A graphics card with 2 active cores for machine learning with dask
Apps

Data Processing Scaled Up and Out with Dask and RAPIDS: Installing a Data Science App as Dask Client (2/3)

Lesezeit
7 ​​min

This blog post tutorial shows how a scalable and high-performance environment for machine learning can be set up using the ingredients GPUs, Kubernetes clusters, Dask and Jupyter. In the first article of our blog series we have set up a Kubernetes cluster with access to GPUs. In this part we will add containerized applications to the cluster to be able to run data processing workloads in our cluster. Being more precise: we will prepare a notebook image that has CUDA installed which is required if we want to use GPU-based frameworks. Furthermore, the image should contain Dask, Rapids and Dask-Rapids. As soon as the image is ready, we will deploy JupyterHub which spawns said notebook image as a container for each user. 

We will use JupyterLab notebooks as an interactive environment to start our data processing algorithms. In other words, JupyterLab will act as a Dask client. As we want to provide an environment not only for one data scientist but for a group of users, we decided to install JupyterHub on our Kubernetes cluster. JupyterHub makes it possible to serve a pre-configured data science environment to a group of users.

Permissions for Dask-Clients

At first, we have to care about the permissions of our JupyterLab instances. When being used as a Dask-Client it needs to have sufficient permissions to start new pods acting as  Dask-workers. As we decided to install JupyterHub, no extra configuration is required since JupyterHub uses a Service Account with sufficient permissions per default. If you want to use Dask from a different environment, you will have to make sure to grant correct permissions for your client to create, delete, view etc. of your Dask-worker-pods via a Service Account.

Docker Image for Jupyter

JupyterHub is a multi-tenant version of JupyterLab. The hub creates a pod in the cluster for each user and pulls the notebook image that runs on that pod. There are official Jupyter-specific images like the Minimal-Notebook or the Data-Science-Notebook that are ready to use. However, to use the Rapids-Library, CUDA Toolkit is required. So we cannot use these base images and simply add Rapids and Dask to it.

It seems to be a good idea to create a base image which contains Jupyter and CUDA and use it to build an image with Rapids and Dask. Since Rapids and Dask are still in development and new versions are released frequently, keeping Jupyter and CUDA as a separate base image will make it easier to maintain our final image.

Fortunately, there are not only official Notebook images but also official images from NVIDIA with CUDA. We can simply combine both images. We will use the the base-notebook Dockerfile  from here and the 10.2-base-ubuntu-18.04 CUDA 10.2 Dockerfile from  here . We then combine both of them into a single image. Keep in mind that for the base-notebook you need to have following files together with your Dockerfile:

  1.  fix-permissions
  2.  jupyter_notebook_config.py
  3.  start.sh
  4.  start-notebook-sh
  5.  start-singleuser.sh

All these files can be found in the base-notebook registry from the link above. The resulting Dockerfile is listed below:

It is important to enable the root user for the CUDA part and then to switch back to normal user settings afterwards.

We have to build this image and push it to a repository of our choice. Then we have a base image with Jupyter and CUDA. To create the final image on top, we need to install the Rapids-Library (cuDF and cuML), Dask, Dask-cuDF and Dask-cuML. The none-Dask-Rapids is required for the Dask version. This can be easily done in just a few steps and the Dockerfile looks like this:

In line 5, cuDF and cuML are installed. Line 10 installs Dask and a few needed libraries like NumPy or Pandas. This part, in particular lines 12 to 19, was copied from the daskdev/dask:latest Dockerfile. We will discuss later, why copying it was a good idea.

Finally, in line 27, libraries specified in the requirements.txt (which needs to be accessible while building the image) are installed via pip. These libraries are dask-kubernetes, dask_cuda, dask_cudf, dask_cuml  and GCSFS (needed to read from google Buckets).

Again, we build the image and push it to a repository.

Deploying Jupyterhub

Now we are ready to deploy the JupyterHub image into our Kubernetes Cluster. This link provides a lot of information about deploying it on Kubernetes. There you can find many details on how to customize and personalize your deployment. We will come straight to the point. Create a file config.yaml according to your configuration preferences. My config looks like this:

To create your 32 Bytes security token, simply run:

… in the terminal and paste the result into line 2 of your config. Then, specify your image, mount the configMap for accessing the Bucket and path to the NVIDIA Drivers (this might or might not be necessary). You can create different profiles with different requests for resources. In the above example, a profile with access to the GPU and one without it are available. A simple password-based authentication is provided as well.

Now we can add the JupyterHub Helm chart repository:

After a while an  “Update Complete. Happy Helming“ info should appear. We are ready to deploy the Hub. From the directory with the config.yaml, run:

You might want to add a  –timeout flag with a higher value, like 1000, since the image is quite big and it sometimes results in timeout errors. The deployment should create a Hub and Proxy pod. As soon as both are running, we can port-forward the proxy to a 8000 port:

Outlook on Part 3

Finally, Jupyter is up and running and port-forwarding is enabled. Now we can access JupyterHub from the browser, log in (if authentication is on) and we see the workspace of our JupyterLab.  In the next part of our series we will finally use the prepared infrastructure for data science and compare the efficiency of four various approaches – including usage of multiple GPUs!

Hat dir der Beitrag gefallen?

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert