Basic Docker integration example

The purpose of this note is explain how to create a minimal Docker image containing a version of the SDK, and extend that image to run Jupyter Notebooks.

Base Docker image

We begin with the set-up of a minimal base Docker image. The purpose of the image is to build with a version of the SDK library, an MRE engine, and basic configurations about licensing.

The base Dockerfile we are going to explore is:

# DataStories - Docker Container Example for SDK development
# Author: Justin Dekeyser, on behalf of DataStories
# Date: February 2023
#
# =============================================================================
# ==  BASE IMAGE                                                           ====
# =============================================================================
# The Docker image is built on top of Python official image,
# see https://hub.docker.com/_/python for further information.
#
# Current version of the SDK is using Python 3.9.
#
FROM python:3.9
#
# =============================================================================
# ==  GLOBAL BUILD WORKFLOW                                                ====
# =============================================================================
# The current Dockerfile uses the following strategy to build:
#   1. Copy artefacts from host machine to image
#      - The local host artefacts must contain a requirements.txt file
#   2. Install the SDK ds-mre dependency from DataStories PyPi repository
#   3. Install Python dependencies listed in the requirements.txt file
#
# =============================================================================
# ==  ARGUMENTS DECLARATIONS                                               ====
# =============================================================================
# Arguments below are overwritable during the build, using --build-arg option.
# However, they are not available at runtime anymore.
#
# -----------------------------------------------------------------------------
# --  WORKING DIRECTORY                                                    ----
# -----------------------------------------------------------------------------
# Refers to the working directory inside the Docker container.
#
ARG WORKING_DIRECTORY=/usr/src/datastories
#
# -----------------------------------------------------------------------------
# --  COPY ARTEFACT DIRECTORY                                              ----
# -----------------------------------------------------------------------------
# Refers to the directory relative path on the host machine (outside the
# container) whose content will be copied into the container working directory.
#
# This directory must contain the requirements.txt file used by pip.
#
# Requirements file example:
# --------------------------
#    file content:
#      datastories==1.6.35
#      ds-mre
#
ARG SOURCE_ARTEFACT_DIRECTORY=./artefacts
#
# -----------------------------------------------------------------------------
# --  PYPI EXTRA URL                                                       ----
# -----------------------------------------------------------------------------
# The Pypi extra URL to use to resolve DataStories specific dependencies.
# Because the DataStories SDK requires other packages like `ds-utils`, it is
# effetively required to set this variable, or else to provide the sufficient
# dependencies in the artefact folder that is copied in the Docker image.
#
ARG PYPI_EXTRA_URL=https://pypi.datastories.com/
#
# =============================================================================
# ==  ENVIRONMENT VARIABLES                                                ====
# =============================================================================
# Configure usual environment variables that might be required for the SDK to
# run without error. In particular, you can define different options for the
# SDK, within a configuration file directly reachable from a location specified
# by the DS_CONFIG_PATH environment variable.
#
# Note that the path is relative to the Docker container. If you need to link
# that path to a folder of the host machine, you should mount a volume.
#
# Configuration file example:
# ---------------------------
#  environment declaration:
#    DS_CONFIG_PATH=$WORKING_DIRECTORY/config
#
#  file content:
#    [SDK]
#    load_mre = true
#    license_path = /usr/src/datastories/host-workspace
#
#
ENV DS_CONFIG_PATH=$WORKING_DIRECTORY/config
#
# =============================================================================
# ==  IMAGE BUILD PROCEDURE
# =============================================================================
# Modify the instructions below to customize the build further than what has
# been foreseen initially.
#
WORKDIR $WORKING_DIRECTORY
COPY $SOURCE_ARTEFACT_DIRECTORY .

RUN pip install -r requirements.txt -f . --extra-index-url $PYPI_EXTRA_URL --no-cache

In the next sections, we show how to use that file and how to extend it to run a Jupyter Notebook within Docker, with the DataStories SDK.

Setting up a project

In the folder of your choice, copy the content of the Docker file above in a Dockerfile file. At first, we recommend to modify nothing in this file, and just follow the conventions it forces. You can review those choices later on.

Create a folder called artefacts in the same folder as your Dockerfile. That folder will be copied withing the Docker container at build-time.

Create a file artefacts/requirements.txt and write in there:

datastories==1.6.0
ds-mre

Replace the version of DataStories SDK by the appropriate one. You can skip the line ds-mre if you do not wish to include the MRE in the Dockerfile.

Create a file artefacts/config and write in there:

[SDK]
license_path = /usr/src/datastories/host-workspace
load_mre = true

If you do not wish to include the MRE in the image, you should probably set load_mre to false.

So far, your folder structure should like like:

|- Dockerfile
|- artefacts/
|---- requirements.txt
|---- config

Building the image

Open a terminal and move to the working directory (the one that constains your Dockerfile). Enter:

docker build -t ds-sdk .

You should see a bunch of operations going on. (Make sure the Docker demon is running, otherwise the command would fail.) You should recognise all the steps we have defined in the Dockerfile.

The name of your image is ds-sdk.

Checking the image health

Once the build is over, quickly check it can run:

docker run -it ds-sdk /bin/bash

This command should prompt you to a Linux-like terminal: you are now inside the container! Type pdw to confirm the working directory is the one set in the Dockerfile. You can also inspect the content of that directory, by entering ls. You should see the content of the artefacts directory.

Linking with a license

Enter the ds-sdk container, and try to type python: you should see a python console for Python 3.9 (you are still inside the Docker container!). If you try to do something like:

from datastories.story import load

You should see a license error. This is because we haven’t tell the SDK where is the license located. More precisely, the Docker image contains an environment variable that points to our config configuration file (checkout the Dockerfile to see that configuration). The configuration file refers to a place inside de Docker image, that is expected to contain the license, but that place does not exist.

Preparing a shared folder

Create a folder called host-workspace, and store in it a license file host-workspace/license.lic. Watch out that this license file will be checked from the inside of the container, which means it has to be signed with the correct host-id of the running container.

You would probably want an ANY wild-card here, or use a configuration server. For the remaining part of the text, we will assume you have a valid license file.

Exit the container if you’re still in (type exit) and run it like this instead:

docker run -it ds-sdk -v${PWD}/host-workspace:/usr/src/datastories/host-workspace /bin/bash

Compared to the previous command, we added now a volume instruction. This is a Docker feature that is aimed to link a folder in the host machine, to a folder within the container.

From inside the container, explore the working directory: you should see a folder host-workspace with your license file inside. Note that the binding is dynamic, in the sense that modifying the files of that folder withing the Docker, can be seen while staying on the host machine, and vice-versa.

You should now be able to successfully load the DataStories modules, in a Python console (within the container).

Exit the container and delete the host-workspace folder, to keep your root clean.

Extended Docker image

We recommend to not touch the base Docker image further, as it actually is sufficiently well configured for being extended. Extension is a mechanism that allows you to reuse existing images and build on top of them. We used extension in the base image, by creating a Docker image on top of the official Python:3.9 existing one.

In this section, we show how to benefit from our previous work to run a Jupyter Notebook from inside the Docker.

Setting-up the project

Set-up a new project, with file structure:

|- Dockerfile
|- host-workspace
|---- license.lic
|---- your_training_data.csv

In the new Dockerfile file, copy-paste the following content:

FROM ds-sdk
EXPOSE 8000

RUN pip install notebook

CMD ["jupyter", "notebook", "--port", "8000", "--ip", "0.0.0.0", "--allow-root", "--no-browser"]

We start by extending our local base ds-sdk image (constructed previously). As this image exists on your local machine, Docker will find it.

The EXPOSE instruction has no effect, and warns us that our current image is aimed at opening the port 8000. This port is the opened port from the inside of the Docker.

We then install Jupyter Notebook (note that the SDK would already be installed, at that step), and we finish the build with a CMD instruction. Compared to RUN, the instruction here will be launched when the container runs (and not when it builds). The instruction basically opens a Jupyter Notebook server and listens on port 8000 (from inside the Docker).

Building the extension

Build the extension image, as usual:

docker build -t sdk-notebook-example .

Running the extension

In order to run the extension, you need to have in mind three things
  1. The Docker container must have a volume shared with the local machine, that plays the role of a buffer between the two engines

  2. The Docker internally opens port 8000, but that port is not bind on anything from the perspective of the local machine

  3. If possible, the Docker should display the logs of what happens inside, so we can check the notebook health

The correct Docker command to run the container properly, is:

docker run -v${PWD}/host-workspace:/usr/src/datastories/host-workspace -p 8000:8000 -it --rm sdk-notebook-example

You should see the usual Jupyter Notebook logs appearing on the console. Copy-paste the URL it gives you (watch out that it prints the URL as seen from the inside of the Docker, so with port 8000. Here we bind 8000 to 8000, but you might need to change the URL to handle the correct port from your host machine, in case you bind on another port) in the browser of your choice. You should see your notebook.

By saving and loading files from host-workspace, you can ensure those files are saved also on your host machine. If you save the files somewhere else inside the container, they will be lost as soon as you exit it!