Basic Docker integration example¶
The purpose of this note is explain how to create a minimal Docker image containing a version of the SDK, and extend that image to run Jupyter Notebooks.
Base Docker image¶
We begin with the set-up of a minimal base Docker image. The purpose of the image is to build with a version of the SDK library, an MRE engine, and basic configurations about licensing.
The base Dockerfile
we are going to explore is:
# DataStories - Docker Container Example for SDK development
# Author: Justin Dekeyser, on behalf of DataStories
# Date: February 2023
#
# =============================================================================
# == BASE IMAGE ====
# =============================================================================
# The Docker image is built on top of Python official image,
# see https://hub.docker.com/_/python for further information.
#
# Current version of the SDK is using Python 3.9.
#
FROM python:3.9
#
# =============================================================================
# == GLOBAL BUILD WORKFLOW ====
# =============================================================================
# The current Dockerfile uses the following strategy to build:
# 1. Copy artefacts from host machine to image
# - The local host artefacts must contain a requirements.txt file
# 2. Install the SDK ds-mre dependency from DataStories PyPi repository
# 3. Install Python dependencies listed in the requirements.txt file
#
# =============================================================================
# == ARGUMENTS DECLARATIONS ====
# =============================================================================
# Arguments below are overwritable during the build, using --build-arg option.
# However, they are not available at runtime anymore.
#
# -----------------------------------------------------------------------------
# -- WORKING DIRECTORY ----
# -----------------------------------------------------------------------------
# Refers to the working directory inside the Docker container.
#
ARG WORKING_DIRECTORY=/usr/src/datastories
#
# -----------------------------------------------------------------------------
# -- COPY ARTEFACT DIRECTORY ----
# -----------------------------------------------------------------------------
# Refers to the directory relative path on the host machine (outside the
# container) whose content will be copied into the container working directory.
#
# This directory must contain the requirements.txt file used by pip.
#
# Requirements file example:
# --------------------------
# file content:
# datastories==1.6.35
# ds-mre
#
ARG SOURCE_ARTEFACT_DIRECTORY=./artefacts
#
# -----------------------------------------------------------------------------
# -- PYPI EXTRA URL ----
# -----------------------------------------------------------------------------
# The Pypi extra URL to use to resolve DataStories specific dependencies.
# Because the DataStories SDK requires other packages like `ds-utils`, it is
# effetively required to set this variable, or else to provide the sufficient
# dependencies in the artefact folder that is copied in the Docker image.
#
ARG PYPI_EXTRA_URL=https://pypi.datastories.com/
#
# =============================================================================
# == ENVIRONMENT VARIABLES ====
# =============================================================================
# Configure usual environment variables that might be required for the SDK to
# run without error. In particular, you can define different options for the
# SDK, within a configuration file directly reachable from a location specified
# by the DS_CONFIG_PATH environment variable.
#
# Note that the path is relative to the Docker container. If you need to link
# that path to a folder of the host machine, you should mount a volume.
#
# Configuration file example:
# ---------------------------
# environment declaration:
# DS_CONFIG_PATH=$WORKING_DIRECTORY/config
#
# file content:
# [SDK]
# load_mre = true
# license_path = /usr/src/datastories/host-workspace
#
#
ENV DS_CONFIG_PATH=$WORKING_DIRECTORY/config
#
# =============================================================================
# == IMAGE BUILD PROCEDURE
# =============================================================================
# Modify the instructions below to customize the build further than what has
# been foreseen initially.
#
WORKDIR $WORKING_DIRECTORY
COPY $SOURCE_ARTEFACT_DIRECTORY .
RUN pip install -r requirements.txt -f . --extra-index-url $PYPI_EXTRA_URL --no-cache
In the next sections, we show how to use that file and how to extend it to run a Jupyter Notebook within Docker, with the DataStories SDK.
Setting up a project¶
In the folder of your choice, copy the content of the Docker file above
in a Dockerfile
file. At first, we recommend to modify nothing in this
file, and just follow the conventions it forces. You can review those
choices later on.
Create a folder called artefacts
in the same folder as your Dockerfile
.
That folder will be copied withing the Docker container at build-time.
Create a file artefacts/requirements.txt
and write in there:
datastories==1.6.0
ds-mre
Replace the version of DataStories SDK by the appropriate one.
You can skip the line ds-mre
if you do not wish to include the MRE
in the Dockerfile
.
Create a file artefacts/config
and write in there:
[SDK]
license_path = /usr/src/datastories/host-workspace
load_mre = true
If you do not wish to include the MRE in the image, you should probably set
load_mre
to false
.
So far, your folder structure should like like:
|- Dockerfile
|- artefacts/
|---- requirements.txt
|---- config
Building the image¶
Open a terminal and move to the working directory (the one that
constains your Dockerfile
). Enter:
docker build -t ds-sdk .
You should see a bunch of operations going on. (Make sure the Docker demon is
running, otherwise the command would fail.) You should recognise all the steps
we have defined in the Dockerfile
.
The name of your image is ds-sdk
.
Checking the image health¶
Once the build is over, quickly check it can run:
docker run -it ds-sdk /bin/bash
This command should prompt you to a Linux-like terminal: you are now
inside the container! Type pdw
to confirm the working directory is
the one set in the Dockerfile
. You can also inspect the content
of that directory, by entering ls
. You should see the content of the
artefacts
directory.
Linking with a license¶
Enter the ds-sdk
container, and try to type python
:
you should see a python console for Python 3.9
(you are still inside the Docker container!). If you try to do something like:
from datastories.story import load
You should see a license error. This is because we haven’t tell the SDK
where is the license located. More precisely, the Docker image contains an
environment variable that points to our config
configuration file
(checkout the Dockerfile
to see that configuration). The configuration
file refers to a place inside de Docker image, that is expected to
contain the license, but that place does not exist.
Extended Docker image¶
We recommend to not touch the base Docker image further, as it actually
is sufficiently well configured for being extended. Extension is a mechanism
that allows you to reuse existing images and build on top of them.
We used extension in the base image, by creating a Docker image on top of
the official Python:3.9
existing one.
In this section, we show how to benefit from our previous work to run a Jupyter Notebook from inside the Docker.
Setting-up the project¶
Set-up a new project, with file structure:
|- Dockerfile
|- host-workspace
|---- license.lic
|---- your_training_data.csv
In the new Dockerfile
file, copy-paste the following content:
FROM ds-sdk
EXPOSE 8000
RUN pip install notebook
CMD ["jupyter", "notebook", "--port", "8000", "--ip", "0.0.0.0", "--allow-root", "--no-browser"]
We start by extending our local base ds-sdk
image (constructed previously).
As this image exists on your local machine, Docker will find it.
The EXPOSE
instruction has no effect, and warns us that our current image
is aimed at opening the port 8000
. This port is the opened port from the inside
of the Docker.
We then install Jupyter Notebook (note that the SDK would already be installed,
at that step), and we finish the build with a CMD
instruction.
Compared to RUN
, the instruction here will be launched when the container
runs (and not when it builds). The instruction basically opens a Jupyter Notebook
server and listens on port 8000
(from inside the Docker).
Building the extension¶
Build the extension image, as usual:
docker build -t sdk-notebook-example .
Running the extension¶
- In order to run the extension, you need to have in mind three things
The Docker container must have a volume shared with the local machine, that plays the role of a buffer between the two engines
The Docker internally opens port 8000, but that port is not bind on anything from the perspective of the local machine
If possible, the Docker should display the logs of what happens inside, so we can check the notebook health
The correct Docker command to run the container properly, is:
docker run -v${PWD}/host-workspace:/usr/src/datastories/host-workspace -p 8000:8000 -it --rm sdk-notebook-example
You should see the usual Jupyter Notebook logs appearing on the console. Copy-paste the URL it gives you (watch out that it prints the URL as seen from the inside of the Docker, so with port 8000. Here we bind 8000 to 8000, but you might need to change the URL to handle the correct port from your host machine, in case you bind on another port) in the browser of your choice. You should see your notebook.
By saving and loading files from host-workspace
, you can ensure those files
are saved also on your host machine. If you save the files somewhere else inside
the container, they will be lost as soon as you exit it!