Basic integration with MLflow

The purpose of this note is explain the integration of DataStories SDK within MLflow. We are not going to cover MLflow in detail: readers are expecting to know enough of the framework to fill the possible UX gaps.

MLflow integration

The integration with MLflow is done through the new package datastories.mlflow. This module contains a save_model, log_model, and load_model methods, according to the MLflow API.

Logging a model

Logging a model is done using the usual MLflow procedure, by starting a run and within the context of that run, training a story and logging it:

import mlflow
import datastories

with mlflow.start_run():
    story = predict_kpis(data_set, data_set.columns, kpis)
    assert story.model is not None
    datastories.mlflow.log_model(story.model, 'model')

This will log an experiment in MLflow, with a default minimal set-up. For instance, the folder architecture will look like:

|- model
|---- data
|------- model_dump.rsx
|---- MLmodel
|---- conda.yaml

The model/data folder is automatically created and contains the RSX file representing the model associated to the story. Observe that the object we log is not the story, but the model associated to it.

The conda.yaml file contains a minimal set of dependency, and does not rely on a given version of cloupickle. For example, a possible output would be:

channels:
- conda-forge
dependencies:
- python=3.7.5
- pip<=20.0.2
- pip:
  - mlflow
  - --extra-index-url https://pypi.datastories.com/
  - datastories==1.6.1b0
name: mlflow-env

The explicit dependency to the pypi repository makes it possible to register models for deployments as standalone end points through Databricks.

Loading a model

A logged model can be reloaded using the load_model method, that one can use in the same fashion as the standard counter-parts for Keras or Scikit-Learn:

import datastories

mlflow_run_id = ???
loaded_model = datastories.mlflow.load_model(
    f'runs:/{mlflow_run_id}/models
)

The model can then be used to predict. It will have the exact same effect as calling predict on a DataStories Model object, except that the result type is guaranteed to be a Pandas DataFrame. This is required by the MLflow API.

It is also possible to get back the DataStories model, by calling the custom property loaded_model.ds_model property. However, it is not possible to get back the DataStories story.

Loading as a PyFunc

In accordance to the MLflow API, DataStories model can be loaded as regular PyFunc function, using mlflow.pyfunc package. For instance:

import mlflow

mlflow_run_id = ???
loaded_model = mlflow.pyfunc.load_model(
    f'runs:/{mlflow_run_id}/models
)

Auto-logging

In addition to the features above, DataStories SDK also provides an auto-logging feature in the same philosophy as the ones provided by Scikit-Learn, Keras, PyTorch or similar:

import datastories
datastories.mlflow.autolog()

# This will log the story automatically:
story = predict_kpis(df, df.columns, kpis)

assert mlflow.last_run() is not None

When auto-logging is true, a certain amount of metrics and parameters will be recorded inside the run.

The parameters of the run we record as inlined-parameters (available directly in the MLflow summary overview) are the arguments of predict_kpis among outlier_elimination and runs. In case the user uses predict_kpis_ex with a provided ex parameters for extra arguments, we record it in a extra_parameters.json file.

The metrics of the run we record as inlined-metrics are a shortening of the prediction performance, for each KPI. A complete log of the story metrics and the model metrics (available usually through story.metrics and model.metrics) are also logged togethers in a metrics.json file.

In addition, two visualisation reports, namely the What-Ifs and the Driver-Overview components, are saved in the folder reports. Those two files are standalone files that can be used to get a quick overview of the drivers.

Note: It is possible to turn autologging off by calling datastories.mlflow.autolog(turn_off=True).