Basic integration with MLflow¶
The purpose of this note is explain the integration of DataStories SDK within MLflow. We are not going to cover MLflow in detail: readers are expecting to know enough of the framework to fill the possible UX gaps.
MLflow integration¶
The integration with MLflow is done through the new package
datastories.mlflow
. This module contains a save_model
,
log_model
, and load_model
methods, according to the
MLflow API.
Logging a model¶
Logging a model is done using the usual MLflow procedure, by starting a run and within the context of that run, training a story and logging it:
import mlflow
import datastories
with mlflow.start_run():
story = predict_kpis(data_set, data_set.columns, kpis)
assert story.model is not None
datastories.mlflow.log_model(story.model, 'model')
This will log an experiment in MLflow, with a default minimal set-up. For instance, the folder architecture will look like:
|- model
|---- data
|------- model_dump.rsx
|---- MLmodel
|---- conda.yaml
The model/data
folder is automatically created and contains the RSX file
representing the model associated to the story. Observe that the object we
log is not the story, but the model associated to it.
The conda.yaml
file contains a minimal set of dependency, and does not
rely on a given version of cloupickle. For example, a possible output
would be:
channels:
- conda-forge
dependencies:
- python=3.7.5
- pip<=20.0.2
- pip:
- mlflow
- --extra-index-url https://pypi.datastories.com/
- datastories==1.6.1b0
name: mlflow-env
The explicit dependency to the pypi repository makes it possible to register models for deployments as standalone end points through Databricks.
Loading a model¶
A logged model can be reloaded using the load_model
method,
that one can use in the same fashion as the standard counter-parts for
Keras or Scikit-Learn:
import datastories
mlflow_run_id = ???
loaded_model = datastories.mlflow.load_model(
f'runs:/{mlflow_run_id}/models
)
The model can then be used to predict
. It will have the exact same effect
as calling predict
on a DataStories Model
object, except that
the result type is guaranteed to be a Pandas DataFrame. This is required
by the MLflow API.
It is also possible to get back the DataStories model, by calling the
custom property loaded_model.ds_model
property. However, it is not
possible to get back the DataStories story.
Loading as a PyFunc¶
In accordance to the MLflow API, DataStories model can be loaded as
regular PyFunc function, using mlflow.pyfunc
package. For instance:
import mlflow
mlflow_run_id = ???
loaded_model = mlflow.pyfunc.load_model(
f'runs:/{mlflow_run_id}/models
)
Auto-logging¶
In addition to the features above, DataStories SDK also provides an auto-logging feature in the same philosophy as the ones provided by Scikit-Learn, Keras, PyTorch or similar:
import datastories
datastories.mlflow.autolog()
# This will log the story automatically:
story = predict_kpis(df, df.columns, kpis)
assert mlflow.last_run() is not None
When auto-logging is true, a certain amount of metrics and parameters will be recorded inside the run.
The parameters of the run we record as inlined-parameters (available
directly in the MLflow summary overview) are the arguments of
predict_kpis
among outlier_elimination
and runs
. In case the
user uses predict_kpis_ex
with a provided ex
parameters for
extra arguments, we record it in a extra_parameters.json
file.
The metrics of the run we record as inlined-metrics are
a shortening of the prediction performance, for each KPI.
A complete log of the story metrics and the model metrics
(available usually through story.metrics
and model.metrics
)
are also logged togethers in a metrics.json
file.
In addition, two visualisation reports, namely the What-Ifs
and the Driver-Overview components, are saved in the folder reports
.
Those two files are standalone files that can be used to get a quick overview
of the drivers.
Note: It is possible to turn autologging off by calling
datastories.mlflow.autolog(turn_off=True)
.