Using optimizers in the SDK

The purpose of this note is explain how to use DataStories SDK to optimize models. The notes cover recent changes in the SDK, together with logging into MLflow.

This document is organised as a step-by-step example.

How to define optimizers

Loading data and training a Story

As usual, we load data from a file via Pandas, and we train a story based on those data. The data set we are going to use is the houses.csv dataset, available in the examples:

from pandas import read_csv
data = read_csv("data/houses.csv", delimiter=";", decimal=".")

from datastories.story import predict_kpis
story = predict_kpis(data, data.columns, [
    # KPIS
    'Price', 'Living area'
])

Alternatively, we can also load an existing model with the regular API:

from datastories.model import Model
model = Model('data/houses.rsx')

Although a story object also contains a model via its property story.model, the notion is a bit different from a model acquired by directly loading a RSX file. This will have a light, but noticeable impact on how to optimize.

Create an optimizer

Optimizers are objects that allow inspection of models. The standard way to create an optimizer is by using the exposed factory method:

from datastories.optimization import *

optimizer = create_optimizer(
    population_size=500,
    iterations=250
)

This object can be parametrized with two optional parameters: the population size that will be used to explore the model, and the number of iterations.

Since optimizers are allocating resources, it is a good idea to reuse them as much as possible.

Creation optimization specifications

In order to give an optimizer the objectives to reach and the constraints to reach those objectives, we create an optimizer specification object:

specification = OptimizerSpecification(
    objectives=[
        Maximize('Land Area'),
        Minimize('Price')
    ],
    constraints=[
        AtLeast('Living area', 150)
    ]
)

Objectives and constraints can be defined through word-like objects that represent the constraints. In the above example, we would like to minimize the price and maximize the land area, while keeping the living area above 150 units.

An exhaustive list of definition wordings is available in the documentation.

Specifying additional value ranges

Under some circumstances, the notion of constraint encoded in the specification is too strong. This could be the case, for example, when one would like to constraint a driver to keep a constant value.

In those cases, using the constraints field above would impose a too strong condition on the optimizers, and the model wouldn’t be converging.

In order to circumvent this behavior, we can impose value ranges in the optimizers. For example:

# Asking for bedrooms between 3 and 4
variable_ranges=[
    'Bedrooms': (3, 4)
]

# Asking for bedrooms exactly 4
variable_ranges=[
    'Bedrooms': 4
]

Value ranges allow a better preparation of the population, but they come at a price of potentially introducing a biais, as they reduce the number of dimensions to explore.

Invoking an optimizer

Once the specifications and the value ranges are decided, an optimizer can be used to explore a model and try to find an optimal set of drivers.

The straightforward way is to invoke the optimizer on a Model object:

optimization_result = optimizer.optimize(
    model,
    optimization_spec=specification,
    variable_ranges=variable_ranges
)

By default, the progress_bar parameter that displays the progress bar of the run, is set to True.

Currently, there is a deprecated way to mimic the above run when we are given a Story object:

optimizer.optimize(story.model.model, ...)

The usage of story.model.model is deprecated and should be replaced. Because story.model encapsulates the model of the story, the standard way to call an optimizer on a story model, is through the indirection:

optimization_result = story.model.optimize(
    optimizer=optimizer,
    optimization_spec=specification,
    variable_ranges=variable_ranges
)

If the optimizer field is not provided, or if it is None, a default optimizer will be created.

Using Optimization models in MLflow

The datastories.mlflow.optimization module comes with facilities to log models that represent results of optimizations.

Note that compared to logging of DataStories models, we do not provide autologging.

Creating a MLflow model

We create an optimization model for MLflow, by creating an OptimizationModel object:

from datastories.mlflow.optimization import OptimizationModel
optim_model = OptimizationModel(
    model=model or story.model,
    optimization_spec=specification,
    variable_ranges=variable_ranges
)

Observe the similarly between the OptimizationModel creation and the optimize method on optimizers.

The input for this model is a variable_ranges list that will enrich and potentially override the one provided in parameters of the model.

You can always inspect what will be the variable_ranges used by a prediction, by invoking the utility:

optim_model.compute_input_ranges([
    # new set of variable_ranges
])

This call will return the true variable ranges used when the model will be queried with model_input from MLflow.

The current strategy is to use the model input from the user in priority, and enrich it with the ranges encoded in constructor.

Logging an Optimization Model

You can log an Optimization Model in MLflow using a standard PyFunc logging, or else use the one provided by DataStories:

from datastories.mlflow.optimization import log_model, compute_signature

with mlflow.start_run():
    signature = compute_signature(optim_model)
    log_model(optim_model, 'model', signature=signature)

This will create a MLflow model file containing the DataStories model in RSX format, together with a serialized form of the specification and variable ranges. The signature of the model will be computed by the SDK.

Querying a deployed Optimization Model

Once an Optimization Model has been registered (on a platform like Databricks, for example), you should be able to query it:

# Scoring a model that accepts pandas DataFrames
user_request = pd.DataFrame.from_dict({
    '[Bedrooms] min': [3, 3.8],
    '[Bedrooms] max': [4,   4]
})

score_model(user_request)

Such a call would return the optimization results for the runs with 3 <= Bedrooms <= 4 and 3.8 <= Bedrooms <= 4, respectively. We obtained something along the lines:

[
    {
        'Bedrooms': 3.7797896132407627,
        'EPC': 213.0000000000001,
        'Land Area': 123.0000000000003,
        'Year': 1952.0,
        'Price': 278851.07035839395,
        'Price_uncertainty': 16924.01112037047,
        'Living area': 182.37404619498324,
        'Living area_uncertainty': 10.046381550187041
    },
    {
        'Bedrooms': 3.8000000000000074,
        'EPC': 213.00000000000009,
        'Land Area': 123.00000000000078,
        'Year': 1952.0,
        'Price': 279459.6470510475,
        'Price_uncertainty': 17043.45040122649,
        'Living area': 182.55707323641684,
        'Living area_uncertainty': 10.083453460104948
    }
]

As another example, the following request fixes the Land area in three different ways:

user_request = pd.DataFrame.from_dict({
    'Land Area': [180, 150, 120]
})

gives the following result:

[
    {
        'Bedrooms': 3.0,
        'EPC': 213.00000000000017,
        'Land Area': 180.0,
        'Year': 1952.0,
        'Price': 266397.0325322998,
        'Price_uncertainty': 13555.391442169215,
        'Living area': 173.68393153430904,
        'Living area_uncertainty': 8.470395128186775
    },
    {
        'Bedrooms': 3.0,
        'EPC': 213.00000000000017,
        'Land Area': 150.0,
        'Year': 1952.0,
        'Price': 263196.2969955701,
        'Price_uncertainty': 15012.784809173781,
        'Living area': 171.2967377380069,
        'Living area_uncertainty': 9.503470561289724
    },
    {
        'Bedrooms': 3.0,
        'EPC': 213.00000000000014,
        'Land Area': 120.0,
        'Year': 1952.0,
        'Price': 260029.30907049886,
        'Price_uncertainty': 16935.6086776347,
        'Living area': 168.4210121425581,
        'Living area_uncertainty': 10.89076053863659
    }
]

As a word of caution, we draw the reader attention to the serialisation policy Databricks provides through its model template score_model. Because Databricks and MLflow does not allow to pass data that contain interval values, you need to mimic the example above and adding suffixes min and max. We recommend to pre-process your dataset in such a way that your dataset is automatically converted before the request is sent to the MLflow endpoint.