Using optimizers in the SDK¶
The purpose of this note is explain how to use DataStories SDK to optimize models. The notes cover recent changes in the SDK, together with logging into MLflow.
This document is organised as a step-by-step example.
How to define optimizers¶
Loading data and training a Story¶
As usual, we load data from a file via Pandas, and we train a story based on those data. The data set we are going to use is the houses.csv dataset, available in the examples:
from pandas import read_csv
data = read_csv("data/houses.csv", delimiter=";", decimal=".")
from datastories.story import predict_kpis
story = predict_kpis(data, data.columns, [
# KPIS
'Price', 'Living area'
])
Alternatively, we can also load an existing model with the regular API:
from datastories.model import Model
model = Model('data/houses.rsx')
Although a story
object also contains a model via its property
story.model
, the notion is a bit different from a model acquired
by directly loading a RSX file. This will have a light, but noticeable
impact on how to optimize.
Create an optimizer¶
Optimizers are objects that allow inspection of models. The standard way to create an optimizer is by using the exposed factory method:
from datastories.optimization import *
optimizer = create_optimizer(
population_size=500,
iterations=250
)
This object can be parametrized with two optional parameters: the population size that will be used to explore the model, and the number of iterations.
Since optimizers are allocating resources, it is a good idea to reuse them as much as possible.
Creation optimization specifications¶
In order to give an optimizer the objectives to reach and the constraints to reach those objectives, we create an optimizer specification object:
specification = OptimizerSpecification(
objectives=[
Maximize('Land Area'),
Minimize('Price')
],
constraints=[
AtLeast('Living area', 150)
]
)
Objectives and constraints can be defined through word-like objects that represent the constraints. In the above example, we would like to minimize the price and maximize the land area, while keeping the living area above 150 units.
An exhaustive list of definition wordings is available in the documentation.
Specifying additional value ranges¶
Under some circumstances, the notion of constraint encoded in the specification is too strong. This could be the case, for example, when one would like to constraint a driver to keep a constant value.
In those cases, using the constraints
field above would
impose a too strong condition on the optimizers, and the model wouldn’t be
converging.
In order to circumvent this behavior, we can impose value ranges in the optimizers. For example:
# Asking for bedrooms between 3 and 4
variable_ranges=[
'Bedrooms': (3, 4)
]
# Asking for bedrooms exactly 4
variable_ranges=[
'Bedrooms': 4
]
Value ranges allow a better preparation of the population, but they come at a price of potentially introducing a biais, as they reduce the number of dimensions to explore.
Invoking an optimizer¶
Once the specifications and the value ranges are decided, an optimizer can be used to explore a model and try to find an optimal set of drivers.
The straightforward way is to invoke the optimizer on a Model
object:
optimization_result = optimizer.optimize(
model,
optimization_spec=specification,
variable_ranges=variable_ranges
)
By default, the progress_bar
parameter that displays the progress bar
of the run, is set to True
.
Currently, there is a deprecated way to mimic the above run when we are given a Story object:
optimizer.optimize(story.model.model, ...)
The usage of story.model.model
is deprecated and should be replaced.
Because story.model
encapsulates the model of the story,
the standard way to call an optimizer on a story model, is through the
indirection:
optimization_result = story.model.optimize(
optimizer=optimizer,
optimization_spec=specification,
variable_ranges=variable_ranges
)
If the optimizer
field is not provided, or if it is None
,
a default optimizer will be created.
Using Optimization models in MLflow¶
The datastories.mlflow.optimization
module comes with
facilities to log models that represent results of optimizations.
Note that compared to logging of DataStories models, we do not provide autologging.
Creating a MLflow model¶
We create an optimization model for MLflow, by creating an
OptimizationModel
object:
from datastories.mlflow.optimization import OptimizationModel
optim_model = OptimizationModel(
model=model or story.model,
optimization_spec=specification,
variable_ranges=variable_ranges
)
Observe the similarly between the OptimizationModel creation and
the optimize
method on optimizers.
The input for this model is a variable_ranges
list that
will enrich and potentially override the one provided in
parameters of the model.
You can always inspect what will be the variable_ranges
used by a prediction, by invoking the utility:
optim_model.compute_input_ranges([
# new set of variable_ranges
])
This call will return the true variable ranges used when the model
will be queried with model_input
from MLflow.
The current strategy is to use the model input from the user in priority, and enrich it with the ranges encoded in constructor.
Logging an Optimization Model¶
You can log an Optimization Model in MLflow using a standard PyFunc logging, or else use the one provided by DataStories:
from datastories.mlflow.optimization import log_model, compute_signature
with mlflow.start_run():
signature = compute_signature(optim_model)
log_model(optim_model, 'model', signature=signature)
This will create a MLflow model file containing the DataStories model in RSX format, together with a serialized form of the specification and variable ranges. The signature of the model will be computed by the SDK.
Querying a deployed Optimization Model¶
Once an Optimization Model has been registered (on a platform like Databricks, for example), you should be able to query it:
# Scoring a model that accepts pandas DataFrames
user_request = pd.DataFrame.from_dict({
'[Bedrooms] min': [3, 3.8],
'[Bedrooms] max': [4, 4]
})
score_model(user_request)
Such a call would return the optimization results for
the runs with 3 <= Bedrooms <= 4
and 3.8 <= Bedrooms <= 4
,
respectively. We obtained something along the lines:
[
{
'Bedrooms': 3.7797896132407627,
'EPC': 213.0000000000001,
'Land Area': 123.0000000000003,
'Year': 1952.0,
'Price': 278851.07035839395,
'Price_uncertainty': 16924.01112037047,
'Living area': 182.37404619498324,
'Living area_uncertainty': 10.046381550187041
},
{
'Bedrooms': 3.8000000000000074,
'EPC': 213.00000000000009,
'Land Area': 123.00000000000078,
'Year': 1952.0,
'Price': 279459.6470510475,
'Price_uncertainty': 17043.45040122649,
'Living area': 182.55707323641684,
'Living area_uncertainty': 10.083453460104948
}
]
As another example, the following request fixes the Land area in three different ways:
user_request = pd.DataFrame.from_dict({
'Land Area': [180, 150, 120]
})
gives the following result:
[
{
'Bedrooms': 3.0,
'EPC': 213.00000000000017,
'Land Area': 180.0,
'Year': 1952.0,
'Price': 266397.0325322998,
'Price_uncertainty': 13555.391442169215,
'Living area': 173.68393153430904,
'Living area_uncertainty': 8.470395128186775
},
{
'Bedrooms': 3.0,
'EPC': 213.00000000000017,
'Land Area': 150.0,
'Year': 1952.0,
'Price': 263196.2969955701,
'Price_uncertainty': 15012.784809173781,
'Living area': 171.2967377380069,
'Living area_uncertainty': 9.503470561289724
},
{
'Bedrooms': 3.0,
'EPC': 213.00000000000014,
'Land Area': 120.0,
'Year': 1952.0,
'Price': 260029.30907049886,
'Price_uncertainty': 16935.6086776347,
'Living area': 168.4210121425581,
'Living area_uncertainty': 10.89076053863659
}
]
As a word of caution, we draw the reader attention to the serialisation
policy Databricks provides through its model template score_model
.
Because Databricks and MLflow does not allow to pass data that contain
interval values, you need to mimic the example above and adding suffixes
min
and max
. We recommend to pre-process your dataset
in such a way that your dataset is automatically converted before
the request is sent to the MLflow endpoint.