SDK Reference¶
General Interfaces¶
- datastories.api.get_version()¶
Get the version of the currently loaded modules.
- Returns:
A dictionary containing loaded modules and corresponding versions
Base classes and interfaces¶
- class datastories.api.IAnalysisResult¶
Interface implemented by all analysis results.
- plot(*args, **kwargs)¶
Plots a graphical representation of the results in Jupyter Notebook.
- to_csv(file_path, delimiter=',', decimal='.')¶
Export the result to a
CSV
file.Args:
- file_path (str):
path to the output file.
- delimiter (str=’,’):
character used as value delimiter.
- decimal (str=’.’):
character used as decimal point.
Raises:
ValueError
:when the object returned by to_pandas is not a
Pandas
data frame.
- to_excel(file_path, tab_name='Statistics')¶
Export the result to an
Excel
file.Args:
- file_path (str):
path to the output file.
- tab_name (str=’Statistics’):
name of the Excel tab where to save the result.
Raises:
ValueError
:when the object returned by to_pandas is not a
Pandas
data frame.
- to_html(file_path, title='', subtitle='', scenario=VisualizationScenario.REPORT)¶
Exports the analysis result visualization to a standalone
HTML
document.Args:
- file_path (str):
path to the output file.
- title (str=’’):
HTML document title.
- subtitle (str=’’):
HTML document subtitle.
- scenario (enum=VisualizationScenario.REPORT):
A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.
- abstract to_pandas()¶
Exports the result to a Pandas
DataFrame
.Returns:
The constructed Pandas
DataFrame
.
- to_txt(file_path)¶
Export the result to a
TXT
file.Args:
- file_path (str):
path to the output file.
- class datastories.api.IConsole¶
Interface implemented by all message loggers.
- abstract log(message)¶
Log a message tot he console.
Args:
- message (string):
the message to log.
- class datastories.api.IPrediction(data)¶
Bases:
IAnalysisResult
Interface implemented by all prediction results.
Args:
- data (obj):
The associated prediction input data.
- abstract property metrics¶
A dictionary containing prediction performance metrics.
These metrics are computed when the data frame used for prediction includes KPI values, for the purpose of evaluating the model prediction performance.
- class datastories.api.IPredictiveModel¶
Interface implemented by all prediction models.
- abstract property metrics¶
A dictionary containing model prediction performance metrics.
The type of metrics depend on the model type (i.e., regression or classification)
- abstract property model¶
The generic RSX based model used for making predictions.
- abstract predict(data_frame)¶
Predict the model KPI on a new data frame.
Args:
- data_frame (obj):
the data frame on which the model associated KPI is to be predicted.
Returns:
An object of type
datastories.regression.PredictionResult
encapsulating the prediction results.
Raises:
ValueError
:when not all required columns are provided.
- to_cpp(file_path)¶
Export the model to a C++ file.
Args:
- file_path (str):
path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:when there is a problem saving the file.
- to_excel(file_path)¶
Export the model to an Excel file.
Args:
- file_path (str):
path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:when there is a problem saving the file.
- to_matlab(file_path)¶
Export the model to a MATLAB file.
Args:
- file_path (str):
path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:when there is a problem saving the file.
- to_py(file_path)¶
Export the model to a Python file.
Args:
- file_path (str):
path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:when there is a problem saving the file.
- to_r(file_path)¶
Export the model to an R file.
Args:
- file_path (str):
path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:when there is a problem saving the file.
- class datastories.api.IStory(params=None, metainfo=None, raw_results=None, results=None, folder=None, notes=None, upload_function=None, on_snapshot=None, progress_bar=False)¶
Bases:
IAnalysisResult
Interface implemented by all story analyses.
Args:
- params (dict):
dictionary containing user and inferred analysis parameters.
- metainfo (dict):
dictionary containing process parameters (e.g., progress pointers).
- raw_results (dict):
dictionary containing rainstorm processing results.
- results (dict):
dictionary containing processing results.
- folder (str=None):
the story working folder. Leave not specified to create one at runtime.
- notes (list=[]):
a list of notes.
- upload_function (callback=None):
a function to upload files to a storage (relevant for the client).
- on_snapshot (callback=None):
a callback to be executed upon saving a snapshot (e.g., upload snapshot to S3).
- progress_bar (obj=None):
a progress bar object.
- abstract add_note(note)¶
Add an annotation to the story results.
The already present annotations can be retrieved using the
datastories.api.IStory.notes()
property.Args:
- note (str):
the annotation to be added.
- abstract clear_note(note_id)¶
Remove a specific annotation associated with the story analysis.
Args:
- note_id (int):
the index of the note to be removed.
Raises:
ValueError
:when the note index is unknown.
- abstract clear_notes()¶
Clear the annotations associated with the story analysis.
- abstract property info¶
Displays story execution information.
- abstract static is_compatible(current_version_string, ref_version_string)¶
Checks if a story version is compatible with a reference version.
- abstract classmethod load(file_path)¶
Loads a previously saved story.
- abstract property metrics¶
Returns a set of metrics computed during analysis.
- abstract property notes¶
The list of all annotations currently associated with the story analysis.
- abstract reset()¶
Reset the execution pointer of a story to the first stage.
- abstract run(resume_from=None, strict=False, params=None, progress_bar=None, check_interrupt=None)¶
Resumes the execution of a story form a give stage.
The stage to resume from is optional. If not specified, the story is executed from the beginning. If the stage cannot be executed (e.g., due to missing intermediate results) the closest story that can be executed will be used as starting point unless the [strict] argument is set to True. In that case an exception will be raised if the execution cannot be resumed from the requested stage.
Args:
- resume_from (StoryProcessingStage=None):
The stage to resume execution from. Should be a stage for which all intermediate results are available. If None, the stage at which execution was previously interrupted (if any) is used.
- strict (bool=False):
Raise en error if execution cannot be resumed from the requested stage.
- params (dict={}):
Map of parameters to be used with the run. It can override the original parameters, but this leads to invalidating previous results that depend on the updated parameter values.
- progress_bar (obj=None):
An object of type
datastories.display.ProgressReporter
to replace the currently used progress reporter. When not specified the current story progress reporter will not be modified. The case for this is to set a progress bar after the story is loaded, when a progress bar cannot be given to the load function directly (e.g, when a progress bar has to be constructed based on the story).
- check_interrupt (func=None):
an optional callback to check whether analysis execution needs to be interrupted.
Raises:
datastories.api.errors.StoryError
:if a stage is specified for which no intermediate results are available and the [strict] argument is set to True.
- abstract save(file_path)¶
Saves the story analysis results.
- abstract property stats¶
Returns a set of stats computed during analysis.
- class datastories.api.IStoryDeprecated(notes=None)¶
Bases:
IAnalysisResult
Interface implemented by all story analyses.
Args:
- notes (list=[]):
a list of notes.
- add_note(note)¶
Add an annotation to the story results.
The already present annotations can be retrieved using the
datastories.api.IStory.notes()
property.Args:
- note (str):
the annotation to be added.
- clear_note(note_id)¶
Remove a specific annotation associated with the story analysis.
Args:
- note_id (int):
the index of the note to be removed.
Raises:
ValueError
:when the note index is unknown.
- clear_notes()¶
Clear the annotations associated with the story analysis.
- static is_compatible(current_version_string, ref_version_string)¶
Checks if a story version is compatible with a reference version.
- abstract static load(file_path)¶
Loads a previously saved story.
- abstract property metrics¶
Returns a set of metrics computed during analysis.
- property notes¶
The list of all annotations currently associated with the story analysis.
- abstract save(file_path)¶
Saves the story analysis results.
- class datastories.api.IProgressObserver¶
Interface implemented by all progress report observers.
- abstract on_progress(progress)¶
Callback triggered upon progress update.
Args:
- progress (float):
the amount of progress. Possible values: [0-1]
- class datastories.api.ISlide(slide_deck=None, file_path='slide.json', slide_name=None)¶
Interface implemented by slides.
A slide is a collection of data and references to data that a renderer can transform into a visual representation.
Args:
- slide_deck (obj=None):
a
datastories.api.SlideDeck
object used to manage the slide.
- file_path (str=’slide.json’):
path to a file to be used for serializing the slide.
- property slide¶
The slide content.
The slide content is a versioned and serializable entity that can be used to visualize the slide without requiring access to the object itself.
NOTE: This information cannot be used to construct the object by deserialization.
- class datastories.api.SlideDeck¶
Base class for slide decks.
A slide deck is a convenience component that facilitates managing a collection of slides.
- add_slide(slide)¶
Adds a slide to the deck.
Args:
- slide (
datastories.api.ISlide
): the slide to be added.
- slide (
- clear_slides()¶
Remove the slides in the deck.
- goto_slide(slide_idx)¶
Sets the current slide pointer to a specific value.
Args:
- slide_idx (int):
the new value for the slide pointer.
- has_slides()¶
Check if the slide deck contains any slides (i.e., it is not empty).
Returns:
True is the slide deck is empty, otherwise False.
- insert_slide(pos_idx, slide)¶
Inserts a slide in the deck at a given position.
Args:
- pos_idx (int):
the index at which position the slide is to be inserted.
- slide (
datastories.api.ISlide
): the slide to be inserted.
- slide (
- next_slide()¶
Retrieve the next slide in the deck and advances the slide pointer.
If the deck is at the end, or has no slides it returns None.
Returns:
The next slide in the deck or None.
- property slides¶
The deck slides.
- sort_slides(names)¶
Sort the slides based on a list of names.
Slides are sorted in place.
Args:
- names (list):
A list of slide names indicating the desired sort order. Slides that not mentioned in the list will be added at the end.
- class datastories.core.utils.ExportableMixin¶
- to_csv(file_path, delimiter=',', decimal='.', df=None)¶
Export the result to a
CSV
file.Args:
- file_path (str):
path to the output file.
- delimiter (str=’,’):
character used as value delimiter.
- decimal (str=’.’):
character used as decimal point.
- df (pandas=None):
data frame to export. If left unspecified it will use he data frame returned by the
to_pandas
method of the object
Raises:
ValueError
:when the serialized object is not a
Pandas
data frame
- to_excel(file_path, tab_name='Statistics', df=None)¶
Export the result to an
Excel
file.Args:
- file_path (str):
path tot he output file.
- tab_name (str=’Statistics’):
name of the Excel tab where to save the result
- df (pandas=None):
data frame to export. If left unspecified it will use he data frame returned by the
to_pandas
method of the object
Raises:
ValueError
:when the serialized object is not a
Pandas
data frame
- class datastories.core.utils.ManagedObject(dependencies=None, *args, **kwargs)¶
An object that has a user controllable lifespan.
Typically inherited by classes that require special resource to be allocated and manually released outside the Python object lifetime management.
Note: Objects of this class should not be manually constructed.
- assert_alive()¶
Triggers an exception if the object has been manually released.
- release()¶
Releases the object associated storage.
Note: This function should only be used in order to force releasing allocated resources. Using the object after this point would lead to an exception.
- class datastories.core.utils.StorageBackedObject(folder=None, files=None, *args, **kwargs)¶
An object that stores part of its resources on disk and loads them on demand.
Base classes:
The resources may be provided by the object dependencies or by the object associated storage. When resources are specified, the object can be made independent from its dependencies by copying the listed resources to its associated storage.
Note: Objects of this class should not be manually constructed.
- make_independent(base_folder='')¶
Make object independent by copying required resources to the own folder.
Args:
- base_folder (str=’’):
the base folder for the unique object folder that will hold the required resources.
Errors¶
- class datastories.api.errors.DatastoriesError(value='')¶
Base exception class for the DataStories SDK.
- class datastories.api.errors.ObjectError(value='')¶
Exception generated when SDK managed objects are not valid.
- class datastories.api.errors.LicenseError¶
Exception generated when accessing license protected functionality using an invalid license.
- class datastories.api.errors.ConversionError(value='')¶
Error raised when data conversion fails.
- class datastories.api.errors.VisualizationError(value='')¶
Error raised when result visualization fails.
- class datastories.api.errors.StoryError(value='')¶
Base class for all story analysis related errors.
- class datastories.api.errors.StoryDataLoadingError(value='')¶
Exception generated when a story analysis cannot load the provided input data.
- class datastories.api.errors.StoryDataPreparationError(value='')¶
Exception generated when a story analysis cannot be preprocess the provided data.
- class datastories.api.errors.StoryProcessingError(value='')¶
Exception generated when a story analysis cannot be performed.
- class datastories.api.errors.StoryInterrupted(value='')¶
Exception generated when a story analysis execution is interrupted.
- class datastories.api.errors.ParserError(value='')¶
Base class for all file parsing and validation related errors.
- class datastories.api.errors.FormatError(value='')¶
Error raised when the provided file is not in a readable format (unreadable csv, …)
- class datastories.api.errors.ValidationError(value='')¶
Error raised when the parser was able to read the file structure, but an error occurred during validation.
- class datastories.api.errors.TypeNotRecognized(value='')¶
Error raised when the SDK parser cannot determine the provided file type.
- class datastories.api.errors.TypeNotSupported(value='')¶
Error raised when the provided file type cannot be handled by SDK the parser.
- class datastories.api.errors.ExternalDataConnectionError(value='')¶
Error raised when VBA scripts or an external data connection is detected in spreadsheet.
Constants and Enumerations¶
License Management¶
- datastories.api.get_activation_info()¶
Get information required to create and activate a DataStories license.
- Returns:
- dict:
a dictionary containing data to be submitted to the DataStories representative in charge with issuing the license.
The datastories.license
package contains a collection
of utility functions to facilitate license management.
These functions are available as methods of a predefined object
of class datastories.license.LicenseManager
called master
.
Example:
from datastories.license import manager
manager.initialize('my_license.lic')
manager
- class datastories.license.LicenseManager(license_file_path=None)¶
Encapsulates the DataStories license manager.
The license manager enables users to inspect the details of their installed DataStories SDK license, and to use license keys that are not available in the standard installation locations (see Installation)
This class should not be instantiated directly. Instead one should use the already available object instance
datastories.license.manager
.Args:
- license_file_path (str = None):
the path to a license key file or folder if other than the standard locations for the platform.
Attributes:
- status (str):
the status of the license manager initialization.
- license (obj):
the managed license as indicated in the license key file.
Example:
from datastories.license import manager manager.initialize('my_license.lic') manager
- property default_license_path¶
Default path used for license initialization if none provided.
- initialize(license_file_path=None, initialize_modules=True)¶
Initialize the license manager with a license key at a specific location.
Args:
- license_file_path (string):
the path to a license key file or a folder containing the license key file.
- initialize_modules (bool=True):
set to
True
in order to initialize dependent modules.
Raises:
ValueError
:when the provided
license_file_path
is not accessible.
- is_granted(option)¶
Checks if execution rights are granted for license protected functionality.
Args:
- option (str):
the license option required by the protected functionality.
Returns:
True
if execution rights are granted by the installed license.
- property is_ok¶
Check the initialization status of the license manager.
The license manager initialization fails when no valid license file is found in the standard or user indicated locations.
Note: A successful license manager initialization does not imply a grant for using license protected functionality. Fort example, when an expired license is used, the initialization is still successful. To check whether execution rights are granted one should use the
datastories.license.LicenseManager.is_granted()
method.Returns:
True
if the license manager was successfully initialized.
- reinitialize()¶
Re-initializes the license manager.
This is done using the same license file path as in the previous call to
datastories.license.LicenseManager.initialize()
.
- release()¶
Releases the currently held licenses.
This can be useful e.g., when using floating or counted licenses, as it makes the released licenses available for other clients or processes.
Note: once a license is released, the associated execution rights are retracted. In order to use the license protected functionality, users need to acquire the license, by initializing the license manager again (i.e.,
datastories.license.LicenseManager.initialize()
).
Data¶
The datastories.data
package contains a collection
of classes and functions for handling data and converting
it to and from the internal format used by DataStories.
Base Classes¶
- class datastories.data.DataFrame¶
Encapsulates a data frame in the DataStories format.
Args:
- rows (int):
number of rows in the data frame.
- cols (int):
number of columns in the data frame.
- types (list):
list of value types for the data frame columns.
- cols¶
The number of columns in the data frame.
- columns¶
The list of data frame column names.
- static from_pandas(data_frame)¶
Construct a new
datastories.data.DataFrame
from a PandasDataFrame
object.Args:
data_frame (obj): the source Pandas
DataFrame
object.
Returns:
The constructed
datastories.data.DataFrame
object.
- get(self, size_t row, size_t col)¶
Get the value of a cell in the data frame.
Args:
- row (int):
the index of the cell row.
- col (int):
the index of the cell column.
Returns:
- (float|string) :
the cell at position (row, column) in the data frame.
- get_name(self, size_t col)¶
Retrieve the name of a specific column.
Args:
- col (int):
the index of the column.
Returns:
- (str) :
the name of the column with the given index.
- get_type(self, size_t col)¶
Retrieve the type of values in a given column.
Args:
- col (int):
the index of the column.
Returns:
An object of type
datastories.data.ColumnType
.
- static load(file_path)¶
Load a data frame from a file.
Args:
- file_path (str):
path to the file to be loaded.
- mapper_get(self, size_t index, size_t value)¶
- static read_csv(file_path, delimiter=u', ', decimal=u'.', quote=u'"', int header_rows=1)¶
Loads a DataFrame from a CSV file.
Args:
- file_path (str):
the path to the file to load.
- delimiter (str=’,’):
character to use as value delimiter.
- decimal (str=’.’):
character to use as decimal point in numeric values.
- header_rows (int=1):
number of header rows (i.e., not containing data values)
- rows¶
The number of rows in the data frame.
- save(self, file_path)¶
Save the data frame to a file.
Args:
- file_path (str):
path to the output file.
- set_float(self, size_t row, size_t col, double val)¶
Sets the value of a given cell to a new float value.
Args:
- row (int):
the row index of the cell.
- col (int):
the column index of the cell.
- val (float):
the new float value.
- set_int(self, size_t row, size_t col, int64_t val)¶
Sets the value of a given cell to a new int value.
Args:
- row (int):
the row index of the cell.
- col (int):
the column index of the cell.
- val (int):
the new int value.
- set_name(self, size_t col, name)¶
Set the name of a column in the data frame.
Args:
- col (int):
the index of the column.
- name (str):
the new name.
- set_string(self, size_t row, size_t col, val)¶
Sets the value of a given cell to a new string value.
Args:
- row (int):
the row index of the cell.
- col (int):
the column index of the cell.
- val (str):
the new string value.
- to_pandas(self)¶
Exports the DataFrame to a Pandas
DataFrame
object.Returns:
The constructed Pandas
DataFrame
object.
- class datastories.data.ColumnType(value)¶
Possible column types for
datastories.data.DataFrame
.- DATE = 3¶
- INTEGER = 2¶
- MIXED = 10¶
- NUMERIC = 1¶
- STRING = 4¶
- UNKNOWN = 0¶
- class datastories.data.DataType(value)¶
Possible cell value types for
datastories.data.DataFrame
.- DATE = 3¶
- INTEGER = 2¶
- NUMERIC = 1¶
- STRING = 4¶
- UNKOWN = 0¶
- class datastories.data.RangeType(value)¶
Possible value range types for
datastories.data.DataFrame
.- CATEGORICAL = 3¶
- INTERVAL = 1¶
- ORDINAL = 2¶
- UNSPECIFIED = 0¶
- class datastories.data.BaseConverter¶
Base class for all DataStories SDK value type converters.
Objects of this class are callables. To apply the converter, simply call the obejct with the value to be converted.
The number of conversion operations (both successful or not) is tracked and can be retrieved and reset.
Example:
converter = BaseConverter() converted_value = converter(raw_value)
- class datastories.data.IntConverter¶
Converter to integer values.
- class datastories.data.FloatConverter¶
Converter to float values.
- class datastories.data.StringConverter¶
Converter to string values.
- class datastories.data.BoolConverter(true_values, false_values)¶
Converter to boolean values.
Args:
- true_values (list):
a list of strings that will be regarded as
True
.
- false_values (list):
a list of strings that will be regarded as
False
.
- class datastories.data.DateConverter(**kwargs)¶
Converter to datetime values.
Args:
- kwargs:
Passed on to dateutil.parser.parse. See [https://dateutil.readthedocs.org/en/latest/parser.html#dateutil.parser.parse] for the accepted arguments.
- class datastories.data.NanConverter(nan_values=('', 'NA', 'NaN', 'null', 'none', '?', '..', '...', 'N/A', '-'))¶
Converter to NaN values.
Converts NaN equivalent values to
numpy.nan
.NOTE: This converter is somewhat different from the others. While others return
numpy.nan
when the conversion is not possible, this converter returnsnumpy.nan
only when the conversion is possible and the unchanged value otherwise.
- class datastories.data.FallbackConverter(nan_detector=None, converters=None)¶
This converter has a list of converters it tries in order until one is successful. It also keeps track of how many conversions each converter performed successfully. If no converter is successful, a
datastories.api.errors.ConversionError
exception is raised.First the converter try to see whether it is a nan value and if so, the value is ignored. Otherwise, the converters are used in order to try and convert the value, stopping from the moment the first conversion is successful. If none of the conversions is successful, a
datastories.api.errors.ConversionError
exception is raised.Args:
- nan_detector (obj=NanConverter()):
an object of type
datastories.data.BaseConverter
used to detect whether a value represents a nan or not.
- converters (list=[FloatConverter(), StringConverter()]):
a list of
datastories.data.BaseConverter
objects that will be tried in order with each attempted conversion.
Data Frame Preparation¶
- datastories.data.prepare_data_frame(data_frame, sample_size=None, report_conversion=False, include_types=False)¶
Prepares a
pandas.DataFrame
object compatible with the DataStories clean-up and type conversion rules.Pandas
DataFrame
objects obtained from external sources are often inconsistent and need to be cleaned-up in order to make them usable for analysis. The clean-up process transforms the data frame, for example by enforcing type conversions and discarding non-usable values. DataStories analyses perform the clean-up operation automatically. However, there may be scenarios when a data clean-up is required before running it through a DataStories analysis (e.g., a custom feature-engineering stage).This function can be used to obtain a Pandas
DataFrame
object that is cleaned-up according the DataStories rules and conventions.Args:
- data_frame (obj):
the data frame object to convert (either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
- sample_size (int|str=None):
the sample size to use for inferring column data types (either absolute integer value or a percentage - e.g. ‘10%’). If left unspecified is the minimum of 100 and 10% of the number of points.
- report_conversion (bool=False):
True to print a report of how the columns got converted, False otherwise.
- include_types (bool=False):
True to return also the inferred column types, False otherwise.
Returns:
If include_types is False, the constructed
pandas.DataFrame
object.Otherwise, a tuple with the above dataframe, together with a list of inferred column types.
- datastories.data.data_to_file(data_frame, file_path)¶
Save a DataFrame to a file.
Args:
- data_frame (obj):
the input data frame (either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
- file_path (str):
path to the saved file.
- datastories.data.file_to_pandas(file_path)¶
Load a saved
datastories.data.DataFrame
object into apandas.DataFrame
object.Args:
- file_path (str):
path to the file to be loaded.
Summary Calculation¶
- datastories.data.compute_summary(data_frame, sample_size=None)¶
Compute a data summary on a provided data frame.
Args:
- data_frame (obj):
the input data frame (either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
- sample_size (int|str=None)`:
the sample size to use for inferring data types (either absolute integer value or a percentage - e.g. ‘10%’). If left unspecified is the minimum of 100 and 10% of the number of points.
Returns:
An object of type
datastories.data.DataSummaryResult
wrapping-up the summary report.
Example:
from datastories.data import compute_summary import pandas as pd df = pd.read_csv('example.csv') summary = compute_summary(df)
- class datastories.data.DataSummaryResult(stats)¶
Encapsulates the result of the
datastories.data.compute_summary()
analysis.Base classes:
Note: Objects of this class should not be manually constructed.
- static load(file_path)¶
Load a previously saved summary from a JSON file.
Args:
- file_path (str):
the path to file to be loaded.
Returns:
An object of type
datastories.data.DataSummaryResult
encapsulating data summary information.
- property metrics¶
The set of metrics included in the data summary.
NOTE: This is an alias for the .stats property.
Returns:
an object of type
datastories.data.TableStatistics
wrapping up summary statistics.
- save(file_path)¶
Save the summary to a JSON file.
Args:
- file_path (str):
the path to the exported summary file.
- select(cols)¶
Select a set of columns for further reference.
- property selected¶
The list of selected columns.
- property stats¶
The set of statistics included in the data summary.
Returns:
an object of type
datastories.data.TableStatistics
wrapping up summary statistics.
- to_pandas()¶
Exports the detailed (column-level) data summary to a Pandas
DataFrame
.Returns:
The constructed Pandas
DataFrame
object.
- property visualization¶
The data health visualization.
- class datastories.data.TableStatistics(name=None, rows=None, columns=None, n=None, n_missing=None, p_missing=None, health=None, health_score=0, df=None, converters=None, n_rows=None, n_columns=None, version=None)¶
Statistics and data health reports for a given data frame.
Note: Objects of this class should not be manually constructed.
Attributes:
- n_rows (int):
number of rows.
- n_columns (int):
number of columns.
- n (int):
number of values.
- n_missing (int):
number of missing values.
- p_missing (float):
percentage of missing values.
- health_score (float):
health score: 0 (good) - 100 (bad).
- health (float):
general health value for the data frame (unusable:0, fixable:0.5, great:1).
- columns ‘(list)`:
list of objects of type
datastories.data.ColumnStatistics
encapsulating detailed column level statistics
- calc_stats(missing_thr=(50, 90), balance_thr=(50, 90), outlier_thr=(50, 90), table_thr=(50, 90), rows_thr=30)¶
Compute the statistics for the data frame and set the corresponding attributes.
Args:
- missing_thr (tuple=(50, 90)):
thresholds for deciding the missing values health category (Poor, Reasonable, Good)
- balance_thr (tuple=(50, 90)):
thresholds for deciding the data distribution health category (Poor, Reasonable, Good)
- outlier_thr (tuple=(50, 90)):
thresholds for deciding outlier health category (Poor, Reasonable, Good)
- table_thr (tuple=(50, 90)):
thresholds for deciding overall data health category (Poor, Reasonable, Good)
- rows_thr (int=30):
threshold for the minimum number of required rows. Under this value data is considered to be not usable.
- class datastories.data.ColumnStatistics(col=None, id=None, converter=None, label=None, column_type=None, element_type=None, n=None, n_valid=None, n_missing=None, p_missing=None, n_unique=None, min=None, max=None, mean=None, median=None, most_freq=None, first_quartile=None, third_quartile=None, histo_labels=None, histo_counts=None, balance_score=None, balance_health=None, missing_health=None, left_outlier_score=None, right_outlier_score=None, outlier_score=None, left_outlier_health=None, right_outlier_health=None, outlier_health=None, health=None, missing_thr=None, balance_thr=None, outlier_thr=None, bincount=10, n_outliers=None, outlier_n=None, outlier_perc=None, outlier_grade=None)¶
Statistics and data health reports for a given column in a data frame.
Note: Objects of this class should not be manually constructed.
Attributes:
- n_rows (int):
number of rows.
- id (int):
the index of the column.
- label (str):
the label (header values) of the column.
- n (int):
the length of the column.
- n_valid (int):
the number of correctly parsed data items.
- n_missing (int):
the number of unreadable data items.
- p_missing (float):
percent of unreadable data items.
- column_type (str):
type of the column (ordinal, interval, binary, …).
- element_type (str):
type of individual data items (float, string, …).
- n_unique (int):
number of unique values.
- min (float):
minimum value.
- max (float):
maximum value.
- mean (float):
mean value.
- median (float):
median value.
- first_quartile (float):
first quartile (data point under which 25% of data is situated).
- third_quartile (float):
third quartile (data point under which 75% of data is situated).
- histo_labels (list):
labels for the histogram bins.
- histo_counts (list):
counts for the histogram bins.
- balance_score (float):
score for the data balance quality, 0 (good) - 100 (bad).
- balance_health (float):
health value in terms of balance (unusable:0, fixable:0.5, great:1).
- missing_health (float):
health value in terms of nr of missing items (unusable, …).
- left_outlier_score (float):
metric for outlier impact on the left (i.e., small) side of the data range. Scale: 0 (no outliers detected) - 100 (bad).
- right_outlier_score (float):
metric for outlier impact on the right (i.e., big) side of the data range. Scale: 0 (no outliers detected) - 100 (bad).
- outlier_score (float):
metric for the general outlier impact of the data. Scale: 0 (no outlier impact whatsoever) - 100 (bad).
- left_outlier_health (float):
health value for left outlier impact (unusable:0, fixable:0.5, great:1).
- right_outlier_health (float):
health value for right outlier impact (unusable, fixable:0.5, great:1).
- outlier_health (float):
health value for outlier impact (unusable:0, fixable:0.5, great:1).
- health (float):
general health value for this column (unusable:0, fixable:0.5, great:1).
- n_outliers (int):
number of outliers.
- outlier_n (int):
number of outliers.
- outlier_perc (float):
percentage of outlier values.
- outlier_grade (int):
0: bad, 1:good.
- calc_stats(missing_thr=(50, 90), balance_thr=(50, 90), outlier_thr=(50, 90))¶
Compute the statistics for the column and set the corresponding attributes.
Args:
- missing_thr (tuple=(50, 90)):
thresholds for deciding the missing values health category (Poor, Reasonable, Good)
- balance_thr (tuple=(50, 90)):
thresholds for deciding the data distribution health category (Poor, Reasonable, Good)
- outlier_thr (tuple=(50, 90)):
thresholds for deciding outlier health category (Poor, Reasonable, Good)
Outlier Detection¶
- datastories.data.compute_outliers(input, ref=None, double strictness=0.25, outlier_vote_threshold=None, far_outlier_vote_threshold=None)¶
Identifies numeric outliers in a 1D or 2D space. This function can be used either with the strictness argument only (i.e., by leaving two last parameters at their defaults so they will be computed as a function of the strictness) or manually by setting the last two parameters in which case the strictness will be ignored. Args:
- input (list|obj|ndarray):
numeric input vector can be either a list, a
pandas.Series
object or a numpy numeric array;
- ref (list|obj|ndarray=None):
abscissa vector for the 2D case. Can be either a list, a
pandas.Series
object or a numpy numeric array;
- strictness (double=0.25):
determines how strict the algorithm selects outliers - higher values yield less outliers. Value in range is [0-1].
- outlier_vote_threshold (double=None):
determines when a point is considered outlier - higher values yield less outliers. Value in range is [0-100]. When left unspecified it will be set to
100 * strictness
.
- far_outlier_vote_threshold (double=None):
determines when a point is considered a far outlier - higher values yield less outliers. This must be larger than [outlier_vote_threshold]. Default is
outlier_vote_threshold + 50
. Value in range is [0-100].
- Returns:
An object of type
datastories.data.OutlierResult
wrapping-up the computed outliers.
- Example::
from datastories.data import compute_outliers import pandas as pd df = pd.read_csv(‘example.csv’) outliers = compute_outliers(df[‘my_column’]) print(outliers)
- class datastories.data.OutlierResult(input, outliers)¶
Encapsulates the result of the
datastories.data.compute_outliers()
analysis. Base classes:Note: Objects of this class should not be manually constructed. Attributes:
- valid (bool):
a flag indicating whether the result is valid.
- as_index(self, outlier_types=None)¶
A numpy index vector that can be used to select and retrieve outlier values. The index can be applied on numpy arrays or
pandas.Series
objects. Args:- outlier_types (list):
list of
datastories.api.OutlierType
values to specify which outliers to retrieve. By default, all outliers are included (i.e., outlier_types = [OutlierType.FAR_OUTLIER_HIGH, OutlierType.FAR_OUTLIER_LOW, OutlierType.OUTLIER_HIGH, OutlierType.OUTLIER_LOW])
- as_itemgetter(self, outlier_types=None)¶
An
operator.itemgetter
object that can be used to select and retrieve outlier values from a list. Args:- outlier_types (list):
list of
datastories.api.OutlierType
values to specify which outliers to retrieve. By default, all outliers are included (i.e., outlier_types = [OutlierType.FAR_OUTLIER_HIGH, OutlierType.FAR_OUTLIER_LOW, OutlierType.OUTLIER_HIGH, OutlierType.OUTLIER_LOW])
- clip_to_iqr(self, low_threshold=0.05, high_threshold=0.95)¶
Marks as outliers values that are outside a specific inter-quartile range. This operation can be un-done via the
reset
method. Args:- low_threshold (float=0.05):
the lower bound of the inter-quartile range. Should be in the interval [0,1].
- high_threshold (float=0.95):
the higher bound of the inter-quartile range. Should be in the interval [0,1].
- Raises:
ValueError
:when the input arguments are not valid.
- property metrics¶
A dictionary containing outlier detection metrics. The following metrics are retrieved:
- Outliers:
total number of outliers.
- Outliers Low:
number of lower outliers.
- Outliers High:
number of higher outliers.
- Close Outliers:
number of close outliers.
- Close Outliers Low:
number of lower close outliers.
- Close Outliers High:
number of higher close outliers.
- Far Outliers:
number of far outliers.
- Far Outliers Low:
number of lower far outliers.
- Far Outliers High:
number of higher far outliers.
- NaN:
number of NaN values.
- Normal:
number of values that are neither outliers not NaN.
- reset(self)¶
Resets outliers to original values, as computed by the
datastories.data.compute_outliers()
analysis.
- to_csv(self, file_path, content='metrics', delimiter=',', decimal='.')¶
Exports a list of detected outliers or metrics to a
CSV
file. Args:- file_path (str):
path to the output file.
- content (str=’metrics’):
the type of metrics to export. Possible values: -
'metrics'
: exports outlier detection metrics. -'outliers'
: exports point-wise outlier classification.
- delimiter (str=’,’):
character used as value delimiter.
- decimal (str=’.’):
character used as decimal point.
- Raises:
ValueError
:when an invalid value is provided for the [content] argument.
- to_excel(self, file_path)¶
Exports the list of detected outliers and metrics to an
Excel
file. Args:- file_path (str):
path to the output file.
- to_pandas(self, content='metrics')¶
Exports a list of detected outliers or metrics to a
pandas.Series
object. Args:- content (str=’metrics’):
the type of metrics to export. Possible values: -
'metrics'
: exports outlier detection metrics. -'outliers'
: exports point-wise outlier classification.
- Returns:
The constructed
pandas.Series
object.
- Raises:
ValueError
:when an invalid value is provided for the [content] argument.
- update(self, updates)¶
Updates the list of detected outliers with manual corrections.
- property updated¶
A list of manual corrections applied to the detected outliers.
- property visualization¶
The outliers visualization.
Classification¶
The datastories.classification
package contains a collection
of classes and functions to facilitate classification analysis.
Feature Ranking¶
- datastories.classification.rank_features(data_set, kpi, metric=FeatureRankingMetric.ACCURACY) FeatureRankResult ¶
Computes the relative importance of columns in a data frame for predicting a binary KPI.
The scoring is based on maximizing the prediction accuracy with respect to the KPI while iteratively splitting the data frame rows.
Args:
- data_set (obj):
the input data frame (either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
- kpi (int|str):
the index or the name of the KPI column.
- metric (enum = FeatureRankingMetric.ACCURACY):
an object of type
datastories.classification.FeatureRankingMetric
specifying the metric type used to rank the features.
Returns:
An object of type
datastories.classification.FeatureRankResult
wrapping-up the computed scores.
Raises:
TypeError
:if data_set is not a
DataFrame
or a PandasDataFrame
object.
ValueError
:if kpi is not a valid column name or index value (e.g., out-of-range index).
Example:
from datastories.classification import rank_features import pandas as pd df = pd.read_csv('example.csv') kpi_column_index = 1 ranks = rank_features(df, kpi_column_index) print(ranks)
- class datastories.classification.FeatureRankingMetric(value)¶
Metric to use for ranking the features.
- ACCURACY = 0¶
- class datastories.classification.FeatureRankResult(title='', subtitle='')¶
Encapsulates the result of the
datastories.classification.rank_features()
analysis.Base classes:
Note: Objects of this class should not be manually constructed.
- feature_ranks¶
The feature ranks computed by the
datastories.classification.rank_features()
analysis.Returns:
A list of
datastories.classification.RankingSplit
objects.
- select(self, cols)¶
Selects a number of column names as features.
- selected¶
The list of column names currently selected as features.
- to_excel(self, file_path)¶
Exports the list of ranking scores to an
Excel
file.Args:
- file_path (str):
path to the output file.
- to_pandas(self, ranking_column='Score', min_threshold=0.0)¶
Exports the list of ranking scores to a Pandas
DataFrame
object.Args:
- ranking_column (str=’Score’):
column to compute the rank and order the data frame. This can be useful to discover interesting variables that are penalised because they have a lot of missing values.
- min_threshold (float):
a a cutoff threshold for the minimum score that a variable should have in order to be exported.
Returns:
The constructed Pandas
DataFrame
object.
- visualization¶
The feature ranks visualization.
- class datastories.classification.RankingSplit¶
Encapsulates information about a split.
Note: Objects of this class should not be manually constructed.
Attributes:
- column_name (str):
name of the variable (i.e., column) used in split.
- column_index (int):
index of the variable used in split.
- score (float):
relative importance score with respect to the KPI.
- left_value (float):
the variable value that was used for the split.
- right_value (float):
the next higher variable value in the dataset.
- split_value (float):
the variable value that was used for the split.
- equal_type_split (bool):
indicates whether the split value equals one of the left_value or right_value.
- extra_scores (dict):
dictionary containing additional metrics (e.g., accuracy).
Correlation¶
The datastories.correlation
package contains a collection
of classes and functions to facilitate correlation analysis.
- datastories.correlation.compute_correlations(data, column_list, kpis, max_vars=200, outlier_elimination=False, optimize=False)¶
Find the most relevant correlations between the columns of a data set.
A number of correlation metrics are computed (currently linear and mutual information) for a subset of the most relevant input variables with respect to a set of KLIs and between the KPIs themselves.
The subset of relevant input variable is computed based on prototyping and limited to a a maximum number as specified (i.e., max_vars = 200).
Args:
- data (obj):
the input data frame (either a
pandas.DataFrame
or adatastories.data.DataFrame
object);
- column_list (list):
the list of input variable identifiers (indices or names)
- kpis (list):
the list of KPI column identifiers (indices or names)
- max_vars (int):
the maximum number of variables to be included in the result.
- outlier_elimination (bool=False):
set to True in order to exclude far outliers from from columns before computing correlations;
- optimize (bool=False):
set to True in order to improve correlation metrics by using transformed versions of the input (e.g., scaled columns).
Returns:
A JSON formatted string encapsulating the computed correlation metrics, compatible with the DataStories CorrelationBrowser visualization.
- class datastories.correlation.CorrelationResult(json_content, column_names=None)¶
Encapsulates the result of the
datastories.correlation.compute_correlations()
analysis.Base classes:
Note: Objects of this class should not be manually constructed.
- column(col)¶
Retrieve the correlation measurements associated with a given column.
Args:
- col (str|int):
the identifier of the column (name or index).
Returns:
A dictionary containing correlation measurements with respect to other columns in the data frame, in case these have been included in the top correlations selection.
- property columns¶
The list of column names.
- static load(file_path, column_names=None)¶
Load the result from a JSON file.
Args:
- file_path (str):
location of the input file
- column_names (list[str]=None):
list of column names in the original data frame. If not provided, one cannot access the correlations via the original data frame column indexes. Instead one must use column names.
Returns:
An object of type
datastories.correlation.CorrelationResult
- save(file_path)¶
Save the result to a JSON file.
Args:
- file_path (str):
location of the output file
NOTE: This operation loses the data frame context information. The original column names and their indices will not be available when loading the result from this file, unless the context is provided by the user. If no context is provided, one can still use the result but the correlations cannot be accessed via the original data frame column indexes. Instead one can use the column names.
- to_excel(file_path)¶
Export the list of correlations to an
Excel
file.Args:
- file_path (str):
name of the file to export to.
- to_json(html_safe=False)¶
Save the result as a JSON string.
Args:
- html_safe (bool=False):
Set to True in order to produce a JSON string that is safe to embed in a HTML page as an attribute value.
Returns:
A JSON string containing the analysis results.
- to_pandas()¶
Export the list of correlations to a Pandas
DataFrame
object.NOTE: Every pair of correlated columns is included twice in the results such that each of the columns in the pair appears as a main column.
Returns:
The constructed Pandas
DataFrame
object.
- property visualization¶
The prototypes visualization.
Prototype Detection¶
- datastories.correlation.compute_prototypes(data_set, kpi, list inputs: list = None, double prototype_threshold: float = 0.85, fast_approximation: bool = True, double missing_value_threshold: float = 0.5, use_linear_correlation: bool = False, inputs_only: bool = False) PrototypeResult ¶
Compute a set of mutually uncorrelated variables from a data frame.
Correlation estimation is by default based on the
Mutual Information Content
measure, and can be overridden to theLinear Correlation
when required.Each variable in the set has the following properties:
it is not significantly correlated to any other variable in the set;
it can be highly correlated to other variables that are not included in the set;
it has a higher KPI correlation score than all the other variables that are highly correlated to it.
Each variable that is not included in the set has the property that is highly correlated to a variable in the set.
Args:
- data_set (obj):
the input data frame (either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
- kpi (list|int|str):
single value or a list containing the index or the name of the KPI column(s).
- inputs (list=None):
list of columns IDs to include in the analysis. When not specified all columns in the provided dataset will be included.
- prototype_threshold (float = 0.85):
correlation threshold for features to be considered proxies.
- fast_approximation (bool = True):
approximate the mutual information, this provides a significant speedup with little precision loss.
- missing_value_threshold (float = 0.5):
missing values threshold for excluding features from prototypes.
- use_linear_correlation (bool = False):
use linear correlation instead of the mutual information for correlation estimation.
- inputs_only (bool = False):
extract prototypes only for inputs (i.e., exclude KPIs). The KPIs are used only to determine the order in which the prototypes are presented. That is, the order of prototypes in the result is given by their maximum correlation with a KPI.
Returns:
An object of type
datastories.correlation.PrototypeResult
wrapping-up the computed prototypes.
Raises:
TypeError
:if [data_set] is not a
DataFrame
or a PandasDataFrame
object.
ValueError
:if [kpi] is not a valid column name or index value (e.g., out-of-range index).
Example:
from datastories.correlation import compute_prototypes import pandas as pd df = pd.read_csv('example.csv') kpi_column_index = 1 prototypes = compute_prototypes(df, kpi_column_index) print(prototypes)
- class datastories.correlation.PrototypeResult(prototype_list)¶
Encapsulates the result of the
datastories.correlation.compute_prototypes()
analysis.Base classes:
Note: Objects of this class should not be manually constructed.
- classmethod load(type cls, file_path)¶
Load the analysis result from a JSON file.
Args:
- file_path (str):
Path to the file to be loaded.
- prototypes¶
The list of column names currently selected as prototypes.
- save(self, file_path)¶
Save the analysis result to a JSON file.
Args:
- file_path (str):
location of the output file.
- select(self, cols)¶
Select a number of column names as prototypes.
- selected¶
The list of column names currently selected as prototypes.
- to_excel(self, file_path)¶
Export the list of prototypes to an
Excel
file.Args:
- file_path (str):
path to the output file.
- to_pandas(self)¶
Export the list of prototypes to a Pandas
DataFrame
object.Returns:
The constructed Pandas
DataFrame
object.
- visualization¶
The prototypes visualization.
- class datastories.correlation.Prototype(info, proxy_list)¶
Encapsulates prototype information data.
Note: Objects of this class should not be manually constructed.
Attributes:
- info (obj):
an object of type
datastories.correlation.CorrelationInfo
describing the correlation of the prototype with respect to a KPI.
- proxy_list (list):
a list of
datastories.correlation.CorrelationInfo
objects corresponding to highly correlated variables with respect to the prototype.
- class datastories.correlation.CorrelationInfo(col_index, col_name, kpi_index, kpi_name, correlation)¶
Encapsulates correlation information for a variable with respect to a reference.
Note: Objects of this class should not be manually constructed.
Attributes:
- col_index (int):
the index of the variable in the input data frame.
- col_name (str):
the name of the variable.
- correlation (float):
the correlation score with respect to the reference.
Model¶
The datastories.model
package contains a collection
of classes that encapsulate data models (e.g., prediction
models computed by regression or classification analysis).
Base Classes¶
- class datastories.model.Model¶
Encapsulates an RSX based DataStories model.
- inputs¶
The list of input model variable names.
- outputs¶
The list of output model variable names.
- plot(self, *args, **kwargs)¶
Display a graphical representation of the prediction model.
Accepts the same parameters as the constructor for
datastories.visualization.WhatIfsSettings
- predict(self, data_frame, as_pandas=None, prepare_data=True)¶
Evaluate the model on an input data frame.
Args:
- data_frame (obj):
the input data frame (either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
- as_pandas (bool=None):
Flag to indicate whether prediction results should be returned as a
Pandas
data frame. By default results are returned in the same format as the input data frame.
- prepare_data (bool=True):
Set to
True
in order to prepare provided Pandas data frames according to the DataStories type conversion rules. When the provided data frame is adatastories.data.DataFrame
object, this argument is discarded.
Returns:
An object of type
datastories.core.model.PredictionResult
wrapping-up the computed prediction.
- save(self, file_path=None)¶
Serialize the model to a file or a bytes object.
Args:
- file_path (str=None):
Name of the output file. If omitted the file is saved to a bytes object and returned as output for the function.
Returns:
A bytes object containing the model when the [file_path] argument is omitted or set to
None
.
- to_cpp(self, file_path)¶
Export the model to a C++ file.
Args:
- file_path (str):
path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:when there is a problem saving the file.
- to_excel(self, file_path)¶
Export the model to an Excel file.
Args:
- file_path (str):
path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:when there is a problem saving the file.
- to_matlab(self, file_path)¶
Export the model to a MATLAB file.
Args:
- file_path (str):
path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:when there is a problem saving the file.
- to_py(self, file_path)¶
Export the model to a Python file.
Args:
- file_path (str):
path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:when there is a problem saving the file.
- to_r(self, file_path)¶
Export the model to an R file.
Args:
- file_path (str):
path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:when there is a problem saving the file.
- variables¶
A dictionary mapping model variables to corresponding information such as variable type and range.
Returns:
A dictionary mapping column names to corresponding objects of type
datastories.model.VariableInfo
.
- class datastories.model.VariableInfo¶
Holds information about a model variable, such as ranges and types.
Note: Objects of this class should not be manually constructed.
- categories¶
The registered categories of the variable (i.e., if the variable is categorical).
- index¶
The index of the variable.
- is_input¶
Checks if the associated variable is an input for the model.
- max¶
The maximum value of the variable.
- min¶
The minimum value of the variable.
- range_type¶
The range type of the variable.
- type¶
The variable type.
Prediction¶
- datastories.model.predict_from_model(data_frame, rsx_model_path)¶
Evaluate an RSX model on an input data frame.
Args:
- data_frame (obj):
the input data frame (either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
- rsx_model_path (str):
path of the RSX model file.
Returns:
An object of type
datastories.model.PredictionResult
wrapping-up the computed prediction.
- class datastories.model.PredictionResult(data=None)¶
Encapsulates a model prediction result.
Base classes:
Note: Objects of this class should not be manually constructed.
- property error_plot¶
An interactive visualization of prediction errors.
- NOTE: This is only available then the actual values corresponding
the predicted ones are available in the input dataset.
Returns:
An object of type
datastories.visualization.ErrorPlot
.
- property evaluation_data¶
The data used for evaluation, if available, or None.
- property kpis¶
The list of KPIs included in the prediction.
- static load(metrics=None, predict_vs_actual=None, evaluation_data=None, path=None, as_pandas=True)¶
Load a
datastories.model.PredictionResult
object from a set of files or objects.The objects take precedence over the files. When a required object is not provided, the corresponding information will be retrieved from the associated file, provided such file can be identified.
Files have standard names:
metrics.json
predicted_vs_actual.csv
evaluation_data.parquet
Files are specified indirectly, by providing a folder name, containing the files mentioned above. The folder name can be also a zip archive. In that case the files should be available in the root of the archive.
The evaluation data information is optional, and only required for reference.
Args:
- metrics (dict=None):
A dictionary containing performance metrics.
- predict_vs_actual (obj=None):
A data frame (Pandas or DataFrame) containing predicted vs actual data.
- evaluation_data (obj=None):
A data frame (Pandas or DataFrame) containing evaluation input data
- path (str=None):
Path to a folder or ZIP archive containing required information if not provided by the other (object) parameters. Files containing this information should have a standard name, as mentioned above.
- as_pandas (bool=True):
Flag to indicate whether the values field should be available as a
pandas.DataFrame
(i.e., True) or adatastories.data.DataFrame
object (i.e., False).
- property metrics¶
The prediction performance metrics, if available.
NOTE: This is an alias for the .stats property
- property performance¶
An interactive visualization of prediction performance, depicting predicted against actual values.
- NOTE: This is only available then the actual values corresponding
the predicted ones are available in the input dataset.
Returns:
An object of type
datastories.visualization.PredictedVsActual
.
- property quality¶
An interactive visualization of prediction performance, depicting predicted against actual values.
This is an alias for the
.performance
property.- NOTE: This is only available then the actual values corresponding
the predicted ones are available in the input dataset.
Returns:
An object of type
datastories.visualization.PredictedVsActual
.
- property record_info_columns¶
Get/set the record info column names.
- save(folder, include_data=False, compress=False)¶
Save the prediction data to a folder or a zip archive.
The metrics, prediction values and (optionally) prediction input data are saved as individual files:
metrics.json
predicted_vs_actual.csv
evaluation_data.parquet
Args:
- folder (str):
The folder where the files should be saved.
- include_data (bool=False):
Flag to indicate whether the evaluation data should be included as well.
- compress (bool=False):
Flag to indicate whether the files should be saved to a compressed ZIP archive instead of a folder.
- property stats¶
The prediction performance statistics, if available.
NOTE: When the actual KPI value is missing from the input data frame, the performance metrics cannot be computed. In that case None is returned.
- to_csv(file_path, delimiter=',', decimal='.', include_evaluation_data=True)¶
Export the result to a
CSV
file.Args:
- file_path (str):
path to the output file.
- delimiter (str=’,’):
character used as value delimiter.
- decimal (str=’.’):
character used as decimal point.
- include_evaluation_data (bool=True):
set to
True
in order to include the evaluation data next to the prediction values.
- to_excel(file_path, tab_name='Predictions', include_evaluation_data=True)¶
Export the result to an
Excel
file.Args:
- file_path (str):
path to the output file.
- tab_name (str=’Predictions’):
name of the Excel tab where to save the result
- include_evaluation_data (bool=True):
set to
True
in order to include the evaluation data next to the prediction values.
- to_pandas(include_evaluation_data=True)¶
Export the prediction and input values to pandas.
Args:
- include_evaluation_data (bool=True):
set to
True
in order to include the evaluation data next to the prediction values.
- property values¶
The prediction values.
For each provided record in the input data frame the following values are provided per KPI:
- actual:
the actual value of the KPI (i.e., if present in the input data frame).
- predicted:
the predicted value of the KPI.
- uncertainty_min:
minimum predicted value corrected for uncertainty.
- uncertainty_max:
maximum predicted value corrected for uncertainty.
- model_based_outlier:
whether the prediction is based on outlier values according to the model (1=True).
NOTE: The result object has the same type as the input provided to the predict method.
- class datastories.model.BasePredictor(base_model)¶
Base class for all models based on a RSX backed model.
Base classes:
Offers access to basic functionality:
prediction
optimization
model export to a specific language
Args:
- base_model (obj):
an object of type
datastories.model.Model
encapsulating the base RSX model used for making predictions.
- export(file_path)¶
Export the underlying prediction model to a lightweight RSX file.
This can be then loaded as a
datastories.model.Model
object and used to make predictions on new data.Args:
- file_path (str=None):
Name of the output file.
- maximize(progress_bar=True, optimizer=None)¶
Compute the input combination that maximizes the predictive model output.
Args:
- progress_bar (obj|bool=True):
An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
Returns:
A
datastories.optimization.OptimizationResult
object encapsulating the model variables values that maximize the model outputs.
- abstract property metrics¶
A dictionary containing model prediction performance metrics.
The type of metrics depend on the model type (i.e., regression or classification)
- minimize(progress_bar=True, optimizer=None)¶
Compute the input combination that minimizes the predictive model output.
Args:
- progress_bar (obj|bool=True):
An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
Returns:
A
datastories.optimization.OptimizationResult
object encapsulating the model variables values that minimize the model outputs.
- property model¶
The generic RSX based model used for making predictions.
- optimize(optimization_spec=None, variable_ranges=None, progress_bar=True, optimizer=None)¶
Compute an optimum input/output combination according to an (optional) optimization specification.
Args:
- optimization_spec (obj=OptimizationSpecification()) :
A
datastories.optimization.OptimizationSpecification
object encapsulating the optional optimization specification.
- variable_ranges (dict [str,
datastories.optimization.VariableRange
] = {}): An optional dictionary mapping variable names to ranges that are to be used to limit the searching for the optimum solution to a given domain.
- variable_ranges (dict [str,
- progress_bar (obj|bool=True):
An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
Returns:
A
datastories.optimization.OptimizationResult
object encapsulating the model variables values that satisfy the optimization specification.
- abstract predict(data_frame)¶
Predict the modeled KPI on a new data frame.
Args:
- data_frame (obj):
the data frame on which the model associated KPIs are to be predicted (either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
Returns:
An object of type
datastories.model.PredictionResult
encapsulating the prediction results.
Raises:
ValueError
:when not all required columns are provided.
- property stats¶
A dictionary containing model prediction performance metrics.
- to_cpp(file_path)¶
Export the model to a C++ file.
Args:
- file_path (str):
path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:when there is a problem saving the file.
- to_excel(file_path)¶
Export the model to an Excel file.
Args:
- file_path (str):
path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:when there is a problem saving the file.
- to_matlab(file_path)¶
Export the model to a MATLAB file.
Args:
- file_path (str):
path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:when there is a problem saving the file.
- to_py(file_path)¶
Export the model to a Python file.
Args:
- file_path (str):
path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:when there is a problem saving the file.
- to_r(file_path)¶
Export the model to an R file.
Args:
- file_path (str):
path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:when there is a problem saving the file.
- class datastories.model.BasePrediction(data)¶
Base class for all prediction classes.
Base classes:
Args:
- data (obj):
The associated prediction input data (either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
- to_pandas()¶
Exports the list of predictions to a
pandas.DataFrame
object.Returns:
The constructed
pandas.DataFrame
object.
- class datastories.model.MultiKpiPredictor(predictor_info, base_model)¶
Encapsulates multi-KPI prediction models (e.g., as computed using
datastories.story.predict_kpis()
).Base classes:
Note: Objects of this class should not be manually constructed.
- property error_plot¶
A visualization for assessing model prediction errors.
Returns:
An object of type
datastories.visualization.ErrorPlot
.
- property metrics¶
A dictionary containing multi KPI model prediction performance metrics.
The type of metrics depend on the model type (i.e., regression or classification)
- predict(data_frame)¶
Predict the model KPIs on a new data frame.
Args:
- data_frame (obj):
the data frame on which the model associated KPIs are to be predicted (either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
- Returns:
An object of type
datastories.regression.MultiKpiPredictionResult
encapsulating the prediction results.
- Raises:
ValueError
:when not all required columns are provided.
NOTE: If not all drivers are provided, the KPIs that depend on them will not be predicted. However, no Exception will be generated.
- property visualization¶
The prediction performance visualization.
- class datastories.model.MultiKpiPredictorInfo(pva, performance_metrics)¶
Data class wrapper for prediction performance metrics.
Note: Objects of this class should not be manually constructed.
- property metrics¶
The prediction performance metrics.
- property predicted_vs_actual¶
The prediction performance metrics.
- class datastories.model.MultiKpiPredictionResult(prediction)¶
Encapsulates the results of a prediction done using a
datastories.model.MultiKpiPredictor
object.Base classes:
Note: Objects of this class should not be manually constructed.
- property error_plot¶
A visualization for assessing model prediction errors.
Returns:
An object of type
datastories.visualization.ErrorPlot
.
- property metrics¶
A dictionary containing multi KPI prediction performance metrics.
- property values¶
A data frame containing the input augmented with predicted values, confidence estimates and flags to indicate whether the prediction is a model based outlier.
- property visualization¶
The prediction performance visualization.
Optimization¶
The datastories.optimization
package contains a collection
of classes and functions for optimizing models.
- datastories.optimization.create_optimizer(*args, **kwargs)¶
Factory method for creating optimizers.
Returns:
An object of type
datastories.optimization.pso.Optimizer
that can be used to perform optimization analyses on adatastories.model.Model
object.
Example:
model = Model("my_model.rsx") spec = OptimizationSpecification() spec.objectives = [ Minimize('KPI_1'), Maximize('KPI_2') ] spec.constraints = [ AtMost('Input_1', 10), ] optimizer = create_optimizer() optimization_result = optimizer.optimize(model, optimization_spec=spec)
- class datastories.optimization.pso.Optimizer(size_t population_size=500, size_t iterations=250)¶
A model optimizer using the particle swarm strategy for identifying an optimum solution.
Args:
- population_size (int =
500
): the initial size of the swarm population.
- population_size (int =
- iterations (int =
250
): number of swarm computation iterations before stopping.
- iterations (int =
- maximize(self, model, variable_ranges=None, progress_bar=True)¶
Run the optimizer with the goal of maximizing the outputs (i.e., KPIs) of a given model.
Args:
- model (
datastories.model.Model
): The input model whose KPIs are to be maximized.
- model (
- variable_ranges (dict [str,
datastories.optimization.VariableRange
] = {}): An optional dictionary mapping variable names to ranges that are to be used to limit the searching for the optimum solution to a given domain.
- variable_ranges (dict [str,
- progress_bar (obj|bool=True):
An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
Returns:
A
datastories.optimization.OptimizationResult
object encapsulating the model variables values that maximize the model outputs.
- minimize(self, model, variable_ranges=None, progress_bar=True)¶
Run the optimizer with the goal of minimizing the outputs (i.e., KPIs) of a given model.
Args:
- model (
datastories.model.Model
): The input model whose KPIs are to be minimized.
- model (
- variable_ranges (dict [str,
datastories.optimization.VariableRange
] = {}): An optional dictionary mapping variable names to ranges that are to be used to limit the searching for the optimum solution to a given domain.
- variable_ranges (dict [str,
- progress_bar (obj|bool=True):
An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
Returns:
A
datastories.optimization.OptimizationResult
object encapsulating the model variables values that minimize the model outputs.
- optimize(self, model, optimization_spec=None, variable_ranges=None, direction=None, progress_bar=True)¶
Optimize an input model according to a given optimization specification.
Args:
- model (
datastories.model.Model
): The input model to be optimized
- model (
- optimization_spec (
datastories.optimization.OptimizationSpecification
): An optional specification for the optimization objectives and constraints. The default value is an empty specification (i.e., OptimizationSpecification())
- optimization_spec (
- variable_ranges (dict [str,
datastories.optimization.VariableRange
] = {}): An optional dictionary mapping variable names to ranges that are to be used to limit the searching for the optimum solution to a given domain.
- variable_ranges (dict [str,
- direction (
datastories.optimization.OptimizationDirection
) - The direction of optimization when no specification is provided. Can be one of:
OptimizationDirection.MAXIMIZE
OptimizationDirection.MINIMIZE
- direction (
- progress_bar (obj|bool=True):
An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
Returns:
A
datastories.optimization.OptimizationResult
object encapsulating the model variables values that satisfy the optimization specification.
Raises:
TypeError
:when the provided input parameters do not have the expected types.
- class datastories.optimization.OptimizerType(value)¶
Enumeration for DataStories supported optimizer types.
- PARTICLE_SWARM = 0¶
- class datastories.optimization.OptimizationResult¶
Encapsulates the result of a
datastories.optimizer.Optimizer.optimize()
analysis.Note: Objects of this class should not be manually constructed.
- is_complete¶
Checks whether the search for the optimum has been interrupted before completion.
- is_feasible¶
Checks whether the identified optimum position respects the imposed constraints (if any).
- optimum¶
The model variable values for the identified optimum position.
- to_pandas(self)¶
Export the optimum position to a Pandas
DataFrame
object.Returns:
The constructed Pandas
DataFrame
object.
- class datastories.optimization.OptimizationSpecification(objectives=None, constraints=None)¶
Encapsulates a set of optimization objectives and constraints that can be used to configure an optimization analysis.
Both objectives and constraints are defined using
datastories.optimization.VariableSpec
and (potentially)datastories.optimization.VariableMapper
objects.Example:
spec = OptimizationSpecification() spec.objectives = [ Minimize('KPI_1', 2), InInterval('KPI_2', 1, 100) ] spec.add_constraint(AtMost(Sum('Input_1','Input_2'), 100))
- add_constraint(self, constraint)¶
Add a optimization constraint to the specification.
- add_objective(self, objective)¶
Add a optimization objective to the specification.
- constraints¶
Get/set the optimization specification constraints.
- objectives¶
Get/set the optimization specification objectives.
- to_dict(self)¶
- class datastories.optimization.OptimizationDirection¶
Enumeration for possible optimization goals when no other optimization specification is provided.
- Possible values:
OptimizationDirection.MAXIMIZE
OptimizationDirection.MINIMIZE
- class datastories.optimization.VariableRange¶
Encapsulates a numeric or categorical value ranges.
Numeric ranges are defined by an upper and a lower bound. Categorical ranges are currently limited to a single value.
Args:
- min (double=0):
a numeric range lower bound.
- max (double=0):
a numeric range upper bound.
- value (str=’’):
a categorical range value.
- is_categorical¶
Checks whether the variable range is categorical.
- is_numeric¶
Checks whether the variable range is numeric.
- max¶
Get/set the upper bound of a numeric range.
- min¶
Get/set the lower bound of a numeric range.
- to_dict(self)¶
- value¶
Get/set the value of a categorical range.
- class datastories.optimization.VariableMapper¶
Base class for all variable mappers.
Variable mappers are the first argument to be passed when defining optimization objectives and constraints. They indicate to what variable or group of variables the objective/constraint applies.
For simple cases (i.e., one variable), variable mappers can be replaced with the name of the variable itself. However, in more complex scenarios (e.g., a constraint that applies to the aggregated value of a number of variables), mappers have to be explicitly constructed.
- class datastories.optimization.Sum(operands, weights=None)¶
Bases:
VariableMapper
Aggregates a number of variables using a weighted sum. This can be then used to define optimization objectives or constraints.
Args:
- operands (list):
a list of variable names to sum-up.
- weights (list=None):
a list of relative weights for aggregating the given variables.
- class datastories.optimization.VariableSpec¶
Base class for all optimization objectives and constraints.
- class datastories.optimization.AtMost(operand, double limit, double weight=1.0)¶
Bases:
VariableSpec
Specifies an optimization objective or constraint by which a variable (or aggregation of variables) should be lower than a given reference value.
Args:
- operand (obj):
a variable mapper (
datastories.optimization.specification.VariableMapper
) indicating to whom the objective/constraint applies.
- limit (double):
the reference value to compare against.
- weight (double=1):
the relative weight of this objective/constraint among all the specified objectives or constraints.
- class datastories.optimization.AtLeast(operand, double limit, double weight=1.0)¶
Bases:
VariableSpec
Specifies an optimization objective or constraint by which a variable (or aggregation of variables) should be greater than a given reference value.
Args:
- operand (obj):
a variable mapper (
datastories.optimization.specification.VariableMapper
) indicating to whom the objective/constraint applies.
- limit (double):
the reference value to compare against.
- weight (double=1):
the relative weight of this objective/constraint among all the specified objectives or constraints.
- class datastories.optimization.InInterval(operand, double lower_limit, double upper_limit, double weight=1.0)¶
Bases:
VariableSpec
Specifies an optimization objective or constraint by which a variable (or aggregation of variables) should be in a given reference interval.
Args:
- operand (obj):
a variable mapper (
datastories.optimization.specification.VariableMapper
) indicating to whom the objective/constraint applies.
- lower_limit (double):
the lower bound of the reference interval.
- upper_limit (double):
the upper bound of the reference interval.
- weight (double=1):
the relative weight of this objective/constraint among all the specified objectives or constraints.
- class datastories.optimization.IsEqual(operand, double value, double weight=1.0)¶
Bases:
VariableSpec
Specifies an optimization objective by which a variable (or aggregation of variables) should be equal to a given reference value.
Note: The optimizer does not support the use of
datastories.optimization.specification.IsEqual
as a constraint, because the underlying algorithm is not optimized to handle constraints of this type. Therefore, trying to forceIsEqual
-like behavior by combiningAtLeast
andAtMost
to make only a small region feasible is not recommended. The returned result might be in this region, but there is no guarantee that it is close to optimal.In general, one should try to add the
IsEqual
condition as an objective with a high weight. This does not guarantee that the condition will be met, but the results are often close enough that a small manual adjustment to one parameter is enough to meet the condition.A common case is that the sum of some parameters must be equal to a value, for example in formulations where parameters express a fraction of a mixture. In this case, if the previous recommendation does not lead to good solutions, one can try to relax the condition in the following way. Have one constraint limiting the sum to the value with
AtMost[ Sum[], value ]
, and have one objectiveMaximize[ Sum[] ]
with a high weight. This is less restrictive towards the algorithm than using theIsEqual
as objective, and can lead to better results. Of course a small manual adjustment might be needed to satisfy the condition exactly.Args:
- operand (obj):
a variable mapper (
datastories.optimization.specification.VariableMapper
) indicating to whom the objective applies.
- value (double):
the reference value to compare against.
- weight (double=1):
the relative weight of this objective/constraint among all the specified objectives or constraints.
- class datastories.optimization.Minimize(operand, double weight=1.0)¶
Bases:
VariableSpec
Specifies an optimization objective by which a variable (or aggregation of variables) should have the smallest possible value.
Note: This cannot be used to define optimization constraints.
Args:
- operand (obj):
a variable mapper (
datastories.optimization.specification.VariableMapper
) indicating to whom the objective applies.
- weight (double=1):
the relative weight of this objective/constraint among all the specified objectives or constraints.
- class datastories.optimization.Maximize(operand, double weight=1.0)¶
Bases:
VariableSpec
Specifies an optimization objective by which a variable (or aggregation of variables) should have the largest possible value.
Note: This cannot be used to define optimization constraints.
Args:
- operand (obj):
a variable mapper (
datastories.optimization.specification.VariableMapper
) indicating to whom the objective applies.
- weight (double=1):
the relative weight of this objective/constraint among all the specified objectives or constraints.
Story¶
The datastories.story
package contains a collection
of workflows to automate specific analysis tasks (e.g., building a predictive model).
- datastories.story.load(file_path, *args, **kwargs)¶
Loads a previously saved story.
Args:
- file_path (str):
name of the file containing the story, including extension.
- folder (str):
Optional. If set, must be named. If not None, the provided folder is used to store story files.
Returns:
An object wrapping the story.
Raises:
TypeError
:when the story type is not recognized by the SDK.
StoryError
:when the story type cannot be retrieved from the file.
- datastories.story.load_result(file_path, cls=None, *argc, **kwargs)¶
Load a decoupled story analysis result.
NOTE: There are no compatibility guarantees across result versions. It is generally safe to use decoupled results only within a minor SDK version.
Args:
- file_path (str):
Path to the result file
- cls (str=None):
Expected type of the result. When not specified, an attempt will be made to infer the type from the file contents. When specified, it has to match the type of the result stored in the file.
Returns:
An object instance of the result stored in the file.
Raises:
ValueError
:When the type of the result could not be inferred or is different from the one specified in teh [cls] argument.
NotImplementedError
:When the result type or it’s specific version is not supported.
- class datastories.story.StoryBase(data=None, params=None, metainfo=None, raw_results=None, results=None, folder=None, notes=None, on_snapshot=None, progress_bar=False, **kwargs)¶
Base class for story analyses.
Base classes:
- class ProcessingStage(value)¶
Enumeration of all story processing stages.
Specializations have to extend this with their specific execution stages, while maintaining these base stages as defined below:
UNKNOWN = 0
INIT = 1
- add_note(note)¶
Add an annotation to the story results.
The already present annotations can be retrieved using the
datastories.api.IStory.notes()
property.Args:
- note (str):
the annotation to be added.
- clear_note(note_id)¶
Remove a specific annotation associated with the story analysis.
Args:
- note_id (int):
the index of the note to be removed.
Raises:
IndexError
:when the note index is unknown.
- clear_notes()¶
Clear the annotations associated with the story analysis.
- classmethod create_story(data_frame, info_fields, **kwargs)¶
Factory method.
This method has to be overwritten by specializations in order to enable additional computation when loading a story object.
- info()¶
Display story execution information.
All story execution stages are displayed together with their completion status. The version of the used DataStories SDK and the user notes are also included.
- static is_compatible(current_version_string, ref_version_string)¶
Test whether two story versions are compatible.
The story version compatibility policy is as follows:
stories are forward and backwards compatible within the minor version (i.e., you can open a saved story whose version differs from the version associated with the current SDK but only if the major version number remains unchanged).
- property is_complete¶
Checks whether all story analysis stages have been executed.
- property is_ok¶
Checks whether last executed story analysis stage has been successful.
- classmethod load(file_path)¶
Load story from file_path
- classmethod load_from_folder(folder, *args, **kwargs)¶
Load a story instance from a folder
- property metrics¶
Returns a set of metrics computed during analysis.
NOTE: This is an alias for the .stats property
- property notes¶
The list of all annotations currently associated with the story analysis.
- reset()¶
Reset the execution pointer of a story to the first stage.
- run(resume_from=None, strict=False, params=None, progress_bar=None, check_interrupt=None)¶
Resumes the execution of a story form a give stage.
The stage to resume from is optional. If not specified, the story is executed from the beginning. If the stage cannot be executed (e.g., due to missing intermediate results) the closest story that can be executed will be used as starting point unless the [strict] argument is set to True. In that case an exception will be raised if the execution cannot be resumed from the requested stage.
Args:
- resume_from (obj):
The stage to resume execution from. Should be a
datastories.story.predict_kpis.Story.ProcessingStage
value corresponding to a stage for which all intermediate results are available. If None, the stage at which execution was previously interrupted (if any) is used.
- strict (bool=False):
Raise en error if execution cannot be resumed from the requested stage.
- params (dict={}):
Map of parameters to be used with the run. It can override the original parameters, but this leads to invalidating previous results that depend on the updated parameter values.
- progress_bar (obj=None):
An object of type
datastories.display.ProgressReporter
to replace the currently used progress reporter. When not specified the current story progress reporter will not be modified. The case for this is to set a progress bar after the story is loaded, when a progress bar cannot be given to the load function directly (e.g, when a progress bar has to be constructed based on the story).
- check_interrupt (func=None):
an optional callback to check whether analysis execution needs to be interrupted.
Raises:
datastories.api.errors.StoryError
:if a stage is specified for which no intermediate results are available and the [strict] argument is set to True.
- save(file_path, include_data=True)¶
Save the story analysis results.
Use this function to persist the results of the story analysis. One can reload them and continue investigations at a later moment using the
datastories.story.load()
method.Args:
- file_path (str):
path to the output file.
- include_data (bool=True):
set to True to include a copy of the data in the exported file.
Raises:
datastories.api.errors.StoryError
:when attempting to include data while the story that does not contain a data reference. This is that case with stories that have been previously saved without including the data.
Predict Single KPI¶
- datastories.story.predict_single_kpi(data_frame, column_list, kpi, runs=3, outlier_elimination=True, prototypes='auto', progress_bar=True, threads=0, scale_kpi='auto')¶
Fits a non-linear regression model on a data frame in order to predict one column.
DEPRECATED: This method has been deprecated and will be removed in a future version of the SDK. Use the more generic
datastories.story.predict_kpis()
for analysing single KPIs as well.The column to pe predicted (i.e., the KPI) is to be identified either by name or by column index in the data frame.
Args:
- data_frame (obj):
the input data frame (either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
- column_list (list):
the list of variables (i.e., columns) to consider for regression.
- kpi (int|str):
the index or the name of the target (i.e., KPI) column.
- runs (int=3):
the number of training rounds;
- outlier_elimination (bool=True):
set to True in order to exclude far outliers from modeling. Note that no outliers will be eliminated if the dataset has fewer than 30 rows, or the variable has less than 20 unique values.
- prototypes (str=’auto’):
indicates whether analysis should be performed on prototypes. Possible values:
'yes'
: use only prototypes as inputs.'no'
: use all original inputs.'auto'
: use prototypes if there are more than 200 input variables.
- progress_bar (obj|bool=True):
An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
- threads (int):
the number of computational threads to use, uses all available cores by default.
- scale_kpi (str=’auto’):
indicates whether the kpi should be scaled. Possible values:
'yes'
: All runs use the scaled kpi if we detect that scaling could be beneficial'no'
: All runs use the original kpi'auto'
: About one third of the runs will use the scaled kpi if we detect that scaling could be beneficial, the rest of the runs use the original kpi
Returns:
An object of type
datastories.story.predict_single_kpi.Story
wrapping-up the computed model.
Raises:
ValueError
:when an invalid value is provided for one of the input parameters parameters.
datastories.story.StoryError
:when there is a problem fitting the model.
Example:
from datastories.story import predict_single_kpi import pandas as pd df = pd.read_csv('example.csv') kpi_column_index = 1 ranks = predict_single_kpi(df, df.columns, kpi_column_index, progress_bar=True) print(story)
- class datastories.story.predict_single_kpi.Story(platform, kpi_name, user_columns, nrows, folder='', *args, **kwargs)¶
Encapsulates the result of a single KPI non-linear regression model.
Base classes:
DEPRECATED: This class been deprecated and will be removed in a future version of the SDK. Use the more generic
datastories.story.predict_kpis()
for analysing single KPIs as well.Note: Objects of this class should not be manually constructed but rather using the
datastories.story.predict_single_kpi()
factory method.- classmethod load(file_path)¶
Loads a story from an existing folder
- classmethod load_from_folder(folder, *args, **kwargs)¶
Loads a story from an existing folder
- property metrics¶
A dictionary containing the model performance metrics and the list of main drivers.
These metrics are computed on the training data for the purpose of evaluating the model prediction performance.
The following metrics are retrieved:
- Training Set Size:
size of the actual data frame used for training (rows x columns).
- Correlation:
actual vs predicted correlation.
- Estimated Correlation:
estimated correlation for future (unseen) values.
- R-squared:
the coefficient of determination.
- MSE:
mean squared error.
- RMSE:
root mean squared error.
- Main Drivers:
list of main features with associated relative importance and energy.
- Features:
list of all features with associated relative importance and energy.
- Computation Effort:
a measure of model complexity.
- Number of Runs:
number of training rounds.
- Best Run:
best performing training round.
- Run Overview:
overview of individual runs including Performance and Feature Importance.
In case the KPI is a binary variable, the following additional metrics are included:
- Positive Label:
the label used to identify positive cases.
- Negative Label:
the label used to identify negative cases.
- True Positives:
number of correctly identified positive cases (TP).
- False Positives:
number of incorrectly identified positive cases (FP).
- True Negatives:
number of correctly identified negative cases (TN).
- False Negatives:
number of incorrectly identified negative cases (FN).
- Not Classified:
number of records that could not be classified (i.e., KPI is NaN).
- True Positive Rate:
TP / (TP + FN) * 100 (a.k.a. sensitivity, recall).
- False Positive Rate:
FP / (FP + TN) * 100 (a.k.a. fall-out).
- True Negative Rate:
TN / ( FP + TN) * 100 (a.k.a. specificity).
- False Negative Rate:
FN / (TP + FN) * 100 (a.k.a. miss rate).
- Precision:
percentage of correctly identified cases from the total reported positive cases TP / (TP + FP) * 100.
- Recall:
percentage of correctly identified cases from the total existing positive cases TP / (TP + FN) * 100.
- Accuracy:
percentage of correctly identified cases (TP + TN) / (TP + FP + TN + FN) * 100.
- F1 Score:
the F1 score (the harmonic mean of precision and recall).
- AUC:
area under (ROC) curve.
- property model¶
An object of type
datastories.model.SingleKpiPredictor
that can be used for making predictions on new data.
- property run_overview¶
An overview of feature importance metrics across all runs.
- property runs¶
A list containing the results of individual analysis rounds.
Each entry in the list is an object of type
datastories.story.predict_single_kpi.StoryRun
encapsulating the results associated with a given analysis round.
- save(file_path)¶
Saves the story analysis results.
Use this function to persist the results of the
datastories.story.predict_single_kpi()
analysis. One can reload them and continue investigations at a later moment using thedatastories.story.predict_single_kpi.Story.load()
method.Args:
- file_path (str):
path to the output file.
- to_csv(file_path, content='metrics', delimiter=',', decimal='.')¶
Exports a list of model metrics to a
CSV
file.Args:
- file_path (str):
path to the output file.
- content (str=’metrics’):
the type of metrics to export. Possible values:
'metrics'
: exports estimated model performance metrics.'drivers'
: exports driver importance metrics.'run_overview'
: exports an overview of feature importance metrics across all runs.
- delimiter (str=’,’):
character to use as value delimiter.
- decimal (str=’.’):
character to use as decimal point.
Raises:
ValueError
:when an invalid value is provided for the [content] argument.
- to_excel(file_path)¶
Exports the list of model metrics to an
Excel
file.Args:
- file_path (str):
path to the output file.
- to_html(file_path, title='Predict Single KPI', subtitle='', scenario=VisualizationScenario.REPORT)¶
Export the story visualization to a standalone
HTML
document.Args:
- file_path (str):
path to the output file.
- title (str=’Predict Single KPI’):
HTML document title.
- subtitle (str=’’):
HTML document subtitle.
- scenario (enum=VisualizationScenario.REPORT):
A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.
- to_pandas(content='metrics')¶
Exports a list of model metrics to a
pandas.DataFrame
object.Args:
- content (str=’metrics’):
the type of metrics to export. Possible values:
'metrics'
: exports estimated model performance metrics.'drivers'
: exports feature importance metrics for the model.'run_overview'
: exports an overview of feature importance metrics across all runs.
Returns:
The constructed
pandas.DataFrame
object.
Raises:
ValueError
:when an invalid value is provided for the [content] argument.
- class datastories.story.predict_single_kpi.StoryRun(platform, parent, folder=None, dependencies=None, *args, **kwargs)¶
Encapsulates the result of one analysis round for a single KPI non-linear regression model.
DEPRECATED: This class been deprecated and will be removed in a future version of the SDK. Use the more generic
datastories.story.predict_kpis()
for analyzing single KPIs as well.Base classes:
Note: Objects of this class should not be manually constructed.
- property correlation_browser¶
A visualization for assessing feature correlation.
An object of type
datastories.visualization.CorrelationBrowser
that can be used for assessing feature correlation, as discovered while training the model.
- property metrics¶
A dictionary containing the model performance metrics and the list of main drivers.
These metrics are computed on the training data for the purpose of evaluating the model prediction performance.
The following metrics are retrieved:
- Training Set Size:
size of the actual data frame used for training (rows x columns).
- Correlation:
actual vs predicted correlation.
- Estimated Correlation:
estimated correlation for future (unseen) values.
- R-squared:
the coefficient of determination.
- MSE:
mean squared error.
- RMSE:
root mean squared error.
- Main Drivers:
list of main features with associated relative importance and energy.
- Features:
list of all features with associated relative importance and energy.
In case the KPI is a binary variable, the following additional metrics are included:
- Positive Label:
the label used to identify positive cases.
- Negative Label:
the label used to identify negative cases.
- True Positives:
number of correctly identified positive cases (TP).
- False Positives:
number of incorrectly identified positive cases (FP).
- True Negatives:
number of correctly identified negative cases (TN).
- False Negatives:
number of incorrectly identified negative cases (FN).
- Not Classified:
number of records that could not be classified (i.e., KPI is NaN).
- True Positive Rate:
TP / (TP + FN) * 100 (a.k.a. sensitivity, recall).
- False Positive Rate:
FP / (FP + TN) * 100 (a.k.a. fall-out).
- True Negative Rate:
TN / ( FP + TN) * 100 (a.k.a. specificity).
- False Negative Rate:
FN / (TP + FN) * 100 (a.k.a. miss rate).
- Precision:
percentage of correctly identified cases from the total reported positive cases TP / (TP + FP) * 100.
- Recall:
percentage of correctly identified cases from the total existing positive cases TP / (TP + FN) * 100.
- Accuracy:
percentage of correctly identified cases (TP + TN) / (TP + FP + TN + FN) * 100.
- F1 Score:
the F1 score (the harmonic mean of precision and recall).
- AUC:
area under (ROC) curve.
- property model¶
An object of type
datastories.model.SingleKpiPredictor
that can be used for making predictions on new data.
- to_csv(file_path, content='metrics', delimiter=',', decimal='.')¶
Export a list of model drivers or metrics to a
CSV
file.Args:
- file_path (str):
path to the output file.
- content (str=’metrics’):
the type of metrics to export. Possible values:
'metrics'
: exports estimated model performance metrics.'drivers'
: exports driver importance metrics.
- delimiter (str=’,’):
character to use as delimiter
- decimal (str=’.’):
character to use as decimal point
Raises:
ValueError
:when an invalid value is provided for the [content] argument.
- to_excel(file_path)¶
Exports the list of model drivers and metrics to an
Excel
file.Args:
- file_path (str):
path to the output file.
- to_html(file_path, title='Predict Single KPI Run', subtitle='', scenario=VisualizationScenario.REPORT)¶
Export the story visualization to a standalone
HTML
document.Args:
- file_path (str):
path to the output file.
- title (str=’Predict Single KPI Run’):
HTML document title.
- subtitle (str=’’):
HTML document subtitle.
- scenario (enum=VisualizationScenario.REPORT):
A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.
- to_pandas(content='metrics')¶
Export a list of model drivers or metrics to a
pandas.DataFrame
object.Args:
- content (str=’metrics’):
the type of metrics to export. Possible values:
'metrics'
: exports estimated model performance metrics.'drivers'
: exports driver importance metrics.
Returns:
The constructed
pandas.DataFrame
object.
Raises:
ValueError
:when an invalid value is provided for the [content] argument.
- property what_ifs¶
A visualization for interactive exploration of the models.
The visualization helps getting insight into how driver variables influence the target KPIs. An object of type
datastories.visualization.WhatIfs
that can be used for interactive exploration of the models.
Predict Multiple KPIs¶
- datastories.story.predict_kpis(data, column_list, kpi_list, record_info_list=None, runs=3, outlier_elimination=True, prototypes='auto', prototype_threshold=0.85, optimize=False, progress_bar=True, fail_on_error=False, threads=0, scale_kpi='no')¶
Fit a non-linear regression model on a data frame in order to predict several columns (i.e., KPIs) at the same time.
The columns to pe predicted (i.e., the KPIs) are to be identified either by name or by column index in the data frame.
Args:
- data (obj):
the input data frame (either a
pandas.DataFrame
or adatastories.data.DataFrame
object) or a data descriptor (i.e., adatastories.data.DataDescriptor
object).
- column_list (list):
the list of variables (i.e., columns) to consider for regression.
- kpi_list (list):
the list of indexes or names for the target columns (i.e., KPIs).
- record_info_list (list=[]):
the list of indexes or names to be used as additional record info.
- runs (int=3):
the number of training rounds.
- outlier_elimination (bool=True):
set to True in order to exclude far outliers from modeling. Note that no outliers will be eliminated if the dataset has fewer than 30 rows, or the variable has less than 20 unique values.
- prototypes (str=’auto’):
indicates whether analysis should be performed on prototypes. Possible values:
'yes'
: use only prototypes as inputs.'no'
: use all original inputs.'auto'
: use prototypes if there are more than 200 inputs variables.
- prototype_threshold (float=0.85):
minimum correlation required for a column to be consider a proxy for another.
- optimize (bool=False):
set to True in order to compute optimal values for the KPIs. This will run optimization analyses that attempt to first minimize and then maximize all KPI together. For more complex scenarios (e.g., minimize a specific KPI while maximizing another) one can use the optimize method of the model field (
datastories.model.MultiKpiPredictor
) once the story analysis is completed.
- progress_bar (obj|bool=True):
An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
- fail_on_error (bool=False):
set to
True
in order to fail (i.e., raise an exception) when problems are detected. Otherwise, the processing will complete producing a partial story object. In order to check how far the processing has reached, one can use thedatastories.story.StoryBase.info()
.
- threads (int):
the number of computational threads to use, uses all available cores by default.
- scale_kpi (str=’no’):
indicates whether the kpi should be scaled. Possible values:
'yes'
: All runs use the scaled kpi if we detect that scaling could be beneficial'no'
: All runs use the original kpi'auto'
: One third of the runs will use the scaled kpi if we detect that scaling could be beneficial,the rest of the runs use the original kpi. When doing a single run, no scaling is applied. When doing two runs, scaling can be applied on one run.
Returns:
An object of type
datastories.story.predict_kpis.Story
wrapping-up the computed model.
Raises:
ValueError
:when an invalid value is provided for one of the input parameters parameters.
datastories.story.StoryError
:when there is a problem fitting the model.
Example:
from datastories.story import predict_kpis import pandas as pd df = pd.read_csv('example.csv') kpi_column_indexes = [1,'other kpi',3,4] story = predict_kpis( df, df.columns, kpi_column_indexes) print(story)
- class datastories.story.predict_kpis.Story(data=None, params=None, metainfo=None, raw_results=None, results=None, folder=None, notes=None, upload_function=None, on_snapshot=None, progress_bar=False)¶
Encapsulates a multi-kpi non-linear regression model analysis.
Base classes:
Note: Objects of this class should not be manually constructed but rather created using the
datastories.story.predict_kpis()
factory method.- class ProcessingStage(value)¶
Enumeration declaring the story processing stages.
Possible values:
UNKNOWN = 0
INIT = 1
PREPARE_DATA = 2
PROCESS_DATA = 3
BUILD_MODELS = 4
MERGE_MODELS = 5
VALIDATE_MODEL = 6
OPTIMIZE = 7
WRAP_UP = 8
END = 9
- add_model_validation(prediction)¶
Add a prediction containing validation data to the story managed validations.
- property best_run¶
The index of the best analysis run.
The best run is selected to be the run with most cummulated importance overlap between the main drivers of different KPIs. The overlap is computed pairwise between all KPI pairs of a given run.
- compare_predicted_vs_actual(test_dataframe)¶
Run the Predicted-Vs-Actual analysis on the provided test dataframe, and returns a dataframe containing the same columns, enriched with the result of the Predicted-vs-Actual comparison.
- Args:
- test_dataframe pd.DataFrame:
If None is provided, then the internal _data_frame will be used (if it exists). Otherwise, it must be a dataframe that follows DataStories conventions and normalisations of columns.
- Returns:
The result of the Predicted-vs-Actual comparison and the previous columns of the test dataframe. The method does not mutate the test dataframe. None is returned in case the model is not available
- property conclusions¶
An analysis summary containing highlights and pointers to detailed insights (object of type
datastories.story.predict_kpis.Conclusion
)..
- property data¶
A copy of the story associated dataframe, if available. This is the full source dataframe.
NOTE: When the dataframe has been previously discarded (i.e., by setting the include_data argument to False while saving the story) the associated data is lost and this property will return
None
.
- property data_health¶
A summary of input data quality (object of type
datastories.story.generic.DataHealth
).
- property data_overview¶
An overview of driver importance across all analysis runs (object of type
datastories.story.generic.DataOverview
).
- property failed_kpis¶
A list of KPIs that could not be processed, or None if all KPIs have been successfully modeled.
- property linear_vs_nonlinear¶
An overview of KPI relationships with other columns in the dataframe (object of type
datastories.story.generic.LinearVsNonlinear
).
- property model¶
An object of type
datastories.api.IPredictiveModel
that can be used for making predictions on new data.
- property model_validation¶
An overview of model validations (object of type
datastories.story.predict_kpis.ModelValidation
).
- modify_drivers(replace=None, remove=None, run=None, complexity=1.0)¶
Modify the drivers and complexity of a model from the story. This function generates new models with driver substitutions or removals performed on the already trained model.
Since it starts out from an already trained model, we can save time under the assumption that the variables substituted are similar in information content.
The main intent for this function is to replace a driver that is hard to control by one of it’s proxies that is easy to control, without having to do a full training run again.
Note that this function does not give you a statistical guarantee about the quality of the resulting model, as no variable selection is performed and the input weights are not retrained.
In general, starting a new story with the driver substitution and removal performed on the input columns will yield a more reliable model than the one created by this function.
Args:
- replace (dict):
Driver labels representing driver replacements, keys will be replaced by values i.e. with input {‘driver1’ : ‘driver2’} the driver1 will be replaced by driver2
- remove (list):
Driver labels to remove from the model
- run (int):
Advanced option to select the run you want to modify. By default the best run is chosen automatically, i.e. the one you see when displaying a story
- complexity (float=1):
Advanced option to increase or decrease the complexity factor of the model. More complex models can have more complex response surfaces and variable interactions. Note increasing this above one might result in worse models, or can over-fit.
Returns:
An object of type
datastories.story.predict_kpis.Story
wrapping-up the computed model.
- property pairwise_plots¶
A collection of variable vs variable plots (object of type
datastories.story.generic.PairwisePlots
).
- property record_info_labels¶
The names of the columns that contain record identification information.
- reset()¶
Reset the execution pointer of a story to the first stage.
Warning: After calling this, all previous results are discarded. One needs to run the story again in order to regenerate the results. This is only possible when the data frame is still available. That is, resetting a story that previously discarded the data frame (e.g., while saving) would render the story unusable. Consequently, this scenarios is not allowed and an Exception is raised when the scenario is attempted.
- run(resume_from=None, strict=False, params=None, progress_bar=None, check_interrupt=None)¶
Resume the execution of a story form a give stage.
The stage to resume from is optional. If not specified, the story is executed from the beginning. If the stage cannot be executed (e.g., due to missing intermediate results) the closest story that can be executed will be used as starting point unless the [strict] argument is set to True. In that case an exception will be raised if the execution cannot be resumed from the requested stage.
Args:
- resume_from (self.ProcessingStage=None):
The stage to resume execution from. Should be a stage for which all intermediate results are available. If None, the stage at which execution was previously interrupted (if any) is used.
- strict (bool=False):
Raise en error if execution cannot be resumed from the requested stage.
- params (dict={}):
Map of parameters to be used with the run. It can override the original parameters, but this leads to invalidating previous results that depend on the updated parameter values.
- progress_bar (obj=None):
An object of type
datastories.display.ProgressReporter
to replace the currently used progress reporter. When not specified the current story progress reporter will not be modified. The case for this is to set a progress bar after the story is loaded, when a progress bar cannot be given to the load function directly (e.g, when a progress bar has to be constructed based on the story).
- check_interrupt (func=None):
an optional callback to check whether analysis execution needs to be interrupted.
Raises:
- StoryError:
if a stage is specified for which no intermediate results are available and the ‘strict’ argument is set to True.
- property run_overview¶
An overview of driver importance across all analysis runs (object of type
datastories.story.predict_kpis.RunOverview
).
- property runs¶
A list containing the results of individual analysis rounds.
Each entry in the list is an object of type
datastories.story.predict_kpis.StoryRun
encapsulating the results associated with a given analysis round.
- save(file_path, include_data=True)¶
Save the story analysis results.
Use this function to persist the results of the
datastories.story.predict_kpis()
analysis. One can reload them and continue investigations at a later moment using thedatastories.story.load()
method.Args:
- file_path (str):
path to the output file.
- include_data (bool=True):
set to
True
to include a copy of the data in the exported file.
Raises:
datastories.api.errors.StoryError
:when attempting to include data while the story that does not contain a data reference. This is that case with stories that have been previously saved without including the data.
- property stats¶
A dictionary containing the model performance statistics and the list of main drivers.
These statistics are computed on the training data for the purpose of evaluating the model prediction performance.
The following statistics are retrieved:
- Prediction Performance:
the prediction performance per KPI ( correlation coefficient for regression, or AUC for classification).
- Driver Importance:
relative driver importance per KPI.
- Driver Overlap:
cummulated driver importance overlap computed between all possible pairs of KPIs.
- to_html(file_path, title='Predict Multiple KPIs', subtitle='', scenario=VisualizationScenario.REPORT)¶
Export the story visualization to a standalone
HTML
document.Args:
- file_path (str):
name of the file to export to.
- title (str=’Predict Multiple KPIs’):
HTML document title.
- subtitle (str=’’):
HTML document subtitle.
- scenario (enum=VisualizationScenario.REPORT):
A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.
- class datastories.story.predict_kpis.StoryRun(run_idx=None, upload_function=None, uuid=None, progress_bar=None, folder=None, dependencies=None, parent=None, *args, **kwargs)¶
Encapsulates results of one analysis round of a multi-kpi non-linear regression model analysis.
Base classes:
Note: Objects of this class should not be manually constructed.
- property correlation_browser¶
An overview of linear and nonlinear correlations across most relevant variables in the analysis (object of type
datastories.story.generic.CorrelationBrowser
).The most relevant variables are identified based on the amount of correlation they exhibit with respect to other variables in the analysis.
- property driver_overview¶
An overview of driver importance across all KPIs (object of type
datastories.story.predict_kpis.DriverOverview
).
- property drivers¶
Retrieves an overview of all driver variables.
- property kpis¶
Retrieves an overview of all KPIs.
- property metrics¶
A set of metrics computed during analysis.
NOTE: This is an alias for the .stats property.
- property model¶
An object of type
datastories.api.IPredictiveModel
that can be used for making predictions on new data.
- property outliers¶
A dictionary of outlier values per column used in modeling.
- property stats¶
A dictionary containing the model performance statistics and the list of main drivers.
These statistics are computed on the training data for the purpose of evaluating the model prediction performance.
The following statistics are retrieved:
- Prediction Performance:
the prediction performance per KPI ( correlation coefficient for regression, or AUC for classification).
- Driver Importance:
relative driver importance per KPI.
- Driver Overlap:
cummulated driver importance overlap computed between all possible pairs of KPIs.
- to_csv(file_path, content='Driver Importance', delimiter=',', decimal='.')¶
Export a list of story metrics to a
CSV
file.Args:
- file_path (str):
path to the output file.
- content (str=’Driver Importance’):
the type of metrics to export. Possible values:
'Prediction Performance'
: exports estimated model performance metrics;'Driver Importance'
: exports driver importance metrics.
- delimiter (str=’,’):
character used as value delimiter.
- decimal (str=’.’):
character used as decimal point.
Raises:
ValueError
:when an invalid value is provided for the [content] argument.
- to_excel(file_path)¶
Export the list of story metrics to an
Excel
file.Args:
- file_path (str):
path to the output file.
- to_html(file_path, title='Predict Multiple KPIs Run', subtitle='', scenario=VisualizationScenario.REPORT)¶
Export the story visualization to a standalone
HTML
document.Args:
- file_path (str):
path to the output file.
- title (str=’Predict Multiple KPIs Run’):
HTML document title.
- subtitle (str=’’):
HTML document subtitle.
- scenario (enum=VisualizationScenario.REPORT):
A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.
- to_pandas(content='Driver Importance')¶
Export a list of model drivers or metrics to a
pandas.DataFrame
object.Args:
- content (str=’Driver Importance’):
the type of metrics to export. Possible values:
'Prediction Performance'
: exports estimated model performance metrics;'Driver Importance'
: exports driver importance metrics.
Returns:
The constructed
pandas.DataFrame
object.
Raises:
ValueError
:when an invalid value is provided for the [content] argument.
- property what_ifs¶
An interactive what-ifs analysis visualization (object of type
datastories.story.generic.WhatIfs
).
- class datastories.story.predict_kpis.ProgressBar(story=None, runs=None, kpi_list=None, *args, **kwargs)¶
Convenience wrapper for
datastories.display.AggregatedReporter
.It constructs aggregated progress reporters for multi kpi stories. To this end, it requires either a story object (if already available) or two parameters that define the processing stages: the number of runs and the list of kpis. When the story is specified, the number of runs and the kpi list should not be provided.
Args:
- story (obj=None):
an optional multi kpi story object of type
datastories.story.predict_kpis.Story
from which processing stages will be inferred.
- runs (int=None):
an optional integer specifying the number of runs.
- kpi_list (list=None):
an optional list specifying the story KPIs as would be provided to the analysis.
Raises:
ValueError
:when the provided parameters do not match the specification requirements.
Check Data Health¶
- datastories.story.check_data_health(data, sample_size=None, progress_bar=True, on_snapshot=None, upload_function=None, check_interrupt=None, fail_on_error=False)¶
Check the suitability of a dataset for building statistical models.
Args:
- data (obj):
the input data frame (either a
pandas.DataFrame
or adatastories.data.DataFrame
object) or a data descriptor (i.e., adatastories.data.DataDescriptor
object).
- sample_size (int|str=None)`:
the sample size to use for inferring data types (either absolute integer value or a percentage - e.g. ‘10%’). If left unspecified is the minimum of 100 and 10% of the number of points.
- progress_bar (obj|bool=True):
An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
- on_snapshot (func=None):
an optional callback to be executed when an analysis snapshot is created. The callback receives one argument indicating the path of the snapshot file relative to the current execution folder.
- upload_function (func=None):
an optional callback to upload analysis result files to a client specific storage. The callback receives one argument indicating the path of the result file relative to the current execution folder.
- check_interrupt (func=None):
an optional callback to check whether analysis execution needs to be interrupted.
- fail_on_error (bool=False):
set to
True
in order to fail (i.e., raise an exception) when problems are detected. Otherwise, the processing will complete producing a partial story object. In order to check how far the processing has reached, one can use thedatastories.story.StoryBase.info()
.
Returns:
An object of type
datastories.story.check_data_health.Story
wrapping-up the data health report.
Example:
from datastories.story import check_data_health import pandas as pd df = pd.read_csv('example.csv') story = check_data_health(df) print(story)
- class datastories.story.check_data_health.Story(data=None, params=None, metainfo=None, raw_results=None, results=None, folder=None, notes=None, upload_function=None, on_snapshot=None, progress_bar=False)¶
Encapsulates a data health analysis.
Base classes:
Note: Objects of this class should not be manually constructed but rather created using the
datastories.story.check_data_health()
factory method.- class ProcessingStage(value)¶
Enumeration declaring the story processing stages.
Possible values:
UNKNOWN = 0
INIT = 1
PREPARE_DATA = 2
COMPUTE_DATA_SUMMARY = 3
END = 4
- property data_summary¶
An interactive data summary visualization.
Returns:
- An object of type
datastories.data.DataSummaryResult
that can be used for assessing variable type and value distribution.
- An object of type
- property stats¶
The set of data health statistics.
- to_html(file_path, title='Data Health Report', subtitle='', scenario=VisualizationScenario.REPORT)¶
Exports the analysis result visualization to a standalone
HTML
document.Args:
- file_path (str):
name of the file to export to;
- title (str=’Data Health Report’):
HTML document title;
- subtitle (str=’’):
HTML document subtitle.
- scenario (enum=VisualizationScenario.REPORT):
A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.
- to_pandas()¶
Export the data health stats to a
Pandas
data frame.Returns:
The constructed
Pandas
data frame object.
Story Results¶
General Results¶
- class datastories.story.generic.CorrelationBrowser(json_content, column_names, correlation_file='edge_bundling.json', slide_deck=None, slide_name='CorrelationBrowser')¶
An overview of linear and nonlinear correlations across most relevant variables of an
datastories.story.predict_kpis()
analysis.Base classes:
Note: Objects of this class should not be manually constructed.
- property slide¶
A serializable representation of the overview that can be used together with a compatible renderer in order to visualize the results.
- class datastories.story.generic.DataHealth(kpi_index_list, global_metrics, column_metrics, health_stats_file='health_stats.json', slide_deck=None, slide_name='DataHealth')¶
An overview of data health.
Base classes:
Note: Objects of this class should not be manually constructed.
- property column_metrics¶
A set of column level health metrics.
- property columns¶
A visualization of column level statistics only.
- property global_metrics¶
A set of global health metrics.
- property metrics¶
A set of data health statistics.
- property slide¶
A serializable representation of the data health that can be used together with a compatible renderer in order to visualize the results.
- property stats¶
A set of data health statistics.
- property visualization¶
The data health visualization.
- class datastories.story.generic.DataOverview(stats, input_indices, kpi_indices, slide_deck=None, slide_name='DataOverview')¶
A high level overview with descriptive statistics for the story input data frame.
Base classes:
Note: Objects of this class should not be manually constructed.
- property metrics¶
The set of data overview statistics.
- property slide¶
A serializable representation of the data overview that can be used together with a compatible renderer in order to visualize the results.
- property stats¶
The set of data overview statistics.
- to_html(file_path, title='Data Overview', subtitle='')¶
Exports the data overview visualization to a standalone
HTML
document.Args:
- file_path (str):
Name of the file to export to.
- title (str=’Data Overview’):
HTML document title.
- subtitle (str=’’):
HTML document subtitle.
- class datastories.story.generic.LinearVsNonlinear(relations, graphics_file='linear_vs_nonlinear.json', slide_deck=None, slide_name='LinearVsNonlinear')¶
An overview of KPI relations with other columns in a dataframe.
Base classes:
Note: Objects of this class should not be manually constructed.
- classmethod load(file_path, *args, **kwargs)¶
Load a ‘Linear vs Non-linear Relationships’ result from a versioned JSON file.
NOTE: There are no compatibility guarantees across result versions. It is generally safe to use a story-decoupled result only within a minor SDK version.
Args:
- file_path (str):
Location of the input file
- graphics_file (str=’linear_vs_nonlinear.json’):
Name of a file containing the the same plot data (to be passed in the slide).
- slide_deck (obj=None):
Associated slide deck.
Returns:
An object of type
datastories.story.generic.LinearVsNonlinear
- property metrics¶
A KPI indexed dictionary of relation metrics.
These include the number of columns that have no relation with the KPI and the number of investigated relations.
- property relations¶
A KPI indexed dictionary of relations with other columns in the data-frame.
- save(file_path)¶
Save the ‘Linear vs Non-linear Relationships’ result to a versioned JSON file.
NOTE: There are no compatibility guarantees across result versions. It is generally safe to use decoupled results only within a minor SDK version.
Args:
- file_path (str):
location of the output file
- property slide¶
A serializable representation of the ‘Linear vs Non-linear Relationships’ slide that can be used together with a compatible renderer in order to visualize the results.
- property stats¶
A KPI indexed dictionary of relation metrics.
These include the number of columns that have no relation with the KPI and the number of investigated relations.
NOTE: This is an alias for the .metrics property.
- to_json(html_safe=False)¶
Save the ‘Linear vs Non-linear Relationships’ data as a JSON string.
Args:
- html_safe (bool=False):
set to True in order to produce a JSON string that is safe to embed in a HTML page as an attribute value.
Returns:
A JSON string containing the analysis results.
- property visualization¶
The ‘Linear vs Non-linear Relationships’ visualization.
- class datastories.story.generic.PairwisePlots(plots, data=None, stats=None, record_info_columns=None, graphics_file='pair_wise_plots.json', slide_deck=None, slide_name='PairwisePlots')¶
A collection of variable to variable plots.
Note: Objects of this class should not be manually constructed.
- classmethod load(file_path, *args, **kwargs)¶
Load a Pair-Wise Plots result from a versioned JSON file.
NOTE: There are no compatibility guarantees across result versions. It is generally safe to use a story-decoupled result only within a minor SDK version.
Args:
- file_path (str):
location of the input file
- data (obj=None):
A Pandas data frame containing reference data for plots. When not provided, part of the plots is not available. In particular, scatter plots are not available in the SDK. Exported slides, on the other hand, rely on the data available to the renderer and therefore are not subject to this constraint.
- stats (dict=None):
A column name indexed dictionary of column statistics. When not provided, part of the plots might not be available (see data argument above).
- record_info_columns (list=[]):
List of column names to use for displaying additional data in plot tooltips.
- graphics_file (str=’pair_wise_plots.json’):
Name of file containing the plot data
- slide_deck (obj=None):
Story to which this slide belongs to.
Returns:
An object of type
datastories.story.generic.PairwisePlots
- plot(x, y, color=None, **kwargs)¶
Plot a variable against another.
Args:
- x (str|int):
Name or index of variable depicted on the horizontal axis.
- y (str):
Name or index of variable depicted on the vertical axis.
When the data frame is not provided, some plots are not available available.
For example, when the object is created as part of a
datastories.story.predict_kpis.Story
only plots containing aggregated data (e.g., box plots) for variables that are relevant to modelling (i.e., not discarded by the story) are available, while scatter plots are not.
- save(file_path)¶
Save the Pair-Wise Plots result to a versioned JSON file.
NOTE: There are no compatibility guarantees across result versions. It is generally safe to use decoupled results only within a minor SDK version.
Args:
- file_path (str):
location of the output file
NOTE: This operation loses the data, stats and record_info_columns information. Upon loading, the .plot() method will be limited and no additional information will be displayed in tooltips, unless the data and the
record_info_columns
arguments are provided again.
- property slide¶
A serializable representation of the pairwise plots that can be used together with a compatible renderer in order to visualize the results.
- property visualization¶
The ‘Pair-Wise Plots’ visualization.
- class datastories.story.generic.WhatIfs(what_ifs_file, minimize_drivers=None, maximize_drivers=None, driver_importance={}, prediction_file=None, data_file=None, record_info_file=None, stats_file=None, outlier_file=None, slide_deck=None, slide_name='WhatIfs')¶
An overview of linear and nonlinear correlations across most relevant variables of an
datastories.story.predict_kpis()
analysis.Base classes:
Note: Objects of this class should not be manually constructed.
- property drivers¶
Get/set the current driver values
- maximize()¶
Select the driver values that maximize the overall KPIs
- minimize()¶
Select the driver values that minimize the overall KPIs
- property slide¶
A serializable representation of the what-ifs that can be used together with a compatible renderer in order to visualize the results.
- property visualization¶
The ‘What-Ifs’ visualization.
Predict Multiple KPI Story Specific Results¶
- class datastories.story.predict_kpis.Conclusion(stats=None, slide_deck=None, slide_name='Conclusion')¶
An overview of the story analysis conclusions.
Base classes:
Note: Objects of this class should not be manually constructed.
- property metrics¶
The set of story analysis metrics.
NOTE: This is an alias for the .stats property.
- property slide¶
A serializable representation of the story analysis conclusions that can be used together with a compatible renderer in order to visualize the results.
- property stats¶
The set of story analysis statistics.
- property visualization¶
The story analysis conclusions visualization.
- class datastories.story.predict_kpis.DriverOverview(performance, estimated_performance, drivers, driver_overlap, driver_correlations, number_of_vars, overview_file='driver_overview.json', slide_deck=None, slide_name='DriverOverview')¶
An overview of driver importance and KPI prediction metrics across one run of the
datastories.story.predict_kpis()
analysis.Base classes:
Note: Objects of this class should not be manually constructed.
- property correlations¶
An overview of driver correlations with other columns in the dataset.
- property driver_overlap¶
An overview of driver importance overlap across runs.
- property drivers¶
An overview of main drivers and their importance across KPIs and runs.
- property estimated_performance¶
Estimation of KPI prediction error metrics across runs, provided when the associated performance could not be computed with enough confidence. It may be
None
- classmethod load(file_path)¶
Load a Driver Overview result from a versioned JSON file.
NOTE: There are no compatibility guarantees across result versions. It is generally safe to use a story-decoupled result only within a minor SDK version.
Args:
- file_path (str):
location of the input file.
Returns:
An object of type
datastories.story.predict_kpis.DriverOverview
.
- property metrics¶
The set of driver performance statistics.
- property performance¶
An overview of KPI prediction error metrics across runs.
- save(file_path)¶
Save the Driver Overview result to a versioned JSON file.
NOTE: There are no compatibility guarantees across result versions. It is generally safe to use decoupled results only within a minor SDK version.
Args:
- file_path (str):
location of the output file.
- property slide¶
A serializable representation of the overview that can be used together with a compatible renderer in order to visualize the results.
- property stats¶
The set of driver performance statistics.
- to_excel(file_path)¶
Exports the driver overview to an
Excel
file.Args:
- file_path (str):
name of the file to export to.
- to_json(html_safe=False)¶
Save the result as a JSON string.
Args:
- html_safe (bool=False):
set to True in order to produce a JSON string that is safe to embed in a HTML page as an attribute value.
Returns:
A JSON string containing the analysis results.
- to_pandas()¶
Exports the driver overview to a Pandas
DataFrame
.Returns:
The constructed Pandas
DataFrame
.
- property visualization¶
The Driver Overview visualization.
- class datastories.story.predict_kpis.ModelValidation(model_file, evaluations=[], evaluation_files=[], record_info_columns=[], training_data=None, slide_deck=None, slide_name='ModelValidation')¶
An overview of story model validation.
Base classes:
Note: Objects of this class should not be manually constructed.
- property evaluations¶
Get/set the meta information of existing model evaluations.
- property metrics¶
The set of model validation statistics.
- property slide¶
A serializable representation of the model validation that can be used together with a compatible renderer in order to visualize the results.
- property stats¶
The set of model validation statistics.
- property training_validation¶
The name of the training validation associated with the model validation.
NOTE: Under normal circumstances, there should be only one training validation per model validation. However, this is not enforced. This property retrieves the first training validation it can retrieve from the associated evaluations.
- class datastories.story.predict_kpis.RunOverview(performance, estimated_performance, drivers, driver_overlap, best_run, overview_file='run_overview.json', slide_deck=None, slide_name='RunOverview')¶
An overview of driver importance and KPI prediction metrics across all runs of
datastories.story.predict_kpis()
analysis.Base classes:
Note: Objects of this class should not be manually constructed.
- property best_run¶
Overview of KPI prediction error metrics across runs.
- property driver_overlap¶
Overview of driver importance overlap across runs.
- property drivers¶
Overview of main drivers and their importance across KPIs and runs.
- property estimated_performance¶
Estimation of KPI prediction error metrics across runs, provided when the associated performance could not be computed with enough confidence. It may be
None
.
- classmethod load(file_path)¶
Load a Run Overview result from a versioned JSON file.
NOTE: There are no compatibility guarantees across result versions. It is generally safe to use a story-decoupled result only within a minor SDK version.
Args:
- file_path (str):
location of the input file.
Returns:
An object of type
datastories.story.predict_kpis.RunOverview
.
- property metrics¶
The set of driver performance statistics across analysis runs.
- property performance¶
Overview of KPI prediction error metrics across runs.
- save(file_path)¶
Save the Run Overview result to a versioned JSON file.
NOTE: There are no compatibility guarantees across result versions. It is generally safe to use decoupled results only within a minor SDK version.
Args:
- file_path (str):
location of the output file.
- property slide¶
A serializable representation of the overview that can be used together with a compatible renderer in order to visualize the results.
- property stats¶
The set of driver performance statistics across analysis runs.
- to_excel(file_path)¶
Exports the run overview to an
Excel
file.Args:
- file_path (str):
name of the file to export to.
- to_html(file_path, title='Run Overview', subtitle=None, scenario=VisualizationScenario.REPORT)¶
Exports the visualization to a standalone
HTML
document.Args:
- file_path (str):
name of the file to export to.
- title (str=’Run Overview’):
HTML document title.
- subtitle (str=’’):
HTML document subtitle.
- scenario (enum=VisualizationScenario.REPORT):
A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.
- to_json(html_safe=False)¶
Save the result as a JSON string.
Args:
- html_safe (bool=False):
Set to True in order to produce a JSON string that is safe to embed in a HTML page as an attribute value.
Returns:
A JSON string containing the analysis results.
- to_pandas()¶
Exports the run overview to a Pandas
DataFrame
.Returns:
The constructed Pandas
DataFrame
.
- property visualization¶
The Run Overview visualization.
Visualization¶
Display Utils¶
The datastories.display
package contains a collection
of display helpers.
- datastories.display.wide_screen(width=0.95)¶
Make the notebook screen wider when running under
Jupyter Notebook
.Args:
- width (float=0.95):
width of notebook as a fraction of the screen width. Should be in the interval [0,1].
Raises:
ValueError
:when the [width] argument is outside the accepted interval.
- datastories.display.init_graphics(should_embed=False, dslibs_location=None)¶
Initializes the DataStories graphics engine.
Use this method at the beginning of your notebooks (Jupyter, Jupyterlab, Databricks) to trigger optimal rendering.
The component can be chosen to embed Datastories libraries (should_embed=True), or rely on its running environment (should_embed=False); default is False..
When should_embed=False, that is, the environment is required to be sufficient to load the components, this method loads the scripts in the environment by using the embedded version inside the SDK. Components in NOTEBOOK mode are then loaded taking as granted the environment contains sufficiently many resource.
When should_embed=True, that is, components should embed the DataStories library resources, this method does not act on the HTML and has the following actions: - if dslibs_location is not provided (None), then components will carry their own version of the libraries - otherwise, components will try to reach the provided end point.
Recommended usages: Recommended usage is init_graphics(). On Databricks, it will setup Datastories libraries in /dbfs/FileStore/DataStories/components_library/.
- Args:
should_embed: is True if components should be responsible for embedding library resources, False otherwise (the environment is in charge) dslibs_location: is the reference to DataStories libraries
- Returns:
Nothing
- Effects:
This method has effects on the SDK state, and may has the HTML environment executing it. The latter is irreversible.
- datastories.display.export_javascript_library(file_path=None)¶
Export DataStories libraries as a JavaScript code.
- Args:
- file_path: The path of the JavaScript file that will contain the library code.
If None, the JavaScript code is returned directly
- Returns:
The JavaScript library to load DataStories components
- datastories.display.get_progress_bar(progress_bar)¶
A default implementation for a progress bar.
Args:
- progress_bar (obj|bool=False):
an object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).When an
datastories.display.ProgressReporter
object is provided it will be returned as is.
Returns:
An object of type
datastories.api.ProgressReporter
.
- class datastories.display.ProgressCounter¶
Base class implemented by all progress counters (including progress reporters).
Attributes:
- total (int):
the number of steps required for completion.
- step (int):
the current step.
- start_time (int):
the start time in ns.
- stop_time (int):
the stop time in ns.
- increment(steps=1)¶
Registers a processing advance with a number of steps.
Args:
- steps (int):
the number of steps to advance.
- start(total=1)¶
Initialize the progress range.
Args:
- total (int):
the number of steps required for completion.
- stop()¶
Stop progress monitoring.
- timeout()¶
Mark the step at which the execution timeout occurred.
Use this upon interrupting counting before reaching the end (i.e., step < total).
- class datastories.display.ProgressReporter(observers=[])¶
Abstract base class implemented by all progress reporters.
Base classes:
Args:
- observers (list):
list of progress observers to be notified on progress updates.
- property header¶
Get/set the current reporting header.
- increment(steps=1)¶
Register a processing advance with a number of steps.
Args:
- steps (int=1):
number of advance steps.
- log(message)¶
Log a progress message.
Args:
- message (str):
progress message to log.
- on_progress(progress)¶
Log the completion percentage.
Args:
- progress (float=None):
completion percentage to be logged.
- property progress¶
The currently reported progress.
- report()¶
Notify observers on progress updates.
- start(total=1)¶
Start progress reporting.
Args:
- total (int=)`:
total number of steps required for completion.
- property state¶
Get/set the currently reported state.
- stop(info='')¶
Stop progress reporting.
Args:
- info (str=’’):
optional message to report.
- class datastories.display.AggregatedReporter(stages=None, observers=None, display=True, bar_length=50)¶
A progress reporter that aggregates progress of a number of independent stages.
Base classes:
Stages are to be specified in the beginning, together with an estimation of the stage importance relative to the whole execution. The progress of each stage will be individually monitored and reported in the context of the whole execution.
Stages are to be identified and activated by setting the progress header.
Args:
- stages (dict):
a dictionary mapping local stage names to their bounds in the globally reported progress.
- observers (list):
list of observers to be notified about progress updates.
- display (bool=True):
set to
False
in order to disable progress display (e.g., when the display is done by observers)
- bar_length (int=cfg):
optional size of the progress bar. It defaults to the value specified in the SDK configuration settings. That is 25 if no configuration settings are provided.
Example:
stages = { 'Stage 1' : (0,50), 'Stage 2' : (50,100) } reporter = AggregatedReporter(stages=stages)
- property header¶
Get/set the progress report header.
- log(message)¶
Log a progress message.
Args:
- message (str):
progress message to log.
- on_progress(progress=None)¶
Log the completion percentage.
Args:
- progress (float=None):
completion percentage to be logged.
- reset()¶
Reset the progress reporter.
Plots¶
The datastories.visualization
package contains a collection
of visualizations that facilitates the assessment of selected DataStories
analysis results.
- class datastories.visualization.VisualizableMixin(title='', subtitle='')¶
Mixin for classes that provide a visualization property.
Enables exporting to HTML, manging the visualization settings, and provides a Jupyter representation.
- plot(*args, **kwargs)¶
Display an interactive visualization.
- to_html(file_path, title=None, subtitle=None, scenario=VisualizationScenario.REPORT)¶
Exports the visualization to a standalone
HTML
document.Args:
- file_path (str):
path to output file.
- title (str=’’):
HTML document title.
- subtitle (str=’’):
HTML document subtitle.
Raises:
datastories.api.errors.VisualizationError
when no visualization is defined.
- property vis_settings¶
Get/set the visualization settings.
Raises:
datastories.api.errors.VisualizationError
:when no visualization is defined.
- abstract property visualization¶
The visualization.
- class datastories.visualization.ColorScheme(value)¶
Enumeration of available color encoding schemes:
Possible values:
- For discrete variable encoding:
DISCRETE_12
DISCRETE_12_LIGHT
DISCRETE_10
DISCRETE_8
DISCRETE_8_LIGHT
DISCRETE_8_ACCENT
- For numeric variable encoding:
NUMERIC_RED_YELLOW_GREEN
NUMERIC_RED_YELLOW_BLUE
NUMERIC_RED_BLUE
NUMERIC_PINK_GREEN
NUMERIC_COLD_HOT
- class datastories.visualization.ConclusionsSettings¶
Encapsulates visualization settings for
datastories.visualization.Conclusions
visualizations.
- class datastories.visualization.Conclusions(conclusions=None, vis_settings=None, *args, **kwargs)¶
Encapsulates a visual representation of KPI drivers.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_settings (obj):
an object of type
datastories.visualization.ConclusionsSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
- plot(*args, **kwargs)¶
Convenience function to set-up and display the Conclusions visualization.
Accepts the same parameters as the constructor for
datastories.visualization.ConclusionsSettings
objects.
- to_html(file_path, title='Conclusions', subtitle='', scenario=VisualizationScenario.REPORT)¶
Exports the Conclusions visualization to a standalone
HTML
document.Args:
- file_path (str):
path tho the output file.
- title (str=’Conclusions’):
HTML document title.
- subtitle (str=’’):
HTML document subtitle.
- scenario (enum=VisualizationScenario.REPORT):
A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.
- class datastories.visualization.ConfusionMatrixSettings(width=480, height=320)¶
Encapsulates visualization settings for
datastories.visualization.ConfusionMatrix
visualizations.Args:
- width (int=480):
Graph width in pixels.
- height (int=320):
Graph height in pixels.
Attributes:
Same as the Args section above.
- class datastories.visualization.ConfusionMatrix(prediction_performance, vis_settings=None, *args, **kwargs)¶
Encapsulates a visual representation of model accuracy for binary classification models.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_settings (obj):
an object of type
datastories.visualization.ConfusionMatrixSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
- plot(*args, **kwargs)¶
Convenience function to set-up and display the Confusion Matrix visualization.
Accepts the same parameters as the constructor for
datastories.visualization.ConfusionMatrixSettings
objects.
- to_html(file_path, title='Confusion Matrix', subtitle='', scenario=VisualizationScenario.REPORT)¶
Exports the Confusion Matrix visualization to a standalone
HTML
document.Args:
- file_path (str):
path to the output file.
- title (str=’Confusion Matrix’):
HTML document title.
- subtitle (str=’’):
HTML document subtitle.
- scenario (enum=VisualizationScenario.REPORT):
A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.
- datastories.visualization.correlation_browser(file_path=None, raw_content=None, vis_settings=None)¶
Displays a Correlation Browser visualization in a Jupyter notebook based on an input correlation data file.
Args:
- file_path (str=None):
path to the input data file containing a serialized class:datastories.correlation.CorrelationResult object.
- raw_content (str=None):
a string, containing a JSON serialized class:datastories.correlation.CorrelationResult object.
- vis_setting (obj=CorrelationBrowserSettings()):
an object of type
datastories.visualization.CorrelationBrowserSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
NOTE: Either the [file_path] or [raw_content] argument has to be provided but not both.
Returns:
An object of type
datastories.visualization.CorrelationBrowser
Raises:
ValueError
:when both the [file_path] and the [raw_content] arguments are provided.
Example:
from datastories.visualization import correlation_browser correlation_browser('correlations.json')
- class datastories.visualization.CorrelationBrowserSettings(scale=1, node_opacity=0.9, edge_opacity=0.3, tension=0.65, font_size=15, filter_unconnected=False, min_weight=50, max_weight=100, weight_key='weightMI', show_controls=True, show_inspector=True)¶
Encapsulates visualization settings for
datastories.visualization.CorrelationBrowser
visualizations.Args:
- scale (float=1):
Scale factor of the radius [0-1].
- node_opacity (float=0.9):
Opacity of the nodes that aren’t hovered or connected to hovered or selected nodes [0-1].
- edge_opacity (float=0.3):
Opacity of the edges that aren’t hovered or connected to hovered or selected nodes [0-1].
- tension (float=0.65):
The tension of the links. A tension of 0 means straight lines [0-1].
- font_size (int=15):
Font size used for the nodes of the plot [10-32];
- filter_unconnected (boolean=False):
Whether or nodes that aren’t connected to any other node are filtered from the view.
- min_weight (int=50):
Minimum weight of the links that will be shown [0-100].
- max_weight (int=100):
Maximum weight of the links that will be shown [0-100].
- weight_key (str=’weightMI’):
Type of relations top display [‘weightMI’ for Mutual Information,’weightL’ for Linear Correlation].
- show_controls (bool=True):
Set to True in order to display relation controls.
- show_inspector (bool=True):
Set to True in order to display the relation inspector window.
Attributes:
Same as the Args section above.
- class datastories.visualization.CorrelationBrowser(correlation_result=None, raw_content=None, vis_settings=None, *args, **kwargs)¶
Encapsulates a visual representation of correlation between features.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_setting (obj):
an object of type
datastories.visualization.CorrelationBrowserSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
- plot(*args, **kwargs)¶
Convenience function to set-up and display the Correlation Browser visualization.
Accepts the same parameters as the constructor for
datastories.visualization.CorrelationBrowserSettings
objects.
- to_html(file_path, title='Correlation Browser', subtitle='', scenario=VisualizationScenario.REPORT)¶
Exports the Correlation Browser visualization to a standalone
HTML
document.Args:
- file_path (str):
path to the output file.
- title (str=’Correlation Browser’):
HTML document title.
- subtitle (str=’’):
HTML document subtitle.
- scenario (enum=VisualizationScenario.REPORT):
A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.
- class datastories.visualization.DataHealthSettings(page_size=25)¶
Encapsulates visualization settings for
datastories.visualization.DataHealth
visualizations.Args:
- page_size (int=1):
Maximum number of columns to display one one summary page;
Attributes:
Same as the Args section above.
- class datastories.visualization.DataHealth(data_health=None, vis_settings=None, *args, **kwargs)¶
Encapsulates a visual representation of data health report.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_settings (obj):
an object of type
datastories.visualization.DataHealthSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
- plot(*args, **kwargs)¶
Convenience function to set-up and display the Data Health visualization.
Accepts the same parameters as the constructor for
datastories.visualization.DataHealthSettings
objects.
- to_html(file_path, title='Data Health', subtitle='', scenario=VisualizationScenario.REPORT)¶
Exports the Data Health visualization to a standalone
HTML
document.Args:
- file_path (str):
path to the output file.
- title (str=’Data Health’):
HTML document title.
- subtitle (str=’’):
HTML document subtitle.
- scenario (enum=VisualizationScenario.REPORT):
A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.
- class datastories.visualization.DataSummaryTableSettings(page_size=25, show_console=True)¶
Encapsulates visualization settings for
datastories.visualization.DataSummaryTable
visualizations.Args:
- page_size (int=1):
Maximum number of columns to display one one summary page;
- show_console (bool=True):
Set to True in order to display the visualization console.
Attributes:
Same as the Args section above.
- class datastories.visualization.DataSummaryTable(summary=None, column_stats=None, vis_settings=None, *args, **kwargs)¶
Encapsulates a visual representation of data frame summary.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_settings (obj):
an object of type
datastories.visualization.DataSummaryTableSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
- plot(*args, **kwargs)¶
Convenience function to set-up and display the Data Summary visualization.
Accepts the same parameters as the constructor for
datastories.visualization.DataSummaryTableSettings
objects.
- to_html(file_path, title='Data Summary', subtitle='', scenario=VisualizationScenario.REPORT)¶
Exports the Data Summary visualization to a standalone
HTML
document.Args:
- file_path (str):
path to the output file.
- title (str=’Data Summary’):
HTML document title.
- subtitle (str=’’):
HTML document subtitle.
- scenario (enum=VisualizationScenario.REPORT):
A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.
- datastories.visualization.driver_overview(file_path=None, raw_content=None, vis_settings=None)¶
Displays a DriverOverview visualization in a Jupyter notebook based on an input correlation data file.
Args:
- file_path (str=None):
path to the input driver overview data file;
- vis_setting (obj):
an object of type
datastories.visualization.DriverOverviewSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
NOTE: Either the [file_path] or [raw_content] argument has to be provided but not both.
Returns:
An object of type
datastories.visualization.DriverOverview
Raises:
ValueError
:when both the [file_path] and the [raw_content] arguments are provided.
Example:
from datastories.visualization import driver_overview driver_overview('driver_overview.json')
- class datastories.visualization.DriverOverviewSettings(height=600)¶
Encapsulates visualization settings for
datastories.visualization.DriverOverview
visualizations.Args:
- height (int=600):
Graph height in pixels;
Attributes:
Same as the Args section above.
- class datastories.visualization.DriverOverview(driver_overview=None, raw_content=None, vis_settings=None, *args, **kwargs)¶
Encapsulates a visual representation of KPI drivers.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_settings (obj):
an object of type
datastories.visualization.DriverOverviewSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
- plot(*args, **kwargs)¶
Convenience function to set-up and display the Driver Overview visualization.
Accepts the same parameters as the constructor for
datastories.visualization.DriverOverviewSettings
objects.
- to_html(file_path, title='Driver Overview', subtitle='', scenario=VisualizationScenario.REPORT)¶
Exports the Driver Overview visualization to a standalone
HTML
document.Args:
- file_path (str):
path to the output file.
- title (str=’Driver Overview’):
HTML document title.
- subtitle (str=’’):
HTML document subtitle.
- scenario (enum=VisualizationScenario.REPORT):
A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.
- class datastories.visualization.ErrorPlotSettings(sort_key='id', highlight_outliers=True, display_confidence_interval=True, connect_dots=False, width=900, height=300)¶
Encapsulates visualization settings for
datastories.visualization.ErrorPlot
visualizations.Args:
- sort_key (str=’id’):
The sorting criteria for the X axis.Possible values:
'id'
: sort on record id.'actual'
: sort on record actual KPI value.'predicted'
: sort on record predicted value.
- highlight_outliers (bool=Tue):
set to True if outliers should be highlighted.
- display_confidence_interval (bool=True):
set to True if confidence limits should be displayed.
- connect_dots (bool=False):
set to True if data points should be connected by lines
- width (int=900):
plot width in pixels.
- height (int=300):
plot height in pixels.
Attributes:
Same as the Args section above.
- class datastories.visualization.ErrorPlot(pva=None, metrics=None, vis_settings=None, *args, **kwargs)¶
Encapsulates a visual representation of prediction error for prediction models.
Both regression and classification models are supported.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_settings (obj):
an object of type
datastories.visualization.ErrorPlotSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
- plot(*args, **kwargs)¶
Convenience function to set-up and display the Error Plot visualization.
Accepts the same parameters as the constructor for
datastories.visualization.ErrorPlotSettings
objects.
- to_html(file_path, title='Error Plot', subtitle='', scenario=VisualizationScenario.REPORT)¶
Exports the Error Plot visualization to a standalone
HTML
document.Args:
- file_path (str):
path to the output file.
- title (str=’Error Plot’):
HTML document title.
- subtitle (str=’’):
HTML document subtitle.
- scenario (enum=VisualizationScenario.REPORT):
A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.
- class datastories.visualization.FeatureRanksTableSettings(height=460, show_console=True)¶
Encapsulates visualization settings for
datastories.visualization.FeatureRanksTable
visualizations.Args:
- height (int=460):
graph height in pixels.
- show_console (bool=True):
displays the visualization console where update operations are logged.
Attributes:
Same as the Args section above.
- class datastories.visualization.FeatureRanksTable(feature_ranks, vis_settings=None, *args, **kwargs)¶
Encapsulates a visual representation of feature ranking.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_settings (obj):
an object of type
datastories.visualization.FeatureRanksTableSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
- plot(*args, **kwargs)¶
Convenience function to set-up and display the Feature Ranking visualization.
Accepts the same parameters as the constructor for
datastories.visualization.FeatureRanksTable
objects.
- to_html(file_path, title='Feature Ranking', subtitle='', scenario=VisualizationScenario.REPORT)¶
Exports the Feature Ranking visualization to a standalone
HTML
document.Args:
- file_path (str):
path to the output file.
- title (str=’Feature Ranking’):
HTML document title.
- subtitle (str=’’):
HTML document subtitle.
- scenario (enum=VisualizationScenario.REPORT):
A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.
- class datastories.visualization.OutlierPlotSettings(width=800, height=200, x_padding=0.2, y_padding=0.2, marker_size=32, hover_marker_size_delta=32, animations=500, show_jitter=True, show_cdf=True, show_iqr=True, show_summary=True, show_console=True, show_legend=True, low_threshold=0.05, high_threshold=0.95)¶
Encapsulates visualization settings for
datastories.visualization.OutlierXPlot
visualizations.Args:
- width (int=800):
graph width in pixels.
- height (int=200):
graph height in pixels.
- x_padding (float=0.2):
padding on horizontal axis.
- y_padding (float=0.2): ;
padding on vertical axis.
- marker_size (int=32):
size of the point marker.
- hover_marker_size_delta (int=32):
size of the point hover marker.
- animations (int=500):
animation duration in milliseconds.
- show_jitter (bool=False):
amount of jitter added to the vertical dimension, to better distinguish points.
- show_cdf (bool=True):
set to True to display the cumulative distribution function.
- show_iqr (bool=True):
set to True to display the inter-quartile range, as specified in the lower and higher threshold arguments.
- show_summary (bool=True):
set to True to display the summary table.
- show_console (bool=True):
set to True to display the visualization console where update operations are logged.
- low_threshold (float=0.05):
the lower threshold for the inter-quartile range.
- high_threshold (float=0.95):
the upper threshold for the inter-quartile range.
Attributes:
Same as the Args section above.
- class datastories.visualization.OutlierXPlot(outliers_result, vis_settings=None, *args, **kwargs)¶
Encapsulates a visual representation of outliers resulting from a one dimensional analysis.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_settings (obj):
an object of type
datastories.visualization.OutlierPlotSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
- plot(*args, **kwargs)¶
Convenience function to set-up and display the Outliers visualization.
Accepts the same parameters as the constructor for
datastories.visualization.OutlierPlotSettings
objects.
- to_html(file_path, title='Outliers', subtitle='', scenario=VisualizationScenario.REPORT)¶
Exports the Outliers visualization to a standalone
HTML
document.Args:
- file_path (str):
path to the output file.
- title (str=’Outliers’):
HTML document title.
- subtitle (str=’’):
HTML document subtitle.
- scenario (enum=VisualizationScenario.REPORT):
A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.
- datastories.visualization.plot_xy(data, x, y, color=None, info_columns=None, **kwargs)¶
Create an X vs Y plot.
Args:
- data (obj):
A Pandas data frame containing the data to be visualized.
- x (str|int):
Name or index of the variable for the horizontal axis.
- y (str|int):
Name or index of the variable for the vertical axis.
- color (str|int=None):
Optional name or index for a variable to be used for encoding in the color dimension.
- info_columns (list):
Optional list of name or index for columns to be used to provide additional info (e.g., in tooltips)
- kwargs (dict):
Dictionary of additional options to be used for configuring the visualization. See
datastories.visualization.PairWisePlotSettings
for a complete list.
- class datastories.visualization.PairWisePlotSettings(width=600, height=400, color_scheme=ColorScheme.DEFAULT)¶
Encapsulates visualization settings for
datastories.visualization.PairWisePlot
visualizations.Args:
- color_scheme (obj):
An object of type
datastories.visualization.ColorScheme
- width (int=600):
The plot width in pixels.
- height (int=400):
The plot height in pixels.
Attributes:
Same as the Args section above.
- class datastories.visualization.PairWisePlot(plot_json, data=None, record_info_columns=None, show_navigator=False, vis_settings=None, *args, **kwargs)¶
Encapsulates a visual representation of two variable relations.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_settings (obj):
an object of type
datastories.visualization.PairWisePlotSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
- plot(*args, **kwargs)¶
Convenience function to set-up and display the ‘Pair-Wise Plots’ visualization.
Accepts the same parameters as the constructor for
datastories.visualization.PairWisePlotSettings
objects.
- to_html(file_path, title='', subtitle='', scenario=VisualizationScenario.REPORT)¶
Exports the Predict vs Actual visualization to a standalone
HTML
document.Args:
- file_path (str):
path to the output file.
- title (str=’Predicted vs Actual’):
HTML document title.
- subtitle (str=’’):
HTML document subtitle.
- scenario (enum=VisualizationScenario.REPORT):
A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.
- class datastories.visualization.PredictedVsActualSettings(highlight_outliers=True, show_metrics=True, width=600)¶
Encapsulates visualization settings for
datastories.visualization.PredictedVsActual
visualizations.Args:
- highlight_outliers (bool=Tue):
set to True if outliers should be highlighted.
- show_metrics (bool=True):
set to True if prediction performance metrics should be displayed
- width (int=600):
graph width in pixels.
Attributes:
Same as the Args section above.
- class datastories.visualization.PredictedVsActual(pva=None, metrics=None, vis_settings=None, *args, **kwargs)¶
Encapsulates a visual representation of model accuracy for prediction models.
Both regression and classification models are supported.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_settings (obj):
an object of type
datastories.visualization.PredictedVsActualSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
- plot(*args, **kwargs)¶
Convenience function to set-up and display the Predict vs Actual visualization.
Accepts the same parameters as the constructor for
datastories.visualization.PredictedVsActualSettings
objects.
- to_html(file_path, title='Predicted vs Actual', subtitle='', scenario=VisualizationScenario.REPORT)¶
Exports the Predict vs Actual visualization to a standalone
HTML
document.Args:
- file_path (str):
path to the output file.
- title (str=’Predicted vs Actual’):
HTML document title.
- subtitle (str=’’):
HTML document subtitle.
- scenario (enum=VisualizationScenario.REPORT):
A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.
- class datastories.visualization.PrototypeTableSettings(height=320, show_console=True, selectable=True, condensed=True)¶
Encapsulates visualization settings for
datastories.visualization.PrototypeTable
visualizations.Args:
- height (int=320):
graph height in pixels.
Attributes:
Same as the Args section above.
- class datastories.visualization.PrototypeTable(prototypes, vis_settings=None, *args, **kwargs)¶
Encapsulates a visual representation of feature prototypes.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_settings (obj):
an object of type
datastories.visualization.PrototypeTableSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
- plot(*args, **kwargs)¶
Convenience function to set-up and display the Prototypes visualization.
Accepts the same parameters as the constructor for
datastories.visualization.PrototypeTableSettings
objects.
- to_html(file_path, title='Prototypes', subtitle='', scenario=VisualizationScenario.REPORT)¶
Exports the Prototypes visualization to a standalone
HTML
document.Args:
- file_path (str):
path to the output file.
- title (str=’Prototypes’):
HTML document title.
- subtitle (str=’’):
HTML document subtitle.
- scenario (enum=VisualizationScenario.REPORT):
A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.
- datastories.visualization.what_ifs(file_path=None, raw_content=None, init_values=None, minimize_values=None, maximize_values=None, vis_settings=None)¶
Displays a What-Ifs visualization in a Jupyter notebook based on an input RSX model file.
Args:
- file_path (str=None):
path to the input RSX model file. If
None
the [raw_content] argument has to be provided.
- raw_content (bytes=None):
a bytes object, containing the source of the backing RSX model.
- init_values (list=[]):
list of initial driver values;
- minimize_values (list=None):
driver values that minimize the KPI.
- maximize_values (list=None):
driver values that maximize the KPI.
- vis_settings (obj=WhatIfsSettings()):
An object of type
datastories.visualization.WhatIfsSettings
containing the initial visualization settings.
NOTE: Either the [file_path] or [json_content] argument has to be provided but not both.
Returns:
An object of type
datastories.visualization.WhatIfs
Raises:
ValueError
:when both the [file_path] and the [raw_content] arguments are provided.
Example:
from datastories.visualization import what_ifs what_ifs('my_model.rsx')
- class datastories.visualization.WhatIfsSettings(show_controls=True, show_console=True, show_optimizer=False)¶
Encapsulates visualization settings for
datastories.visualization.WhatIfs
visualizations.Args:
- show_controls (bool=True):
Set to True in order to display the visualization controls.
- show_console (bool=True):
Set to True in order to display the visualization console.
- show_optimizer (bool=False):
Set to True in order to disenable the optimizer functionality.
Attributes:
Same as the Args section above.
- class datastories.visualization.WhatIfs(init_values=None, minimize_values=None, maximize_values=None, driver_importances=None, raw_model=None, vis_settings=None, *args, **kwargs)¶
Encapsulates a visual representation for exploring the influence of driver variables on target KPIs.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Note: Objects of this class should not be manually constructed.
- property drivers¶
Get/set the driver values.
- maximize()¶
Identify a set of driver values that maximize the KPI.
- minimize()¶
Identify a set of driver values that minimize the KPI.
- plot(*args, **kwargs)¶
Convenience function to set-up and display the What-Ifs visualization.
Accepts the same parameters as the constructor for
datastories.visualization.PredictedVsActualSettings
objects.
- to_html(file_path, title='What-Ifs', subtitle='', scenario=VisualizationScenario.REPORT)¶
Exports the What-Ifs visualization to a standalone
HTML
document.Args:
- file_path (str):
path to the output file.
- title (str=’What-Ifs’):
HTML document title.
- subtitle (str=’’):
HTML document subtitle.
- scenario (enum=VisualizationScenario.REPORT):
A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.
MLflow Support¶
Story modelling¶
The mlflow package is concerned with integration with MLflow We provide a model implementation, together with an auto-logging feature.
This module is restricted to EXPORT permission agreement in the license.
- datastories.mlflow.save_model(story_model: Model | IPredictiveModel, path, conda_env=None, mlflow_model=None, signature=None, **kwargs)¶
Save a story as a MLflow model.
If not provided, the conda environment will be created with minimal dependencies to DataStories SDK library and mlflow.
- Args:
story_model: the DataStories model to save path: local path where the model is saved conda_env: conda environment. If None, one will be created mlflow_model: the MLflow model to use. If None, a default is used
- datastories.mlflow.log_model(story_model: Model | IPredictiveModel, artifact_path, conda_env=None, signature=None, registered_model_name=None, **kwargs)¶
Log a DataStories model to MLflow, using datastories.mlflow module. This method is called automatically when autologging is on.
In order to generate a signature, please refer to MLFlow documentation or else start from datastories.mlflow.compute_signature. Signature will be infered automatically on auto-logging.
- Args:
story_model: the DataStories model to be saved (required) artifact_path: see mlflow.models.Model.log ; on autologging, model is used conda_env: see mlflow.models.Model.log (optional) registered_model_name: see mlflow.models.Model.log (optional)
- datastories.mlflow.load_model(model_uri, **kwargs)¶
Load a MLflow Datastories model with custom model.
The model returned implements the model API from MLflow, and allows an extra feature to get back the DataStories story object.
- datastories.mlflow.autolog(turn_off=False, license_path=None, signature: Literal['auto', 'off'] = 'auto')¶
Enable auto logging of MLflow models.
When autologging is on, the KPIS associated to the story are logged as parameters, as a long chain or comma separated names.
Subsequent calls of datastories.story.predict_kpis will log a MLflow model, under the path model.
It is not required to manually open a MLflow run when autologging is activated, but this operation can allow user to log more parameters and information.
When autologging, a signature is automatically inferred for the story model (signature=’auto’). In order to turn it off and log no signature, one can pass signature=’off’ instead.
- Args:
turn_off: When True, the auto-logging is turned off. license_path: Provide a license path to be used in the model. See autosave_license for more info. signature: default is ‘auto’ ; ‘off’ to turn off the logging of the signature.
Optimizer modelling¶
The mlflow.optimization package is concerned with integration of Optimization Model within MLflow. This feature is experimental and might be improved in future release, without respecting backward-compatibility.
This module is restricted to EXPORT permission agreement in the license.
- class datastories.mlflow.optimization.OptimizationModel(model: Model | IPredictiveModel, optimization_spec, variable_ranges=None)¶
Optimization model class encodes an optimization specification object for MLflow processing.
Such object is made of a DataStories model and a Optimization Specification. It can potentially be enriched with input ranges, in case the constraints on the input variables cannot be encoded in the specification.
This object is intended to be logged by an MLflow experiment. It can also be used as a shortcut for optimization feature.
- model:
The DataStories model to be optimize; should be a BasePredictor or a Model
- optimization_spec:
The OptimizationSpecification object that encodes the objectives and the constraints
- variable_ranges:
Additional constraints on inputs, that are less strict than specification constraints. (default is None)
- datastories.mlflow.optimization.save_model(ds_optimization_model: Model | IPredictiveModel, path, conda_env=None, signature=None, mlflow_model=None, **kwargs)¶
Save a story as a MLflow model.
If not provided, the conda environment will be created with minimal dependencies to DataStories SDK library and mlflow.
- Args:
ds_optimization_model: the DataStories Optimization model to save path: local path where the model is saved conda_env: conda environment. If None, one will be created mlflow_model: the MLflow model to use. If None, a default is used
- datastories.mlflow.optimization.log_model(ds_optimization_model, artifact_path='model', conda_env=None, signature=None, registered_model_name=None, **kwargs)¶
Log a DataStories Optimization model to MLflow, using datastories.mlflow.optimization module.
In order to generate a signature, please refer to MLFlow documentation or start from datastories.mlflow.optimization.compute_signature
- Args:
ds_optimization_model: the DataStories Optimization model to be saved (required) artifact_path: see mlflow.models.Model.log ; conda_env: see mlflow.models.Model.log (optional) registered_model_name: see mlflow.models.Model.log (optional)
- datastories.mlflow.optimization.load_model(model_uri, **kwargs)¶
Load a MLflow Datastories Optimization model.
The model returned implements the model API from MLflow. It is the combination of an optimizer and a model