SDK Reference

General Interfaces

datastories.api.get_version()

Get the version of the currently loaded modules.

Returns:

  • A dictionary containing loaded modules and corresponding versions

Base classes and interfaces

class datastories.api.IAnalysisResult

Interface implemented by all analysis results.

plot(*args, **kwargs)

Plots a graphical representation of the results in Jupyter Notebook.

to_csv(file_path, delimiter=',', decimal='.')

Export the result to a CSV file.

Args:

  • file_path (str):

    path to the output file.

  • delimiter (str=’,’):

    character used as value delimiter.

  • decimal (str=’.’):

    character used as decimal point.

Raises:

  • ValueError:

    when the object returned by to_pandas is not a Pandas data frame.

to_excel(file_path, tab_name='Statistics')

Export the result to an Excel file.

Args:

  • file_path (str):

    path to the output file.

  • tab_name (str=’Statistics’):

    name of the Excel tab where to save the result.

Raises:

  • ValueError:

    when the object returned by to_pandas is not a Pandas data frame.

to_html(file_path, title='', subtitle='', scenario=VisualizationScenario.REPORT)

Exports the analysis result visualization to a standalone HTML document.

Args:

  • file_path (str):

    path to the output file.

  • title (str=’’):

    HTML document title.

  • subtitle (str=’’):

    HTML document subtitle.

  • scenario (enum=VisualizationScenario.REPORT):

    A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.

abstract to_pandas()

Exports the result to a Pandas DataFrame.

Returns:

  • The constructed Pandas DataFrame.

to_txt(file_path)

Export the result to a TXT file.

Args:

  • file_path (str):

    path to the output file.

class datastories.api.IConsole

Interface implemented by all message loggers.

abstract log(message)

Log a message tot he console.

Args:

  • message (string):

    the message to log.

class datastories.api.IPrediction(data)

Bases: IAnalysisResult

Interface implemented by all prediction results.

Args:

  • data (obj):

    The associated prediction input data.

abstract property metrics

A dictionary containing prediction performance metrics.

These metrics are computed when the data frame used for prediction includes KPI values, for the purpose of evaluating the model prediction performance.

class datastories.api.IPredictiveModel

Interface implemented by all prediction models.

abstract property metrics

A dictionary containing model prediction performance metrics.

The type of metrics depend on the model type (i.e., regression or classification)

abstract predict(data_frame)

Predict the model KPI on a new data frame.

Args:

  • data_frame (obj):

    the data frame on which the model associated KPI is to be predicted.

Returns:

  • An object of type datastories.regression.PredictionResult encapsulating the prediction results.

Raises:

  • ValueError:

    when not all required columns are provided.

to_cpp(file_path)

Export the model to a C++ file.

Args:

  • file_path (str):

    path to the output file.

Raises:

to_excel(file_path)

Export the model to an Excel file.

Args:

  • file_path (str):

    path to the output file.

Raises:

to_matlab(file_path)

Export the model to a MATLAB file.

Args:

  • file_path (str):

    path to the output file.

Raises:

to_py(file_path)

Export the model to a Python file.

Args:

  • file_path (str):

    path to the output file.

Raises:

to_r(file_path)

Export the model to an R file.

Args:

  • file_path (str):

    path to the output file.

Raises:

class datastories.api.IStory(params=None, metainfo=None, raw_results=None, results=None, folder=None, notes=None, upload_function=None, on_snapshot=None, progress_bar=False)

Bases: IAnalysisResult

Interface implemented by all story analyses.

Args:

  • params (dict):

    dictionary containing user and inferred analysis parameters.

  • metainfo (dict):

    dictionary containing process parameters (e.g., progress pointers).

  • raw_results (dict):

    dictionary containing rainstorm processing results.

  • results (dict):

    dictionary containing processing results.

  • folder (str=None):

    the story working folder. Leave not specified to create one at runtime.

  • notes (list=[]):

    a list of notes.

  • upload_function (callback=None):

    a function to upload files to a storage (relevant for the client).

  • on_snapshot (callback=None):

    a callback to be executed upon saving a snapshot (e.g., upload snapshot to S3).

  • progress_bar (obj=None):

    a progress bar object.

abstract add_note(note)

Add an annotation to the story results.

The already present annotations can be retrieved using the datastories.api.IStory.notes() property.

Args:

  • note (str):

    the annotation to be added.

abstract clear_note(note_id)

Remove a specific annotation associated with the story analysis.

Args:

  • note_id (int):

    the index of the note to be removed.

Raises:

  • ValueError:

    when the note index is unknown.

abstract clear_notes()

Clear the annotations associated with the story analysis.

abstract property info

Displays story execution information.

abstract static is_compatible(current_version_string, ref_version_string)

Checks if a story version is compatible with a reference version.

abstract classmethod load()

Loads a previously saved story.

abstract property metrics

Returns a set of metrics computed during analysis.

abstract property notes

The list of all annotations currently associated with the story analysis.

abstract reset()

Reset the execution pointer of a story to the first stage.

abstract run(resume_from=None, strict=False, params=None, progress_bar=None, check_interrupt=None)

Resumes the execution of a story form a give stage.

The stage to resume from is optional. If not specified, the story is executed from the beginning. If the stage cannot be executed (e.g., due to missing intermediate results) the closest story that can be executed will be used as starting point unless the [strict] argument is set to True. In that case an exception will be raised if the execution cannot be resumed from the requested stage.

Args:

  • resume_from (StoryProcessingStage=None):

    The stage to resume execution from. Should be a stage for which all intermediate results are available. If None, the stage at which execution was previously interrupted (if any) is used.

  • strict (bool=False):

    Raise en error if execution cannot be resumed from the requested stage.

  • params (dict={}):

    Map of parameters to be used with the run. It can override the original parameters, but this leads to invalidating previous results that depend on the updated parameter values.

  • progress_bar (obj=None):

    An object of type datastories.display.ProgressReporter to replace the currently used progress reporter. When not specified the current story progress reporter will not be modified. The case for this is to set a progress bar after the story is loaded, when a progress bar cannot be given to the load function directly (e.g, when a progress bar has to be constructed based on the story).

  • check_interrupt (func=None):

    an optional callback to check whether analysis execution needs to be interrupted.

Raises:

abstract save(file_path)

Saves the story analysis results.

abstract property stats

Returns a set of stats computed during analysis.

class datastories.api.IStoryDeprecated(notes=None)

Bases: IAnalysisResult

Interface implemented by all story analyses.

Args:

  • notes (list=[]):

    a list of notes.

add_note(note)

Add an annotation to the story results.

The already present annotations can be retrieved using the datastories.api.IStory.notes() property.

Args:

  • note (str):

    the annotation to be added.

clear_note(note_id)

Remove a specific annotation associated with the story analysis.

Args:

  • note_id (int):

    the index of the note to be removed.

Raises:

  • ValueError:

    when the note index is unknown.

clear_notes()

Clear the annotations associated with the story analysis.

static is_compatible(current_version_string, ref_version_string)

Checks if a story version is compatible with a reference version.

abstract static load(file_path)

Loads a previously saved story.

abstract property metrics

Returns a set of metrics computed during analysis.

property notes

The list of all annotations currently associated with the story analysis.

abstract save(file_path)

Saves the story analysis results.

class datastories.api.IProgressObserver

Interface implemented by all progress report observers.

abstract on_progress(progress)

Callback triggered upon progress update.

Args:

  • progress (float):

    the amount of progress. Possible values: [0-1]

class datastories.api.ISlide(slide_deck=None, file_path='slide.json', slide_name=None)

Interface implemented by slides.

A slide is a collection of data and references to data that a renderer can transform into a visual representation.

Args:

  • slide_deck (obj=None):

    a datastories.api.SlideDeck object used to manage the slide.

  • file_path (str=’slide.json’):

    path to a file to be used for serializing the slide.

property slide

The slide content.

The slide content is a versioned and serializable entity that can be used to visualize the slide without requiring access to the object itself.

NOTE: This information cannot be used to construct the object by deserialization.

class datastories.api.SlideDeck

Base class for slide decks.

A slide deck is a convenience component that facilitates managing a collection of slides.

add_slide(slide)

Adds a slide to the deck.

Args:

clear_slides()

Remove the slides in the deck.

goto_slide(slide_idx)

Sets the current slide pointer to a specific value.

Args:

  • slide_idx (int):

    the new value for the slide pointer.

has_slides()

Check if the slide deck contains any slides (i.e., it is not empty).

Returns:

  • True is the slide deck is empty, otherwise False.

insert_slide(pos_idx, slide)

Inserts a slide in the deck at a given position.

Args:

  • pos_idx (int):

    the index at which position the slide is to be inserted.

  • slide (datastories.api.ISlide):

    the slide to be inserted.

next_slide()

Retrieve the next slide in the deck and advances the slide pointer.

If the deck is at the end, or has no slides it returns None.

Returns:

  • The next slide in the deck or None.

property slides

The deck slides.

sort_slides(names)

Sort the slides based on a list of names.

Slides are sorted in place.

Args:

  • names (list):

    A list of slide names indicating the desired sort order. Slides that not mentioned in the list will be added at the end.

class datastories.core.utils.ExportableMixin
to_csv(file_path, delimiter=',', decimal='.', df=None)

Export the result to a CSV file.

Args:

  • file_path (str):

    path to the output file.

  • delimiter (str=’,’):

    character used as value delimiter.

  • decimal (str=’.’):

    character used as decimal point.

  • df (pandas=None):

    data frame to export. If left unspecified it will use he data frame returned by the to_pandas method of the object

Raises:

  • ValueError:

    when the serialized object is not a Pandas data frame

to_excel(file_path, tab_name='Statistics', df=None)

Export the result to an Excel file.

Args:

  • file_path (str):

    path tot he output file.

  • tab_name (str=’Statistics’):

    name of the Excel tab where to save the result

  • df (pandas=None):

    data frame to export. If left unspecified it will use he data frame returned by the to_pandas method of the object

Raises:

  • ValueError:

    when the serialized object is not a Pandas data frame

class datastories.core.utils.ManagedObject(dependencies=None, *args, **kwargs)

An object that has a user controllable lifespan.

Typically inherited by classes that require special resource to be allocated and manually released outside the Python object lifetime management.

Note: Objects of this class should not be manually constructed.

assert_alive()

Triggers an exception if the object has been manually released.

release()

Releases the object associated storage.

Note: This function should only be used in order to force releasing allocated resources. Using the object after this point would lead to an exception.

class datastories.core.utils.StorageBackedObject(folder=None, files=None, *args, **kwargs)

An object that stores part of its resources on disk and loads them on demand.

Base classes:

The resources may be provided by the object dependencies or by the object associated storage. When resources are specified, the object can be made independent from its dependencies by copying the listed resources to its associated storage.

Note: Objects of this class should not be manually constructed.

make_independent(base_folder='')

Make object independent by copying required resources to the own folder.

Args:

  • base_folder (str=’’):

    the base folder for the unique object folder that will hold the required resources.

Errors

class datastories.api.errors.DatastoriesError(value='')

Base exception class for the DataStories SDK.

class datastories.api.errors.ObjectError(value='')

Exception generated when SDK managed objects are not valid.

class datastories.api.errors.LicenseError

Exception generated when accessing license protected functionality using an invalid license.

class datastories.api.errors.ConversionError(value='')

Error raised when data conversion fails.

class datastories.api.errors.VisualizationError(value='')

Error raised when result visualization fails.

class datastories.api.errors.StoryError(value='')

Base class for all story analysis related errors.

class datastories.api.errors.StoryDataLoadingError(value='')

Exception generated when a story analysis cannot load the provided input data.

class datastories.api.errors.StoryDataPreparationError(value='')

Exception generated when a story analysis cannot be preprocess the provided data.

class datastories.api.errors.StoryProcessingError(value='')

Exception generated when a story analysis cannot be performed.

class datastories.api.errors.StoryInterrupted(value='')

Exception generated when a story analysis execution is interrupted.

class datastories.api.errors.ParserError(value='')

Base class for all file parsing and validation related errors.

class datastories.api.errors.FormatError(value='')

Error raised when the provided file is not in a readable format (unreadable csv, …)

class datastories.api.errors.ValidationError(value='')

Error raised when the parser was able to read the file structure, but an error occurred during validation.

class datastories.api.errors.TypeNotRecognized(value='')

Error raised when the SDK parser cannot determine the provided file type.

class datastories.api.errors.TypeNotSupported(value='')

Error raised when the provided file type cannot be handled by SDK the parser.

class datastories.api.errors.ExternalDataConnectionError(value='')

Error raised when VBA scripts or an external data connection is detected in spreadsheet.

Constants and Enumerations

class datastories.api.OutlierType(value)

Enumeration of possible outlier types.

FAR_OUTLIER_HIGH = 2
FAR_OUTLIER_LOW = -2
NO_OUTLIER = 0
OUTLIER_HIGH = 1
OUTLIER_LOW = -1

License Management

datastories.api.get_activation_info()

Get information required to create and activate a DataStories license.

Returns:
dict:

a dictionary containing data to be submitted to the DataStories representative in charge with issuing the license.


The datastories.license package contains a collection of utility functions to facilitate license management.

These functions are available as methods of a predefined object of class datastories.license.LicenseManager called master.

Example:

from datastories.license import manager
manager.initialize('my_license.lic')
manager

class datastories.license.LicenseManager(license_file_path=None)

Encapsulates the DataStories license manager.

The license manager enables users to inspect the details of their installed DataStories SDK license, and to use license keys that are not available in the standard installation locations (see Installation)

This class should not be instantiated directly. Instead one should use the already available object instance datastories.license.manager.

Args:

  • license_file_path (str = None):

    the path to a license key file or folder if other than the standard locations for the platform.

Attributes:

  • status (str):

    the status of the license manager initialization.

  • license (obj):

    the managed license as indicated in the license key file.

Example:

from datastories.license import manager
manager.initialize('my_license.lic')
manager
property default_license_path

Default path used for license initialization if none provided.

initialize(license_file_path=None, initialize_modules=True)

Initialize the license manager with a license key at a specific location.

Args:

  • license_file_path (string):

    the path to a license key file or a folder containing the license key file.

  • initialize_modules (bool=True):

    set to True in order to initialize dependent modules.

Raises:

  • ValueError:

    when the provided license_file_path is not accessible.

is_granted(option)

Checks if execution rights are granted for license protected functionality.

Args:

  • option (str):

    the license option required by the protected functionality.

Returns:

  • True if execution rights are granted by the installed license.

property is_ok

Check the initialization status of the license manager.

The license manager initialization fails when no valid license file is found in the standard or user indicated locations.

Note: A successful license manager initialization does not imply a grant for using license protected functionality. Fort example, when an expired license is used, the initialization is still successful. To check whether execution rights are granted one should use the datastories.license.LicenseManager.is_granted() method.

Returns:

  • True if the license manager was successfully initialized.

reinitialize()

Re-initializes the license manager.

This is done using the same license file path as in the previous call to datastories.license.LicenseManager.initialize().

release()

Releases the currently held licenses.

This can be useful e.g., when using floating or counted licenses, as it makes the released licenses available for other clients or processes.

Note: once a license is released, the associated execution rights are retracted. In order to use the license protected functionality, users need to acquire the license, by initializing the license manager again (i.e., datastories.license.LicenseManager.initialize()).

Data

The datastories.data package contains a collection of classes and functions for handling data and converting it to and from the internal format used by DataStories.

Base Classes

class datastories.data.DataFrame

Encapsulates a data frame in the DataStories format.

Args:

  • rows (int):

    number of rows in the data frame.

  • cols (int):

    number of columns in the data frame.

  • types (list):

    list of value types for the data frame columns.

cols

The number of columns in the data frame.

columns

The list of data frame column names.

static from_pandas(data_frame)

Construct a new datastories.data.DataFrame from a Pandas DataFrame object.

Args:

  • data_frame (obj): the source Pandas DataFrame object.

Returns:

get(self, size_t row, size_t col)

Get the value of a cell in the data frame.

Args:

  • row (int):

    the index of the cell row.

  • col (int):

    the index of the cell column.

Returns:

  • (float|string) :

    the cell at position (row, column) in the data frame.

get_name(self, size_t col)

Retrieve the name of a specific column.

Args:

  • col (int):

    the index of the column.

Returns:

  • (str) :

    the name of the column with the given index.

get_type(self, size_t col)

Retrieve the type of values in a given column.

Args:

  • col (int):

    the index of the column.

Returns:

static load(file_path)

Load a data frame from a file.

Args:

  • file_path (str):

    path to the file to be loaded.

mapper_get(self, size_t index, size_t value)
static read_csv(file_path, delimiter=u', ', decimal=u'.', quote=u'"', int header_rows=1)

Loads a DataFrame from a CSV file.

Args:

  • file_path (str):

    the path to the file to load.

  • delimiter (str=’,’):

    character to use as value delimiter.

  • decimal (str=’.’):

    character to use as decimal point in numeric values.

  • header_rows (int=1):

    number of header rows (i.e., not containing data values)

rows

The number of rows in the data frame.

save(self, file_path)

Save the data frame to a file.

Args:

  • file_path (str):

    path to the output file.

set_float(self, size_t row, size_t col, double val)

Sets the value of a given cell to a new float value.

Args:

  • row (int):

    the row index of the cell.

  • col (int):

    the column index of the cell.

  • val (float):

    the new float value.

set_int(self, size_t row, size_t col, int64_t val)

Sets the value of a given cell to a new int value.

Args:

  • row (int):

    the row index of the cell.

  • col (int):

    the column index of the cell.

  • val (int):

    the new int value.

set_name(self, size_t col, name)

Set the name of a column in the data frame.

Args:

  • col (int):

    the index of the column.

  • name (str):

    the new name.

set_string(self, size_t row, size_t col, val)

Sets the value of a given cell to a new string value.

Args:

  • row (int):

    the row index of the cell.

  • col (int):

    the column index of the cell.

  • val (str):

    the new string value.

to_pandas(self)

Exports the DataFrame to a Pandas DataFrame object.

Returns:

  • The constructed Pandas DataFrame object.

class datastories.data.ColumnType(value)

Possible column types for datastories.data.DataFrame.

DATE = 3
INTEGER = 2
MIXED = 10
NUMERIC = 1
STRING = 4
UNKNOWN = 0
class datastories.data.DataType(value)

Possible cell value types for datastories.data.DataFrame.

DATE = 3
INTEGER = 2
NUMERIC = 1
STRING = 4
UNKOWN = 0
class datastories.data.RangeType(value)

Possible value range types for datastories.data.DataFrame.

CATEGORICAL = 3
INTERVAL = 1
ORDINAL = 2
UNSPECIFIED = 0

class datastories.data.BaseConverter

Base class for all DataStories SDK value type converters.

Objects of this class are callables. To apply the converter, simply call the obejct with the value to be converted.

The number of conversion operations (both successful or not) is tracked and can be retrieved and reset.

Example:

converter = BaseConverter()
converted_value = converter(raw_value)
class datastories.data.IntConverter

Converter to integer values.

class datastories.data.FloatConverter

Converter to float values.

class datastories.data.StringConverter

Converter to string values.

class datastories.data.BoolConverter(true_values, false_values)

Converter to boolean values.

Args:

  • true_values (list):

    a list of strings that will be regarded as True.

  • false_values (list):

    a list of strings that will be regarded as False.

class datastories.data.DateConverter(**kwargs)

Converter to datetime values.

Args:

class datastories.data.NanConverter(nan_values=('', 'NA', 'NaN', 'null', 'none', '?', '..', '...', 'N/A', '-'))

Converter to NaN values.

Converts NaN equivalent values to numpy.nan.

NOTE: This converter is somewhat different from the others. While others return numpy.nan when the conversion is not possible, this converter returns numpy.nan only when the conversion is possible and the unchanged value otherwise.

class datastories.data.FallbackConverter(nan_detector=None, converters=None)

This converter has a list of converters it tries in order until one is successful. It also keeps track of how many conversions each converter performed successfully. If no converter is successful, a datastories.api.errors.ConversionError exception is raised.

First the converter try to see whether it is a nan value and if so, the value is ignored. Otherwise, the converters are used in order to try and convert the value, stopping from the moment the first conversion is successful. If none of the conversions is successful, a datastories.api.errors.ConversionError exception is raised.

Args:

  • nan_detector (obj=NanConverter()):

    an object of type datastories.data.BaseConverter used to detect whether a value represents a nan or not.

  • converters (list=[FloatConverter(), StringConverter()]):

    a list of datastories.data.BaseConverter objects that will be tried in order with each attempted conversion.

Data Frame Preparation

datastories.data.prepare_data_frame(data_frame, sample_size=None, progress_bar=False)

Prepares a pandas.DataFrame object compatible with the DataStories clean-up and type conversion rules.

Pandas DataFrame objects obtained from external sources are often inconsistent and need to be cleaned-up in order to make them usable for analysis. The clean-up process transforms the data frame, for example by enforcing type conversions and discarding non-usable values. DataStories analyses perform the clean-up operation automatically. However, there may be scenarios when a data clean-up is required before running it through a DataStories analysis (e.g., a custom feature-engineering stage).

This function can be used to obtain a Pandas DataFrame object that is cleaned-up according the DataStories rules and conventions.

Args:

  • data_frame (obj):

    the data frame object to convert (either a pandas.DataFrame or a datastories.data.DataFrame object).

  • sample_size (int|str=None)`:

    the sample size to use for inferring column data types (either absolute integer value or a percentage - e.g. ‘10%’). If left unspecified is the minimum of 100 and 10% of the number of points.

  • progress_bar (obj|bool=False):

    An object of type datastories.display.ProgressReporter, or a boolean to get a default implementations (i.e., True to display progress, False to show nothing).

Returns:

  • The constructed pandas.DataFrame object.

datastories.data.to_ds_pandas(data_frame, converters=None, sample_size=None, copy=False, force_conversion=False, progress_bar=False, include_converters=False)

Converts a data frame to a pandas.DataFrame object compatible with the DataStories type conversion rules.

Args:

  • data_frame (obj):

    the data frame object to convert (either a pandas.DataFrame or a datastories.data.DataFrame object).

  • converters (list):

    list of datastories.data.BaseConverter type conversion objects to use for coercing the column types. If not specified, it will be detected automatically based on a sample of data.

  • sample_size (int|str=None)`:

    the sample size to use for inferring data types when the list of converters is not specified (either absolute integer value or a percentage - e.g. ‘10%’). If left unspecified is the minimum of 100 and 10% of the number of points.

  • copy (bool=False):

    set to True in order to force creation of a new object.

  • force_conversion (bool=False):

    set to True in order to force the conversion on a previously converted data frame.

  • progress_bar (obj|bool=False):

    An object of type datastories.display.ProgressReporter, or a boolean to get a default implementations (i.e., True to display progress, False to show nothing).

  • include_converters (bool=False):

    set to True in order to include the column converters in the returned result

Returns:

  • The constructed pandas.DataFrame object when the include_converters is False

  • a tuple containing the constructed pandas.DataFrame object and the used column converters when the include_converters is True

datastories.data.detect_column_types(data_frame, sample_size=None, progress_bar=False)

Infer the data types for the columns of a data frame.

Inference is done on a sample of the data, based on the most frequent value type in each column.

Args:

  • data_frame (obj):

    the input data frame.

  • sample_size (int|str=None):

    the sample size to use for inferring data types (either absolute integer value or a percentage - e.g. ‘10%’). If left unspecified is the minimum of 100 and 10% of the number of points.

  • progress_bar (obj):

    An object of type datastories.display.ProgressReporter, or a boolean to get a default implementations (i.e., True to display progress, False to show nothing).

Returns:

Raises:

  • ValueError:

    when an invalid value is provided for one of the input parameters parameters.

Example:

from datastories.data import detect_column_types
import pandas as pd
df = pd.read_csv('example.csv')
col_types = detect_column_types(df, sample_size='20%')
for col_type in col_types:
    print(col_type.typename)
datastories.data.data_to_file(data_frame, file_path)

Save a DataFrame to a file.

Args:

  • data_frame (obj):

    the input data frame (either a pandas.DataFrame or a datastories.data.DataFrame object).

  • file_path (str):

    path to the saved file.

datastories.data.file_to_pandas(file_path)

Load a saved datastories.data.DataFrame object into a pandas.DataFrame object.

Args:

  • file_path (str):

    path to the file to be loaded.

datastories.data.normalize_column_names(data_frame)

Normalizes the names of the columns of a Pandas data frame.

The following operations are performed:
  • Convert numbers to strings

  • Normalize unicode (NFKD)

See: [https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize]

The operations are performed in place (i.e., mutating the input data frame).

Args:

  • data_frame (pandas.DataFrame):

    the input data frame.

datastories.data.get_columns(data_frame, include_cols=None, exclude_cols=None)

Get a selection of columns from a dataset that include/exclude specific columns.

Args:

  • data_frame (pandas.DataFrame):

    input data frame ( a pandas.DataFrame object).

  • include_cols (list):

    selection of columns to include in the result. If left unspecified or evaluating to None, all dataset columns will be included.

  • exclude_cols (list):

    selection of columns to be excluded from the result.

Returns:

  • A list of selected columns.

Summary Calculation

datastories.data.compute_summary(data_frame, converters=None, sample_size=None, progress_bar=False)

Compute a data summary on a provided data frame.

Args:

  • data_frame (obj):

    the input data frame (either a pandas.DataFrame or a datastories.data.DataFrame object).

  • converters (list=None):

    list of datastories.data.BaseConverter type conversion objects to use for coercing the column types. If not specified, it will be detected automatically based on a sample of data.

  • sample_size (int|str=None)`:

    the sample size to use for inferring data types (either absolute integer value or a percentage - e.g. ‘10%’). If left unspecified is the minimum of 100 and 10% of the number of points.

  • progress_bar (obj|bool=False):

    An object of type datastories.display.ProgressReporter, or a boolean to get a default implementations (i.e., True to display progress, False to show nothing).

Returns:

Example:

from datastories.data import compute_summary
import pandas as pd
df = pd.read_csv('example.csv')
summary = compute_summary(df)
print(summary)
class datastories.data.DataSummaryResult(stats)

Encapsulates the result of the datastories.data.compute_summary() analysis.

Base classes:

Note: Objects of this class should not be manually constructed.

static load(file_path)

Load a previously saved summary from a JSON file.

Args:

  • file_path (str):

    the path to file to be loaded.

Returns:

property metrics

The set of metrics included in the data summary.

NOTE: This is an alias for the .stats property.

Returns:

save(file_path)

Save the summary to a JSON file.

Args:

  • file_path (str):

    the path to the exported summary file.

select(cols)

Select a set of columns for further reference.

property selected

The list of selected columns.

property stats

The set of statistics included in the data summary.

Returns:

to_pandas()

Exports the detailed (column-level) data summary to a Pandas DataFrame.

Returns:

  • The constructed Pandas DataFrame object.

property visualization

The data health visualization.


class datastories.data.TableStatistics(name=None, rows=None, columns=None, n=None, n_missing=None, p_missing=None, health=None, health_score=0, df=None, converters=None, n_rows=None, n_columns=None, version=None)

Statistics and data health reports for a given data frame.

Note: Objects of this class should not be manually constructed.

Attributes:

  • n_rows (int):

    number of rows.

  • n_columns (int):

    number of columns.

  • n (int):

    number of values.

  • n_missing (int):

    number of missing values.

  • p_missing (float):

    percentage of missing values.

  • health_score (float):

    health score: 0 (good) - 100 (bad).

  • health (float):

    general health value for the data frame (unusable:0, fixable:0.5, great:1).

  • columns ‘(list)`:

    list of objects of type datastories.data.ColumnStatistics encapsulating detailed column level statistics

calc_stats(missing_thr=(50, 90), balance_thr=(50, 90), outlier_thr=(50, 90), table_thr=(50, 90), rows_thr=30)

Compute the statistics for the data frame and set the corresponding attributes.

Args:

  • missing_thr (tuple=(50, 90)):

    thresholds for deciding the missing values health category (Poor, Reasonable, Good)

  • balance_thr (tuple=(50, 90)):

    thresholds for deciding the data distribution health category (Poor, Reasonable, Good)

  • outlier_thr (tuple=(50, 90)):

    thresholds for deciding outlier health category (Poor, Reasonable, Good)

  • table_thr (tuple=(50, 90)):

    thresholds for deciding overall data health category (Poor, Reasonable, Good)

  • rows_thr (int=30):

    threshold for the minimum number of required rows. Under this value data is considered to be not usable.

class datastories.data.ColumnStatistics(col=None, id=None, converter=None, label=None, column_type=None, element_type=None, n=None, n_valid=None, n_missing=None, p_missing=None, n_unique=None, min=None, max=None, mean=None, median=None, most_freq=None, first_quartile=None, third_quartile=None, histo_labels=None, histo_counts=None, balance_score=None, balance_health=None, missing_health=None, left_outlier_score=None, right_outlier_score=None, outlier_score=None, left_outlier_health=None, right_outlier_health=None, outlier_health=None, health=None, missing_thr=None, balance_thr=None, outlier_thr=None, bincount=10, n_outliers=None, outlier_n=None, outlier_perc=None, outlier_grade=None)

Statistics and data health reports for a given column in a data frame.

Note: Objects of this class should not be manually constructed.

Attributes:

  • n_rows (int):

    number of rows.

  • id (int):

    the index of the column.

  • label (str):

    the label (header values) of the column.

  • n (int):

    the length of the column.

  • n_valid (int):

    the number of correctly parsed data items.

  • n_missing (int):

    the number of unreadable data items.

  • p_missing (float):

    percent of unreadable data items.

  • column_type (str):

    type of the column (ordinal, interval, binary, …).

  • element_type (str):

    type of individual data items (float, string, …).

  • n_unique (int):

    number of unique values.

  • min (float):

    minimum value.

  • max (float):

    maximum value.

  • mean (float):

    mean value.

  • median (float):

    median value.

  • first_quartile (float):

    first quartile (data point under which 25% of data is situated).

  • third_quartile (float):

    third quartile (data point under which 75% of data is situated).

  • histo_labels (list):

    labels for the histogram bins.

  • histo_counts (list):

    counts for the histogram bins.

  • balance_score (float):

    score for the data balance quality, 0 (good) - 100 (bad).

  • balance_health (float):

    health value in terms of balance (unusable:0, fixable:0.5, great:1).

  • missing_health (float):

    health value in terms of nr of missing items (unusable, …).

  • left_outlier_score (float):

    metric for outlier impact on the left (i.e., small) side of the data range. Scale: 0 (no outliers detected) - 100 (bad).

  • right_outlier_score (float):

    metric for outlier impact on the right (i.e., big) side of the data range. Scale: 0 (no outliers detected) - 100 (bad).

  • outlier_score (float):

    metric for the general outlier impact of the data. Scale: 0 (no outlier impact whatsoever) - 100 (bad).

  • left_outlier_health (float):

    health value for left outlier impact (unusable:0, fixable:0.5, great:1).

  • right_outlier_health (float):

    health value for right outlier impact (unusable, fixable:0.5, great:1).

  • outlier_health (float):

    health value for outlier impact (unusable:0, fixable:0.5, great:1).

  • health (float):

    general health value for this column (unusable:0, fixable:0.5, great:1).

  • n_outliers (int):

    number of outliers.

  • outlier_n (int):

    number of outliers.

  • outlier_perc (float):

    percentage of outlier values.

  • outlier_grade (int):

    0: bad, 1:good.

calc_stats(missing_thr=(50, 90), balance_thr=(50, 90), outlier_thr=(50, 90))

Compute the statistics for the column and set the corresponding attributes.

Args:

  • missing_thr (tuple=(50, 90)):

    thresholds for deciding the missing values health category (Poor, Reasonable, Good)

  • balance_thr (tuple=(50, 90)):

    thresholds for deciding the data distribution health category (Poor, Reasonable, Good)

  • outlier_thr (tuple=(50, 90)):

    thresholds for deciding outlier health category (Poor, Reasonable, Good)

Outlier Detection

datastories.data.compute_outliers(input, ref=None, double strictness=0.25, outlier_vote_threshold=None, far_outlier_vote_threshold=None)

Identifies numeric outliers in a 1D or 2D space.

This function can be used either with the strictness argument only (i.e., by leaving two last parameters at their defaults so they will be computed as a function of the strictness) or manually by setting the last two parameters in which case the strictness will be ignored.

Args:

  • input (list|obj|ndarray):

    numeric input vector can be either a list, a pandas.Series object or a numpy numeric array;

  • ref (list|obj|ndarray=None):

    abscissa vector for the 2D case. Can be either a list, a pandas.Series object or a numpy numeric array;

  • strictness (double=0.25):

    determines how strict the algorithm selects outliers - higher values yield less outliers. Value in range is [0-1].

  • outlier_vote_threshold (double=None):

    determines when a point is considered outlier - higher values yield less outliers. Value in range is [0-100]. When left unspecified it will be set to 100 * strictness.

  • far_outlier_vote_threshold (double=None):

    determines when a point is considered a far outlier - higher values yield less outliers. This must be larger than [outlier_vote_threshold]. Default is outlier_vote_threshold + 50. Value in range is [0-100].

Returns:

Example:

from datastories.data import compute_outliers
import pandas as pd
df = pd.read_csv('example.csv')
outliers = compute_outliers(df['my_column'])
print(outliers)
class datastories.data.OutlierResult(input, outliers)

Encapsulates the result of the datastories.data.compute_outliers() analysis.

Base classes:

Note: Objects of this class should not be manually constructed.

Attributes:

  • valid (bool):

    a flag indicating whether the result is valid.

as_index(self, outlier_types=None)

A numpy index vector that can be used to select and retrieve outlier values.

The index can be applied on numpy arrays or pandas.Series objects.

Args:

  • outlier_types (list):

    list of datastories.api.OutlierType values to specify which outliers to retrieve. By default, all outliers are included (i.e., outlier_types = [OutlierType.FAR_OUTLIER_HIGH, OutlierType.FAR_OUTLIER_LOW, OutlierType.OUTLIER_HIGH, OutlierType.OUTLIER_LOW])

as_itemgetter(self, outlier_types=None)

An operator.itemgetter object that can be used to select and retrieve outlier values from a list.

Args:

  • outlier_types (list):

    list of datastories.api.OutlierType values to specify which outliers to retrieve. By default, all outliers are included (i.e., outlier_types = [OutlierType.FAR_OUTLIER_HIGH, OutlierType.FAR_OUTLIER_LOW, OutlierType.OUTLIER_HIGH, OutlierType.OUTLIER_LOW])

clip_to_iqr(self, low_threshold=0.05, high_threshold=0.95)

Marks as outliers values that are outside a specific inter-quartile range.

This operation can be un-done via the reset method.

Args:

  • low_threshold (float=0.05):

    the lower bound of the inter-quartile range. Should be in the interval [0,1].

  • high_threshold (float=0.95):

    the higher bound of the inter-quartile range. Should be in the interval [0,1].

Raises:

  • ValueError:

    when the input arguments are not valid.

property metrics

A dictionary containing outlier detection metrics.

The following metrics are retrieved:

  • Outliers:

    total number of outliers.

  • Outliers Low:

    number of lower outliers.

  • Outliers High:

    number of higher outliers.

  • Close Outliers:

    number of close outliers.

  • Close Outliers Low:

    number of lower close outliers.

  • Close Outliers High:

    number of higher close outliers.

  • Far Outliers:

    number of far outliers.

  • Far Outliers Low:

    number of lower far outliers.

  • Far Outliers High:

    number of higher far outliers.

  • NaN:

    number of NaN values.

  • Normal:

    number of values that are neither outliers not NaN.

reset(self)

Resets outliers to original values, as computed by the datastories.data.compute_outliers() analysis.

to_csv(self, file_path, content='metrics', delimiter=',', decimal='.')

Exports a list of detected outliers or metrics to a CSV file.

Args:

  • file_path (str):

    path to the output file.

  • content (str=’metrics’):

    the type of metrics to export. Possible values:

    • 'metrics': exports outlier detection metrics.

    • 'outliers': exports point-wise outlier classification.

  • delimiter (str=’,’):

    character used as value delimiter.

  • decimal (str=’.’):

    character used as decimal point.

Raises:

  • ValueError:

    when an invalid value is provided for the [content] argument.

to_excel(self, file_path)

Exports the list of detected outliers and metrics to an Excel file.

Args:

  • file_path (str):

    path to the output file.

to_pandas(self, content='metrics')

Exports a list of detected outliers or metrics to a pandas.Series object.

Args:

  • content (str=’metrics’):

    the type of metrics to export. Possible values:

    • 'metrics': exports outlier detection metrics.

    • 'outliers': exports point-wise outlier classification.

Returns:

  • The constructed pandas.Series object.

Raises:

  • ValueError:

    when an invalid value is provided for the [content] argument.

update(self, updates)

Updates the list of detected outliers with manual corrections.

property updated

A list of manual corrections applied to the detected outliers.

property visualization

The outliers visualization.


Classification

The datastories.classification package contains a collection of classes and functions to facilitate classification analysis.

Feature Ranking

datastories.classification.rank_features(data_set, kpi, metric=FeatureRankingMetric.ACCURACY) FeatureRankResult

Computes the relative importance of columns in a data frame for predicting a binary KPI.

The scoring is based on maximizing the prediction accuracy with respect to the KPI while iteratively splitting the data frame rows.

Args:

Returns:

Raises:

  • TypeError:

    if data_set is not a DataFrame or a Pandas DataFrame object.

  • ValueError:

    if kpi is not a valid column name or index value (e.g., out-of-range index).

Example:

from datastories.classification import rank_features
import pandas as pd
df = pd.read_csv('example.csv')
kpi_column_index = 1
ranks = rank_features(df, kpi_column_index)
print(ranks)
class datastories.classification.FeatureRankingMetric(value)

Metric to use for ranking the features.

ACCURACY = 0
class datastories.classification.FeatureRankResult(title='', subtitle='')

Encapsulates the result of the datastories.classification.rank_features() analysis.

Base classes:

Note: Objects of this class should not be manually constructed.

feature_ranks

The feature ranks computed by the datastories.classification.rank_features() analysis.

Returns:

select(self, cols)

Selects a number of column names as features.

selected

The list of column names currently selected as features.

to_excel(self, file_path)

Exports the list of ranking scores to an Excel file.

Args:

  • file_path (str):

    path to the output file.

to_pandas(self, ranking_column='Score', min_threshold=0.0)

Exports the list of ranking scores to a Pandas DataFrame object.

Args:

  • ranking_column (str=’Score’):

    column to compute the rank and order the data frame. This can be useful to discover interesting variables that are penalised because they have a lot of missing values.

  • min_threshold (float):

    a a cutoff threshold for the minimum score that a variable should have in order to be exported.

Returns:

  • The constructed Pandas DataFrame object.

visualization

The feature ranks visualization.

class datastories.classification.RankingSplit

Encapsulates information about a split.

Note: Objects of this class should not be manually constructed.

Attributes:

  • column_name (str):

    name of the variable (i.e., column) used in split.

  • column_index (int):

    index of the variable used in split.

  • score (float):

    relative importance score with respect to the KPI.

  • left_value (float):

    the variable value that was used for the split.

  • right_value (float):

    the next higher variable value in the dataset.

  • split_value (float):

    the variable value that was used for the split.

  • equal_type_split (bool):

    indicates whether the split value equals one of the left_value or right_value.

  • extra_scores (dict):

    dictionary containing additional metrics (e.g., accuracy).


Correlation

The datastories.correlation package contains a collection of classes and functions to facilitate correlation analysis.

datastories.correlation.compute_correlations(data, column_list, kpis, max_vars=200, outlier_elimination=False, optimize=False)

Find the most relevant correlations between the columns of a data set.

A number of correlation metrics are computed (currently linear and mutual information) for a subset of the most relevant input variables with respect to a set of KLIs and between the KPIs themselves.

The subset of relevant input variable is computed based on prototyping and limited to a a maximum number as specified (i.e., max_vars = 200).

Args:

  • data (obj):

    the input data frame (either a pandas.DataFrame or a datastories.data.DataFrame object);

  • column_list (list):

    the list of input variable identifiers (indices or names)

  • kpis (list):

    the list of KPI column identifiers (indices or names)

  • max_vars (int):

    the maximum number of variables to be included in the result.

  • outlier_elimination (bool=False):

    set to True in order to exclude far outliers from from columns before computing correlations;

  • optimize (bool=False):

    set to True in order to improve correlation metrics by using transformed versions of the input (e.g., scaled columns).

Returns:

  • A JSON formatted string encapsulating the computed correlation metrics, compatible with the DataStories CorrelationBrowser visualization.

class datastories.correlation.CorrelationResult(json_content, column_names=None)

Encapsulates the result of the datastories.correlation.compute_correlations() analysis.

Base classes:

Note: Objects of this class should not be manually constructed.

column(col)

Retrieve the correlation measurements associated with a given column.

Args:

  • col (str|int):

    the identifier of the column (name or index).

Returns:

A dictionary containing correlation measurements with respect to other columns in the data frame, in case these have been included in the top correlations selection.

property columns

The list of column names.

static load(file_path, column_names=None)

Load the result from a JSON file.

Args:

  • file_path (str):

    location of the input file

  • column_names (list[str]=None):

    list of column names in the original data frame. If not provided, one cannot access the correlations via the original data frame column indexes. Instead one must use column names.

Returns:

save(file_path)

Save the result to a JSON file.

Args:

  • file_path (str):

    location of the output file

NOTE: This operation loses the data frame context information. The original column names and their indices will not be available when loading the result from this file, unless the context is provided by the user. If no context is provided, one can still use the result but the correlations cannot be accessed via the original data frame column indexes. Instead one can use the column names.

to_excel(file_path)

Export the list of correlations to an Excel file.

Args:

  • file_path (str):

    name of the file to export to.

to_json(html_safe=False)

Save the result as a JSON string.

Args:

  • html_safe (bool=False):

    Set to True in order to produce a JSON string that is safe to embed in a HTML page as an attribute value.

Returns:

  • A JSON string containing the analysis results.

to_pandas()

Export the list of correlations to a Pandas DataFrame object.

NOTE: Every pair of correlated columns is included twice in the results such that each of the columns in the pair appears as a main column.

Returns:

  • The constructed Pandas DataFrame object.

property visualization

The prototypes visualization.

Prototype Detection

datastories.correlation.compute_prototypes(data_set, kpi, list inputs: list = None, double prototype_threshold: float = 0.85, fast_approximation: bool = True, double missing_value_threshold: float = 0.5, use_linear_correlation: bool = False, inputs_only: bool = False) PrototypeResult

Compute a set of mutually uncorrelated variables from a data frame.

Correlation estimation is by default based on the Mutual Information Content measure, and can be overridden to the Linear Correlation when required.

Each variable in the set has the following properties:

  • it is not significantly correlated to any other variable in the set;

  • it can be highly correlated to other variables that are not included in the set;

  • it has a higher KPI correlation score than all the other variables that are highly correlated to it.

Each variable that is not included in the set has the property that is highly correlated to a variable in the set.

Args:

  • data_set (obj):

    the input data frame (either a pandas.DataFrame or a datastories.data.DataFrame object).

  • kpi (list|int|str):

    single value or a list containing the index or the name of the KPI column(s).

  • inputs (list=None):

    list of columns IDs to include in the analysis. When not specified all columns in the provided dataset will be included.

  • prototype_threshold (float = 0.85):

    correlation threshold for features to be considered proxies.

  • fast_approximation (bool = True):

    approximate the mutual information, this provides a significant speedup with little precision loss.

  • missing_value_threshold (float = 0.5):

    missing values threshold for excluding features from prototypes.

  • use_linear_correlation (bool = False):

    use linear correlation instead of the mutual information for correlation estimation.

  • inputs_only (bool = False):

    extract prototypes only for inputs (i.e., exclude KPIs). The KPIs are used only to determine the order in which the prototypes are presented. That is, the order of prototypes in the result is given by their maximum correlation with a KPI.

Returns:

Raises:

  • TypeError:

    if [data_set] is not a DataFrame or a Pandas DataFrame object.

  • ValueError:

    if [kpi] is not a valid column name or index value (e.g., out-of-range index).

Example:

from datastories.correlation import compute_prototypes
import pandas as pd
df = pd.read_csv('example.csv')
kpi_column_index = 1
prototypes = compute_prototypes(df, kpi_column_index)
print(prototypes)
class datastories.correlation.PrototypeResult(prototype_list)

Encapsulates the result of the datastories.correlation.compute_prototypes() analysis.

Base classes:

Note: Objects of this class should not be manually constructed.

classmethod load(type cls, file_path)

Load the analysis result from a JSON file.

Args:

  • file_path (str):

    Path to the file to be loaded.

prototypes

The list of column names currently selected as prototypes.

save(self, file_path)

Save the analysis result to a JSON file.

Args:

  • file_path (str):

    location of the output file.

select(self, cols)

Select a number of column names as prototypes.

selected

The list of column names currently selected as prototypes.

to_excel(self, file_path)

Export the list of prototypes to an Excel file.

Args:

  • file_path (str):

    path to the output file.

to_pandas(self)

Export the list of prototypes to a Pandas DataFrame object.

Returns:

  • The constructed Pandas DataFrame object.

visualization

The prototypes visualization.

class datastories.correlation.Prototype(info, proxy_list)

Encapsulates prototype information data.

Note: Objects of this class should not be manually constructed.

Attributes:

class datastories.correlation.CorrelationInfo(col_index, col_name, kpi_index, kpi_name, correlation)

Encapsulates correlation information for a variable with respect to a reference.

Note: Objects of this class should not be manually constructed.

Attributes:

  • col_index (int):

    the index of the variable in the input data frame.

  • col_name (str):

    the name of the variable.

  • correlation (float):

    the correlation score with respect to the reference.


Model

The datastories.model package contains a collection of classes that encapsulate data models (e.g., prediction models computed by regression or classification analysis).

Base Classes

class datastories.model.Model

Encapsulates an RSX based DataStories model.

inputs

The list of input model variable names.

outputs

The list of output model variable names.

plot(self, *args, **kwargs)

Display a graphical representation of the prediction model.

Accepts the same parameters as the constructor for datastories.visualization.WhatIfsSettings

predict(self, data_frame, as_pandas=None, prepare_data=True)

Evaluate the model on an input data frame.

Args:

  • data_frame (obj):

    the input data frame (either a pandas.DataFrame or a datastories.data.DataFrame object).

  • as_pandas (bool=None):

    Flag to indicate whether prediction results should be returned as a Pandas data frame. By default results are returned in the same format as the input data frame.

  • prepare_data (bool=True):

    Set to True in order to prepare provided Pandas data frames according to the DataStories type conversion rules. When the provided data frame is a datastories.data.DataFrame object, this argument is discarded.

Returns:

  • An object of type datastories.core.model.PredictionResult wrapping-up the computed prediction.

save(self, file_path=None)

Serialize the model to a file or a bytes object.

Args:

  • file_path (str=None):

    Name of the output file. If omitted the file is saved to a bytes object and returned as output for the function.

Returns:

  • A bytes object containing the model when the [file_path] argument is omitted or set to None.

to_cpp(self, file_path)

Export the model to a C++ file.

Args:

  • file_path (str):

    path to the output file.

Raises:

to_excel(self, file_path)

Export the model to an Excel file.

Args:

  • file_path (str):

    path to the output file.

Raises:

to_matlab(self, file_path)

Export the model to a MATLAB file.

Args:

  • file_path (str):

    path to the output file.

Raises:

to_py(self, file_path)

Export the model to a Python file.

Args:

  • file_path (str):

    path to the output file.

Raises:

to_r(self, file_path)

Export the model to an R file.

Args:

  • file_path (str):

    path to the output file.

Raises:

variables

A dictionary mapping model variables to corresponding information such as variable type and range.

Returns:

class datastories.model.VariableInfo

Holds information about a model variable, such as ranges and types.

Note: Objects of this class should not be manually constructed.

categories

The registered categories of the variable (i.e., if the variable is categorical).

index

The index of the variable.

is_input

Checks if the associated variable is an input for the model.

max

The maximum value of the variable.

min

The minimum value of the variable.

range_type

The range type of the variable.

type

The variable type.

Prediction

datastories.model.predict_from_model(data_frame, rsx_model_path)

Evaluate an RSX model on an input data frame.

Args:

  • data_frame (obj):

    the input data frame (either a pandas.DataFrame or a datastories.data.DataFrame object).

  • rsx_model_path (str):

    path of the RSX model file.

Returns:

class datastories.model.PredictionResult(data=None)

Encapsulates a model prediction result.

Base classes:

Note: Objects of this class should not be manually constructed.

property error_plot

An interactive visualization of prediction errors.

NOTE: This is only available then the actual values corresponding

the predicted ones are available in the input dataset.

Returns:

property evaluation_data

The data used for evaluation, if available, or None.

property kpis

The list of KPIs included in the prediction.

static load(metrics=None, predict_vs_actual=None, evaluation_data=None, path=None, as_pandas=True)

Load a datastories.model.PredictionResult object from a set of files or objects.

The objects take precedence over the files. When a required object is not provided, the corresponding information will be retrieved from the associated file, provided such file can be identified.

Files have standard names:

  • metrics.json

  • predicted_vs_actual.csv

  • evaluation_data.parquet

Files are specified indirectly, by providing a folder name, containing the files mentioned above. The folder name can be also a zip archive. In that case the files should be available in the root of the archive.

The evaluation data information is optional, and only required for reference.

Args:

  • metrics (dict=None):

    A dictionary containing performance metrics.

  • predict_vs_actual (obj=None):

    A data frame (Pandas or DataFrame) containing predicted vs actual data.

  • evaluation_data (obj=None):

    A data frame (Pandas or DataFrame) containing evaluation input data

  • path (str=None):

    Path to a folder or ZIP archive containing required information if not provided by the other (object) parameters. Files containing this information should have a standard name, as mentioned above.

  • as_pandas (bool=True):

    Flag to indicate whether the values field should be available as a pandas.DataFrame (i.e., True) or a datastories.data.DataFrame object (i.e., False).

property metrics

The prediction performance metrics, if available.

NOTE: This is an alias for the .stats property

property performance

An interactive visualization of prediction performance, depicting predicted against actual values.

NOTE: This is only available then the actual values corresponding

the predicted ones are available in the input dataset.

Returns:

property quality

An interactive visualization of prediction performance, depicting predicted against actual values.

This is an alias for the .performance property.

NOTE: This is only available then the actual values corresponding

the predicted ones are available in the input dataset.

Returns:

property record_info_columns

Get/set the record info column names.

save(folder, include_data=False, compress=False)

Save the prediction data to a folder or a zip archive.

The metrics, prediction values and (optionally) prediction input data are saved as individual files:

  • metrics.json

  • predicted_vs_actual.csv

  • evaluation_data.parquet

Args:

  • folder (str):

    The folder where the files should be saved.

  • include_data (bool=False):

    Flag to indicate whether the evaluation data should be included as well.

  • compress (bool=False):

    Flag to indicate whether the files should be saved to a compressed ZIP archive instead of a folder.

property stats

The prediction performance statistics, if available.

NOTE: When the actual KPI value is missing from the input data frame, the performance metrics cannot be computed. In that case None is returned.

to_csv(file_path, delimiter=',', decimal='.', include_evaluation_data=True)

Export the result to a CSV file.

Args:

  • file_path (str):

    path to the output file.

  • delimiter (str=’,’):

    character used as value delimiter.

  • decimal (str=’.’):

    character used as decimal point.

  • include_evaluation_data (bool=True):

    set to True in order to include the evaluation data next to the prediction values.

to_excel(file_path, tab_name='Predictions', include_evaluation_data=True)

Export the result to an Excel file.

Args:

  • file_path (str):

    path to the output file.

  • tab_name (str=’Predictions’):

    name of the Excel tab where to save the result

  • include_evaluation_data (bool=True):

    set to True in order to include the evaluation data next to the prediction values.

to_pandas(include_evaluation_data=True)

Export the prediction and input values to pandas.

Args:

  • include_evaluation_data (bool=True):

    set to True in order to include the evaluation data next to the prediction values.

property values

The prediction values.

For each provided record in the input data frame the following values are provided per KPI:

  • actual:

    the actual value of the KPI (i.e., if present in the input data frame).

  • predicted:

    the predicted value of the KPI.

  • uncertainty_min:

    minimum predicted value corrected for uncertainty.

  • uncertainty_max:

    maximum predicted value corrected for uncertainty.

  • model_based_outlier:

    whether the prediction is based on outlier values according to the model (1=True).

NOTE: The result object has the same type as the input provided to the predict method.

class datastories.model.BasePredictor(base_model)

Base class for all models based on a RSX backed model.

Base classes:

Offers access to basic functionality:

  • prediction

  • optimization

  • model export to a specific language

Args:

  • base_model (obj):

    an object of type datastories.model.Model encapsulating the base RSX model used for making predictions.

export(file_path)

Export the underlying prediction model to a lightweight RSX file.

This can be then loaded as a datastories.model.Model object and used to make predictions on new data.

Args:

  • file_path (str=None):

    Name of the output file.

maximize(progress_bar=True, optimizer=None)

Compute the input combination that maximizes the predictive model output.

Args:

  • progress_bar (obj|bool=True):

    An object of type datastories.display.ProgressReporter, or a boolean to get a default implementations (i.e., True to display progress, False to show nothing).

Returns:

abstract property metrics

A dictionary containing model prediction performance metrics.

The type of metrics depend on the model type (i.e., regression or classification)

minimize(progress_bar=True, optimizer=None)

Compute the input combination that minimizes the predictive model output.

Args:

  • progress_bar (obj|bool=True):

    An object of type datastories.display.ProgressReporter, or a boolean to get a default implementations (i.e., True to display progress, False to show nothing).

Returns:

property model

The generic RSX based model used for making predictions.

WARNING: This property has been deprecated and will be removed in a future version of the SDK.

optimize(optimization_spec=None, variable_ranges=None, progress_bar=True, optimizer=None)

Compute an optimum input/output combination according to an (optional) optimization specification.

Args:

Returns:

abstract predict(data_frame)

Predict the modeled KPI on a new data frame.

Args:

  • data_frame (obj):

    the data frame on which the model associated KPIs are to be predicted (either a pandas.DataFrame or a datastories.data.DataFrame object).

Returns:

Raises:

  • ValueError:

    when not all required columns are provided.

property stats

A dictionary containing model prediction performance metrics.

to_cpp(file_path)

Export the model to a C++ file.

Args:

  • file_path (str):

    path to the output file.

Raises:

to_excel(file_path)

Export the model to an Excel file.

Args:

  • file_path (str):

    path to the output file.

Raises:

to_matlab(file_path)

Export the model to a MATLAB file.

Args:

  • file_path (str):

    path to the output file.

Raises:

to_py(file_path)

Export the model to a Python file.

Args:

  • file_path (str):

    path to the output file.

Raises:

to_r(file_path)

Export the model to an R file.

Args:

  • file_path (str):

    path to the output file.

Raises:

class datastories.model.BasePrediction(data)

Base class for all prediction classes.

Base classes:

Args:

to_pandas()

Exports the list of predictions to a pandas.DataFrame object.

Returns:

  • The constructed pandas.DataFrame object.


class datastories.model.MultiKpiPredictor(predictor_info, base_model)

Encapsulates multi-KPI prediction models (e.g., as computed using datastories.story.predict_kpis()).

Base classes:

Note: Objects of this class should not be manually constructed.

property error_plot

A visualization for assessing model prediction errors.

Returns:

property metrics

A dictionary containing multi KPI model prediction performance metrics.

The type of metrics depend on the model type (i.e., regression or classification)

predict(data_frame)

Predict the model KPIs on a new data frame.

Args:

  • data_frame (obj):

    the data frame on which the model associated KPIs are to be predicted (either a pandas.DataFrame or a datastories.data.DataFrame object).

Returns:
  • An object of type datastories.regression.MultiKpiPredictionResult encapsulating the prediction results.

Raises:
  • ValueError:

    when not all required columns are provided.

NOTE: If not all drivers are provided, the KPIs that depend on them will not be predicted. However, no Exception will be generated.

property visualization

The prediction performance visualization.

class datastories.model.MultiKpiPredictorInfo(pva, performance_metrics)

Data class wrapper for prediction performance metrics.

Note: Objects of this class should not be manually constructed.

property metrics

The prediction performance metrics.

property predicted_vs_actual

The prediction performance metrics.

class datastories.model.MultiKpiPredictionResult(prediction)

Encapsulates the results of a prediction done using a datastories.model.MultiKpiPredictor object.

Base classes:

Note: Objects of this class should not be manually constructed.

property error_plot

A visualization for assessing model prediction errors.

Returns:

property metrics

A dictionary containing multi KPI prediction performance metrics.

property values

A data frame containing the input augmented with predicted values, confidence estimates and flags to indicate whether the prediction is a model based outlier.

property visualization

The prediction performance visualization.


Optimization

The datastories.optimization package contains a collection of classes and functions for optimizing models.


datastories.optimization.create_optimizer(*args, **kwargs)

Factory method for creating optimizers.

Returns:

Example:

model = Model("my_model.rsx")
spec = OptimizationSpecification()
spec.objectives = [
    Minimize('KPI_1'),
    Maximize('KPI_2')
]
spec.constraints = [
    AtMost('Input_1', 10),
]
optimizer = create_optimizer()
optimization_result = optimizer.optimize(model, optimization_spec=spec)
print(optimization_result.optimum)
class datastories.optimization.pso.Optimizer(size_t population_size=500, size_t iterations=250)

A model optimizer using the particle swarm strategy for identifying an optimum solution.

Args:

  • population_size (int = 500):

    the initial size of the swarm population.

  • iterations (int = 250):

    number of swarm computation iterations before stopping.

maximize(self, model, variable_ranges=None, progress_bar=True)

Run the optimizer with the goal of maximizing the outputs (i.e., KPIs) of a given model.

Args:

  • model (datastories.model.Model):

    The input model whose KPIs are to be maximized.

  • variable_ranges (dict [str, datastories.optimization.VariableRange] = {}):

    An optional dictionary mapping variable names to ranges that are to be used to limit the searching for the optimum solution to a given domain.

  • progress_bar (obj|bool=True):

    An object of type datastories.display.ProgressReporter, or a boolean to get a default implementations (i.e., True to display progress, False to show nothing).

Returns:

minimize(self, model, variable_ranges=None, progress_bar=True)

Run the optimizer with the goal of minimizing the outputs (i.e., KPIs) of a given model.

Args:

  • model (datastories.model.Model):

    The input model whose KPIs are to be minimized.

  • variable_ranges (dict [str, datastories.optimization.VariableRange] = {}):

    An optional dictionary mapping variable names to ranges that are to be used to limit the searching for the optimum solution to a given domain.

  • progress_bar (obj|bool=True):

    An object of type datastories.display.ProgressReporter, or a boolean to get a default implementations (i.e., True to display progress, False to show nothing).

Returns:

optimize(self, model, optimization_spec=None, variable_ranges=None, direction=None, progress_bar=True)

Optimize an input model according to a given optimization specification.

Args:

  • model (datastories.model.Model):

    The input model to be optimized

  • optimization_spec (datastories.optimization.OptimizationSpecification):

    An optional specification for the optimization objectives and constraints. The default value is an empty specification (i.e., OptimizationSpecification())

  • variable_ranges (dict [str, datastories.optimization.VariableRange] = {}):

    An optional dictionary mapping variable names to ranges that are to be used to limit the searching for the optimum solution to a given domain.

  • direction (datastories.optimization.OptimizationDirection)
    The direction of optimization when no specification is provided. Can be one of:
    • OptimizationDirection.MAXIMIZE

    • OptimizationDirection.MINIMIZE

  • progress_bar (obj|bool=True):

    An object of type datastories.display.ProgressReporter, or a boolean to get a default implementations (i.e., True to display progress, False to show nothing).

Returns:

Raises:

  • TypeError:

    when the provided input parameters do not have the expected types.

class datastories.optimization.OptimizerType(value)

Enumeration for DataStories supported optimizer types.

PARTICLE_SWARM = 0
class datastories.optimization.OptimizationResult

Encapsulates the result of a datastories.optimizer.Optimizer.optimize() analysis.

Note: Objects of this class should not be manually constructed.

is_complete

Checks whether the search for the optimum has been interrupted before completion.

is_feasible

Checks whether the identified optimum position respects the imposed constraints (if any).

optimum

The model variable values for the identified optimum position.

to_pandas(self)

Export the optimum position to a Pandas DataFrame object.

Returns:

The constructed Pandas DataFrame object.


class datastories.optimization.OptimizationSpecification(objectives=None, constraints=None)

Encapsulates a set of optimization objectives and constraints that can be used to configure an optimization analysis.

Both objectives and constraints are defined using datastories.optimization.VariableSpec and (potentially) datastories.optimization.VariableMapper objects.

Example:

spec = OptimizationSpecification()
spec.objectives = [
    Minimize('KPI_1', 2),
    InInterval('KPI_2', 1, 100)
]
spec.add_constraint(AtMost(Sum('Input_1','Input_2'), 100))
add_constraint(self, constraint)

Add a optimization constraint to the specification.

add_objective(self, objective)

Add a optimization objective to the specification.

constraints

Get/set the optimization specification constraints.

objectives

Get/set the optimization specification objectives.

to_dict(self)
class datastories.optimization.OptimizationDirection

Enumeration for possible optimization goals when no other optimization specification is provided.

Possible values:
  • OptimizationDirection.MAXIMIZE

  • OptimizationDirection.MINIMIZE

class datastories.optimization.VariableRange

Encapsulates a numeric or categorical value ranges.

Numeric ranges are defined by an upper and a lower bound. Categorical ranges are currently limited to a single value.

Args:

  • min (double=0):

    a numeric range lower bound.

  • max (double=0):

    a numeric range upper bound.

  • value (str=’’):

    a categorical range value.

is_categorical

Checks whether the variable range is categorical.

is_numeric

Checks whether the variable range is numeric.

max

Get/set the upper bound of a numeric range.

min

Get/set the lower bound of a numeric range.

to_dict(self)
value

Get/set the value of a categorical range.

class datastories.optimization.VariableMapper

Base class for all variable mappers.

Variable mappers are the first argument to be passed when defining optimization objectives and constraints. They indicate to what variable or group of variables the objective/constraint applies.

For simple cases (i.e., one variable), variable mappers can be replaced with the name of the variable itself. However, in more complex scenarios (e.g., a constraint that applies to the aggregated value of a number of variables), mappers have to be explicitly constructed.

class datastories.optimization.Sum(operands, weights=None)

Bases: VariableMapper

Aggregates a number of variables using a weighted sum. This can be then used to define optimization objectives or constraints.

Args:

  • operands (list):

    a list of variable names to sum-up.

  • weights (list=None):

    a list of relative weights for aggregating the given variables.

class datastories.optimization.VariableSpec

Base class for all optimization objectives and constraints.

class datastories.optimization.AtMost(operand, double limit, double weight=1.0)

Bases: VariableSpec

Specifies an optimization objective or constraint by which a variable (or aggregation of variables) should be lower than a given reference value.

Args:

  • operand (obj):

    a variable mapper (datastories.optimization.specification.VariableMapper) indicating to whom the objective/constraint applies.

  • limit (double):

    the reference value to compare against.

  • weight (double=1):

    the relative weight of this objective/constraint among all the specified objectives or constraints.

class datastories.optimization.AtLeast(operand, double limit, double weight=1.0)

Bases: VariableSpec

Specifies an optimization objective or constraint by which a variable (or aggregation of variables) should be greater than a given reference value.

Args:

  • operand (obj):

    a variable mapper (datastories.optimization.specification.VariableMapper) indicating to whom the objective/constraint applies.

  • limit (double):

    the reference value to compare against.

  • weight (double=1):

    the relative weight of this objective/constraint among all the specified objectives or constraints.

class datastories.optimization.InInterval(operand, double lower_limit, double upper_limit, double weight=1.0)

Bases: VariableSpec

Specifies an optimization objective or constraint by which a variable (or aggregation of variables) should be in a given reference interval.

Args:

  • operand (obj):

    a variable mapper (datastories.optimization.specification.VariableMapper) indicating to whom the objective/constraint applies.

  • lower_limit (double):

    the lower bound of the reference interval.

  • upper_limit (double):

    the upper bound of the reference interval.

  • weight (double=1):

    the relative weight of this objective/constraint among all the specified objectives or constraints.

class datastories.optimization.IsEqual(operand, double value, double weight=1.0)

Bases: VariableSpec

Specifies an optimization objective by which a variable (or aggregation of variables) should be equal to a given reference value.

Note: The optimizer does not support the use of datastories.optimization.specification.IsEqual as a constraint, because the underlying algorithm is not optimized to handle constraints of this type. Therefore, trying to force IsEqual-like behavior by combining AtLeast and AtMost to make only a small region feasible is not recommended. The returned result might be in this region, but there is no guarantee that it is close to optimal.

In general, one should try to add the IsEqual condition as an objective with a high weight. This does not guarantee that the condition will be met, but the results are often close enough that a small manual adjustment to one parameter is enough to meet the condition.

A common case is that the sum of some parameters must be equal to a value, for example in formulations where parameters express a fraction of a mixture. In this case, if the previous recommendation does not lead to good solutions, one can try to relax the condition in the following way. Have one constraint limiting the sum to the value with AtMost[ Sum[], value ], and have one objective Maximize[ Sum[] ] with a high weight. This is less restrictive towards the algorithm than using the IsEqual as objective, and can lead to better results. Of course a small manual adjustment might be needed to satisfy the condition exactly.

Args:

  • operand (obj):

    a variable mapper (datastories.optimization.specification.VariableMapper) indicating to whom the objective applies.

  • value (double):

    the reference value to compare against.

  • weight (double=1):

    the relative weight of this objective/constraint among all the specified objectives or constraints.

class datastories.optimization.Minimize(operand, double weight=1.0)

Bases: VariableSpec

Specifies an optimization objective by which a variable (or aggregation of variables) should have the smallest possible value.

Note: This cannot be used to define optimization constraints.

Args:

class datastories.optimization.Maximize(operand, double weight=1.0)

Bases: VariableSpec

Specifies an optimization objective by which a variable (or aggregation of variables) should have the largest possible value.

Note: This cannot be used to define optimization constraints.

Args:

Story

The datastories.story package contains a collection of workflows to automate specific analysis tasks (e.g., building a predictive model).


datastories.story.load(file_path, *args, **kwargs)

Loads a previously saved story.

Args:

  • file_path (str):

    name of the file containing the story, including extension.

Returns:

  • An object wrapping the story.

Raises:

  • TypeError:

    when the story type is not recognized by the SDK.

  • StoryError:

    when the story type cannot be retrieved from the file.


datastories.story.load_result(file_path, cls=None, *argc, **kwargs)

Load a decoupled story analysis result.

NOTE: There are no compatibility guarantees across result versions. It is generally safe to use decoupled results only within a minor SDK version.

Args:

  • file_path (str):

    Path to the result file

  • cls (str=None):

    Expected type of the result. When not specified, an attempt will be made to infer the type from the file contents. When specified, it has to match the type of the result stored in the file.

Returns:

  • An object instance of the result stored in the file.

Raises:

  • ValueError:

    When the type of the result could not be inferred or is different from the one specified in teh [cls] argument.

  • NotImplementedError:

    When the result type or it’s specific version is not supported.


class datastories.story.StoryBase(data=None, params=None, metainfo=None, raw_results=None, results=None, folder=None, notes=None, upload_function=None, on_snapshot=None, progress_bar=False)

Base class for story analyses.

Base classes:

class ProcessingStage(value)

Enumeration of all story processing stages.

Specializations have to extend this with their specific execution stages, while maintaining these base stages as defined below:

  • UNKNOWN = 0

  • INIT = 1

add_note(note)

Add an annotation to the story results.

The already present annotations can be retrieved using the datastories.api.IStory.notes() property.

Args:

  • note (str):

    the annotation to be added.

clear_note(note_id)

Remove a specific annotation associated with the story analysis.

Args:

  • note_id (int):

    the index of the note to be removed.

Raises:

  • IndexError:

    when the note index is unknown.

clear_notes()

Clear the annotations associated with the story analysis.

classmethod create_story(data_frame, info_fields, folder, progress_bar, on_snapshot, upload_function, **kwargs)

Factory method.

This method has to be overwritten by specializations in order to enable additional computation when loading a story object.

info()

Display story execution information.

All story execution stages are displayed together with their completion status. The version of the used DataStories SDK and the user notes are also included.

static is_compatible(current_version_string, ref_version_string)

Test whether two story versions are compatible.

The story version compatibility policy is as follows:

stories are forward and backwards compatible within the minor version (i.e., you can open a saved story whose version differs from the version associated with the current SDK but only if the major version number remains unchanged).

property is_complete

Checks whether all story analysis stages have been executed.

property is_ok

Checks whether last executed story analysis stage has been successful.

classmethod load(file_path, folder=None, progress_bar=False, on_snapshot=None, upload_function=None, **kwargs)

Load a previously saved story.

Args:

  • file_path (str):

    path to the input file.

  • folder (str=None):

    Folder to use as working folder for the story.

  • progress_bar (obj|bool=True):

    An object of type datastories.display.ProgressReporter, or a boolean to get a default implementations (i.e., True to display progress, False to show nothing).

  • on_snapshot (func=None):

    an optional callback to be executed when an analysis snapshot is created. The callback receives one argument indicating the path of the snapshot file relative to the current execution folder.

  • upload_function (func=None):

    an optional callback to upload analysis result files to a client specific storage. The callback receives one argument indicating the path of the result file relative to the current execution folder.

Returns:

Raises:
  • datastories.story.StoryError:

    when there is a problem loading the story file (e.g., story version not compatible).

property metrics

Returns a set of metrics computed during analysis.

NOTE: This is an alias for the .stats property

property notes

The list of all annotations currently associated with the story analysis.

reset()

Reset the execution pointer of a story to the first stage.

run(resume_from=None, strict=False, params=None, progress_bar=None, check_interrupt=None)

Resumes the execution of a story form a give stage.

The stage to resume from is optional. If not specified, the story is executed from the beginning. If the stage cannot be executed (e.g., due to missing intermediate results) the closest story that can be executed will be used as starting point unless the [strict] argument is set to True. In that case an exception will be raised if the execution cannot be resumed from the requested stage.

Args:

  • resume_from (obj):

    The stage to resume execution from. Should be a datastories.story.predict_kpis.Story.ProcessingStage value corresponding to a stage for which all intermediate results are available. If None, the stage at which execution was previously interrupted (if any) is used.

  • strict (bool=False):

    Raise en error if execution cannot be resumed from the requested stage.

  • params (dict={}):

    Map of parameters to be used with the run. It can override the original parameters, but this leads to invalidating previous results that depend on the updated parameter values.

  • progress_bar (obj=None):

    An object of type datastories.display.ProgressReporter to replace the currently used progress reporter. When not specified the current story progress reporter will not be modified. The case for this is to set a progress bar after the story is loaded, when a progress bar cannot be given to the load function directly (e.g, when a progress bar has to be constructed based on the story).

  • check_interrupt (func=None):

    an optional callback to check whether analysis execution needs to be interrupted.

Raises:

save(file_path, include_data=True)

Save the story analysis results.

Use this function to persist the results of the story analysis. One can reload them and continue investigations at a later moment using the datastories.story.load() method.

Args:

  • file_path (str):

    path to the output file.

  • include_data (bool=True):

    set to True to include a copy of the data in the exported file.

Raises:

  • datastories.api.errors.StoryError:

    when attempting to include data while the story that does not contain a data reference. This is that case with stories that have been previously saved without including the data.

Predict Single KPI

datastories.story.predict_single_kpi(data_frame, column_list, kpi, runs=3, outlier_elimination=True, prototypes='auto', progress_bar=True, threads=0, scale_kpi='auto')

Fits a non-linear regression model on a data frame in order to predict one column.

DEPRECATED: This method has been deprecated and will be removed in a future version of the SDK. Use the more generic datastories.story.predict_kpis() for analysing single KPIs as well.

The column to pe predicted (i.e., the KPI) is to be identified either by name or by column index in the data frame.

Args:

  • data_frame (obj):

    the input data frame (either a pandas.DataFrame or a datastories.data.DataFrame object).

  • column_list (list):

    the list of variables (i.e., columns) to consider for regression.

  • kpi (int|str):

    the index or the name of the target (i.e., KPI) column.

  • runs (int=3):

    the number of training rounds;

  • outlier_elimination (bool=True):

    set to True in order to exclude far outliers from modeling. Note that no outliers will be eliminated if the dataset has fewer than 30 rows, or the variable has less than 20 unique values.

  • prototypes (str=’auto’):

    indicates whether analysis should be performed on prototypes. Possible values:

    • 'yes': use only prototypes as inputs.

    • 'no': use all original inputs.

    • 'auto': use prototypes if there are more than 200 input variables.

  • progress_bar (obj|bool=True):

    An object of type datastories.display.ProgressReporter, or a boolean to get a default implementations (i.e., True to display progress, False to show nothing).

  • threads (int):

    the number of computational threads to use, uses all available cores by default.

  • scale_kpi (str=’auto’):

    indicates whether the kpi should be scaled. Possible values:

    • 'yes': All runs use the scaled kpi if we detect that scaling could be beneficial

    • 'no': All runs use the original kpi

    • 'auto': About one third of the runs will use the scaled kpi if we detect that scaling could be beneficial, the rest of the runs use the original kpi

Returns:

Raises:

  • ValueError:

    when an invalid value is provided for one of the input parameters parameters.

  • datastories.story.StoryError:

    when there is a problem fitting the model.

Example:

from datastories.story import predict_single_kpi
import pandas as pd
df = pd.read_csv('example.csv')
kpi_column_index = 1
ranks = predict_single_kpi(df, df.columns, kpi_column_index, progress_bar=True)
print(story)
class datastories.story.predict_single_kpi.Story(platform, kpi_name, user_columns, nrows, folder='', *args, **kwargs)

Encapsulates the result of a single KPI non-linear regression model.

Base classes:

DEPRECATED: This class been deprecated and will be removed in a future version of the SDK. Use the more generic datastories.story.predict_kpis() for analysing single KPIs as well.

Note: Objects of this class should not be manually constructed but rather using the datastories.story.predict_single_kpi() factory method.

classmethod load(file_path)

Loads a previously saved story.

Args:

  • file_path (str):

    the name of the input file.

Returns:

Raises:

  • datastories.story.StoryError:

    when there is a problem loading the story file (e.g., story version not compatible).

property metrics

A dictionary containing the model performance metrics and the list of main drivers.

These metrics are computed on the training data for the purpose of evaluating the model prediction performance.

The following metrics are retrieved:

  • Training Set Size:

    size of the actual data frame used for training (rows x columns).

  • Correlation:

    actual vs predicted correlation.

  • Estimated Correlation:

    estimated correlation for future (unseen) values.

  • R-squared:

    the coefficient of determination.

  • MSE:

    mean squared error.

  • RMSE:

    root mean squared error.

  • Main Drivers:

    list of main features with associated relative importance and energy.

  • Features:

    list of all features with associated relative importance and energy.

  • Computation Effort:

    a measure of model complexity.

  • Number of Runs:

    number of training rounds.

  • Best Run:

    best performing training round.

  • Run Overview:

    overview of individual runs including Performance and Feature Importance.

In case the KPI is a binary variable, the following additional metrics are included:

  • Positive Label:

    the label used to identify positive cases.

  • Negative Label:

    the label used to identify negative cases.

  • True Positives:

    number of correctly identified positive cases (TP).

  • False Positives:

    number of incorrectly identified positive cases (FP).

  • True Negatives:

    number of correctly identified negative cases (TN).

  • False Negatives:

    number of incorrectly identified negative cases (FN).

  • Not Classified:

    number of records that could not be classified (i.e., KPI is NaN).

  • True Positive Rate:

    TP / (TP + FN) * 100 (a.k.a. sensitivity, recall).

  • False Positive Rate:

    FP / (FP + TN) * 100 (a.k.a. fall-out).

  • True Negative Rate:

    TN / ( FP + TN) * 100 (a.k.a. specificity).

  • False Negative Rate:

    FN / (TP + FN) * 100 (a.k.a. miss rate).

  • Precision:

    percentage of correctly identified cases from the total reported positive cases TP / (TP + FP) * 100.

  • Recall:

    percentage of correctly identified cases from the total existing positive cases TP / (TP + FN) * 100.

  • Accuracy:

    percentage of correctly identified cases (TP + TN) / (TP + FP + TN + FN) * 100.

  • F1 Score:

    the F1 score (the harmonic mean of precision and recall).

  • AUC:

    area under (ROC) curve.

property model

An object of type datastories.model.SingleKpiPredictor that can be used for making predictions on new data.

property run_overview

An overview of feature importance metrics across all runs.

property runs

A list containing the results of individual analysis rounds.

Each entry in the list is an object of type datastories.story.predict_single_kpi.StoryRun encapsulating the results associated with a given analysis round.

save(file_path)

Saves the story analysis results.

Use this function to persist the results of the datastories.story.predict_single_kpi() analysis. One can reload them and continue investigations at a later moment using the datastories.story.predict_single_kpi.Story.load() method.

Args:

  • file_path (str):

    path to the output file.

to_csv(file_path, content='metrics', delimiter=',', decimal='.')

Exports a list of model metrics to a CSV file.

Args:

  • file_path (str):

    path to the output file.

  • content (str=’metrics’):

    the type of metrics to export. Possible values:

    • 'metrics': exports estimated model performance metrics.

    • 'drivers': exports driver importance metrics.

    • 'run_overview': exports an overview of feature importance metrics across all runs.

  • delimiter (str=’,’):

    character to use as value delimiter.

  • decimal (str=’.’):

    character to use as decimal point.

Raises:

  • ValueError:

    when an invalid value is provided for the [content] argument.

to_excel(file_path)

Exports the list of model metrics to an Excel file.

Args:

  • file_path (str):

    path to the output file.

to_html(file_path, title='Predict Single KPI', subtitle='', scenario=VisualizationScenario.REPORT)

Export the story visualization to a standalone HTML document.

Args:

  • file_path (str):

    path to the output file.

  • title (str=’Predict Single KPI’):

    HTML document title.

  • subtitle (str=’’):

    HTML document subtitle.

  • scenario (enum=VisualizationScenario.REPORT):

    A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.

to_pandas(content='metrics')

Exports a list of model metrics to a pandas.DataFrame object.

Args:

  • content (str=’metrics’):

    the type of metrics to export. Possible values:

    • 'metrics': exports estimated model performance metrics.

    • 'drivers': exports feature importance metrics for the model.

    • 'run_overview': exports an overview of feature importance metrics across all runs.

Returns:

  • The constructed pandas.DataFrame object.

Raises:

  • ValueError:

    when an invalid value is provided for the [content] argument.

class datastories.story.predict_single_kpi.StoryRun(platform, parent, folder=None, dependencies=None, *args, **kwargs)

Encapsulates the result of one analysis round for a single KPI non-linear regression model.

DEPRECATED: This class been deprecated and will be removed in a future version of the SDK. Use the more generic datastories.story.predict_kpis() for analyzing single KPIs as well.

Base classes:

Note: Objects of this class should not be manually constructed.

property correlation_browser

A visualization for assessing feature correlation.

An object of type datastories.visualization.CorrelationBrowser that can be used for assessing feature correlation, as discovered while training the model.

property metrics

A dictionary containing the model performance metrics and the list of main drivers.

These metrics are computed on the training data for the purpose of evaluating the model prediction performance.

The following metrics are retrieved:

  • Training Set Size:

    size of the actual data frame used for training (rows x columns).

  • Correlation:

    actual vs predicted correlation.

  • Estimated Correlation:

    estimated correlation for future (unseen) values.

  • R-squared:

    the coefficient of determination.

  • MSE:

    mean squared error.

  • RMSE:

    root mean squared error.

  • Main Drivers:

    list of main features with associated relative importance and energy.

  • Features:

    list of all features with associated relative importance and energy.

In case the KPI is a binary variable, the following additional metrics are included:

  • Positive Label:

    the label used to identify positive cases.

  • Negative Label:

    the label used to identify negative cases.

  • True Positives:

    number of correctly identified positive cases (TP).

  • False Positives:

    number of incorrectly identified positive cases (FP).

  • True Negatives:

    number of correctly identified negative cases (TN).

  • False Negatives:

    number of incorrectly identified negative cases (FN).

  • Not Classified:

    number of records that could not be classified (i.e., KPI is NaN).

  • True Positive Rate:

    TP / (TP + FN) * 100 (a.k.a. sensitivity, recall).

  • False Positive Rate:

    FP / (FP + TN) * 100 (a.k.a. fall-out).

  • True Negative Rate:

    TN / ( FP + TN) * 100 (a.k.a. specificity).

  • False Negative Rate:

    FN / (TP + FN) * 100 (a.k.a. miss rate).

  • Precision:

    percentage of correctly identified cases from the total reported positive cases TP / (TP + FP) * 100.

  • Recall:

    percentage of correctly identified cases from the total existing positive cases TP / (TP + FN) * 100.

  • Accuracy:

    percentage of correctly identified cases (TP + TN) / (TP + FP + TN + FN) * 100.

  • F1 Score:

    the F1 score (the harmonic mean of precision and recall).

  • AUC:

    area under (ROC) curve.

property model

An object of type datastories.model.SingleKpiPredictor that can be used for making predictions on new data.

to_csv(file_path, content='metrics', delimiter=',', decimal='.')

Export a list of model drivers or metrics to a CSV file.

Args:

  • file_path (str):

    path to the output file.

  • content (str=’metrics’):

    the type of metrics to export. Possible values:

    • 'metrics': exports estimated model performance metrics.

    • 'drivers': exports driver importance metrics.

  • delimiter (str=’,’):

    character to use as delimiter

  • decimal (str=’.’):

    character to use as decimal point

Raises:

  • ValueError:

    when an invalid value is provided for the [content] argument.

to_excel(file_path)

Exports the list of model drivers and metrics to an Excel file.

Args:

  • file_path (str):

    path to the output file.

to_html(file_path, title='Predict Single KPI Run', subtitle='', scenario=VisualizationScenario.REPORT)

Export the story visualization to a standalone HTML document.

Args:

  • file_path (str):

    path to the output file.

  • title (str=’Predict Single KPI Run’):

    HTML document title.

  • subtitle (str=’’):

    HTML document subtitle.

  • scenario (enum=VisualizationScenario.REPORT):

    A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.

to_pandas(content='metrics')

Export a list of model drivers or metrics to a pandas.DataFrame object.

Args:

  • content (str=’metrics’):

    the type of metrics to export. Possible values:

    • 'metrics': exports estimated model performance metrics.

    • 'drivers': exports driver importance metrics.

Returns:

  • The constructed pandas.DataFrame object.

Raises:

  • ValueError:

    when an invalid value is provided for the [content] argument.

property what_ifs

A visualization for interactive exploration of the models.

The visualization helps getting insight into how driver variables influence the target KPIs. An object of type datastories.visualization.WhatIfs that can be used for interactive exploration of the models.

Predict Multiple KPIs

datastories.story.predict_kpis(data, column_list, kpi_list, record_info_list=None, runs=3, outlier_elimination=True, prototypes='auto', prototype_threshold=0.85, optimize=False, progress_bar=True, fail_on_error=False, threads=0, scale_kpi='no')

Fit a non-linear regression model on a data frame in order to predict several columns (i.e., KPIs) at the same time.

The columns to pe predicted (i.e., the KPIs) are to be identified either by name or by column index in the data frame.

Args:

  • data (obj):

    the input data frame (either a pandas.DataFrame or a datastories.data.DataFrame object) or a data descriptor (i.e., a datastories.data.DataDescriptor object).

  • column_list (list):

    the list of variables (i.e., columns) to consider for regression.

  • kpi_list (list):

    the list of indexes or names for the target columns (i.e., KPIs).

  • record_info_list (list=[]):

    the list of indexes or names to be used as additional record info.

  • runs (int=3):

    the number of training rounds.

  • outlier_elimination (bool=True):

    set to True in order to exclude far outliers from modeling. Note that no outliers will be eliminated if the dataset has fewer than 30 rows, or the variable has less than 20 unique values.

  • prototypes (str=’auto’):

    indicates whether analysis should be performed on prototypes. Possible values:

    • 'yes': use only prototypes as inputs.

    • 'no': use all original inputs.

    • 'auto': use prototypes if there are more than 200 inputs variables.

  • prototype_threshold (float=0.85):

    minimum correlation required for a column to be consider a proxy for another.

  • optimize (bool=False):

    set to True in order to compute optimal values for the KPIs. This will run optimization analyses that attempt to first minimize and then maximize all KPI together. For more complex scenarios (e.g., minimize a specific KPI while maximizing another) one can use the optimize method of the model field (datastories.model.MultiKpiPredictor) once the story analysis is completed.

  • progress_bar (obj|bool=True):

    An object of type datastories.display.ProgressReporter, or a boolean to get a default implementations (i.e., True to display progress, False to show nothing).

  • fail_on_error (bool=False):

    set to True in order to fail (i.e., raise an exception) when problems are detected. Otherwise, the processing will complete producing a partial story object. In order to check how far the processing has reached, one can use the datastories.story.StoryBase.info().

  • threads (int):

    the number of computational threads to use, uses all available cores by default.

  • scale_kpi (str=’no’):

    indicates whether the kpi should be scaled. Possible values:

    • 'yes': All runs use the scaled kpi if we detect that scaling could be beneficial

    • 'no': All runs use the original kpi

    • 'auto': One third of the runs will use the scaled kpi if we detect that scaling could be beneficial,

      the rest of the runs use the original kpi. When doing a single run, no scaling is applied. When doing two runs, scaling can be applied on one run.

Returns:

Raises:

  • ValueError:

    when an invalid value is provided for one of the input parameters parameters.

  • datastories.story.StoryError:

    when there is a problem fitting the model.

Example:

from datastories.story import predict_kpis
import pandas as pd
df = pd.read_csv('example.csv')
kpi_column_indexes = [1,'other kpi',3,4]
story = predict_kpis(
    df,
    df.columns,
    kpi_column_indexes)
print(story)
class datastories.story.predict_kpis.Story(data=None, params=None, metainfo=None, raw_results=None, results=None, folder=None, notes=None, upload_function=None, on_snapshot=None, progress_bar=False)

Encapsulates a multi-kpi non-linear regression model analysis.

Base classes:

Note: Objects of this class should not be manually constructed but rather created using the datastories.story.predict_kpis() factory method.

class ProcessingStage(value)

Enumeration declaring the story processing stages.

Possible values:

  • UNKNOWN = 0

  • INIT = 1

  • PREPARE_DATA = 2

  • PROCESS_DATA = 3

  • BUILD_MODELS = 4

  • MERGE_MODELS = 5

  • VALIDATE_MODEL = 6

  • OPTIMIZE = 7

  • WRAP_UP = 8

  • END = 9

add_model_validation(prediction)

Add a prediction containing validation data to the story managed validations.

property best_run

The index of the best analysis run.

The best run is selected to be the run with most cummulated importance overlap between the main drivers of different KPIs. The overlap is computed pairwise between all KPI pairs of a given run.

property conclusions

An analysis summary containing highlights and pointers to detailed insights (object of type datastories.story.predict_kpis.Conclusion)..

property data

A copy of the story associated dataframe, if available.

NOTE: When the dataframe has been previously discarded (i.e., by setting the include_data argument to False while saving the story) the associated data is lost and this property will return None.

property data_health

A summary of input data quality (object of type datastories.story.generic.DataHealth).

property data_overview

An overview of driver importance across all analysis runs (object of type datastories.story.generic.DataOverview).

property failed_kpis

A list of KPIs that could not be processed, or None if all KPIs have been successfully modeled.

property linear_vs_nonlinear

An overview of KPI relationships with other columns in the dataframe (object of type datastories.story.generic.LinearVsNonlinear).

property model

An object of type datastories.api.IPredictiveModel that can be used for making predictions on new data.

property model_validation

An overview of model validations (object of type datastories.story.predict_kpis.ModelValidation).

modify_drivers(replace=None, remove=None, run=None, complexity=1.0)

Modify the drivers and complexity of a model from the story. This function generates new models with driver substitutions or removals performed on the already trained model.

Since it starts out from an already trained model, we can save time under the assumption that the variables substituted are similar in information content.

The main intent for this function is to replace a driver that is hard to control by one of it’s proxies that is easy to control, without having to do a full training run again.

Note that this function does not give you a statistical guarantee about the quality of the resulting model, as no variable selection is performed and the input weights are not retrained.

In general, starting a new story with the driver substitution and removal performed on the input columns will yield a more reliable model than the one created by this function.

Args:

  • replace (dict):

    Driver labels representing driver replacements, keys will be replaced by values i.e. with input {‘driver1’ : ‘driver2’} the driver1 will be replaced by driver2

  • remove (list):

    Driver labels to remove from the model

  • run (int):

    Advanced option to select the run you want to modify. By default the best run is chosen automatically, i.e. the one you see when displaying a story

  • complexity (float=1):

    Advanced option to increase or decrease the complexity factor of the model. More complex models can have more complex response surfaces and variable interactions. Note increasing this above one might result in worse models, or can over-fit.

Returns:

property pairwise_plots

A collection of variable vs variable plots (object of type datastories.story.generic.PairwisePlots).

property record_info_labels

The names of the columns that contain record identification information.

reset()

Reset the execution pointer of a story to the first stage.

Warning: After calling this, all previous results are discarded. One needs to run the story again in order to regenerate the results. This is only possible when the data frame is still available. That is, resetting a story that previously discarded the data frame (e.g., while saving) would render the story unusable. Consequently, this scenarios is not allowed and an Exception is raised when the scenario is attempted.

run(resume_from=None, strict=False, params=None, progress_bar=None, check_interrupt=None)

Resume the execution of a story form a give stage.

The stage to resume from is optional. If not specified, the story is executed from the beginning. If the stage cannot be executed (e.g., due to missing intermediate results) the closest story that can be executed will be used as starting point unless the [strict] argument is set to True. In that case an exception will be raised if the execution cannot be resumed from the requested stage.

Args:

  • resume_from (self.ProcessingStage=None):

    The stage to resume execution from. Should be a stage for which all intermediate results are available. If None, the stage at which execution was previously interrupted (if any) is used.

  • strict (bool=False):

    Raise en error if execution cannot be resumed from the requested stage.

  • params (dict={}):

    Map of parameters to be used with the run. It can override the original parameters, but this leads to invalidating previous results that depend on the updated parameter values.

  • progress_bar (obj=None):

    An object of type datastories.display.ProgressReporter to replace the currently used progress reporter. When not specified the current story progress reporter will not be modified. The case for this is to set a progress bar after the story is loaded, when a progress bar cannot be given to the load function directly (e.g, when a progress bar has to be constructed based on the story).

  • check_interrupt (func=None):

    an optional callback to check whether analysis execution needs to be interrupted.

Raises:

  • StoryError:

    if a stage is specified for which no intermediate results are available and the ‘strict’ argument is set to True.

property run_overview

An overview of driver importance across all analysis runs (object of type datastories.story.predict_kpis.RunOverview).

property runs

A list containing the results of individual analysis rounds.

Each entry in the list is an object of type datastories.story.predict_kpis.StoryRun encapsulating the results associated with a given analysis round.

save(file_path, include_data=True)

Save the story analysis results.

Use this function to persist the results of the datastories.story.predict_kpis() analysis. One can reload them and continue investigations at a later moment using the datastories.story.load() method.

Args:

  • file_path (str):

    path to the output file.

  • include_data (bool=True):

    set to True to include a copy of the data in the exported file.

Raises:

  • datastories.api.errors.StoryError:

    when attempting to include data while the story that does not contain a data reference. This is that case with stories that have been previously saved without including the data.

property stats

A dictionary containing the model performance statistics and the list of main drivers.

These statistics are computed on the training data for the purpose of evaluating the model prediction performance.

The following statistics are retrieved:

  • Prediction Performance:

    the prediction performance per KPI ( correlation coefficient for regression, or AUC for classification).

  • Driver Importance:

    relative driver importance per KPI.

  • Driver Overlap:

    cummulated driver importance overlap computed between all possible pairs of KPIs.

to_html(file_path, title='Predict Multiple KPIs', subtitle='', scenario=VisualizationScenario.REPORT)

Export the story visualization to a standalone HTML document.

Args:

  • file_path (str):

    name of the file to export to.

  • title (str=’Predict Multiple KPIs’):

    HTML document title.

  • subtitle (str=’’):

    HTML document subtitle.

  • scenario (enum=VisualizationScenario.REPORT):

    A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.

class datastories.story.predict_kpis.StoryRun(run_idx=None, upload_function=None, progress_bar=None, folder=None, dependencies=None, parent=None, *args, **kwargs)

Encapsulates results of one analysis round of a multi-kpi non-linear regression model analysis.

Base classes:

Note: Objects of this class should not be manually constructed.

property correlation_browser

An overview of linear and nonlinear correlations across most relevant variables in the analysis (object of type datastories.story.generic.CorrelationBrowser).

The most relevant variables are identified based on the amount of correlation they exhibit with respect to other variables in the analysis.

property driver_overview

An overview of driver importance across all KPIs (object of type datastories.story.predict_kpis.DriverOverview).

property drivers

Retrieves an overview of all driver variables.

property kpis

Retrieves an overview of all KPIs.

property metrics

A set of metrics computed during analysis.

NOTE: This is an alias for the .stats property.

property model

An object of type datastories.api.IPredictiveModel that can be used for making predictions on new data.

property outliers

A dictionary of outlier values per column used in modeling.

property stats

A dictionary containing the model performance statistics and the list of main drivers.

These statistics are computed on the training data for the purpose of evaluating the model prediction performance.

The following statistics are retrieved:

  • Prediction Performance:

    the prediction performance per KPI ( correlation coefficient for regression, or AUC for classification).

  • Driver Importance:

    relative driver importance per KPI.

  • Driver Overlap:

    cummulated driver importance overlap computed between all possible pairs of KPIs.

to_csv(file_path, content='Driver Importance', delimiter=',', decimal='.')

Export a list of story metrics to a CSV file.

Args:

  • file_path (str):

    path to the output file.

  • content (str=’Driver Importance’):

    the type of metrics to export. Possible values:

    • 'Prediction Performance': exports estimated model performance metrics;

    • 'Driver Importance': exports driver importance metrics.

  • delimiter (str=’,’):

    character used as value delimiter.

  • decimal (str=’.’):

    character used as decimal point.

Raises:

  • ValueError:

    when an invalid value is provided for the [content] argument.

to_excel(file_path)

Export the list of story metrics to an Excel file.

Args:

  • file_path (str):

    path to the output file.

to_html(file_path, title='Predict Multiple KPIs Run', subtitle='', scenario=VisualizationScenario.REPORT)

Export the story visualization to a standalone HTML document.

Args:

  • file_path (str):

    path to the output file.

  • title (str=’Predict Multiple KPIs Run’):

    HTML document title.

  • subtitle (str=’’):

    HTML document subtitle.

  • scenario (enum=VisualizationScenario.REPORT):

    A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.

to_pandas(content='Driver Importance')

Export a list of model drivers or metrics to a pandas.DataFrame object.

Args:

  • content (str=’Driver Importance’):

    the type of metrics to export. Possible values:

    • 'Prediction Performance': exports estimated model performance metrics;

    • 'Driver Importance': exports driver importance metrics.

Returns:

  • The constructed pandas.DataFrame object.

Raises:

  • ValueError:

    when an invalid value is provided for the [content] argument.

property what_ifs

An interactive what-ifs analysis visualization (object of type datastories.story.generic.WhatIfs).

class datastories.story.predict_kpis.ProgressBar(story=None, runs=None, kpi_list=None, *args, **kwargs)

Convenience wrapper for datastories.display.AggregatedReporter.

It constructs aggregated progress reporters for multi kpi stories. To this end, it requires either a story object (if already available) or two parameters that define the processing stages: the number of runs and the list of kpis. When the story is specified, the number of runs and the kpi list should not be provided.

Args:

  • story (obj=None):

    an optional multi kpi story object of type datastories.story.predict_kpis.Story from which processing stages will be inferred.

  • runs (int=None):

    an optional integer specifying the number of runs.

  • kpi_list (list=None):

    an optional list specifying the story KPIs as would be provided to the analysis.

Raises:

  • ValueError:

    when the provided parameters do not match the specification requirements.

Check Data Health

datastories.story.check_data_health(data, sample_size=None, progress_bar=True, on_snapshot=None, upload_function=None, check_interrupt=None, fail_on_error=False)

Check the suitability of a dataset for building statistical models.

Args:

  • data (obj):

    the input data frame (either a pandas.DataFrame or a datastories.data.DataFrame object) or a data descriptor (i.e., a datastories.data.DataDescriptor object).

  • sample_size (int|str=None)`:

    the sample size to use for inferring data types (either absolute integer value or a percentage - e.g. ‘10%’). If left unspecified is the minimum of 100 and 10% of the number of points.

  • progress_bar (obj|bool=True):

    An object of type datastories.display.ProgressReporter, or a boolean to get a default implementations (i.e., True to display progress, False to show nothing).

  • on_snapshot (func=None):

    an optional callback to be executed when an analysis snapshot is created. The callback receives one argument indicating the path of the snapshot file relative to the current execution folder.

  • upload_function (func=None):

    an optional callback to upload analysis result files to a client specific storage. The callback receives one argument indicating the path of the result file relative to the current execution folder.

  • check_interrupt (func=None):

    an optional callback to check whether analysis execution needs to be interrupted.

  • fail_on_error (bool=False):

    set to True in order to fail (i.e., raise an exception) when problems are detected. Otherwise, the processing will complete producing a partial story object. In order to check how far the processing has reached, one can use the datastories.story.StoryBase.info().

Returns:

Example:

from datastories.story import check_data_health
import pandas as pd
df = pd.read_csv('example.csv')
story = check_data_health(df)
print(story)
class datastories.story.check_data_health.Story(data=None, params=None, metainfo=None, raw_results=None, results=None, folder=None, notes=None, upload_function=None, on_snapshot=None, progress_bar=False)

Encapsulates a data health analysis.

Base classes:

Note: Objects of this class should not be manually constructed but rather created using the datastories.story.check_data_health() factory method.

class ProcessingStage(value)

Enumeration declaring the story processing stages.

Possible values:

  • UNKNOWN = 0

  • INIT = 1

  • PREPARE_DATA = 2

  • COMPUTE_DATA_SUMMARY = 3

  • END = 4

property data_summary

An interactive data summary visualization.

Returns:

property stats

The set of data health statistics.

to_html(file_path, title='Data Health Report', subtitle='', scenario=VisualizationScenario.REPORT)

Exports the analysis result visualization to a standalone HTML document.

Args:

  • file_path (str):

    name of the file to export to;

  • title (str=’Data Health Report’):

    HTML document title;

  • subtitle (str=’’):

    HTML document subtitle.

  • scenario (enum=VisualizationScenario.REPORT):

    A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.

to_pandas()

Export the data health stats to a Pandas data frame.

Returns:

  • The constructed Pandas data frame object.

Story Results

General Results

class datastories.story.generic.CorrelationBrowser(json_content, column_names, correlation_file='edge_bundling.json', slide_deck=None, slide_name='CorrelationBrowser')

An overview of linear and nonlinear correlations across most relevant variables of an datastories.story.predict_kpis() analysis.

Base classes:

Note: Objects of this class should not be manually constructed.

property slide

A serializable representation of the overview that can be used together with a compatible renderer in order to visualize the results.


class datastories.story.generic.DataHealth(kpi_index_list, global_metrics, column_metrics, health_stats_file='health_stats.json', slide_deck=None, slide_name='DataHealth')

An overview of data health.

Base classes:

Note: Objects of this class should not be manually constructed.

property column_metrics

A set of column level health metrics.

property columns

A visualization of column level statistics only.

property global_metrics

A set of global health metrics.

property metrics

A set of data health statistics.

property slide

A serializable representation of the data health that can be used together with a compatible renderer in order to visualize the results.

property stats

A set of data health statistics.

property visualization

The data health visualization.


class datastories.story.generic.DataOverview(stats, input_indices, kpi_indices, slide_deck=None, slide_name='DataOverview')

A high level overview with descriptive statistics for the story input data frame.

Base classes:

Note: Objects of this class should not be manually constructed.

property metrics

The set of data overview statistics.

property slide

A serializable representation of the data overview that can be used together with a compatible renderer in order to visualize the results.

property stats

The set of data overview statistics.

to_html(file_path, title='Data Overview', subtitle='')

Exports the data overview visualization to a standalone HTML document.

Args:

  • file_path (str):

    Name of the file to export to.

  • title (str=’Data Overview’):

    HTML document title.

  • subtitle (str=’’):

    HTML document subtitle.


class datastories.story.generic.LinearVsNonlinear(relations, graphics_file='linear_vs_nonlinear.json', slide_deck=None, slide_name='LinearVsNonlinear')

An overview of KPI relations with other columns in a dataframe.

Base classes:

Note: Objects of this class should not be manually constructed.

classmethod load(file_path, *args, **kwargs)

Load a ‘Linear vs Non-linear Relationships’ result from a versioned JSON file.

NOTE: There are no compatibility guarantees across result versions. It is generally safe to use a story-decoupled result only within a minor SDK version.

Args:

  • file_path (str):

    Location of the input file

  • graphics_file (str=’linear_vs_nonlinear.json’):

    Name of a file containing the the same plot data (to be passed in the slide).

  • slide_deck (obj=None):

    Associated slide deck.

Returns:

property metrics

A KPI indexed dictionary of relation metrics.

These include the number of columns that have no relation with the KPI and the number of investigated relations.

property relations

A KPI indexed dictionary of relations with other columns in the data-frame.

save(file_path)

Save the ‘Linear vs Non-linear Relationships’ result to a versioned JSON file.

NOTE: There are no compatibility guarantees across result versions. It is generally safe to use decoupled results only within a minor SDK version.

Args:

  • file_path (str):

    location of the output file

property slide

A serializable representation of the ‘Linear vs Non-linear Relationships’ slide that can be used together with a compatible renderer in order to visualize the results.

property stats

A KPI indexed dictionary of relation metrics.

These include the number of columns that have no relation with the KPI and the number of investigated relations.

NOTE: This is an alias for the .metrics property.

to_json(html_safe=False)

Save the ‘Linear vs Non-linear Relationships’ data as a JSON string.

Args:

  • html_safe (bool=False):

    set to True in order to produce a JSON string that is safe to embed in a HTML page as an attribute value.

Returns:

  • A JSON string containing the analysis results.

property visualization

The ‘Linear vs Non-linear Relationships’ visualization.


class datastories.story.generic.PairwisePlots(plots, data=None, stats=None, record_info_columns=None, graphics_file='pair_wise_plots.json', slide_deck=None, slide_name='PairwisePlots')

A collection of variable to variable plots.

Note: Objects of this class should not be manually constructed.

classmethod load(file_path, *args, **kwargs)

Load a Pair-Wise Plots result from a versioned JSON file.

NOTE: There are no compatibility guarantees across result versions. It is generally safe to use a story-decoupled result only within a minor SDK version.

Args:

  • file_path (str):

    location of the input file

  • data (obj=None):

    A Pandas data frame containing reference data for plots. When not provided, part of the plots is not available. In particular, scatter plots are not available in the SDK. Exported slides, on the other hand, rely on the data available to the renderer and therefore are not subject to this constraint.

  • stats (dict=None):

    A column name indexed dictionary of column statistics. When not provided, part of the plots might not be available (see data argument above).

  • record_info_columns (list=[]):

    List of column names to use for displaying additional data in plot tooltips.

  • graphics_file (str=’pair_wise_plots.json’):

    Name of file containing the plot data

  • slide_deck (obj=None):

    Story to which this slide belongs to.

Returns:

plot(x, y, color=None, **kwargs)

Plot a variable against another.

Args:

  • x (str|int):

    Name or index of variable depicted on the horizontal axis.

  • y (str):

    Name or index of variable depicted on the vertical axis.

When the data frame is not provided, some plots are not available available.

For example, when the object is created as part of a datastories.story.predict_kpis.Story only plots containing aggregated data (e.g., box plots) for variables that are relevant to modelling (i.e., not discarded by the story) are available, while scatter plots are not.

save(file_path)

Save the Pair-Wise Plots result to a versioned JSON file.

NOTE: There are no compatibility guarantees across result versions. It is generally safe to use decoupled results only within a minor SDK version.

Args:

  • file_path (str):

    location of the output file

NOTE: This operation loses the data, stats and record_info_columns information. Upon loading, the .plot() method will be limited and no additional information will be displayed in tooltips, unless the data and the record_info_columns arguments are provided again.

property slide

A serializable representation of the pairwise plots that can be used together with a compatible renderer in order to visualize the results.

property visualization

The ‘Pair-Wise Plots’ visualization.


class datastories.story.generic.WhatIfs(what_ifs_file, minimize_drivers=None, maximize_drivers=None, driver_importance={}, prediction_file=None, data_file=None, record_info_file=None, stats_file=None, outlier_file=None, slide_deck=None, slide_name='WhatIfs')

An overview of linear and nonlinear correlations across most relevant variables of an datastories.story.predict_kpis() analysis.

Base classes:

Note: Objects of this class should not be manually constructed.

property drivers

Get/set the current driver values

maximize()

Select the driver values that maximize the overall KPIs

minimize()

Select the driver values that minimize the overall KPIs

property slide

A serializable representation of the what-ifs that can be used together with a compatible renderer in order to visualize the results.

property visualization

The ‘What-Ifs’ visualization.

Predict Multiple KPI Story Specific Results

class datastories.story.predict_kpis.Conclusion(stats=None, slide_deck=None, slide_name='Conclusion')

An overview of the story analysis conclusions.

Base classes:

Note: Objects of this class should not be manually constructed.

property metrics

The set of story analysis metrics.

NOTE: This is an alias for the .stats property.

property slide

A serializable representation of the story analysis conclusions that can be used together with a compatible renderer in order to visualize the results.

property stats

The set of story analysis statistics.

property visualization

The story analysis conclusions visualization.


class datastories.story.predict_kpis.DriverOverview(performance, drivers, driver_overlap, driver_correlations, number_of_vars, overview_file='driver_overview.json', slide_deck=None, slide_name='DriverOverview')

An overview of driver importance and KPI prediction metrics across one run of the datastories.story.predict_kpis() analysis.

Base classes:

Note: Objects of this class should not be manually constructed.

property correlations

An overview of driver correlations with other columns in the dataset.

property driver_overlap

An overview of driver importance overlap across runs.

property drivers

An overview of main drivers and their importance across KPIs and runs.

classmethod load(file_path)

Load a Driver Overview result from a versioned JSON file.

NOTE: There are no compatibility guarantees across result versions. It is generally safe to use a story-decoupled result only within a minor SDK version.

Args:

  • file_path (str):

    location of the input file.

Returns:

property metrics

The set of driver performance statistics.

property performance

An overview of KPI prediction error metrics across runs.

save(file_path)

Save the Driver Overview result to a versioned JSON file.

NOTE: There are no compatibility guarantees across result versions. It is generally safe to use decoupled results only within a minor SDK version.

Args:

  • file_path (str):

    location of the output file.

property slide

A serializable representation of the overview that can be used together with a compatible renderer in order to visualize the results.

property stats

The set of driver performance statistics.

to_excel(file_path)

Exports the driver overview to an Excel file.

Args:

  • file_path (str):

    name of the file to export to.

to_json(html_safe=False)

Save the result as a JSON string.

Args:

  • html_safe (bool=False):

    set to True in order to produce a JSON string that is safe to embed in a HTML page as an attribute value.

Returns:

  • A JSON string containing the analysis results.

to_pandas()

Exports the driver overview to a Pandas DataFrame.

Returns:

  • The constructed Pandas DataFrame.

property visualization

The Driver Overview visualization.


class datastories.story.predict_kpis.ModelValidation(model_file, evaluations=[], evaluation_files=[], record_info_columns=[], training_data=None, slide_deck=None, slide_name='ModelValidation')

An overview of story model validation.

Base classes:

Note: Objects of this class should not be manually constructed.

property evaluations

Get/set the meta information of existing model evaluations.

property metrics

The set of model validation statistics.

property slide

A serializable representation of the model validation that can be used together with a compatible renderer in order to visualize the results.

property stats

The set of model validation statistics.

property training_validation

The name of the training validation associated with the model validation.

NOTE: Under normal circumstances, there should be only one training validation per model validation. However, this is not enforced. This property retrieves the first training validation it can retrieve from the associated evaluations.


class datastories.story.predict_kpis.RunOverview(performance, drivers, driver_overlap, best_run, overview_file='run_overview.json', slide_deck=None, slide_name='RunOverview')

An overview of driver importance and KPI prediction metrics across all runs of datastories.story.predict_kpis() analysis.

Base classes:

Note: Objects of this class should not be manually constructed.

property best_run

Overview of KPI prediction error metrics across runs.

property driver_overlap

Overview of driver importance overlap across runs.

property drivers

Overview of main drivers and their importance across KPIs and runs.

classmethod load(file_path)

Load a Run Overview result from a versioned JSON file.

NOTE: There are no compatibility guarantees across result versions. It is generally safe to use a story-decoupled result only within a minor SDK version.

Args:

  • file_path (str):

    location of the input file.

Returns:

property metrics

The set of driver performance statistics across analysis runs.

property performance

Overview of KPI prediction error metrics across runs.

save(file_path)

Save the Run Overview result to a versioned JSON file.

NOTE: There are no compatibility guarantees across result versions. It is generally safe to use decoupled results only within a minor SDK version.

Args:

  • file_path (str):

    location of the output file.

property slide

A serializable representation of the overview that can be used together with a compatible renderer in order to visualize the results.

property stats

The set of driver performance statistics across analysis runs.

to_excel(file_path)

Exports the run overview to an Excel file.

Args:

  • file_path (str):

    name of the file to export to.

to_html(file_path, title='Run Overview', subtitle=None, scenario=VisualizationScenario.REPORT)

Exports the visualization to a standalone HTML document.

Args:

  • file_path (str):

    name of the file to export to.

  • title (str=’Run Overview’):

    HTML document title.

  • subtitle (str=’’):

    HTML document subtitle.

  • scenario (enum=VisualizationScenario.REPORT):

    A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.

to_json(html_safe=False)

Save the result as a JSON string.

Args:

  • html_safe (bool=False):

    Set to True in order to produce a JSON string that is safe to embed in a HTML page as an attribute value.

Returns:

  • A JSON string containing the analysis results.

to_pandas()

Exports the run overview to a Pandas DataFrame.

Returns:

  • The constructed Pandas DataFrame.

property visualization

The Run Overview visualization.

Visualization

Display Utils

The datastories.display package contains a collection of display helpers.


datastories.display.wide_screen(width=0.95)

Make the notebook screen wider when running under Jupyter Notebook.

Args:

  • width (float=0.95):

    width of notebook as a fraction of the screen width. Should be in the interval [0,1].

Raises:

  • ValueError:

    when the [width] argument is outside the accepted interval.

datastories.display.init_graphics(should_embed=False, dslibs_location=None)

Initializes the DataStories graphics engine.

Use this method at the beginning of your notebooks (Jupyter, Jupyterlab, Databricks) to trigger optimal rendering.

The component can be chosen to embed Datastories libraries (should_embed=True), or rely on its running environment (should_embed=False).

When should_embed=False, that is, the environment is required to be sufficient to load the components, this method loads the scripts in the environment by using the embedded version inside the SDK. Components in NOTEBOOK mode are then loaded taking as granted the environment contains sufficiently many resource.

When should_embed=True, that is, components should embed the DataStories library resources, this method does not act on the HTML and has the following actions: - if dslibs_location is not provided (None), then components will carry their own version of the libraries - otherwise, components will try to reach the provided end point.

Recommended usages: Recommanded usage is init_graphics(). On Databricks, it will setup Datastories libraries in /dbfs/FileStore/DataStories/components_library/.

Args:

should_embed: is True if components should be responsible for embedding library resources, False otherwise (the environment is in charge) dslibs_location: is the reference to DataStories libraries

Returns:

Nothing

Effects:

This method has effects on the SDK state, and may has the HTML environment executing it. The latter is irreversible.

datastories.display.export_javascript_library(file_path=None)

Export DataStories libraries as a JavaScript code.

Args:

file_path: The path of the JavaScript file that will contain the library code.

Returns:

The JavaScript library to load DataStories components


datastories.display.get_progress_bar(progress_bar)

A default implementation for a progress bar.

Args:

Returns:

  • An object of type datastories.api.ProgressReporter.

class datastories.display.ProgressCounter

Base class implemented by all progress counters (including progress reporters).

Attributes:

  • total (int):

    the number of steps required for completion.

  • step (int):

    the current step.

  • start_time (int):

    the start time in ns.

  • stop_time (int):

    the stop time in ns.

increment(steps=1)

Registers a processing advance with a number of steps.

Args:

  • steps (int):

    the number of steps to advance.

start(total=1)

Initialize the progress range.

Args:

  • total (int):

    the number of steps required for completion.

stop()

Stop progress monitoring.

timeout()

Mark the step at which the execution timeout occurred.

Use this upon interrupting counting before reaching the end (i.e., step < total).

class datastories.display.ProgressReporter(observers=[])

Abstract base class implemented by all progress reporters.

Base classes:

Args:

  • observers (list):

    list of progress observers to be notified on progress updates.

property header

Get/set the current reporting header.

increment(steps=1)

Register a processing advance with a number of steps.

Args:

  • steps (int=1):

    number of advance steps.

log(message)

Log a progress message.

Args:

  • message (str):

    progress message to log.

on_progress(progress)

Log the completion percentage.

Args:

  • progress (float=None):

    completion percentage to be logged.

property progress

The currently reported progress.

report()

Notify observers on progress updates.

start(total=1)

Start progress reporting.

Args:

  • total (int=)`:

    total number of steps required for completion.

property state

Get/set the currently reported state.

stop(info='')

Stop progress reporting.

Args:

  • info (str=’’):

    optional message to report.

class datastories.display.AggregatedReporter(stages=None, observers=None, display=True, bar_length=50)

A progress reporter that aggregates progress of a number of independent stages.

Base classes:

Stages are to be specified in the beginning, together with an estimation of the stage importance relative to the whole execution. The progress of each stage will be individually monitored and reported in the context of the whole execution.

Stages are to be identified and activated by setting the progress header.

Args:

  • stages (dict):

    a dictionary mapping local stage names to their bounds in the globally reported progress.

  • observers (list):

    list of observers to be notified about progress updates.

  • display (bool=True):

    set to False in order to disable progress display (e.g., when the display is done by observers)

  • bar_length (int=cfg):

    optional size of the progress bar. It defaults to the value specified in the SDK configuration settings. That is 25 if no configuration settings are provided.

Example:

stages = {
    'Stage 1' : (0,50),
    'Stage 2' : (50,100)
}
reporter = AggregatedReporter(stages=stages)
property header

Get/set the progress report header.

log(message)

Log a progress message.

Args:

  • message (str):

    progress message to log.

on_progress(progress=None)

Log the completion percentage.

Args:

  • progress (float=None):

    completion percentage to be logged.

reset()

Reset the progress reporter.

Plots

The datastories.visualization package contains a collection of visualizations that facilitates the assessment of selected DataStories analysis results.

class datastories.visualization.VisualizableMixin(title='', subtitle='')

Mixin for classes that provide a visualization property.

Enables exporting to HTML, manging the visualization settings, and provides a Jupyter representation.

plot(*args, **kwargs)

Display an interactive visualization.

to_html(file_path, title=None, subtitle=None, scenario=VisualizationScenario.REPORT)

Exports the visualization to a standalone HTML document.

Args:

  • file_path (str):

    path to output file.

  • title (str=’’):

    HTML document title.

  • subtitle (str=’’):

    HTML document subtitle.

Raises:

property vis_settings

Get/set the visualization settings.

Raises:

abstract property visualization

The visualization.

class datastories.visualization.ColorScheme(value)

Enumeration of available color encoding schemes:

Possible values:

  • For discrete variable encoding:
    • DISCRETE_12

    • DISCRETE_12_LIGHT

    • DISCRETE_10

    • DISCRETE_8

    • DISCRETE_8_LIGHT

    • DISCRETE_8_ACCENT

  • For numeric variable encoding:
    • NUMERIC_RED_YELLOW_GREEN

    • NUMERIC_RED_YELLOW_BLUE

    • NUMERIC_RED_BLUE

    • NUMERIC_PINK_GREEN

    • NUMERIC_COLD_HOT


class datastories.visualization.ConclusionsSettings

Encapsulates visualization settings for datastories.visualization.Conclusions visualizations.

class datastories.visualization.Conclusions(conclusions=None, vis_settings=None, *args, **kwargs)

Encapsulates a visual representation of KPI drivers.

Note: Objects of this class should not be manually constructed.

One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.

Attributes:

plot(*args, **kwargs)

Convenience function to set-up and display the Conclusions visualization.

Accepts the same parameters as the constructor for datastories.visualization.ConclusionsSettings objects.

to_html(file_path, title='Conclusions', subtitle='', scenario=VisualizationScenario.REPORT)

Exports the Conclusions visualization to a standalone HTML document.

Args:

  • file_path (str):

    path tho the output file.

  • title (str=’Conclusions’):

    HTML document title.

  • subtitle (str=’’):

    HTML document subtitle.

  • scenario (enum=VisualizationScenario.REPORT):

    A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.


class datastories.visualization.ConfusionMatrixSettings(width=480, height=320)

Encapsulates visualization settings for datastories.visualization.ConfusionMatrix visualizations.

Args:

  • width (int=480):

    Graph width in pixels.

  • height (int=320):

    Graph height in pixels.

Attributes:

  • Same as the Args section above.

class datastories.visualization.ConfusionMatrix(prediction_performance, vis_settings=None, *args, **kwargs)

Encapsulates a visual representation of model accuracy for binary classification models.

Note: Objects of this class should not be manually constructed.

One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.

Attributes:

plot(*args, **kwargs)

Convenience function to set-up and display the Confusion Matrix visualization.

Accepts the same parameters as the constructor for datastories.visualization.ConfusionMatrixSettings objects.

to_html(file_path, title='Confusion Matrix', subtitle='', scenario=VisualizationScenario.REPORT)

Exports the Confusion Matrix visualization to a standalone HTML document.

Args:

  • file_path (str):

    path to the output file.

  • title (str=’Confusion Matrix’):

    HTML document title.

  • subtitle (str=’’):

    HTML document subtitle.

  • scenario (enum=VisualizationScenario.REPORT):

    A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.


datastories.visualization.correlation_browser(file_path=None, raw_content=None, vis_settings=None)

Displays a Correlation Browser visualization in a Jupyter notebook based on an input correlation data file.

Args:

  • file_path (str=None):

    path to the input data file containing a serialized class:datastories.correlation.CorrelationResult object.

  • raw_content (str=None):

    a string, containing a JSON serialized class:datastories.correlation.CorrelationResult object.

  • vis_setting (obj=CorrelationBrowserSettings()):

    an object of type datastories.visualization.CorrelationBrowserSettings containing visualization settings. Set this object before displaying the visualization or exporting to HTML.

NOTE: Either the [file_path] or [raw_content] argument has to be provided but not both.

Returns:

Raises:

  • ValueError:

    when both the [file_path] and the [raw_content] arguments are provided.

Example:

from datastories.visualization import correlation_browser
correlation_browser('correlations.json')
class datastories.visualization.CorrelationBrowserSettings(scale=1, node_opacity=0.9, edge_opacity=0.3, tension=0.65, font_size=15, filter_unconnected=False, min_weight=50, max_weight=100, weight_key='weightMI', show_controls=True, show_inspector=True)

Encapsulates visualization settings for datastories.visualization.CorrelationBrowser visualizations.

Args:

  • scale (float=1):

    Scale factor of the radius [0-1].

  • node_opacity (float=0.9):

    Opacity of the nodes that aren’t hovered or connected to hovered or selected nodes [0-1].

  • edge_opacity (float=0.3):

    Opacity of the edges that aren’t hovered or connected to hovered or selected nodes [0-1].

  • tension (float=0.65):

    The tension of the links. A tension of 0 means straight lines [0-1].

  • font_size (int=15):

    Font size used for the nodes of the plot [10-32];

  • filter_unconnected (boolean=False):

    Whether or nodes that aren’t connected to any other node are filtered from the view.

  • min_weight (int=50):

    Minimum weight of the links that will be shown [0-100].

  • max_weight (int=100):

    Maximum weight of the links that will be shown [0-100].

  • weight_key (str=’weightMI’):

    Type of relations top display [‘weightMI’ for Mutual Information,’weightL’ for Linear Correlation].

  • show_controls (bool=True):

    Set to True in order to display relation controls.

  • show_inspector (bool=True):

    Set to True in order to display the relation inspector window.

Attributes:

  • Same as the Args section above.

class datastories.visualization.CorrelationBrowser(correlation_result=None, raw_content=None, vis_settings=None, *args, **kwargs)

Encapsulates a visual representation of correlation between features.

Note: Objects of this class should not be manually constructed.

One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.

Attributes:

plot(*args, **kwargs)

Convenience function to set-up and display the Correlation Browser visualization.

Accepts the same parameters as the constructor for datastories.visualization.CorrelationBrowserSettings objects.

to_html(file_path, title='Correlation Browser', subtitle='', scenario=VisualizationScenario.REPORT)

Exports the Correlation Browser visualization to a standalone HTML document.

Args:

  • file_path (str):

    path to the output file.

  • title (str=’Correlation Browser’):

    HTML document title.

  • subtitle (str=’’):

    HTML document subtitle.

  • scenario (enum=VisualizationScenario.REPORT):

    A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.


class datastories.visualization.DataHealthSettings(page_size=25)

Encapsulates visualization settings for datastories.visualization.DataHealth visualizations.

Args:

  • page_size (int=1):

    Maximum number of columns to display one one summary page;

Attributes:

  • Same as the Args section above.

class datastories.visualization.DataHealth(data_health=None, vis_settings=None, *args, **kwargs)

Encapsulates a visual representation of data health report.

Note: Objects of this class should not be manually constructed.

One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.

Attributes:

plot(*args, **kwargs)

Convenience function to set-up and display the Data Health visualization.

Accepts the same parameters as the constructor for datastories.visualization.DataHealthSettings objects.

to_html(file_path, title='Data Health', subtitle='', scenario=VisualizationScenario.REPORT)

Exports the Data Health visualization to a standalone HTML document.

Args:

  • file_path (str):

    path to the output file.

  • title (str=’Data Health’):

    HTML document title.

  • subtitle (str=’’):

    HTML document subtitle.

  • scenario (enum=VisualizationScenario.REPORT):

    A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.


class datastories.visualization.DataSummaryTableSettings(page_size=25, show_console=True)

Encapsulates visualization settings for datastories.visualization.DataSummaryTable visualizations.

Args:

  • page_size (int=1):

    Maximum number of columns to display one one summary page;

  • show_console (bool=True):

    Set to True in order to display the visualization console.

Attributes:

  • Same as the Args section above.

class datastories.visualization.DataSummaryTable(summary=None, column_stats=None, vis_settings=None, *args, **kwargs)

Encapsulates a visual representation of data frame summary.

Note: Objects of this class should not be manually constructed.

One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.

Attributes:

plot(*args, **kwargs)

Convenience function to set-up and display the Data Summary visualization.

Accepts the same parameters as the constructor for datastories.visualization.DataSummaryTableSettings objects.

to_html(file_path, title='Data Summary', subtitle='', scenario=VisualizationScenario.REPORT)

Exports the Data Summary visualization to a standalone HTML document.

Args:

  • file_path (str):

    path to the output file.

  • title (str=’Data Summary’):

    HTML document title.

  • subtitle (str=’’):

    HTML document subtitle.

  • scenario (enum=VisualizationScenario.REPORT):

    A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.


datastories.visualization.driver_overview(file_path=None, raw_content=None, vis_settings=None)

Displays a DriverOverview visualization in a Jupyter notebook based on an input correlation data file.

Args:

  • file_path (str=None):

    path to the input driver overview data file;

  • vis_setting (obj):

    an object of type datastories.visualization.DriverOverviewSettings containing visualization settings. Set this object before displaying the visualization or exporting to HTML.

NOTE: Either the [file_path] or [raw_content] argument has to be provided but not both.

Returns:

Raises:

  • ValueError:

    when both the [file_path] and the [raw_content] arguments are provided.

Example:

from datastories.visualization import driver_overview
driver_overview('driver_overview.json')
class datastories.visualization.DriverOverviewSettings(height=600)

Encapsulates visualization settings for datastories.visualization.DriverOverview visualizations.

Args:

  • height (int=600):

    Graph height in pixels;

Attributes:

  • Same as the Args section above.

class datastories.visualization.DriverOverview(driver_overview=None, raw_content=None, vis_settings=None, *args, **kwargs)

Encapsulates a visual representation of KPI drivers.

Note: Objects of this class should not be manually constructed.

One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.

Attributes:

plot(*args, **kwargs)

Convenience function to set-up and display the Driver Overview visualization.

Accepts the same parameters as the constructor for datastories.visualization.DriverOverviewSettings objects.

to_html(file_path, title='Driver Overview', subtitle='', scenario=VisualizationScenario.REPORT)

Exports the Driver Overview visualization to a standalone HTML document.

Args:

  • file_path (str):

    path to the output file.

  • title (str=’Driver Overview’):

    HTML document title.

  • subtitle (str=’’):

    HTML document subtitle.

  • scenario (enum=VisualizationScenario.REPORT):

    A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.


class datastories.visualization.ErrorPlotSettings(sort_key='id', highlight_outliers=True, display_confidence_interval=True, connect_dots=False, width=900, height=300)

Encapsulates visualization settings for datastories.visualization.ErrorPlot visualizations.

Args:

  • sort_key (str=’id’):

    The sorting criteria for the X axis.Possible values:

    • 'id': sort on record id.

    • 'actual': sort on record actual KPI value.

    • 'predicted': sort on record predicted value.

  • highlight_outliers (bool=Tue):

    set to True if outliers should be highlighted.

  • display_confidence_interval (bool=True):

    set to True if confidence limits should be displayed.

  • connect_dots (bool=False):

    set to True if data points should be connected by lines

  • width (int=900):

    plot width in pixels.

  • height (int=300):

    plot height in pixels.

Attributes:

  • Same as the Args section above.

class datastories.visualization.ErrorPlot(pva=None, metrics=None, vis_settings=None, *args, **kwargs)

Encapsulates a visual representation of prediction error for prediction models.

Both regression and classification models are supported.

Note: Objects of this class should not be manually constructed.

One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.

Attributes:

plot(*args, **kwargs)

Convenience function to set-up and display the Error Plot visualization.

Accepts the same parameters as the constructor for datastories.visualization.ErrorPlotSettings objects.

to_html(file_path, title='Error Plot', subtitle='', scenario=VisualizationScenario.REPORT)

Exports the Error Plot visualization to a standalone HTML document.

Args:

  • file_path (str):

    path to the output file.

  • title (str=’Error Plot’):

    HTML document title.

  • subtitle (str=’’):

    HTML document subtitle.

  • scenario (enum=VisualizationScenario.REPORT):

    A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.


class datastories.visualization.FeatureRanksTableSettings(height=460, show_console=True)

Encapsulates visualization settings for datastories.visualization.FeatureRanksTable visualizations.

Args:

  • height (int=460):

    graph height in pixels.

  • show_console (bool=True):

    displays the visualization console where update operations are logged.

Attributes:

  • Same as the Args section above.

class datastories.visualization.FeatureRanksTable(feature_ranks, vis_settings=None, *args, **kwargs)

Encapsulates a visual representation of feature ranking.

Note: Objects of this class should not be manually constructed.

One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.

Attributes:

plot(*args, **kwargs)

Convenience function to set-up and display the Feature Ranking visualization.

Accepts the same parameters as the constructor for datastories.visualization.FeatureRanksTable objects.

to_html(file_path, title='Feature Ranking', subtitle='', scenario=VisualizationScenario.REPORT)

Exports the Feature Ranking visualization to a standalone HTML document.

Args:

  • file_path (str):

    path to the output file.

  • title (str=’Feature Ranking’):

    HTML document title.

  • subtitle (str=’’):

    HTML document subtitle.

  • scenario (enum=VisualizationScenario.REPORT):

    A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.


class datastories.visualization.OutlierPlotSettings(width=800, height=200, x_padding=0.2, y_padding=0.2, marker_size=32, hover_marker_size_delta=32, animations=500, show_jitter=True, show_cdf=True, show_iqr=True, show_summary=True, show_console=True, show_legend=True, low_threshold=0.05, high_threshold=0.95)

Encapsulates visualization settings for datastories.visualization.OutlierXPlot visualizations.

Args:

  • width (int=800):

    graph width in pixels.

  • height (int=200):

    graph height in pixels.

  • x_padding (float=0.2):

    padding on horizontal axis.

  • y_padding (float=0.2): ;

    padding on vertical axis.

  • marker_size (int=32):

    size of the point marker.

  • hover_marker_size_delta (int=32):

    size of the point hover marker.

  • animations (int=500):

    animation duration in milliseconds.

  • show_jitter (bool=False):

    amount of jitter added to the vertical dimension, to better distinguish points.

  • show_cdf (bool=True):

    set to True to display the cumulative distribution function.

  • show_iqr (bool=True):

    set to True to display the inter-quartile range, as specified in the lower and higher threshold arguments.

  • show_summary (bool=True):

    set to True to display the summary table.

  • show_console (bool=True):

    set to True to display the visualization console where update operations are logged.

  • low_threshold (float=0.05):

    the lower threshold for the inter-quartile range.

  • high_threshold (float=0.95):

    the upper threshold for the inter-quartile range.

Attributes:

  • Same as the Args section above.

class datastories.visualization.OutlierXPlot(outliers_result, vis_settings=None, *args, **kwargs)

Encapsulates a visual representation of outliers resulting from a one dimensional analysis.

Note: Objects of this class should not be manually constructed.

One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.

Attributes:

plot(*args, **kwargs)

Convenience function to set-up and display the Outliers visualization.

Accepts the same parameters as the constructor for datastories.visualization.OutlierPlotSettings objects.

to_html(file_path, title='Outliers', subtitle='', scenario=VisualizationScenario.REPORT)

Exports the Outliers visualization to a standalone HTML document.

Args:

  • file_path (str):

    path to the output file.

  • title (str=’Outliers’):

    HTML document title.

  • subtitle (str=’’):

    HTML document subtitle.

  • scenario (enum=VisualizationScenario.REPORT):

    A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.


datastories.visualization.plot_xy(data, x, y, color=None, info_columns=None, **kwargs)

Create an X vs Y plot.

Args:

  • data (obj):

    A Pandas data frame containing the data to be visualized.

  • x (str|int):

    Name or index of the variable for the horizontal axis.

  • y (str|int):

    Name or index of the variable for the vertical axis.

  • color (str|int=None):

    Optional name or index for a variable to be used for encoding in the color dimension.

  • info_columns (list):

    Optional list of name or index for columns to be used to provide additional info (e.g., in tooltips)

  • kwargs (dict):

    Dictionary of additional options to be used for configuring the visualization. See datastories.visualization.PairWisePlotSettings for a complete list.

class datastories.visualization.PairWisePlotSettings(width=600, height=400, color_scheme=ColorScheme.DEFAULT)

Encapsulates visualization settings for datastories.visualization.PairWisePlot visualizations.

Args:

Attributes:

  • Same as the Args section above.

class datastories.visualization.PairWisePlot(plot_json, data=None, record_info_columns=None, show_navigator=False, vis_settings=None, *args, **kwargs)

Encapsulates a visual representation of two variable relations.

Note: Objects of this class should not be manually constructed.

One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.

Attributes:

plot(*args, **kwargs)

Convenience function to set-up and display the ‘Pair-Wise Plots’ visualization.

Accepts the same parameters as the constructor for datastories.visualization.PairWisePlotSettings objects.

to_html(file_path, title='', subtitle='', scenario=VisualizationScenario.REPORT)

Exports the Predict vs Actual visualization to a standalone HTML document.

Args:

  • file_path (str):

    path to the output file.

  • title (str=’Predicted vs Actual’):

    HTML document title.

  • subtitle (str=’’):

    HTML document subtitle.

  • scenario (enum=VisualizationScenario.REPORT):

    A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.


class datastories.visualization.PredictedVsActualSettings(highlight_outliers=True, show_metrics=True, width=600)

Encapsulates visualization settings for datastories.visualization.PredictedVsActual visualizations.

Args:

  • highlight_outliers (bool=Tue):

    set to True if outliers should be highlighted.

  • show_metrics (bool=True):

    set to True if prediction performance metrics should be displayed

  • width (int=600):

    graph width in pixels.

Attributes:

  • Same as the Args section above.

class datastories.visualization.PredictedVsActual(pva=None, metrics=None, vis_settings=None, *args, **kwargs)

Encapsulates a visual representation of model accuracy for prediction models.

Both regression and classification models are supported.

Note: Objects of this class should not be manually constructed.

One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.

Attributes:

plot(*args, **kwargs)

Convenience function to set-up and display the Predict vs Actual visualization.

Accepts the same parameters as the constructor for datastories.visualization.PredictedVsActualSettings objects.

to_html(file_path, title='Predicted vs Actual', subtitle='', scenario=VisualizationScenario.REPORT)

Exports the Predict vs Actual visualization to a standalone HTML document.

Args:

  • file_path (str):

    path to the output file.

  • title (str=’Predicted vs Actual’):

    HTML document title.

  • subtitle (str=’’):

    HTML document subtitle.

  • scenario (enum=VisualizationScenario.REPORT):

    A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.


class datastories.visualization.PrototypeTableSettings(height=320, show_console=True, selectable=True, condensed=True)

Encapsulates visualization settings for datastories.visualization.PrototypeTable visualizations.

Args:

  • height (int=320):

    graph height in pixels.

Attributes:

  • Same as the Args section above.

class datastories.visualization.PrototypeTable(prototypes, vis_settings=None, *args, **kwargs)

Encapsulates a visual representation of feature prototypes.

Note: Objects of this class should not be manually constructed.

One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.

Attributes:

plot(*args, **kwargs)

Convenience function to set-up and display the Prototypes visualization.

Accepts the same parameters as the constructor for datastories.visualization.PrototypeTableSettings objects.

to_html(file_path, title='Prototypes', subtitle='', scenario=VisualizationScenario.REPORT)

Exports the Prototypes visualization to a standalone HTML document.

Args:

  • file_path (str):

    path to the output file.

  • title (str=’Prototypes’):

    HTML document title.

  • subtitle (str=’’):

    HTML document subtitle.

  • scenario (enum=VisualizationScenario.REPORT):

    A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.


datastories.visualization.what_ifs(file_path=None, raw_content=None, init_values=None, minimize_values=None, maximize_values=None, vis_settings=None)

Displays a What-Ifs visualization in a Jupyter notebook based on an input RSX model file.

Args:

  • file_path (str=None):

    path to the input RSX model file. If None the [raw_content] argument has to be provided.

  • raw_content (bytes=None):

    a bytes object, containing the source of the backing RSX model.

  • init_values (list=[]):

    list of initial driver values;

  • minimize_values (list=None):

    driver values that minimize the KPI.

  • maximize_values (list=None):

    driver values that maximize the KPI.

  • vis_settings (obj=WhatIfsSettings()):

    An object of type datastories.visualization.WhatIfsSettings containing the initial visualization settings.

NOTE: Either the [file_path] or [json_content] argument has to be provided but not both.

Returns:

Raises:

  • ValueError:

    when both the [file_path] and the [raw_content] arguments are provided.

Example:

from datastories.visualization import what_ifs
what_ifs('my_model.rsx')
class datastories.visualization.WhatIfsSettings(show_controls=True, show_console=True, show_optimizer=False)

Encapsulates visualization settings for datastories.visualization.WhatIfs visualizations.

Args:

  • show_controls (bool=True):

    Set to True in order to display the visualization controls.

  • show_console (bool=True):

    Set to True in order to display the visualization console.

  • show_optimizer (bool=False):

    Set to True in order to disenable the optimizer functionality.

Attributes:

  • Same as the Args section above.

class datastories.visualization.WhatIfs(init_values=None, minimize_values=None, maximize_values=None, driver_importances=None, raw_model=None, vis_settings=None, *args, **kwargs)

Encapsulates a visual representation for exploring the influence of driver variables on target KPIs.

One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.

Note: Objects of this class should not be manually constructed.

property drivers

Get/set the driver values.

maximize()

Identify a set of driver values that maximize the KPI.

minimize()

Identify a set of driver values that minimize the KPI.

plot(*args, **kwargs)

Convenience function to set-up and display the What-Ifs visualization.

Accepts the same parameters as the constructor for datastories.visualization.PredictedVsActualSettings objects.

to_html(file_path, title='What-Ifs', subtitle='', scenario=VisualizationScenario.REPORT)

Exports the What-Ifs visualization to a standalone HTML document.

Args:

  • file_path (str):

    path to the output file.

  • title (str=’What-Ifs’):

    HTML document title.

  • subtitle (str=’’):

    HTML document subtitle.

  • scenario (enum=VisualizationScenario.REPORT):

    A value of type :class:datastories.api.VisualizationScenario to indicate the use scenario.

MLflow Support

datastories.mlflow.save_model(ds_model, path, conda_env=None, mlflow_model=None, **kwargs)

Save a story as a MLflow model.

If not provided, the conda environment will be created with minimal dependencies to DataStories SDK library and mlflow.

Args:

ds_model: the DataStories model to save path: local path where the model is saved conda_env: conda environment. If None, one will be created mlflow_model: the MLflow model to use. If None, a default is used

datastories.mlflow.log_model(ds_model, artifact_path, conda_env=None, registered_model_name=None, **kwargs)

Log a DataStories model to MLflow, using datastories.mlflow module. This method is called automatically when autologging is on.

Args:

ds_model: the DataStories model to be saved (required) artifact_path: see mlflow.models.Model.log ; on autologging, model is used conda_env: see mlflow.models.Model.log (optional) registered_model_name: see mlflow.models.Model.log (optional)

datastories.mlflow.load_model(model_uri, **kwargs)

Load a MLflow Datastories model with custom model.

The model returned implements the model API from MLflow, and allows an extra feature to get back the DataStories story object.

datastories.mlflow.autolog(turn_off=False)

Enable auto logging of MLflow models.

When autologging is on, the KPIS associated to the story are logged as parameters, as a long chain or comma separated names.

Subsequent calls of datastories.story.predict_kpis will log a MLflow model, under the path model.

It is not required to manually open a MLflow run when autologging is activated, but this operation can allow user to log more parameters and information.

Args:

turn_off: When True, the auto-logging is turned off.


class datastories.mlflow.optimization.OptimizationModel(model, optimization_spec, variable_ranges=None)

Optimization model class encodes an optimization specification object for MLflow processing.

Such object is made of a DataStories model and a Optimization Specification. It can potentially be enriched with input ranges, in case the constraints on the input variables cannot be encoded in the specification.

This object is intended to be logged by an MLflow experiment. It can also be used as a shortcut for optimization feature.

model:

The DataStories model to be optimize; should be a BasePredictor or a Model

optimization_spec:

The OptimizationSpecification object that encodes the objectives and the constraints

variable_ranges:

Additional constraints on inputs, that are less strict than specification constraints. (default is None)

datastories.mlflow.optimization.save_model(ds_optimization_model, path, conda_env=None, mlflow_model=None, **kwargs)

Save a story as a MLflow model.

If not provided, the conda environment will be created with minimal dependencies to DataStories SDK library and mlflow.

Args:

ds_optimization_model: the DataStories Optimization model to save path: local path where the model is saved conda_env: conda environment. If None, one will be created mlflow_model: the MLflow model to use. If None, a default is used

datastories.mlflow.optimization.log_model(ds_optimization_model, artifact_path='model', conda_env=None, registered_model_name=None, **kwargs)

Log a DataStories Optimization model to MLflow, using datastories.mlflow.optimization module.

Args:

ds_optimization_model: the DataStories Optimization model to be saved (required) artifact_path: see mlflow.models.Model.log ; conda_env: see mlflow.models.Model.log (optional) registered_model_name: see mlflow.models.Model.log (optional)

datastories.mlflow.optimization.load_model(model_uri, **kwargs)

Load a MLflow Datastories Optimization model.

The model returned implements the model API from MLflow. It is the combination of an optimizer and a model