SDK Reference¶
General Interfaces¶
-
datastories.api.
get_version
()¶ Get the version of the currently loaded modules.
Returns:
- A dictionary containing loaded modules and corresponding versions
Base classes and interfaces¶
-
class
datastories.api.
IAnalysisResult
¶ Interface implemented by all analysis results.
-
plot
(*args, **kwargs)¶ Plots a graphical representation of the results in Jupyter Notebook.
-
to_csv
(file_path, delimiter=', ', decimal='.')¶ Export the result to a
CSV
file.Args:
- file_path (str):
- path to the output file.
- delimiter (str=’,’):
- character used as value delimiter.
- decimal (str=’.’):
- character used as decimal point.
Raises:
ValueError
:- when the object returned by to_pandas is not a
Pandas
data frame.
-
to_excel
(file_path, tab_name='Statistics')¶ Export the result to an
Excel
file.Args:
- file_path (str):
- path to the output file.
- tab_name (str=’Statistics’):
- name of the Excel tab where to save the result.
Raises:
ValueError
:- when the object returned by to_pandas is not a
Pandas
data frame.
-
to_html
(file_path, title='', subtitle='')¶ Exports the analysis result visualization to a standalone
HTML
document.Args:
- file_path (str):
- path to the output file.
- title (str=’’):
- HTML document title.
- subtitle (str=’’):
- HTML document subtitle.
-
to_pandas
()¶ Exports the result to a Pandas
DataFrame
.Returns:
- The constructed Pandas
DataFrame
.
- The constructed Pandas
-
to_txt
(file_path)¶ Export the result to a
TXT
file.Args:
- file_path (str):
- path to the output file.
-
-
class
datastories.api.
IConsole
¶ Interface implemented by all message loggers.
-
log
(message)¶ Log a message tot he console.
Args:
- message (string):
- the message to log.
-
-
class
datastories.api.
IPrediction
(data)¶ Bases:
datastories.api.interface.IAnalysisResult
Interface implemented by all prediction results.
Args:
- data (obj):
- The associated prediction input data.
-
metrics
¶ A dictionary containing prediction performance metrics.
These metrics are computed when the data frame used for prediction includes KPI values, for the purpose of evaluating the model prediction performance.
-
class
datastories.api.
IPredictiveModel
¶ Interface implemented by all prediction models.
-
metrics
¶ A dictionary containing model prediction performance metrics.
The type of metrics depend on the model type (i.e., regression or classification)
-
predict
(data_frame)¶ Predict the model KPI on a new data frame.
Args:
- data_frame (obj):
- the data frame on which the model associated KPI is to be predicted.
Returns:
- An object of type
datastories.regression.PredictionResult
encapsulating the prediction results.
Raises:
ValueError
:- when not all required columns are provided.
-
to_cpp
(file_path)¶ Export the model to a C++ file.
Args:
- file_path (str):
- path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:- when there is a problem saving the file.
-
to_excel
(file_path)¶ Export the model to an Excel file.
Args:
- file_path (str):
- path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:- when there is a problem saving the file.
-
to_matlab
(file_path)¶ Export the model to a MATLAB file.
Args:
- file_path (str):
- path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:- when there is a problem saving the file.
-
to_py
(file_path)¶ Export the model to a Python file.
Args:
- file_path (str):
- path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:- when there is a problem saving the file.
-
to_r
(file_path)¶ Export the model to an R file.
Args:
- file_path (str):
- path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:- when there is a problem saving the file.
-
-
class
datastories.api.
IStory
(params=None, metainfo=None, raw_results=None, results=None, folder=None, notes=None, upload_function=None, on_snapshot=None, progress_bar=False)¶ Bases:
datastories.api.interface.IAnalysisResult
Interface implemented by all story analyses.
Args:
- params (dict):
- dictionary containing user and inferred analysis parameters.
- metainfo (dict):
- dictionary containing process parameters (e.g., progress pointers).
- raw_results (dict):
- dictionary containing rainstorm processing results.
- results (dict):
- dictionary containing processing results.
- folder (str=None):
- the story working folder. Leave not specified to create one at runtime.
- notes (list=[]):
- a list of notes.
- upload_function (callback=None):
- a function to upload files to a storage (relevant for the client).
- on_snapshot (callback=None):
- a callback to be executed upon saving a snapshot (e.g., upload snapshot to S3).
- progress_bar (obj=None):
- a progress bar object.
-
add_note
(note)¶ Add an annotation to the story results.
The already present annotations can be retrieved using the
datastories.api.IStory.notes()
property.Args:
- note (str):
- the annotation to be added.
-
clear_note
(note_id)¶ Remove a specific annotation associated with the story analysis.
Args:
- note_id (int):
- the index of the note to be removed.
Raises:
ValueError
:- when the note index is unknown.
-
clear_notes
()¶ Clear the annotations associated with the story analysis.
-
info
¶ Displays story execution information.
-
static
is_compatible
(current_version_string, ref_version_string)¶ Checks if a story version is compatible with a reference version.
-
classmethod
load
()¶ Loads a previously saved story.
-
metrics
¶ Returns a set of metrics computed during analysis.
-
notes
¶ A list of all annotations currently associated with the story analysis.
-
reset
()¶ Reset the execution pointer of a story to the first stage.
-
run
(resume_from=None, strict=False, params=None, progress_bar=None, check_interrupt=None)¶ Resumes the execution of a story form a give stage.
The stage to resume from is optional. If not specified, the story is executed from the beginning. If the stage cannot be executed (e.g., due to missing intermediate results) the closest story that can be executed will be used as starting point unless the [strict] argument is set to True. In that case an exception will be raised if the execution cannot be resumed from the requested stage.
Args:
- resume_from (StoryProcessingStage=None):
- The stage to resume execution from. Should be a stage for which all intermediate results are available. If None, the stage at which execution was previously interrupted (if any) is used.
- strict (bool=False):
- Raise en error if execution cannot be resumed from the requested stage.
- params (dict={}):
- Map of parameters to be used with the run. It can override the original parameters, but this leads to invalidating previous results that depend on the updated parameter values.
- progress_bar (obj=None):
- An object of type
datastories.display.ProgressReporter
to replace the currently used progress reporter. When not specified the current story progress reporter will not be modified. The case for this is to set a progress bar after the story is loaded, when a progress bar cannot be given to the load function directly (e.g, when a progress bar has to be constructed based on the story).
- check_interrupt (func=None):
- an optional callback to check whether analysis execution needs to be interrupted.
Raises:
datastories.api.errors.StoryError
:- if a stage is specified for which no intermediate results are available and the [strict] argument is set to True.
-
save
(file_path)¶ Saves the story analysis results.
-
stats
¶ Returns a set of stats computed during analysis.
-
class
datastories.api.
IStoryDeprecated
(notes=None)¶ Bases:
datastories.api.interface.IAnalysisResult
Interface implemented by all story analyses.
Args:
- notes (list=[]):
- a list of notes.
-
add_note
(note)¶ Add an annotation to the story results.
The already present annotations can be retrieved using the
datastories.api.IStory.notes()
property.Args:
- note (str):
- the annotation to be added.
-
clear_note
(note_id)¶ Remove a specific annotation associated with the story analysis.
Args:
- note_id (int):
- the index of the note to be removed.
Raises:
ValueError
:- when the note index is unknown.
-
clear_notes
()¶ Clear the annotations associated with the story analysis.
-
static
is_compatible
(current_version_string, ref_version_string)¶ Checks if a story version is compatible with a reference version.
-
static
load
(file_path)¶ Loads a previously saved story.
-
metrics
¶ Returns a set of metrics computed during analysis.
-
notes
¶ A list of all annotations currently associated with the story analysis.
-
save
(file_path)¶ Saves the story analysis results.
-
class
datastories.api.
IProgressObserver
¶ Interface implemented by all progress report observers.
-
on_progress
(progress)¶ Callback triggered upon progress update.
Args:
- progress (float):
- the amount of progress. Possible values: [0-1]
-
-
class
datastories.api.
ISlide
(slide_deck=None, file_path='slide.json')¶ Interface implemented by slides.
A slide is a collection of data and references to data that a renderer can transform into a visual representation.
Args:
- slide_deck (obj=None):
- a
datastories.api.SlideDeck
object used to manage the slide.
- file_path (str=’slide.json’):
- path to a file to be used for serializing the slide.
-
slide
¶ Retrieves the slide content.
The slide content is a versioned and serializable entity that can be used to visualize the slide without requiring access to the object itself.
NOTE: This information cannot be used to construct the object by deserialization.
-
class
datastories.api.
SlideDeck
¶ Base class for slide decks.
A slide deck is a convenience component that facilitates managing a collection of slides.
-
add_slide
(slide)¶ Adds a slide to the deck.
Args:
- slide (
datastories.api.ISlide
): - the slide to be added.
- slide (
-
clear_slides
()¶ Remove the slides in the deck.
-
goto_slide
(slide_idx)¶ Sets the current slide pointer to a specific value.
Args:
- slide_idx (int):
- the new value for the slide pointer.
-
has_slides
()¶ Check if the slide deck contains any slides (i.e., it is not empty).
Returns:
- True is the slide deck is empty, otherwise False.
-
insert_slide
(pos_idx, slide)¶ Inserts a slide in the deck at a given position.
Args:
- pos_idx (int):
- the index at which position the slide is to be inserted.
- slide (
datastories.api.ISlide
): - the slide to be inserted.
- slide (
-
next_slide
()¶ Retrieves the next slide in the deck and advances the slide pointer.
If the deck is at the end, or has no slides it returns None.
Returns:
- The next slide in the deck or None.
-
slides
¶ Retrieves the deck slides.
-
-
class
datastories.core.utils.
ExportableMixin
¶ -
to_csv
(file_path, delimiter=', ', decimal='.', df=None)¶ Export the result to a
CSV
file.Args:
- file_path (str):
- path to the output file.
- delimiter (str=’,’):
- character used as value delimiter.
- decimal (str=’.’):
- character used as decimal point.
- df (pandas=None):
- data frame to export. If left unspecified it will use he
data frame returned by the
to_pandas
method of the object
Raises:
ValueError
:- when the serialized object is not a
Pandas
data frame
-
to_excel
(file_path, tab_name='Statistics', df=None)¶ Export the result to an
Excel
file.Args:
- file_path (str):
- path tot he output file.
- tab_name (str=’Statistics’):
- name of the Excel tab where to save the result
- df (pandas=None):
- data frame to export. If left unspecified it will use he
data frame returned by the
to_pandas
method of the object
Raises:
ValueError
:- when the serialized object is not a
Pandas
data frame
-
-
class
datastories.core.utils.
ManagedObject
(dependencies=None, *args, **kwargs)¶ An object that has a user controllable lifespan.
Typically inherited by classes that require special resource to be allocated and manually released outside the Python object lifetime management.
Note: Objects of this class should not be manually constructed.
-
assert_alive
()¶ Triggers an exception if the object has been manually released.
-
release
()¶ Releases the object associated storage.
Note: This function should only be used in order to force releasing allocated resources. Using the object after this point would lead to an exception.
-
-
class
datastories.core.utils.
StorageBackedObject
(folder=None, files=None, *args, **kwargs)¶ An object that stores part of its resources on disk and loads them on demand.
Base classes:
The resources may be provided by the object dependencies or by the object associated storage. When resources are specified, the object can be made independent from its dependencies by copying the listed resources to its associated storage.
Note: Objects of this class should not be manually constructed.
-
make_independent
(base_folder='')¶ Make object independent by copying required resources to the own folder.
Args:
- base_folder (str=’’):
- the base folder for the unique object folder that will hold the required resources.
-
Errors¶
-
class
datastories.api.errors.
DatastoriesError
(value='')¶ Base exception class for the DataStories SDK.
-
class
datastories.api.errors.
ObjectError
(value='')¶ Exception generated when SDK managed objects are not valid.
-
class
datastories.api.errors.
LicenseError
¶ Exception generated when accessing license protected functionality using an invalid license.
-
class
datastories.api.errors.
ConversionError
(value='')¶ Error raised when data conversion fails.
-
class
datastories.api.errors.
VisualizationError
(value='')¶ Error raised when result visualization fails.
-
class
datastories.api.errors.
StoryError
(value='')¶ Base class for all story analysis related errors.
-
class
datastories.api.errors.
StoryDataLoadingError
(value='')¶ Exception generated when a story analysis cannot load the provided input data.
-
class
datastories.api.errors.
StoryDataPreparationError
(value='')¶ Exception generated when a story analysis cannot be preprocess the provided data.
-
class
datastories.api.errors.
StoryProcessingError
(value='')¶ Exception generated when a story analysis cannot be performed.
-
class
datastories.api.errors.
StoryInterrupted
(value='')¶ Exception generated when a story analysis execution is interrupted.
-
class
datastories.api.errors.
ParserError
(value='')¶ Base class for all file parsing and validation related errors.
-
class
datastories.api.errors.
FormatError
(value='')¶ Error raised when the provided file is not in a readable format (unreadable csv, …)
-
class
datastories.api.errors.
ValidationError
(value='')¶ Error raised when the parser was able to read the file structure, but an error occurred during validation.
-
class
datastories.api.errors.
TypeNotRecognized
(value='')¶ Error raised when the SDK parser cannot determine the provided file type.
-
class
datastories.api.errors.
TypeNotSupported
(value='')¶ Error raised when the provided file type cannot be handled by SDK the parser.
-
class
datastories.api.errors.
ExternalDataConnectionError
(value='')¶ Error raised when VBA scripts or an external data connection is detected in spreadsheet.
License Management¶
-
datastories.api.
get_activation_info
()¶ Get information required to create and activate a DataStories license.
- Returns:
dict: a dictionary containing data to be submitted to the DataStories representative in charge with issuing the license.
The datastories.license
package contains a collection
of utility functions to facilitate license management.
These functions are available as methods of a predefined object
of class datastories.license.LicenseManager
called master
.
Example:
from datastories.license import manager
manager.initialize('my_license.lic')
manager
-
class
datastories.license.
LicenseManager
(license_file_path=None)¶ Encapsulates the DataStories license manager.
The license manager enables users to inspect the details of their installed DataStories SDK license, and to use license keys that are not available in the standard installation locations (see Installation)
This class should not be instantiated directly. Instead one should use the already available object instance
datastories.license.manager
.Args:
- license_file_path (str = None):
- the path to a license key file or folder if other than the standard locations for the platform.
Attributes:
- status (str):
- the status of the license manager initialization.
- license (obj):
- the managed license as indicated in the license key file.
Example:
from datastories.license import manager manager.initialize('my_license.lic') manager
-
default_license_path
¶ Default path used for license initialization if none provided.
-
initialize
(license_file_path=None, initialize_modules=True)¶ Initialize the license manager with a license key at a specific location.
Args:
- license_file_path (string):
- the path to a license key file or a folder containing the license key file.
- initialize_modules (bool=True):
- set to
True
in order to initialize dependent modules.
Raises:
ValueError
:- when the provided
license_file_path
is not accessible.
-
is_granted
(option)¶ Checks if execution rights are granted for license protected functionality.
Args:
- option (str):
- the license option required by the protected functionality.
Returns:
True
if execution rights are granted by the installed license.
-
is_ok
¶ Check the initialization status of the license manager.
The license manager initialization fails when no valid license file is found in the standard or user indicated locations.
Note: A successful license manager initialization does not imply a grant for using license protected functionality. Fort example, when an expired license is used, the initialization is still successful. To check whether execution rights are granted one should use the
datastories.license.LicenseManager.is_granted()
method.Returns:
True
if the license manager was successfully initialized.
-
reinitialize
()¶ Re-initializes the license manager.
This is done using the same license file path as in the previous call to
datastories.license.LicenseManager.initialize()
.
-
release
()¶ Releases the currently held licenses.
This can be useful e.g., when using floating or counted licenses, as it makes the released licenses available for other clients or processes.
Note: once a license is released, the associated execution rights are retracted. In order to use the license protected functionality, users need to acquire the license, by initializing the license manager again (i.e.,
datastories.license.LicenseManager.initialize()
).
Data¶
The datastories.data
package contains a collection
of classes and functions for handling data and converting
it to and from the internal format used by DataStories.
Base Classes¶
-
class
datastories.data.
DataFrame
¶ Encapsulates a data frame in the DataStories format.
Args:
- rows (int):
- number of rows in the data frame.
- cols (int):
- number of columns in the data frame.
- types (list):
- list of value types for the data frame columns.
-
cols
¶ Retrieves the number of columns in the data frame.
-
columns
¶ Retrieves the list of data frame column names.
-
from_pandas
¶ Construct a new
datastories.data.DataFrame
from a PandasDataFrame
object.Args:
- data_frame (obj): the source Pandas
DataFrame
object.
Returns:
- The constructed
datastories.data.DataFrame
object.
- data_frame (obj): the source Pandas
-
get
¶ Get the value of a cell in the data frame.
Args:
- row (int):
- the index of the cell row.
- col (int):
- the index of the cell column.
Returns:
- (float|string) :
- the cell at position (row, column) in the data frame.
-
get_name
(self, size_t col)¶ Retrieve the name of a specific column.
Args:
- col (int):
- the index of the column.
Returns:
- (str) :
- the name of the column with the given index.
-
get_type
¶ Retrieve the type of values in a given column.
Args:
- col (int):
- the index of the column.
Returns:
- An object of type
datastories.data.ColumnType
.
-
load
¶ Load a data frame from a file.
Args:
- file_path (str):
- path to the file to be loaded.
-
mapper_get
(self, size_t index, size_t value)¶
-
read_csv
¶ Loads a DataFrame from a CSV file.
Args:
- file_path (str):
- the path to the file to load.
- delimiter (str=’,’):
- character to use as value delimiter.
- decimal (str=’.’):
- character to use as decimal point in numeric values.
- header_rows (int=1):
- number of header rows (i.e., not containing data values)
-
rows
¶ Retrieves the number of rows in the data frame.
-
save
(self, file_path)¶ Save the data frame to a file.
Args:
- file_path (str):
- path to the output file.
-
set_float
¶ Sets the value of a given cell to a new float value.
Args:
- row (int):
- the row index of the cell.
- col (int):
- the column index of the cell.
- val (float):
- the new float value.
-
set_int
¶ Sets the value of a given cell to a new int value.
Args:
- row (int):
- the row index of the cell.
- col (int):
- the column index of the cell.
- val (int):
- the new int value.
-
set_name
(self, size_t col, name)¶ Set the name of a column in the data frame.
Args:
- col (int):
- the index of the column.
- name (str):
- the new name.
-
set_string
¶ Sets the value of a given cell to a new string value.
Args:
- row (int):
- the row index of the cell.
- col (int):
- the column index of the cell.
- val (str):
- the new string value.
-
to_pandas
(self)¶ Exports the DataFrame to a Pandas
DataFrame
object.Returns:
- The constructed Pandas
DataFrame
object.
- The constructed Pandas
-
class
datastories.data.
ColumnType
¶ Possible column types for
datastories.data.DataFrame
.-
DATE
= 3¶
-
INTEGER
= 2¶
-
MIXED
= 10¶
-
NUMERIC
= 1¶
-
STRING
= 4¶
-
UNKNOWN
= 0¶
-
-
class
datastories.data.
DataType
¶ Possible cell value types for
datastories.data.DataFrame
.-
DATE
= 3¶
-
INTEGER
= 2¶
-
NUMERIC
= 1¶
-
STRING
= 4¶
-
UNKOWN
= 0¶
-
-
class
datastories.data.
RangeType
¶ Possible value range types for
datastories.data.DataFrame
.-
CATEGORICAL
= 3¶
-
INTERVAL
= 1¶
-
ORDINAL
= 2¶
-
UNSPECIFIED
= 0¶
-
-
class
datastories.data.
BaseConverter
¶ Base class for all DataStories SDK value type converters.
Objects of this class are callables. To apply the converter, simply call the obejct with the value to be converted.
The number of conversion operations (both successful or not) is tracked and can be retrieved and reset.
Example:
converter = BaseConverter() converted_value = converter(raw_value)
-
class
datastories.data.
IntConverter
¶ Converter to integer values.
-
class
datastories.data.
FloatConverter
¶ Converter to float values.
-
class
datastories.data.
StringConverter
¶ Converter to string values.
-
class
datastories.data.
BoolConverter
(true_values, false_values)¶ Converter to boolean values.
Args:
- true_values (list):
- a list of strings that will be regarded as
True
.
- false_values (list):
- a list of strings that will be regarded as
False
.
-
class
datastories.data.
DateConverter
(**kwargs)¶ Converter to datetime values.
Args:
- kwargs:
- Passed on to dateutil.parser.parse. See [https://dateutil.readthedocs.org/en/latest/parser.html#dateutil.parser.parse] for the accepted arguments.
-
class
datastories.data.
NanConverter
(nan_values=('', 'NA', 'NaN', 'null', 'none', '?', '..', '...', 'N/A', '-'))¶ Converter to NaN values.
Converts NaN equivalent values to
numpy.nan
.NOTE: This converter is somewhat different from the others. While others return
numpy.nan
when the conversion is not possible, this converter returnsnumpy.nan
only when the conversion is possible and the unchanged value otherwise.
-
class
datastories.data.
FallbackConverter
(nan_detector=None, converters=None)¶ This converter has a list of converters it tries in order until one is successful. It also keeps track of how many conversions each converter performed successfully. If no converter is successful, a
datastories.api.errors.ConversionError
exception is raised.First the converter try to see whether it is a nan value and if so, the value is ignored. Otherwise, the converters are used in order to try and convert the value, stopping from the moment the first conversion is successful. If none of the conversions is successful, a
datastories.api.errors.ConversionError
exception is raised.Args:
- nan_detector (obj=NanConverter()):
- an object of type
datastories.data.BaseConverter
used to detect whether a value represents a nan or not.
- converters (list=[FloatConverter(), StringConverter()]):
- a list of
datastories.data.BaseConverter
objects that will be tried in order with each attempted conversion.
Data Frame Preparation¶
-
datastories.data.
prepare_data_frame
(data_frame, sample_size=None, progress_bar=False)¶ Prepares a
pandas.DataFrame
object compatible with the DataStories clean-up and type conversion rules.Pandas
DataFrame
objects obtained from external sources are often inconsistent and need to be cleaned-up in order to make them usable for analysis. The clean-up process transforms the data frame, for example by enforcing type conversions and discarding non-usable values. DataStories analyses perform the clean-up operation automatically. However, there may be scenarios when a data clean-up is required before running it through a DataStories analysis (e.g., a custom feature-engineering stage).This function can be used to obtain a Pandas
DataFrame
object that is cleaned-up according the DataStories rules and conventions.Args:
- data_frame (obj):
- the data frame object to convert (either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
- sample_size (int|str=None)`:
- the sample size to use for inferring column data types (either absolute integer value or a percentage - e.g. ‘10%’). If left unspecified is the minimum of 100 and 10% of the number of points.
- progress_bar (obj|bool=False):
- An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
Returns:
- The constructed
pandas.DataFrame
object.
-
datastories.data.
to_ds_pandas
(data_frame, converters=None, sample_size=None, copy=False, force_conversion=False, progress_bar=False, include_converters=False)¶ Converts a data frame to a
pandas.DataFrame
object compatible with the DataStories type conversion rules.Args:
- data_frame (obj):
- the data frame object to convert (either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
- converters (list):
- list of
datastories.data.BaseConverter
type conversion objects to use for coercing the column types. If not specified, it will be detected automatically based on a sample of data.
- sample_size (int|str=None)`:
- the sample size to use for inferring data types when the list of converters is not specified (either absolute integer value or a percentage - e.g. ‘10%’). If left unspecified is the minimum of 100 and 10% of the number of points.
- copy (bool=False):
- set to
True
in order to force creation of a new object.
- force_conversion (bool=False):
- set to
True
in order to force the conversion on a previously converted data frame.
- progress_bar (obj|bool=False):
- An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
- include_converters (bool=False):
- set to True in order to include the column converters in the returned result
Returns:
- The constructed
pandas.DataFrame
object when theinclude_converters
isFalse
- a tuple containing the constructed
pandas.DataFrame
object and the used column converters when theinclude_converters
isTrue
-
datastories.data.
detect_column_types
(data_frame, sample_size=None, progress_bar=False)¶ Infer the data types for the columns of a data frame.
Inference is done on a sample of the data, based on the most frequent value type in each column.
Args:
- data_frame (obj):
- the input data frame.
- sample_size (int|str=None):
- the sample size to use for inferring data types (either absolute integer value or a percentage - e.g. ‘10%’). If left unspecified is the minimum of 100 and 10% of the number of points.
- progress_bar (obj):
- An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
Returns:
- list of
datastories.data.BaseConverter
type conversion objects corresponding to detected data types.
Raises:
ValueError
:- when an invalid value is provided for one of the input parameters parameters.
Example:
from datastories.data import detect_column_types import pandas as pd df = pd.read_csv('example.csv') col_types = detect_column_types(df, sample_size='20%') for col_type in col_types: print(col_type.typename)
-
datastories.data.
data_to_file
(data_frame, file_path)¶ Save a DataFrame to a file.
Args:
- data_frame (obj):
- the input data frame (either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
- file_path (str):
- path to the saved file.
-
datastories.data.
file_to_pandas
(file_path)¶ Load a saved
datastories.data.DataFrame
object into apandas.DataFrame
object.Args:
- file_path (str):
- path to the file to be loaded.
-
datastories.data.
normalize_column_names
(data_frame)¶ Normalizes the names of the columns of a
Pandas
data frame.- The following operations are performed:
- Convert numbers to strings
- Normalize unicode (NFKD)
See: [https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize]
The operations are performed in place (i.e., mutating the input data frame).
Args:
- data_frame (pandas.DataFrame):
- the input data frame.
-
datastories.data.
get_columns
(data_frame, include_cols=None, exclude_cols=None)¶ Get a selection of columns from a dataset that include/exclude specific columns.
Args:
- data_frame (pandas.DataFrame):
- input data frame ( a
pandas.DataFrame
object).
- include_cols (list):
- selection of columns to include in the result. If left unspecified or evaluating to None, all dataset columns will be included.
- exclude_cols (list):
- selection of columns to be excluded from the result.
Returns:
- A list of selected columns.
Summary Calculation¶
-
datastories.data.
compute_summary
(data_frame, converters=None, sample_size=None, progress_bar=False)¶ Compute a data summary on a provided data frame.
Args:
- data_frame (obj):
- the input data frame (either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
- converters (list=None):
- list of
datastories.data.BaseConverter
type conversion objects to use for coercing the column types. If not specified, it will be detected automatically based on a sample of data.
- sample_size (int|str=None)`:
- the sample size to use for inferring data types (either absolute integer value or a percentage - e.g. ‘10%’). If left unspecified is the minimum of 100 and 10% of the number of points.
- progress_bar (obj|bool=False):
- An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
Returns:
- An object of type
datastories.data.DataSummaryResult
wrapping-up the summary report.
Example:
from datastories.data import compute_summary import pandas as pd df = pd.read_csv('example.csv') summary = compute_summary(df) print(summary)
-
class
datastories.data.
DataSummaryResult
(stats)¶ Encapsulates the result of the
datastories.data.compute_summary()
analysis.Base classes:
Note: Objects of this class should not be manually constructed.
-
static
load
(file_path)¶ Load a previously saved summary from a JSON file.
Args:
- file_path (str):
- the path to file to be loaded.
Returns:
- An object of type
datastories.data.DataSummaryResult
encapsulating data summary information.
-
metrics
¶ Retrieves the set of metrics included in the data summary.
NOTE: This is an alias for the .stats property.Returns:
- an object of type
datastories.data.TableStatistics
wrapping up summary statistics.
- an object of type
-
save
(file_path)¶ Save the summary to a JSON file.
Args:
- file_path (str):
- the path to the exported summary file.
-
select
(cols)¶ Select a set of columns for further reference.
-
selected
¶ Retrieves the list of selected columns.
-
stats
¶ Retrieves the set of statistics included in the data summary.
Returns:
- an object of type
datastories.data.TableStatistics
wrapping up summary statistics.
- an object of type
-
to_pandas
()¶ Exports the detailed (column-level) data summary to a Pandas
DataFrame
.Returns:
- The constructed Pandas
DataFrame
object.
- The constructed Pandas
-
visualization
¶ Retrieves the data health visualization.
-
static
-
class
datastories.data.
TableStatistics
(name=None, rows=None, columns=None, n=None, n_missing=None, p_missing=None, health=None, health_score=0, df=None, converters=None, n_rows=None, n_columns=None, version=None)¶ Statistics and data health reports for a given data frame.
Note: Objects of this class should not be manually constructed.Attributes:
- n_rows (int):
- number of rows.
- n_columns (int):
- number of columns.
- n (int):
- number of values.
- n_missing (int):
- number of missing values.
- p_missing (float):
- percentage of missing values.
- health_score (float):
- health score: 0 (good) - 100 (bad).
- health (float):
- general health value for the data frame (unusable:0, fixable:0.5, great:1).
- columns ‘(list)`:
- list of objects of type
datastories.data.ColumnStatistics
encapsulating detailed column level statistics
-
calc_stats
(missing_thr=(50, 90), balance_thr=(50, 90), outlier_thr=(50, 90), table_thr=(50, 90), rows_thr=30)¶ Compute the statistics for the data frame and set the corresponding attributes.
Args:
- missing_thr (tuple=(50, 90)):
- thresholds for deciding the missing values health category (Poor, Reasonable, Good)
- balance_thr (tuple=(50, 90)):
- thresholds for deciding the data distribution health category (Poor, Reasonable, Good)
- outlier_thr (tuple=(50, 90)):
- thresholds for deciding outlier health category (Poor, Reasonable, Good)
- table_thr (tuple=(50, 90)):
- thresholds for deciding overall data health category (Poor, Reasonable, Good)
- rows_thr (int=30):
- threshold for the minimum number of required rows. Under this value data is considered to be not usable.
-
class
datastories.data.
ColumnStatistics
(col=None, id=None, converter=None, label=None, column_type=None, element_type=None, n=None, n_valid=None, n_missing=None, p_missing=None, n_unique=None, min=None, max=None, mean=None, median=None, most_freq=None, first_quartile=None, third_quartile=None, histo_labels=None, histo_counts=None, balance_score=None, balance_health=None, missing_health=None, left_outlier_score=None, right_outlier_score=None, outlier_score=None, left_outlier_health=None, right_outlier_health=None, outlier_health=None, health=None, missing_thr=None, balance_thr=None, outlier_thr=None, bincount=10, n_outliers=None, outlier_n=None, outlier_perc=None, outlier_grade=None)¶ Statistics and data health reports for a given column in a data frame.
Note: Objects of this class should not be manually constructed.Attributes:
- n_rows (int):
- number of rows.
- id (int):
- the index of the column.
- label (str):
- the label (header values) of the column.
- n (int):
- the length of the column.
- n_valid (int):
- the number of correctly parsed data items.
- n_missing (int):
- the number of unreadable data items.
- p_missing (float):
- percent of unreadable data items.
- column_type (str):
- type of the column (ordinal, interval, binary, …).
- element_type (str):
- type of individual data items (float, string, …).
- n_unique (int):
- number of unique values.
- min (float):
- minimum value.
- max (float):
- maximum value.
- mean (float):
- mean value.
- median (float):
- median value.
- first_quartile (float):
- first quartile (data point under which 25% of data is situated).
- third_quartile (float):
- third quartile (data point under which 75% of data is situated).
- histo_labels (list):
- labels for the histogram bins.
- histo_counts (list):
- counts for the histogram bins.
- balance_score (float):
- score for the data balance quality, 0 (good) - 100 (bad).
- balance_health (float):
- health value in terms of balance (unusable:0, fixable:0.5, great:1).
- missing_health (float):
- health value in terms of nr of missing items (unusable, …).
- left_outlier_score (float):
- metric for outlier impact on the left (i.e., small) side of the data range. Scale: 0 (no outliers detected) - 100 (bad).
- right_outlier_score (float):
- metric for outlier impact on the right (i.e., big) side of the data range. Scale: 0 (no outliers detected) - 100 (bad).
- outlier_score (float):
- metric for the general outlier impact of the data. Scale: 0 (no outlier impact whatsoever) - 100 (bad).
- left_outlier_health (float):
- health value for left outlier impact (unusable:0, fixable:0.5, great:1).
- right_outlier_health (float):
- health value for right outlier impact (unusable, fixable:0.5, great:1).
- outlier_health (float):
- health value for outlier impact (unusable:0, fixable:0.5, great:1).
- health (float):
- general health value for this column (unusable:0, fixable:0.5, great:1).
- n_outliers (int):
- number of outliers.
- outlier_n (int):
- number of outliers.
- outlier_perc (float):
- percentage of outlier values.
- outlier_grade (int):
- 0: bad, 1:good.
-
calc_stats
(missing_thr=(50, 90), balance_thr=(50, 90), outlier_thr=(50, 90))¶ Compute the statistics for the column and set the corresponding attributes.
Args:
- missing_thr (tuple=(50, 90)):
- thresholds for deciding the missing values health category (Poor, Reasonable, Good)
- balance_thr (tuple=(50, 90)):
- thresholds for deciding the data distribution health category (Poor, Reasonable, Good)
- outlier_thr (tuple=(50, 90)):
- thresholds for deciding outlier health category (Poor, Reasonable, Good)
Outlier Detection¶
-
datastories.data.
compute_outliers
(input, ref=None, double strictness=0.25, outlier_vote_threshold=None, far_outlier_vote_threshold=None)¶ Identifies numeric outliers in a 1D or 2D space.
This function can be used either with the strictness argument only (i.e., by leaving two last parameters at their defaults so they will be computed as a function of the strictness) or manually by setting the last two parameters in which case the strictness will be ignored.
Args:
- input (list|obj|ndarray):
- numeric input vector can be either a list, a
pandas.Series
object or a numpy numeric array;
- ref (list|obj|ndarray=None):
- abscissa vector for the 2D case. Can be either a list, a
pandas.Series
object or a numpy numeric array;
- strictness (double=0.25):
- determines how strict the algorithm selects outliers - higher values yield less outliers. Value in range is [0-1].
- outlier_vote_threshold (double=None):
- determines when a point is considered outlier - higher values yield
less outliers. Value in range is [0-100]. When left unspecified it
will be set to
100 * strictness
.
- far_outlier_vote_threshold (double=None):
- determines when a point is considered a far outlier - higher values
yield less outliers. This must be larger than [outlier_vote_threshold].
Default is
outlier_vote_threshold + 50
. Value in range is [0-100].
Returns:
- An object of type
datastories.data.OutlierResult
wrapping-up the computed outliers.
Example:
from datastories.data import compute_outliers import pandas as pd df = pd.read_csv('example.csv') outliers = compute_outliers(df['my_column']) print(outliers)
-
class
datastories.data.
OutlierResult
(input, outliers)¶ Encapsulates the result of the
datastories.data.compute_outliers()
analysis.Base classes:
Note: Objects of this class should not be manually constructed.
Attributes:
- valid (bool):
- a flag indicating whether the result is valid.
-
as_index
(self, outlier_types=None)¶ A numpy index vector that can be used to select and retrieve outlier values.
The index can be applied on numpy arrays or
pandas.Series
objects.Args:
- outlier_types (list):
- list of
datastories.api.OutlierType
values to specify which outliers to retrieve. By default, all outliers are included (i.e., outlier_types = [OutlierType.FAR_OUTLIER_HIGH, OutlierType.FAR_OUTLIER_LOW, OutlierType.OUTLIER_HIGH, OutlierType.OUTLIER_LOW])
-
as_itemgetter
(self, outlier_types=None)¶ An
operator.itemgetter
object that can be used to select and retrieve outlier values from a list.Args:
- outlier_types (list):
- list of
datastories.api.OutlierType
values to specify which outliers to retrieve. By default, all outliers are included (i.e., outlier_types = [OutlierType.FAR_OUTLIER_HIGH, OutlierType.FAR_OUTLIER_LOW, OutlierType.OUTLIER_HIGH, OutlierType.OUTLIER_LOW])
-
clip_to_iqr
(self, low_threshold=0.05, high_threshold=0.95)¶ Marks as outliers values that are outside a specific inter-quartile range.
This operation can be un-done via the
reset
method.Args:
- low_threshold (float=0.05):
- the lower bound of the inter-quartile range. Should be in the interval [0,1].
- high_threshold (float=0.95):
- the higher bound of the inter-quartile range. Should be in the interval [0,1].
Raises:
ValueError
:- when the input arguments are not valid.
-
metrics
¶ A dictionary containing outlier detection metrics.
The following metrics are retrieved:
- Outliers:
- total number of outliers.
- Outliers Low:
- number of lower outliers.
- Outliers High:
- number of higher outliers.
- Close Outliers:
- number of close outliers.
- Close Outliers Low:
- number of lower close outliers.
- Close Outliers High:
- number of higher close outliers.
- Far Outliers:
- number of far outliers.
- Far Outliers Low:
- number of lower far outliers.
- Far Outliers High:
- number of higher far outliers.
- NaN:
- number of NaN values.
- Normal:
- number of values that are neither outliers not NaN.
-
reset
(self)¶ Resets outliers to original values, as computed by the
datastories.data.compute_outliers()
analysis.
-
to_csv
(self, file_path, content=u'metrics', delimiter=u', ', decimal=u'.')¶ Exports a list of detected outliers or metrics to a
CSV
file.Args:
- file_path (str):
- path to the output file.
- content (str=’metrics’):
- the type of metrics to export. Possible values:
'metrics'
: exports outlier detection metrics.'outliers'
: exports point-wise outlier classification.
- delimiter (str=’,’):
- character used as value delimiter.
- decimal (str=’.’):
- character used as decimal point.
Raises:
ValueError
:- when an invalid value is provided for the [content] argument.
-
to_excel
(self, file_path)¶ Exports the list of detected outliers and metrics to an
Excel
file.Args:
- file_path (str):
- path to the output file.
-
to_pandas
(self, content=u'metrics')¶ Exports a list of detected outliers or metrics to a
pandas.Series
object.Args:
- content (str=’metrics’):
- the type of metrics to export. Possible values:
'metrics'
: exports outlier detection metrics.'outliers'
: exports point-wise outlier classification.
Returns:
- The constructed
pandas.Series
object.
Raises:
ValueError
:- when an invalid value is provided for the [content] argument.
-
update
(self, updates)¶ Updates the list of detected outliers with manual corrections.
-
updated
¶ A list of manual corrections applied to the detected outliers.
-
visualization
¶ Retrieves the outliers visualization.
Classification¶
The datastories.classification
package contains a collection
of classes and functions to facilitate classification analysis.
Feature Ranking¶
-
datastories.classification.
rank_features
(data_set, kpi, metric=FeatureRankingMetric.ACCURACY) → FeatureRankResult¶ Computes the relative importance of columns in a data frame for predicting a binary KPI.
The scoring is based on maximizing the prediction accuracy with respect to the KPI while iteratively splitting the data frame rows.
Args:
- data_set (obj):
- the input data frame (either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
- kpi (int|str):
- the index or the name of the KPI column.
- metric (enum = FeatureRankingMetric.ACCURACY):
- an object of type
datastories.classification.FeatureRankingMetric
specifying the metric type used to rank the features.
Returns:
- An object of type
datastories.classification.FeatureRankResult
wrapping-up the computed scores.
Raises:
TypeError
:- if data_set is not a
DataFrame
or a PandasDataFrame
object.
ValueError
:- if kpi is not a valid column name or index value (e.g., out-of-range index).
Example:
from datastories.classification import rank_features import pandas as pd df = pd.read_csv('example.csv') kpi_column_index = 1 ranks = rank_features(df, kpi_column_index) print(ranks)
-
class
datastories.classification.
FeatureRankingMetric
¶ Metric to use for ranking the features.
-
ACCURACY
= 0¶
-
-
class
datastories.classification.
FeatureRankResult
(title='', subtitle='')¶ Encapsulates the result of the
datastories.classification.rank_features()
analysis.Base classes:
Note: Objects of this class should not be manually constructed.
-
feature_ranks
¶ Retrieves the feature ranks computed by the
datastories.classification.rank_features()
analysis.Returns:
- A list of
datastories.classification.RankingSplit
objects.
- A list of
-
select
(self, cols)¶ Selects a number of column names as features.
-
selected
¶ Retrieves the list of column names currently selected as features.
-
to_excel
(self, file_path)¶ Exports the list of ranking scores to an
Excel
file.Args:
- file_path (str):
- path to the output file.
-
to_pandas
(self, ranking_column=u'Score', min_threshold=0.0)¶ Exports the list of ranking scores to a Pandas
DataFrame
object.Args:
- ranking_column (str=’Score’):
- column to compute the rank and order the data frame. This can be useful to discover interesting variables that are penalised because they have a lot of missing values.
- min_threshold (float):
- a a cutoff threshold for the minimum score that a variable should have in order to be exported.
Returns:
- The constructed Pandas
DataFrame
object.
-
visualization
¶ Retrieves the feature ranks visualization.
-
-
class
datastories.classification.
RankingSplit
¶ Encapsulates information about a split.
Note: Objects of this class should not be manually constructed.
Attributes:
- column_name (str):
- name of the variable (i.e., column) used in split.
- column_index (int):
- index of the variable used in split.
- score (float):
- relative importance score with respect to the KPI.
- left_value (float):
- the variable value that was used for the split.
- right_value (float):
- the next higher variable value in the dataset.
- split_value (float):
- the variable value that was used for the split.
- equal_type_split (bool):
- indicates whether the split value equals one of the left_value or right_value.
- extra_scores (dict):
- dictionary containing additional metrics (e.g., accuracy).
Correlation¶
The datastories.correlation
package contains a collection
of classes and functions to facilitate correlation analysis.
-
datastories.correlation.
compute_correlations
(data, column_list, kpis, max_vars=200, outlier_elimination=False, optimize=False)¶ Find the most relevant correlations between the columns of a data set.
A number of correlation metrics are computed (currently linear and mutual information) for a subset of the most relevant input variables with respect to a set of KLIs and between the KPIs themselves.
The subset of relevant input variable is computed based on prototyping and limited to a a maximum number as specified (i.e., max_vars = 200).
Args:
- data (obj):
- the input data frame (either a
pandas.DataFrame
or adatastories.data.DataFrame
object);
- column_list (list):
- the list of input variable identifiers (indices or names)
- kpis (list):
- the list of KPI column identifiers (indices or names)
- max_vars (int):
- the maximum number of variables to be included in the result.
- outlier_elimination (bool=False):
- set to True in order to exclude far outliers from from columns before computing correlations;
- optimize (bool=False):
- set to True in order to improve correlation metrics by using transformed versions of the input (e.g., scaled columns).
Returns:
- A JSON formatted string encapsulating the computed correlation metrics, compatible with the DataStories CorrelationBrowser visualization.
-
class
datastories.correlation.
CorrelationResult
(json_content, column_names=None)¶ Encapsulates the result of the
datastories.correlation.compute_correlations()
analysis.Base classes:
Note: Objects of this class should not be manually constructed.
-
column
(col)¶ Retrieve the correlation measurements associated with a given column.
Args:
- col (str|int):
- the identifier of the column (name or index).
Returns:
A dictionary containing correlation measurements with respect to other columns in the data frame, in case these have been included in the top correlations selection.
-
columns
¶ Retrieve the names of the columns.
-
static
load
(file_path, column_names=None)¶ Load the result from a JSON file.
Args:
- file_path (str):
- location of the input file
- column_names (list[str]=None):
- list of column names in the original data frame. If not provided, one cannot access the correlations via the original data frame column indexes. Instead one must use column names.
Returns:
- An object of type
datastories.correlation.CorrelationResult
-
save
(file_path)¶ Save the result to a JSON file.
Args:
- file_path (str):
- location of the output file
NOTE: This operation loses the data frame context information. The original column names and their indices will not be available when loading the result from this file, unless the context is provided by the user. If no context is provided, one can still use the result but the correlations cannot be accessed via the original data frame column indexes. Instead one can use the column names.
-
to_excel
(file_path)¶ Export the list of correlations to an
Excel
file.Args:
- file_path (str):
- name of the file to export to.
-
to_json
(html_safe=False)¶ Save the result as a JSON string.
Args:
- html_safe (bool=False):
- Set to True in order to produce a JSON string that is safe to embed in a HTML page as an attribute value.
Returns:
- A JSON string containing the analysis results.
-
to_pandas
()¶ Export the list of correlations to a Pandas
DataFrame
object.NOTE: Every pair of correlated columns is included twice in the results such that each of the columns in the pair appears as a main column.
Returns:
- The constructed Pandas
DataFrame
object.
- The constructed Pandas
-
visualization
¶ Retrieves the prototypes visualization.
-
Prototype Detection¶
-
datastories.correlation.
compute_prototypes
(data_set, kpi, double prototype_threshold: float = 0.85, fast_approximation: bool = True, double missing_value_threshold: float = 0.5, use_linear_correlation: bool = False, inputs_only: bool = False) → PrototypeResult¶ Compute a set of mutually uncorrelated variables from a data frame.
Correlation estimation is by default based on the
Mutual Information Content
measure, and can be overridden to theLinear Correlation
when required.Each variable in the set has the following properties:
- it is not significantly correlated to any other variable in the set;
- it can be highly correlated to other variables that are not included in the set;
- it has a higher KPI correlation score than all the other variables that are highly correlated to it.
Each variable that is not included in the set has the property that is highly correlated to a variable in the set.
Args:
- data_set (obj):
- the input data frame (either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
- kpi (int|str):
- single value or list containing the index or the name of the KPI column(s).
- prototype_threshold (float = 0.85):
- correlation threshold for features to be considered proxies.
- fast_approximation (bool = True):
- approximate the mutual information, this provides a significant speedup with little precision loss.
- missing_value_threshold (float = 0.5):
- missing values threshold for excluding features from prototypes.
- use_linear_correlation (bool = False):
- use linear correlation instead of the mutual information for correlation estimation.
- inputs_only (bool = False):
- extract prototypes only for inputs (i.e., exclude KPIs). The KPIs are used only to determine the order in which the prototypes are presented. That is, the order of prototypes in the result is given by their maximum correlation with a KPI.
Returns:
- An object of type
datastories.correlation.PrototypeResult
wrapping-up the computed prototypes.
Raises:
TypeError
:- if [data_set] is not a
DataFrame
or a PandasDataFrame
object.
ValueError
:- if [kpi] is not a valid column name or index value (e.g., out-of-range index).
Example:
from datastories.correlation import compute_prototypes import pandas as pd df = pd.read_csv('example.csv') kpi_column_index = 1 prototypes = compute_prototypes(df, kpi_column_index) print(prototypes)
-
class
datastories.correlation.
PrototypeResult
(title='', subtitle='')¶ Encapsulates the result of the
datastories.correlation.compute_prototypes()
analysis.Base classes:
Note: Objects of this class should not be manually constructed.
-
prototypes
¶ Retrieves the list of column names currently selected as prototypes.
-
select
(self, cols)¶ Select a number of column names as prototypes.
-
selected
¶ Retrieves the list of column names currently selected as prototypes.
-
to_excel
(self, file_path)¶ Export the list of prototypes to an
Excel
file.Args:
- file_path (str):
- path to the output file.
-
to_pandas
(self)¶ Export the list of prototypes to a Pandas
DataFrame
object.Returns:
- The constructed Pandas
DataFrame
object.
- The constructed Pandas
-
visualization
¶ Retrieves the prototypes visualization.
-
-
class
datastories.correlation.
Prototype
¶ Encapsulates prototype information data.
Note: Objects of this class should not be manually constructed.
Attributes:
- info (obj):
- an object of type
datastories.correlation.CorrelationInfo
describing the correlation of the prototype with respect to the KPI.
- proxy_list (list):
- a list of
datastories.correlation.CorrelationInfo
objects corresponding to highly correlated variables with respect to the prototype.
-
class
datastories.correlation.
CorrelationInfo
¶ Encapsulates correlation information for a variable with respect to a reference.
Note: Objects of this class should not be manually constructed.
Attributes:
- col_index (int):
- the index of the variable in the input data frame.
- col_name (str):
- the name of the variable.
- correlation (float):
- the correlation score with respect to the reference.
Model¶
The datastories.model
package contains a collection
of classes that encapsulate data models (e.g., prediction
models computed by regression or classification analysis).
Base Classes¶
-
class
datastories.model.
Model
¶ Encapsulates an RSX based DataStories model.
-
inputs
¶ Retrieves a list of input model variable names.
-
outputs
¶ Retrieves a list of output model variable names.
-
plot
(self, *args, **kwargs)¶ Display a graphical representation of the prediction model.
Accepts the same parameters as the constructor for
datastories.visualization.WhatIfsSettings
-
predict
(self, data_frame, as_pandas=None, prepare_data=True)¶ Evaluate the model on an input data frame.
Args:
- data_frame (obj):
- the input data frame (either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
- as_pandas (bool=None):
- Flag to indicate whether prediction results should be returned as a
Pandas
data frame. By default results are returned in the same format as the input data frame.
- prepare_data (bool=True):
- Set to
True
in order to prepare provided Pandas data frames according to the DataStories type conversion rules. When the provided data frame is adatastories.data.DataFrame
object, this argument is discarded.
Returns:
- An object of type
datastories.core.model.PredictionResult
wrapping-up the computed prediction.
-
save
(self, file_path=None)¶ Serialize the model to a file or a bytes object.
Args:
- file_path (str=None):
- Name of the output file. If omitted the file is saved to a bytes object and returned as output for the function.
Returns:
- A bytes object containing the model when the [file_path] argument
is omitted or set to
None
.
-
to_cpp
(self, file_path)¶ Export the model to a C++ file.
Args:
- file_path (str):
- path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:- when there is a problem saving the file.
-
to_excel
(self, file_path)¶ Export the model to an Excel file.
Args:
- file_path (str):
- path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:- when there is a problem saving the file.
-
to_matlab
(self, file_path)¶ Export the model to a MATLAB file.
Args:
- file_path (str):
- path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:- when there is a problem saving the file.
-
to_py
(self, file_path)¶ Export the model to a Python file.
Args:
- file_path (str):
- path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:- when there is a problem saving the file.
-
to_r
(self, file_path)¶ Export the model to an R file.
Args:
- file_path (str):
- path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:- when there is a problem saving the file.
-
variables
¶ Retrieves a dictionary mapping model variables to corresponding information such as variable type and range.
Returns:
- A dictionary mapping column names to corresponding objects of
type
datastories.model.VariableInfo
.
- A dictionary mapping column names to corresponding objects of
type
-
-
class
datastories.model.
VariableInfo
¶ Holds information about a model variable, such as ranges and types.
Note: Objects of this class should not be manually constructed.
-
categories
¶ Retrieves the registered categories of the associated variable (i.e., if the variable is categorical).
-
index
¶ Retrieves the index of the associated variable.
-
is_input
¶ Checks if the associated variable is an input for the model.
-
max
¶ Retrieves the maximum value of the associated variable.
-
min
¶ Retrieves the minimum value of the associated variable.
-
range_type
¶ Retrieves the range type of the associated variable.
-
type
¶ Retrieves the type of the associated variable.
-
Prediction¶
-
datastories.model.
predict_from_model
(data_frame, rsx_model_path)¶ Evaluate an RSX model on an input data frame.
Args:
- data_frame (obj):
- the input data frame (either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
- rsx_model_path (str):
- path of the RSX model file.
Returns:
- An object of type
datastories.model.PredictionResult
wrapping-up the computed prediction.
-
class
datastories.model.
PredictionResult
(data=None)¶ Encapsulates a model prediction result.
Base classes:
Note: Objects of this class should not be manually constructed.
-
evaluation_data
¶ Retrieves the data used for evaluation, if available, or None.
-
kpis
¶ Retrieves the list of KPIs included in the prediction.
-
static
load
(metrics=None, predict_vs_actual=None, evaluation_data=None, path=None, as_pandas=True)¶ Load a
datastories.model.PredictionResult
object from a set of files or objects.The objects take precedence over the files. When a required object is not provided, the corresponding information will be retrieved from the associated file, provided such file can be identified.
Files have standard names:
- metrics.json
- predicted_vs_actual.csv
- evaluation_data.parquet
Files are specified indirectly, by providing a folder name, containing the files mentioned above. The folder name can be also a zip archive. In that case the files should be available in the root of the archive.
The evaluation data information is optional, and only required for reference.
Args:
- metrics (dict=None):
- A dictionary containing performance metrics.
- predict_vs_actual (obj=None):
- A data frame (Pandas or DataFrame) containing predicted vs actual data.
- evaluation_data (obj=None):
- A data frame (Pandas or DataFrame) containing evaluation input data
- path (str=None):
- Path to a folder or ZIP archive containing required information if not provided by the other (object) parameters. Files containing this information should have a standard name, as mentioned above.
- as_pandas (bool=True):
- Flag to indicate whether the values field should be available as a
pandas.DataFrame
(i.e., True) or adatastories.data.DataFrame
object (i.e., False).
-
metrics
¶ Retrieves the prediction performance metrics, if available.
NOTE: This is an alias for the .stats property
-
record_info_columns
¶ Retrieves/sets the record info column names.
-
save
(folder, include_data=False, compress=False)¶ Save the prediction data to a folder or a zip archive.
The metrics, prediction values and (optionally) prediction input data are saved as individual files:
- metrics.json
- predicted_vs_actual.csv
- evaluation_data.parquet
Args:
- folder (str):
- The folder where the files should be saved.
- include_data (bool=False):
- Flag to indicate whether the evaluation data should be included as well.
- compress (bool=False):
- Flag to indicate whether the files should be saved to a compressed ZIP archive instead of a folder.
-
stats
¶ Retrieves the prediction performance statistics, if available.
NOTE: When the actual KPI value is missing from the input data frame, the performance metrics cannot be computed. In that case None is returned.
-
to_csv
(file_path, delimiter=', ', decimal='.', include_evaluation_data=True)¶ Export the result to a
CSV
file.Args:
- file_path (str):
- path to the output file.
- delimiter (str=’,’):
- character used as value delimiter.
- decimal (str=’.’):
- character used as decimal point.
- include_evaluation_data (bool=True):
- set to
True
in order to include the evaluation data next to the prediction values.
-
to_excel
(file_path, tab_name='Predictions', include_evaluation_data=True)¶ Export the result to an
Excel
file.Args:
- file_path (str):
- path to the output file.
- tab_name (str=’Predictions’):
- name of the Excel tab where to save the result
- include_evaluation_data (bool=True):
- set to
True
in order to include the evaluation data next to the prediction values.
-
to_pandas
(include_evaluation_data=True)¶ Export the prediction and input values to pandas.
Args:
- include_evaluation_data (bool=True):
- set to
True
in order to include the evaluation data next to the prediction values.
-
values
¶ Retrieves the prediction values.
For each provided record in the input data frame the following values are provided per KPI:
- actual:
- the actual value of the KPI (i.e., if present in the input data frame).
- predicted:
- the predicted value of the KPI.
- uncertainty_min:
- minimum predicted value corrected for uncertainty.
- uncertainty_max:
- maximum predicted value corrected for uncertainty.
- model_based_outlier:
- whether the prediction is based on outlier values according to the model (1=True).
NOTE: The result object has the same type as the input provided to the predict method.
-
-
class
datastories.model.
BasePredictor
(base_model)¶ Base class for all models based on a RSX backed model.
Base classes:
Offers access to basic functionality:
- prediction
- optimization
- model export to a specific language
Args:
- base_model (obj):
- an object of type
datastories.model.Model
encapsulating the base RSX model used for making predictions.
-
base_model
¶ Retrieves the generic RSX based model used for making predictions.
-
maximize
(progress_bar=True)¶ Compute the input combination that maximizes the predictive model output.
Args:
- progress_bar (obj|bool=True):
- An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
Returns:
- A
datastories.optimization.OptimizationResult
object encapsulating the model variables values that maximize the model outputs.
-
metrics
¶ Retrieves a dictionary containing model prediction performance metrics.
The type of metrics depend on the model type (i.e., regression or classification)
-
minimize
(progress_bar=True)¶ Compute the input combination that minimizes the predictive model output.
Args:
- progress_bar (obj|bool=True):
- An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
Returns:
- A
datastories.optimization.OptimizationResult
object encapsulating the model variables values that minimize the model outputs.
-
optimize
(optimization_spec=None, progress_bar=True)¶ Compute an optimum input/output combination according to an (optional) optimization specification.
Args:
- optimization_spec (obj=OptimizationSpecification()) :
- A
datastories.optimization.OptimizationSpecification
object encapsulating the optional optimization specification.
- progress_bar (obj|bool=True):
- An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
Returns:
- A
datastories.optimization.OptimizationResult
object encapsulating the model variables values that satisfy the optimization specification.
-
predict
(data_frame)¶ Predict the modeled KPI on a new data frame.
Args:
- data_frame (obj):
- the data frame on which the model associated KPIs are to be
predicted (either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
Returns:
- An object of type
datastories.model.PredictionResult
encapsulating the prediction results.
Raises:
ValueError
:- when not all required columns are provided.
-
stats
¶ Retrieves a dictionary containing model prediction performance metrics.
-
to_cpp
(file_path)¶ Export the model to a C++ file.
Args:
- file_path (str):
- path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:- when there is a problem saving the file.
-
to_excel
(file_path)¶ Export the model to an Excel file.
Args:
- file_path (str):
- path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:- when there is a problem saving the file.
-
to_matlab
(file_path)¶ Export the model to a MATLAB file.
Args:
- file_path (str):
- path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:- when there is a problem saving the file.
-
to_py
(file_path)¶ Export the model to a Python file.
Args:
- file_path (str):
- path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:- when there is a problem saving the file.
-
to_r
(file_path)¶ Export the model to an R file.
Args:
- file_path (str):
- path to the output file.
Raises:
datastories.api.errors.DatastoriesError
:- when there is a problem saving the file.
-
class
datastories.model.
BasePrediction
(data)¶ Base class for all prediction classes.
Base classes:
Args:
- data (obj):
- The associated prediction input data (either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
-
to_pandas
()¶ Exports the list of predictions to a
pandas.DataFrame
object.Returns:
- The constructed
pandas.DataFrame
object.
- The constructed
-
class
datastories.model.
SingleKpiPredictor
(predictor_info, *args, **kwargs)¶ Encapsulates single KPI prediction models (e.g., as computed using
datastories.story.predict_single_kpi()
).Base classes:
Note: Objects of this class should not be manually constructed.
-
error_plot
¶ A visualization for assessing model prediction errors, as discovered while training the model.
Returns:
- In case of a regression model:
- an object of type
datastories.visualization.ErrorPlot
.
- In case of a binary classification model:
- an object of type
datastories.visualization.ClassificationPlot
-
metrics
¶ Retrieves a dictionary containing model prediction performance metrics.
The type of metrics depend on the model type (i.e., regression or classification)
For regression models the metrics include:
- Correlation:
- actual vs predicted correlation.
- Estimated Correlation:
- estimated correlation for future (unseen) values.
- R-squared:
- the coefficient of determination.
- MSE:
- mean squared error.
- RMSE:
- root mean squared error.
For binary classification models the metrics include:
- Positive Label:
- the label used to identify positive cases.
- Negative Label:
- the label used to identify negative cases.
- True Positives:
- number of correctly identified positive cases (TP).
- False Positives:
- number of incorrectly identified positive cases (FP).
- True Negatives:
- number of correctly identified negative cases (TN).
- False Negatives:
- number of incorrectly identified negative cases (FN).
- Not Classified:
- number of records that could not be classified (i.e., KPI is NaN).
- True Positive Rate:
- TP / (TP + FN) * 100 (a.k.a. sensitivity, recall).
- False Positive Rate:
- FP / (FP + TN) * 100 (a.k.a. fall-out).
- True Negative Rate:
- TN / ( FP + TN) * 100 (a.k.a. specificity).
- False Negative Rate:
- FN / (TP + FN) * 100 (a.k.a. miss rate).
- Precision:
- percentage of correctly identified cases from the total reported positive cases TP / (TP + FP) * 100.
- Recall:
- percentage of correctly identified cases from the total existing positive cases TP / (TP + FN) * 100.
- Accuracy:
- percentage of correctly identified cases (TP + TN) / (TP + FP + TN + FN) * 100.
- F1 Score:
- the F1 score (the harmonic mean of precision and recall).
- AUC:
- area under (ROC) curve.
-
plot
(*args, **kwargs)¶ Displays a graphical representation of the prediction model.
Accepts the same parameters as the constructor for
datastories.visualization.ConfusionMatrixSettings
(for classification models) ordatastories.visualization.PredictedVsActualSettings
(for regression models).
-
predict
(data_frame)¶ Predict the model KPI on a new data frame.
Args:
- data_frame (obj):
- the data frame on which the model associated KPI is to be
predicted (either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
Returns:
- An object of type
datastories.model.PredictionResult
encapsulating the prediction results.
Raises:
TypeError
:when the provided data frame is neither a
pandas.DataFrame
nor adatastories.data.DataFrame
object.
ValueError
:when not all required columns are provided.
Note: Only the driver columns are required for making predictions.
-
rebuild
(score_threshold=None)¶ Rebuilds the prediction model using custom settings.
Args:
- score_threshold (float=None):
- the decision threshold for binary KPI models. If missing, the optimal decision threshold will be determined automatically.
Note: In order to make changes permanent (i.e., survive story reloads) the associated story has to be saved after executing this method. To save a story use
datastories.story.predict_single_kpi.Story.save()
.
-
to_html
(file_path, title='Single KPI Predictive Model', subtitle='Performance')¶ Exports a visual representation of the prediction model to a standalone
HTML
document.Args:
- file_path (str):
- path to the output file.
- title (str=’Single KPI Predictive Model’):
- HTML document title.
- subtitle (str=’Performance’):
- HTML document subtitle.
-
vis_settings
¶ Retrieves/sets the predictor visualization settings.
Communicates either an object of type
datastories.visualization.PredictedVsActualSettings
ordatastories.visualization.ConfusionMatrixSettings
(which one is appropriate to the visualization type).Raises:
ValueError
:- when the provided object is not compatible with the predictor type.
NOTE: Set this object before displaying the visualization
-
-
class
datastories.model.
SingleKpiPredictorInfo
(predictor_type, performance_metrics)¶ Prediction performance metrics data class.
Note: Objects of this class should not be manually constructed.
-
metrics
¶ Retrieves the predictor associated performance metrics.
-
type
¶ Retrieves the predictor type (i.e.,
regression
vsclassification
).
-
-
class
datastories.model.
SingleKpiPredictionResult
(prediction_type, prediction, data, kpi_name, is_test=False, *args, **kwargs)¶ Encapsulates the results of a prediction done using a
datastories.model.SingleKpiPredictor
object.Base classes:
Note: Objects of this class should not be manually constructed.
-
error_plot
¶ A visualization for assessing model prediction errors.
Returns:
- In case of a regression model:
- an object of type
datastories.visualization.ErrorPlot
.
- In case of a binary classification model:
- an object of type
datastories.visualization.ClassificationPlot
-
is_test
¶ Checks whether the prediction result contains validation metrics.
This is the case when the actual KPI value is present in the input data frame so that a comparison can be done between predicted and actual values.
-
metrics
¶ Retrieves a dictionary containing prediction performance metrics.
These metrics are computed when the data frame used for prediction includes KPI values, for the purpose of evaluating the model prediction performance.
The following metrics are retrieved:
- Number of Records:
- number of records submitted for prediction.
- Correlation:
- actual vs predicted correlation.
- R-squared:
- the coefficient of determination.
- MSE:
- mean squared error.
- RMSE:
- root mean squared error.
In case the KPI is a binary variable, the following additional metrics are included:
- Positive Label:
- the label used to identify positive cases.
- Negative Label:
- the label used to identify negative cases.
- True Positives:
- number of correctly identified positive cases (TP).
- False Positives:
- number of incorrectly identified positive cases (FP).
- True Negatives:
- number of correctly identified negative cases (TN).
- False Negatives:
- number of incorrectly identified negative cases (FN).
- False Negatives:
- number of incorrectly identified negative cases (FN).
- Not Classified:
- number of records that could not be classified (i.e., KPI is NaN).
- True Positive Rate:
- TP / (TP + FN) * 100 (a.k.a. sensitivity, recall).
- False Positive Rate:
- FP / (FP + TN) * 100 (a.k.a. fall-out).
- True Negative Rate:
- TN / ( FP + TN) * 100 (a.k.a. specificity).
- False Negative Rate:
- FN / (TP + FN) * 100 (a.k.a. miss rate).
- Precision:
- percentage of correctly identified cases from the total reported positive cases TP / (TP + FP) * 100.
- Recall:
- percentage of correctly identified cases from the total existing positive cases TP / (TP + FN) * 100.
- Accuracy:
- percentage of correctly identified cases (TP + TN) / (TP + FP + TN + FN) * 100.
- F1 Score:
- the F1 score (the harmonic mean of precision and recall).
- AUC:
- area under (ROC) curve.
-
plot
(*args, **kwargs)¶ Displays a graphical representation of the prediction performance.
Accepts the same parameters as the constructor for
datastories.visualization.ConfusionMatrixSettings
(for classification based predictions predictions) ordatastories.visualization.PredictedVsActualSettings
(for regression based predictions).
-
to_html
(file_path, title='Single KPI Prediction Performance', subtitle='Predicted vs Actual')¶ Exports a visual representation of the prediction performance to a standalone
HTML
document.- Args:
file_path (str): name of the file to export to; title (str=’Single KPI Prediction Performance’): HTML document title; subtitle (str=’’): HTML document subtitle.
-
values
¶ Retrieves the prediction values.
Returns:
- A data frame containing the input augmented with predicted values, confidence estimates and flags to indicate whether the prediction is a model based outlier.
-
vis_settings
¶ Retrieves/sets the Predict-vs-Actual visualization settings.
Communicates either an object of type
datastories.visualization.PredictedVsActualSettings
ordatastories.visualization.ConfusionMatrixSettings
(which one is appropriate to the visualization type).Raises:
ValueError
:- when the provided object is not compatible with the predictor type.
NOTE: Set this object before displaying the visualization
-
-
class
datastories.model.
MultiKpiPredictor
(predictor_info, base_model)¶ Encapsulates multi-KPI prediction models (e.g., as computed using
datastories.story.predict_kpis()
).Base classes:
Note: Objects of this class should not be manually constructed.
-
metrics
¶ Retrieves a dictionary containing multi KPI model prediction performance metrics.
The type of metrics depend on the model type (i.e., regression or classification)
-
predict
(data_frame)¶ Predict the model KPIs on a new data frame.
Args:
- data_frame (obj):
- the data frame on which the model associated KPIs are to be predicted
(either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
- Returns:
- An object of type
datastories.regression.MultiKpiPredictionResult
encapsulating the prediction results.
- An object of type
- Raises:
ValueError
:- when not all required columns are provided.
NOTE: If not all drivers are provided, the KPIs that depend on them will not be predicted. However, no Exception will be generated.
-
to_html
(file_path, title='Multi KPI Predictive Model', subtitle='Performance')¶ Exports a visual representation of the prediction model to a standalone
HTML
document.Args:
- file_path (str):
- name of the file to export to.
- title (str=’Multi KPI Predictive Model’):
- HTML document title.
- subtitle (str=’Performance’):
- HTML document subtitle.
-
-
class
datastories.model.
MultiKpiPredictorInfo
(performance_metrics)¶ Data class wrapper for prediction performance metrics.
Note: Objects of this class should not be manually constructed.
-
metrics
¶ Retrieves the prediction performance metrics.
-
-
class
datastories.model.
MultiKpiPredictionResult
(prediction, data)¶ Encapsulates the results of a prediction done using a
datastories.model.MultiKpiPredictor
object.Base classes:
Note: Objects of this class should not be manually constructed.
-
metrics
¶ Retrieves a dictionary containing multi KPI prediction performance metrics.
-
to_html
(file_path, title='Multi KPI Prediction Performance', subtitle='')¶ Exports a visual representation of the prediction performance to a standalone
HTML
document.Args:
- file_path (str):
- path tot he output file.
- title (str=’Multi KPI Prediction Performance’).
- HTML document title;
- subtitle (str=’’):
- HTML document subtitle.
-
values
¶ Retrieves a data frame containing the input augmented with predicted values, confidence estimates and flags to indicate whether the prediction is a model based outlier.
-
Optimization¶
The datastories.optimization
package contains a collection
of classes and functions for optimizing models.
-
datastories.optimization.
create_optimizer
(*args, **kwargs)¶ Factory method for creating optimizers.
Returns:
- An object of type
datastories.optimization.pso.Optimizer
that can be used to perform optimization analyses on adatastories.model.Model
object.
Example:
model = Model("my_model.rsx") spec = OptimizationSpecification() spec.objectives = [ Minimize('KPI_1'), Maximize('KPI_2') ] spec.constraints = [ AtMost('Input_1', 10), ] optimizer = create_optimizer() optimization_result = optimizer.optimize(model, optimization_spec=spec) print(optimization_result.optimum)
- An object of type
-
class
datastories.optimization.pso.
Optimizer
(size_t population_size=500, size_t iterations=250)¶ A model optimizer using the particle swarm strategy for identifying an optimum solution.
Args:
- population_size (int =
500
): - the initial size of the swarm population.
- population_size (int =
- iterations (int =
250
): - number of swarm computation iterations before stopping.
- iterations (int =
-
maximize
(self, model, variable_ranges=None, progress_bar=True)¶ Run the optimizer with the goal of maximizing the outputs (i.e., KPIs) of a given model.
Args:
- model (
datastories.model.Model
): - The input model whose KPIs are to be maximized.
- model (
- variable_ranges (dict [str,
datastories.optimization.VariableRange
] = {}): - An optional dictionary mapping variable names to ranges that are to be used to limit the searching for the optimum solution to a given domain.
- variable_ranges (dict [str,
- progress_bar (obj|bool=True):
- An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
Returns:
- A
datastories.optimization.OptimizationResult
object encapsulating the model variables values that maximize the model outputs.
-
minimize
(self, model, variable_ranges=None, progress_bar=True)¶ Run the optimizer with the goal of minimizing the outputs (i.e., KPIs) of a given model.
Args:
- model (
datastories.model.Model
): - The input model whose KPIs are to be minimized.
- model (
- variable_ranges (dict [str,
datastories.optimization.VariableRange
] = {}): - An optional dictionary mapping variable names to ranges that are to be used to limit the searching for the optimum solution to a given domain.
- variable_ranges (dict [str,
- progress_bar (obj|bool=True):
- An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
Returns:
- A
datastories.optimization.OptimizationResult
object encapsulating the model variables values that minimize the model outputs.
-
optimize
(self, model, optimization_spec=None, variable_ranges=None, direction=None, progress_bar=True)¶ Optimize an input model according to a given optimization specification.
Args:
- model (
datastories.model.Model
): - The input model to be optimized
- model (
- optimization_spec (
datastories.optimization.OptimizationSpecification
): - An optional specification for the optimization objectives and constraints. The default value is an empty specification (i.e., OptimizationSpecification())
- optimization_spec (
- variable_ranges (dict [str,
datastories.optimization.VariableRange
] = {}): - An optional dictionary mapping variable names to ranges that are to be used to limit the searching for the optimum solution to a given domain.
- variable_ranges (dict [str,
- direction (
datastories.optimization.OptimizationDirection
) - The direction of optimization when no specification is provided. Can be one of:
OptimizationDirection.MAXIMIZE
OptimizationDirection.MINIMIZE
- direction (
- progress_bar (obj|bool=True):
- An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
Returns:
- A
datastories.optimization.OptimizationResult
object encapsulating the model variables values that satisfy the optimization specification.
Raises:
TypeError
:- when the provided input parameters do not have the expected types.
-
class
datastories.optimization.
OptimizerType
¶ Enumeration for DataStories supported optimizer types.
-
PARTICLE_SWARM
= 0¶
-
-
class
datastories.optimization.
OptimizationResult
¶ Encapsulates the result of a
datastories.optimizer.Optimizer.optimize()
analysis.Note: Objects of this class should not be manually constructed.
-
is_complete
¶ Checks whether the search for the optimum has been interrupted before completion.
-
is_feasible
¶ Checks whether the identified optimum position respects the imposed constraints (if any).
-
optimum
¶ Retrieves the model variable values for the identified optimum position.
-
to_pandas
(self)¶ Export the optimum position to a Pandas
DataFrame
object.Returns:
The constructed PandasDataFrame
object.
-
-
class
datastories.optimization.
OptimizationSpecification
(objectives=None, constraints=None)¶ Encapsulates a set of optimization objectives and constraints that can be used to configure an optimization analysis.
Both objectives and constraints are defined using
datastories.optimization.VariableSpec
and (potentially)datastories.optimization.VariableMapper
objects.Example:
spec = OptimizationSpecification() spec.objectives = [ Minimize('KPI_1', 2), InInterval('KPI_2', 1, 100) ] spec.add_constraint(AtMost(Sum('Input_1','Input_2'), 100))
-
add_constraint
(self, constraint)¶ Add a optimization constraint to the specification.
-
add_objective
(self, objective)¶ Add a optimization objective to the specification.
-
constraints
¶ Retrieves/sets the optimization specification constraints.
-
objectives
¶ Retrieves/sets the optimization specification objectives.
-
-
class
datastories.optimization.
OptimizationDirection
¶ Enumeration for possible optimization goals when no other optimization specification is provided.
- Possible values:
OptimizationDirection.MAXIMIZE
OptimizationDirection.MINIMIZE
-
class
datastories.optimization.
VariableRange
¶ Encapsulates a numeric or categorical value ranges.
Numeric ranges are defined by an upper and a lower bound. Categorical ranges are currently limited to a single value.
Args:
- min (double=0):
- a numeric range lower bound.
- max (double=0):
- a numeric range upper bound.
- value (str=’’):
- a categorical range value.
-
is_categorical
¶ Checks whether the variable range is categorical.
-
is_numeric
¶ Checks whether the variable range is numeric.
-
max
¶ Retrieves/sets the upper bound of a numeric range.
-
min
¶ Retrieves/sets the lower bound of a numeric range.
-
value
¶ Retrieves/sets the value of a categorical range.
-
class
datastories.optimization.
VariableMapper
¶ Base class for all variable mappers.
Variable mappers are the first argument to be passed when defining optimization objectives and constraints. They indicate to what variable or group of variables the objective/constraint applies.
For simple cases (i.e., one variable), variable mappers can be replaced with the name of the variable itself. However, in more complex scenarios (e.g., a constraint that applies to the aggregated value of a number of variables), mappers have to be explicitly constructed.
-
class
datastories.optimization.
Sum
(operands, weights=None)¶ Bases:
datastories.optimization.specification.VariableMapper
Aggregates a number of variables using a weighted sum. This can be then used to define optimization objectives or constraints.
Args:
- operands (list):
- a list of variable names to sum-up.
- weights (list=None):
- a list of relative weights for aggregating the given variables.
-
class
datastories.optimization.
VariableSpec
¶ Base class for all optimization objectives and constraints.
-
class
datastories.optimization.
AtMost
(operand, double limit, double weight=1.0)¶ Bases:
datastories.optimization.specification.VariableSpec
Specifies an optimization objective or constraint by which a variable (or aggregation of variables) should be lower than a given reference value.
Args:
- operand (obj):
- a variable mapper (
datastories.optimization.specification.VariableMapper
) indicating to whom the objective/constraint applies.
- limit (double):
- the reference value to compare against.
- weight (double=1):
- the relative weight of this objective/constraint among all the specified objectives or constraints.
-
class
datastories.optimization.
AtLeast
(operand, double limit, double weight=1.0)¶ Bases:
datastories.optimization.specification.VariableSpec
Specifies an optimization objective or constraint by which a variable (or aggregation of variables) should be greater than a given reference value.
Args:
- operand (obj):
- a variable mapper (
datastories.optimization.specification.VariableMapper
) indicating to whom the objective/constraint applies.
- limit (double):
- the reference value to compare against.
- weight (double=1):
- the relative weight of this objective/constraint among all the specified objectives or constraints.
-
class
datastories.optimization.
InInterval
(operand, double lower_limit, double upper_limit, double weight=1.0)¶ Bases:
datastories.optimization.specification.VariableSpec
Specifies an optimization objective or constraint by which a variable (or aggregation of variables) should be in a given reference interval.
Args:
- operand (obj):
- a variable mapper (
datastories.optimization.specification.VariableMapper
) indicating to whom the objective/constraint applies.
- lower_limit (double):
- the lower bound of the reference interval.
- upper_limit (double):
- the upper bound of the reference interval.
- weight (double=1):
- the relative weight of this objective/constraint among all the specified objectives or constraints.
-
class
datastories.optimization.
IsEqual
(operand, double value, double weight=1.0)¶ Bases:
datastories.optimization.specification.VariableSpec
Specifies an optimization objective by which a variable (or aggregation of variables) should be equal to a given reference value.
Note: The optimizer does not support the use of
datastories.optimization.specification.IsEqual
as a constraint, because the underlying algorithm is not optimized to handle constraints of this type. Therefore, trying to forceIsEqual
-like behavior by combiningAtLeast
andAtMost
to make only a small region feasible is not recommended. The returned result might be in this region, but there is no guarantee that it is close to optimal.In general, one should try to add the
IsEqual
condition as an objective with a high weight. This does not guarantee that the condition will be met, but the results are often close enough that a small manual adjustment to one parameter is enough to meet the condition.A common case is that the sum of some parameters must be equal to a value, for example in formulations where parameters express a fraction of a mixture. In this case, if the previous recommendation does not lead to good solutions, one can try to relax the condition in the following way. Have one constraint limiting the sum to the value with
AtMost[ Sum[], value ]
, and have one objectiveMaximize[ Sum[] ]
with a high weight. This is less restrictive towards the algorithm than using theIsEqual
as objective, and can lead to better results. Of course a small manual adjustment might be needed to satisfy the condition exactly.Args:
- operand (obj):
- a variable mapper (
datastories.optimization.specification.VariableMapper
) indicating to whom the objective applies.
- value (double):
- the reference value to compare against.
- weight (double=1):
- the relative weight of this objective/constraint among all the specified objectives or constraints.
-
class
datastories.optimization.
Minimize
(operand, double weight=1.0)¶ Bases:
datastories.optimization.specification.VariableSpec
Specifies an optimization objective by which a variable (or aggregation of variables) should have the smallest possible value.
Note: This cannot be used to define optimization constraints.
Args:
- operand (obj):
- a variable mapper (
datastories.optimization.specification.VariableMapper
) indicating to whom the objective applies.
- weight (double=1):
- the relative weight of this objective/constraint among all the specified objectives or constraints.
-
class
datastories.optimization.
Maximize
(operand, double weight=1.0)¶ Bases:
datastories.optimization.specification.VariableSpec
Specifies an optimization objective by which a variable (or aggregation of variables) should have the largest possible value.
Note: This cannot be used to define optimization constraints.
Args:
- operand (obj):
- a variable mapper (
datastories.optimization.specification.VariableMapper
) indicating to whom the objective applies.
- weight (double=1):
- the relative weight of this objective/constraint among all the specified objectives or constraints.
Story¶
The datastories.story
package contains a collection
of workflows to automate specific analysis tasks (e.g., building a predictive model).
-
datastories.story.
load
(file_path, *args, **kwargs)¶ Loads a previously saved story.
Args:
- file_path (str):
- name of the file containing the story, including extension.
Returns:
- An object wrapping the story.
Raises:
TypeError
:- when the story type is not recognized by the SDK.
StoryError
:- when the story type cannot be retrieved from the file.
-
class
datastories.story.
StoryBase
(data=None, params=None, metainfo=None, raw_results=None, results=None, folder=None, notes=None, upload_function=None, on_snapshot=None, progress_bar=False)¶ Base class for story analyses.
Base classes:
-
class
ProcessingStage
¶ Enumeration of all story processing stages.
Specializations have to extend this with their specific execution stages, while maintaining these base stages as defined below:
- UNKNOWN = 0
- INIT = 1
-
add_note
(note)¶ Add an annotation to the story results.
The already present annotations can be retrieved using the
datastories.api.IStory.notes()
property.Args:
- note (str):
- the annotation to be added.
-
clear_note
(note_id)¶ Remove a specific annotation associated with the story analysis.
Args:
- note_id (int):
- the index of the note to be removed.
Raises:
IndexError
:- when the note index is unknown.
-
clear_notes
()¶ Clear the annotations associated with the story analysis.
-
classmethod
create_story
(data_frame, info_fields, folder, progress_bar, on_snapshot, upload_function, **kwargs)¶ Factory method.
This method has to be overwritten by specializations in order to enable additional computation when loading a story object.
-
info
()¶ Display story execution information.
All story execution stages are displayed together with their completion status. The version of the used DataStories SDK and the user notes are also included.
-
static
is_compatible
(current_version_string, ref_version_string)¶ Test whether two story versions are compatible.
The story version compatibility policy is as follows:
stories are forward and backwards compatible within the minor version (i.e., you can open a saved story whose version differs from the version associated with the current SDK but only if the major version number remains unchanged).
-
is_complete
¶ Checks whether all story analysis stages have been executed.
-
is_ok
¶ Checks whether last executed story analysis stage has been successful.
-
classmethod
load
(file_path, folder=None, progress_bar=False, on_snapshot=None, upload_function=None, **kwargs)¶ Load a previously saved story.
Args:
- file_path (str):
- path to the input file.
- folder (str=None):
- Folder to use as working folder for the story.
- progress_bar (obj|bool=True):
- An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
- on_snapshot (func=None):
- an optional callback to be executed when an analysis snapshot is created. The callback receives one argument indicating the path of the snapshot file relative to the current execution folder.
- upload_function (func=None):
- an optional callback to upload analysis result files to a client specific storage. The callback receives one argument indicating the path of the result file relative to the current execution folder.
Returns:
- An object of type
datastories.story.predict_kpis.Story
encapsulating training results form a previous analysis.
- Raises:
datastories.story.StoryError
:- when there is a problem loading the story file (e.g., story version not compatible).
-
metrics
¶ Returns a set of metrics computed during analysis.
NOTE: This is an alias for the .stats property
-
notes
¶ Retrieves a list of all annotations currently associated with the story analysis.
-
reset
()¶ Reset the execution pointer of a story to the first stage.
-
run
(resume_from=None, strict=False, params=None, progress_bar=None, check_interrupt=None)¶ Resumes the execution of a story form a give stage.
The stage to resume from is optional. If not specified, the story is executed from the beginning. If the stage cannot be executed (e.g., due to missing intermediate results) the closest story that can be executed will be used as starting point unless the [strict] argument is set to True. In that case an exception will be raised if the execution cannot be resumed from the requested stage.
Args:
- resume_from (obj):
- The stage to resume execution from.
Should be a
datastories.story.predict_kpis.Story.ProcessingStage
value corresponding to a stage for which all intermediate results are available. If None, the stage at which execution was previously interrupted (if any) is used.
- strict (bool=False):
- Raise en error if execution cannot be resumed from the requested stage.
- params (dict={}):
- Map of parameters to be used with the run. It can override the original parameters, but this leads to invalidating previous results that depend on the updated parameter values.
- progress_bar (obj=None):
- An object of type
datastories.display.ProgressReporter
to replace the currently used progress reporter. When not specified the current story progress reporter will not be modified. The case for this is to set a progress bar after the story is loaded, when a progress bar cannot be given to the load function directly (e.g, when a progress bar has to be constructed based on the story).
- check_interrupt (func=None):
- an optional callback to check whether analysis execution needs to be interrupted.
Raises:
datastories.api.errors.StoryError
:- if a stage is specified for which no intermediate results are available and the [strict] argument is set to True.
-
save
(file_path, include_data=True)¶ Save the story analysis results.
Use this function to persist the results of the story analysis. One can reload them and continue investigations at a later moment using the
datastories.story.load()
method.Args:
- file_path (str):
- path to the output file.
- include_data (bool=True):
- set to True to include a copy of the data in the exported file.
Raises:
datastories.api.errors.StoryError
:- when attempting to include data while the story that does not contain a data reference. This is that case with stories that have been previously saved without including the data.
-
class
Predict Single KPI¶
-
datastories.story.
predict_single_kpi
(data_frame, column_list, kpi, runs=3, outlier_elimination=True, prototypes='auto', progress_bar=True)¶ Fits a non-linear regression model on a data frame in order to predict one column.
DEPRECATED: This method has been deprecated and will be removed in a future version of the SDK. Use the more generic
datastories.story.predict_kpis()
for analysing single KPIs as well.The column to pe predicted (i.e., the KPI) is to be identified either by name or by column index in the data frame.
Args:
- data_frame (obj):
- the input data frame (either a
pandas.DataFrame
or adatastories.data.DataFrame
object).
- column_list (list):
- the list of variables (i.e., columns) to consider for regression.
- kpi (int|str):
- the index or the name of the target (i.e., KPI) column.
- runs (int=3):
- the number of training rounds;
- outlier_elimination (bool=True):
- set to True in order to exclude far outliers from modeling;
- prototypes (str=’auto’):
- indicates whether analysis should be performed on prototypes. Possible values:
'yes'
: use only prototypes as inputs.'no'
: use all original inputs.'auto'
: use prototypes if there are more than 200 inputs variables.
- progress_bar (obj|bool=True):
- An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
Returns:
- An object of type
datastories.story.predict_single_kpi.Story
wrapping-up the computed model.
Raises:
ValueError
:- when an invalid value is provided for one of the input parameters parameters.
datastories.story.StoryError
:- when there is a problem fitting the model.
Example:
from datastories.story import predict_single_kpi import pandas as pd df = pd.read_csv('example.csv') kpi_column_index = 1 ranks = predict_single_kpi(df, df.columns, kpi_column_index, progress_bar=True) print(story)
-
class
datastories.story.predict_single_kpi.
Story
(platform, kpi_name, user_columns, nrows, folder='', *args, **kwargs)¶ Encapsulates the result of a single KPI non-linear regression model.
Base classes:
DEPRECATED: This class been deprecated and will be removed in a future version of the SDK. Use the more generic
datastories.story.predict_kpis()
for analysing single KPIs as well.Note: Objects of this class should not be manually constructed but rather using the
datastories.story.predict_single_kpi()
factory method.-
base_model
¶ Retrieves the RSX based model used for making predictions.
-
classmethod
load
(file_path)¶ Loads a previously saved story.
Args:
- file_path (str):
- the name of the input file.
Returns:
- An object of type
datastories.story.predict_single_kpi.Story
encapsulating training results form a previous analysis.
Raises:
datastories.story.StoryError
:- when there is a problem loading the story file (e.g., story version not compatible).
-
metrics
¶ Retrieves a dictionary containing the model performance metrics and the list of main drivers.
These metrics are computed on the training data for the purpose of evaluating the model prediction performance.
The following metrics are retrieved:
- Training Set Size:
- size of the actual data frame used for training (rows x columns).
- Correlation:
- actual vs predicted correlation.
- Estimated Correlation:
- estimated correlation for future (unseen) values.
- R-squared:
- the coefficient of determination.
- MSE:
- mean squared error.
- RMSE:
- root mean squared error.
- Main Drivers:
- list of main features with associated relative importance and energy.
- Features:
- list of all features with associated relative importance and energy.
- Computation Effort:
- a measure of model complexity.
- Number of Runs:
- number of training rounds.
- Best Run:
- best performing training round.
- Run Overview:
- overview of individual runs including Performance and Feature Importance.
In case the KPI is a binary variable, the following additional metrics are included:
- Positive Label:
- the label used to identify positive cases.
- Negative Label:
- the label used to identify negative cases.
- True Positives:
- number of correctly identified positive cases (TP).
- False Positives:
- number of incorrectly identified positive cases (FP).
- True Negatives:
- number of correctly identified negative cases (TN).
- False Negatives:
- number of incorrectly identified negative cases (FN).
- Not Classified:
- number of records that could not be classified (i.e., KPI is NaN).
- True Positive Rate:
- TP / (TP + FN) * 100 (a.k.a. sensitivity, recall).
- False Positive Rate:
- FP / (FP + TN) * 100 (a.k.a. fall-out).
- True Negative Rate:
- TN / ( FP + TN) * 100 (a.k.a. specificity).
- False Negative Rate:
- FN / (TP + FN) * 100 (a.k.a. miss rate).
- Precision:
- percentage of correctly identified cases from the total reported positive cases TP / (TP + FP) * 100.
- Recall:
- percentage of correctly identified cases from the total existing positive cases TP / (TP + FN) * 100.
- Accuracy:
- percentage of correctly identified cases (TP + TN) / (TP + FP + TN + FN) * 100.
- F1 Score:
- the F1 score (the harmonic mean of precision and recall).
- AUC:
- area under (ROC) curve.
-
model
¶ Retrieves an object of type
datastories.model.SingleKpiPredictor
that can be used for making predictions on new data.
-
run_overview
¶ Retrieves an overview of feature importance metrics across all runs.
-
runs
¶ Retrieves a list containing the results of individual analysis rounds.
Each entry in the list is an object of type
datastories.story.predict_single_kpi.StoryRun
encapsulating the results associated with a given analysis round.
-
save
(file_path)¶ Saves the story analysis results.
Use this function to persist the results of the
datastories.story.predict_single_kpi()
analysis. One can reload them and continue investigations at a later moment using thedatastories.story.predict_single_kpi.Story.load()
method.Args:
- file_path (str):
- path to the output file.
-
to_csv
(file_path, content='metrics', delimiter=', ', decimal='.')¶ Exports a list of model metrics to a
CSV
file.Args:
- file_path (str):
- path to the output file.
- content (str=’metrics’):
- the type of metrics to export. Possible values:
'metrics'
: exports estimated model performance metrics.'drivers'
: exports driver importance metrics.'run_overview'
: exports an overview of feature importance metrics across all runs.
- delimiter (str=’,’):
- character to use as value delimiter.
- decimal (str=’.’):
- character to use as decimal point.
Raises:
ValueError
:- when an invalid value is provided for the [content] argument.
-
to_excel
(file_path)¶ Exports the list of model metrics to an
Excel
file.Args:
- file_path (str):
- path to the output file.
-
to_html
(file_path, title='Predict Single KPI', subtitle='')¶ Export the story visualization to a standalone
HTML
document.Args:
- file_path (str):
- path to the output file.
- title (str=’Predict Single KPI’):
- HTML document title.
- subtitle (str=’’):
- HTML document subtitle.
-
to_pandas
(content='metrics')¶ Exports a list of model metrics to a
pandas.DataFrame
object.Args:
- content (str=’metrics’):
- the type of metrics to export. Possible values:
'metrics'
: exports estimated model performance metrics.'drivers'
: exports feature importance metrics for the model.'run_overview'
: exports an overview of feature importance metrics across all runs.
Returns:
- The constructed
pandas.DataFrame
object.
Raises:
ValueError
:- when an invalid value is provided for the [content] argument.
-
-
class
datastories.story.predict_single_kpi.
StoryRun
(platform, parent, folder=None, dependencies=None, *args, **kwargs)¶ Encapsulates the result of one analysis round for a single KPI non-linear regression model.
DEPRECATED: This class been deprecated and will be removed in a future version of the SDK. Use the more generic
datastories.story.predict_kpis()
for analysing single KPIs as well.Base classes:
Note: Objects of this class should not be manually constructed.
-
base_model
¶ Retrieves the RSX based model used for making predictions.
-
correlation_browser
¶ Retrieves a visualization for assessing feature correlation.
An object of type
datastories.visualization.CorrelationBrowser
that can be used for assessing feature correlation, as discovered while training the model.
-
metrics
¶ Retrieves a dictionary containing the model performance metrics and the list of main drivers.
These metrics are computed on the training data for the purpose of evaluating the model prediction performance.
The following metrics are retrieved:
- Training Set Size:
- size of the actual data frame used for training (rows x columns).
- Correlation:
- actual vs predicted correlation.
- Estimated Correlation:
- estimated correlation for future (unseen) values.
- R-squared:
- the coefficient of determination.
- MSE:
- mean squared error.
- RMSE:
- root mean squared error.
- Main Drivers:
- list of main features with associated relative importance and energy.
- Features:
- list of all features with associated relative importance and energy.
In case the KPI is a binary variable, the following additional metrics are included:
- Positive Label:
- the label used to identify positive cases.
- Negative Label:
- the label used to identify negative cases.
- True Positives:
- number of correctly identified positive cases (TP).
- False Positives:
- number of incorrectly identified positive cases (FP).
- True Negatives:
- number of correctly identified negative cases (TN).
- False Negatives:
- number of incorrectly identified negative cases (FN).
- Not Classified:
- number of records that could not be classified (i.e., KPI is NaN).
- True Positive Rate:
- TP / (TP + FN) * 100 (a.k.a. sensitivity, recall).
- False Positive Rate:
- FP / (FP + TN) * 100 (a.k.a. fall-out).
- True Negative Rate:
- TN / ( FP + TN) * 100 (a.k.a. specificity).
- False Negative Rate:
- FN / (TP + FN) * 100 (a.k.a. miss rate).
- Precision:
- percentage of correctly identified cases from the total reported positive cases TP / (TP + FP) * 100.
- Recall:
- percentage of correctly identified cases from the total existing positive cases TP / (TP + FN) * 100.
- Accuracy:
- percentage of correctly identified cases (TP + TN) / (TP + FP + TN + FN) * 100.
- F1 Score:
- the F1 score (the harmonic mean of precision and recall).
- AUC:
- area under (ROC) curve.
-
model
¶ Retrieves an object of type
datastories.model.SingleKpiPredictor
that can be used for making predictions on new data.
-
to_csv
(file_path, content='metrics', delimiter=', ', decimal='.')¶ Export a list of model drivers or metrics to a
CSV
file.Args:
- file_path (str):
- path to the output file.
- content (str=’metrics’):
- the type of metrics to export. Possible values:
'metrics'
: exports estimated model performance metrics.'drivers'
: exports driver importance metrics.
- delimiter (str=’,’):
- character to use as delimiter
- decimal (str=’.’):
- character to use as decimal point
Raises:
ValueError
:- when an invalid value is provided for the [content] argument.
-
to_excel
(file_path)¶ Exports the list of model drivers and metrics to an
Excel
file.Args:
- file_path (str):
- path to the output file.
-
to_html
(file_path, title='Predict Single KPI Run', subtitle='')¶ Export the story visualization to a standalone
HTML
document.Args:
- file_path (str):
- path to the output file.
- title (str=’Predict Single KPI Run’):
- HTML document title.
- subtitle (str=’’):
- HTML document subtitle.
-
to_pandas
(content='metrics')¶ Export a list of model drivers or metrics to a
pandas.DataFrame
object.Args:
- content (str=’metrics’):
- the type of metrics to export. Possible values:
'metrics'
: exports estimated model performance metrics.'drivers'
: exports driver importance metrics.
Returns:
- The constructed
pandas.DataFrame
object.
Raises:
ValueError
:- when an invalid value is provided for the [content] argument.
-
what_ifs
¶ Retrieves a visualization for interactive exploration of the models.
The visualization helps getting insight into how driver variables influence the target KPIs. An object of type
datastories.visualization.WhatIfs
that can be used for interactive exploration of the models.
-
Predict Multiple KPIs¶
-
datastories.story.
predict_kpis
(data, column_list, kpi_list, record_info_list=None, runs=3, outlier_elimination=True, prototypes='auto', optimize=False, progress_bar=True, fail_on_error=False)¶ Fit a non-linear regression model on a data frame in order to predict several columns (i.e., KPIs) at the same time.
The columns to pe predicted (i.e., the KPIs) are to be identified either by name or by column index in the data frame.
Args:
- data (obj):
- the input data frame (either a
pandas.DataFrame
or adatastories.data.DataFrame
object) or a data descriptor (i.e., adatastories.data.DataDescriptor
object).
- column_list (list):
- the list of variables (i.e., columns) to consider for regression.
- kpi_list (list):
- the list of indexes or names for the target columns (i.e., KPIs).
- record_info_list (list=[]):
- the list of indexes or names to be used as additional record info.
- runs (int=3):
- the number of training rounds.
- outlier_elimination (bool=True):
- set to True in order to exclude far outliers from modeling.
- prototypes (str=’auto’):
- indicates whether analysis should be performed on prototypes. Possible values:
'yes'
: use only prototypes as inputs.'no'
: use all original inputs.'auto'
: use prototypes if there are more than 200 inputs variables.
- optimize (bool=False):
- set to True in order to compute optimal values for the KPIs. This will
run optimization analyses that attempt to first minimize and then
maximize all KPI together. For more complex scenarios (e.g., minimize
a specific KPI while maximizing another) one can use the optimize
method of the model field (
datastories.model.MultiKpiPredictor
) once the story analysis is completed.
- progress_bar (obj|bool=True):
- An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
- fail_on_error (bool=False):
- set to
True
in order to fail (i.e., raise an exception) when problems are detected. Otherwise, the processing will complete producing a partial story object. In order to check how far the processing has reached, one can use thedatastories.story.StoryBase.info()
.
Returns:
- An object of type
datastories.story.predict_kpis.Story
wrapping-up the computed model.
Raises:
ValueError
:- when an invalid value is provided for one of the input parameters parameters.
datastories.story.StoryError
:- when there is a problem fitting the model.
Example:
from datastories.story import predict_kpis import pandas as pd df = pd.read_csv('example.csv') kpi_column_indexes = [1,'other kpi',3,4] story = predict_kpis( df, df.columns, kpi_column_indexes) print(story)
-
class
datastories.story.predict_kpis.
Story
(data=None, params=None, metainfo=None, raw_results=None, results=None, folder=None, notes=None, upload_function=None, on_snapshot=None, progress_bar=False)¶ Encapsulates a multi-kpi non-linear regression model analysis.
Base classes:
Note: Objects of this class should not be manually constructed but rather created using the
datastories.story.predict_kpis()
factory method.-
class
ProcessingStage
¶ Enumeration declaring the story processing stages.
Possible values:
- UNKNOWN = 0
- INIT = 1
- PREPARE_DATA = 2
- PROCESS_DATA = 3
- BUILD_MODELS = 4
- MERGE_MODELS = 5
- VALIDATE_MODEL = 6
- OPTIMIZE = 7
- WRAP_UP = 8
- END = 9
-
add_model_validation
(prediction)¶ Add a prediction containing validation data to the story managed validations.
-
base_model
¶ Retrieves the RSX based model used for making predictions.
-
best_run
¶ Retrieves the index of the best analysis run.
The best run is selected to be the run with most cummulated importance overlap between the main drivers of different KPIs. The overlap is computed pairwise between all KPI pairs of a given run.
-
conclusions
¶ Retrieves an analysis summary containing highlights and pointers to detailed insights.
-
data_health
¶ Retrieves a summary of input data quality.
-
data_overview
¶ Retrieves an overview of driver importance across all analysis runs.
-
failed_kpis
¶ Retrieves a list of KPIs that could not be processed, or None if all KPIs have been successfully modeled.
-
model
¶ Retrieves an object of type
datastories.api.IPredictiveModel
that can be used for making predictions on new data.
-
model_validation
¶ Retrieves an overview of model validations.
-
pairwise_plots
¶ Retrieves a collection of variable vs variable plots.
-
record_info_labels
¶ Retrieves the names of the columns that contain record identification information.
-
reset
()¶ Reset the execution pointer of a story to the first stage.
Warning: After calling this, all previous results are discarded. One needs to run the story again in order to regenerate the results. This is only possible when the data frame is still available. That is, resetting a story that previously discarded the data frame (e.g., while saving) would render the story unusable. Consequently, this scenarios is not allowed and an Exception is raised when the scenario is attempted.
-
run
(resume_from=None, strict=False, params=None, progress_bar=None, check_interrupt=None)¶ Resume the execution of a story form a give stage.
The stage to resume from is optional. If not specified, the story is executed from the beginning. If the stage cannot be executed (e.g., due to missing intermediate results) the closest story that can be executed will be used as starting point unless the [strict] argument is set to True. In that case an exception will be raised if the execution cannot be resumed from the requested stage.
Args:
- resume_from (self.ProcessingStage=None):
- The stage to resume execution from. Should be a stage for which all intermediate results are available. If None, the stage at which execution was previously interrupted (if any) is used.
- strict (bool=False):
- Raise en error if execution cannot be resumed from the requested stage.
- params (dict={}):
- Map of parameters to be used with the run. It can override the original parameters, but this leads to invalidating previous results that depend on the updated parameter values.
- progress_bar (obj=None):
- An object of type
datastories.display.ProgressReporter
to replace the currently used progress reporter. When not specified the current story progress reporter will not be modified. The case for this is to set a progress bar after the story is loaded, when a progress bar cannot be given to the load function directly (e.g, when a progress bar has to be constructed based on the story).
- check_interrupt (func=None):
- an optional callback to check whether analysis execution needs to be interrupted.
Raises:
- StoryError:
- if a stage is specified for which no intermediate results are available and the ‘strict’ argument is set to True.
-
run_overview
¶ Retrieves an overview of driver importance across all analysis runs.
-
runs
¶ Retrieves a list containing the results of individual analysis rounds.
Each entry in the list is an object of type
datastories.story.predict_kpis.StoryRun
encapsulating the results associated with a given analysis round.
-
save
(file_path, include_data=True)¶ Save the story analysis results.
Use this function to persist the results of the
datastories.story.predict_kpis()
analysis. One can reload them and continue investigations at a later moment using thedatastories.story.load()
method.Args:
- file_path (str):
- path to the output file.
- include_data (bool=True):
- set to
True
to include a copy of the data in the exported file.
Raises:
datastories.api.errors.StoryError
:- when attempting to include data while the story that does not contain a data reference. This is that case with stories that have been previously saved without including the data.
-
stats
¶ Retrieves a dictionary containing the model performance statistics and the list of main drivers.
These statistics are computed on the training data for the purpose of evaluating the model prediction performance.
The following statistics are retrieved:
- Prediction Performance:
- the prediction performance per KPI ( correlation coefficient for regression, or AUC for classification).
- Driver Importance:
- relative driver importance per KPI.
- Driver Overlap:
- cummulated driver importance overlap computed between all possible pairs of KPIs.
-
to_html
(file_path, title='Predict Multiple KPIs', subtitle='')¶ Export the story visualization to a standalone
HTML
document.Args:
- file_path (str):
- name of the file to export to.
- title (str=’Predict Multiple KPIs’):
- HTML document title.
- subtitle (str=’’):
- HTML document subtitle.
-
class
-
class
datastories.story.predict_kpis.
StoryRun
(run_idx=None, upload_function=None, progress_bar=None, folder=None, dependencies=None, parent=None, *args, **kwargs)¶ Encapsulates results of one analysis round of a multi-kpi non-linear regression model analysis.
Base classes:
Note: Objects of this class should not be manually constructed.
-
base_model
¶ Retrieves the RSX based model used for making predictions.
-
correlation_browser
¶ Retrieves an overview of linear and nonlinear correlations across most relevant variables in the analysis.
The most relevant variables are identified based on the amount of correlation they exhibit with respect to other variables in the analysis.
-
driver_overview
¶ Retrieves an overview of driver importance across all KPIs.
-
metrics
¶ Retrieves a set of metrics computed during analysis.
NOTE: This is an alias for the .stats property.
-
model
¶ Retrieves an object of type
datastories.api.IPredictiveModel
that can be used for making predictions on new data.
-
outliers
¶ Retrieves a dictionary of outlier values per column used in modeling.
-
stats
¶ Retrieves a dictionary containing the model performance statistics and the list of main drivers.
These statistics are computed on the training data for the purpose of evaluating the model prediction performance.
The following statistics are retrieved:
- Prediction Performance:
- the prediction performance per KPI ( correlation coefficient for regression, or AUC for classification).
- Driver Importance:
- relative driver importance per KPI.
- Driver Overlap:
- cummulated driver importance overlap computed between all possible pairs of KPIs.
-
to_csv
(file_path, content='Driver Importance', delimiter=', ', decimal='.')¶ Export a list of story metrics to a
CSV
file.Args:
- file_path (str):
- path to the output file.
- content (str=’Driver Importance’):
- the type of metrics to export. Possible values:
'Prediction Performance'
: exports estimated model performance metrics;'Driver Importance'
: exports driver importance metrics.
- delimiter (str=’,’):
- character used as value delimiter.
- decimal (str=’.’):
- character used as decimal point.
Raises:
ValueError
:- when an invalid value is provided for the [content] argument.
-
to_excel
(file_path)¶ Export the list of story metrics to an
Excel
file.Args:
- file_path (str):
- path to the output file.
-
to_html
(file_path, title='Predict Multiple KPIs Run', subtitle='')¶ Export the story visualization to a standalone
HTML
document.Args:
- file_path (str):
- path to the output file.
- title (str=’Predict Multiple KPIs Run’):
- HTML document title.
- subtitle (str=’’):
- HTML document subtitle.
-
to_pandas
(content='Driver Importance')¶ Export a list of model drivers or metrics to a
pandas.DataFrame
object.Args:
- content (str=’Driver Importance’):
- the type of metrics to export. Possible values:
'Prediction Performance'
: exports estimated model performance metrics;'Driver Importance'
: exports driver importance metrics.
Returns:
- The constructed
pandas.DataFrame
object.
Raises:
ValueError
:- when an invalid value is provided for the [content] argument.
-
what_ifs
¶ Retrieves an interactive what-ifs analysis visualization.
-
-
class
datastories.story.predict_kpis.
ProgressBar
(story=None, runs=None, kpi_list=None, *args, **kwargs)¶ Convenience wrapper for
datastories.display.AggregatedReporter
.It constructs aggregated progress reporters for multi kpi stories. To this end, it requires either a story object (if already available) or two parameters that define the processing stages: the number of runs and the list of kpis. When the story is specified, the number of runs and the kpi list should not be provided.
Args:
- story (obj=None):
- an optional multi kpi story object of type
datastories.story.predict_kpis.Story
from which processing stages will be inferred.
- runs (int=None):
- an optional integer specifying the number of runs.
- kpi_list (list=None):
- an optional list specifying the story KPIs as would be provided to the analysis.
Raises:
ValueError
:- when the provided parameters do not match the specification requirements.
Check Data Health¶
-
datastories.story.
check_data_health
(data, sample_size=None, progress_bar=True, on_snapshot=None, upload_function=None, check_interrupt=None, fail_on_error=False)¶ Check the suitability of a dataset for building statistical models.
Args:
- data (obj):
- the input data frame (either a
pandas.DataFrame
or adatastories.data.DataFrame
object) or a data descriptor (i.e., adatastories.data.DataDescriptor
object).
- sample_size (int|str=None)`:
- the sample size to use for inferring data types (either absolute integer value or a percentage - e.g. ‘10%’). If left unspecified is the minimum of 100 and 10% of the number of points.
- progress_bar (obj|bool=True):
- An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
- on_snapshot (func=None):
- an optional callback to be executed when an analysis snapshot is created. The callback receives one argument indicating the path of the snapshot file relative to the current execution folder.
- upload_function (func=None):
- an optional callback to upload analysis result files to a client specific storage. The callback receives one argument indicating the path of the result file relative to the current execution folder.
- check_interrupt (func=None):
- an optional callback to check whether analysis execution needs to be interrupted.
- fail_on_error (bool=False):
- set to
True
in order to fail (i.e., raise an exception) when problems are detected. Otherwise, the processing will complete producing a partial story object. In order to check how far the processing has reached, one can use thedatastories.story.StoryBase.info()
.
Returns:
- An object of type
datastories.story.check_data_health.Story
wrapping-up the data health report.
Example:
from datastories.story import check_data_health import pandas as pd df = pd.read_csv('example.csv') story = check_data_health(df) print(story)
-
class
datastories.story.check_data_health.
Story
(data=None, params=None, metainfo=None, raw_results=None, results=None, folder=None, notes=None, upload_function=None, on_snapshot=None, progress_bar=False)¶ Encapsulates a data health analysis.
Base classes:
Note: Objects of this class should not be manually constructed but rather created using the
datastories.story.check_data_health()
factory method.-
class
ProcessingStage
¶ Enumeration declaring the story processing stages.
Possible values:
- UNKNOWN = 0
- INIT = 1
- PREPARE_DATA = 2
- COMPUTE_DATA_SUMMARY = 3
- END = 4
-
data_summary
¶ Retrieves an interactive data summary representation.
Returns:
- An object of type
datastories.data.DataSummaryResult
that can be used for - assessing variable type and value distribution.
- An object of type
-
stats
¶ Retrieves the set of data health statistics.
-
to_html
(file_path, title='Data Health Report', subtitle='')¶ Exports the analysis result visualization to a standalone
HTML
document.Args:
- file_path (str):
- name of the file to export to;
- title (str=’Data Health Report’):
- HTML document title;
- subtitle (str=’’):
- HTML document subtitle.
-
to_pandas
()¶ Export the data health stats to a
Pandas
data frame.Returns:
- The constructed
Pandas
data frame object.
- The constructed
-
class
Visualization¶
Display Utils¶
The datastories.display
package contains a collection
of display helpers.
-
datastories.display.
wide_screen
(width=0.95)¶ Make the notebook screen wider when running under
Jupyter Notebook
.Args:
- width (float=0.95):
- width of notebook as a fraction of the screen width. Should be in the interval [0,1].
Raises:
ValueError
:- when the [width] argument is outside the accepted interval.
-
datastories.display.
init_graphics
()¶ Initializes the DataStories graphics engine.
Use this function at the top of your notebooks when planing to save HTML copies of your work.
-
datastories.display.
get_progress_bar
(progress_bar)¶ Retrieves a default implementation for a progress bar.
Args:
- progress_bar (obj|bool=False):
an object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).When an
datastories.display.ProgressReporter
object is provided it will be returned as is.
Returns:
- An object of type
datastories.api.ProgressReporter
.
-
class
datastories.display.
ProgressCounter
¶ Base class implemented by all progress counters (including progress reporters).
Attributes:
- total (int):
- the number of steps required for completion.
- step (int):
- the current step.
- start_time (int):
- the start time in ns.
- stop_time (int):
- the stop time in ns.
-
increment
(steps=1)¶ Registers a processing advance with a number of steps.
Args:
- steps (int):
- the number of steps to advance.
-
start
(total=1)¶ Initialize the progress range.
Args:
- total (int):
- the number of steps required for completion.
-
stop
()¶ Stop progress monitoring.
-
timeout
()¶ Mark the step at which the execution timeout occurred.
Use this upon interrupting counting before reaching the end (i.e., step < total).
-
class
datastories.display.
ProgressReporter
(observers=[])¶ Abstract base class implemented by all progress reporters.
Base classes:
Args:
- observers (list):
- list of progress observers to be notified on progress updates.
-
header
¶ Retrieves/sets the current reporting header.
-
increment
(steps=1)¶ Register a processing advance with a number of steps.
Args:
- steps (int=1):
- number of advance steps.
-
log
(message)¶ Log a progress message.
Args:
- message (str):
- progress message to log.
-
on_progress
(progress)¶ Log the completion percentage.
Args:
- progress (float=None):
- completion percentage to be logged.
-
progress
¶ Retrieves the currently reported progress.
-
report
()¶ Notify observers on progress updates.
-
start
(total=1)¶ Start progress reporting.
Args:
- total (int=)`:
- total number of steps required for completion.
-
state
¶ Retrieves/sets the currently reported state.
-
stop
(info='')¶ Stop progress reporting.
Args:
- info (str=’’):
- optional message to report.
-
class
datastories.display.
AggregatedReporter
(stages=None, observers=None, display=True, bar_length=50)¶ A progress reporter that aggregates progress of a number of independent stages.
Base classes:
Stages are to be specified in the beginning, together with an estimation of the stage importance relative to the whole execution. The progress of each stage will be individually monitored and reported in the context of the whole execution.
Stages are to be identified and activated by setting the progress header.
Args:
- stages (dict):
- a dictionary mapping local stage names to their bounds in the globally reported progress.
- observers (list):
- list of observers to be notified about progress updates.
- display (bool=True):
- set to
False
in order to disable progress display (e.g., when the display is done by observers)
- bar_length (int=cfg):
- optional size of the progress bar. It defaults to the value specified in the SDK configuration settings. That is 25 if no configuration settings are provided.
Example:
stages = { 'Stage 1' : (0,50), 'Stage 2' : (50,100) } reporter = AggregatedReporter(stages=stages)
-
header
¶ Retrieves/sets the progress report header.
-
log
(message)¶ Log a progress message.
Args:
- message (str):
- progress message to log.
-
on_progress
(progress=None)¶ Log the completion percentage.
Args:
- progress (float=None):
- completion percentage to be logged.
-
reset
()¶ Reset the progress reporter.
Plots¶
The datastories.visualization
package contains a collection
of visualizations that facilitates the assessment of selected DataStories
analysis results.
-
class
datastories.visualization.
VisualizableMixin
(title='', subtitle='')¶ Mixin for classes that provide a visualization property.
Enables exporting to HTML, manging the visualization settings, and provides a Jupyter representation.
-
plot
(*args, **kwargs)¶ Display an interactive visualization.
-
to_html
(file_path, title=None, subtitle=None)¶ Exports the visualization to a standalone
HTML
document.Args:
- file_path (str):
- path to output file.
- title (str=’’):
- HTML document title.
- subtitle (str=’’):
- HTML document subtitle.
Raises:
datastories.api.errors.VisualizationError
when no visualization is defined.
-
vis_settings
¶ Retrieves/sets the visualization settings.
Raises:
datastories.api.errors.VisualizationError
:- when no visualization is defined.
-
visualization
¶ Retrieves the current visualization.
-
-
class
datastories.visualization.
ClassificationPlotSettings
(x_axis=None, jitter=0.2, *args, **kwargs)¶ Encapsulates visualization settings for
datastories.visualization.ClassificationPlot
visualizations.Args:
- x_axis (str=None):
- Column to display on the X axis.
- jitter (float=0.2):
- Amount of ‘jitter’ to add on the Y axis in order to minimize overlapping.
Attributes:
- Same as the Args section above.
-
class
datastories.visualization.
ClassificationPlot
(data, predicted_name, actual_name, prediction_performance=None, vis_settings=None, *args, **kwargs)¶ Visual representation of binary classification (performance).
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_settings (obj):
- an object of type
datastories.visualization.ClassificationPlotSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
-
plot
(*args, **kwargs)¶ Convenience function to set-up and display the Classification Plot visualization.
Accepts the same parameters as the constructor for
datastories.visualization.ClassificationPlotSettings
objects.
-
class
datastories.visualization.
ConclusionsSettings
¶ Encapsulates visualization settings for
datastories.visualization.Conclusions
visualizations.
-
class
datastories.visualization.
Conclusions
(conclusions=None, vis_settings=None, *args, **kwargs)¶ Encapsulates a visual representation of KPI drivers.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_settings (obj):
- an object of type
datastories.visualization.ConclusionsSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
-
plot
(*args, **kwargs)¶ Convenience function to set-up and display the Conclusions visualization.
Accepts the same parameters as the constructor for
datastories.visualization.ConclusionsSettings
objects.
-
to_html
(file_path, title='Conclusions', subtitle='')¶ Exports the Conclusions visualization to a standalone
HTML
document.Args:
- file_path (str):
- path tho the output file.
- title (str=’Conclusions’):
- HTML document title.
- subtitle (str=’’):
- HTML document subtitle.
-
class
datastories.visualization.
ConfusionMatrixSettings
(width=480, height=320)¶ Encapsulates visualization settings for
datastories.visualization.ConfusionMatrix
visualizations.Args:
- width (int=480):
- Graph width in pixels.
- height (int=320):
- Graph height in pixels.
Attributes:
- Same as the Args section above.
-
class
datastories.visualization.
ConfusionMatrix
(prediction_performance, vis_settings=None, *args, **kwargs)¶ Encapsulates a visual representation of model accuracy for binary classification models.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_settings (obj):
- an object of type
datastories.visualization.ConfusionMatrixSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
-
plot
(*args, **kwargs)¶ Convenience function to set-up and display the Confusion Matrix visualization.
Accepts the same parameters as the constructor for
datastories.visualization.ConfusionMatrixSettings
objects.
-
to_html
(file_path, title='Confusion Matrix', subtitle='')¶ Exports the Confusion Matrix visualization to a standalone
HTML
document.Args:
- file_path (str):
- path to the output file.
- title (str=’Confusion Matrix’):
- HTML document title.
- subtitle (str=’’):
- HTML document subtitle.
-
datastories.visualization.
correlation_browser
(file_path=None, raw_content=None, vis_settings=None)¶ Displays a Correlation Browser visualization in a Jupyter notebook based on an input correlation data file.
Args:
- file_path (str=None):
- path to the input data file containing a serialized class:datastories.correlation.CorrelationResult object.
- raw_content (str=None):
- a string, containing a JSON serialized class:datastories.correlation.CorrelationResult object.
- vis_setting (obj=CorrelationBrowserSettings()):
- an object of type
datastories.visualization.CorrelationBrowserSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
NOTE: Either the [file_path] or [raw_content] argument has to be provided but not both.
Returns:
- An object of type
datastories.visualization.CorrelationBrowser
Raises:
ValueError
:- when both the [file_path] and the [raw_content] arguments are provided.
Example:
from datastories.visualization import correlation_browser correlation_browser('correlations.json')
-
class
datastories.visualization.
CorrelationBrowserSettings
(scale=1, node_opacity=0.9, edge_opacity=0.3, tension=0.65, font_size=15, filter_unconnected=False, min_weight=50, max_weight=100, weight_key='weightMI', show_controls=True, show_inspector=True)¶ Encapsulates visualization settings for
datastories.visualization.CorrelationBrowser
visualizations.Args:
- scale (float=1):
- Scale factor of the radius [0-1].
- node_opacity (float=0.9):
- Opacity of the nodes that aren’t hovered or connected to hovered or selected nodes [0-1].
- edge_opacity (float=0.3):
- Opacity of the edges that aren’t hovered or connected to hovered or selected nodes [0-1].
- tension (float=0.65):
- The tension of the links. A tension of 0 means straight lines [0-1].
- font_size (int=15):
- Font size used for the nodes of the plot [10-32];
- filter_unconnected (boolean=False):
- Whether or nodes that aren’t connected to any other node are filtered from the view.
- min_weight (int=50):
- Minimum weight of the links that will be shown [0-100].
- max_weight (int=100):
- Maximum weight of the links that will be shown [0-100].
- weight_key (str=’weightMI’):
- Type of relations top display [‘weightMI’ for Mutual Information,’weightL’ for Linear Correlation].
- show_controls (bool=True):
- Set to True in order to display relation controls.
- show_inspector (bool=True):
- Set to True in order to display the relation inspector window.
Attributes:
- Same as the Args section above.
-
class
datastories.visualization.
CorrelationBrowser
(correlation_result=None, raw_content=None, vis_settings=None, *args, **kwargs)¶ Encapsulates a visual representation of correlation between features.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_setting (obj):
- an object of type
datastories.visualization.CorrelationBrowserSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
-
plot
(*args, **kwargs)¶ Convenience function to set-up and display the Correlation Browser visualization.
Accepts the same parameters as the constructor for
datastories.visualization.CorrelationBrowserSettings
objects.
-
to_html
(file_path, title='Correlation Browser', subtitle='')¶ Exports the Correlation Browser visualization to a standalone
HTML
document.Args:
- file_path (str):
- path to the output file.
- title (str=’Correlation Browser’):
- HTML document title.
- subtitle (str=’’):
- HTML document subtitle.
-
class
datastories.visualization.
DataHealthSettings
¶ Encapsulates visualization settings for
datastories.visualization.DataHealth
visualizations.
-
class
datastories.visualization.
DataHealth
(data_health=None, vis_settings=None, *args, **kwargs)¶ Encapsulates a visual representation of data health report.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_settings (obj):
- an object of type
datastories.visualization.DataHealthSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
-
plot
(*args, **kwargs)¶ Convenience function to set-up and display the Data Health visualization.
Accepts the same parameters as the constructor for
datastories.visualization.DataHealthSettings
objects.
-
to_html
(file_path, title='Data Health', subtitle='')¶ Exports the Data Health visualization to a standalone
HTML
document.Args:
- file_path (str):
- path to the output file.
- title (str=’Data Health’):
- HTML document title.
- subtitle (str=’’):
- HTML document subtitle.
-
class
datastories.visualization.
DataSummaryTableSettings
(page_size=25, show_console=True)¶ Encapsulates visualization settings for
datastories.visualization.DataSummaryTable
visualizations.Args:
- page_size (int=1):
- Maximum number of columns to display one one summary page;
Attributes:
- Same as the Args section above.
-
class
datastories.visualization.
DataSummaryTable
(summary, vis_settings=None, *args, **kwargs)¶ Encapsulates a visual representation of data frame summary.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_settings (obj):
- an object of type
datastories.visualization.DataSummaryTableSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
-
plot
(*args, **kwargs)¶ Convenience function to set-up and display the Data Summary visualization.
Accepts the same parameters as the constructor for
datastories.visualization.DataSummaryTableSettings
objects.
-
to_html
(file_path, title='Data Summary', subtitle='')¶ Exports the Data Summary visualization to a standalone
HTML
document.Args:
- file_path (str):
- path to the output file.
- title (str=’Data Summary’):
- HTML document title.
- subtitle (str=’’):
- HTML document subtitle.
-
datastories.visualization.
driver_overview
(file_path=None, raw_content=None, vis_settings=None)¶ Displays a DriverOverview visualization in a Jupyter notebook based on an input correlation data file.
Args:
- file_path (str=None):
- path to the input driver overview data file;
- vis_setting (obj):
- an object of type
datastories.visualization.DriverOverviewSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
NOTE: Either the [file_path] or [raw_content] argument has to be provided but not both.
Returns:
- An object of type
datastories.visualization.DriverOverview
Raises:
ValueError
:- when both the [file_path] and the [raw_content] arguments are provided.
Example:
from datastories.visualization import driver_overview driver_overview('driver_overview.json')
-
class
datastories.visualization.
DriverOverviewSettings
(height=600)¶ Encapsulates visualization settings for
datastories.visualization.DriverOverview
visualizations.Args:
- height (int=600):
- Graph height in pixels;
Attributes:
- Same as the Args section above.
-
class
datastories.visualization.
DriverOverview
(driver_overview=None, raw_content=None, vis_settings=None, *args, **kwargs)¶ Encapsulates a visual representation of KPI drivers.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_settings (obj):
- an object of type
datastories.visualization.DriverOverviewSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
-
plot
(*args, **kwargs)¶ Convenience function to set-up and display the Driver Overview visualization.
Accepts the same parameters as the constructor for
datastories.visualization.DriverOverviewSettings
objects.
-
to_html
(file_path, title='Driver Overview', subtitle='')¶ Exports the Driver Overview visualization to a standalone
HTML
document.Args:
- file_path (str):
- path to the output file.
- title (str=’Driver Overview’):
- HTML document title.
- subtitle (str=’’):
- HTML document subtitle.
-
class
datastories.visualization.
ErrorPlotSettings
(sort_key='id', lines=False, highlight_outliers=False, threshold=None, confidence=False, x_padding=0, y_padding=1, marker_size=32, hover_marker_size_delta=32, animations=500, margin_top=10, margin_right=20, margin_bottom=40, margin_left=60)¶ Encapsulates visualization settings for
datastories.visualization.ErrorPlot
visualizations.Args:
- sort_key (str=’id’):
- The sorting criteria for the X axis.Possible values:
'id'
: sort on record id.'act'
: sort on record actual KPI value.'pred'
: sort on record predicted value.
- lines (bool=Tue):
- set to True if points should be connected by lines.
- highlight_outliers (bool=Tue):
- set to True if outliers should be highlighted.
- threshold (float=0.5):
- Threshold.
- confidence (bool=Tue):
- set to True if confidence limits should be displayed.
- x_padding (int=1):
- X padding.
- y_padding (int=1): ;
- Y padding.
- marker_size (int=32):
- Size of the point marker.
- hover_marker_size_delta (int=32):
- Size of the point hover marker.
- animations (int=500):
- Animation duration in milliseconds.
- margin_top (int=10):
- top margin in pixels.
- margin_right (int=20):
- right margin in pixels.
- margin_bottom (int=40):
- bottom margin in pixels.
- margin_left (int=60):
- left margin in pixels.
Attributes:
- Same as the Args section above.
-
class
datastories.visualization.
ErrorPlot
(prediction_performance, vis_settings=None, *args, **kwargs)¶ Encapsulates a visual representation of model accuracy for regression models.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_settings (obj):
- an object of type
datastories.visualization.ErrorPlotSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
-
plot
(*args, **kwargs)¶ Convenience function to set-up and display the Error Plot visualization.
Accepts the same parameters as the constructor for
datastories.visualization.ErrorPlotSettings
objects.
-
to_html
(file_path, title='Error Plot', subtitle='')¶ Exports the Error Plot visualization to a standalone
HTML
document.Args:
- file_path (str):
- path to the output file.
- title (str=’Error Plot’):
- HTML document title.
- subtitle (str=’’):
- HTML document subtitle.
-
class
datastories.visualization.
FeatureRanksTableSettings
(height=460, show_console=True)¶ Encapsulates visualization settings for
datastories.visualization.FeatureRanksTable
visualizations.Args:
- height (int=460):
- graph height in pixels.
- show_console (bool=True):
- displays the visualization console where update operations are logged.
Attributes:
- Same as the Args section above.
-
class
datastories.visualization.
FeatureRanksTable
(feature_ranks, vis_settings=None, *args, **kwargs)¶ Encapsulates a visual representation of feature ranking.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_settings (obj):
- an object of type
datastories.visualization.FeatureRanksTableSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
-
plot
(*args, **kwargs)¶ Convenience function to set-up and display the Feature Ranking visualization.
Accepts the same parameters as the constructor for
datastories.visualization.FeatureRanksTable
objects.
-
to_html
(file_path, title='Feature Ranking', subtitle='')¶ Exports the Feature Ranking visualization to a standalone
HTML
document.Args:
- file_path (str):
- path to the output file.
- title (str=’Feature Ranking’):
- HTML document title.
- subtitle (str=’’):
- HTML document subtitle.
-
class
datastories.visualization.
OutlierPlotSettings
(width=800, height=200, x_padding=0.2, y_padding=0.2, marker_size=32, hover_marker_size_delta=32, animations=500, show_jitter=True, show_cdf=True, show_iqr=True, show_summary=True, show_console=True, show_legend=True, low_threshold=0.05, high_threshold=0.95)¶ Encapsulates visualization settings for
datastories.visualization.OutlierXPlot
visualizations.Args:
- width (int=800):
- graph width in pixels.
- height (int=200):
- graph height in pixels.
- x_padding (float=0.2):
- padding on horizontal axis.
- y_padding (float=0.2): ;
- padding on vertical axis.
- marker_size (int=32):
- size of the point marker.
- hover_marker_size_delta (int=32):
- size of the point hover marker.
- animations (int=500):
- animation duration in milliseconds.
- show_jitter (bool=False):
- amount of jitter added to the vertical dimension, to better distinguish points.
- show_cdf (bool=True):
- set to True to display the cumulative distribution function.
- show_iqr (bool=True):
- set to True to display the inter-quartile range, as specified in the lower and higher threshold arguments.
- show_summary (bool=True):
- set to True to display the summary table.
- show_console (bool=True):
- set to True to display the visualization console where update operations are logged.
- low_threshold (float=0.05):
- the lower threshold for the inter-quartile range.
- high_threshold (float=0.95):
- the upper threshold for the inter-quartile range.
Attributes:
- Same as the Args section above.
-
class
datastories.visualization.
OutlierXPlot
(outliers_result, vis_settings=None, *args, **kwargs)¶ Encapsulates a visual representation of outliers resulting from a one dimensional analysis.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_settings (obj):
- an object of type
datastories.visualization.OutlierPlotSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
-
plot
(*args, **kwargs)¶ Convenience function to set-up and display the Outliers visualization.
Accepts the same parameters as the constructor for
datastories.visualization.OutlierPlotSettings
objects.
-
to_html
(file_path, title='Outliers', subtitle='')¶ Exports the Outliers visualization to a standalone
HTML
document.Args:
- file_path (str):
- path to the output file.
- title (str=’Outliers’):
- HTML document title.
- subtitle (str=’’):
- HTML document subtitle.
-
class
datastories.visualization.
PredictedVsActualSettings
(width=400, highlight_outliers=True, threshold=0.5, x_padding=0.2, y_padding=0.2, marker_size=32, hover_marker_size_delta=32, animations=500)¶ Encapsulates visualization settings for
datastories.visualization.PredictedVsActual
visualizations.Args:
- width (int=400):
- graph width in pixels.
- highlight_outliers (bool=Tue):
- set to True if outliers should be highlighted.
- threshold (float=0.5):
- threshold.
- x_padding (float=0.2):
- amount of padding on the horizontal axis [0-1).
- y_padding (float=0.2):
- amount of padding on the vertical axis [0-1).
- marker_size (int=32):
- size of the point marker.
- hover_marker_size_delta (int=32):
- size of the point hover marker.
- animations (int=500):
- animation duration in milliseconds.
Attributes:
- Same as the Args section above.
-
class
datastories.visualization.
PredictedVsActual
(prediction_performance, vis_settings=None, *args, **kwargs)¶ Encapsulates a visual representation of model accuracy for regression models.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_settings (obj):
- an object of type
datastories.visualization.PredictedVsActualSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
-
plot
(*args, **kwargs)¶ Convenience function to set-up and display the Predict vs Actual visualization.
Accepts the same parameters as the constructor for
datastories.visualization.PredictedVsActualSettings
objects.
-
to_html
(file_path, title='Predicted vs Actual', subtitle='')¶ Exports the Predict vs Actual visualization to a standalone
HTML
document.Args:
- file_path (str):
- path to the output file.
- title (str=’Predicted vs Actual’):
- HTML document title.
- subtitle (str=’’):
- HTML document subtitle.
-
class
datastories.visualization.
PrototypeTableSettings
(height=320, show_console=True, selectable=True, condensed=True)¶ Encapsulates visualization settings for
datastories.visualization.PrototypeTable
visualizations.Args:
- height (int=320):
- graph height in pixels.
Attributes:
- Same as the Args section above.
-
class
datastories.visualization.
PrototypeTable
(prototypes, vis_settings=None, *args, **kwargs)¶ Encapsulates a visual representation of feature prototypes.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Attributes:
- vis_settings (obj):
- an object of type
datastories.visualization.PrototypeTableSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
-
plot
(*args, **kwargs)¶ Convenience function to set-up and display the Prototypes visualization.
Accepts the same parameters as the constructor for
datastories.visualization.PrototypeTableSettings
objects.
-
to_html
(file_path, title='Prototypes', subtitle='')¶ Exports the Prototypes visualization to a standalone
HTML
document.Args:
- file_path (str):
- path to the output file.
- title (str=’Prototypes’):
- HTML document title.
- subtitle (str=’’):
- HTML document subtitle.
-
datastories.visualization.
what_ifs
(file_path=None, raw_content=None, init_values=None, minimize_values=None, maximize_values=None, vis_settings=None)¶ Displays a What-Ifs visualization in a Jupyter notebook based on an input RSX model file.
Args:
- file_path (str=None):
- path to the input RSX model file. If
None
the [raw_content] argument has to be provided.
- raw_content (bytes=None):
- a bytes object, containing the source of the backing RSX model.
- init_values (list=[]):
- list of initial driver values;
- minimize_values (list=None):
- driver values that minimize the KPI.
- maximize_values (list=None):
- driver values that maximize the KPI.
- vis_settings (obj=WhatIfsSettings()):
- An object of type
datastories.visualization.WhatIfsSettings
containing the initial visualization settings.
NOTE: Either the [file_path] or [json_content] argument has to be provided but not both.
Returns:
- An object of type
datastories.visualization.WhatIfs
Raises:
ValueError
:- when both the [file_path] and the [raw_content] arguments are provided.
Example:
from datastories.visualization import what_ifs what_ifs('my_model.rsx')
-
class
datastories.visualization.
WhatIfsSettings
(show_controls=True, show_console=True, show_optimizer=False)¶ Encapsulates visualization settings for
datastories.visualization.WhatIfs
visualizations.Args:
- show_controls (bool=True):
- Set to True in order to display the visualization controls.
- show_console (bool=True):
- Set to True in order to display the visualization console.
- show_optimizer (bool=False):
- Set to True in order to disenable the optimizer functionality.
Attributes:
- Same as the Args section above.
-
class
datastories.visualization.
WhatIfs
(init_values=None, minimize_values=None, maximize_values=None, driver_importances=None, raw_model=None, vis_settings=None, *args, **kwargs)¶ Encapsulates a visual representation for exploring the influence of driver variables on target KPIs.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Note: Objects of this class should not be manually constructed.
-
drivers
¶ Retrieves/sets the driver values.
-
maximize
()¶ Identify a set of driver values that maximize the KPI.
-
minimize
()¶ Identify a set of driver values that minimize the KPI.
-
plot
(*args, **kwargs)¶ Convenience function to set-up and display the What-Ifs visualization.
Accepts the same parameters as the constructor for
datastories.visualization.PredictedVsActualSettings
objects.
-
to_html
(file_path, title='What-Ifs', subtitle='')¶ Exports the Whjat-Ifs visualization to a standalone
HTML
document.Args:
- file_path (str):
- path to the output file.
- title (str=’What-Ifs’):
- HTML document title.
- subtitle (str=’’):
- HTML document subtitle.
-