API Reference¶
General Interfaces¶
Base classes¶
-
class
datastories.api.
IAnalysisResult
¶ Interface implemented by all analysis results
-
plot
(*args, **kwargs)¶ Plots a graphical representation of the results in Jupyter Notebook.
-
to_csv
(file_name, delimiter=', ', decimal='.')¶ Exports the result to a
CSV
file.- Args:
file_name (str): name of the file to export to. delimiter (str=’,’): CSV delimiter decimal (str=’.’): CSV decimal point
-
to_excel
(file_name)¶ Exports the result to an
Excel
file.- Args:
file_name (str): name of the file to export to.
-
to_pandas
()¶ Exports the result to a Pandas
DataFrame
.- Returns:
- The constructed Pandas
DataFrame
.
- The constructed Pandas
-
-
class
datastories.api.
IPredictiveModel
(prediction_type, *args, **kwargs)¶ Interface implemented by all prediction models
-
metrics
¶ A dictionary containing model prediction performance metrics.
The type of metrics depend on the model type (i.e., regression or classification)
-
predict
(data_frame)¶ Predict the model KPI on a new data frame.
- Args:
data_frame (obj): the data frame on which the model associated KPI is to be predicted. - Returns:
- An object of type
datastories.regression.PredictionResult
encapsulating the prediction results.
- An object of type
- Raises:
ValueError
: if not all required columns are provided.
Note: All columns present in the training data frame are required for making predictions even if they are not significant for the prediction.
-
to_cpp
(file_name)¶ Export the model to a C++ file.
- Args:
file_name (str): name of the file to export to. - Raises:
DatastoriesError
: if there is a problem saving the file.
-
to_excel
(file_name)¶ Export the model to an Excel file.
- Args:
file_name (str): name of the file to export to. - Raises:
DatastoriesError
: if there is a problem saving the file.
-
to_matlab
(file_name)¶ Export the model to a MATLAB file.
- Args:
file_name (str): name of the file to export to. - Raises:
DatastoriesError
: if there is a problem saving the file.
-
to_py
(file_name)¶ Export the model to a Python file.
- Args:
file_name (str): name of the file to export to. - Raises:
DatastoriesError
: if there is a problem saving the file.
-
to_r
(file_name)¶ Export the model to an R file.
- Args:
file_name (str): name of the file to export to. - Raises:
DatastoriesError
: if there is a problem saving the file.
-
-
class
datastories.api.
IPrediction
(prediction_type, *args, **kwargs)¶ Bases:
datastories.api.interface.IAnalysisResult
Interface implemented by all prediction results
-
metrics
¶ A dictionary containing prediction performance metrics.
These metrics are computed when the data frame used for prediction includes KPI values, for the purpose of evaluating the model prediction performance.
-
-
class
datastories.api.
IStory
(notes=[], *args, **kwargs)¶ Bases:
datastories.api.interface.IAnalysisResult
Interface implemented by all story analyses
-
add_note
(note)¶ Add an annotation to the story results.
The already present annotations can be retrieved using the
datastories.api.IStory.notes()
property.- Args:
note (str): the annotation to be added.
-
clear_note
(note_id)¶ Remove a specific annotation associated with the story analysis.
- Args:
note_id (int): the index of the note to be removed. - Raises:
ValueError
: if the note index is unknown.
-
clear_notes
()¶ Clear the annotations associated with the story analysis.
-
static
load
(file_name)¶ Loads a previously saved story.
-
metrics
¶ Returns a set of metrics computed during analysis.
-
notes
¶ A text representation of all annotations currently associated with the story analysis.
-
save
(file_name)¶ Saves the story analysis results.
-
-
class
datastories.api.
IPredictiveStory
(notes=[], *args, **kwargs)¶ Bases:
datastories.api.interface.IStory
Interface implemented by all story analyses that generate a predictive model
-
model
¶ Returns an object of type
datastories.api.IPredictiveModel
that can be used for making predictions on new data
-
Data¶
The datastories.data
package contains a collection
of classes and functions for handling data and converting
it to and from the internal format used by DataStories.
Data Frame Construction¶
-
class
datastories.data.
DataFrame
¶ Encapsulates a data frame in the DataStories format.
- Args:
rows (int): number of rows in the data frame. cols (int): number of columns in the data frame. types (list): list of value types for the data frame columns.
-
cols
(self)¶ Get the number of columns in the data frame.
- Returns:
- (int) : number of columns in the data frame.
-
static
from_pandas
(df)¶ Construct a new
datastories.data.DataFrame
from a PandasDataFrame
object.- Args:
df (obj): the source Pandas DataFrame
object.- Returns:
- The constructed
datastories.data.DataFrame
object.
- The constructed
-
get
(self, size_t row, size_t col)¶ Get the value of a cell in the data frame.
- Args:
row (int): the index of the cell row. col (int): the index of the cell column. - Returns:
- (float|string) : The in the data frame cell at position (row, column).
-
get_type
(self, size_t col)¶ Get the type of values in a given column.
- Args:
col (int): the index of the column. - Returns:
- An object of type
datastories.data.ColumnType
.
- An object of type
-
name
(self, size_t col)¶ Get the name of a specific column.
- Args:
col (int): the index of the column. - Returns:
- (str) : the name of the column with the given index.
-
names
(self)¶ Get the data frame column names.
- Returns:
- (list) : a list of strings.
-
static
read_csv
(filename, delimiter=u', ', decimal=u'.', quote=u'"', int header_rows=1, missing_values=None)¶
-
rows
(self)¶ Get the number of rows in the data frame.
- Returns:
- (int) : number of rows in the data frame.
-
set_float
(self, size_t row, size_t col, double val)¶ Sets the value of a given cell to a new float value.
- Args:
row (int): the row index of the cell. col (int): the column index of the cell. val (float): the new float value.
-
set_int
(self, size_t row, size_t col, int64_t val)¶ Sets the value of a given cell to a new int value.
- Args:
row (int): the row index of the cell. col (int): the column index of the cell. val (int): the new int value.
-
set_name
(self, size_t col, string name)¶ Set the name of a column in the data frame.
- Args:
col (int): the index of the column. name (str): the new name.
-
set_string
(self, size_t row, size_t col, string val)¶ Sets the value of a given cell to a new string value.
- Args:
row (int): the row index of the cell. col (int): the column index of the cell. val (str): the new string value.
-
to_pandas
(self)¶ Exports the DataFrame to a Pandas
DataFrame
object.- Returns:
- The constructed Pandas
DataFrame
object.
- The constructed Pandas
-
class
datastories.data.
ColumnType
¶ Possible column types for
datastories.data.DataFrame
.-
DATE
= 3¶
-
INTEGER
= 2¶
-
MIXED
= 10¶
-
NUMERIC
= 1¶
-
STRING
= 4¶
-
UNKNOWN
= 0¶
-
-
class
datastories.data.
DataType
¶ Possible cell value types for
datastories.data.DataFrame
.-
DATE
= 3¶
-
INTEGER
= 2¶
-
NUMERIC
= 1¶
-
STRING
= 4¶
-
UNKOWN
= 0¶
-
-
class
datastories.data.
RangeType
¶ Possible value range types for
datastories.data.DataFrame
.-
CATEGORICAL
= 3¶
-
INTERVAL
= 1¶
-
ORDINAL
= 2¶
-
UNSPECIFIED
= 0¶
-
-
datastories.data.
prepare_data_frame
(data_frame, progress_bar=False)¶ Prepares a
pandas.core.frame.DataFrame
object compatible with the DataStories clean-up and type conversion rules.Pandas
DataFrames
obtained from external sources are often inconsistent and need to be cleaned-up in order to make them usable for analysis. The clean-up process transforms the data frame, for example by enforcing type conversions and discarding non-usable values. DataStories analyses perform the clean-up operation automatically. However, there may be scenarios when a data clean-up is required before running it through a DataStories analysis (e.g., a custom feature-engineering stage).This function can be used to obtain a Pandas
DataFrame
object that is cleaned-up according the DataStories rules and conventions.- Args:
data_frame (obj): the data frame object to convert (either a pandas.core.frame.DataFrame
or adatastories.data.DataFrame
object);progress_bar (obj|bool=False): An object of type datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).- Returns:
- The constructed Pandas
DataFrame
object.
- The constructed Pandas
Summary Calculation¶
-
class
datastories.data.
DataSummaryResult
(stats)¶ Encapsulates the result of the
datastories.data.compute_summary()
analysis.Note: Objects of this class should not be manually constructed.
- Attributes:
stats (obj): an object of type datastories.data.TableStatistics
wrapping up summary statistics.vis_settings (obj): an object of type datastories.visualization.DataSummaryTableSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
-
plot
(*args, **kwargs)¶ Displays a graphical representation of the data summary analysis results.
Accepts the same parameters as the constructor for
datastories.visualization.DataSummaryTableSettings
-
select
(cols)¶ Selects a set of columns for further reference.
-
selected
¶ Retrieves the list of selected columns.
-
to_csv
(file_name)¶ Exports the list of ranking scores to a
CSV
file.- Args:
file_name (str): name of the file to export to.
-
to_excel
(file_name)¶ Exports the list of ranking scores to an
Excel
file.- Args:
file_name (str): name of the file to export to.
-
to_html
(file_name, title='Data Summary', subtitle='')¶ Exports the data summary visualization to a standalone
HTML
document.- Args:
file_name (str): name of the file to export to; title (str=’Data Summary’): HTML document title; subtitle (str=’’): HTML document subtitle.
-
to_pandas
()¶ Exports the detailed (column-level) data summary to a Pandas
DataFrame
.- Returns:
- The constructed Pandas
DataFrame
object.
- The constructed Pandas
-
class
datastories.data.
TableStatistics
(name=None, rows=None, columns=None, n=None, n_missing=None, p_missing=None, health=None, health_score=0, df=None, converters=None, n_rows=None, n_columns=None, version={'core': '1.4.0'})¶ Statistics and data health reports for a given data frame.
Note: Objects of this class should not be manually constructed.- Attributes:
n_rows (int): Number of rows n_columns (int): Number of columns n (int): Number of values n_missing (int): Number of missing values p_missing (float): Percentage of missing values health_score (float): Health score: 0 (good) - 100 (bad) health (float): General health value for the data frame (unusable:0, fixable:0.5, great:1). columns ‘(list)`: List of objects of type datastories.data.ColumnStatistics
wrapping up detailed column level statistics
-
class
datastories.data.
ColumnStatistics
(col=None, id=None, converter=None, label=None, column_type=None, element_type=None, n=None, n_valid=None, n_missing=None, p_missing=None, n_unique=None, min=None, max=None, mean=None, median=None, most_freq=None, first_quartile=None, third_quartile=None, histo_labels=None, histo_counts=None, balance_score=None, balance_health=None, missing_health=None, left_outlier_score=None, right_outlier_score=None, outlier_score=None, left_outlier_health=None, right_outlier_health=None, outlier_health=None, health=None, missing_thr=None, balance_thr=None, outlier_thr=None, bincount=10)¶ Statistics and data health reports for a given column in a data frame.
Note: Objects of this class should not be manually constructed.- Attributes:
n_rows (int): Number of rows id (int): The index of the column. label (str): The label (header values) of the column. n (int): The length of the column. n_valid (int): The number of correctly parsed data items. n_missing (int): The number of unreadable data items. p_missing (float): Percent of unreadable data items. column_type (str): Type of the column (ordinal, interval, binary, …) element_type (str): Type of individual data items (float, string, …) n_unique (int): Number of unique values. min (float): Minimum value. max (float): Maximum value. mean (float): Mean value. median (float): Median value. first_quartile (float): First quartile (data point under which 25% of data is situated). third_quartile (float): Third quartile (data point under which 75% of data is situated). histo_labels (list): Labels for the histogram bins. histo_counts (list): Counts for the histogram bins. balance_score (float): Score for the balanceness of the data, 0 (good) - 100 (bad). balance_health (float): Health value in terms of balance (unusable:0, fixable:0.5, great:1). missing_health (float): Health value in terms of nr of missing items (unusable, …). left_outlier_score (float): Metric for outlier impact on the left (i.e., small) side of the data range. Scale: 0 (no outliers detected) - 100 (bad). right_outlier_score (float): Metric for outlier impact on the right (i.e., big) side of the data range. Scale: 0 (no outliers detected) - 100 (bad). outlier_score (float): Metric for the general outlier impact of the data. Scale: 0 (no outlier impact whatsoever) - 100 (bad). left_outlier_health (float): Health value for left outlier impact (unusable:0, fixable:0.5, great:1). right_outlier_health (float): Health value for right outlier impact (unusable, fixable:0.5, great:1). outlier_health (float): Health value for outlier impact (unusable:0, fixable:0.5, great:1). health (float): General health value for this column (unusable:0, fixable:0.5, great:1).
-
calc_basic_stats
()¶ Generates the basic statistics for the column and sets the corresponding attributes.
-
datastories.data.
compute_summary
(data_frame, col_types=None, sample_size=None, progress_bar=False)¶ Computes a data summary on a provided data frame.
- Args:
data_frame (obj): the input data frame. col_types (list=None): list of column types to use for extracting statistics. If not provided, it will be inferred from a sample of the data, based on the most frequent value type in each column. sample_size (int=100|str=’10%’): the sample size to use for inferring data types (either absolute integer value or a percentage) progress_bar (obj|bool=False): An object of type datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).- Returns:
- An object of type
datastories.data.DataSummaryResult
wrapping-up the summary report.
- An object of type
Example:
from datastories.data import compute_summary import pandas as pd df = pd.read_csv('example.csv') summary = compute_summary(df) print(summary)
Outlier Detection¶
-
class
datastories.data.
OutlierResult
(input, outliers)¶ Encapsulates the result of the
datastories.data.compute_outliers()
analysis.- Attributes:
valid (bool): a flag indicating whether the result is valid.
- Raises:
AssertionError
: when calling methods of an invalid result.
Note: Objects of this class should not be manually constructed.
-
as_index
(self, outlier_types=[OutlierType.FAR_OUTLIER_HIGH, OutlierType.FAR_OUTLIER_LOW, OutlierType.OUTLIER_HIGH, OutlierType.OUTLIER_LOW])¶ A numpy index vector that can be used to select and retrieve outlier values.
The index can be applied on numpy arrays or
pandas.core.series.Series
objects.- Args:
outlier_types (list): list of datastories.data.OutlierType
values to specify which outliers to retrieve. By default, all outliers are included (i.e., outlier_types = [OutlierType.FAR_OUTLIER_HIGH, OutlierType.FAR_OUTLIER_LOW, OutlierType.OUTLIER_HIGH, OutlierType.OUTLIER_LOW])
-
as_itemgetter
(self, outlier_types=[OutlierType.FAR_OUTLIER_HIGH, OutlierType.FAR_OUTLIER_LOW, OutlierType.OUTLIER_HIGH, OutlierType.OUTLIER_LOW])¶ An
operator.itemgetter
object that can be used to select and retrieve outlier values from a list.- Args:
outlier_types (list): list of datastories.data.OutlierType
values to specify which outliers to retrieve. By default, all outliers are included (i.e., outlier_types = [OutlierType.FAR_OUTLIER_HIGH, OutlierType.FAR_OUTLIER_LOW, OutlierType.OUTLIER_HIGH, OutlierType.OUTLIER_LOW])
-
clip_to_iqr
(self, low_threshold=0.05, high_threshold=0.95)¶ Marks as outliers values that are outside a specific inter-quartile range.
This operation can be un-done via the reset method.
- Args:
low_threshold (float=0.05): the lower bound of the inter-quartile range. Should be in the interval [0,1]. high_threshold (float=0.95): the higher bound of the inter-quartile range. Shoudl be in the interval [0,1]. - Raises:
ValueError
: when the input arguments are not valid.
-
metrics
¶ A dictionary containing outlier detection metrics.
- The following metrics are retrieved:
Outliers: total number of outliers Outliers Low: number of lower outliers Outliers High: number of higher outliers Close Outliers: number of close outliers Close Outliers Low: number of lower close outliers Close Outliers High: number of higher close outliers Far Outliers: number of far outliers Far Outliers Low: number of lower far outliers Far Outliers High: number of higher far outliers NaN: number of NaN values Normal: number of values that are neither outliers not NaN
-
plot
(self, *args, **kwargs)¶ Displays a graphical representation of the outlier analysis results.
Accepts the same parameters as the constructor for
datastories.visualization.OutlierPlotSettings
objects.
-
reset
(self)¶ Resets outliers to original values, as computed by the
datastories.data.compute_outliers()
analysis.
-
to_csv
(self, file_name, content=u'metrics')¶ Exports a list of detected outliers or metrics to a
CSV
file.- Args:
file_name (str): name of the file to export to. content (str=metrics): the type of metrics to export. Possible values: - metrics
- exports outlier detection metrics; -outliers
- exports point-wise outlier classification.- Raises:
ValueError
: when an invalid value is provided for thecontent
parameter.
-
to_excel
(self, file_name)¶ Exports the list of detected outliers and metrics to an
Excel
file.- Args:
file_name (str): name of the file to export to.
-
to_html
(self, file_name, title=u'Outliers', subtitle=u'')¶ Exports the outliers visualization to a standalone
HTML
document.- Args:
file_name (str): name of the file to export to; title (str=’Outliers’): HTML document title; subtitle (str=’’): HTML document subtitle.
-
to_pandas
(self, content=u'metrics')¶ Exports a list of detected outliers or metrics to a
pandas.core.series.Series
object.- Args:
content (str=metrics): the type of metrics to export. Possible values: - metrics
- exports outlier detection metrics; -outliers
- exports point-wise outlier classification.- Returns:
- The constructed
pandas.core.series.Series
object. - Raises:
ValueError
: when an invalid value is provided for thecontent
parameter.
-
update
(self, updates)¶ Updates the list of detected outliers with manual corrections.
-
updated
¶ A list of manual corrections applied to the detected outliers.
-
vis_settings
¶ An object of type
datastories.visualization.OutlierPlotSettings
encapsulating the outlier visualization settings
-
datastories.data.
compute_outliers
(input, ref=None, double strictness=0.25, outlier_vote_threshold=None, far_outlier_vote_threshold=None)¶ Identifies numeric outliers in a 1D or 2D space.
This function can be used either with the strictness parameter only (i.e., by leaving two last parameters at their defaults so they will be computed as a function of the strictness) or manually by setting the last two parameters in which case the strictness will be ignored.
- Args:
input (list|obj|ndarray): numeric input vector can be either a list, a pandas.core.series.Series
object or a numpy numeric array;ref (list|obj|ndarray=None): abscissa vector for the 2D case; can be either a list, a pandas.core.series.Series
object or a numpy numeric array;strictness (double=0.25): determines how strict the algorithm selects outliers - higher values yield less outliers. Value in range is [0-1]; outlier_vote_threshold (double=None): determines when a point is considered outlier - higher values yield less outliers. Value in range is [0-100]. When left unspecified it will be set to 100 * strictness
.far_outlier_vote_threshold (double=None): determines when a point is considered a far outlier - higher values yield less outliers. This must be larger than outlier_vote_threshold. Default is outlier_vote_threshold + 50
. Value in range is [0-100].- Returns:
- An object of type
datastories.data.OutlierResult
wrapping-up the computed outliers.
Example:
from datastories.data import compute_outliers import pandas as pd df = pd.read_csv('example.csv') outliers = compute_outliers(df['my_column']) print(outliers)
Classification¶
The datastories.classification
package contains a collection
of classes and functions to facilitate classification analysis.
Feature Ranking¶
-
datastories.classification.
rank_features
(data_set, kpi, metric=FeatureRankingMetric.Accuracy) → FeatureRankResult¶ Computes the relative importance of columns in a dataframe for predicting a binary KPI.
The scoring is based on maximizing the prediction accuracy with respect to the KPI while iteratively splitting the dataframe rows.
- Args:
data_set (obj): a DataStories
or a PandasDataFrame
object.kpi (int|str): the index or the name of the KPI column. metric (enum = FeatureRankingMetric.Accuracy): an object of type datastories.classification.FeatureRankResult
specifying the metric type used to rank the features. Possible values:FeatureRankingMetric.Accuracy
- Returns:
- An object of type
datastories.classification.FeatureRankResult
wrapping-up the computed scores.
- An object of type
- Raises:
TypeError
: if data_set is not aDataFrame
or a PandasDataFrame
object.ValueError
: if kpi is not a valid column name or index value (e.g., out-of-range index).
Example:
from datastories.classification import rank_features import pandas as pd df = pd.read_csv('example.csv') kpi_column_index = 1 ranks = rank_features(df, kpi_column_index) print(ranks)
-
class
datastories.classification.
FeatureRankResult
(c_split_list, data_frame, kpi, metric)¶ Bases:
datastories.api.interface.IAnalysisResult
Encapsulates the result of the
datastories.classification.rank_features()
analysis.Note: Objects of this class should not be manually constructed.
-
feature_ranks
¶ Retrieves the feature ranks computed by the
datastories.classification.rank_features()
analysis.- Returns:
- (list): a list of
datastories.classification.RankingSplit
objects.
- (list): a list of
-
get_feature_ranks
(self)¶ Retrieves the feature ranks computed by the
datastories.classification.rank_features()
analysis.- Returns:
- (list): a list of
datastories.classification.RankingSplit
objects.
- (list): a list of
-
metric_map
= {<FeatureRankingMetric.Accuracy: 0>: 'Accuracy'}¶
-
plot
(self, *args, **kwargs)¶ Displays a graphical representation of the rank features analysis results.
Accepts the same parameters as the constructor for
datastories.visualization.FeatureRanksTableSettings
objects.
-
select
(self, cols)¶ Selects a number of column names as features.
-
selected
¶ The list of column names currently selected as features.
-
to_csv
(self, file_name, delimiter=u', ', decimal=u'.')¶ Exports the list of ranking scores to a
CSV
file.- Args:
file_name (str): name of the file to export to. delimiter (str=’,’): CSV delimiter decimal (str=’.’): CSV decimal point
-
to_excel
(self, file_name)¶ Exports the list of ranking scores to an
Excel
file.- Args:
file_name (str): name of the file to export to.
-
to_html
(self, file_name, title=u'Feature Ranks', subtitle=u'')¶ Exports the feature ranks visualization to a standalone
HTML
document.- Args:
file_name (str): name of the file to export to; title (str=’Feature Ranks’): HTML document title; subtitle (str=’’): HTML document subtitle.
-
to_pandas
(self, ranking_column=u'Score', min_threshold=0.0)¶ Exports the list of ranking scores to a Pandas
DataFrame
object.- Args:
ranking_column (str=Score): Column to compute the rank and order the dataframe. This can be useful to discover interesting variables that are penalised because they have a lot of missing values. min_threshold (float): A a cutoff threshold for the minimum score that a variable should have in order to be exported. - Returns:
- The constructed Pandas
DataFrame
object.
- The constructed Pandas
-
vis_settings
¶ An object of type
datastories.visualization.FeatureRanksTableSettings
encapsulating the outlier visualization settings
-
Correlation¶
The datastories.correlation
package contains a collection
of classes and functions to facilitate correlation analysis.
Prototype Detection¶
-
datastories.correlation.
compute_prototypes
(data_set, kpi, double prototype_threshold: float = 0.85, fast_approximation: bool = True, double missing_value_threshold: float = 0.5) → PrototypeResult¶ Identifies a set of mutually uncorrelated variables from a data frame.
Correlation estimation is based on the Mutual Information Content measure.
Each variable in the set has the following properties:
- it is not significantly correlated to any other variable in the set;
- it can be highly correlated to other variables that are not included in the set;
- it has a higher KPI correlation score than all the other variables that are highly correlated to it.
Each variable that is not included in the set has the property that is highly correlated to a variable in the set.
- Args:
data_set (obj): a DataStories or a Pandas dataframe. kpi (int|str): the index or the name of the KPI column. prototype_threshold (float = 0.85): correlation threshold for features to be considered proxies. fast_approximation (bool = True): approximate the mutual information, this provides a significant speedup with little precision loss. missing_value_threshold (float = 0.5): missing values threshold for excluding features from prototypes. - Returns:
- An object of type
datastories.correlation.PrototypeResult
wrapping-up the computed protoypes.
- An object of type
- Raises:
TypeError
: if data_set is not aDataFrame
or a PandasDataFrame
object.ValueError
: if kpi is not a valid column name or index value (e.g., out-of-range index).
Example:
from datastories.correlation import compute_prototypes import pandas as pd df = pd.read_csv('example.csv') kpi_column_index = 1 protoypes = compute_prototypes(df, kpi_column_index) print(protoypes)
-
class
datastories.correlation.
PrototypeResult
(c_prototype_list)¶ Bases:
datastories.api.interface.IAnalysisResult
Encapsulates the result of the
datastories.correlation.compute_prototypes()
analysis.Note: Objects of this class should not be manually constructed.
-
get_prototypes
(self)¶ Retrieves the list of models computed by the
datastories.correlation.compute_prototypes()
analysis.- Returns:
- -(list): a list of
datastories.correlation.Prototype
objects.
-
plot
(self, *args, **kwargs)¶ Displays a graphical representation of the prototype analysis results.
Accepts the same parameters as the constructor for
datastories.visualization.PrototypeTableSettings
objects.
-
prototypes
¶ Retrieves the list of column names currently selected as prototypes.
- Returns:
- (list): a list of column names.
-
select
(self, cols)¶ Selects a number of column names as prototypes.
-
selected
¶ Retrieves the list of column names currently selected as prototypes.
- Returns:
- (list): a list of column names.
-
to_csv
(self, file_name, delimiter=u', ', decimal=u'.')¶ Exports the list of prototypes to a
CSV
file.- Args:
file_name (str): name of the file to export to. delimiter (str=’,’): CSV delimiter decimal (str=’.’): CSV decimal point
-
to_excel
(self, file_name)¶ Exports the list of protoypes to an
Excel
file.- Args:
file_name (str): name of the file to export to.
-
to_html
(self, file_name, title=u'Prototypes', subtitle=u'')¶ Exports the prototypes visualization to a standalone
HTML
document.- Args:
file_name (str): name of the file to export to; title (str=’Outliers’): HTML document title; subtitle (str=’’): HTML document subtitle.
-
to_pandas
(self)¶ Exports the list of prototypes to a Pandas
DataFrame
object.- Returns:
- The constructed Pandas
DataFrame
object.
- The constructed Pandas
-
vis_settings
¶ An object of type
datastories.visualization.PrototypeTableSettings
encapsulating the outlier visualization settings
-
-
class
datastories.correlation.
Prototype
(c_prototype)¶ Encapsulates prototype information data.
- Attributes:
info (obj): an object of type datastories.correlation.CorrelationInfo
describing the correlation of the prototype with respect to the KPI.proxy_list (list): a list of datastories.correlation.CorrelationInfo
objects corresponding to highly correlated variables with respect to the prototype.
-
class
datastories.correlation.
CorrelationInfo
(c_correlation_info)¶ Encapsulates correlation information for a variable with respect to a reference.
- Attributes:
col_index (int): the index of the variable in the input data frame. col_name (str): the name of the variable. correlation (float): the correlation score with respect to the reference.
Model¶
The datastories.model
package contains a collection
of classes that encapsulate data models (e.g., prediction
models computed by regression or classification analysis).
-
class
datastories.model.
Model
¶ An DataStories model.
-
evaluate
(self, data_frame)¶ Evaluate the model on an input data frame.
Args:
- data_frame (obj):
- the input data frame (either a
datastories.data.DataFrame
or a PandasDataFrame
object). This has to include the input variables for the model.
Returns:
A data frame including the evaluated output variables of the model. Can be either adatastories.data.DataFrame
or a PandasDataFrame
object, depending on the provided input.
-
inputs
¶ A list of input model variable names
-
outputs
¶ A list of output model variable names
-
plot
(self, *args, **kwargs)¶ Displays a graphical representation of the prediction model.
Accepts the same parameters as the constructor for
datastories.visualization.WhatIfsSettings
-
save
(self, file_name=None)¶ Serialize the model to a file or a bytes object.
Args:
- file_name (str =
None
): - Name of the output file. If omitted the file is saved to a bytes object and returned as output for the function.
- file_name (str =
Returns:
A bytes object containing the model when thefile_name
argument is omitted or set toNone
.
-
to_cpp
(self, file_name)¶ Export the model to a C++ file.
Args:
- file_name (str): name of the file to export to.
Raises:
- class:datastories.api.DatastoriesError: if there is a problem saving the file.
-
to_excel
(self, file_name)¶ Export the model to an Excel file.
Args:
- file_name (str): name of the file to export to.
Raises:
- class:datastories.api.DatastoriesError: if there is a problem saving the file.
-
to_matlab
(self, file_name)¶ Export the model to a MATLAB file.
Args:
- file_name (str): name of the file to export to.
Raises:
- class:datastories.api.DatastoriesError: if there is a problem saving the file.
-
to_py
(self, file_name)¶ Export the model to a Python file.
Args:
- file_name (str): name of the file to export to.
Raises:
- class:datastories.api.DatastoriesError: if there is a problem saving the file.
-
to_r
(self, file_name)¶ Export the model to an R file.
Args:
- file_name (str): name of the file to export to.
Raises:
- class:datastories.api.DatastoriesError: if there is a problem saving the file.
-
variables
¶ A dictionary mapping model variables to corresponding information such as variable type and range.
Returns:
Dictionary string ->datastories.model.VariableInfo
.
-
-
class
datastories.model.
VariableInfo
¶ Holds information about a model variable, such as ranges and types.
-
categories
¶ Get the registered categories of the associated variable (i.e., if the variable is categorical).
-
index
¶ Get the index of the associated variable.
-
is_input
¶ Check if the associated variable is an input for the model.
-
max
¶ Get the maximum value of the associated variable.
-
min
¶ Get the minimum value of the associated variable.
-
range_type
¶ Get the range type of the associated variable.
-
type
¶ Get the type of the associated variable.
-
-
class
datastories.model.
SingleKpiPredictor
(kpi_name, column_names, prediction_type, prediction_performance, *args, **kwargs)¶ Bases:
datastories.api.interface.IPredictiveModel
,datastories.core.utils.object_.StorageBackedObject
Encapsulates prediction models (e.g., computed using
datastories.story.predict_single_kpi()
).Note: Objects of this class should not be manually constructed.- Attributes:
vis_settings (obj): an object of type datastories.visualization.PredictedVsActualSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
-
error_plot
¶ A visualization for assessing model prediction errors, as discovered while training the model.
- Returns:
- In case of a regression model: An object of type
datastories.visualization.ErrorPlot
. - In case of a binary classification model: An object of type
datastories.visualization.ClassificationPlot
- In case of a regression model: An object of type
-
maximize
(progress_bar=True)¶ Compute the input combination that maximizes the predictive model output.
Args:
- progress_bar (obj|bool=True):
- An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
Returns:
Adatastories.optimization.OptimizationResult
object encapsulating the model variables values that maximize the model outputs.
-
metrics
¶ A dictionary containing model prediction performance metrics.
The type of metrics depend on the model type (i.e., regression or classification)
- For regression models the metrics include:
Correlation: actual vs predicted correlation Estimated Correlation: estimated correlation for future (unseen) values R-squared: the coefficient of determination MSE: mean squared error RMSE: root mean squared error - For binary classification models the metrics include:
Positive Label: the label used to identify positive cases Negative Label: the label used to identify negative cases True Positives: number of correctly identified positive cases (TP) False Positives: number of incorrectly identified positive cases (FP) True Negatives: number of correctly identified negative cases (TN) False Negatives: number of incorrectly identified negative cases (FN) Not Classified: number of records that could not be classified (i.e., KPI is NaN) True Positive Rate: TP / (TP + FN) * 100 (a.k.a. sensitivity, recall) False Positive Rate: FP / (FP + TN) * 100 (a.k.a. fall-out) True Negative Rate: TN / ( FP + TN) * 100 (a.k.a. specificity) False Negative Rate: FN / (TP + FN) * 100 (a.k.a. miss rate) Precision: percentage of correctly identified cases from the total reported positive cases TP / (TP + FP) * 100 Recall: percentage of correctly identified cases from the total existing positive cases TP / (TP + FN) * 100 Accuracy: percentage of correctly identified cases (TP + TN) / (TP + FP + TN + FN) * 100 F1 Score: the F1 score (the harmonic mean of precision and recall) AUC: area under (ROC) curve
-
minimize
(progress_bar=True)¶ Compute the input combination that minimizes the predictive model output.
Args:
- progress_bar (obj|bool=True):
- An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
Returns:
Adatastories.optimization.OptimizationResult
object encapsulating the model variables values that minimize the model outputs.
-
optimize
(optimization_spec=<datastories.optimization.specification.OptimizationSpecification object>, progress_bar=True)¶ Compute an optimum input/output combination according to an (optional) optimization specification.
Args:
- optimization_spec (obj =
OptimizationSpecification()
) : - A
datastories.optimization.OptimizationSpecification
object encapsulating the optional optimization specification.
- optimization_spec (obj =
- progress_bar (obj|bool=True):
- An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
Returns:
Adatastories.optimization.OptimizationResult
object encapsulating the model variables values that satisfy the optimization specification.
-
plot
(*args, **kwargs)¶ Displays a graphical representation of the prediction model.
Accepts the same parameters as the constructor for
datastories.visualization.ConfusionMatrixSettings
(for classification models) ordatastories.visualization.PredictedVsActualSettings
(for regression models).
-
predict
(data_frame)¶ Predict the model KPI on a new data frame.
- Args:
data_frame (obj): the data frame on which the model associated KPI is to be predicted. - Returns:
- An object of type
datastories.regression.PredictionResult
encapsulating the prediction results.
- An object of type
- Raises:
ValueError
: if not all required columns are provided.
Note: All columns present in the training data frame are required for making predictions even if they are not significant for the prediction.
-
rebuild
(score_threshold=None)¶ Rebuilds the prediction model using custom settings.
- Args:
score_threshold (float=None): the decision threshold for binary KPI models. If missing, the optimal decision threshold will be determined automatically.
Note: In order to make changes permanent (i.e., survive story reloads) the associated story has to be saved after executing this method. To save a story use
datastories.story.PredictSingleKpiStory.save()
.
-
to_cpp
(file_name)¶ Export the model to a C++ file.
Args:
- file_name (str): name of the file to export to.
Raises:
- class:datastories.api.DatastoriesError: if there is a problem saving the file.
-
to_excel
(file_name)¶ Export the model to an Excel file.
Args:
- file_name (str): name of the file to export to.
Raises:
- class:datastories.api.DatastoriesError: if there is a problem saving the file.
-
to_html
(file_name, title='Predictive Model', subtitle='Predicted vs Actual')¶ Exports a visual representation of the prediction model to a standalone
HTML
document.- Args:
file_name (str): name of the file to export to; title (str=’Feature correlation browser’): HTML document title; subtitle (str=’’): HTML document subtitle.
-
to_matlab
(file_name)¶ Export the model to a MATLAB file.
Args:
- file_name (str): name of the file to export to.
Raises:
- class:datastories.api.DatastoriesError: if there is a problem saving the file.
-
to_py
(file_name)¶ Export the model to a Python file.
Args:
- file_name (str): name of the file to export to.
Raises:
- class:datastories.api.DatastoriesError: if there is a problem saving the file.
-
to_r
(file_name)¶ Export the model to an R file.
Args:
- file_name (str): name of the file to export to.
Raises:
- class:datastories.api.DatastoriesError: if there is a problem saving the file.
-
class
datastories.model.
SingleKpiPrediction
(prediction_type, prediction, data, kpi_name, is_test=False)¶ Bases:
datastories.api.interface.IPrediction
Encapsulates the results of a prediction done using a
datastories.model.SingleKpiPredictor
object.Note: Objects of this class should not be manually constructed.
- Attributes:
vis_settings (obj): an object of type datastories.visualization.PredictedVsActualSettings
containing visualization settings. Set this object before displaying the visualization.
-
error_plot
¶ A visualization for assessing model prediction errors.
- Returns:
- In case of a regression model: An object of type
datastories.visualization.ErrorPlot
. - In case of a binary classification model: An object of type
datastories.visualization.ClassificationPlot
- In case of a regression model: An object of type
-
metrics
¶ A dictionary containing prediction performance metrics.
These metrics are computed when the data frame used for prediction includes KPI values, for the purpose of evaluating the model prediction performance.
- The following metrics are retrieved:
Number of Records: number of records submitted for prediction Correlation: actual vs predicted correlation R-squared: the coefficient of determination MSE: mean squared error RMSE: root mean squared error - In case the KPI is a binary variable, the following additional metrics are included:
Positive Label: the label used to identify positive cases Negative Label: the label used to identify negative cases True Positives: number of correctly identified positive cases (TP) False Positives: number of incorrectly identified positive cases (FP) True Negatives: number of correctly identified negative cases (TN) False Negatives: number of incorrectly identified negative cases (FN) False Negatives: number of incorrectly identified negative cases (FN) Not Classified: number of records that could not be classified (i.e., KPI is NaN) True Positive Rate: TP / (TP + FN) * 100 (a.k.a. sensitivity, recall) False Positive Rate: FP / (FP + TN) * 100 (a.k.a. fall-out) True Negative Rate: TN / ( FP + TN) * 100 (a.k.a. specificity) False Negative Rate: FN / (TP + FN) * 100 (a.k.a. miss rate) Precision: percentage of correctly identified cases from the total reported positive cases TP / (TP + FP) * 100 Recall: percentage of correctly identified cases from the total existing positive cases TP / (TP + FN) * 100 Accuracy: percentage of correctly identified cases (TP + TN) / (TP + FP + TN + FN) * 100 F1 Score: the F1 score (the harmonic mean of precision and recall) AUC: area under (ROC) curve
-
plot
(*args, **kwargs)¶ Displays a graphical representation of the prediction performance.
Accepts the same parameters as the constructor for
datastories.visualization.ConfusionMatrixSettings
(for classification based predictions predictions) ordatastories.visualization.PredictedVsActualSettings
(for regression based predictions).
-
to_csv
(file_name, keep_metrics=True, delimiter=', ', decimal='.')¶ Exports the list of predictions to a
CSV
file.- Args:
file_name (str): name of the file to export to. keep_metrics (bool=True): True is predictions metrics should be included as additional columns. delimiter (str=’,’): CSV delimiter decimal (str=’.’): CSV decimal point
-
to_excel
(file_name, keep_metrics=True)¶ Exports the list of predictions to an
Excel
file.- Args:
file_name (str): name of the file to export to. keep_metrics (bool=True): True is predictions metrics should be included as additional columns.
-
to_html
(file_name, title='Prediction Performance', subtitle='Predicted vs Actual')¶ Exports a visual representation of the prediction performance to a standalone
HTML
document.- Args:
file_name (str): name of the file to export to; title (str=’Feature correlation browser’): HTML document title; subtitle (str=’’): HTML document subtitle.
-
to_pandas
(keep_metrics=True)¶ Exports the list of predictions to a
pandas.core.frame.DataFrame
object.- Args:
keep_metrics (bool=True): True is predictions metrics should be included as additional columns. - Returns:
- The constructed
pandas.core.frame.DataFrame
object.
- The constructed
Optimization¶
The datastories.optimization
package contains a collection
of classes and functions for optimizing models.
-
datastories.optimization.
create_optimizer
(*args, **kwargs)¶ Factory method for creating optimizers.
Returns:
An object of typedatastories.optimization.pso.Optimizer
that can be used to perform optimization analyses on adatastories.model.Model
object.Example:
model = Model("my_model.rsx") spec = OptimizationSpecification() spec.objectives = [ Minimize('KPI_1'), Maximize('KPI_2') ] spec.constraints = [ AtMost('Input_1', 10), ] optimizer = create_optimizer() optimization_result = optimizer.optimize(model, optimization_spec=spec) print(optimization_result.optimum)
-
class
datastories.optimization.pso.
Optimizer
(size_t population_size=500, size_t iterations=250)¶ A model optimizer using the particle swarm strategy for identifying an optimum solution.
Args:
- population_size (int =
500
): - the initial size of the swarm population.
- population_size (int =
- iterations (int =
250
): - number of swarm computation iterations before stopping.
- iterations (int =
-
maximize
(self, model, variable_ranges={}, progress_bar=True)¶ Run the optimizer with the goal of maximizing the outputs (i.e., KPIs) of a given model.
- Args:
- model (
datastories.model.Model
): - The input model whose KPIs are to be maximized.
- model (
- variable_ranges (dict [str,
datastories.optimization.VariableRange
] = {}): - An optional dictionary mapping variable names to ranges that are to be used to limit the searching for the optimum solution to a given domain.
- variable_ranges (dict [str,
- progress_bar (obj|bool=True):
- An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
Returns:
Adatastories.optimization.OptimizationResult
object encapsulating the model variables values that maximize the model outputs.
-
minimize
(self, model, variable_ranges={}, progress_bar=True)¶ Run the optimizer with the goal of minimizing the outputs (i.e., KPIs) of a given model.
- Args:
- model (
datastories.model.Model
): - The input model whose KPIs are to be minimized.
- model (
- variable_ranges (dict [str,
datastories.optimization.VariableRange
] = {}): - An optional dictionary mapping variable names to ranges that are to be used to limit the searching for the optimum solution to a given domain.
- variable_ranges (dict [str,
- progress_bar (obj|bool=True):
- An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
Returns:
Adatastories.optimization.OptimizationResult
object encapsulating the model variables values that minimize the model outputs.
-
optimize
(self, model, optimization_spec=OptimizationSpecification(), variable_ranges={}, direction=None, progress_bar=True)¶ Optimize an input model according to a given optimization specification.
Args:
- model (
datastories.model.Model
): The input model to be optimized
- model (
- optimization_spec (
datastories.optimization.OptimizationSpecification
): An optional specification for the optimization objectives and constraints. The default value is an empty specification (i.e., OptimizationSpecification())
- optimization_spec (
- variable_ranges (dict [str,
datastories.optimization.VariableRange
] = {}): An optional dictionary mapping variable names to ranges that are to be used to limit the searching for the optimum solution to a given domain.
- variable_ranges (dict [str,
- direction (
datastories.optimization.OptimizationDirection
) The direction of optimization when no specification is provided. Can be one of:
OptimizationDirection.MAXIMIZE
OptimizationDirection.MINIMIZE
- direction (
- progress_bar (obj|bool=True):
An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).
Returns:
Adatastories.optimization.OptimizationResult
object encapsulating the model variables values that satisfy the optimization specification.
-
class
datastories.optimization.
OptimizerType
¶ Enumeration for DataStories supported optimizer types.
- Possible value:
OptimizerType.PARTICLE_SWARM
-
class
datastories.optimization.
OptimizationDirection
¶ Enumeration for possible optimization goals when no other optimization specification is provided.
- Possible values:
OptimizationDirection.MAXIMIZE
OptimizationDirection.MINIMIZE
-
class
datastories.optimization.
OptimizationSpecification
(objectives=None, constraints=None)¶ Encapsulates a set of optimization objectives and constraints that can be used to configure an optimization analysis.
Both objectives and constraints are defined using
datastories.optimization.VariableSpec
and (potentially)datastories.optimization.VariableMapper
objects.Example:
spec = OptimizationSpecification() spec.objectives = [ Minimize('KPI_1', 2), InInterval('KPI_2', 1, 100) ] spec.add_constraint(AtMost(Sum('Input_1','Input_2'), 100))
-
add_constraint
(self, constraint)¶ Add a optimization constraint to the specification.
-
add_objective
(self, objective)¶ Add a optimization objective to the specification.
-
constraints
¶ Get/set the optimization specification constraints.
-
objectives
¶ Get/set the optimization specification objectives.
-
-
class
datastories.optimization.
OptimizationResult
¶ Encapsulates the result of a
datastories.optimizer.Optimizer.optimize()
analysis.Note: Objects of this class should not be manually constructed.
-
is_complete
¶ Check whether the search for the optimum has been interrupted before completion.
-
is_feasible
¶ Check whether the identified optimum position respects the imposed constraints (if any).
-
optimum
¶ Get the model variable values for the identified optimum position.
-
to_pandas
(self)¶ Export the optimum position to a Pandas
DataFrame
object.Returns:
The constructed PandasDataFrame
object.
-
-
class
datastories.optimization.
VariableRange
(min=None, max=None, value=None)¶ - Encapsulates a numeric or categorical value ranges.
- Numeric ranges are defined by an upper and a lower bound.
- Categorical ranges are currently limited to a single value.
- Args:
- min (double =
0
): - a numeric range lower bound
- min (double =
- max (double =
0
): - a numeric range upper bound
- max (double =
- value (str =
''
): - a categorical range value
- value (str =
-
is_categorical
¶ Check if the variable range is categorical.
-
is_numeric
¶ Check if the variable range is numeric.
-
max
¶ Get the upper bound of a numeric range.
-
min
¶ Get the lower bound of a numeric range.
-
value
¶ Get the value of a categorical range.
-
class
datastories.optimization.
VariableMapper
¶ Base class for all variable mappers.
Variable mappers are the first parameter to be passed when defining optimization objectives and constraints. They indicate to what variable or group of variables the objective/constraint applies.
For simple cases (i.e., one variable), variable mappers can be replaced with the name of the variable itself. However, in more complex scenarios (e.g., a constraint that applies to the aggregated value of a number of variables), mappers have to be explicitly constructed.
-
class
datastories.optimization.
Sum
(operands, weights=None)¶ Bases:
datastories.optimization.specification.VariableMapper
Aggregates a number of variables using a weighted sum. This can be then used to define optimization objectives or constraints.
Args:
- operands (list):
- a list of variable names to sum-up.
- weights (list =
None
): - a list of relative weights for aggregating the given variables.
- weights (list =
-
class
datastories.optimization.
VariableSpec
¶ Base class for all optimization objectives and constraints.
-
class
datastories.optimization.
AtMost
(operand, double limit, double weight=1.0)¶ Bases:
datastories.optimization.specification.VariableSpec
Specifies an optimization objective or constraint by which a variable (or aggregation of variables) should be lower than a given reference value.
Args:
- operand (obj):
- a variable mapper (
datastories.optimization.specification.VariableMapper
) indicating to whom the objective/constraint applies.
- limit (double):
- the reference value to compare against.
- weight (double =
1
): - the relative weight of this objective/constraint among all the specified objectives or constraints.
- weight (double =
-
class
datastories.optimization.
AtLeast
(operand, double limit, double weight=1.0)¶ Bases:
datastories.optimization.specification.VariableSpec
Specifies an optimization objective or constraint by which a variable (or aggregation of variables) should be greater than a given reference value.
Args:
- operand (obj):
- a variable mapper (
datastories.optimization.specification.VariableMapper
) indicating to whom the objective/constraint applies.
- limit (double):
- the reference value to compare against.
- weight (double =
1
): - the relative weight of this objective/constraint among all the specified objectives or constraints.
- weight (double =
-
class
datastories.optimization.
InInterval
(operand, double lower_limit, double upper_limit, double weight=1.0)¶ Bases:
datastories.optimization.specification.VariableSpec
Specifies an optimization objective or constraint by which a variable (or aggregation of variables) should be in a given reference interval.
Args:
- operand (obj):
- a variable mapper (
datastories.optimization.specification.VariableMapper
) indicating to whom the objective/constraint applies.
- lower_limit (double):
- the lower bound of the reference interval.
- upper_limit (double):
- the upper bound of the reference interval.
- weight (double =
1
): - the relative weight of this objective/constraint among all the specified objectives or constraints.
- weight (double =
-
class
datastories.optimization.
IsEqual
(operand, double value, double weight=1.0)¶ Bases:
datastories.optimization.specification.VariableSpec
Specifies an optimization objective by which a variable (or aggregation of variables) should be equal to a given reference value.
Note: This cannot be used to define optimization constraints. To achieve a similar effect when defining a constraint, one can use a combination of
datastories.optimization.specification.AtMost
anddatastories.optimization.specification.AtLeast
instead.Args:
- operand (obj):
- a variable mapper (
datastories.optimization.specification.VariableMapper
) indicating to whom the objective applies.
- value (double):
- the reference value to compare against.
- weight (double =
1
): - the relative weight of this objective/constraint among all the specified objectives or constraints.
- weight (double =
-
class
datastories.optimization.
Minimize
(operand, double weight=1.0)¶ Bases:
datastories.optimization.specification.VariableSpec
Specifies an optimization objective by which a variable (or aggregation of variables) should have the smallest possible value.
Note: This cannot be used to define optimization constraints.
Args:
- operand (obj):
- a variable mapper (
datastories.optimization.specification.VariableMapper
) indicating to whom the objective applies.
- weight (double =
1
): - the relative weight of this objective/constraint among all the specified objectives or constraints.
- weight (double =
-
class
datastories.optimization.
Maximize
(operand, double weight=1.0)¶ Bases:
datastories.optimization.specification.VariableSpec
Specifies an optimization objective by which a variable (or aggregation of variables) should have the largest possible value.
Note: This cannot be used to define optimization constraints.
Args:
- operand (obj):
- a variable mapper (
datastories.optimization.specification.VariableMapper
) indicating to whom the objective applies.
- weight (double =
1
): - the relative weight of this objective/constraint among all the specified objectives or constraints.
- weight (double =
Visualization¶
Plots¶
The datastories.visualization
package contains a collection
of visualizations that facilitates the assessment of selected DataStories
analysis results.
-
class
datastories.visualization.
ClassificationPlot
(data, predicted_name, actual_name, prediction_performance=None, vis_settings=<datastories.visualization.classification_plot.ClassificationPlotSettings object>, *args, **kwargs)¶ Visual representation of binary classification (performance).
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
- Attributes:
vis_settings (obj): an object of type datastories.visualization.ClassificationPlotSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
-
plot
(*args, **kwargs)¶ Convenience function to set-up and display the visualization.
Accepts the same parameters as the constructor for
datastories.visualization.ClassificationPlotSettings
objects.
-
class
datastories.visualization.
ClassificationPlotSettings
(x_axis=None, jitter=0.2, *args, **kwargs)¶ Encapsulates visualization settings for
datastories.visualization.ClassificationPlot
visualizations.- Args:
x_axis (str=None): Column to display on the X axis; jitter (float=0.2): Amount of ‘jitter’ to add on the Y axis in order to minimize overlapping. - Attributes:
- Same as the Args section above.
-
class
datastories.visualization.
ConfusionMatrix
(prediction_performance, vis_settings=<datastories.visualization.confusion_matrix.ConfusionMatrixSettings object>, *args, **kwargs)¶ Encapsulates a visual representation of model accuracy for binary classification models.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
- Attributes:
vis_settings (obj): an object of type datastories.visualization.ConfusionMatrixSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
-
plot
(*args, **kwargs)¶ Convenience function to set-up and display the visualization.
Accepts the same parameters as the constructor for
datastories.visualization.ConfusionMatrixSettings
objects.
-
to_html
(file_name, title='Confusion Matrix', subtitle='')¶ Exports the visualization to a standalone
HTML
document.- Args:
file_name (str): name of the file to export to; title (str=’Feature correlation browser’): HTML document title; subtitle (str=’’): HTML document subtitle.
-
class
datastories.visualization.
ConfusionMatrixSettings
(width=480, height=320)¶ Encapsulates visualization settings for
datastories.visualization.ConfusionMatrix
visualizations.- Args:
width (int=640): Graph width in pixels; height (int=480): Graph height in pixels; - Attributes:
- Same as the Args section above.
-
class
datastories.visualization.
CorrelationBrowser
(vis_settings=<datastories.visualization.correlation_browser.CorrelationBrowserSettings object>, *args, **kwargs)¶ Encapsulates a visual representation of correlation between features.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
- Attributes:
vis_setting (obj): an object of type datastories.visualization.CorrelationBrowserSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
-
plot
(*args, **kwargs)¶ Convenience function to set-up and display the visualization.
Accepts the same parameters as the constructor for
datastories.visualization.CorrelationBrowserSettings
objects.
-
to_html
(file_name, title='Feature correlation browser', subtitle='')¶ Exports the visualization to a standalone
HTML
document.- Args:
file_name (str): name of the file to export to; title (str=’Feature correlation browser’): HTML document title; subtitle (str=’’): HTML document subtitle.
-
class
datastories.visualization.
CorrelationBrowserSettings
(scale=1, node_opacity=0.9, edge_opacity=0.3, tension=0.65, font_size=15, filter_unconnected=False, min_weight=50, max_weight=100, weight_key='weightMI', show_controls=True)¶ Encapsulates visualization settings for
datastories.visualization.CorrelationBrowser
visualizations.- Args:
scale (float=1): Scale factor of the radius [0-1]; node_opacity (float=0.9): Opacity of the nodes that aren’t hovered or connected to hovered or selected nodes [0-1]; edge_opacity (float=0.3): Opacity of the edges that aren’t hovered or connected to hovered or selected nodes [0-1]; tension (float=0.65): The tension of the links. A tension of 0 means straight lines [0-1]; font_size (int=15): Font size used for the nodes of the plot [10-32]; filter_unconnected (boolean=False): Whether or nodes that aren’t connected to any other node are filtered from the view; min_weight (int=50): Minimum weight of the links that will be shown [0-100]; max_weight (int=100): Maximum weight of the links that will be shown [0-100]; weight_key (str=’weightMI’): Type of relations top display [‘weightMI’ for Mutual Information,’weightL’ for Linear Correlation]; - Attributes:
- Same as the Args section above.
-
class
datastories.visualization.
DataSummaryTable
(summary, vis_settings=<datastories.visualization.data_summary_table.DataSummaryTableSettings object>, *args, **kwargs)¶ Encapsulates a visual representation of data frame summary.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
- Attributes:
vis_settings (obj): an object of type datastories.visualization.DataSummaryTableSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
-
plot
(*args, **kwargs)¶ Convenience function to set-up and display the visualization.
Accepts the same parameters as the constructor for
datastories.visualization.DataSummaryTableSettings
objects.
-
to_html
(file_name, title='Data Summary', subtitle='')¶ Exports the visualization to a standalone
HTML
document.- Args:
file_name (str): name of the file to export to; title (str=’Data Summary’): HTML document title; subtitle (str=’’): HTML document subtitle.
-
class
datastories.visualization.
DataSummaryTableSettings
(page_size=25, show_console=True)¶ Encapsulates visualization settings for
datastories.visualization.DataSummaryTable
visualizations.- Args:
page_size (int=1): Maximum number of columns to display one one summary page; - Attributes:
- Same as the Args section above.
-
class
datastories.visualization.
ErrorPlot
(prediction_performance, vis_settings=<datastories.visualization.error_plot.ErrorPlotSettings object>, *args, **kwargs)¶ Encapsulates a visual representation of model accuracy for regression models.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
- Attributes:
vis_settings (obj): an object of type datastories.visualization.ErrorPlotSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
-
plot
(*args, **kwargs)¶ Convenience function to set-up and display the visualization.
Accepts the same parameters as the constructor for
datastories.visualization.ErrorPlotSettings
objects.
-
to_html
(file_name, title='Error Plot', subtitle='')¶ Exports the visualization to a standalone
HTML
document.- Args:
file_name (str): name of the file to export to; title (str=’Feature correlation browser’): HTML document title; subtitle (str=’’): HTML document subtitle.
-
class
datastories.visualization.
ErrorPlotSettings
(sort_key='id', lines=False, highlight_outliers=False, threshold=None, confidence=False, x_padding=0, y_padding=1, marker_size=32, hover_marker_size_delta=32, animations=500, margin_top=10, margin_right=20, margin_bottom=40, margin_left=60)¶ Encapsulates visualization settings for
datastories.visualization.ErrorPlot
visualizations.- Args:
sort_key (str=’id’): The sorting criteria for the X axis.Possible values: id
- sort on record id;act
- sort on record actual KPI value;pred
- sort on record predicted value.lines (bool=Tue): True if points should be connected by lines; highlight_outliers (bool=Tue): True if outliers should be highlighted; threshold (float=0.5): Threshold; confidence (bool=Tue): True if confidence limits should be displayed x_padding (int=1): X padding; y_padding (int=1): ; Y padding; marker_size (int=32): ; Size of the point marker hover_marker_size_delta (int=32): Size of the point hover marker; animations (int=500): ; Animation duration in milliseconds; margin_top (int=10): margin_right (int=20): margin_bottom (int=40): margin_left (int=60): - Attributes:
- Same as the Args section above.
-
class
datastories.visualization.
OutlierXPlot
(outliers_result, vis_settings=<datastories.visualization.outlier_plot.OutlierPlotSettings object>, *args, **kwargs)¶ Encapsulates a visual representation of outliers resulting from a one dimensional analysis.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
- Attributes:
vis_settings (obj): an object of type datastories.visualization.OutlierPlotSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
-
plot
(*args, **kwargs)¶ Convenience function to set-up and display the visualization.
Accepts the same parameters as the constructor for
datastories.visualization.OutlierPlotSettings
objects.
-
to_html
(file_name, title='Confusion Matrix', subtitle='')¶ Exports the visualization to a standalone
HTML
document.- Args:
file_name (str): name of the file to export to; title (str=’Feature correlation browser’): HTML document title; subtitle (str=’’): HTML document subtitle.
-
class
datastories.visualization.
OutlierPlotSettings
(width=800, height=200, x_padding=0.2, y_padding=0.2, marker_size=32, hover_marker_size_delta=32, animations=500, show_jitter=True, show_cdf=True, show_iqr=True, show_summary=True, show_console=True, show_legend=True, low_threshold=0.05, high_threshold=0.95)¶ Encapsulates visualization settings for
datastories.visualization.OutlierXPlot
visualizations.- Args:
width (int=400): Graph width in pixels; height (int=300): Graph height in pixels; x_padding (float=0.2): X padding; y_padding (float=0.2): ; Y padding; marker_size (int=32): ; Size of the point marker hover_marker_size_delta (int=32): Size of the point hover marker; animations (int=500): Animation duration in milliseconds; show_jitter (bool=False): adds some jitter to the vertical dimension, to better distinguish points; show_cdf (bool=True): shows the cumulative distribution function; show_iqr (bool=True): displays the inter-quartile range, as specified in the lower and higher threshold arguments; show_summary (bool=True): displays the summary table show_console (bool=True): displays the visualization console where update operations are logged low_threshold (float=0.05): the lower threshold for the inter-quartile range; high_threshold (float=0.95): the upper threshold for the inter-quartile range; - Attributes:
- Same as the Args section above.
-
class
datastories.visualization.
PredictedVsActual
(prediction_performance, vis_settings=<datastories.visualization.predicted_vs_actual.PredictedVsActualSettings object>, *args, **kwargs)¶ Encapsulates a visual representation of model accuracy for regression models.
Note: Objects of this class should not be manually constructed.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
- Attributes:
vis_settings (obj): an object of type datastories.visualization.PredictedVsActualSettings
containing visualization settings. Set this object before displaying the visualization or exporting to HTML.
-
plot
(*args, **kwargs)¶ Convenience function to set-up and display the visualization.
Accepts the same parameters as the constructor for
datastories.visualization.PredictedVsActualSettings
objects.
-
to_html
(file_name, title='Predicted vs Actual', subtitle='')¶ Exports the visualization to a standalone
HTML
document.- Args:
file_name (str): name of the file to export to; title (str=’Feature correlation browser’): HTML document title; subtitle (str=’’): HTML document subtitle.
-
class
datastories.visualization.
PredictedVsActualSettings
(width=400, highlight_outliers=True, threshold=0.5, x_padding=0.2, y_padding=0.2, marker_size=32, hover_marker_size_delta=32, animations=500)¶ Encapsulates visualization settings for
datastories.visualization.PredictedVsActual
visualizations.- Args:
width (int=400): Graph width in pixels; highlight_outliers (bool=Tue): True if outliers should be highlighted; threshold (float=0.5): Threshold; x_padding (float=0.2): X padding; y_padding (float=0.2): ; Y padding; marker_size (int=32): ; Size of the point marker hover_marker_size_delta (int=32): Size of the point hover marker; animations (int=500): ; Animation duration in milliseconds; - Attributes:
- Same as the Args section above.
-
class
datastories.visualization.
WhatIfs
(current_values=[], minimize_values=None, maximize_values=None, raw_model=None, vis_settings=<datastories.visualization.whatifs.WhatIfsSettings object>, *args, **kwargs)¶ Encapsulates a visual representation for exploring the influence of driver variables on target KPIs.
One can display this visualization in a IPython Notebook by simply giving the name of an object of this class.
Note: Objects of this class should not be manually constructed.
-
drivers
¶ Retrieves the driver values
-
maximize
()¶ Identify a set of driver values that maximize the KPI
-
minimize
()¶ Identify a set of driver values that minimize the KPI
-
plot
(*args, **kwargs)¶ Convenience function to set-up and display the visualization.
Accepts the same parameters as the constructor for
datastories.visualization.PredictedVsActualSettings
objects.
-
to_html
(file_name, title='What-Ifs', subtitle='')¶ Exports the visualization to a standalone
HTML
document.- Args:
file_name (str): name of the file to export to; title (str=’Feature correlation browser’): HTML document title; subtitle (str=’’): HTML document subtitle.
-
-
datastories.visualization.
what_ifs
(model_path=None, current_values=[], minimize_values=None, maximize_values=None, raw_model=None)¶ Displays a WhatIf visualization in a Jupyter notebook based on an input RSX model file.
- Args:
model_path (str=None): path to the input RSX model file; if None
theraw_model
arguments has to be provided.current_values (list=[]): list of initial driver values; minimize_values (list=None): driver values that minimize the KPI; maximize_values (list=None): driver values that maximize the KPI; raw_model (bytes=None): an optional bytes object, containing the source of the backing RSX model - Returns:
- An object of type
datastories.visualization.WhatIfs
- An object of type
Example:
from datastories.visualization import what_ifs what_ifs('my_model.rsx')
Utils¶
The datastories.display
package contains a collection
of display helpers.
-
class
datastories.display.
ProgressCounter
¶ Base class implemented by all progress counters.
- Attr:
total (int): the number of steps required for completion step (int): the current step start_time (int): the start time in ns stop_time (int): the stop time in ns
-
increment
(steps=1)¶ Advances the progress with a number of steps.
- Args:
steps (int): the number of steps to advance
-
start
(total=1)¶ Initializes the progress range.
- Args:
total (int): the number of steps required for completion
-
stop
()¶ Stops progress monitoring.
-
class
datastories.display.
ProgressReporter
¶ Abstract base class implemented by all progress reporters.
-
log
(message)¶ Logs a progress message.
- Args:
- : message (str): Progress message to log.
-
on_progress
(progress)¶ Logs the completion percentage.
- Args:
- : progress (float=None): Completion percentage to be logged.
-
-
datastories.display.
get_progress_bar
(progress_bar)¶ Retrieves a default implementation for a progress bar.
- Args:
progress_bar (obj|bool=False): An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).When an
datastories.display.ProgressReporter
object is provided it will be returned as is.- Returns:
- An object of type
datastories.api.ProgressReporter
.
-
datastories.display.
wide_screen
(width=0.95)¶ Make the notebook screen wider when running under
Jupyter Notebook
.- Args:
width (float=0.95): width of notebook as a fraction of the screen width. Should be in the interval [0,1].
-
datastories.display.
init_graphics
()¶ Initializes the DataStories graphics engine.
Use this function at the top of your notebooks when planing to save HTML copies of your work.
Regression¶
The datastories.regression
package contains a collection
of classes and functions to facilitate regression analysis.
-
class
datastories.regression.
RegressionError
(value)¶ Exception generated when failing to execute regression analysis methods.
Story¶
The datastories.story
package contains a collection
of workflows to automate specific analysis tasks (e.g., building a predictive model).
Predict Single KPI¶
-
datastories.story.
predict_single_kpi
(data_frame, column_list, kpi, runs=3, outlier_elimination=True, prototypes='auto', progress_bar=True)¶ Fits a non-linear regression model on a data frame in order to predict one column.
The column to pe predicted (i.e., the KPI) is to be identified either by name or by column index in the data frame.
- Args:
data_frame (obj): the input data frame (either a
pandas.core.frame.DataFrame
or adatastories.data.DataFrame
object);column_list (list): the list of variables (i.e., columns) to consider for regression;
kpi (int|str): the index or the name of the target (i.e., KPI) column;
runs (int=3): the number of training rounds;
outlier_elimination (bool=True): set to True in order to exclude far outliers from modeling;
prototypes (str=’yes’): indicates whether analysis should be performed on prototypes. Possible values:
'yes'
: use only prototypes as inputs;'no'
: use all original inputs;'auto'
: use prototypes if there are more than 200 inputs variables.progress_bar (obj|bool=True): An object of type
datastories.display.ProgressReporter
, or a boolean to get a default implementations (i.e.,True
to display progress,False
to show nothing).- Returns:
- An object of type
datastories.story.predict_single_kpi.Story
wrapping-up the computed model.
- An object of type
- Raises:
ValueError
: when an invalid value is provided for one of the input parameters parameters.- class:datastories.story.StoryError: if there is a problem fitting the model.
Example:
from datastories.story import predict_single_kpi import pandas as pd df = pd.read_csv('example.csv') kpi_column_index = 1 ranks = predict_single_kpi(df, df.columns, kpi_column_index, progress_bar=True) print(story)
-
class
datastories.story.predict_single_kpi.
Story
(platform, kpi_name, user_columns, nrows, folder='', *args, **kwargs)¶ Bases:
datastories.api.interface.IPredictiveStory
,datastories.story.predict_single_kpi.predict_single_kpi.StoryRun
Encapsulates the result of the
datastories.story.predict_single_kpi()
story.Note: Objects of this class should not be manually constructed.
-
add_note
(note)¶ Add an annotation to the story results.
The already present annotations can be retrieved using the
datastories.api.IStory.notes()
property.- Args:
note (str): the annotation to be added.
-
assert_alive
()¶ Triggers an exception if the object has been manually released.
-
clear_note
(note_id)¶ Remove a specific annotation associated with the story analysis.
- Args:
note_id (int): the index of the note to be removed. - Raises:
ValueError
: if the note index is unknown.
-
clear_notes
()¶ Clear the annotations associated with the story analysis.
-
correlation_browser
¶ A visualization for assessing feature correlation.
An object of type
datastories.visualization.CorrelationBrowser
that can be used for assessing feature correlation, as discovered while training the model.
-
static
load
(file_name)¶ Loads a previously saved story.
- Args:
file_name (str): the name of the source file. - Returns:
- An object of type
datastories.story.predict_single_kpi.Story
encapsulating training results form a previous analysis.
- An object of type
- Raises:
datastories.story.StoryError
if there is a problem loading the story file (e.g., story version not compatible).
-
make_independent
(base_folder='')¶ Make object independent by copying required resources to an own folder.
- Args:
base_folder (str=’’): the base folder for the unique object folder that will hold the required resources.
-
metrics
¶ A dictionary containing the model performance metrics and the list of main drivers.
These metrics are computed on the training data for the purpose of evaluating the model prediction performance.
- The following metrics are retrieved:
Training Set Size: size of the actual data frame used for training (rows x columns) Correlation: actual vs predicted correlation Estimated Correlation: estimated correlation for future (unseen) values R-squared: the coefficient of determination MSE: mean squared error RMSE: root mean squared error Main Drivers: list of main features with associated relative importance and energy Features: list of all features with associated relative importance and energy Computation Effort: a measure of model complexity Number of Runs: number of training rounds Best Run: best performing training round Run Overview: overview of individual runs including Performance and Feature Importance - In case the KPI is a binary variable, the following additional metrics are included:
Positive Label: the label used to identify positive cases Negative Label: the label used to identify negative cases True Positives: number of correctly identified positive cases (TP) False Positives: number of incorrectly identified positive cases (FP) True Negatives: number of correctly identified negative cases (TN) False Negatives: number of incorrectly identified negative cases (FN) Not Classified: number of records that could not be classified (i.e., KPI is NaN) True Positive Rate: TP / (TP + FN) * 100 (a.k.a. sensitivity, recall) False Positive Rate: FP / (FP + TN) * 100 (a.k.a. fall-out) True Negative Rate: TN / ( FP + TN) * 100 (a.k.a. specificity) False Negative Rate: FN / (TP + FN) * 100 (a.k.a. miss rate) Precision: percentage of correctly identified cases from the total reported positive cases TP / (TP + FP) * 100 Recall: percentage of correctly identified cases from the total existing positive cases TP / (TP + FN) * 100 Accuracy: percentage of correctly identified cases (TP + TN) / (TP + FP + TN + FN) * 100 F1 Score: the F1 score (the harmonic mean of precision and recall) AUC: area under (ROC) curve
-
model
¶ Returns an object of type
datastories.model.SingleKpiPredictor
that can be used for making predictions on new data.
-
notes
¶ A text representation of all annotations currently associated with the story analysis.
-
plot
(*args, **kwargs)¶ Plots a graphical representation of the results in Jupyter Notebook.
-
release
()¶ Releases the object associated storage.
Note: This function should only be used in order to force releasing allocated resources. Using the object after this point would lead to an exception.
-
run_overview
¶ An overview of feature importance metrics across all runs.
-
runs
¶ A list containing the results of individual analysis rounds.
Each entry in the list is an object of type
datastories.story.predict_single_kpi.StoryRun
encapsulating the results associated with a given analysis round.
-
save
(file_name)¶ Saves the story analysis results.
Use this function to persist the results of the
datastories.story.predict_single_kpi()
analysis. One can reload them and continue investigations at a later moment using thedatastories.story.predict_single_kpi.Story.load()
method.- Args:
file_name (str): the name of the destination file.
-
to_csv
(file_name, content='metrics', delimiter=', ', decimal='.')¶ Exports a list of model metrics to a
CSV
file.- Args:
file_name (str): name of the file to export to. content (str=metrics): the type of metrics to export. Possible values: - metrics
- exports estimated model performance metrics; -drivers
- exports driver importance metrics; -run_overview
- exports an overview of feature importance metrics across all runs.delimiter (str=’,’): CSV delimiter decimal (str=’.’): CSV decimal point - Raises:
ValueError
: when an invalid value is provided for thecontent
parameter.
-
to_excel
(file_name)¶ Exports the list of model metrics to an
Excel
file.- Args:
file_name (str): name of the file to export to.
-
to_pandas
(content='metrics')¶ Exports a list of model metrics to a
pandas.core.frame.DataFrame
object.- Args:
content (str=metrics): the type of metrics to export. Possible values: - metrics
- exports estimated model performance metrics; -drivers
- exports feature importance metrics for the model; -run_overview
- exports an overview of feature importance metrics across all runs.- Returns:
- The constructed
pandas.core.frame.DataFrame
object.
- The constructed
- Raises:
ValueError
: when an invalid value is provided for thecontent
parameter.
-
what_ifs
¶ A visualization for interactive exploration of the models.
The visualization helps getting insight into how driver variables influence the target KPIs. An object of type
datastories.visualization.WhatIfs
that can be used for interactive exploration of the models.
-
-
class
datastories.story.predict_single_kpi.
StoryRun
(platform, kpi_name, nrows, *args, **kwargs)¶ Bases:
datastories.api.interface.IAnalysisResult
,datastories.core.utils.object_.StorageBackedObject
Encapsulates the result of one analysis round from the
datastories.story.predict_single_kpi()
story.Note: Objects of this class should not be manually constructed.
-
correlation_browser
¶ A visualization for assessing feature correlation.
An object of type
datastories.visualization.CorrelationBrowser
that can be used for assessing feature correlation, as discovered while training the model.
-
metrics
¶ A dictionary containing the model performance metrics and the list of main drivers.
These metrics are computed on the training data for the purpose of evaluating the model prediction performance.
- The following metrics are retrieved:
Training Set Size: size of the actual data frame used for training (rows x columns) Correlation: actual vs predicted correlation Estimated Correlation: estimated correlation for future (unseen) values R-squared: the coefficient of determination MSE: mean squared error RMSE: root mean squared error Main Drivers: list of main features with associated relative importance and energy Features: list of all features with associated relative importance and energy - In case the KPI is a binary variable, the following additional metrics are included:
Positive Label: the label used to identify positive cases Negative Label: the label used to identify negative cases True Positives: number of correctly identified positive cases (TP) False Positives: number of incorrectly identified positive cases (FP) True Negatives: number of correctly identified negative cases (TN) False Negatives: number of incorrectly identified negative cases (FN) Not Classified: number of records that could not be classified (i.e., KPI is NaN) True Positive Rate: TP / (TP + FN) * 100 (a.k.a. sensitivity, recall) False Positive Rate: FP / (FP + TN) * 100 (a.k.a. fall-out) True Negative Rate: TN / ( FP + TN) * 100 (a.k.a. specificity) False Negative Rate: FN / (TP + FN) * 100 (a.k.a. miss rate) Precision: percentage of correctly identified cases from the total reported positive cases TP / (TP + FP) * 100 Recall: percentage of correctly identified cases from the total existing positive cases TP / (TP + FN) * 100 Accuracy: percentage of correctly identified cases (TP + TN) / (TP + FP + TN + FN) * 100 F1 Score: the F1 score (the harmonic mean of precision and recall) AUC: area under (ROC) curve
-
model
¶ An object of type
datastories.model.SingleKpiPredictor
that can be used for making predictions on new data.
-
to_csv
(file_name, content='metrics', delimiter=', ', decimal='.')¶ Exports a list of model drivers or metrics to a
CSV
file.- Args:
file_name (str): name of the file to export to. content (str=metrics): the type of metrics to export. Possible values: - metrics
- exports estimated model performance metrics; -drivers
- exports driver importance metrics.delimiter (str=’,’): CSV delimiter decimal (str=’.’): CSV decimal point - Raises:
ValueError
: when an invalid value is provided for thecontent
parameter.
-
to_excel
(file_name)¶ Exports the list of model drivers and metrics to an
Excel
file.- Args:
file_name (str): name of the file to export to.
-
to_pandas
(content='metrics')¶ Exports a list of model drivers or metrics to a
pandas.core.frame.DataFrame
object.- Args:
content (str=metrics): the type of metrics to export. Possible values: - metrics
- exports estimated model performance metrics; -drivers
- exports driver importance metrics.- Returns:
- The constructed
pandas.core.frame.DataFrame
object.
- The constructed
- Raises:
ValueError
: when an invalid value is provided for thecontent
parameter.
-
what_ifs
¶ A visualization for interactive exploration of the models.
The visualization helps getting insight into how driver variables influence the target KPIs. An object of type
datastories.visualization.WhatIfs
that can be used for interactive exploration of the models.
-
License¶
-
datastories.api.
get_activation_info
()¶ Get information required to create and activate a DataStories license.
- Returns:
dict: a dictionary containing data to be submitted to the DataStories representative in charge with issuing the license.
The datastories.license
package contains a collection
of utility functions to facilitate license management.
These functions are available as methods of a predefined object
of class datastories.license.LicenseManager
called master
.
Example:
from datastories.license import manager
manager.initialize('my_license.lic')
manager
-
class
datastories.license.
LicenseManager
(license_file_path=None)¶ Encapsulates the DataStories license manager.
The license manager enables users to inspect the details of their installed DataStories API license, and to use license keys that are not available in the standard installation locations (see Installation)
This class should not be instantiated directly. Instead one should use the already available object instance
datastories.license.manager
.- Args:
license_file_path (str = None): The path to a license key file or folder if other than the standard locations for the platform. - Attributes:
status (str): the status of the license manager initialization. license (obj): the managed license as indicated in the license key file.
Example:
from datastories.license import manager manager.initialize('my_license.lic') manager
-
default_license_path
¶ Default path used for license initialization if none provided
-
initialize
(license_file_path=None)¶ Initialize the license manager with a license key at a specific location.
- Args:
license_file_path (string): The path to a license key file or a folder containing the license key file. - Raises:
ValueError
: when the providedlicense_file_path
is not accessible.
-
is_granted
(opt)¶ Checks if execution rights are granted for license protected functionality.
- Args:
opt (str): the license option required by the protected functionality. - Return:
- bool:
True
if execution rights are granted by the installed license.
-
is_ok
()¶ Check the initialization status of the license manager.
The license manager initialization fails when no valid license file is found in the standard or user indicated locations.
- Returns:
- (bool):
True
if the license manager was successfully initialized.
Note: A successful license manager initialization does not guarantee a grant for using license protected functionality. Fort example, when an expired license is used, the initialization is still successful. To check whether execution rights are granted one should use the
datastories.license.LicenseManager.is_granted()
method.- (bool):
-
reinitialize
()¶ Re-initializes the license manager.
This is done using the same license file path as in the previous call to
datastories.license.LicenseManager.initialize()
.
-
release
()¶ Releases the currently held licenses.
This can be useful e.g., when using floating or counted licenses, as it makes the released licenses available for other clients or processes.
Note: once a license is released, the associated execution rights are retracted. In order to use the license protected functionality, users need to acquire the license, by initializing the license manager again (i.e.,
datastories.license.LicenseManager.initialize()
).
-
class
datastories.license.rlm.
LicenseError
(value)¶ Exception generated when accessing license protected functionality using an invalid license.