Documentation

class stc.SparseTensorClassifier(targets, features=None, collapse=True, engine='sqlite://', prefix='stc', chunksize=100, cache=True, power=0.5, balance=1, entropy=1, loss='norm', tol=1e-15, verbose=True)

This class implements the Sparse Tensor Classifier (STC), a supervised classification algorithm for categorical data inspired by the notion of superposition of states in quantum physics.

The algorithm is implemented in SQL. By default, the library uses an in-memory SQLite database, shipped with Python standard library, that require no configuration by the user. It is also possible to configure STC to run on alternative DBMS in order to take advantage of persistent storage and scalability.

The input data must be a pandas DataFrame or a JSON structured as follows:

data = [
    {'key1': [value1, value2, ..., valueN], 'key2': [], ..., 'keyN': []},
    ...
    {'key1': [value1, value2, ..., valueN], 'key2': [], ..., 'keyN': []},
]

Such that each dictionary represents an item where each key is a feature associated to one or more values. This makes easy to deal with multi-valued attributes. STC also supports input data in the form of pandas DataFrame for tabular data, where each row represents an item, each column represents a key and each cell represents a value. STC deals with categorical data only and all the values are internally converted to strings. Continuous features should be discretized first.

Parameters
  • targets (List[str]) – The target variable(s). In the notation above, this is the list of keys to predict.

  • features (Optional[List[str]]) – In the notation above, this is the list of keys to use for prediction.

  • collapse (bool) – If True (the default) merges all the features into a unique key and STC reduces to a matrix-based approach. This is fast and efficient, and recommended for tabular data. When False, the items are represented with the cartesian product among the values in each key. In this case, it is needed a policy to avoid degenerate probability estimates in the prediction phase. The policy can be arbitrarily specified or automatically learnt with stc.SparseTensorClassifier.learn()

  • engine (Union[Engine, str]) – Connection to database in the form of a SQLAlchemy engine. By default, STC uses an in-memory SQLite database.

  • prefix (str) – Prefix to use in the database tables. STC instances initialized with different prefix are completely independent. This makes possible to use the same engine multiple times with different prefix, without creating a new database/schema. If an instance associated with the same engine and prefix is found on the database, then STC is initialized from the database.

  • chunksize (int) – Number of items to fit and predict per chunk. May impact the computational time.

  • cache (bool) – If True (the default) caches fitted weights to improve performance. The cache can be cleaned with stc.SparseTensorClassifier.clean() and it is automatically cleaned each time new data are fitted with stc.SparseTensorClassifier.fit()

  • power (float) – Hyper-parameter. Smaller values give similar weight to all the features regardless of their frequency. Usually between 0 and 1.

  • balance (float) – Hyper-parameter. The sample is artificially balanced when setting balance=1. It is not balanced with balance=0. For values between 0 and 1 the sample is semi-balanced, increasing the weight of the less frequent classes but not enough to have a balanced sample. For values greater than 1 the sample is super-balanced, where the weight of the less frequent classes is greater than the weight of the most frequent classes.

  • entropy (float) – Hyper-parameter. Higher values lead to predictions based on less but more relevant features, thus more robust to noise. Usually between 0 and 1.

  • loss (str) – Loss function used in stc.SparseTensorClassifier.learn(). Use norm for Manhattan Distance or log for cross-entropy (log-loss).

  • tol (float) – The actual predicted probabilities are replaced with the value of tol when using loss='log' and the actual predicted probability is zero.

  • verbose (bool) – Print on progress? Default True.

Methods

fit(items[, keep_items, if_exists, clean])

Fit the training data.

predict(items[, policy, probability, explain])

Predict the test data.

explain([features])

Global explainability.

learn([test_size, train_size, stratify, …])

Learn the policy.

policy()

Get the policy.

set(params)

Set parameters.

get(params)

Get parameters.

clean([deep])

Clean the database.

read_sql(sql)

Read SQL query into a pandas DataFrame.

to_sql(x, table[, if_exists])

Write a pandas DataFrame into a SQL table.

connect()

Open the connection to the database.

close()

Close the connection to the database.

clean(deep=False)

Clean the database.

Parameters

deep (bool) – If False (the default) drops temporary tables and cache. If True, deletes all tables and closes the connection.

Return type

None

close()

Close the connection to the database.

Return type

None

connect()

Open the connection to the database.

Return type

None

explain(features=None)

Global explainability. Compute the global contribution of each feature value to each target class label.

Parameters

features (Optional[List[str]]) – The features to use. By default, it uses the features used for prediction.

Return type

DataFrame

Returns

Global explainability table giving the contribution of each feature value to each target class label.

fit(items, keep_items=None, if_exists='fail', clean=True)

Fit the training data. The data must contain both targets and features and must be structured as described above. Supports incremental fit and it is ready to use in an online learning context.

Parameters
  • items (Union[List[dict], DataFrame]) – The training data in JSON or tabular format as described above.

  • keep_items (Optional[bool]) – If True, stores the individual items seen during fit. This requires longer computational times but allows to estimate the policy with stc.SparseTensorClassifier.learn(). By default, it is False when collapse=True or when only a single target and a single feature have been provided upon initialization. In this case, there is no need to estimate the policy and no need to store the individual items.

  • if_exists (str) – The action to take if STC has already been fitted. One of fail: raise exception, append: incremental fit in online learning, replace: re-fit from scratch.

  • clean (bool) – If True (the default) invalidates the cache used for prediction.

Return type

None

get(params)

Get parameters. Read the parameters provided upon initialization.

Parameters

params (Union[str, List[str]]) – Name(s) of the parameters to return.

Return type

Any

Returns

Value(s) of the parameters.

learn(test_size=None, train_size=None, stratify=True, priority=None, max_features=0, max_actions=0, max_iter=1, max_runtime=0, random_state=None)

Learn the policy. Learn the policy via reinforcement learning by optimizing the loss function on cross validation. Before learning, STC must be fitted with keep_items=True. Then, proceeds as follows. For each episode, split the train set in train-validation sets. Start with a set of empy features and compute the reward of the state (-loss). Add the value to the Q-table. Explore all the next states generated by adding 1 feature to the empty set. Compute the values of the states. Add to the Q-table. Select the state with the maximum value. Move to that state. Explore all the next states generated by adding 1 feature to the current set of features… Stop when all features are used or when the value of all the next states is less than the value of the current state.

Parameters
  • test_size (Optional[float]) – Train-test cross validation split (percentage of the training sample).

  • train_size (Optional[float]) – Train-test cross validation split (percentage of the training sample).

  • stratify (bool) – If True, the folds are made by preserving the percentage of samples for each class.

  • priority (Optional[List[str]]) – List of features to learn first.

  • max_features (int) – Number of maximum features to return in the policy. If 0, no limit.

  • max_actions (int) – Number of maximum states to explore at once. If 0, no limit.

  • max_iter (int) – Maximum number of iterations to train the algorithm. If 0, no limit.

  • max_runtime (int) – How long to train the algorithm, in seconds. If 0, no time limit.

  • random_state (Optional[bool]) – Random number generator seed, used for reproducible output across function calls.

Return type

Tuple[List[List[str]], List[float]]

Returns

Tuple of (policy, loss). The policy is saved internally and used by default in stc.SparseTensorClassifier.predict(). The second element of the tuple provides the loss associated with the policy.

policy()

Get the policy.

Return type

Tuple[List[List[str]], List[float]]

Returns

Output of stc.SparseTensorClassifier.learn()

predict(items, policy=None, probability=True, explain=True)

Predict the test data. The data must be structured as described above and must contain the features. All additional keys are ignored (included targets). If all the attributes of an item are associated with features never seen in the training set, STC will not be able to provide a prediction. In this case, a fallback mechanism is needed: the policy. The policy is a list of sets of features to use subsequently for prediction. The algorithm starts with the first element of the policy (first set of features). If no prediction could be produced, the second set of features is used, and so on. If the policy ends with the empy list [], then all the item are guaranteed to be predicted (eventually using no features, i.e., they will be attributed the most likely class in the trainig set). If the policy does not end with the empy list [], some predictions may miss for some items.

Parameters
  • items (Union[List[dict], DataFrame]) – The test data in JSON or tabular format as described above.

  • policy (Optional[List[List[str]]]) – List of lists of features to use for prediction, such as [[f1,f2],[f1],[]]. First lists are applied first. By default, it uses the policy [[features],[]] when only one feature is provided upon initialization or when collapse=True. In the other cases, it uses the policy [[features]] and raises a warning as some predictions may miss for some items. If a policy has been learnt with stc.SparseTensorClassifier.learn(), it uses that policy instead.

  • probability (bool) – If True (the default) returns the probability of the target class label for each predicted item. If False, and also explain=False, returns the final classification only (saves computational time).

  • explain (bool) – If True (the default) returns the contribution of each feature to the target class label for each predicted item.

Return type

Tuple[DataFrame, Optional[DataFrame], Optional[DataFrame]]

Returns

Tuple of (classification, probability, explainability). The classification table contains the final predictions for each item. Missing predictions are encoded as NaN. The probability table contains the probabilities of the target class labels for each predicted item. Labels that do not appear in this table are associated with zero probability. The explainability table provides the contribution of each feature to the target class label for each predicted item.

read_sql(sql)

Read SQL query into a pandas DataFrame.

Parameters

sql (str) – SQL query to SELECT data.

Return type

DataFrame

Returns

Output of the SQL query.

set(params)

Set parameters. Changing parameters does NOT need to re-fit STC. The fitting of STC is independent from the parameters. In particular, also the targets can be changed on the fly (if initialized with collapse=False).

Parameters

params (dict) – Dictionary of parameters to set in the form of {'param': value}. Supported parameters are: targets, chunksize, cache, power, balance, entropy, loss, tol.

Return type

None

to_sql(x, table, if_exists='fail')

Write a pandas DataFrame into a SQL table.

Parameters
  • x (DataFrame) – A pandas DataFrame to write into table.

  • table (str) – The name of the table to write x into.

  • if_exists (str) – The action to take when the table already exists. One of fail: raise exception, append: insert new values to the existing table, replace: drop the table before inserting new values.

Return type

None