automl.datahandler package

Submodules

automl.datahandler.dataloader module

Module to expose classes to interact with a given dataset.

Here we abstract any dataset as a class to ease the interaction within the package’s modules. Additionally, we provide a class to load datasets in this new format.

class automl.datahandler.dataloader.DataLoader

Bases: object

Class to load dataset as a Dataset class from different sources.

It exposes static methods only.

static get_openml_dataset(openml_id, problem_type)

Fetch a dataset from OpenML and return a Dataset object.

Parameters:
  • openml_id (int) – ID for the dataset, as stored in OpenML.
  • problem_type (int) – Type of problem to solve in the dataset. 0 for classification, 1 for regression.
Returns:

An auto-ml Dataset, as defined in this module.

Its default ID will be the concatenation of the OpenML dataset name and ID.

Return type:

Dataset

class automl.datahandler.dataloader.Dataset(dataset_id=None, X=None, y=None, categorical_indicators=None, problem_type=0)

Bases: object

Class abstracting a dataset.

In this class we abstract a dataset as an object composed of a features pandas.DataFrame and a target pandas.DataFrame (with the name ‘target’). Additionally, a Dataset contains categorical_indicators indicators for each of the feature columns, and ID and a problem type to solve (either classification or regression).

Parameters:
  • dataset_id (str) – An string identifying the dataset. If None is passed to the constructor, then the default is a string of 6 random letters/digits.
  • X (pandas.DataFrame) – The Data Frame containing the features of the dataset. Shape is (n, m).
  • y (pandas.DataFrame) – The Data Frame containing the target value (e.g. the label for a class). It should be of shape (n, 1).
  • categorical_indicators (list) – A list of m booleans following a 1-to-1 relation with the features’ columns that indicate whether or not the feature is categorical_indicators. If None is passed to the constructor, the default is a list of m False values.
  • problem_type (int) – 0 indicates classification, 1 indicates regression. Defaults to 0.
Raises:
  • TypeError – If X and/or y are not pandas Data Frames.
  • ValueError – If X and y are not of shape (n, m), (n, 1) respectively.
X

The Data Frame containing the features of the dataset. Shape is (n, m).

Type:pandas.DataFrame
y

The Data Frame containing the target value (e.g. the label for a class). It should be of shape (n, 1).

Type:pandas.DataFrame
categorical_indicators

A list of m booleans following a 1-to-1 relation with the features’ columns that indicate whether or not the feature is categorical_indicators.

Type:list
dataset_id

An string identifying the dataset.

Type:str
problem_type

0 indicates classification, 1 indicates regression.

Type:int
is_classification_problem()

Whether or not the dataset is registered as classification task.

Returns:True if the dataset has been marked a classification problem.
Return type:bool
is_regression_problem()

Whether or not the dataset is registered as regression task.

Returns:True if the dataset has been marked as a regression problem.
Return type:bool
is_sparse()

Return whether or not the X data is sparse or not.

Returns:True if the X data frame is an sparse representation.
Return type:bool
metafeatures_vector()

Return the metafeatures of this dataset as a vector (numpy array).

Returns:The metafeatures as a numpy array.
Return type:np.array
train_test_split(random_state=42, test_size=0.33)

Make a train-test split as defined in scikit-learn for the dataset.

Parameters:
  • random_state (int) – The random_state to initialize with so that the results can be reproduced.
  • test_size (float) – Proportion of split for the test set.
Returns:

A 4-tuple as train_test_split in scikit-learn.

Return type:

tuple

n_features

Return the number of features in the dataset.

Returns:The number of features.
Return type:int
n_labels

Provide the number of different labels (target) for this dataset.

Module contents

Group submodels that abstract and work out data.