automl.datahandler package¶

Submodules¶

automl.datahandler.dataloader module¶

Module to expose classes to interact with a given dataset.

Here we abstract any dataset as a class to ease the interaction within the package’s modules. Additionally, we provide a class to load datasets in this new format.

class automl.datahandler.dataloader.DataLoader¶

Bases: object

Class to load dataset as a Dataset class from different sources.

It exposes static methods only.

static get_openml_dataset(openml_id, problem_type)¶

Fetch a dataset from OpenML and return a Dataset object.

Parameters:

openml_id (int) – ID for the dataset, as stored in OpenML.
problem_type (int) – Type of problem to solve in the dataset. 0 for classification, 1 for regression.

Returns:

An auto-ml Dataset, as defined in this module.: Its default ID will be the concatenation of the OpenML dataset name and ID.

Return type:

Dataset

class automl.datahandler.dataloader.Dataset(dataset_id=None, X=None, y=None, categorical_indicators=None, problem_type=0)¶

Bases: object

Class abstracting a dataset.

In this class we abstract a dataset as an object composed of a features pandas.DataFrame and a target pandas.DataFrame (with the name ‘target’). Additionally, a Dataset contains categorical_indicators indicators for each of the feature columns, and ID and a problem type to solve (either classification or regression).

Parameters:

dataset_id (str) – An string identifying the dataset. If None is passed to the constructor, then the default is a string of 6 random letters/digits.
X (pandas.DataFrame) – The Data Frame containing the features of the dataset. Shape is (n, m).
y (pandas.DataFrame) – The Data Frame containing the target value (e.g. the label for a class). It should be of shape (n, 1).
categorical_indicators (list) – A list of m booleans following a 1-to-1 relation with the features’ columns that indicate whether or not the feature is categorical_indicators. If None is passed to the constructor, the default is a list of m False values.
problem_type (int) – 0 indicates classification, 1 indicates regression. Defaults to 0.

Raises:

TypeError – If X and/or y are not pandas Data Frames.
ValueError – If X and y are not of shape (n, m), (n, 1) respectively.

X¶

The Data Frame containing the features of the dataset. Shape is (n, m).

Type:	pandas.DataFrame

y¶

The Data Frame containing the target value (e.g. the label for a class). It should be of shape (n, 1).

Type:	pandas.DataFrame

categorical_indicators¶

A list of m booleans following a 1-to-1 relation with the features’ columns that indicate whether or not the feature is categorical_indicators.

Type:	list

dataset_id¶

An string identifying the dataset.

Type:	str

problem_type¶

0 indicates classification, 1 indicates regression.

Type:	int

is_classification_problem()¶

Whether or not the dataset is registered as classification task.

Returns:	True if the dataset has been marked a classification problem.
Return type:	bool

is_regression_problem()¶

Whether or not the dataset is registered as regression task.

Returns:	True if the dataset has been marked as a regression problem.
Return type:	bool

is_sparse()¶

Return whether or not the X data is sparse or not.

Returns:	True if the X data frame is an sparse representation.
Return type:	bool

metafeatures_vector()¶

Return the metafeatures of this dataset as a vector (numpy array).

Returns:	The metafeatures as a numpy array.
Return type:	np.array

train_test_split(random_state=42, test_size=0.33)¶

Make a train-test split as defined in scikit-learn for the dataset.

Parameters:	random_state (int) – The random_state to initialize with so that the results can be reproduced. test_size (float) – Proportion of split for the test set.
Returns:	A 4-tuple as train_test_split in scikit-learn.
Return type:	tuple

n_features¶

Return the number of features in the dataset.

Returns:	The number of features.
Return type:	int

n_labels¶: Provide the number of different labels (target) for this dataset.

Module contents¶

Group submodels that abstract and work out data.