automl.datahandler package¶
Submodules¶
automl.datahandler.dataloader module¶
Module to expose classes to interact with a given dataset.
Here we abstract any dataset as a class to ease the interaction within the package’s modules. Additionally, we provide a class to load datasets in this new format.
-
class
automl.datahandler.dataloader.
DataLoader
¶ Bases:
object
Class to load dataset as a Dataset class from different sources.
It exposes static methods only.
-
static
get_openml_dataset
(openml_id, problem_type)¶ Fetch a dataset from OpenML and return a Dataset object.
Parameters: - openml_id (int) – ID for the dataset, as stored in OpenML.
- problem_type (int) – Type of problem to solve in the dataset. 0 for classification, 1 for regression.
Returns: - An auto-ml Dataset, as defined in this module.
Its default ID will be the concatenation of the OpenML dataset name and ID.
Return type:
-
static
-
class
automl.datahandler.dataloader.
Dataset
(dataset_id=None, X=None, y=None, categorical_indicators=None, problem_type=0)¶ Bases:
object
Class abstracting a dataset.
In this class we abstract a dataset as an object composed of a features pandas.DataFrame and a target pandas.DataFrame (with the name ‘target’). Additionally, a Dataset contains categorical_indicators indicators for each of the feature columns, and ID and a problem type to solve (either classification or regression).
Parameters: - dataset_id (str) – An string identifying the dataset. If None is passed to the constructor, then the default is a string of 6 random letters/digits.
- X (pandas.DataFrame) – The Data Frame containing the features of the dataset. Shape is (n, m).
- y (pandas.DataFrame) – The Data Frame containing the target value (e.g. the label for a class). It should be of shape (n, 1).
- categorical_indicators (list) – A list of m booleans following a 1-to-1 relation with the features’ columns that indicate whether or not the feature is categorical_indicators. If None is passed to the constructor, the default is a list of m False values.
- problem_type (int) – 0 indicates classification, 1 indicates regression. Defaults to 0.
Raises: TypeError
– If X and/or y are not pandas Data Frames.ValueError
– If X and y are not of shape (n, m), (n, 1) respectively.
-
X
¶ The Data Frame containing the features of the dataset. Shape is (n, m).
Type: pandas.DataFrame
-
y
¶ The Data Frame containing the target value (e.g. the label for a class). It should be of shape (n, 1).
Type: pandas.DataFrame
-
categorical_indicators
¶ A list of m booleans following a 1-to-1 relation with the features’ columns that indicate whether or not the feature is categorical_indicators.
Type: list
-
dataset_id
¶ An string identifying the dataset.
Type: str
-
problem_type
¶ 0 indicates classification, 1 indicates regression.
Type: int
-
is_classification_problem
()¶ Whether or not the dataset is registered as classification task.
Returns: True if the dataset has been marked a classification problem. Return type: bool
-
is_regression_problem
()¶ Whether or not the dataset is registered as regression task.
Returns: True if the dataset has been marked as a regression problem. Return type: bool
-
is_sparse
()¶ Return whether or not the X data is sparse or not.
Returns: True if the X data frame is an sparse representation. Return type: bool
-
metafeatures_vector
()¶ Return the metafeatures of this dataset as a vector (numpy array).
Returns: The metafeatures as a numpy array. Return type: np.array
-
train_test_split
(random_state=42, test_size=0.33)¶ Make a train-test split as defined in scikit-learn for the dataset.
Parameters: - random_state (int) – The random_state to initialize with so that the results can be reproduced.
- test_size (float) – Proportion of split for the test set.
Returns: A 4-tuple as train_test_split in scikit-learn.
Return type: tuple
-
n_features
¶ Return the number of features in the dataset.
Returns: The number of features. Return type: int
-
n_labels
¶ Provide the number of different labels (target) for this dataset.
Module contents¶
Group submodels that abstract and work out data.