automl.utl package

Submodules

automl.utl.arff_operations module

Read ARFF files and provide an interface to interact with them.

The current MetaDatabase (the place where the meta-knowledge acquired by auto-sklearn is stored) is based on ARFF files. These, are expressive but not straightforward to manipulate neither for matrices operations nor data access.

The intention here is to provide classes that can perform some basic/common operations and can hide the object as numpy/pandas objects that are easier to manipulate for data science purposes.

class automl.utl.arff_operations.ARFFWrapper(arff_dataset=None, arff_filepath=None, sort_attributes=False, sort_attr_backwards=False)

Bases: object

Class to perform operations on a ARFF dataset.

Since performing matrix operations or even data manipulation with ARFF datasets in python is not an straightforward operation, we expose this wrapper that will perform some of these operations.

Parameters:
  • arff_dataset (dict) – The dataset as an ARFF dictionary. If arrff_filepath is not None, arff_dataset is ignored.
  • arff_filepath (str) – The filepath used to load the dataset. None if is not desired to load from file.
  • sort_attributes (bool) – Wheter or not to sort the columns after loading the dataset. Defaults to False.
  • sort_attr_backwards (bool) – If sort_attributes is True, this argument controls the reverse/natural order of the sorting. Defaults to False (non-reverse).
Raises:
  • KeyError – When the dictionary describing the ARFF dataset does not contain the official fields: data, relation, attributes and description.
  • TypeError – When the values for each of the official fields in the ARFF dataset are not as expected: list for data and `attributes and str for relation and description.
  • TypeError – When arff_dataser is not a dictionary.
  • ValueError – If the path argument is None and arff_dataset is None (no dataset passed).
data

The data field in an ARFF dataset.

Type:dict
description

The description as in the original ARFF dataset.

Type:str
name

Corresponds to the relation field in the original ARFF dataset.

Type:str
key_attributes

The names for the columns (attributes) in the ARFF dataset.

Type:list
add_attribute_key(column=None)

Add an attribute to the list of keys.

A key is the identifier in the dataset, similar to what we call primary key in Databases.

Parameters:column (str) – The column to add as a key in the dataset. Defaults to None.
Raises:ValueError – If column is None or not in the attributes.
arrf_dict()

Return the dataset as an arff dictionary.

Returns:The resulting ARFF dataset in the form of a dictionary.
Return type:dict
as_numpy_array()

Return the data as a numpy array.

Returns:The dataset as numpy object.
Return type:numpy.array
as_pandas_df()

Return the data as a pandas data frame.

Returns:The data as a pandas dataframe.
Return type:pandas.DataFrame
attribute_names()

Return a list with the names of the attributes/features/columns.

Returns:
A list with the names of the attributes, in the order they
appear in the dataset.
Return type:list
attribute_types()

Return a list with the attribute types.

Returns:
A list with the types of the attributes, in the order they
appear in the dataset.
Return type:list
change_attribute_type(column=None, newtype=None)

Change the dtype of a column.

Parameters:
  • column (str or int) – The column we want to work with, either its name or its index position. Defaults to None.
  • newtype (type) – The new type to assign as the column’s dtype. Defaults to None.
Raises:
  • TypeError – When no str or int is used for column argument.
  • ValueError – When column is None.
  • IndexError – If column is instance of int and is out of bounds with respect to the number of columns (attributes) in the dataset.
copy()

Create a copy of this object.

drop_attributes(attributes=None, inplace=True)

Drop a column (attribute) from the dataset.

Parameters:
  • attributes (bool) – List of attributes to drop off. Defaults to None.
  • inplace (bool) – Whether to perform the drop inplace or not. Defaults to True.
Raises:

ValueError – If an attribute in attributes is not part of the dataset columns or None is passed.

Returns:

The dataframe after dropping the attribute.

Return type:

pandas.DataFrame

load_dataset(path=None)

Load the dataset from a ARFF file.

It loads the file and validates that the resulting arff dict is valid. If a dataset had already been loaded, then this operation overrides it.

Parameters:
  • path (str) – The path where the file to load is located. Defaults to
  • None.
Raises:
  • KeyError – When the dictionary describing the ARFF dataset does not contain the official fields: data, relation, attributes and description.
  • TypeError – When the values for each of the official fields in the ARFF dataset are not as expected: list for data and `attributes, and str for relation and description.
  • ValueError – When the path argument is None.
row_by_column_value(column=None, value=None)

Return the row that satisfies column=value.

Parameters:
  • column (str) – The column to query against. Defaults to None.
  • value (object) – The value to search for in the specified column. Defaults to None.
Raises:
  • ValueError – If column or value are None.
  • TypeError – If column is not an instance of str.
Returns:

The row satisfying the search, if any. Empty series

if nothing was found.

Return type:

pandas.Series

sort_attributes(backwards=False, inplace=True)

Sort the columns in the dataset.

Parameters:
  • backwards (bool) – Whether to apply reverse order or not. Defaults to False.
  • inplace (bool) – Whether to perform the mutation inplace or not. Defaults to True.
Returns:

The self object after changes if inplace was specified

or a copy with the modification otherwise.

Return type:

ARFFWrapper

sort_rows(columns=None, ascending=True, inplace=True)

Sort the rows, similarly to pandas dataframe sorting.

Parameters:
  • columns (list) – Columns (ordered) to sort by. Defaults to None.
  • ascending (bool) – Whether or not to sort in ascending order. Defaults to True.
  • inplace (bool) – Whether or not to perform the change in place. Defaults to True.
Raises:

ValueError – If a value in columns is not part of the dataset or None is passed.

Returns:

The self object if inplace = True. A copy with the

modifications otherwise.

Return type:

ARFFWrapper

summary()

Print a summary of the dataset.

It shows basic information: name, description, shape and first 5 observations.

values_by_attribute(column=None)

Return the data in one column (attribute).

Parameters:

column (str or int) – This is the column we want to retrieve, either by its name or its positional index.

Returns:

The values in the specified column.

Return type:

numpy.array

Raises:
  • TypeError – If column is not int nor str
  • ValueError – If the column is not in the attributes.
shape

Return a tuple with the shape of the data.

Returns:2-tuple with first element being n and second m.
Return type:tuple

automl.utl.json_utils module

automl.utl.json_utils.get_individual_cs(component_name)

This function returns the configuration space of the component

Parameters:component_name – Name of the component (Should be exactly same as the name of the class of the component)
Returns:returns the configuration space of the component
Return type:ConfigurationSpace
automl.utl.json_utils.write_cs_to_json_file(cs, json_name)

This function writes the configuration space object into a json file.

Parameters:
  • cs (ConfigurationSpace) – The configuration space object
  • json_name (str) – Name of the component. The name of the component should be exactly same as the name of the class of the component.

automl.utl.miscellaneous module

Module to define useful but no AutoMl-specific methods.

automl.utl.miscellaneous.argsort_list(src_list=None, backwards=False)

Return the sorted (alphabetically) indices of a src_list.

Similar to argsort for numpy arrays. It is based on unutbu’s answer in StackOverflow.

Parameters:
  • src_list (list) – The list to sort. Defaults to None.
  • backwards (bool) – Whether or not to do reverse sort. Defaults to False.
Returns:

The resulting list of indices sorted by value.

Return type:

list

automl.utl.miscellaneous.generate_random_id(n_chars=6)

Generate an internal ID of a given length.

Parameters:n_chars (int) – The length. Defaults to 6.
Returns:The resulting ID.
Return type:str

Module contents

Module containing utility classes and methods.

In this library, other code can use the methods/classes defined here. Any piece of code in this module must be non-specific for AutoML tasks.