automl.utl package¶
Submodules¶
automl.utl.arff_operations module¶
Read ARFF files and provide an interface to interact with them.
The current MetaDatabase (the place where the meta-knowledge acquired by auto-sklearn is stored) is based on ARFF files. These, are expressive but not straightforward to manipulate neither for matrices operations nor data access.
The intention here is to provide classes that can perform some basic/common operations and can hide the object as numpy/pandas objects that are easier to manipulate for data science purposes.
-
class
automl.utl.arff_operations.
ARFFWrapper
(arff_dataset=None, arff_filepath=None, sort_attributes=False, sort_attr_backwards=False)¶ Bases:
object
Class to perform operations on a ARFF dataset.
Since performing matrix operations or even data manipulation with ARFF datasets in python is not an straightforward operation, we expose this wrapper that will perform some of these operations.
Parameters: - arff_dataset (dict) – The dataset as an ARFF dictionary. If arrff_filepath is not None, arff_dataset is ignored.
- arff_filepath (str) – The filepath used to load the dataset. None if is not desired to load from file.
- sort_attributes (bool) – Wheter or not to sort the columns after loading the dataset. Defaults to False.
- sort_attr_backwards (bool) – If sort_attributes is True, this argument controls the reverse/natural order of the sorting. Defaults to False (non-reverse).
Raises: KeyError
– When the dictionary describing the ARFF dataset does not contain the official fields: data, relation, attributes and description.TypeError
– When the values for each of the official fields in the ARFF dataset are not as expected: list for data and `attributes and str for relation and description.TypeError
– When arff_dataser is not a dictionary.ValueError
– If the path argument is None and arff_dataset is None (no dataset passed).
-
data
¶ The data field in an ARFF dataset.
Type: dict
-
description
¶ The description as in the original ARFF dataset.
Type: str
-
name
¶ Corresponds to the relation field in the original ARFF dataset.
Type: str
-
key_attributes
¶ The names for the columns (attributes) in the ARFF dataset.
Type: list
-
add_attribute_key
(column=None)¶ Add an attribute to the list of keys.
A key is the identifier in the dataset, similar to what we call primary key in Databases.
Parameters: column (str) – The column to add as a key in the dataset. Defaults to None. Raises: ValueError
– If column is None or not in the attributes.
-
arrf_dict
()¶ Return the dataset as an arff dictionary.
Returns: The resulting ARFF dataset in the form of a dictionary. Return type: dict
-
as_numpy_array
()¶ Return the data as a numpy array.
Returns: The dataset as numpy object. Return type: numpy.array
-
as_pandas_df
()¶ Return the data as a pandas data frame.
Returns: The data as a pandas dataframe. Return type: pandas.DataFrame
-
attribute_names
()¶ Return a list with the names of the attributes/features/columns.
Returns: - A list with the names of the attributes, in the order they
- appear in the dataset.
Return type: list
-
attribute_types
()¶ Return a list with the attribute types.
Returns: - A list with the types of the attributes, in the order they
- appear in the dataset.
Return type: list
-
change_attribute_type
(column=None, newtype=None)¶ Change the dtype of a column.
Parameters: - column (str or int) – The column we want to work with, either its name or its index position. Defaults to None.
- newtype (type) – The new type to assign as the column’s dtype. Defaults to None.
Raises: TypeError
– When no str or int is used for column argument.ValueError
– When column is None.IndexError
– If column is instance of int and is out of bounds with respect to the number of columns (attributes) in the dataset.
-
copy
()¶ Create a copy of this object.
-
drop_attributes
(attributes=None, inplace=True)¶ Drop a column (attribute) from the dataset.
Parameters: - attributes (bool) – List of attributes to drop off. Defaults to None.
- inplace (bool) – Whether to perform the drop inplace or not. Defaults to True.
Raises: ValueError
– If an attribute in attributes is not part of the dataset columns or None is passed.Returns: The dataframe after dropping the attribute.
Return type: pandas.DataFrame
-
load_dataset
(path=None)¶ Load the dataset from a ARFF file.
It loads the file and validates that the resulting arff dict is valid. If a dataset had already been loaded, then this operation overrides it.
Parameters: - path (str) – The path where the file to load is located. Defaults to
- None. –
Raises: KeyError
– When the dictionary describing the ARFF dataset does not contain the official fields: data, relation, attributes and description.TypeError
– When the values for each of the official fields in the ARFF dataset are not as expected: list for data and `attributes, and str for relation and description.ValueError
– When the path argument is None.
-
row_by_column_value
(column=None, value=None)¶ Return the row that satisfies column=value.
Parameters: - column (str) – The column to query against. Defaults to None.
- value (object) – The value to search for in the specified column. Defaults to None.
Raises: ValueError
– If column or value are None.TypeError
– If column is not an instance of str.
Returns: - The row satisfying the search, if any. Empty series
if nothing was found.
Return type: pandas.Series
-
sort_attributes
(backwards=False, inplace=True)¶ Sort the columns in the dataset.
Parameters: - backwards (bool) – Whether to apply reverse order or not. Defaults to False.
- inplace (bool) – Whether to perform the mutation inplace or not. Defaults to True.
Returns: - The self object after changes if inplace was specified
or a copy with the modification otherwise.
Return type:
-
sort_rows
(columns=None, ascending=True, inplace=True)¶ Sort the rows, similarly to pandas dataframe sorting.
Parameters: - columns (list) – Columns (ordered) to sort by. Defaults to None.
- ascending (bool) – Whether or not to sort in ascending order. Defaults to True.
- inplace (bool) – Whether or not to perform the change in place. Defaults to True.
Raises: ValueError
– If a value in columns is not part of the dataset or None is passed.Returns: - The self object if inplace = True. A copy with the
modifications otherwise.
Return type:
-
summary
()¶ Print a summary of the dataset.
It shows basic information: name, description, shape and first 5 observations.
-
values_by_attribute
(column=None)¶ Return the data in one column (attribute).
Parameters: column (str or int) – This is the column we want to retrieve, either by its name or its positional index.
Returns: The values in the specified column.
Return type: numpy.array
Raises: TypeError
– If column is not int nor strValueError
– If the column is not in the attributes.
-
shape
¶ Return a tuple with the shape of the data.
Returns: 2-tuple with first element being n and second m. Return type: tuple
automl.utl.json_utils module¶
-
automl.utl.json_utils.
get_individual_cs
(component_name)¶ This function returns the configuration space of the component
Parameters: component_name – Name of the component (Should be exactly same as the name of the class of the component) Returns: returns the configuration space of the component Return type: ConfigurationSpace
-
automl.utl.json_utils.
write_cs_to_json_file
(cs, json_name)¶ This function writes the configuration space object into a json file.
Parameters: - cs (ConfigurationSpace) – The configuration space object
- json_name (str) – Name of the component. The name of the component should be exactly same as the name of the class of the component.
automl.utl.miscellaneous module¶
Module to define useful but no AutoMl-specific methods.
-
automl.utl.miscellaneous.
argsort_list
(src_list=None, backwards=False)¶ Return the sorted (alphabetically) indices of a src_list.
Similar to argsort for numpy arrays. It is based on unutbu’s answer in StackOverflow.
Parameters: - src_list (list) – The list to sort. Defaults to None.
- backwards (bool) – Whether or not to do reverse sort. Defaults to False.
Returns: The resulting list of indices sorted by value.
Return type: list
-
automl.utl.miscellaneous.
generate_random_id
(n_chars=6)¶ Generate an internal ID of a given length.
Parameters: n_chars (int) – The length. Defaults to 6. Returns: The resulting ID. Return type: str
Module contents¶
Module containing utility classes and methods.
In this library, other code can use the methods/classes defined here. Any piece of code in this module must be non-specific for AutoML tasks.