automl.discovery package

Submodules

automl.discovery.assistant module

Module containing the classes to allow for easy discovery.

class automl.discovery.assistant.Assistant(dataset=None, metalearning_metric='accuracy', evaluation_metric='accuracy')

Bases: object

Provide methods to return AutoML results.

This class can compute the metafeatures, similar datasets and discover pipelines. For this, a metric and a dataset are required.

Parameters:
  • dataset (Dataset) – A dataset object to work with. Defaults to None.
  • metalearning_metric – A metric in the metalearning database to use if similar datasets want to be retrieved. Defatuls to ‘accuracy’.
  • evaluation_metric (string or callable) – A metric to evalate the pipeline with. Defaults to ‘accuracy’.
dataset

A dataset object to work with.

Type:Dataset
metalearning_metric

A metric in the metalearning database to use if similar datasets want to be retrieved.

evaluation_metric

A metric to evalate the pipeline with.

Type:string or callable
bayesian_optimize(pipeline=None, optimize_on='quality', cutoff_time=600, iteration=15, scoring=None, cv=5)

Optimize a pipeline using Bayesian optimization.

Parameters:
  • pipeline (Pipeline) – A pipeline to use. If None, we try to use the last auto generated pipeline.
  • optimize_on (string) – Optimization parameter that takes input either “quality” or “runtime”. The pipeline can either be optimized based on quality (performance metric) or runtime (time required to complete the bayesian optimization process). Defaults to “quality”.
  • cutoff_time (int) – This hyperparameter is only used when “optimize_on” hyperparameter is set to “runtime”. Maximum time limit in seconds after which the bayesian optimization process stops and best result is given as output. Defaults to 600.
  • iteration (int) – Number of iteration the bayesian optimization process will take. More number of iterations gives better results but increases the execution time. Defaults to 15.
  • scoring (string or callable) – The performance metric on which the pipeline is supposed to be optimized. Example : ‘auc_roc’, ‘accuracy’, ‘precision’, ‘recall’, ‘f1’, etc. Defaults to None.
  • cv (int) – Specifies the number of folds in a (Stratified)KFold. Defaults to 5.
Raises:

ValueError – if no pipeline has been specified nor found.

Returns:

the object used for optimization.

Return type:

BayesianOptimizationPipeline

compute_similar_datasets(k=5, similarity_metric='minkowski')

Compute the similar datasets based on the dataset’s metafeatures.

Parameters:k (int) – The number of similar datasets to retrieve. Defaults to 5.
Returns:List of neighbors ordered by similarity. list: List of the metrics for each of the neighbors.
Return type:list
generate_pipeline(ignore_similar_datasets=False)

Automatically generate a pipeline using the dataset and metric.

If the similar datasets have been already been computed and the default value of ignore_similar_datasets is kept, then the search space of classifiers is reduced to the suggestions for the similar datasets.

Parameters:ignore_similar_datasets (bool) – Whether to ignore the suggested models for the similar datasets or not. Defaults to False.
Returns:
The PipelineDiscoverObject used to find out the
dataset.
Return type:PipelineDiscovery
metafeatures

Return the metafeatures for the dataset attribute of the class.

Returns
numpy.array: The metafeatures computed in the form of an array.
They are sorted alphabetically by its name.
reduced_search_space

Retrieve the reduced search space based on the similar datasets.

The similar datasets should have been computed already. AutoMLError is thrown if no neighbors have been computed yet.

Returns:List of MLSuggestions.
Return type:list
similar_datasets

Retrieve the similar datasets, without recomputing.

AutoMLError is thrown if no neighbors have been computed yet.

Returns:
A 2-tuple where the first element is the set of similar
datasets and the second is the metric (distance) between our dataset and the similar ones.
Return type:tuple

automl.discovery.pipeline_generation module

Module defining the functionality to generate a pipeline using TPOT GP.

class automl.discovery.pipeline_generation.PipelineDiscovery(dataset=None, search_space='scikit-learn', tpot_params=None, evaluation_metric='accuracy')

Bases: object

Discover a pipeline for a given dataset, using a given metric.

Parameters:
  • dataset (Dataset) – The dataset to work with. Defaults to None.
  • search_space (str or dict) – The search space to use for the discovery operation. If string, it should be any of the following: ‘scikit-learn’. If dict, it must comply with the TPOT config_dict format. Defaults to ‘scikit-learn’.
  • tpot_params (dict) – The extra parameters to pass to the TPOT object (either a TPOTClassifier or a TPOTRegressor). Defaults to None.
Raises:

TypeError – If any of the arguments do not satisfies the type constraints described above.

dataset

The dataset to work with.

Type:Dataset
search_space

The search space to use for the discovery operation. If string, it should be any of the following: ‘scikit-learn’. If dict, it must comply with the TPOT config_dict format.

Type:str or dict
validation_score

The extra parameters to pass to the TPOT object (either a TPOTClassifier or a TPOTRegressor).

Type:dict
discover(limit_time=None, random_state=42)

Perform the discovery of a pipeline.

Parameters:
  • limit_time (int) – In minutes, the maximum time to wait for the generation of the pipeline. If None, ignored.
  • random_state (int) – The number to seed the random state with. Defaults to 42.
Returns:

The resulting pipeline.

Return type:

sklearn.pipeline.Pipeline

save_pipeline(target_dir=None, file_name=None)

Save the discovered pipeline into a file, as python code.

Parameters:
  • target_dir (string) – If not none, use it as parent dir for the resulting file. Defaults to None.
  • file_name (string) – The name to use for the resulting file. Defaults to None.
score(x_val, y_val)

Score a validation set against the discovered pipeline.

Parameters:
  • x_val (numpy.array) – The validation set for the features.
  • y_val (numpy.array) – The validation set for the target.
Returns:

The score given by the pipeline for the passed set.

Return type:

float

pipeline

Return the resulting pipeline from the discovery process.

Returns:The discovered pipeline if any.
Return type:sklearn.pipeline.Pipeline
tpot_object

Return the TPOT object used in the discovery process.

Returns:The TPOTBase class (TPOTClassifier or TPOTRegressor).
Return type:TPOTBase

Module contents

Provide the classes to perform automated discovery of pipelines.

The discovery module is able to discover pipelines from scratch or based on a restricted Search Space (set of MLSuggestion objects) that come from a query of similar datasets to a given one.