Synthetic dataset generator

fedot.utilities.synth_dataset_generator.classification_dataset(samples_amount: int, features_amount: int, classes_amount: int, features_options: Dict, noise_fraction: float = 0.1, full_shuffle: bool = True, weights: Optional[list] = None)[source]

Generates a random dataset for n-class classification problem using scikit-learn API.

Parameters
  • samples_amount – Total amount of samples in the resulted dataset.

  • features_amount – Total amount of features per sample.

  • classes_amount – The amount of classes in the dataset.

  • features_options

    The dictionary containing features options in key-value format

    possible features_options variants:
    • informative -> the amount of informative features

    • redundant -> the amount of redundant features

    • repeated -> the amount of features that repeat the informative features

    • clusters_per_class -> the amount of clusters for each class

  • noise_fraction – the fraction of noisy labels in the dataset

  • full_shuffle – if true then all features and samples will be shuffled

  • weights – The proportions of samples assigned to each class. If None, then classes are balanced

Returns

features and target as numpy-arrays

Return type

array

fedot.utilities.synth_dataset_generator.regression_dataset(samples_amount: int, features_amount: int, features_options: Dict, n_targets: int, noise: float = 0.0, shuffle: bool = True)[source]

Generates a random dataset for regression problem using scikit-learn API.

Parameters
  • samples_amount – total amount of samples in the resulted dataset

  • features_amount – total amount of features per sample

  • features_options

    the dictionary containing features options in key-value format

    possible features_options variants:
    • informative -> the amount of informative features

    • bias -> bias term in the underlying linear model

  • n_targets – the amount of target variables

  • noise – the standard deviation of the gaussian noise applied to the output

  • shuffle – if True then all features and samples will be shuffled

Returns

features and target as numpy-arrays

Return type

array

fedot.utilities.synth_dataset_generator.gauss_quantiles_dataset(samples_amount: int, features_amount: int, classes_amount: int, full_shuffle=True, **kwargs)[source]

Generates a random dataset for n-class classification problem based on multi-dimensional gaussian distribution quantiles using scikit-learn API.

Parameters
  • samples_amount – total amount of samples in the resulted dataset

  • features_amount – total amount of features per sample

  • classes_amount – the amount of classes in the dataset

  • full_shuffle – if True then all features and samples will be shuffled

  • kwargs – Optional[‘gauss_params’] mean and covariance values of the distribution

Returns

features and target as numpy-arrays

Return type

array

fedot.utilities.synth_dataset_generator.generate_synthetic_data(length: int = 2200, periods: int = 5)[source]

The function generates a synthetic one-dimensional array without omissions

Parameters
  • length – the length of the array

  • periods – the number of periods in the sine wave

Returns

an array without gaps

Return type

array