Guru

class modelgym.guru.Guru(print_hints=True, sample_size=None, category_qoute=0.2, sparse_qoute=0.8, class_disbalance_qoute=0.5, pvalue_boundary=0.05)

This class analyze data trying to find some issues.

Parameters:
  • sample_size (int) – number of objects to be used for category and sparsity diagnostic. If None, whole data will be used.
  • category_qoute (0 < float < 1) – max number of distinct feature values in sample to assume this feature categorial
  • sparse_qoute (0 < float < 1) – zeros portion in sample required to assume this feature sparse
  • class_disbalance_qoute (0 < float < 1) – class portion should be distant from the mean to assume this class disbalanced
check_categorial(X)

Find category features in X

Parameters:X (array-like with shape (n_objects, n_features)) – features from your dataset
Returns:dict of shape:
{
    'not numeric': list of feature indexes,
    'not variable': list of feature indexes
}
check_class_disbalance(y)

Find disbalanced classes in y. You should use this method only if you are solving classification task

Parameters:y (array-like with shape (n_objects,)) – target classes in your dataset
Returns:dict of shape:
{
    'too common': list of classes,
    'too rare': list of classes
}
check_correlation(X, feature_indexes=None)

Find correlated features among features with specified indexes from X

Parameters:
  • X (array-like with shape (n_objects x n_features)) – features from your dataset
  • feature_indexes – list of features which should be checked for correlation. If None all features will be checked
Returns:

list of pairs of features which are supposed to be correlated

check_everything(data)

Full data check. Find category features, sparse features, correlated features and disbalanced classes.

Parameters:data (XYCDataset-like) – your dataset
Returns:(categorials, sparse, disbalanced, correlated)
  • categorials: indexes of features which are supposed to be categorial
  • sparse: indexes of features which are supposed to be sparse
  • disbalanced: disbalanced classes
  • correlated: indexes of features which are supposed to be correlated

For more detailes see methods:

  • check_categorials
  • check_sparse
  • check_class_disbalance
  • check_correlation
check_sparse(X)

Find sparse features in X

Parameters:X (array-like with shape (n_objects, n_features)) – features from your dataset
Returns:list of features which are supposed to be sparse
draw_2dhist(X, feature_indexes=None, figsize=(6, 4), **hist_kwargs)

Draw 2dhist for each pair of features with specified indexes

Parameters:
  • X (array-like with shape (n_objects x n_features)) – features from your dataset
  • feature_indexes (list of int or str) – features which should be checked for correlation. If None all features will be checked. If it is list of str X should be a np.ndarray and X.dtype should contain fields
  • figsize (tuple of int) – Size of figure with hist2d
draw_correlation_heatmap(X, feature_indexes=None, figsize=(15, 10), **heatmap_kwargs)

Draw correlation heatmap between features with specified indexes from X

Parameters:
  • X (array-like with shape (n_objects x n_features)) – features from your dataset
  • feature_indexes (list of int or str) – features which should be checked for correlation. If None all features will be checked. If it is list of str X should be a np.ndarray and X.dtype should contain fields
  • figsize (tuple of int) – Size of figure with heatmap