Guru¶

class modelgym.guru.Guru(print_hints=True, sample_size=None, category_qoute=0.2, sparse_qoute=0.8, class_disbalance_qoute=0.5, pvalue_boundary=0.05)¶

This class analyze data trying to find some issues.

Parameters:

sample_size (int) – number of objects to be used for category and sparsity diagnostic. If None, whole data will be used.
category_qoute (0 < float < 1) – max number of distinct feature values in sample to assume this feature categorial
sparse_qoute (0 < float < 1) – zeros portion in sample required to assume this feature sparse
class_disbalance_qoute (0 < float < 1) – class portion should be distant from the mean to assume this class disbalanced

check_categorial(X)¶

Find category features in X

Parameters:	X (array-like with shape (n_objects, n_features)) – features from your dataset
Returns:	dict of shape: { 'not numeric': list of feature indexes, 'not variable': list of feature indexes }

check_class_disbalance(y)¶

Find disbalanced classes in y. You should use this method only if you are solving classification task

Parameters:	y (array-like with shape (n_objects,)) – target classes in your dataset
Returns:	dict of shape: { 'too common': list of classes, 'too rare': list of classes }

check_correlation(X, feature_indexes=None)¶

Find correlated features among features with specified indexes from X

Parameters:	X (array-like with shape (n_objects x n_features)) – features from your dataset feature_indexes – list of features which should be checked for correlation. If None all features will be checked
Returns:	list of pairs of features which are supposed to be correlated

check_everything(data)¶

Full data check. Find category features, sparse features, correlated features and disbalanced classes.

Parameters:	data (XYCDataset-like) – your dataset
Returns:	(categorials, sparse, disbalanced, correlated) categorials: indexes of features which are supposed to be categorial sparse: indexes of features which are supposed to be sparse disbalanced: disbalanced classes correlated: indexes of features which are supposed to be correlated

For more detailes see methods:

check_categorials

check_sparse

check_class_disbalance

check_correlation

check_sparse(X)¶

Find sparse features in X

Parameters:	X (array-like with shape (n_objects, n_features)) – features from your dataset
Returns:	list of features which are supposed to be sparse

draw_2dhist(X, feature_indexes=None, figsize=(6, 4), **hist_kwargs)¶

Draw 2dhist for each pair of features with specified indexes

Parameters:	X (array-like with shape (n_objects x n_features)) – features from your dataset feature_indexes (list of int or str) – features which should be checked for correlation. If None all features will be checked. If it is list of str X should be a np.ndarray and X.dtype should contain fields figsize (tuple of int) – Size of figure with hist2d

draw_correlation_heatmap(X, feature_indexes=None, figsize=(15, 10), **heatmap_kwargs)¶

Draw correlation heatmap between features with specified indexes from X

Parameters:	X (array-like with shape (n_objects x n_features)) – features from your dataset feature_indexes (list of int or str) – features which should be checked for correlation. If None all features will be checked. If it is list of str X should be a np.ndarray and X.dtype should contain fields figsize (tuple of int) – Size of figure with heatmap