Guru¶
-
class
modelgym.guru.
Guru
(print_hints=True, sample_size=None, category_qoute=0.2, sparse_qoute=0.8, class_disbalance_qoute=0.5, pvalue_boundary=0.05)¶ This class analyze data trying to find some issues.
Parameters: - sample_size (int) – number of objects to be used for category and sparsity diagnostic. If None, whole data will be used.
- category_qoute (0 < float < 1) – max number of distinct feature values in sample to assume this feature categorial
- sparse_qoute (0 < float < 1) – zeros portion in sample required to assume this feature sparse
- class_disbalance_qoute (0 < float < 1) – class portion should be distant from the mean to assume this class disbalanced
-
check_categorial
(X)¶ Find category features in X
Parameters: X (array-like with shape (n_objects, n_features)) – features from your dataset Returns: dict of shape: { 'not numeric': list of feature indexes, 'not variable': list of feature indexes }
-
check_class_disbalance
(y)¶ Find disbalanced classes in y. You should use this method only if you are solving classification task
Parameters: y (array-like with shape (n_objects,)) – target classes in your dataset Returns: dict of shape: { 'too common': list of classes, 'too rare': list of classes }
-
check_correlation
(X, feature_indexes=None)¶ Find correlated features among features with specified indexes from X
Parameters: - X (array-like with shape (n_objects x n_features)) – features from your dataset
- feature_indexes – list of features which should be checked for correlation. If None all features will be checked
Returns: list of pairs of features which are supposed to be correlated
-
check_everything
(data)¶ Full data check. Find category features, sparse features, correlated features and disbalanced classes.
Parameters: data (XYCDataset-like) – your dataset Returns: (categorials, sparse, disbalanced, correlated) - categorials: indexes of features which are supposed to be categorial
- sparse: indexes of features which are supposed to be sparse
- disbalanced: disbalanced classes
- correlated: indexes of features which are supposed to be correlated
For more detailes see methods:
- check_categorials
- check_sparse
- check_class_disbalance
- check_correlation
-
check_sparse
(X)¶ Find sparse features in X
Parameters: X (array-like with shape (n_objects, n_features)) – features from your dataset Returns: list of features which are supposed to be sparse
-
draw_2dhist
(X, feature_indexes=None, figsize=(6, 4), **hist_kwargs)¶ Draw 2dhist for each pair of features with specified indexes
Parameters: - X (array-like with shape (n_objects x n_features)) – features from your dataset
- feature_indexes (list of int or str) – features which should be checked for correlation. If None all features will be checked. If it is list of str X should be a np.ndarray and X.dtype should contain fields
- figsize (tuple of int) – Size of figure with hist2d
-
draw_correlation_heatmap
(X, feature_indexes=None, figsize=(15, 10), **heatmap_kwargs)¶ Draw correlation heatmap between features with specified indexes from X
Parameters: - X (array-like with shape (n_objects x n_features)) – features from your dataset
- feature_indexes (list of int or str) – features which should be checked for correlation. If None all features will be checked. If it is list of str X should be a np.ndarray and X.dtype should contain fields
- figsize (tuple of int) – Size of figure with heatmap