Guru example¶

from modelgym import Guru
import numpy as np

Initialize Guru

guru = Guru()

Make toy dataset

n = 100
np.random.seed(0)
X = np.zeros((n, 6), dtype=object)

# make not numeric feature
X[:, 0] = 'not a number'

# make categorial feature
X[:, 1] = np.random.binomial(3, 0.6, size=n)

# make sparse feature
X[:, 2] = np.random.binomial(1, 0.05, size=n) * np.random.normal(size=n)

# make correlated features
X[:, 3] = np.random.normal(size=n)
X[:, 4] = X[:, 3] * 50 - 100

# make independent feature
X[:, 5] = np.random.normal(size=n)

# make disbalanced classes
y = np.random.binomial(3, 0.9, size=n)

Main features¶

Looking for categorical features

guru.check_categorial(X)

Some features are supposed to be categorial. Make sure that all categorial features are in cat_cols.
Following features are not numeric:  [0]
Following features are not variable:  [1]

defaultdict(list, {'not numeric': [0], 'not variable': [1]})

Looking for sparse features

guru.check_sparse(X)

Consider use hashing trick for your sparse features, if you haven't already. Following features are supposed to be sparse:  [2]

[2]

Looking for correlated features

guru.check_correlation(X, [3, 4, 5])

There are several correlated features. Consider dimention reduction, for example you can use PCA. Following pairs of features are supposed to be correlated:  [(3, 4)]

[(3, 4)]

Drawing correlation heatmap for features

guru.draw_correlation_heatmap(X, [3, 4, 5], figsize=(8, 6))

Drawing 2d histograms for features

guru.draw_2dhist(X, [3, 4, 5])

Looking for disbalanced classes

guru.check_class_disbalance(y)

There is class disbalance. Probably, you can solve it by data augmentation.
Following classes are too common:  [3]
Following classes are too rare:  [1, 0]

defaultdict(list, {'too common': [3], 'too rare': [1, 0]})

dtype with fields¶

You can also use array with dtype with fields

Let’s make another representation of the same data

named_X = np.zeros((n,), dtype=[('str', 'U25'),
                                ('categorial', 'int'),
                                ('sparse', float),
                                ('corr_1', float),
                                ('corr_2', float),
                                ('independent', float)])
for i, name in enumerate(named_X.dtype.names):
    named_X[name] = X[:, i]

Now we can draw heatmap like this

guru.draw_correlation_heatmap(named_X, ['corr_1', 'corr_2', 'independent'], figsize=(8, 6))