Add common preprocessing functions
sklearn
and pandas
have some preprocessing functions which we use a lot. However, these preprocessors aren't aware of the semantic types of each variable. We want to apply preprocessing only where it makes sense, and we can do this within Dataset
since each column is tagged with a type. By default, each preprocessing function should only apply to a certain type of feature, but we should allow for flexibility if people want to ignore the feature types.
Some must-haves to implement first (names can be changed):
-
make_one_hot
- converts CATEGORICAL features to binary indicators. If a categorical feature hask
categories, this should be converted tok
ork-1
binary indicators. This should gracefully handle missing data. For categories with a largek
, we might only only want to include the top 10 or 100, useful for long-tail categories. -
impute_{mean,median}
- converts missing data in NUMERICAL features to the mean or median, and optionally adds an extra binary column to mark where missing values were imputed. There is no need to impute non-numerical features, as these can just be treated as a separate category. -
center_and_rescale
- rescales INTERVAL data to be have mean 0 (centering) and variance 1 (rescaling). For RATIO data, only performs the scaling step.
Some other things that will be useful soon:
-
clip
- squish values to be in[-1, 1]
or some other set range. Good for clipping outliers. -
log
- convert RATIO features to log space -
random_polynomial
- randomly generate polynomial combinations of NUMERICAL features.