Viewing a single comment thread. View all comments

trnka t1_j3t1t18 wrote

If you're doing the preprocessing and feature selection manually (meaning without the use of a library), yeah that's a pain.

If you're using sklearn, generally if you do all your preprocessing and feature selection with their classes in a sklearn pipeline you should be good. For example, if your input data is a pandas dataframe you can use a ColumnTransformer to tell it which columns to preprocess in which ways, such as a OneHotEncoder on categorical columns. Then you can follow it up with feature selection before your model.

Sklearn's classes are implemented so that they only train the preprocessing and feature selection on the training data.

1