Viewing a single comment thread. View all comments

lilpolymorph t1_j3qrnoh wrote

I dont understand the fact that I have to perform preprocessing and feature selection on my training data set only as to prevent data leakage but when I try to use my classifiers in python they want equal dimensions of my train and validation sets. of course they are not anymore if I only preprocess the training set??? What do i have to do.

1

trnka t1_j3t1t18 wrote

If you're doing the preprocessing and feature selection manually (meaning without the use of a library), yeah that's a pain.

If you're using sklearn, generally if you do all your preprocessing and feature selection with their classes in a sklearn pipeline you should be good. For example, if your input data is a pandas dataframe you can use a ColumnTransformer to tell it which columns to preprocess in which ways, such as a OneHotEncoder on categorical columns. Then you can follow it up with feature selection before your model.

Sklearn's classes are implemented so that they only train the preprocessing and feature selection on the training data.

1