ed3203

ed3203 t1_j4slyft wrote

Yes, you may arrive at a different local minima which could be more performant. You give the model more freedom to explore. OP gave no context, if it's a huge transformer model for instance that would be impractical to retain then sure use the model as is with a different final classification layer.

5