I'm not aware of the performance of detach, but why not set requires_grad to False to freeze some layers? It will tremendously speed up training and memory usage.
Detach creates a new tensor. You need to set the required_grad to False. You will not see performance improvements as long as the earlier layers aren't frozen. You don't get to skip gradient updates, so if your first layers needs the gradient, you will need to calculate them all.
Detach is O(1), if you consider copy a O(1) operation, but it's probably closer to O(n). I don't know to which extent PyTorch can optimize copying by making it in-place or lazy with views.
I'm talking about detach. From what I could find on the internet the copy part is taking tensor data and wrapping it in a variable. This does not imply that an actual copy in memory happens. And from what I understand to get a hard copy you have to clone the detached tensor.
If all OP does is detach tensors, then it's O(1). But we can't know that without further information, so I elaborated that it's likely closer to O(n) because I presume they might be doing something beyond detach.
Yea this makes sense. If it's only detach for all layers it's like the .eval() method which needs to probably make a copy (as per your explanation) once of the whole model footprint but in this case, it has to keep multiple copies at every point I detach, I guess.
Small_Stand_8716 t1_ir6a9ja wrote
I'm not aware of the performance of detach, but why not set requires_grad to False to freeze some layers? It will tremendously speed up training and memory usage.