Submitted by mishtimoi t3_xwfg83 in MachineLearning
[removed]
Submitted by mishtimoi t3_xwfg83 in MachineLearning
[removed]
I also tried set_grad=False but did not see much improvement.
How much of a slowdown?? I'm interested in this as well
Detach creates a new tensor. You need to set the required_grad to False. You will not see performance improvements as long as the earlier layers aren't frozen. You don't get to skip gradient updates, so if your first layers needs the gradient, you will need to calculate them all.
Detach is O(1), if you consider copy a O(1) operation, but it's probably closer to O(n). I don't know to which extent PyTorch can optimize copying by making it in-place or lazy with views.
Copy is always O(n) unless you've got a tensor of N items operating with N threads and N ports to the memory you're copying.
I'm talking about detach. From what I could find on the internet the copy part is taking tensor data and wrapping it in a variable. This does not imply that an actual copy in memory happens. And from what I understand to get a hard copy you have to clone the detached tensor.
If all OP does is detach tensors, then it's O(1). But we can't know that without further information, so I elaborated that it's likely closer to O(n) because I presume they might be doing something beyond detach.
Does detach make a copy? I thought it took a view (which is much cheaper as it only creates new metadata, but doesn't copy the underlying storage)
It makes a copy of a tensor, but it is not a copy in memory, just a view. It copies the reference.
Yea this makes sense. If it's only detach for all layers it's like the .eval() method which needs to probably make a copy (as per your explanation) once of the whole model footprint but in this case, it has to keep multiple copies at every point I detach, I guess.
Small_Stand_8716 t1_ir6a9ja wrote
I'm not aware of the performance of detach, but why not set requires_grad to False to freeze some layers? It will tremendously speed up training and memory usage.