Small_Stand_8716 t1_ir6a9ja wrote on October 5, 2022 at 5:22 PM

#37,711

I'm not aware of the performance of detach, but why not set requires_grad to False to freeze some layers? It will tremendously speed up training and memory usage.

mishtimoi OP t1_ir6pvy8 wrote on October 5, 2022 at 7:02 PM

#38,550

Replying to Small_Stand_8716 (#37,711)

I also tried set_grad=False but did not see much improvement.

ButthurtFeminists t1_ir6x9pv wrote on October 5, 2022 at 7:48 PM

#38,886

How much of a slowdown?? I'm interested in this as well

suflaj t1_ir7re2g wrote on October 5, 2022 at 11:14 PM

#40,268

Detach creates a new tensor. You need to set the required_grad to False. You will not see performance improvements as long as the earlier layers aren't frozen. You don't get to skip gradient updates, so if your first layers needs the gradient, you will need to calculate them all.

Detach is O(1), if you consider copy a O(1) operation, but it's probably closer to O(n). I don't know to which extent PyTorch can optimize copying by making it in-place or lazy with views.

onyx-zero-software t1_ir839jv wrote on October 6, 2022 at 12:51 AM

#40,863

Replying to suflaj (#40,268)

Copy is always O(n) unless you've got a tensor of N items operating with N threads and N ports to the memory you're copying.

suflaj t1_ir83jmj wrote on October 6, 2022 at 12:54 AM

#40,876

Replying to onyx-zero-software (#40,863)

I'm talking about detach. From what I could find on the internet the copy part is taking tensor data and wrapping it in a variable. This does not imply that an actual copy in memory happens. And from what I understand to get a hard copy you have to clone the detached tensor.

If all OP does is detach tensors, then it's O(1). But we can't know that without further information, so I elaborated that it's likely closer to O(n) because I presume they might be doing something beyond detach.

chatterbox272 t1_ir8dz7e wrote on October 6, 2022 at 2:21 AM

#41,310

Replying to suflaj (#40,268)

Does detach make a copy? I thought it took a view (which is much cheaper as it only creates new metadata, but doesn't copy the underlying storage)

suflaj t1_ir9g78r wrote on October 6, 2022 at 9:53 AM

#42,860

Replying to chatterbox272 (#41,310)

It makes a copy of a tensor, but it is not a copy in memory, just a view. It copies the reference.

mishtimoi OP t1_ir9wwlx wrote on October 6, 2022 at 12:53 PM

#43,776

Replying to suflaj (#40,876)

Yea this makes sense. If it's only detach for all layers it's like the .eval() method which needs to probably make a copy (as per your explanation) once of the whole model footprint but in this case, it has to keep multiple copies at every point I detach, I guess.

Time Complexity of Detach() in torch "[R]"

Comments