Viewing a single comment thread. View all comments

suflaj t1_ir7re2g wrote

Detach creates a new tensor. You need to set the required_grad to False. You will not see performance improvements as long as the earlier layers aren't frozen. You don't get to skip gradient updates, so if your first layers needs the gradient, you will need to calculate them all.

Detach is O(1), if you consider copy a O(1) operation, but it's probably closer to O(n). I don't know to which extent PyTorch can optimize copying by making it in-place or lazy with views.

1

onyx-zero-software t1_ir839jv wrote

Copy is always O(n) unless you've got a tensor of N items operating with N threads and N ports to the memory you're copying.

1

suflaj t1_ir83jmj wrote

I'm talking about detach. From what I could find on the internet the copy part is taking tensor data and wrapping it in a variable. This does not imply that an actual copy in memory happens. And from what I understand to get a hard copy you have to clone the detached tensor.

If all OP does is detach tensors, then it's O(1). But we can't know that without further information, so I elaborated that it's likely closer to O(n) because I presume they might be doing something beyond detach.

1

mishtimoi OP t1_ir9wwlx wrote

Yea this makes sense. If it's only detach for all layers it's like the .eval() method which needs to probably make a copy (as per your explanation) once of the whole model footprint but in this case, it has to keep multiple copies at every point I detach, I guess.

1

chatterbox272 t1_ir8dz7e wrote

Does detach make a copy? I thought it took a view (which is much cheaper as it only creates new metadata, but doesn't copy the underlying storage)

1

suflaj t1_ir9g78r wrote

It makes a copy of a tensor, but it is not a copy in memory, just a view. It copies the reference.

1