Submitted by Rishh3112 t3_120gvgw in deeplearning

I made a model for handwritten text recognition. The model is training on CPU but when I use gpu I get cuda out of memory error in the validation step. Can someone please tell me why this is happening?

1

Comments

You must log in or register to comment.

Old-Chemistry-7050 t1_jdhbvrc wrote

Model too big or there’s a memory issue somewhere in ur code

5

Rishh3112 OP t1_jdhbz4o wrote

The model isn't too big. There should not be a problem with that.

−1

humpeldumpel t1_jdhcgg6 wrote

Well then it's the memory issue. Hard to say without seeing your code

4

Rishh3112 OP t1_jdhd785 wrote

class CNN(nn.Module):
def __init__(self, num_chars):
super(CNN, self).__init__()
# Convolution Layer
self.conv1 = nn.Conv2d(3, 128, kernel_size=(3, 6), padding=(1, 1))
self.pool1 = nn.MaxPool2d(kernel_size=(2, 2))
self.conv2 = nn.Conv2d(128, 64, kernel_size=(3, 6), padding=(1, 1))
self.pool2 = nn.MaxPool2d(kernel_size=(2, 2))
# Dense Layer
self.fc1 = nn.Linear(768, 64)
self.dp1 = nn.Dropout(0.2)
# Recurrent Layer
self.lstm = nn.GRU(64, 32, bidirectional=True)
# Output Layer
self.output = nn.Linear(64, num_chars + 1)
def forward(self, images, targets=None):
bs, _, _, _ = images.size()
x = F.relu(self.conv1(images))
x = self.pool1(x)
x = F.relu(self.conv2(x))
x = self.pool2(x)
x = x.permute(0, 3, 1, 2)
x = x.view(bs, x.size(1), -1)
x = F.relu(self.fc1(x))
x = self.dp1(x)
x, _ = self.lstm(x)
x = self.output(x)
x = x.permute(1, 0, 2)
if targets is not None:
log_probs = F.log_softmax(x, 2)
input_lengths = torch.full(
size=(bs,), fill_value=log_probs.size(0), dtype=torch.int32
)
target_lengths = torch.full(
size=(bs,), fill_value=targets.size(1), dtype=torch.int32
)
loss = nn.CTCLoss(blank=0)(
log_probs, targets, input_lengths, target_lengths
)
return x, loss
return x, None
if __name__ == '__main__':
model = CNN(74)
img = torch.rand(config.BATCH_SIZE, 3, 50, 200)
target = torch.randint(1, 20, (config.BATCH_SIZE, 5))
x, loss = model(img, target)
print(loss)

0

trajo123 t1_jdhi7u8 wrote

The problem is likely in your training loop. Perhaps your computation graphs keeps going because you keep track of the average loss as an autograd variable rather than a plain numerical one. Make sure that for any metrics/logging you use loss.item().

5

humpeldumpel t1_jdhpl0w wrote

And also make use of the training and validation mode of the model

2

Rishh3112 OP t1_jdhib79 wrote

sure ill will give it a try thanks a lot.

1

Rishh3112 OP t1_jdhiguj wrote

i just checked in my training loop I'm using loss.item()

1

_vb__ t1_jdiwjqk wrote

Are you calling the zero_grad method on your optimizer in every step of your training loop?

3

trajo123 t1_jdhhteo wrote

Have you tried asking ChatGPT? :)

4

Rishh3112 OP t1_jdhi3xo wrote

i actually did. But the suggestions it gave was for general out of memory. Didnt help a lot.

0

CKtalon t1_jdhkdgc wrote

Code seems fine unless your batch size is too huge. Try running on CPU and see how much RAM is used and debug from there?

3

Rishh3112 OP t1_jdhks1w wrote

my batch size is just 8. I am running it on CPU and my laptop has 8gb of ram and its running fine there.

0

stuv_x t1_jdj5l6h wrote

Make sure you’re putting the model into evaluation mode during validation and zero your optimiser.

2

MisterManuscript t1_jdhrqhl wrote

What GPU are you using? How much vRAM does it have?

1

Rishh3112 OP t1_jdhsvfv wrote

Using aws and it have a ram of 14gb

0

MisterManuscript t1_jdhtan0 wrote

You probably have a memory leak somewhere in youe training loop. Either that or your model or batch size is way too big and occupies a lot of vRAM.

Addendum: There's a difference between RAM and vRAM (your GPU's RAM), I hope the 14GB you're talking about is vRAM and not the RAM of your AWS vm.

1

Rishh3112 OP t1_jdhti46 wrote

my model is down here. and my batch size is just 8.

0

what_if___420 t1_jdi72rw wrote

Reduce batch size

1

Rishh3112 OP t1_jdihvhj wrote

Even reducing the batch size to 1 is generating the same error.

1

boosandy t1_jdi7fiy wrote

Use smaller batch size

1

goxdin t1_jdjdwcw wrote

Confirm your video ram - if NVIDIA - run nvidia-smi

1

j-solorzano t1_jdk2kod wrote

If it works in CPU but not GPU, even though the GPU should have more memory, the only difference I can think of is garbage collection timing. Try calling the garbage collector in every epoch. Also, note that you have a GRU, which retains tensors.

1

Rishh3112 OP t1_jdl42av wrote

Sure I will try using a garbage collector in every epoch. Thanks.

2