So I am trying to make a CNN image classifier that has two classes, good and bad. The aim is to look at photoshoot pictures that can be found on fashion sites and find the "best one". I trained this model for 150 epochs and its loss did not change at all(Roughly). Details are as follows:

My metric for "best one", also the way I structured my dataset, is that the photo shows the whole outfit or whole body of the model not only like upper body or lower. I also labeled the photos where the model's back was turned to the camera. My training set has 1304 good photos and 2000 bad photos. My validation set has 300 photos per class.(so 600)

Architechture as follows : Conv > Pool > Conv > Pool > Flatten > Linear > Linear > Softmax. For details of architecture like stride etc. check out the code I provided.

I have to use softmax since in my application I need to see the probabilities of being good and bad. That is why I am not using cross-entropy loss but instead using negative log-likelihood loss. Adam is my optimizer

Other hyperparameters: batch size: 64, number of epochs: 150, input size: (224, 224), number of classes: 2, learning rate: 0.01, weight decay: 0.01

I trained with this script for 150 epochs. The model initialized with a 0.5253 loss and ended with a 0.5324 loss. I took snapshots every 10 epochs but I did not learn anything through the learning. This is what my learning curve looks like:

Learning Curve

Now I know that there are many many things I can do to make the model perform better like initializing with a pretrained, doing some more stuff with transforms, etc. But the problem with this model is not that it is performing poorly, it is not performing at all! Like, I also have a validation accuracy check and it is around %50 during all training, for a classifier with 2 classes. So I am assuming I am doing something very obvious wrong, any idea what?

Comments

You must log in or register to comment.

trajo123 t1_j3busy6 wrote on January 7, 2023 at 1:07 PM

First of all, the dataset size is way too small to train a model from scratch to give meaningful results on this relatively complex task (more complex than MNIST for example, which has a training set of 60000 images). Second, your model is way too small/simple for this task even if you would have 100 times more data. I strongly suggest "Transfer Learning" - fine-tuning a pre-trained model by replacing the classification head, freezing the rest of the model in place and training on your dataset.

Something along these lines:

from torchvision import transforms, models

# ...

model = models.swin_b(weights=models.Swin_B_Weights.IMAGENET1K_V1)
model.heads[0] = nn.Linear(model.heads[0].in_features, 1, bias=True)
# ...
)

In the pre-trained model documentation you will see what training recippe was used and what transforms were applied to the image. Typically:

transforms.Normalize(
                mean=(0.485, 0.456, 0.406),
                std=(0.229, 0.224, 0.225),
            )
            
transforms.Resize((224, 224), interpolation=transforms.InterpolationMode.BICUBIC)

See more at <https://pytorch.org/vision/stable/models.html#table-of-all-available-classification-weights>. You can also find pre-trained models HuggingFace / VisionModels.

Hope this helps, good luck!

trajo123 t1_j3c38rx wrote on January 7, 2023 at 2:24 PM

Several things I noticed in your code:

your model doesn't use any transfer function
the combination of final activation function and loss function is incorrect
for CNN you should be using BatchNorm2D layers

The code should look something like this:

    def __init__(self, input_size, num_classes):
        super(CNNClassifier, self).__init__()
        self.input_size = input_size
        self.num_classes = num_classes
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, stride=1, padding=1) # increase the number of channels
        self.bn1 = nn.BatchNorm2d(32)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(in_channels=8, out_channels=128, kernel_size=3, stride=1, padding=1) # increase the number of channels
        self.bn2 = nn.BatchNorm2d(128)
        self.fc1 = nn.Linear(128, 256)  # note the smaller numbers
        self.fc2 = nn.Linear(256, num_classes)
        self.bn1 = nn.BatchNorm2d(32),
        self.final_pool = nn.AdaptiveAvgPool2d(1)  # before flatten, you should use AdaptiveMaxPool2d, or AdaptiveAvgPool2d to get rid of the spatial dimensions, essentially treat each filter as one feature
        # self.softmax = nn.Softmax(dim=1) - not needed, see below. Also Softmax is not correct for use with NLLLoss, he correct one would be LogSoftmax(dim=1)
        self.f = nn.ReLU()
        
    def forward(self, x):     
        x = self.conv1(x)
        x = self.pool(x)
        x = self.f(x)  # apply the transfer function
        x = self.bn1(x) # apply batch norm (this can also be placed before the transfer function)

        x = self.conv2(x)   
        x = self.pool(x)
        x = self.f(x)  # apply the transfer function        
        x = self.bn2(x) # apply batch norm (this can also be placed before the transfer function)

        # since you are now using batchnorm, you could add a few more blocks like the one above, vanishing gradients are less of a concern now

        x = self.final_pool(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = self.f(x)  # apply the transfer function, here you could try tanh as well
        x = self.fc2(x)
        # x = self.softmax(x)  # no need for a function here because it is incorporated into the loss function for numerical/computational efficiency reasons
        return x

Also, the loss should be

# criterion = nn.NLLLoss()
criterion = nn.CrossEntropyLoss()  # the more natural choice of loss function for classification, actually for binary classification the more natural choice would be BCEWithLogitsLoss, but then you need to set the number of number of output units to 1.

trajo123 t1_j3c3cvf wrote on January 7, 2023 at 2:25 PM

...let me know if it works any better!

AKavun OP t1_j3l51kx wrote on January 9, 2023 at 8:43 AM

Thank you sir, I posted a general update to this thread and I will be further updating you about everything.

FastestLearner t1_j3c0yju wrote on January 7, 2023 at 2:05 PM

You are not using non-linearity. Yours is just a linear model. Deep CNNs thrive on non-linearity. Try adding a ReLU layer after every MaxPool. Also, for better convergence, add BN layers after each Conv. Don’t use two Linear layers (mostly redundant). Use AvgPool instead of Flatten. Replace Softmax with LogSoftmax. Set Adam lr=1e-4, decay=1e-4.

PM me if you face any more issues.

AKavun OP t1_j3l4gb1 wrote on January 9, 2023 at 8:35 AM

u/trajo123 u/FastestLearner u/trajo123

I am giving this as a general update. In my original post, I said "I am doing something very obvious wrong" and indeed I was. The reason my model did not learn at all was that the whole python script with the exception of my main method was being re-executed every few seconds which actually caused my model to reinitilize and reset. I believe this was caused by PyTorch's handling of the "num_workers" parameter in the dataloader which tries to do some multithreading magic and ends up re-executing the script multiple times.

So fixing that allowed my model to learn but it still performed poorly due to the reasons all of you so generously explained in great detail. My first instinctive reaction to this was to switch to resnet18 and change the output layer. I also switched to crossentropy loss as I learned I can still use softmax in postprocessing to obtain the prediction confidence, this was something I did not think it was possible to do previously. Now my model performs with 90% accuracy in my test set and rest I think is just tweaking the hyperparameters, enlarging and augmenting the data, and maybe doing some partial training with different learning rates etc.

However I still do want to learn how to design an architecture from scratch so I am experimenting with that after carefully reading the answers you provided. I thank each of you so much and wish all the success in your careers. You are great people and we are a great community

trajo123 t1_j3lwv4f wrote on January 9, 2023 at 2:02 PM

> 90% accuracy in my test

Looking at accuracy can be misleading if your dataset is imbalanced. Let's say 90% of your data is labelled as False and only 10% of your data is labelled as True, so even a model that doesn't look at the input at all and just predicts False all the time will have 90% accuracy. A better metric for binary classification is the F1 score, but that also depends on where you set the decision threshold (the default is 0.5, but you can change that to adjust the confusion matrix). Perhaps the most useful metric to see how much your model learned is the Area under the ROC curve aka ROC_AUC score (where 0.5 is the same as random guessing and 1 is a perfect classifier).

suflaj t1_j3bt2eq wrote on January 7, 2023 at 12:49 PM

That learning rate is about 100 times higher than you give to Adam for that batch size. That weight decay is also about 100 times higher, and if you want to use weight decay with Adam, you should probably use the AdamW optimizer (which is more or less the same thing, just fixes the interaction between Adam and weight decay)

Also, loss is not something that determines how much a model has learned. You should check out validation F1, or whatever metrics are relevant for the performance of your model.

AKavun OP t1_j3btlem wrote on January 7, 2023 at 12:54 PM

I also have a validation accuracy metric of around %50 which is basically the expected value of a random variable.

I removed the weight decay to keep things simpler and adjusted the learning rate to 0.0003. I will update this thread on the results.

Thank you for taking the time to help

suflaj t1_j3bubtm wrote on January 7, 2023 at 1:02 PM

Another problem you will likely have is your very small convolutions. Basically, output channels of 8 and 16 are probably only enough to solve MNIST. You should then probably use something more like 32 and 64, and use larger kernels and strides to hopefully reduce reliance on the linears to do the work for you.

Finally, you are not using nonlinear activations between layers. Your whole network essentially acts like one smaller convolutional layer with a flatten and softmax.