Submitted by Santhosh999 t3_z9dryt in deeplearning

I trained a neural network for credit card fraud detection. When the last layer has 1 neuron and sigmoid activation function, the accuracy is 99% whereas when softmax is used, accuracy is 0.17%.

I know that sigmoid needs to be used for binary classification problem. Can someone explain why to use sigmoid rather than softmax?

Thank you for your time.

2

Comments

You must log in or register to comment.

Blueberry-Tacos t1_iygvvgh wrote

Softmax activation function is usually used for multi categorical classification. Knowing that you are building a binary classification model, sigmoid is better for your application :)

EDIT : I've made a mistake. It is not used for multi categorical classification but to be able to classify multiple 'categories' from one input.

3

suflaj t1_iyh2wil wrote

Well for softmax you need at least 2 neurons

1

jellyfishwhisperer t1_iyjv41y wrote

Do the positive classes make up 0.17% of your data? A softmax of a single neuron should be 1 always right?

1

suflaj t1_iykea48 wrote

Doesn't matter. Softmax is just a multidimensional sigmoid. For binary classifications you can therefore use either 1 output and a sigmoid, or 2 outputs and a softmax. The only difference is that with a sigmoid you resolve the result as

is_fraud = result > 0.5

while with softmax you'd do

is_fraud = argmax(result) == 1
3

Santhosh999 OP t1_iykgddd wrote

I am getting an error when tried with 2 neurons and softmax activation func with binary crossentropy loss.

ValueError: logits and labels must have the same shape ((None, 2) vs (None, 1))

1

suflaj t1_iyktdbc wrote

Well, you have to change your labels from being 1 long to 2 long. If your labels are True or 1 and False or 0, you will need to change them to [0, 1] and [1, 0]

2

trajo123 t1_iymuivu wrote

To answer you question concretely: in classification you want your model output to reflect a probability distribution over the classes. If you have only 2 classes this can be achieved with 1 output unit producing values ranging from 0 to 1. If you have more than 2 classes then you need 1 unit per class so that each one produces a value in the (0,1) range and also that the sum of all units adds up to 1 to pass as a probability distribution. In case of 1 output unit the sigmoid function ensures that the output is 0,1 and in case of multiple output units softmax ensures the conditions mentioned above. Now, in practice, classification models don't use an explicit activation function after the last layer, instead the loss incorporates the appropriate activation function due to efficiency and numerical stability reasons. So in case of binary classification you have two equivalent options:

  • use 1 output unit with torch.nn.BCEWithLogitsLoss

>This loss combines a Sigmoid layer and the BCELoss in one single class. This version is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability.

  • use 2 output units with torch.nn.CrossEntropyLoss

>This criterion computes the cross entropy loss between input logits and target

Both of these approaches are mathematically equivalent and should produce the same results up to numerical considerations. If you get wildly different predictions, it means you did something wrong.

On another note, using accuracy when looking at credit card fraud detection is not a good idea because the dataset is most likely highly unbalanced. Probably more than 99% of the data samples are labelled as "not fraud". In this case, having a stupid model always produce "not fraud" regardless of input will already give you 99% accuracy. You may want to look into metrics for unbalanced datasets, e.g. F1 score, false positive rate, false negative rate, etc.

Have fun on your (deep) learning journey!

2