_Arsenie_Boca_

_Arsenie_Boca_ t1_j3kzllo wrote

Your laptop will not begin to suffice, not for inference and especially not for fine tuning. You would need something like an A100 GPU in a server that handles requests. And in the end, the results will be much worse than GPT-3. If you dont already have an AI infrastructure, go with an API, it will save you more than a bit of money (unless you are certain you will use it at scale long-term). If you are worried about OpenAI, there are some other companies that serve LMs.

15

_Arsenie_Boca_ t1_j2xonjk wrote

Yes, I believe there are 2 factors playing a role here:

  1. Models could potentially correct some errors of the human labeler using their generalization power, provided that the model is not overfitted.

  2. You should differentiate between outperforming a human and outperforming humans. Labels usually represent the collective knowledge of a number of people not just one.

3

_Arsenie_Boca_ t1_ixgjhjp wrote

As interesting as weak supervision is, the main takeaway is that using LLM few-shot predictions as labels to train a small model is a great approach to save labeling costs. Using snorkel on top means you have to query multiple LLMs and have snorkel as additional complexity, yielding only a few extra points. Perhaps those extra points also could have been achieved by letting the LLM label a few more samples or giving it a few more shots to get better labels

2

_Arsenie_Boca_ t1_ivvfzfs wrote

In classification you usually have a single correct class, a hard label. However, you might also have soft labels, where multiple classes have non-zero target probabilities. Label smoothing is a technique that artificially introduces those soft labels from hard labels, i.e. if your hard label was [0 0 1 0] it might now be [0.05 0.05 0.85 0.05]. You could use the strength of smoothing to represent uncertainty.

1

_Arsenie_Boca_ t1_iusvc0e wrote

Parameter sharing across layers would achieve just that. In the ALBERT paper the authors show that repeating a layer multiple times actually leads to similar performance than having separate parameter matrices. I havent heard a lot about this technique, but I assume this is because people mostly care about speed, which this does not improve (while it is a good match for your usecase)

2