cthorrez t1_jc5s8ag wrote

Basically I would just make sure the metrics being compared are computed the same way. Same numerator and denominator like summing vs averaging, over the batch vs epoch. If the datasets are the same and the type of metric you are computing is the same it's comparable.

The implementation details just become part of the comparison.


cthorrez t1_ja8d6oc wrote

Not exactly. In batch RL the data they train on are real (state, action, next state, reward) tuples from real agents interacting with real environments.

They improve the policy offline. In RLHF there actually is no env. And the policy is just standard LLM decoding.


cthorrez t1_ja70abd wrote

I find it a little weird that RLHF is considered to be reinforcement learning.

The human feedback is collected offline and forms a static dataset. They use the objective from PPO but it's really more of a form of supervised learning. There isn't an agent interacting with an env, the "env" is just sampling text from a static dataset and the reward is the score from a neural net trained on a static dataset.


cthorrez t1_j9xstlw wrote

People are rushing to deploy LLMs in search, summarization, virtual assistants, question answering and countless other applications where correct answers are expected.

The reason they want to get to the latent space close to the answer is because they want the LLM to output the correct answer.


cthorrez t1_j9x772h wrote

along with each prompt, just put: "And at the end of your response, state on a scale from one to ten how confident you are in you answer"

This works amazingly and is very accurate. source

It has the added bonus where you can get confidence intervals on your confidence intervals just by asking how confident it is in it's estimation of its confidence.


cthorrez t1_j67csjx wrote

That's an interesting topic that I think deserves further investigation. On the surface it sounds like the size of the LM impacts the mechanism by which the LM is able to "secretly perform gradient descent".

Is finetuning similarly unstable for small sized LMs?


cthorrez t1_j63uc5a wrote

I have an issue with the experiments.

> For ICL, we fix the number of demonstration examples to 32 and tune the random seed for each task to find a set of demonstration examples that achieves the best validation performance. For finetuning, we use the same demonstration examples for ICL as the training examples and use SGD as the optimizer

They go through a set of random seeds to pick the "best" possible samples for in context learning, and then use the same set of examples for fine tuning. I think this biases the results in favor of in context learning.

A more fair way to do this would be to use a truly random set of examples, or to use use the same approach and tune the seed to find the "best" set of examples for finetuning as well.


cthorrez OP t1_iqpko2f wrote

Thanks for the reply and the resource! You're right about the relatively recent influx of people who enter the ML field via deep learning first. Seems like most of the intro material focuses on logistic sigmoid based methods.

That said, do you think there is a fundamental reason why other log likelihood based methods such as probit and poisson as you mentioned haven't caught on in the deep learning field? Is it just that probit doesn't give an edge in classification, and such a large portion of use cases don't require anything besides a classification based loss?