dataslacker

dataslacker t1_j7m1xjg wrote

Take it from someone who learned C++ first, start with python. You are actually very unlikely to get an interview in C++. The industry standard is Python. Know your algorithms and data structure well enough to do the intermediate level questions on hackerrank and you’ll be in good shape

2

dataslacker t1_j7m1dho wrote

I’ve been working in ML for 8 years and I’ve never seen or heard of a scientist being hired without at least one coding interview. Never seen someone just “write down an algorithm” and hand it off to an engineer. I would really like to hear where you saw this because it’s no where near my experience at big tech companies.

3

dataslacker t1_j7gfa6c wrote

There’s probably some resentment that google and meta could have released something similar over a year ago but chose not to because they didn’t think it would be responsible. Now the company that was founded on being “responsible” released it to the world it a way that hasn’t satisfied a lot of researchers.

5

dataslacker t1_j4z8zm4 wrote

Yes, your explanations are clear and are also how I understood the paper, but I feel like there's some motivation for the RL training that's missing. Why not "pseudo labeling"? Why is the RL approach better? Also the reward score is non-differentiable because it was designed that way, but they could have designed it to be differentiable. For example instead of decoding the log probs why not train the reward model on them directly? You can still obtain the labels via decoding them doesn't mean that has to be the input to the reward model. There are a number of design choice the authors made that are not motivated in the paper. I haven't read the reference so maybe they are motivated elsewhere in the literature, but RL seems like a strange choice for this problem since there isn't a dynamic environment that the agent is interacting with.

3

dataslacker t1_j4yraoc wrote

Sorry I think didn’t do a great job asking the question. The reward model, as I understand it, will rank the N generated responses from the LLM. So why not take the top ranked response as ground truth, or a weak label if you’d like and train in a supervised fashion predicting the next token. This would avoid a he RL training which I understand is inefficient and unstable.

2

dataslacker t1_j2z5slt wrote

Depends on how your labels, or more generally the target data distribution, is generated. If it’s generated by human subjectivity yea I would agree. However it’s not hard to think of situations where the labels are the output of a physical process or, in the case of prediction, a future event. In these cases the label is not ambiguous and thus not subject to human interpretation. You also have RL systems that play games against each other and reach super human performance that way. Read about AlphaGo or AlphaStar for example.

1