koolaidman123 t1_jbtkuif wrote on March 11, 2023 at 4:40 PM

Reply to comment by maxToTheJ in [D] Is Pytorch Lightning + Wandb a good combination for research? by gokulPRO

there are some fairly annoying things with pytorch lightning, and somethings are definitely harder to do in lightning due to how it's structured. but overall i find for practical purposes i've been liking lightning a lot more than pytorch + accelerate, especially now you can basically use colossal ai with lightning over deepspeed

koolaidman123 t1_j9qhp9n wrote on February 23, 2023 at 9:06 PM

Reply to [D] Model size vs task complexity by Fine-Topic-6127

i have rarely encountered situations where scaling up mode (eg resnet34 -> resnet50, deberta base -> deberta large/xl) doesn't help. whether it's practical to may be a different story

koolaidman123 t1_j6y07he wrote on February 2, 2023 at 6:14 PM

Reply to comment by was_der_Fall_ist in [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta

have you even read the instructGPT paper?

>In Stiennon et al. (2020), the RM is trained on a dataset of comparisons between two model outputs on the same input. They use a cross-entropy loss, with the comparisons as labels—the difference in rewards represents the log odds that one response will be preferred to the other by a human labeler. In order to speed up comparison collection, we present labelers with anywhere between K = 4 and K = 9 responses to rank. This produces (K C 2 ) comparisons for each prompt shown to a labeler. Since comparisons are very correlated within each labeling task, we found that if we simply shuffle the comparisons into one dataset, a single pass over the dataset caused the reward model to overfit.5 Instead, we train on all (K C 2 ) comparisons from each prompt as a single batch element. This is much more computationally efficient because it only requires a single forward pass of the RM for each completion (rather than (K 2 ) forward passes for K completions) and, because it no longer overfits, it achieves much improved validation accuracy and log loss. Specifically, the loss function for the reward model is: loss (θ) = − 1/ (K C 2 ) E(x,yw ,yl )∼D [log (σ (rθ (x, yw) − rθ (x, yl)))] (1) where rθ (x, y) is the scalar output of the reward model for prompt x and completion y with parameters θ, yw is the preferred completion out of the pair of yw and yl, and D is the dataset of human comparisons.

you know that figure you're referencing comes from the instructgpt paper... right?

koolaidman123 t1_j6xd93p wrote on February 2, 2023 at 3:51 PM

Reply to [D]How Will Open Source Alternatives Compete With GPT3? by noellarkin

None of the models you listed are instruction tuned, so it's no surprise that gpt3 performs better

Some better models are gpt-jt and flan t5 11b, probably the best in terms of open source models right now, maybe opt-iml?

koolaidman123 t1_j6x2b05 wrote on February 2, 2023 at 2:37 PM

Reply to comment by [deleted] in [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta

sure? you can have multiple ways of ranking, but:

the instructGPT paper strictly uses pairwise ranking
asking annotators to rank however many passages 1-k in 1 shot is much more difficult and subject to noise than asking for pairwise comparisons

koolaidman123 t1_j6wtmdj wrote on February 2, 2023 at 1:31 PM

Reply to [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta

Outputs are not ranked 1-5, they're ranked 2 at a time head to head and the rm predicts which is more favored by humans
Empirically they found rl outperformed supervised fine-tuning (sft) on human evaluations, meaning humans generally preferred the rlhf model vs the sft model. The sft model was ft using the top ranked answer

As to why rl outperform sft, not a lot of orgs have the resources to test this (yet), I've heard a plausible theory from ai2 that the main difference comes from the fact that sft uses a token level loss, whereas rl loss takes the entire sentence, so maybe instead of rl being "better" its just next token prediction task is worse

Reseachers ive spoken with dont believe rl is the critical component to enable these models, and that we could eventually discover the right training regime to enable sft to perform on par (or better) than rl

koolaidman123 t1_j6ug73c wrote on February 1, 2023 at 11:31 PM

Reply to comment by RandomCandor in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips

it is, the memorization rate is like 0.03% or less

https://twitter.com/BlancheMinerva/status/1620781482209087488

koolaidman123 t1_j5wbk37 wrote on January 26, 2023 at 12:34 AM

Reply to comment by [deleted] in [D] Self-Supervised Contrastive Approaches that don’t use large batch size. by shingekichan1996

Thats not the same thing...

Gradient accumulation calcs the loss on each batch, it doesnt work with in batch negatives because you need compare input from batch 1 to inputs of batch 2, hence offloading and caching predictions, then calculating the loss with 1 batch

Thats why gradient accumulation doesnt work to simulate large batch sizes for contrastive learning, if youre familiar with it

koolaidman123 t1_j5uk2ai wrote on January 25, 2023 at 5:50 PM

Reply to [D] Self-Supervised Contrastive Approaches that don’t use large batch size. by shingekichan1996

cache your predictions on each smaller batch w/ labels until you get a similar batch size, then run your loss function

so instead of calculating loss and accumulating like gradient accumulation, you only calculate loss once you reach the target batch size

koolaidman123 t1_j5ujfpv wrote on January 25, 2023 at 5:46 PM

Reply to comment by altmly in [D] Self-Supervised Contrastive Approaches that don’t use large batch size. by shingekichan1996

contrastive methods require in-batch negatives, you can't replicate that with grad accumulation

koolaidman123 t1_j4v2uyq wrote on January 18, 2023 at 1:47 PM

Reply to comment by JClub in [D] RLHF - What type of rewards to use? by JClub

it's just a binary pairwise comparison of which is more preferred between 2 outputs, read the instructgpt paper or the wandb post https://wandb.ai/carperai/summarize_RLHF/reports/Implementing-RLHF-Learning-to-Summarize-with-trlX--VmlldzozMzAwODM2#train-the-reward-model

koolaidman123 t1_j4uuko0 wrote on January 18, 2023 at 12:35 PM

Reply to comment by JClub in [D] RLHF - What type of rewards to use? by JClub

chatgpt (assuming they use same training as instructgpt) doesn't use a numerical scale, everything is a comparison between 2 (out of k) sampled outputs from a prompt, so everything is a pairwise comparison

koolaidman123 t1_iy8hbxd wrote on November 29, 2022 at 3:21 PM

Reply to comment by DinosParkour in [D] Difference between sparse and dense information retrieval by itsyourboiirow

splade-v2 and sparta are exclusively on the sparse leaderboards, and they uses bert

the point is dispelling the notion that sparse retrieval somehow = no dl involved. it's conflating dense retrieval with neural retrieval

koolaidman123 t1_iy89kz0 wrote on November 29, 2022 at 2:23 PM

Reply to comment by [deleted] in [D] Difference between sparse and dense information retrieval by itsyourboiirow

Colbert v2 is literally listed in the beir sparse leaderboards...

Sparse refers to the embedding vector(s), not the model

And ranking/reranking refers to the same thing, but its still distinct from retrieval, which was my point

koolaidman123 t1_iy6hhbj wrote on November 29, 2022 at 2:35 AM

Reply to comment by [deleted] in [D] Difference between sparse and dense information retrieval by itsyourboiirow

sparse retrieval isn't mutually exclusive to deep learning, splade v2 and colbert v2 are sparse methods, because they still produce higher dimensional sparse vectors, but both leverage bert models to create the sparse representations

also cross-encoders aren't considered retrievers, but rerankers

koolaidman123 t1_ixdbjen wrote on November 22, 2022 at 4:17 PM

Reply to [P] BetterTransformer: PyTorch-native free-lunch speedups for Transformer-based models by fxmarty

does this integrate with torchscript/triton?

if so, what's the speedup over either one of the methods

if not, how does bettertransformer compare?