Think About Scaling LLMs In 2020, a team of researchers from OpenAI released a paper called: "Scaling Laws For Neural Language Models". They observed a predictable decrease in training loss when increasing ... that is what people did. The models got larger and larger with GPT-3 (175B), Gopher (280B), Megatron-Turing NLG (530B) just to name a few. But the bigger ... number of training tokens should double as well. This was published in DeepMind's 2022 paper "Training Compute-Optimal Large Language Models" The researchers fitted over 400 language models ranging from


Technically Lambda already uses "external database" i.e external tools (the internet, calculator, etc.) to retrieve information. It doesn't solve memory problem (I don't remember what GPT3 ... GPT3 level). One solution is using a kNN lookup in a non-differentiable manner. One solution is making Transformers semi-recurrent (process inside chunks parallely, then sequencially process some coarse-compressed-chunk ... representation sequentially.). This can allow information to be carried in through the sequential process. Another solution is to augment Transformer with a State Space model which have shown great


popular practice/belief is unsound or useless. Some famous examples are: **Troubling Trends in ML**, **ML that Matters**, **On the Convergence of ADAM**, **On the Information Bottleneck**, **Implementation Matters in Deep Policy Gradients** (showed a certain purported algorithm gain is actually mainly due to code-level optimization), **Critique of Turing Award** (basically a critique on the citation practice in ML), **Deep Learning a Critical Appraisal**. However, these are a little bit dated. Does anyone have any recent critique papers of similar flavour


trend has been AI's societal impact. if anyone's read the recent job impact paper, one of the factors that jumped out was the exposure of blockchain engineering to AI-based ... function of any group of market participants. with respect to ML frameworks like sparsely-gated MoE, world models, multimodality, and adaptive agents


help. A bit of self promotion, but my Master's thesis was about GNNs. It should be very beginner-friendly, since I had to write it while also learning about this step ... articles are also great. You should also definitely read papers about GCN (very intuitively written), GAT, GraphSAGE and GIN, the most ... with **a lot** of suspicion. This paper about fair comparison is becoming more and more used. This baseline, not GNN but similar, gives very strong results. I will


several years ago and in this same subreddit too. For example: This is recurring question, people asking it every year


very easy to use architectures where computation is largely decoupled from the sequence length, like Perceivers or Recurrent Interface Networks. This is highly speculative though ... aware that an autoregressive variant of the Perceiver architecture exists, but it is actually quite a bit less general/flexible than Perceiver IO / the original Perceiver


human-like decoder for language models and seeing what outputs humans prefer. Transformers supports typical decoding and contrastive search, and there are papers and code out for RankGen, Time Control, and Contrastive Decoding (which is totally different from contrastive search


work on these systems, the work seems to focus on improvements in (a) search algorithms; (b) program abstraction/library compression; optimizing neural guidance; and (d) specification. While obviously work proceeds in these (and other related) domains, I'd love


problems, despite many claims to the contrary: * Tabular Data: Deep Learning is Not All You Need * In Search of Lost Domain Generalization * Unsupervised Domain Adaptation: A Reality Check * A Baseline for Few-Shot Image Classification


view them, is as a idealised exploration of a specific limit of PC. In recent work, we expand on this limit idea and show that all current EBM approximations to BP, such ... number of its properties. We also have a more theoretical analysis of standard PC where we show that although it differs from backdrop, it can also converge to minima of a supervised ... advantages of PC over BP including the ability for it to learn arbitrary recurrent computation graphs, the fact that you can significantly speed it up with incremental variants, and that


have this survey about ML for Combinatorial Optimization from Bengio, Lodi, and Provost. OpenAI's paper about a robot hand learning to solve a rubik's cube. Also check ... aims to combine neural network learning with logic-based reasoning. Gary Marcus wrote an extensive note on the subject that I recommend as well


merging the two concepts of language models and RL-based feedback. Some papers mentioned are: "Experience Grounds Language" (although I didn't read them entirely yet). We could ... looking for more related resources, my thoughts were inspired by the field of language emergence and this work


could tackle along the way. That led to our papers on human-level no-press Diplomacy, no-press Diplomacy from scratch, better modeling of humans in no-press Diplomacy, and expert-level no-press Diplomacy


gravity. Beyond these, here are articles discussing the point further: (1) A diatribe on expanding space. This is pretty technical, but it's the most direct attack on the idea of expanding ... cosmic expansion is simply not relevant to it. (2) The kinematic origin of the cosmological redshift. Very well written and less technical, although there are mathematical arguments. The main point of this ... space is nonexistent, not merely negligible. (3) On The Relativity of Redshifts: Does Space Really "Expand"? The least technical of the batch, this article is also focused on the interpretation


just tuned for sentiment analysis. There are two groups who developed models they called FinBERT. The first paper's model can be fond here ... tasks. Since you're interested in text embeddings, you may also be interested in this paper. The focus of that paper is sentiment analysis, but the general idea of using a sentence


papers given : - Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering - Multimodal Chain-of-Thought Reasoning in Language Models and such .. with general chain of thought ... idea for language can be looked at this paper. I'm not sure if the link you provided will work, but as it's huge I might have missed (I've glanced


Further reading on *expanding space* not being a physically real phenomenon: * A diatribe on expanding space * The kinematic origin of the cosmological redshift * On The Relativity of Redshifts: Does Space Really "Expand"? Further reading on cosmological dynamics with Newtonian gravity: * The dynamics of Newtonian cosmology * or more generally, just search for "Newtonian cosmology"


prefer functions that handle translation equivariance (not exactly true but only roughly due to pooling layers). Graph neural networks provide a relational inductive bias. Neural networks overall prefer simpler ... solutions, embodying Occam's razor, another inductive bias. This argument is made theoretically using Kolmogorov complexity.


optimizations mean that you can squish models onto modern GPUs now (i.e. int8 etc.). Designed to be fit onto a standard GPU, DeepMind Gato was bigger than I thought, with starting size ... paper, which compresses the models to 7MB? It lists some 1.2M-6.2M param models. My table shows... Smallest seems to be Microsoft Pact, which ... they were not really LLMs. They did train a 10M model during scaling research, but the model hasn't been released


LLMs are insanely impressive for a number of reasons. They emerge new abilities at scale. They build internal world models. They can be grounded to robotics ... robots brain. They can teach themselves how to use tools. They've developed a theory of mind. I'm sorry but anyone who looks


think we ever claimed it was. This is building on the adversarial policies threat model we introduced a couple of years ago. The norm-bounded perturbation threat model is an interesting lens ... think it's pretty limited: Gilmer et al (2018) had an interesting exploration of alternative threat models for supervised learning, and we view our work as similar in spirit to unrestricted adversarial examples


problems to solve, yes, but there are also very technical problems to solve, like power-seeking or inner misalignment or mechanistic interpretability that are much less


memorizing a lot of information from the training dataset a little less than a year later. About a year after that Anthropic came out with a paper that suggested that there were ... that meant undertrained larger models did not that much better and actually did need more data. Finally, more recent results from DeepMind did an additional pass on the topic and seem ... that a 4x smaller model trained for 4x the time would out-perform the larger model. Basically the original OpenAI paper did contradict a lot of prior research on overfitting and generalization


full Jacobian- people do similar things in adversarial robustness so you can have a look. I think you should check the stuff on evaluating for disentanglement. This paper could ... also be useful for u. For vae disentanglement better Jacobian is close to orthogonal than just small norm


years ago and nobody took the effort to put into a modern GPU accelerated codebase. Neurosymbolic AI: The 3rd Wave. Neuro-Symbolic Artificial Intelligence: Current Trends