50 results for

Submitted by LesleyFair t3_10fw22o in deeplearning

Think About Scaling LLMs In 2020, a team of researchers from OpenAI released a [paper]( called: “Scaling Laws For Neural Language Models”. They observed a predictable decrease in training loss when increasing ... that is what people did. The models got larger and larger with GPT-3 (175B), [Gopher]( (280B), [Megatron-Turing NLG]( (530B) just to name a few. But the bigger ... number of training tokens should double as well. This was published in DeepMind’s 2022 [paper]( “Training Compute-Optimal Large Language Models” The researchers fitted over 400 language models ranging from


InfuriatinglyOpaque t1_ivb9otw wrote

Dorka, N., Burgard, W., Koltun, V., & Brox, T. (2020). Scaling Imitation Learning in Minecraft. []( Bramlage, L., & Cortese, A. (2021). Generalized Attention-Weighted Reinforcement Learning. Neural Networks. []( Frey ... Characterizing the dynamics of learning in repeated reference games. Cognitive Science, 44(6), e12845. []( Kumaran, V., Mott, B. W., & Lester, J. C. (2019.). Generating Game Levels for Multiple Distinct Games with ... Hjelm, D., Bachman, P., & Courville, A. (2021). Pretraining Representations for Data-Efficient Reinforcement Learning. []( Sibert, C., Gray, W. D., & Lindstedt, J. K. (2017). Interrogating Feature Learning Models to Discover Insights


Nameless1995 t1_iyyl3m5 wrote

Technically Lambda already uses "external database" i.e external tools (the internet, calculator, etc.) to retrieve information: (Section 6.2) It doesn't solve /u/ThePahtomPhoton's memory problem (I don't remember what GPT3 ... GPT3 level). One solution is using a kNN lookup in a non-differentiable manner: One solution is making Transformers semi-recurrent (process inside chunks parallely, then sequencially process some coarse-compressed-chunk ... representation sequentially.). This can allow information to be carried in through the sequential process: Another solution is to augment Transformer with a State Space model which have shown great


FrogBearSalamander t1_jc5vvrb wrote

Would love to read some research papers if you have a link! - [Nonlinear Transform Coding]( - [An Introduction to Neural Data Compression]( - [SoundStream: An End-to-End Neural Audio Codec ... - Old but foundational: [End-to-end Optimized Image Compression]( - And this paper made the connection between compression models and VAEs: [Variational image compression with a scale hyperprior]( ... that SoundStream (mentioned above) uses residual VQ (RVQ). - [Image Compression with Product Quantized Masked Image Modeling]( uses a kind of VQ (subdivide the latent vectors and code separate to form a product


Submitted by LesleyFair t3_11alh40 in singularity winner of most popular,or "weather" for short. \[5\] []( \[6\] []( \[7\] []( \[8\] []( \[9\] []( \[10\] []( ...]( \[13\] []( \[14\] []( \[15\] []( \[16\] [](


Submitted by fromnighttilldawn t3_y11a7r in MachineLearning

popular practice/belief is unsound or useless. Some famous examples are: **Troubling Trends in ML** []( **ML that Matters** []( **On the Convergence of ADAM** []( **On the Information Bottleneck ...]( **Implementation Matters in Deep Policy Gradients** []( (showed a certain purported algorithm gain is actually mainly due to code-level optimization) **Critique of Turing Award** [\~juergen/critique-turing-award-bengio-hinton-lecun.html]( ... basically a critique on the citation practice in ML) **Deep Learning a Critical Appraisal** []( However, these are a little bit dated. Does anyone have any recent critique papers of similar flavour


Submitted by mjrossman t3_11ws42u in Futurology

trend has been AI's societal impact. if anyone's read the[ recent job impact paper](, one of the factors that jumped out was the exposure of blockchain engineering to AI-based ... function of any group of market participants. with respect to ML frameworks like[ sparsely-gated MoE](,[ world models](,[ multimodality](, and[ adaptive agents](


qalis t1_j8driqb wrote

help. A bit of self promotion, but my Master's thesis was about GNNs: []( It should be very beginner-friendly, since I had to write it while also learning about this step ... articles are also great, e.g. []( or a well known (in this field) []( You should also definitely read papers about GCN (very intuitively written), GAT, GraphSAGE and GIN, the most ... with **a lot** of suspicion. This paper about fair comparison is becoming more and more used: []( This baseline, not GNN but similar, gives very strong results: []( I will


cnapun t1_j10a9jz wrote

better or worse results. Some not super-recent papers I can think of: []( []( []( []( (3.2) []( (2.2/2.4


serge_cell t1_j5akgwk wrote

several years ago and in [this same subreddit too]( For example: This is recurring question, people asking it every year


benanne OP t1_j427zj0 wrote

very easy to use architectures where computation is largely decoupled from the sequence length, like Perceivers (,, or Recurrent Interface Networks ( This is highly speculative though ... aware that an autoregressive variant of the Perceiver architecture exists (, but it is actually quite a bit less general/flexible than Perceiver IO / the original Perceiver


olmec-akeru OP t1_iy2zjoi wrote]( []( []( and the one speaking to categorical variables: [](


prototypist t1_j0c5p2j wrote

human-like decoder for language models and seeing what outputs humans prefer. Transformers supports [typical decoding]( and [contrastive search](, and there are papers and code out for [RankGen ..., [Time Control](, and [Contrastive Decoding]( (which is totally different from contrastive search


JNmbrs t1_isgqdyr wrote

work on these systems, the work seems to focus on improvements in (a) search algorithms (e.g., [](; (b) program abstraction/library compression (e.g., [\_jul11.pdf]( and [](; ... optimizing neural guidance (e.g., []( and [](; and (d) specification (e.g., []( and []( While obviously work proceeds in these (and other related) domains, I'd love


Throwaway00000000028 t1_iy42ker wrote

Blog: []( Youtube videos: []( Seminal papers: \- Denoising Diffusion Probabilistic Models: []( \- Improved Techniques for Training Score-based Generative Models: []( \- Hierarchical Text-Conditional Image Generation with ... CLIP Latents: []( Review papers: \- Understanding Diffusion Models: [](


tariban t1_irw5z8d wrote

problems, despite many claims to the contrary: * [Tabular Data: Deep Learning is Not All You Need]( * [In Search of Lost Domain Generalization]( * [Unsupervised Domain Adaptation: A Reality Check ... * [A Baseline for Few-Shot Image Classification](


dangerhexagon t1_j4x2yrp wrote

There's some papers on applying transformers to trees: []( , []( , []( And some recent work on tree extraction: []( There's also this paper which recovers ... tree by observing the leaf nodes: [](


BerenMillidge t1_iy814ur wrote

view them, is as a idealised exploration of a specific limit of PC. In recent work (, we expand on this limit idea and show that all current EBM approximations to BP, such ... number of its properties. We also have a more theoretical analysis of standard PC ( where we show that although it differs from backdrop, it can also converge to minima of a supervised ... advantages of PC over BP including the ability for it to learn arbitrary recurrent computation graphs (, the fact that you can significantly speed it up with incremental variants, and that


DinosParkour t1_iy7j1hw wrote

choosing the most suitable ones) when it comes to computing the query-doc similarity. \[1\] []( \[2\] []( \[3\] []( \[4\] []( \[5\] [](


Submitted by kizumada t3_11rfxca in MachineLearning

understanding model in 2019 and evolved to ERNIE 3.0 Titan with 260 billion parameters. ERNIE 1.0: []( ERNIE 2.0: []( ERNIE 3.0: []( ERNIE for text-to-image ...]( ERNIE Bot live-stream on YouTube: [](


Submitted by IamTimNguyen t3_105v7el in MachineLearning

papers: Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes: []( Tensor Programs II: Neural Tangent Kernel for Any Architecture: []( Tensor Programs III: Neural ... Matrix Laws: []( Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks: []( Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer: [](


K3tchM t1_j46kidw wrote

have [this survey about ML for Combinatorial Optimization]( from Bengio, Lodi, and Provost. OpenAI's paper about a [robot hand learning to solve a rubik's cube]( Also check ... aims to combine neural network learning with logic-based reasoning. Gary Marcus wrote [an extensive note]( on the subject that I recommend as well


blazejd OP t1_ix7mr03 wrote

merging the two concepts of language models and RL-based feedback. Some papers mentioned are: []( and ["Experience Grounds Language"]( (although I didn't read them entirely yet). We could ... looking for more related resources, my thoughts were inspired by the field of language emergence ([]( and this work ([](


MetaAI_Official OP t1_izfk9ug wrote

could tackle along the way. That led to our papers on [human-level no-press Diplomacy](, [no-press Diplomacy from scratch](, [better modeling of humans in no-press Diplomacy ..., and [expert-level no-press Diplomacy](


Aseyhe t1_jc6ofrj wrote

gravity. Beyond these, here are articles discussing the point further: (1) [A diatribe on expanding space]( This is pretty technical, but it's the most direct attack on the idea of expanding ... cosmic expansion is simply not relevant to it. (2) [The kinematic origin of the cosmological redshift]( Very well written and less technical, although there are mathematical arguments. The main point of this ... space is nonexistent, not merely negligible. (3) [On The Relativity of Redshifts: Does Space Really "Expand"?]( The least technical of the batch, this article is also focused on the interpretation


eyeofthephysics t1_jbhu9d4 wrote

just tuned for sentiment analysis. There are two groups who developed models they called FinBERT []( and []( The first paper's model can be fond [here]( ... tasks. Since you're interested in text embeddings, you may also be interested in this paper []( The focus of that paper is sentiment analysis, but the general idea of using a sentence


1azytux OP t1_jd2ho88 wrote

papers given : \- [Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering]( \- [Multimodal Chain-of-Thought Reasoning in Language Models]( and such .. with general chain of thought ... idea for language can be looked at [this paper]( I'm not sure if the link you provided will work, but as it's huge I might have missed (I've glanced


ttt05 t1_j0ju037 wrote

looks like I messed up the years, but all of these are good references) 1. MSP: []( 2. OE: []( 3. One vs all: [](


Aseyhe t1_jaka1l0 wrote

Further reading on *expanding space* not being a physically real phenomenon: * [A diatribe on expanding space]( * [The kinematic origin of the cosmological redshift]( * [On The Relativity of Redshifts: Does ... Space Really "Expand"?]( Further reading on cosmological dynamics with Newtonian gravity: * [The dynamics of Newtonian cosmology]( * or more generally, just search for "Newtonian cosmology


Aseyhe t1_j2kql8y wrote

public consciousness, here are some articles discussing the point further. (1) [A diatribe on expanding space]( This is pretty technical, but it's the most direct attack on the idea of expanding ... expansion is simply no longer relevant to it. (2) [The kinematic origin of the cosmological redshift]( Very well written and less technical, although there are mathematical arguments. The main point of this ... viewed as just a Doppler shift. (3) [On The Relativity of Redshifts: Does Space Really "Expand"?]( The least technical of the batch. This article is also focused on the interpretation


activatedgeek t1_j9jvj8h wrote

prefer functions that handle translation equivariance (not exactly true but only roughly due to pooling layers). Graph neural networks provide a relational inductive bias. Neural networks overall prefer simpler ... solutions, embodying Occam’s razor, another inductive bias. This argument is made theoretically using Kolmogorov complexity.


adt t1_j9neq5w wrote

optimizations mean that you can squish models onto modern GPUs now (i.e. [int8]( etc.). Designed to be fit onto a standard GPU, DeepMind Gato was bigger than I thought, with starting size ... paper, which compresses the models to 7MB? It lists some 1.2M-6.2M param models: []( My table shows... []( \*looks at table\* Smallest seems to be Microsoft Pact, which ... they were not really LLMs. They did train a 10M model during scaling research ([paper](, but the model hasn't been released


MysteryInc152 t1_j81e986 wrote

Reply to comment by rretaemer1 in Open source AI by rretaemer1

LLMs are insanely impressive for a number of reasons. They emerge new abilities at scale - []( They build internal world models - []( They can be grounded to robotics ... robots brain) - [](, They can teach themselves how to use tools - []( They've developed a theory of mind - []( I'm sorry but anyone who looks


ARGleave t1_iuseu7k wrote

think we ever claimed it was. This is building on the [adversarial policies threat model]( we introduced a couple of years ago. The norm-bounded perturbation threat model is an interesting lens ... think it's pretty limited: [Gilmer et al (2018)]( had an interesting exploration of alternative threat models for supervised learning, and we view our work as similar in spirit to [unrestricted adversarial ... examples](


albertzeyer t1_j65rtdq wrote

papers where people only use attention-based encoder-decoder (AED) for speech recognition. Some random papers: * []( * []( * []( See my Phd thesis for some overview over


PiGuyInTheSky t1_j9sx3nd wrote

problems to solve, yes, but there are also very technical problems to solve, like [power-seeking]( or [inner misalignment]( or [mechanistic interpretability]( that are much less


qalis t1_j6mbu5s wrote and [GPT-3 lecture 2]( and [GPT-3 paper]( to learn about GPT-3 \- [InstructGPT page]( and [InstructGPT paper]( to learn ... RLHF is based on Proximal Policy Optimization algorithm \- [PPO page]( and [PPO paper](


andreichiffa t1_j6n9lg6 wrote

memorizing a lot of information from the training dataset a little less than a year later: About a year after that Anthropic came out with a paper that suggested that there were ... that meant undertrained larger models did not that much better and actually did need more data: Finally, more recent results from DeepMind did an additional pass on the topic and seem ... that a 4x smaller model trained for 4x the time would out-perform the larger model: Basically the original OpenAI paper did contradict a lot of prior research on overfitting and generalization


i-heart-turtles t1_iusf0zy wrote

full Jacobian- people do similar things in adversarial robustness so you can have a look. []( []( I think you should check the stuff on evaluating for disentanglement. This paper could ... also be useful for u: []( For vae disentanglement better Jacobian is close to orthogonal than just small norm


lorepieri t1_j1z4zp5 wrote

years ago and nobody took the effort to put into a modern GPU accelerated codebase. []( Neurosymbolic AI: The 3rd Wave []( Neuro-Symbolic Artificial Intelligence: Current Trends [](