JustOneAvailableName

JustOneAvailableName t1_jea2dzf wrote

Software engineer perspective on attention (self quote):

> You have to think about searching. If you search, you have a query (the search term), some way to correlate the query to the actual (size unknown/indifferent) knowledge base and the knowledge base itself. If you have to write this as a mathematical function you have to have something that matches a query, to how similar it is to some key and then return the corresponding value to that key. The transformer equation is a pretty straightforward formula from that perspective. Each layers learns what it searches for, how it can be found and which value it wants to transfer when requested.

RWKV changes this by removing the query. So data is not requested anymore, only pushed. I am frankly surprised to seems to work thus far. Pushing data (self determining how important something is for something else) is not dependant on other states, enabling it to be a RNN.

Edit: step I need to mention: in RWKV importance also fades over time, so it has a recency bias

3

JustOneAvailableName t1_j7v99gd wrote

If the model is sufficiently large (if not, you don't really need to wait long anyways) and no expensive CPU pre/postprocessing is done, the 3090 will be the bottleneck.

A single 3090 might not have enough memory to train GPT 2 large, but it's probably close.

Fully training a LLM on a single 3090 is impossible, but you could finetune one.

3

JustOneAvailableName t1_j6cfdmr wrote

I worked with Wav2vec a year ago. WER on dutch was (noticeably) better when fine tuned than it was with GCP or Azure, and we didn't use any labeled own data. I used CTC mainly because it didn't reduce WER, hugely improved CER and made inference lots simpler. Inference cost was also a fraction (less than a cent per hour, assuming the GPU is fully utalized) of the paid services. I kinda assumed others got to the same conclusions I did back then, but my own conclusions, so plenty I could have done wrong.

Whisper offers this performance level practically out of the box, although with a lot higher inference costs. I, sadly, haven't had the time yet to finetune it. Nor have I found the time to optimize inference costs.

> E.g. it does not work well for streaming (getting instant recognition results, usually within 100ms, or 500ms, or max 1sec)

If you're okay with intermediary results getting improved later this is doable, although at a factor increased cost. Offline works like a charm though.

> Also, I'm quite sure it has some strange failure cases, as AED models tend to have, like repeating some labels, or skipping to the end of a sequence (or just chunk) when it got confused.

True that.

1

JustOneAvailableName t1_j2dih9m wrote

Although I do agree that europe could use a boost to it's army, I think it's mainly in a reduction in bureaucracy and investing more in domestic military industry. The EU has 20% more military personnel and thrice US's reserve. A large reason why expendenture is so much less is that it's aimed at defense and not oversea power projection. I.e. no aircraft carriers and nuclear subs

24

JustOneAvailableName t1_j0uj1ee wrote

I based my answer on the 2020 and 2021 Neurips papers by institution. Couldn't find data of 2022. Anyways, Wav2vec was a huge paper for me and basically what I worked on for a large part of the past 2 years. And they were the maintainer of PyTorch and still carry the bulk of the work.

I really don't get how you can disregard FB as nearly zero influence on ML

8