Submitted by CS-fan-101 t3_11yzsz6 in MachineLearning

Note #2: We are revising the name to Sparse-IFT. We appreciate the candid feedback and look forward to hearing any additional feedback you have on our research.

Note: Thank you r/MachineLearning for providing so many awesome naming alternatives! We'll revisit the acronym and update accordingly.

We are excited to announce the availability of our paper on arxiv on Sparse Iso-FLOP Transformations (Sparse-IFT), which increases accuracy and maintains the same FLOPs as the dense model using sparsity. In this research, we replace dense layers with Sparse-IFT and significantly improve computer vision and natural language processing tasks without modifying training hyperparameters

Some of the highlights of this work include ResNet-18 on ImageNet achieving a 3.5% accuracy improvement and GPT-3 Small on WikiText-103 reducing perplexity by 0.4, both matching larger dense model variants that have 2x or more FLOPs.

Sparse-IFT is simple to use, provides a larger search space to find optimal sparse masks, and is parameterized by a single hyperparameter - the sparsity level.

This is independent of the research we posted yesterday, which demonstrates the ability to reduce pre-training FLOPs while maintaining accuracy on downstream tasks.

This is the first work (that we know of!) to demonstrate the use of sparsity for improving the accuracy of models via a set of sparse transformations.

https://preview.redd.it/qznj00gex6qa1.jpg?width=3536&format=pjpg&auto=webp&v=enabled&s=2e3ee31dd58f76ab7e2c24105081574f772ed0b1

77

Comments

You must log in or register to comment.

mouldygoldie t1_jdaa3nv wrote

I think I'd look for a different acronym to SIFT, given that's a very well known feature detector and descriptor in computer vision...

116

elisiyumali t1_jdapqms wrote

Whoa...this is the first time I've seen weight sparsity being used to actually improve accuracy! :O The paper was a pleasant read, and the method is simple but novel. Nice work.. I look forward to experimenting with these transformations in my own work once the code is out...

12

brownmamba94 t1_jdaq0gn wrote

Hi, thanks for acknowledging the novelty of our work and finding our paper a good read. We look forward to releasing our code so yourself and others can experiment with the different SIFT transformations. And yes, first time sparsity is being used to improve the accuracy!

5

MisterManuscript t1_jdawa9m wrote

Feels like the authors are trying to piggyback on the pre-existing fame of Scale-Invariant Feature Transform. Out of all other names that could have been chosen, why try to override an existing name?

Addendum: if you're lucky, Google just might cut you some slack. If not, then expect their lawyers to come at you with a cease-and-desist.

Addendeum 2: from a deleted reply from one of the authors/person from Cerebras asking why Google might come after them with a cease-and-desist: SIFT's patent is owned by Google. They may consider trademark violation, or something similar.

16

BrotherAmazing t1_jdazqji wrote

Came here to say that.

It’d almost be like choosing the name “IBM” for your company then starting off with “Not to be confused with the International Business Machines publicly traded company IBM,…”

25

Tejalapeno t1_jdb3u06 wrote

Man it would be cool if the comments here actually focused on the paper contents and not the use of an acronym for an outdated algorithm. Because the results are extremely important for future scaling

−9

jakderrida t1_jdb95pw wrote

How about SPIT, or Sparse Parameter Iso-FLOP Transformations)?

or would SPLIT: Sparse Performance-focused Lightweight Iso-FLOP Transformations work?Or let's choose whatever's SAFIST, or Sparse Accuracy-focused FLOP-Isometric Structural Transformations?

Who cares that I obviously had to shoehorn "Structural" in there just to get my pun across?

17

Armanoth t1_jdc2vt0 wrote

While the paper is good and definetly presents some novel approach. Re-using existing acronyms, especially such prominent ones. The main purpose of these acronyms to allow for readers to easily identify and reference existing methods.

If your choice of acronym forces all subsequent research to have to elaborate on which SIFT is mentioned, it is not only a poor choice but also a point of confusion. And existing papers that mention SIFT are retroactively affected.

As many in this thread has pointed out, there are other equally catchy, non-overlapping acronyms that could have been chosen.

5

Armanoth t1_jdc3lyf wrote

Yeah, whenever there is papers that try to redefine/takeover existing well known acronyms, i just get the sense that the goal is publicity through controversy.

I dont believe its just a coincidence, especially not when its an acronym so prominent. I mean who tries to coin a term without doing a basic Google search, let alone pick an acronym that is so well-known in the same field.

4

brownmamba94 t1_jdd1otu wrote

Hi thank you for the feedback. This was a genuine oversight and we will correct the paper with a new acronym in the revised version of the manuscript. You can expect the changes soon. I look forward to any feedback you have on the research itself, cheers!

8

GamerMinion t1_jddeprr wrote

When you say "FLOP-equivalent, does that also mean compute-time equivalent?

I ask this because on GPUs, models like EfficientNet, which technically have far less flops and parameters can be way slower than a standard ResNet of same accuracy because they're that much less efficiently parallelizable.

Did you look into inference latency on GPUs in your paper?

3

brownmamba94 t1_jddhxdb wrote

Hi yes, this is a great question. When we say FLOP-equivalent, we're saying on an ideal hardware which can accelerate unstructured weight sparsity, the total compute-time would also be equivalent. Except, we're showing we can actually improve the accuracy of the original dense model for the same compute budget with these Sparse Iso-FLOP Transformations (e.g., Sparse Wide, Sparse Parallel, etc.).

In Section 4 of our paper, we actually make comparisons for inference and training on hardwares with and without support for sparsity acceleration.

In theory, there should be no increase in wall-clock time, but on GPUs there'd be a significant increase. However, emerging hardware accelerators like Cerebras CS-2 are doing hardware-software co-design for sparse techniques, which can allow us to take advantage of sparse acceleration during training.

0

GamerMinion t1_jddlqit wrote

Yes, theory is one thing, but you can't build ASICs for everything due to the cost involved.

Did you look into sparsity at latency-equivalent scales? i.e. same latency, bigger but sparser model.

I would be very interested to see results like that, especially for GPU-like accelerators (e.g. Nvidia's AGX computers use their ampere GPU architecture), as latency is a primary focus in high-value computer vision applications such as in autonomous driving.

2

brownmamba94 t1_jddzjgc wrote

Thanks for your inquiry. We are working with our legal team to figure out the best path forward, but most likely, we'll be releasing under some permissive license that allows you to use the code for your applications.

3

Under_Over_Thinker t1_jde5e2f wrote

Perplexity going from 20.8 to 20.4. Is that a significant improvement? Also, I am not sure if perplexity is representative enough to evaluate LLMs.

0

jakderrida t1_jdf8zip wrote

Whoever is downvoting you just doesn't get it.

My joke was that "structural" was so meaningless that it's obviously a backronym solely in service of my pun.

/r/VictorMollo 's joke is that we should all just go off the deep-end and double down on blatantly obvious backronyms.

Notice he used the word "Widget" instead of freaking "Weighted"? He obviously chose to Taylor it that way because he appreciates my puns.

3