drinkingsomuchcoffee
drinkingsomuchcoffee t1_jdnhxri wrote
Huge models are incredibly wasteful and unoptimized. Someday, someone is going to sit down and create an adaptive algorithm that expands or contracts a model during the training phase and we're going to laugh at how stupid we were.
drinkingsomuchcoffee OP t1_j8zael9 wrote
Reply to comment by baffo32 in [D] HuggingFace considered harmful to the community. /rant by drinkingsomuchcoffee
I am the "bad guy" of the thread, so anything I say will be seen negatively, even if it's correct. This is typical human behavior, unfortunately.
I have a feeling most people here do not understand DRY done well, and are used to confusing inheritance hierarchies and incredibly deep function chains. Essentially they have conflated DRY with bad code, simple as that.
drinkingsomuchcoffee OP t1_j8v3y80 wrote
Reply to comment by fasttosmile in [D] HuggingFace considered harmful to the community. /rant by drinkingsomuchcoffee
>You cant just copy paste a file if it’s centralized, you’ll have to copy paste multiple, and the main issue is it’s gonna take a while to understand which ones (and you'll have to modify the imports etc., unless you copy the entire repo! are you seriously suggesting that lmao)
Yep apparently they themselves claim to do this for every module. Thank you for pointing out how crazy this is and proving my point.
>Your definition of hackable is almost it. What’s missing is that being decentralized makes things much, much easier to understand because the code is very straightforward and doesn’t have to take 10 different things into account.
Oh really? I think those files depend on pytorch functions and also numpy. Should they copy those entire libraries into the file to be more "hackable"? Lmao
drinkingsomuchcoffee OP t1_j8uvzdr wrote
Reply to comment by dahdarknite in [D] HuggingFace considered harmful to the community. /rant by drinkingsomuchcoffee
This is such a terrible attitude to have. This isn't about money at all.
You don't pay for many services. Does this mean they should be able to treat you like garbage? Should Google be able to lock you out of all your services because their automated system falsely accused you? By your logic, you don't pay so you have no right to be annoyed.
HuggingFace is a for profit company. They will be asking for your money now or in the future. This isn't a bad thing, they need to eat too.
By even existing, HuggingFace has disincentivized possibly more competent devs from creating their own framework. That's fine, but is a very real thing. In fact it's pretty common for a business to corner a market at a loss and then ratchet up prices.
Finally you may work for a company that chooses HuggingFace and you will be forced to use the library whether you want to or not.
drinkingsomuchcoffee OP t1_j8unlkp wrote
Reply to comment by fasttosmile in [D] HuggingFace considered harmful to the community. /rant by drinkingsomuchcoffee
Alright, I have a bit of time so I'll address a few things.
>You need to understand that there is a trade-off between centralizing [...] verses keeping it hackable that is unavoidable.
I don't know what hackable means. You haven't defined it. I'm going to use the most generous interpretation to mean, you can modify it without impacting other places. Well you can do that if it's centralized, just copy paste it into your file and then edit it- that's no excuse to completely ban centralization! Alternatively decompose the centralized function more and only use the pieces you need.
Now onto the blog post.
>If a bug is found in one of the model files, we want to make it as easy as possible for the finder to fix it. There is little that is more demotivating than fixing a bug only to see that it caused 100 failures of other models.
Maybe it should cause 100s of failures if it's a breaking change (a bug). That's a pretty good sign you really did screw something up.
>Similarly, it's easier to add new modeling code and review the corresponding PR if only a single new model file is added.
No it's not. If new code uses a battle tested core, I don't have to review those parts as thoroughly. If it's copy pasted, I still have to review it and make sure they didn't copy an old version with bugs or slightly modified it and broke something. Sounds like this is common as many people have complained about dozens of bugs!
>We assume that a significant amount of users of the Transformers library not only read the documentation, but also look into the actual modeling code and potentially modify it. This hypothesis is backed by the Transformers library being forked over 10,000 times and the Transformers paper being cited over a thousand times.
Maybe you should check your assumptions before you make a fundamental decision (you know, basic engineering). There's plenty of forked libraries that are not modified and are forked for archival purposes. Nor should you cater to a small minority if most people _aren't_ doing this.
> Providing all the necessary logical components in order in a single modeling file helps a lot to achieve improved readability and adaptability.
It can _sometimes_. But not always. Having one massive file named `main.py` is not more readable than a well split program. This seems like basic common sense to me, but here's an actual paper on the subject: http://www.catb.org/esr/writings/taoup/html/ch04s01.html
>Every time we would have to have asked ourselves whether the "standard" attention function should be adapted or whether it would have been better to add a new attention function to attention.py. But then how do we name it? attention_with_positional_embd, reformer_attention, deberta_attention?
Yep, you've identified a place where you shouldn't try to fit every idea under a single "Attention" class. That's just common sense programming, not an argument against writing good shared functions or classes.
>Once a machine learning model is published, it is rarely adapted or changed afterward.
Then why does the Bert module have changes as recent as this week with changes from dozens of authors going back years?
https://github.com/huggingface/transformers/tree/main/src/transformers/models/bert
This is irrefutable hard evidence against your argument.
> Sylvain Gugger, found a great mechanism that respects both the single file policy and keeps maintainability cost in bounds. This mechanism, loosely called "the copying mechanism", allows us to mark logical components, such as an attention layer function, with a # Copied from <predecessor_model>.<function> statement
Ok so the programmer you mentioned before is going to "break 100s of tests" when she changes this ad-hoc C-preprocessor knock off. You're still doing "DRY" you're just doing it how C programmers did it 30 years ago, in a much more complicated manner.
If anyone here works at HuggingFace, please forward this to the author of that article.
drinkingsomuchcoffee OP t1_j8u09ez wrote
Reply to comment by [deleted] in [D] HuggingFace considered harmful to the community. /rant by drinkingsomuchcoffee
Not an argument.
drinkingsomuchcoffee OP t1_j8tw3yt wrote
Reply to comment by fasttosmile in [D] HuggingFace considered harmful to the community. /rant by drinkingsomuchcoffee
There's so many contradictions in that blog post and fallacies, I don't even know where to begin. I think I'll let empirical evidence do the talking for me, aka many people agreeing with my post.
drinkingsomuchcoffee OP t1_j8sm1y7 wrote
Reply to comment by narsilouu in [D] HuggingFace considered harmful to the community. /rant by drinkingsomuchcoffee
Thank you for replying. I apologize for the harsh tone, and was hoping to phrase it as a wake up call that people are reading the code and they do care about quality.
Do continue to avoid inheritance. In fact, probably ban inheritance unless it's only one layer deep and inheriting from an abstract base class.
But don't misunderstand DRY. DRY is not about compressing code as much as possible. That's code golfing. DRY is about having one place for information to live, that's it. If you see a dev creating a poorly named function or abstraction to reduce 5 lines of duplicate code, that's not DRY, that's just bad code.
You can achieve DRY by using code generators as you mention, but splitting things into separate modules is also fine. A code generator is DRY because the generator is the point of truth for the information, even if it creates "duplicate" code. This is what a real understanding of DRY is.
People wanting to "hack" on code do not mind about having to copy a few folders. If you have a beautiful module of pure functions for calculating statistics, it is flat out stupid to copy+paste it into every folder to be more "hackable". Dont do this. Instead factor these out into simple pure modules.
Submitted by drinkingsomuchcoffee t3_113m1ly in MachineLearning
drinkingsomuchcoffee t1_ithb1xs wrote
Reply to comment by jaschau in [D] Building the Future of TensorFlow by eparlan
Unsurprising. Google's been out of touch with reality for awhile now. That's what happens when you have a near monopoly (besides Apple). Despite the claims of how elite they are, the APIs they produce are pretty garbage except for a few lucky hits like JAX.
drinkingsomuchcoffee t1_iswij8p wrote
Reply to comment by Azmisov in [D] How frustrating are the ML interviews these days!!! TOP 3% interview joke by Mogady
I would expect a place that hires the top 3% to design a better test.
drinkingsomuchcoffee t1_jdpg1cb wrote
Reply to comment by YoloSwaggedBased in [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700
The problem is learned features aren't factored nicely into a minimal set of parameters. For example, identifying if an image is a cat may be 1000s of parameters over n layers, where it may actually be expressed as 10 parameters over fewer layers. A small model does this automatically, as it's obviously physically constrained. A large model has no such constraint, so it is wasteful. There's probably many solutions to get the best of both worlds at training time, but it's by no means an easy problem. And the current distillation methods or retraining feel clunky. We actually want the big model to use all its parameters efficiently and not waste them, which it's likely doing if much more compact models can get similar results. It's probably extremely wasteful if it requires an order of magnitude in size to get a few percentage points improvement. Compare that to biological entities where an order of magnitude size increase results in huge cognitive improvements.