Viewing a single comment thread. View all comments

JamesBaxter_Horse t1_j6h8tib wrote

If I understand correctly, you're hyperparameter training with the intent to minimise training speed. What do you see as the purpose of this? Presumably, all you're achieving is successfully minimising the inductive space and optimising the learning parameters so as to converge as quickly as possible, but these results are completely specific to cifar and would not be reproducible on a different dataset.

19

LeanderKu t1_j6hd6hp wrote

Well, this is also an assumption. It would be interesting which lessons do translate and which won’t. I wouldn’t dismiss it so quickly. Also, It’s a fun game to play and interesting in its own!

6

tysam_and_co OP t1_j6hfdaj wrote

And, safe to say, there's stuff I'm not sharing here (yet?) that I've found as a result of that. Some hyperparameters are more network-specific, some are dataset specific. And some behave in ways just weirdly enough that you might get an entirely new adaptive method out of it... ;))))

I hadn't thought about that in the exact words you'd put it for a long while, but I think you're right very much so! It is quite a fun game, and very interesting to play in its own right. There's very much this bizzare, high-dimensional "messy zen" to it all. ;D

Thanks again for your comment, it warmed my heart and made me smile seeing it. Have a good evening/night/etc! :D :))) <3 <3 :D

3

tysam_and_co OP t1_j6hx45k wrote

Thanks for sharing. I think you might be missing some of the bigger picture here! Most of the changes and performance improvements did indeed come by changing the architecture, memory format, execution order, network width/etc in the right places. These are from about five previous years of experience where my primary task was architecting networks like this. I actually transferred a number of personal lessons learned into this network to get a lot of the benefits that we have here. So I'm not quite sure why they would not scale to other problems all of a sudden! ;P I guess that said, there might be some tweaks in line in order to line up with the inductive biases of the network on different datasets (in this case, say, for Imagenet 1-2 more downscaling blocks or something like that).

I also wouldn't focus in on the hyperparameter twiddling that much -- though it is important and definitely can be a trap. At the front of being a world record, every option is on the table and hyperparameters promise results but are exponentially more expensive to work with. But the 'good enough' parameter space should be pretty flat outside of it, so it's likely not too bad of a starting place.

I'm a bit curious about how this would not be reproducible on another dataset (especially if we're narrowing our inductive space -- this should increase generalization, not reduce it!). Similar to Transformers, the simpler and more scalable this architecture is, the better. One of my go-tos for people newer to the field is to encourage them to keep things as simple as possible. It pays off!

In this case, for example, before release, I just added 70 epochs and doubled the base width, and went from 94.08% to 95.77%. That's a good sign! It should at least have good basic performance on other datasets, and if something has to be changed, it's probably just a few hyperparameters, and not all of them, if that makes sense.

2