IntelArtiGen

IntelArtiGen t1_jeguknc wrote

I've used autoencoders on spectrograms and in theory you don't need an A100 or 80M spectrograms to have some results.

I've not used ViTMAE specifically but I read similar papers. I'm not sure on how to interpret the value of the loss. You can use some tips which are valid for most of DL projects. Can your model overfit on a smaller version of your dataset (1000 spectrograms) ? If yes, perhaps your model isn't large / efficient enough to process your whole dataset (though bird songs shouldn't be that hard to learn imo). At least you could easily do more epochs faster with this method and debug some parameters. If your model can't overfit, you may have a problem in your pre/post processing.

Do ViTMAE models need normalized inputs? Spectrograms can have large values by default which may not be easy to process, they may be hard to normalize. Your input and your output should be in a coherent range of values and you should use the right layers in your model if you want that to happen. Also fp16 training can mess up with that.

ViTMAE isn't specifically for sounds right? I think there have been multiple attemps to use it for sounds, this paper (https://arxiv.org/pdf/2212.09058v1.pdf) cites other papers:

>Inspired by the success of the recent visual pre-training method MAE [He et al., 2022], MSM-MAE [Niizumi et al., 2022], MaskSpec [Chong et al., 2022], MAE-AST [Baade et al., 2022] and Audio-MAE [Xu et al., 2022] learn the audio representations following the Transformer-based encoder-decoder design and reconstruction pre-training task in MAE

You can try to see their results and how they made it work, these papers probably also published their code.

Be careful with how you process sounds, the pre/post processing is different than for images which may induce some problems.

3

IntelArtiGen t1_jcdxtrb wrote

The challenge looks very cool but also quite hard. However, if it's truly possible to read that ink and unfold these scrolls, I'm sure ML and data processing will be able to do it.

4.7 TB (for two scrolls) seems a lot, but I also get it's due to the required resolution to detect ink. I guess people can test their algorithms first on the other datasets and find a way to process these 4.7 TB if they need to. Perhaps the task could be more accessible if people could easily access 1/4~1/8 of 1 scroll (0.5/1 TB)

35

IntelArtiGen t1_jbjg4wk wrote

>Humans need substantially fewer tokens than transformer language models.

We don't use tokens the same way. In theory you could build a model with 10000 billion tokens, including one for each number up to a limit. Obviously humans can't and don't do that. We're probably closer to a model which would do "characters of a word => embedding". Some models do that but they also do "token => embedding" because it improves results and it's easier for the models to learn. Those who make these models may not really care about the size of the model if they have the right machine to train it and if they just want to have the best results on a task without constraints on size efficiency.

Most NLP models aren't efficient regarding their size. Though I'm not sure there currently is a way to keep having the best possible results without doing things like this. If I tell you "what happened in 2018?", you need to have an idea of what "2018" means, and that it's not just a random number. Either: (1) you know it's a year because you've learned this number like all other tokens (and you have to do that for many numbers / weird words and you have a big model), or (2) you think it's a random number, you don't need 1 token / number, your model is much smaller, but you can't answer these questions precisely, (3) you can re-build an embedding for 2018 knowing it's 2-0-1-8, you have an accurate "character => embedding" model.

I don't think we have a perfect solution for (3) so we usually do (1) & (3). But just doing (3) is the way to go for smaller NLP models... or putting much more weights for (3) and much less for (1).

So the size of NLP models doesn't really mean anything, you could build a model with 100000b parameters but 99.999% of these parameters won't improve the results a lot, and are only required to answer very specific questions. We should focus on building better "character => embedding" models and on ways to compress word embeddings if we care about the size of NLP models (easier said than done).

1

IntelArtiGen t1_j60jjfg wrote

It's quite hard to answer these questions for neural networks. We don't really know if GANs are forever worse than Latent Diffusion Models, they are now, but previously they weren't, and perhaps in the future GANs will outperform LDMs. It seems that how we configure the denoising task now is better suited for text2img than how we configure GANs now.

A model usually outperforms another when it's more efficient in how it stores information in its weights. Successive conditioned denoising layers seem to be more efficient for this task, but it also requires a good enough perceptual loss, a good enough encoder, etc. We know that these networks could compete with GANs but maybe they were just not good enough before, or not combined in a good enough way.

2

IntelArtiGen t1_j5tmia0 wrote

chatGPT is probably a very good model on the task it had to solve (to be a great conversational agent based on openAI data), but there are better models regarding the broad task of language understanding. You could adapt these models to be conversational agents, and they could probably beat chatGPT if they had access to the same dataset. But it would still be this specific task of being a great conversational agent. It's not the task of "thinking by itself like humans".

So it depends on what "more advanced" means. There are probably more "advanced" tasks towards AGI. But towards being a great conversational agent perhaps openAI has the best task-dataset combo today. At least I'm quite sure that there aren't systems which would be "significantly" more advanced than that, because I think the current limit is that it's "just" a very good conversational agent.

18

IntelArtiGen t1_j5tijjx wrote

I managed to use SwAV on 1 GPU (8GB), batch size 240, 224x224 images, FP16, ResNet18.

Of course it works, the problem isn't just the batch size but the accuracy - batchsize trade-off, and the accuracy was quite bad (still usable for my task though). If 50% top5 on imagenet is ok for you, you can do it. But I'm not sure there are many tasks where it makes sense.

Perhaps contrastive learning isn't the best for single GPU. I'm not sure about the current SOTA on this task.

3

IntelArtiGen t1_j55at5j wrote

I don't really see how and why they would do it. What's the video? You can check the codec they used with right click > "stats for nerds", the codec should say which algorithm they used to encode/decode the video. Using CNNs client-side for this task would probably be quite cpu/gpu intensive and I doubt they would do it (except perhaps if it's an experiment). And using CNNs server-side wouldn't make sense if it increases the size of data download.

It does look like CNN artifacts.

46

IntelArtiGen t1_j4zr3iq wrote

Yeah that's also what I would say, I doubt it's anything revolutionary as it's likely not necessary. It might be an innovative use of embeddings of a conversation but I wouldn't qualify that as "revolutionary".

They probably don't use only one embedding for the whole conv, perhaps they use one embedding per prompt and/or they keep in memory some tokens.

1

IntelArtiGen t1_j3dyhfy wrote

By default it's true that DL algorithms are truly unoptimized on this point because modelers usually don't really care about optimizing the number of parameters.

For example Resnet50 uses 23 million parameters, which is much more than efficient net B0 which uses 5 million parameters and have a better accuracy (and is harder to train). But when you try to further optimize algorithms which were already optimized on their number of parameters you quickly see these limits. You would need models that would be even more efficient than these DL models which are already optimized regarding their number of parameters.

A DL model could probably solve this handwriting problem with a very low number of parameters if you build it specifically with this goal in mind.

2

IntelArtiGen t1_j3dvbjr wrote

>Imo there's no reason why we can't have much smaller models

It depends on how much smaller they would be. There are limits to how much you can compress information. If you need to represent 4 states, you can't use one binary value 0/1, you need two parameters 00/01/10/11.

A large image of the real world contains a lot of information / details which can be hard to process and compress. We can compress it of course, that's what current DL algorithms and compression softwares do, but they have limits otherwise they loose too much information.

Usual models are far from being perfectly optimized but when you try to optimize them too much you can quickly loose in accuracy. Under 1.000.000 parameters it's hard to have anything that could compete with more standard DL models on the tasks I've described... at least for now. Perhaps people will have great ideas but it would require to really push current limits.

2

IntelArtiGen t1_j3dpy8q wrote

Well it doesn't really count because you can also "solve" these tasks with SVM / RandomForests, etc. MNIST, OCR and other tasks with very small images are not great benchmarks anymore to compare a random algorithm with a deep learning algorithm.

I was more thinking of getting 90% top 1 on ImageNet or generating 512x512 images from text or learning on billions of texts to answer questions. You either need tons of parameters to solve these or an unbelievable amount of compression. And even DL algorithms which do compression need a lot of parameters. You would need an even bigger way to compress an information, perhaps it's possible but it's yet to invent.

1

IntelArtiGen t1_j3cmdvl wrote

Any alternative which would be able to solve the same problems would probably require a similar architecture: lot of parameters, deep connections.

There are many alternatives to deep learning on some specific tasks. But I'm not sure that if something is able to outpeform the current way we're doing deep learning on usual DL tasks, it will be something totally different (non-deep, few parameters etc.)

The future of ML regarding tasks we do with deep learning is probably just another kind of deep learning. Perhaps without backpropagation, perhaps with a totally different way to do computations, but still deep and highly parametric.

7

IntelArtiGen t1_j36nsv9 wrote

I think there are multiple kinds of "neuromorphic" processors and they all have different abilities. OP pointed out the power efficiency. Researchers also work on analog chips which don't have the same constraints as traditional circuits.

But how / if you can truly use some of these differences depend on the use case, it would seem logical that well-exploited neuromorphic processors would be more power efficient, but it doesn't mean you have the algorithm to exploit it better than current processors for your use case, or that it's necessarily true. For complex tasks, we don't have a proof that would say "No algorithm on a traditional processor can outpeform the best algorithm we would know on a neuromorphic chip for the same power efficiency".

The main difference is that neuromorphic chips are still experimental, and that traditional chips allowed 10+ years of very fast progress in AI.

1

IntelArtiGen t1_j358s2v wrote

>I meant just in terms of compute efficiency, using the same kind of algorithms we use now.

For SNNs I'm sure they can make them more efficient but that doesn't mean it'll have a better ratio score/power_cons on a task than more standard models in their most optimized versions.

>This makes sense to me; instead of emulating a neural network using math, you're building a physical model of one on silicon. Plus, SNNs are very sparse and an analog one would only use power when firing.

I understand and I can't disagree but as I said, we don't have the proof that the way we're usually doing it (with dense layers / tensors) is necessarily less efficient than artificial SNNs or biologicial NNs. "Efficient" in terms of accuracy / power consumption. And we don't have a theory that would allow a generic comparison between usual ANNs and SNNs or Biological NNs, it would require a generic metric of how "intelligent" these models can be just because of their design (we don't have that). Neurons in usual ANNs don't represent the same thing.

Also, an optimized model on a modern GPU can run resnet50 (fp16) at ~2000 fps with 450W, we can't directly compare fps with human vision but if the brain works with 20W, it's equivalent to approximately 90 fps for 20W (if you say 7W are for vision, it's 30fps). Of course we don't see at 30fps and it's hard to compare the accuracy of resnet50 with humans, but resnet50 is also very far from being the most efficient architecture and there are also more power efficient GPUs. It's hard to say for sure that current GPUs with SOTA models would be less power efficient on some tasks than the human brain.

>I feel like a lot of SNN research is motivated by understanding the brain rather than being the best possible AI.

It depends on what you call the "best possible AI". It's probably not designed to be a SOTA on usual tasks but the best way to prove that you can understand the human brain is by reproducing how it works, which would make the resulting model better than current models on a lot of tasks.

2

IntelArtiGen t1_j33v5ir wrote

>Is the performance really better than GPUs?

Depends on the model I guess, usual ANNs work with tensors so you probably can't do much better than GPUs (/TPUs).

>Could this achieve the the dream of running a model on as little power as the brain uses?

That alone I doubt it, even if it could theoretically reproduce how the brain works with the same power efficiency it doesn't mean you would have the algorithm to efficiently use this hardware. Perhaps GPUs could actually be more efficient that a human brain in theory with a perfect algorithm but we don't have that algorithm and we don't have the proof it can't exist.

>Are spiking neural networks useful for anything?

I've read papers that said they do work, but papers I've read use it on the same tasks we use for usual ANNs and they perform worse (for what I've seen). Perhaps it's also a bad idea to test them on the same tasks. Usual ANNs are designed for current tasks and current tasks are often designed for usual ANNs. It's easier to use the same datasets but I don't think the point of SNNs is just to try to perform better on these datasets but rather to try more innovative approaches on some specific datasets. Biological neurons use time for their action potential so if you want to reproduce their behavior it's probably better to test them on videos / sounds which also depend on time.

It would say it's useful for researchers who have ideas. Otherwise I'm not sure. And if you have an idea I guess it's better to first try it on usual hardware and only use neuromorphic chips if you're sure they'll run faster and improve the results.

The hardware is not the only limit, if I gave an AI researcher a living human brain, this researcher probably couldn't make AGI out of it. You also need the good algorithms.

7

IntelArtiGen t1_j2xa49x wrote

I can give another example. Input / Output: 1.7/0, 2/0, 2.2/1 ,3.5/0 ,4/0 ,5/0 ,8/0 ,9.6/0 ,11/1, 13/1, 14/1, 16/1, 18/1, 20/1. There is an error in this dataset: 2.2/1. But you can train a model on this set to predict 2.2/0 (a small / regularized model would do that) . You could also train a model to predict 1 for 2.2, but it would probably be overfitting. The same idea applies to any concept in input and any concept in output.

5

IntelArtiGen t1_j2x7kg9 wrote

From a very theoretical point of view, we can imagine a knowledge "A" useful for a task A, a knowledge B useful for a task B. That's how humans would apply their knowledge. But we could teach a model to learn knowledge A (or A+B) and apply it to task B, and it would eventually perform better.

Humans don't have all the knowledge and don't apply everything they could know to every tasks perfectly. Models also aren't perfect but they could do more combinations and perform better on certain tasks because of that.

But I can take another exemple. Here is a task: "a human says N images contain dogs and M images contain cats, the model must reproduce this behavior". Would a perfect model designed to exactly reproduce the human be able to outperform a human on this task? No. The human would make mistakes, and the model would reproduce these mistakes. But we don't design or train our models to exactly reproduce what a human did, that would be a risk of overfitting, we use regularizations so that even by reproducing what humans did a model can do better and not reproduce some mistakes.

7

IntelArtiGen t1_izxdej3 wrote

I agree it's unperfect, as we are. When I tried to do it, I was still able to maintain a bit of knowledge in the network but I had to continously re-train on previous data.

It's hard to do "info1,2,3 => train => info4,5,6 => train => info7,8,9 => train [etc.]" and have the model remember info1,2,3

But you can do "info1,2,3 => train => info4,5,1 => train => info6,7,2 => train [etc.]". I used a memory to retain previous information and continously train the network on it and it works. Of course it's slower because you don't process all the new information, you mix it with old information. I guess there are better ways to do it.

1

IntelArtiGen t1_izwl2wr wrote

>Also, the brain can learn from a continuous stream of incoming data and does not need to stop to run a backprop pass. Yes, sleep is beneficial for learning somehow, but we can learn awake too.

In a way you can also do that with regular NN. Usually we do "long training phase (many backprops) => only test phase". But we can do "backprop => test => backprop => test ..." if it applies to our task (it usually doesn't), simultaneously training and using one model.

Also it's always interesting to try new things but many propositions seemed to work on small image datasets like MNIST or CIFAR-10. For small neural networks and datasets with small inputs, there is always a possibility that the neural network will find a good weight "by chance", and that with enough computing power it'll converge. But, for large networks and large images, these solutions usually don't scale, I think it's important to try these solutions on ImageNet to evaluate how they scale (and to try to make them scale). What made backprop so popular is its ability to scale for very large networks and images.

7