harharveryfunny t1_je50vw9 wrote

There's no indication that I've seen that it maintains any internal state from one word generated to the next. Therefore the only way it can build upon it's own "thoughts" is by generating "step-by-step" output which is fed back into it. It seems its own output is its only working memory, at least for now (GPT-4), although that's an obvious area for improvement.


harharveryfunny t1_jdm3bm4 wrote

It seems most current models don't need the number of parameters that they have. DeepMind did a study on model size vs number of training tokens and concluded that for each doubling of number of parameters the number of training tokens also needs to double, and that a model like GPT-3, trained on 300B tokens would really need to be trained on 3.7T tokens (a 10x increase) to take advantage of it's size.

To prove their scaling law, DeepMind built the 70B params Chinchilla model, and trained it on the predicted optimal 1.4T (!) tokens, and found it to outperform GPT-3.



harharveryfunny t1_jdj9if0 wrote

I'm not sure what your point is.

I started by pointing out that there are some use cases (giving face comparison as an example) where you need access to the neural representation of the image (e.g. embeddings), not just object recognition labels.

You seem to want to argue and say that text labels are all you need, but now you've come full circle back to agree with me and say that the model needs that neural representation (embeddings)!

As I said, embeddings are not the same as object labels. An embedding is a point in n-dimensional space. A label is an object name like "cat" or "nose". Encoding an embedding as text (simple enough - just a vector of numbers) doesn't turn it into an object label.


harharveryfunny t1_jdj5mom wrote

>it would just be a text-encoded representation of an embedding vector

One you've decided to input image embeddings into the model, you may as well enter them directly, not converted into text.

In any case, embeddings, whether represented as text or not, are not the same as object recognition labels.


harharveryfunny t1_jdic1s3 wrote

>Obviously having the visual system provide data that the model can use directly is going to be far more effective, but nothing about dense object detection and description is going to be fundamentally incompatible with any level of detail you could extract into an embedding vectror. I'm not saying it would be a smart or effective solution, but it could be done.

I can't see how that could work for something like my face example. You could individually detect facial features, subclassified into hundreds of different eye/mouth/hair/etc/etc variants, and still fail to capture the subtle differences that differentiate one individual from another.


harharveryfunny t1_jdhkn99 wrote

> GPT-4 with image input can interpret any computer screen

Not necessarily - it depends how they've implemented it. If it's just dense object and text detection, then that's all you're going to get.

For the model to be able to actually "see" the image they would need to feed it into the model at the level of neural net representation, not post-detection object description.

For example, if you wanted the model to guage whether two photos of someone not in it's training set are the same person, then it'd need face embeddings to do that (to gauge distance). They could special case all sorts of cases like this in addition to object detection, but you could always find something they missed.

The back-of-a-napkin hand-drawn website sketch demo is promising, but could have been done via object detection.

In the announcement of GPT-4, OpenAI said they're working with another company on the image/vision tech, and gave a link to an assistive vision company... for that type of use maybe dense labelling is enough.


harharveryfunny t1_jckltrp wrote

> I think it should be possible to replicate even GPT-4 with open source tools something like Bloom + FlashAttention & fine-tune on 32k tokens.

So you mean build a model with a 32K attention window, but somehow initialize it with weights from BLOOM (2K window) then finetune ? Are you aware of any attempts to do this sort of thing ?


harharveryfunny t1_jcchnkp wrote

That's a bogus comparison, for a number of reasons such as:

  1. These models are learning vastly more than language alone

  2. These models are learning in an extraordinarily difficult way with *only* "predict next word" feedback and nothing else

  3. Humans learn in a much more efficient, targetted, way via curiosity-driven knowledge gap filling

  4. Humans learn via all sorts of modalities in addition to language. Having already learnt a concept then we only need to be given a name for it once for it to stick


harharveryfunny t1_jca7x9f wrote

Yes - the Transformer is proof by demonstration that you don't need a language-specific architecture to learn language, and also that you can learn language via prediction feedback, which it highly likely how our brain does it too.

Chomsky is still he sticking to his innateness opinion though (with Gary Marcus cheering him on). Perhaps Chomsky will now claim that Broca's area is a Transformer?


harharveryfunny t1_jbjxolz wrote

The LLM name for things like GPT-3 seems to have stuck, which IMO is a bit unfortunate since it's rather misleading. They certainly wouldn't need the amount of data they do if the goal was merely a language model, nor would we need to have progressed past smaller models like GPT-1. The "predict next word" training/feedback may not have changed, but the capabilities people are hoping to induce in these larger/ginormous models is now way beyond language and into the realms of world model, semantics and thought.


harharveryfunny t1_jbjk9nb wrote

Just to follow up, the reason why the "interact with the world" approach is way more efficient is because it's largely curiosity driven - we proactively try to fill gaps in our knowledge rather than just go read a set of encyclopedias and hope it might cover what we need to know. We learn in a much more targeted fashion..


harharveryfunny t1_jbjhmif wrote

Humans don't learn by locking themselves in a room at birth with a set of encyclopedias, or a print-out of the internet. We learn by interaction with the world - perceive/generalize/theorize/experiment, learn from feedback, etc.

It's impressive how well these LLM's perform given what is really a very tough task - build an accurate world model given only "predict next word" feedback, but hardly surprising that they need massive amounts of data to compensate for the task being so tough.


harharveryfunny t1_jaj8bk2 wrote

Could you put any numbers to that ?

What are the FLOPS per token inference for a given prompt length (for a given model)?

What do those FLOPS translate to in terms of run time on Azure's GPUs (V100's ?)

What is the GPU power consumption and data center electricity costs ?

Even with these numbers can we really relate this to their $/token pricing scheme ? The pricing page mentions this 90% cost reduction being for the "gpt-3.5-turbo" model vs the earlier davinci-text-3.5 (?) one - do we even know the architectural details to get the FLOPs ?


harharveryfunny t1_jairuhd wrote

It says they've cut their costs by 90%, and are passing that saving onto the user. I'd have to guess that they are making money on this, not just treating it as a loss-leader for other more expensive models.

The way the API works is that you have to send the entire conversation each time, and the tokens you will be billed for include both those you send and the API's response (which you are likely to append to the conversation and send back to them, getting billed again and again as the conversation progresses). By the time you've hit the 4K token limit of this API, there will have been a bunch of back and forth - you'll have paid a lot more than 4K * 0.2c/1K for the conversation. It's easy to imagine chat-based API's becoming very widespread and the billable volume becoming huge. OpenAI are using Microsoft Azure compute, who may see a large spike in usage/profits out of this.

It'll be interesting to see how this pricing, and that of competitors evolves. Interesting to see also some of OpenAI's annual price plans outlined elsewhere such as $800K/yr for their 8K token limit "DV" model (DaVinci 4.0?), and $1.5M/yr for the 32K token limit "DV" model.


harharveryfunny t1_j9aydo9 wrote

Here's the key, thanks to CHatGPT:

Data preparation: First, the training data is preprocessed to convert the 2D images and camera poses into a set of 3D points and corresponding colors. Each 2D image is projected onto a 3D point cloud using the corresponding camera pose, resulting in a set of 3D points with associated colors.


harharveryfunny t1_j88kg27 wrote

>some of the things that make ML code inscrutable are that (a) every tensor has a shape that you have to guess, and keep track of as it goes through the various layers

That's not inherent to ML though - that's a library design choice to have tensor shape be defined at runtime vs compile time. A while back I wrote my own framework in C++ and chose to go with compile-time shapes, which as well as preventing shape errors is more in keeping with C++'s typing. For a dynamically typed language like Python maybe runtime-defined types/shapes seems a more natural choice, but still a choice nonetheless.


harharveryfunny t1_j7lu67f wrote

What underlying are you talking about? Are you even familiar with the "Attention" paper and it's relevance here? Maybe you think OpenAI use Google's Tensorflow? They don't.

GANs were invented by Ian Goodfellow while he was a student at. U.Montreal, before he ever joined Google.

No - TPUs are not key to deploying at scale unless you are targeting Google cloud. Google is a distant 3rd in cloud marketshare behind Microsoft and Amazon. OpenAI of course deploy on Microsoft Azure, not Google.