manubfr t1_j8hf8la wrote

I've had access for a few days and I feel quite underwhelmed. Bing chat is VERY inaccurate, I'd say more than half the time when researching on topics I am very familiar with, it correctly identifies information sources and then botches up the output, making very plain mistakes (e.g. pulls the correct statement from a webpage except the year which it gets wrong, replacing 2022 with 2021 within the same statement). It also struggles with disambiguation, eg two homonyms will be mixed up.

I honestly thought web connectivity would massively improve accuracy, but so far I've been very disappointed. However, the short term creative potential of LLMs and image models is insane.


manubfr t1_j67ybjg wrote

With enough data and a smarter model you could probably ask it first to break down all tasks and then execute them sequentially without human intervention. That’s what Adept ACT-1 is trying to do.

I fully expect that a lot of complex digital tasks will one day be fully automated, you will enter a high level description of what you want, the model will propose options for you to pick, then calculate the compute budget requirements for your selected options and give you a few quotes.

So for example, “order a burger fries coke now” will essentially be free, while “write and design a 40-page comic book about the story of Theseus in the style of Frank Miller then publish it on amazon” will come back with options (maybe that task costs $20 or something, likely cheaper).

Automating entire workflows is, to me, the most exciting and realistic outcome of LLMs in the next few years.


manubfr t1_j5y8mo0 wrote

You're right, it could be that 3.5 is already using that approach. I guess the emergent cognition tests haven't yet been published for GPT-3.5 (or have they?) so it's hard for us to measure performance as individuals. I guess someone could test text-davinci-003 on a bunch of cognitive tasks on the PlayGround but I'm far too lazy to do that :)


manubfr t1_j5y6wko wrote

Google (and DeepMind) actually have better LLM tech and models than OpenAI (if you believe their published research anyway). They had a significant breathrough last year in terms of scalability:

Existing LLMs are found out to be undertrained and with some tweaks you can create a smaller model that outperforms larger ones. Chinchilla is arguably the most performant model we've heard of to date ( ) but it hasn't been pushed to any consumer-facing application AFAIK.

This should be powering their ChatGPT competitor Sparrow which might be reeleased this year. I am pretty sure that OpenAI will also implement those ideas for GPT-4.