Viewing a single comment thread. View all comments

Kinexity t1_jbznlup wrote

There is a repo for CPU interference written in pure C++: https://github.com/ggerganov/llama.cpp

30B model can run on just over 20GB of RAM and take 1.2sec per token on my i7 8750H. Though actual Windows support has yet to arrive and as of right now the output is garbage for some reason.

Edit: fp16 version works. It's 4 bit quantisation that returns garbage.

29

light24bulbs t1_jc0s4wr wrote

That is slowwwww

−8

Kinexity t1_jc1lwah wrote

That is fast. We are literally talking about a high end laptop CPU from 5 years ago running a 30B LLM.

17

light24bulbs t1_jc2s2oc wrote

Oh, definitely, it's an amazing optimization.

But less than a token a second is going to be too slow for a lot of real time applications like human chat.

Still, very cool though

2

Lajamerr_Mittesdine t1_jc5b99n wrote

I imagine 1 token per 0.2 seconds would be fast enough. That'd be equivalent to a 60 WPM typer.

Someone should benchmark it on an AMD 7950X3D or Intel 13900-KS

1

light24bulbs t1_jc5e0zk wrote

yeah theres definitely a threshold in there where its fast enough for human interaction. It's only an order of magnitude off, that's not too bad.

3