ReginaldIII

ReginaldIII t1_jb9goco wrote

Link to your code? It needs to be GPLv3 to be compliant with LLama's licensing.

How are you finding the quality of the output? I've had a little play around with the model but wasn't overly impressed. That said, a nice big parameter set like this is a nice test bed for looking at things like pruning methods.

−4

ReginaldIII t1_j8e9sc3 wrote

It's been going downhill for a lot longer than that, and it's not something that can be solved with better moderation.

The people who are engaging with the sub in higher and higher frequencies than before simply do not know anything substantive about this field.

How many times will we have people try to asininely argue about stuff like a models "rights" or that "they" (the model) have "learned just like a person does", when the discussion should have just been about data licensing laws, intellectual property, and research ethics.

People just don't understand what it is that we actually do anymore.

25

ReginaldIII t1_j6ybiju wrote

This isn't being used for autocomplete or any user text generation purposes though.

They're using it to summarize and make todo lists from the Whisper extracted transcripts of video meetings. Users aren't getting a frontend to run arbitrary stuff through the model. Seems like a pretty legitimate use case.

20

ReginaldIII t1_j61nlno wrote

Trying to force these things into a pure hierarchy sounds nothing short of an exercise in pedantry.

And to what end? You make up your own distinctions that no one else agrees with and you lose your ability to communicate ideas to people because you're talking a different language to them.

If you are so caught up on the "is a" part. Have you studied any programming languages that support "multiple inheritance" ?

2

ReginaldIII t1_j5zzhj1 wrote

Pick the tools that work for the problems you have. If you are online training a model on an embedded device you need something optimized for that hardware.

I gave you a generic example of a problem domain where this applies. You can search for online training on embedded devices if you are interested but I can't talk about specific applications because they are not public.

All I'm saying is drawing a line in the sand and saying you'd never use X if it doesn't have Y is silly because what if you end up working on something in the future where the constraints are different?

5

ReginaldIII t1_j5zv9gz wrote

Thats such a tenuous distinction and you're wrong anyway because you can pose any learning from data problem as a generic optimization problem.

They're very useful when your loss function is not differentiable but you still want to fit a model to input+output data pairs.

They're also useful when your model parameters have domain specific meaning and you can derive rules for how two parameter sets can be meaningfully combined with one another

Decision trees and random forests are ML too. What you probably mean is Deep Learning. But even that has a fuzzy boundary to surrounding methods.

Being a prescriptionist with these definitions is a waste of time because the research community as a whole cannot draw clear lines in the sand.

10

ReginaldIII t1_j3epizn wrote

No need to downvote, it was an honest question not an attack. Have you studied the literature and background mathematics of this area much?

Regime is a well established term in mathematics and many other fields, and one example of a "regime" (a domain under rules or constrains) is what you are likely familiar with as a political regime.

With respect to "punchline", I'm going to assume you didn't look at the video at the timestamp listed? Here it is https://youtu.be/1aXOXHA7Jcw?t=6105 All he is saying is that, after a few minutes long tangent talking about something the "punchline" is him circling back around to the point he was trying to make.

It isn't a literal haha punchline, it's not a mathematical term, the punchline comes at the end of a joke, a joke often takes you on a journey before circling back to some type of point. He used the word to mean that here too.

Timothy Nguyen, OP of this post and the host of the video, made a light hearted chapter title within a long video based on a term that Greg Yang used on his whiteboard.

18

ReginaldIII t1_j33ff9r wrote

Except there is an ecosystem monopoly at the cluster level too because some of the most established, scalable, and reliable software (like those used in fields like bio-informatics as an example) only provide CUDA implementations of key algorithms and being able to accurately reproduce results computed by them is vital.

This essentially limits those software to only running on large CUDA clusters. You can't reproduce the results without the scale of a cluster.

Consider software for processing Cryo-Electron Microscopy and Ptychography data. Very very few people are actually "developing" those software packages, but thousands of researchers around the world are using them at scale to process their micrographs. Those microscopists are not programmers, or really even cluster experts, and they just don't have the skillsets to develop on these code bases. They just need it work reliably and reproducibly.

I've been working in HPC on a range of large scale clusters for a long time. There has been a massive and dramatic demographic shift in terms of the skillsets that our cluster users have. A decade ago you wouldn't dream of letting someone not a HPC expert anywhere near your cluster. If a team of non-HPC people needed HPC you'd hire HPC experts into your team to handle that for you and tune the workloads onto the cluster and develop the code to make it work best. Now we have an environment where non-HPC people can pay for access and run their workloads directly because they leverage these pre-tinned software packages.

9

ReginaldIII t1_j0i67uc wrote

Linear regression / logistic regression is all just curve fitting.

> A picture is just a number, but in higher dimensions.

Yes... It literally is. A 10x10 RGB 24bpp image is just a point in the 100 dimensional hypercube bounded by 0-255 with 256 discrete steps. In each 10x10 spatial location there are 256^3 == 2^24 possible colours, meaning there are 256^3^100 possible images in that entire domain. Any one image you can come up with or randomly generate is a unique point in that space.

I'm not sure what you are trying to argue...

When a GAN is trained to map between points on some input manifold (a 512 dimensional unit hypersphere) to points on some output manifold (natural looking images of cats embedded within the 256x256x3 dimensional space bounded between 0-255 and discretized into 256 distinct intensity values) then yes -- the GAN has mapped a projection from one high dimensional manifold to a point on another.

It is quite literally just a bijective function.

1

ReginaldIII t1_j0gdsis wrote

This tool is such an unbelievably bad idea.

It really upsets me when i see people using unrestrained models to do what only a safety critical system should do.

With no clinical study or oversight. No ethics review before work on the project can start. No consideration for the collateral damage that can be caused.

Really really unethical behaviour.

If someone hooked up a bare CNN trained via RL to a real car and put it on the roads everyone would be rightfully screaming OP is a unethical fool for endangering the public. But somehow people think it's okay to screw around with medical data... The mind boggles.

0

ReginaldIII t1_j0cxciw wrote

That we have the ability to project concepts into the scaffold of other concepts? Imagine a puppy wearing a sailor hat. Yup we definitely can do that.

f(x) = 2x

I can put x=1 in, I can put x=2 but if I don't put anything in then it just exists as a mathematical construct and it doesn't sit their pondering its own existence or the nature of what x even is. "I mean, why 2x ?!"

If I write an equation c(Φ,ω) =(Φ ω Φ)do you zoomorphise it because it looks like a cat?

What about this function which plots out Simba. Is it aware of how cute it is?

x(t) = ((-1/12 sin(3/2 - 49 t) - 1/4 sin(19/13 - 44 t) - 1/7 sin(37/25 - 39 t) - 3/10 sin(20/13 - 32 t) - 5/16 sin(23/15 - 27 t) - 1/7 sin(11/7 - 25 t) - 7/4 sin(14/9 - 18 t) - 5/3 sin(14/9 - 6 t) - 31/10 sin(11/7 - 3 t) - 39/4 sin(11/7 - t) + 6/5 sin(2 t + 47/10) + 34/11 sin(4 t + 19/12) + 83/10 sin(5 t + 19/12) + 13/3 sin(7 t + 19/12) + 94/13 sin(8 t + 8/5) + 19/8 sin(9 t + 19/12) + 9/10 sin(10 t + 61/13) + 13/6 sin(11 t + 13/8) + 23/9 sin(12 t + 33/7) + 2/9 sin(13 t + 37/8) + 4/9 sin(14 t + 19/11) + 37/16 sin(15 t + 8/5) + 7/9 sin(16 t + 5/3) + 2/11 sin(17 t + 47/10) + 3/4 sin(19 t + 5/3) + 1/20 sin(20 t + 24/11) + 11/10 sin(21 t + 21/13) + 1/5 sin(22 t + 22/13) + 2/11 sin(23 t + 11/7) + 3/11 sin(24 t + 22/13) + 1/9 sin(26 t + 17/9) + 1/63 sin(28 t + 43/13) + 3/10 sin(29 t + 23/14) + 1/45 sin(30 t + 45/23) + 1/7 sin(31 t + 5/3) + 3/7 sin(33 t + 5/3) + 1/23 sin(34 t + 9/2) + 1/6 sin(35 t + 8/5) + 1/7 sin(36 t + 7/4) + 1/10 sin(37 t + 8/5) + 1/6 sin(38 t + 16/9) + 1/28 sin(40 t + 4) + 1/41 sin(41 t + 31/7) + 1/37 sin(42 t + 25/6) + 3/14 sin(43 t + 12/7) + 2/7 sin(45 t + 22/13) + 1/9 sin(46 t + 17/10) + 1/26 sin(47 t + 12/7) + 1/23 sin(48 t + 58/13) - 55/4) θ(111 π - t) θ(t - 107 π) + (-1/5 sin(25/17 - 43 t) - 1/42 sin(1/38 - 41 t) - 1/9 sin(17/11 - 37 t) - 1/5 sin(4/3 - 25 t) - 10/9 sin(17/11 - 19 t) - 1/6 sin(20/19 - 17 t) - 161/17 sin(14/9 - 2 t) + 34/9 sin(t + 11/7) + 78/7 sin(3 t + 8/5) + 494/11 sin(4 t + 33/7) + 15/4 sin(5 t + 51/11) + 9/4 sin(6 t + 47/10) + 123/19 sin(7 t + 33/7) + 49/24 sin(8 t + 8/5) + 32/19 sin(9 t + 17/11) + 55/18 sin(10 t + 17/11) + 16/5 sin(11 t + 29/19) + 4 sin(12 t + 14/9) + 77/19 sin(13 t + 61/13) + 29/12 sin(14 t + 14/3) + 13/7 sin(15 t + 29/19) + 13/4 sin(16 t + 23/15) ...

1

ReginaldIII t1_j0cuujj wrote

It mimics statistical trends from the training data. It uses embeddings that make related semantics and concepts near to one another, and unrelated ones far from one another. Therefore, when it regurgitates structures and logical templates that were observed in the training data it is able to project other similar concepts and semantics into those structures, making them look convincingly like entirely novel and intentional responses.

1

ReginaldIII t1_j0b9rwb wrote

RL is being used to apply weight updates during fine tuning. The resulting LLM is still just a static LLM with the same architecture.

It has no intent and has no awareness. It is just a model, being shown some prior, and being asked to sample the next token.

It is just an LLM. The method of fine tuning just creates a high quality looking LLM for the specific task of conversationally structured inputs and outputs.

You would never take your linear regression model that happens to perfectly fit the data, take a new prior of some X value, see that it gives a good Y value that makes sense, and come to the conclusion "Look my linear regression is really aware of the problem domain!"

Nope. Your linear regression model fit the data well, and you were able to sample something from it that was on the manifold the training data also lived on. That's all that's going on. Just in higher dimensions.

4

ReginaldIII t1_j06nan5 wrote

> Though blockchains would probably be too slow for something like this.

This is the key point. Blockchains give a confidence bound on trustworthiness by being too slow moving and computationally expensive to manipulate. This is vital when proving a historical audit trail is correct and immutable.

It just isn't important or applicable for high throughput applications where you just care about local immediate correctness of intermediate results.

To quote one of my other comments in this thread

> Blockchains also don't present a solution to trustworthiness here. In the same way that a wallet being present in a transaction on the blockchain says nothing about the real identity of the parties, nor does it say anything about whether the goods or services the transaction was for were carried out honestly.

We care about whether or not you got ripped off by the guy you gave money to (the GPU you gave data to). We don't care about proving you did actually give them the money at a specific point in time.

2

ReginaldIII t1_j06mqeo wrote

In rendertoken's scenario we don't have a requirement on high throughput of one job feeding into another.

The individual units of work are expensive and long lived. Rendering a frame of a film takes roughly the same amount of time it did a few years ago, we just get higher fidelity output for that same render budget. All the frames can be processed lazily by the compute farm, and the results just go into a pool for later collection.

Because the collation of the results happens in a more offline fashion from the actual computation, you have time and resources to encode the results on a blockchain. Auditing that your requested work was processed is a desirable quality, and so a blockchain does provide a benefit.

In the case of distributed model training the scenario is different. We have high throughput of comparatively small chunks of work. Other than passing the results to the next immediate worker to do the next part of the computation, we have no desire (or storage capacity) to keep any of the intermediate results. Because we have high throughput of many small chunks a blockchain encoding these chunks would need a small proof of work and so would not be a reliable source of truth anyway.

Then consider that we don't even care about having an audit trail to prove historical chunks really were processed when we think they were. We only care about checking results are valid on the fly as we are doing the compute.

We just need a vote by agreement on the immediate results so they can be handed off to the next workers. Yes blockchains often have a vote by agreement part to how they decide what the actual state of the blockchain is, but we just need that part. We don't actually need the blockchain itself.

2