cdsmith

cdsmith t1_j8gq1gt wrote

Imagine you just didn't invest those millions of dollars, then, and instead someone else developed the idea and didn't want to freeze the rest of the world out of using it.

Patents only makes sense if you assume that the alternative to you inventing something is no one inventing it. Experience shows that's very rarely the case; in general, when an idea's time has come (the base knowledge is there to understand it, the infrastructure is in place to use it effectively, etc.), there is a race between many parties to develop the idea. Applies to everything from machine learning models to the light bulb or telephone, both of which were famously being developed by multiple inventors simultaneously before one person got lucky, often by a matter of mere days, and was issued an exclusive license to the invention, while everyone else who had the same idea was out of luck.

1

cdsmith t1_j8gpcrf wrote

We're off-topic for this forum, but since we're here anyway...

Patents are tricky when it comes to stuff like this. To successfully patent something software related, you must be able to convince the patent office that what you're patenting counts as a "process", and not as an "idea", or "concept" or "principle" or "algorithm", all of which are explicitly not patentable. The nuances of how you draw the lines between these categories are fairly complex, but in general it often comes down to being able to patent engineering details of HOW you do something in the face of a bunch of real-world constraints, but not WHAT you are doing or any broad generalization of the bigger picture.

It's likely that Swype didn't just screw up and write their patent poorly, but rather wrote the only patent their legal team could succeed in getting approved. If it didn't apply to what other companies did later because they used a different "process" (for nuanced lawyer meanings of that word) to accomplish the same goal, that is an intentional feature of the patent system, not a failure by Swype.

1

cdsmith t1_j6tg9z7 wrote

Awesome question! I definitely laughed.

The serious answer that the GitHub link clarifies is that the model is semi-unsupervised. That means they have a lot of data, but only some of it is labeled. Presumably, the labeled data is all negative because we understand its natural origin. So effectively this becomes almost an anomaly detection sort of thing, looking for data that is least like the known natural signals.

Even if it just directs scientists to look at new natural phenomena, this sounds like a valuable task.

17

cdsmith t1_j60q0bs wrote

I can only answer about Groq. I'm not trying to sell you Groq hardware, honestly... I just honestly don't know the answers for other accelerator chips.

Groq very likely increases inference speed and power efficiency over GPUs; that's actually its main purpose. How much depends on the model, though. I'm not in marketing so I probably don't have the best resources here, but there are some general performance numbers (unfortunately no comparisons) in this article, and this one talks about a very specific case where a Groq chip gets you a 1000x inference performance advantage over the A100.

To run a model on a Groq chip, you would typically start before CUDA enters the picture at all, and convert from PyTorch, Tensorflow, or a model in several other common formats into a Groq program using https://github.com/groq/groqflow. If you have custom-written CUDA code, then it's likely you've got some programming work ahead of you to run on something besides a GPU.

7

cdsmith t1_j460nf2 wrote

Tree search means precisely that: searching a tree. In the context of AlphaZero, the tree is the game tree. That is:

  • I can move my pawn to e4. Then:
    • You could move your knight to c6
      • ...
    • Or you could move your pawn to e6
      • ...
    • Or ...
  • Or, I could move my pawn to d4. Then:
    • You could take my pawn with your pawn on c5.
      • ...
    • Or you could move your knight to c6.
      • ...
    • Or you could move your pawn to d5.
      • ...
    • Or ...
  • Or, I could ...

That's it. The possible moves at each game state, and the game states that they lead to, form a tree. (Actually more like a DAG, since transpositions are possible, but it's often simplified by calling it a tree.) Searching that tree up to a certain depth amounts to thinking forward that many moves in the game. The way you search the tree is some variation on minimax: that is, you want to choose the best move for yourself now, but that means at the next level down, you want to pessimistically only consider the best move for your opponent (which is the worst one for you), etc. Variations come in terms of what order you visit the various nodes of the tree. You could just do a straight-forward depth-first traversal up to a certain depth, in which case this is traditional minimax search. You can refuse to ever visit some nodes, because you know they can't possibly matter, and that's alpha-beta pruning. You could even visit nodes in a random order, changing the likelihood of visiting each node based on a constantly updated estimate of how likely it is to matter, and that's roughly what happens in monte carlo tree search. Either way, you're just traversing that tree in some order.

AlphaZero combines this with machine learning by using two empirically trained machine learning algorithms to tweak the traversal order of the tree, by identifying moves that seem likely to be good, as well as to evaluate partially completed games to estimate how good they look for each player. But ultimately, the machine learning models just plug into certain holes in the tree search algorithm.

16

cdsmith t1_j45e09w wrote

Sort of. The promise of differentiable programming is to be able to implement discrete algorithms in ways that are transparent to gradient descent, but it's really only the numerical values of the inputs that are transparent to gradient descent, not the structure itself. The key idea here is the use of so-called TPRs (tensor product representations) to encode not just values but structure as well in a continuous way, so that one has an entire continuous deformation from the representation of one discrete structure to another. (Obviously, this deformation has to pass through intermediate states that are not directly interpretable as a single discrete structure, but the article argues that even these can represent valid states in some situations.)

9

cdsmith t1_j3heb7r wrote

I'm not at all up to speed on this, but I followed most of the presentation. I was left with this question, though.

Up to the latter part of the video, I was left with the impression that this was building a rigorous theory of what happens if you forget to train your neural network. That is, the assumption was that all the weights were taken from independently sampled Gaussian distributions. The "master theorem" as stated here definitely assumed that all the weights in the network were random. But then suddenly about 2.5 hours in, they are talking about the behavior of the network under training, and as far as I can tell, there's no discussion at all of how the theorems they have painstakingly established for random weights tell you anything about learning behavior.

Did I miss something, or was this just left out of the video? They do seem to have switched by this point from covering proofs to just stating results... which is fine, the video is long enough already, but I'd love to have some intuition for how this model treats training, as opposed to inference with random weights.

3

cdsmith t1_j3ev3je wrote

This is definitely a theory presentation, though it does end with some applications to hyperparameter transfer when scaling model size. But if your main experience with ML is building models and applications, I'm not surprised it looks unfamiliar.

That being said, though, give it a chance if you're interested. Some parts of the outline didn't look familiar to me either, but the video is well-made and stops to explain most of the background knowledge. And you can always gloss over the bits you don't understand.

1

cdsmith t1_j2yk9jb wrote

I think the best way to answer your question is to ask you to be more precise about what, exactly, you mean by "outperform".

There's some limited sense in which your reasoning works as you seem to have envisioned. A generative model like GPT or GANs is typically built at least partly to produce output that's indistinguishable from what is produced by a human, using some kind of autoregressive data set or adversarial objective. By definition, it cannot do better at that goal, because a human has a 100% success rate, by definition, at producing something indistinguishable from what is produced by a human.

But there are limitations to this reasoning:

  1. Producing any arbitrary human-like output is not actually the goal. People don't evaluate generative models on how human-like they are, but rather on how useful their results are. There are lots of ways their results can be more useful even if they aren't quite as "human-like". In fact, the motivation for trying to keep the results human-like is mainly that allowing a generative model too much freedom to generate samples that are very different from its training set decreases accuracy, not that it's a goal in its own right.
  2. That's not all of machine learning anyway. Another very common task is, for example, Netflix predicting what movies you will want to watch to build their recommendations. Humans are involved in producing that data, but it's not learning from data about what other humans predicted users would watch. It's learning directly from observed data about what humans really did watch. Such a system isn't aiming to emulate humans at all. Some machine learning is even trained on data that's not generated by humans at all, but rather the objective it's training to optimize is either directly observed and measured, or directly computed.
  3. Even in cases where a supervised model is learning to predict human labeling, which is where your reasoning best applies, the quantity of data can overcome human accuracy. Imagine this simpler scenario: I am learning to predict which President is on a U.S. bill, given the denomination amount. This is an extremely simple function to learn, of course, but let's say I only have access to data with a rather poor accuracy rate of 60%, with errors occurring uniformly. Well, with enough of that data, I can still learn to be 100% accurate, simply by noting which answer is the most common for each input! That's only a theoretical argument, and in a realistic ML context it's very difficult to get better-than-human performance on a supervised human-labeled task like this. But it's not impossible.
  4. And, of course, if you look at more than just accuracy, ML can be "better" than humans in many ways. They can be cheaper, faster, more easily accessible, more deterministic, etc.
13

cdsmith t1_j2uzks4 wrote

The idea is that there's an inflection point: at first you are mainly removing (masking with zeros) dimensions whose values are extremely small anyway and don't make much difference in the response, so you don't lose much accuracy. But after you're removed those dimensions, the remaining dimensions are specifically the ones that do matter, so you can't just go find more non-impactful dimensions again. They are already gone.

As far as what would happen if you over-pruned a model trained on a large number of parameters, I'd naively expect it to do much worse. If you train on more parameters and then zero out significant weights, then not only do you have a lower-dimensional space to model in (which is unavoidable), but you also lose out on the information that was correlated with the dimensions you've captured, because at training time your model relied on the parameters you have now zeroed out to capture that information.

4