marcus_hk

marcus_hk t1_jcrgwqm wrote

Just browsing on my phone and haven’t dug deep yet, but in the notebook it says that build.py targets M2 by default but can also target CUDA. What about CPU?

I’d love to see a super minimal example, like running a small nn.Linear layer, for pedagogical purposes and to abstract away the complexity of a larger model like Stable Diffusion.

1

marcus_hk t1_jcrdufd wrote

Reply to comment by race2tb in [P] Web Stable Diffusion by crowwork

For weights, yes, and for inference. If you can decompose and distribute a model across enough nodes, then you can get meaningful compute out of CPUs too — for instance for tokenization and smaller models.

1

marcus_hk t1_j9gij1a wrote

Looks great. Might not be intelligible to those who don't know what they're looking at, though. Maybe include labels of, say, filters, what each slice of input represents, etc.?

Would like to see the same for normalization layers. And RNNs. And transformers. Keep it up!

61

marcus_hk t1_j8ejn0n wrote

Which part do you disagree with here:

My unwavering opinion on current (auto-regressive) LLMs

  1. They are useful as writing aids.
  2. They are "reactive" & don't plan nor reason.
  3. They make stuff up or retrieve stuff approximately.
  4. That can be mitigated but not fixed by human feedback.
  5. Better systems will come

https://twitter.com/ylecun/status/1625118108082995203?s=20

3

marcus_hk t1_j2xqe50 wrote

>Are there other recent deep learning based alternatives?

Structured State Space Models

Transformers seem best suited to forming associations among discrete elements. That's what self-attention is, after all. Where transformers perform well over very long ranges (in audio generation for example) there is typically heavy use of Fourier transforms and CNNs as "feature extractors", and the transformer does not process raw data directly.

The S4 model linked above treats time-series data, not as discrete samples, but as continuous signal. Consequently it works much better.

2

marcus_hk t1_iw9gdpi wrote

I designed a custom architecture to model an analog signal processor with lots of different settings combinations. It was a custom MGU (minimal gated unit) that modulates HiPPO memory according to settings embeddings. Can train in parallel, so much faster than, say, a PyTorch GRU.

Another recent design combines convolution and transformers to model spinal CT scans, which is challenging because a single scan can have a shape like (512, 1, 1024, 1024) that is too large to train for dense tasks like segmentation. If you simply resize to a constant shape, then you lose or distort the physical information embedded in the scans. You don't want a scan of the neck to be the same size as a scan of the whole spine, for instance. So you've got to be more clever than that, and something this specialized doesn't come ready to go out of the box.

3

marcus_hk t1_isgbpv9 wrote

If you have a dense 3D image, as in CT, then there is really no distinction between "within image" and "across slices" because these are the same thing, just along a different axis. Of course with sparse MRI slices, though, you're right.

2