ReginaldIII t1_j06mqeo wrote on December 14, 2022 at 1:27 PM

Reply to comment by kaibee in [Project] Run and fine-tune BLOOM-176B at home using a peer-to-peer network by hx-zero

In rendertoken's scenario we don't have a requirement on high throughput of one job feeding into another.

The individual units of work are expensive and long lived. Rendering a frame of a film takes roughly the same amount of time it did a few years ago, we just get higher fidelity output for that same render budget. All the frames can be processed lazily by the compute farm, and the results just go into a pool for later collection.

Because the collation of the results happens in a more offline fashion from the actual computation, you have time and resources to encode the results on a blockchain. Auditing that your requested work was processed is a desirable quality, and so a blockchain does provide a benefit.

In the case of distributed model training the scenario is different. We have high throughput of comparatively small chunks of work. Other than passing the results to the next immediate worker to do the next part of the computation, we have no desire (or storage capacity) to keep any of the intermediate results. Because we have high throughput of many small chunks a blockchain encoding these chunks would need a small proof of work and so would not be a reliable source of truth anyway.

Then consider that we don't even care about having an audit trail to prove historical chunks really were processed when we think they were. We only care about checking results are valid on the fly as we are doing the compute.

We just need a vote by agreement on the immediate results so they can be handed off to the next workers. Yes blockchains often have a vote by agreement part to how they decide what the actual state of the blockchain is, but we just need that part. We don't actually need the blockchain itself.