Viewing a single comment thread. View all comments

the_architect_ai t1_j9h7but wrote

Use binning/ quantisation to reduce image size. Look into voxelisation.

Transformers can capture long range spatial interactions but computation is hefty. Might have to downsize first.

In ViT, tokenization is applied on patches. You might need a 3D CNN to extract voxel tokens.

There are many ways to reduce computational costs via attention-ing. In the paper Perceiver I/O by deepmind, a bottleneck cross attention layer is applied.

6