the_architect_ai t1_j9h7but wrote on February 21, 2023 at 11:16 PM

Use binning/ quantisation to reduce image size. Look into voxelisation.

Transformers can capture long range spatial interactions but computation is hefty. Might have to downsize first.

In ViT, tokenization is applied on patches. You might need a 3D CNN to extract voxel tokens.

There are many ways to reduce computational costs via attention-ing. In the paper Perceiver I/O by deepmind, a bottleneck cross attention layer is applied.