I am looking into any techniques one could use for very large datasets in machine learning. So I am talking about datasets with the following properties:

1: 3D Imaging dataset where each dataset is of the order of many terabytes.

2: Each 3D image is too big to fit in the GPU or CPU memory.

I am interested in educating myself on methods that people have used in classical ML and modern deep learning for such extremely large datasets.

In particular, how does one ensure one can capture long-range spatial interactions in such datasets and what computational techniques can one do to perform learning on such datasets?

Finally, if someone can point me to some open source examples of such ML systems (domain is not important) that I can learn from, I would be extremely grateful.open-source

Comments

You must log in or register to comment.

the_architect_ai t1_j9h7but wrote on February 21, 2023 at 11:16 PM

Use binning/ quantisation to reduce image size. Look into voxelisation.

Transformers can capture long range spatial interactions but computation is hefty. Might have to downsize first.

In ViT, tokenization is applied on patches. You might need a 3D CNN to extract voxel tokens.

There are many ways to reduce computational costs via attention-ing. In the paper Perceiver I/O by deepmind, a bottleneck cross attention layer is applied.

__lawless t1_j9eu7n2 wrote on February 21, 2023 at 12:18 PM

r/learnmachinelearning

vannak139 t1_j9frjil wrote on February 21, 2023 at 4:36 PM

https://openaccess.thecvf.com/content_cvpr_2016/papers/Hou_Patch-Based_Convolutional_Neural_CVPR_2016_paper.pdf

sbb_ml t1_j9uu0jo wrote on February 24, 2023 at 6:45 PM

An old one

https://arxiv.org/abs/1808.05577

deluded_soul OP t1_j9xgx9p wrote on February 25, 2023 at 6:56 AM

Thank you it is slightly unrelated to my question about large inputs to the network but still very useful.

Insecure--Login t1_j9ibbc4 wrote on February 22, 2023 at 4:12 AM

Sorry, this is a bit off-topic but what medical imaging datasets are u working with? I'm usually looking for those and you seem to be familiar with very large ones.

deluded_soul OP t1_j9iqtgs wrote on February 22, 2023 at 6:50 AM

The dataset is more microscopy related and unfortunately I am not allowed to share :(