I was thinking I'd load half the data first, train it, then another half, and train it. This may be slightly slower but should work in theory. I'd preprocess it and store the data in something like X1.npy and X2.npy. X1 and X2 being the first and second half of the preprocessed data. This can make it so data is loaded much quicker as well, but obviously slower than if we had bigger RAM. We can always get more RAM in the cloud, but what if we have 1000GB of images to train on? Seems like my initial intuition is correct, but what is the standard operating procedure here?

I think people normally let Keras do all the work by simply using ImageDataGenerator and feeding the path, but what if I want some control over preprocessing?

Comments

You must log in or register to comment.

Alone_Bee_6221 t1_iycimeo wrote on November 30, 2022 at 11:27 AM

I would probably suggest splitting into chunks of data or you could try to implement you own dataset class to load images lazily.

somebodyenjoy OP t1_iyclml5 wrote on November 30, 2022 at 12:03 PM

I’ve changed the tuner class before. I should try this when I run into this issue

robbsc t1_iycws54 wrote on November 30, 2022 at 1:51 PM

For tensorflow, you have to learn to use tensorflow datasets. https://www.tensorflow.org/datasets

You could also save your dataset as an hdf5 file using h5py, then use the tensorflow_io from_hdf5() to load your data. https://www.tensorflow.org/io

Hdf5 is the "traditional" (for lack of a better word) way of loading numpy data that is too big to fit in memory. The downside is that it is slow at random indexing, so people don't use it as much anymore for training networks.

Pytorch datasets are a little easier in my opinion.

HiPattern t1_iycvpfp wrote on November 30, 2022 at 1:42 PM

Write a generator that feeds the data in batches:

https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly

somebodyenjoy OP t1_iycwvf4 wrote on November 30, 2022 at 1:52 PM

Very interesting, but I wanted the model to preprocess the data only once. This way, it’ll preprocess at every epoch

HiPattern t1_iycypm1 wrote on November 30, 2022 at 2:07 PM

You can preprocess, then write the data into a hdf5 file, and read the preprocessed data batch wise from the hdf5 file!

somebodyenjoy OP t1_iycyur2 wrote on November 30, 2022 at 2:08 PM

I do the same using numpy files, but they only let me load the whole data which is too big in the first place. Tensorflow let’s us load in batches huh, I’ll look into this

HiPattern t1_iyd91t0 wrote on November 30, 2022 at 3:24 PM

hdf5 files are quite nice for that. You can write your X / y datasets in chunks into the file. When you access a batch, then it will only read the part of the hdf5 file where the batch is.

You can also use multiple numpy files, e.g. one for each batch, and then handle the file management in the sequence generator.

somebodyenjoy OP t1_iydqmu0 wrote on November 30, 2022 at 5:20 PM

This is perfect, I won’t have to invest in additional RAM. Thanks for the tip!

incrediblediy t1_iycjg6d wrote on November 30, 2022 at 11:37 AM

You can use your own preprocessing on top of keras preprocessing and data loader, or you can use a custom code for all together.
According to https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator ,

Deprecated: tf.keras.preprocessing.image.ImageDataGenerator is not recommended for new code. Prefer loading images with tf.keras.utils.image_dataset_from_directory and transforming the output tf.data.Dataset with preprocessing layers

You can do mini batch training depending on available VRAM, even with a batch size of 1. I assume you are referring to VRAM as RAM, as we hardly do deep learning with CPU for image datasets.

example: you can use data_augmentation pipeline step to have control over preprocessing like this (I used this code with older TF version (2.4.0 or 2.9.0.dev may be) and might need to change function locations for new version as above)

train_ds = tensorflow.keras.preprocessing.image_dataset_from_directory(
    image_directory,
    labels='inferred', 
    label_mode='int',
    class_names=classify_names,     
    validation_split=0.3,
    subset="training",
    shuffle=shuffle_value,
    seed=seed_value,
    image_size=image_size,
    batch_size=batch_size,
)

data_augmentation = tensorflow.keras.Sequential(
    [
        tensorflow.keras.layers.experimental.preprocessing.RandomFlip("horizontal"),
        tensorflow.keras.layers.experimental.preprocessing.RandomRotation(0.1), 
    ]
)

augmented_train_ds = train_ds.map( lambda x, y: (data_augmentation(x, training=True), y))

somebodyenjoy OP t1_iycl8b8 wrote on November 30, 2022 at 11:59 AM

I meant RAM. I know I can reduce the batch size for VRAM. I’ve solved problems by loading the whole dataset into the RAM and training it. But your answer is interesting as well

Ttttrrrroooowwww t1_iyctkhw wrote on November 30, 2022 at 1:23 PM

Normally your dataloader gets single samples from your dataset. Such as reading an image one by one. In that case RAM is never a problem.

If that is not an option for you (why I would not know), then numpy memmaps might be for you. Basically an array thats read from disk, not from RAM. I use it to handle arrays that are Billions of values.

somebodyenjoy OP t1_iycub7d wrote on November 30, 2022 at 1:29 PM

I haven’t heard of mem mapping, seems like something I should look into, thanks!

sushil79g t1_iyf81lh wrote on November 30, 2022 at 11:09 PM

I always use dask for this kind of situation. works perfectlyly

https://www.dask.org/get-started

suflaj t1_iyclnwf wrote on November 30, 2022 at 12:03 PM

Images are loaded from disk, perhaps with some caching.

The most efficient simple solution would be to have workers that fill up a buffer that acts like a queue for data.

muchomuchacho t1_iydxu0y wrote on November 30, 2022 at 6:06 PM

Data streaming

[deleted] t1_iyeob8t wrote on November 30, 2022 at 8:55 PM

[removed]

AmazingKitten t1_iyevdxx wrote on November 30, 2022 at 9:41 PM

Use a custom Dataset class.

IshanDandekar t1_iycnrjg wrote on November 30, 2022 at 12:26 PM

How big is your RAM? Maybe you can try cloud resources to get a better machine, leverage GPUs too if it is an image dataset

Rishh3112 t1_iyd0opv wrote on November 30, 2022 at 2:23 PM

I would suggest on splitting the dataset and saving the weights everytime you train one set and train the next set using the weights of the previous weights.