I have a use-case where (say) N RGB input images are used to reconstruct a single RGB output image, using either an Autoencoder, or a U-Net architecture. More concretely, if N = 18, 18 RGB input images are used as input to a CNN which should then predict one target RGB output image.

If the spatial width and height are 90, then one input sample might be (18, 3, 90, 90) which is not batch-size = 18! AFAIK, (18, 3, 90, 90) as input to a CNN will reproduce (18, 3, 90, 90) as output, whereas, I want (3, 90, 90) as the desired output.

Any idea how to achieve this?

Comments

You must log in or register to comment.

suflaj t1_jc6n8v1 wrote on March 14, 2023 at 1:03 PM

Just apply an aggregation function on the 0th axis. This can be sum, mean, min, max, whatever. The best is sum, since your loss function will naturally regularise the weights to be smaller and it's the easiest to differentiate. This is in the case you know you have 18 images, for the scenario where you will have a variable amount of images, use mean. The rest are non-differentiable and might give you problems.

If you use sum, make sure you do gradient clipping so the gradients don't explode in the beginning.

agaz1985 t1_jc9pebz wrote on March 15, 2023 at 8:27 AM

18 is your Z dimension, if you move it to the third dimension so Bx3x18x90x90 you can apply multiple 3DConv until you reach a 2D representation and after that you apply 2DConv. For example, let's say we apply 2 times 3DConv->3DMaxPooling with kernels (3,1,1) and (2,1,1) you'll end up with an output of BxCx3x90x90, if you then apply a single 3DConv with kernel (3,1,1) you'll have an output of BxCx1x90x90 or simply BxCx90x90 which can then be passed to 2DConv layers. So basically you ask the model to compress the info in your Z dimension before moving to the spatial dimensions. You can also do the two things together by playing with the kernel size of conv layers. That said, integrating this into UNet it's a bit more work than just using a predefined UNet but it is doable, look for 3D+2D Unet for example.

notgettingfined t1_jc30vh5 wrote on March 13, 2023 at 5:55 PM

Why can’t you stitch together the 18 images into a single input?

grid_world OP t1_jc32gs7 wrote on March 13, 2023 at 6:05 PM

I don’t want to mix up the individual information contained in 18 RGB images together in the hopes the the network learns the anticipated features out of them

notgettingfined t1_jc33bc3 wrote on March 13, 2023 at 6:11 PM

I don’t under how that prevents learning from the individual images. I think you would need to explain the problem better. You could also add all the images together into channels so so you would have a 36x90x90input and then a 3x90x90 output

grid_world OP t1_jc6ox1c wrote on March 14, 2023 at 1:17 PM

Maybe a conv3d helps better without having to reshape