I am dealing with a deep learning task where the model has several inputs of very different sizes. Moreover, the inputs of smaller sizes are those that actually have more influence on the output.

To give you an idea of the scale, one input is a 200-dimensional vector, another input is a 1-dimensional number, and another is a 5-dimensional vector. They are all useful for predicting the correct output, but the 1 and 5 -dimensional ones are particularly helpful.

At the moment I am concatenating all of them, but I suspect that this isn't the best approach in this case, as there is noise in the training process (it's for an RL agent) and I fear that it would be difficult for the model to learn to put enough focus on those small inputs.

Do you know any work that examines the effect of different input sizes on nns? It might turn out that this is not a problem after all.

Comments

eigenham t1_iud3a5l wrote on October 30, 2022 at 12:38 PM

>To give you an idea of the scale, one input is a 200-dimensional vector, another input is a 1-dimensional number, and another is a 5-dimensional vector.

When you're talking about vector length, are you 1) talking about a sequence model, and 2) the length of the sequence? Or are you talking about the number of elements in an actual vector input?

fedetask OP t1_iud6830 wrote on October 30, 2022 at 1:05 PM

the number of elements in an actual vector input

eigenham t1_iud7sg6 wrote on October 30, 2022 at 1:19 PM

Thanks and just to make sure I understand you: are these inputs of different sizes available all the time simultaneously (e.g. could theoretically be concatenated into a single vector)?

Or are only some of them available at a time (and you've found that the smaller vectors are more predictive of the more important class)?

fedetask OP t1_iud863u wrote on October 30, 2022 at 1:23 PM

They are available at the same time. Imagine that the input is a 251-dimensional vector where the first 200 values are related to some feature A, the next 5 to feature B, and the last value to feature C. But features B and C are very important for the prediction

eigenham t1_iudbxis wrote on October 30, 2022 at 1:54 PM

Ok so you really have one input vector but you're concerned that some important elements of it are going to get ignored or underutilized. Normally that's the whole point of the optimization process in the fitting problem: if those features result in the most gain during training, the information from them should be prioritized (up to getting stuck in local minima). Why do you think this wouldn't be the case for your problem? Is this small set of inputs only relevant for a minority class or something like that (unless addressed, this would make them underrepresented in your optimization problem)?

fedetask OP t1_iuduy9k wrote on October 30, 2022 at 4:11 PM

My concern is that since the training process is noisy (RL) the optimization could take more time to "isolate" those features, and maybe some smarter model architecture could bias the algorithm in giving more importance to them from the beginning

eigenham t1_iue47f3 wrote on October 30, 2022 at 5:15 PM

If you know for sure certain inputs should have a greater role in the final decision, you can help the model not lose that information layer over layer by giving it skip connections to later layers.

jobeta t1_iud53kt wrote on October 30, 2022 at 12:55 PM

Kinda random but if you think the size of the input really matters for the model to learn well (which frankly I’m not convinced is an issue) you could add one or two hidden layers of decreasing sizes behind the large input size layers, before you concatenate them with the smaller size ones.

vwings t1_iud9fb8 wrote on October 30, 2022 at 1:33 PM

The best way is probably to use a feature encoding and plugging this into a Transformer. First sample: 200 features A and 5 features B. You encode this as set {[A feats, encoding for A]W_A, [B feats (possibly repeated), encoding for B]W_B]} Second sample with B and C features: {[C feats, encoding for C]W_C, [B feats (possibly repeated), encoding for B]W_B]}. The linear mappings W_A, W_B, and W_C must map to the same dimensions. The order of the feature groups does not play a role (permutation invariance of the transformer). Note that this also learns a feature or feature group embedding.