Submitted by d0cmorris t3_10xxxpa in MachineLearning

Clearly, large scale deep learning approaches in image classification or NLP use all sorts of Regularization mechanisms, but the parameters are typically unconstrained (i.e., every weight can theoretically attain any real value). In many Machine Learning domains, constrained optimization (e.g. via Projected Gradient Descent or Frank-Wolfe) plays a huge role.

I was wondering whether there are large-scale Deep Learning applications which rely on constrained optimization approaches? When I say large-scale, I mean large CNNs, transformers, diffusion models or the like. Are there settings where constrained optimization would even be a preferred approach, but not efficient/stable enough?

Happy for any paper suggestions or thoughts! Thanks!

5

Comments

You must log in or register to comment.

tdgros t1_j7vdocr wrote

With constrained optimization, you usually have a feasible set for the variables you optimize, but in a NN training you optimize millions of weights that aren't directly meaningful, so in general, it's not clear if you can define a feasible set for each of them.

3

DACUS1995 t1_j7vltrw wrote

As you said most deep-learning models use some sort of regularization at training so there is some implicit constraint on the actual values of the weights, even more so when the number of parameters goes in the range of billions where you will have an inherent statistical distribution of the feature importance. On the more explicit and fixed side, there are a couple of papers and efforts in the area of quantization where parameter outliers in various layers affect the precision of quantized representation, so you would want a reduced variance in the block or layers values. For example, you can check this: https://arxiv.org/abs/1901.09504.

3

notdelet t1_j7vv9pi wrote

You can get constrained optimization in general for unconstrained nonlinear problems (see the work N Sahinidis has done on BARON). The feasible sets are defined in the course of solving the problem and introducing branches. But that is both slow, doesn't scale to NN sizes, and doesn't really answer the question ML folks are asking (see the talk at the IAS on "Is Optimization the Right Language for ML").

2

jimmymvp t1_j807b94 wrote

There's a bunch of cool work on using constrained optimization as a layer in neural nets, differentiation through argmin. I'm not sure if this answers your question.

1

d0cmorris OP t1_j819chm wrote

Exactly. I mean I can easily define L2-constraints for the weights of my network and then do constrained optimization, which would at least theoretically be equivalent to L2-regularization/weight decay. But this is not quite useful, I am wondering whether there are applications of constraints where it actually makes sense.

1

Mental-Reference8330 t1_j8xup7w wrote

in the early days, researchers considered the architecture itself to be a form of regularization. LeCunn didn't invent it, but he did popularize the idea that a convolutional layer (like LeNet in his case) is like a fully-connected layer, but constrained to only allow solutions where the layer weights could be expressed in terms of a convolution kernel. In their introduction, ResNets were also motivated by the fact that they're "constrained" to start from better minima, even though you could also convert a resnet model to a fully-connected model without loss of precision.

1