comradeswitch

comradeswitch t1_j341em8 wrote

This is in essence how convolutional neural networks work- most often, looking at small patches of an image with many overlapping windows and the same core model looking at each. Then the same can be done for the output of the very small patches to get summarization of slightly larger patches of the image, and so on. At the end, the output is coming from many different analyses of different, overlapping segments of the data considered together.

I'd be wary of creating explicit synthetic examples that contain e.g. exactly one cycle of interest or whatever unless you know for a fact that it's how the model will be evaluated. You can imagine how snipping out a cycle from beginning to end could give an easier problem than taking segments of the same length but with random phase, for example. It may be simpler and more robust to do this in the model directly with convolution and feed in the whole series at once.

1

comradeswitch t1_j33zsw6 wrote

Do you have data with no cancer? It's going to require careful treatment of the categories with only one example, but one-shot learning is a topic of great research that describes this problem exactly. Starting there should be helpful.

Also, you have "transistional" and "transitional" listed with 1 each- if that typo is in the original data, you should fix that! And then you'll have 2 examples.

Unfortunately, the answer here may be "acquire more data", because you have many categories for the total samples you have as well as multiple with 1 example only.

1

comradeswitch t1_j33yet8 wrote

And what you describe can also happen partially, where a model is developed offline that "learns to learn" or simply pretrained on data that's likely to be representative, and then this is placed on the embedded system that has a much simpler learning task or just starts out much closer to optimal.

But I think you nailed it with the last sentence. I need the Scooby Doo meme, where it's "AI on a wearable embedded computer" revealed to have been a Kalman filter all along.

2

comradeswitch t1_j33wz4y wrote

Why is X being converted into a diagonal matrix here? Can you show some sample code? It's difficult to tell what exactly is happening here.

But I can say this- if the covariance matrix estimate is positive semidefinite, and the inverse exists, the inverse is positive semidefinite (psd from here) as well. In fact, if the inverse exists the matrix and its inverse are both positive definite. But if you have a psd matrix that is not pd, you have several options.

First, the Moore-Penrose pseudoinverse is a generalization of the matrix inverse that always exists, even if it is nonsquare. This occurs in least-squares estimation of over- and under-determined linear systems of equations. Using the pseudoinverse of the covariance matrix is equivalent to assuming that the normal distribution you have is restricted to the lower-dimensional subspace spanned by your observed data rather than the whole space. This can be calculated in the same amount of time roughly as a matrix inverse and if a singular value decomposition is available for the matrix, it can be constructed via VS^+V^T where S^+ has the reciprocals of the nonzero singular values on the diagonal and the zero singular values remain zero. This approach has the benefit of giving an "inverse" that has the same rank as the covariance matrix- if you have a low rank covariance matrix, the data can be transformed via Y = S^{+/2} V^T X (where S^{+/2}_{ii} = 1/sqrt{s_ii} for nonzero s_ii and zero else). Then, Y^T Y = X^T (X^T X)^{+} X which makes the dot product (and Euclidean distance) between columns of Y equal to the Mahalanobis inner product/distance...and if the covariance matrix is low rank, this reduces the number of dimensions in the data. Unfortunately, it's also not a very good estimator of the precision matrix (inverse of covariance) on mathematical statistics grounds as well as being prone to numerical instability for small singular values.

As others have said, you can also add a small multiple of the identity matrix to the estimate of the covariance matrix before inverting it. This has more desirable estimator performance, and it corresponds to penalization of the L2 norm of the estimate. This is significantly more robust and well behaved numerically. It adds that multiple to each singular value of the covariance matrix, which ideally prevents very small singular values with very low reliability blowing up when the reciprocal is taken. Unfortunately, if the covariance matrix is low rank, this regularized estimate is going to be full rank, which can be problematic or unjustified, and significantly more difficult to work with in cases where the rank is much lower than the dimension.

The third option is actually a hybrid of the previous two- it combines the low rank of the pseudoinverse approach and regularization. If A is the covariance matrix estimate, calculate (A^2 + cI)^{-1} A. The singular values of that matrix are s/(s^2 + c) for positive singular values, quite similar to that in the second approach which has eigenvalues 1/(s+c) for all singular values...including zeros. However, while (A^2+cI) is full rank, if A is not, their product will not be full rank either. If Av = 0, then v^T (A^2 + cI)^{-1} A = 1/c v^T A = 0. This retains the low rank benefits while still regularizing. This can be derived from a minimization problem of ||AX - I||^2 + c||X||^2 and solving the gradient root. Conveniently, in the limit as c approaches 0, this matrix becomes the pseudoinverse (and if the matrix is invertible, the pseudoinverse and the inverse coincide).

1

comradeswitch t1_j33qmla wrote

Yes, ridge regression and the more general Tikhonov regularization can be obtained by setting up an optimization problem:

min_X ||AX - Y||^2 + c/2 ||X||^2

Taking gradient wrt X and rearranging, we get (A^T A + c I)X = A^T Y

A matrix is psd iff it can be written as X = B^T B for some matrix B, and is characterized by having nonnegative eigenvalues. And if Xv = lambda v, then (X+cI)v = (lambda + c)v and so v is still an eigenvector but c has been added to the eigenvalue. For a psd matrix, the smallest eigenvalue is at least 0, so for positive c, the matrix is strictly positive definite and therefore invertible.

It may also be approached from a probabilistic modelling standpoint, treating the regularization as a normal prior on the solution with zero mean and precision cI.

2

comradeswitch t1_j33ntbi wrote

Absolutely. Because it's never just the input from humans- presented with an image and a label for it given by a user, the model is not limited to learning only the relationships that the human used to generate the label- the image is right there, after all. So when all goes well, the model can learn relationships in the data that humans are unable to because the human labels are used to guide learning on the source material.

Additionally, there are many ways to allow a model to treat labels as 100% true (i.e. the word of God) but allow for some incorrect labels. In which case, it's entirely possible for the model to do better than the human(s) did even on the same data.

1

comradeswitch t1_itwmbkg wrote

Oh interesting! This looks to hit upon a lot of my favorite topics, I'll be taking a more in depth look later.

It's not surprising to me that a decomposition based on Bregman divergences has similar properties as the one for MSE, but the connections through convex conjugates and proper scores is clever.

2

comradeswitch t1_itmhmqh wrote

> MLPs are universal function approximators

MLPs with non-polynomial activation functions with either arbitrary width or arbitrary depth have the ability to approximate a function f: S -> R with an arbitrary specified level of error where S is a compact subset of R^n.

Violate any of these assumptions and you lose those guarantees. Any finite MLP will only be able to approximate a subset of functions with the given support for an arbitrary error level. Nothing about their ability in practice contradicts this.

Much like how there exist matrix multiplication algorithms with better than O(n^2.4) running time but the naive O(n^3) algorithm outperforms them for all physically realizable inputs, the effects of finite sizes are very important to consider.

1

comradeswitch t1_itmfawj wrote

There's quite a bit of missing context and misconceptions about this in general, but there is a theoretical backing that is really just a mathematical fact that isn't falsifiable by observation.

Bias, variance, and mean squared error for an estimator can be related quite simply. Say the quantity to be estimated is y, the estimate is x (forgive the lack of LaTeX making more conventional names difficult). Then MSE is E[(y - x)^2] = E[x^2] + E[y^2] - 2 E[xy] = Var[x] + Var[y] - 2Cov[x,y] + (E[x]-E[y])^2, and the first 3 terms are typically simplified by assuming some error structure that makes Cov[x,y] = 0, leading to a decomposition of MSE into 3 terms:

  1. Var[y] - the variance of the true underlying quantity that is inherent to any estimator.

  2. Var[x] - the variance of the estimator regardless of bias. (What's typically referred to as the variance in the bias-variance terminology)

  3. (E[y] - E[x])^2 = Bias[x]^2 - the squared bias of the estimator.

Since (1) is constant across all estimators, that means that the difference in MSE for two estimators comes entirely down to (2) and (3).

The bias-variance tradeoff is then:

For a fixed value of mean squared error, decreasing the bias of an estimator must increase the variance by the same amount and vice versa.

An unbiased estimator will have the maximum variance among all estimators with the same MSE. It's entirely possible to improve an estimator's MSE by increasing or decreasing bias. In fact, the two commonly used estimators of sample variance are a great example*.

The most important pieces here are that we're talking about MSE of an estimator, and that for a fixed value of MSE there's a 1:1 tradeoff between bias and variance. It makes no statements about the case when MSE is not held constant, and it's actually very well known in mathematical statistics that biased estimators can provide significant improvements in MSE over unbiased ones. This doesn't hold if you're not considering MSE, or allowing MSE to change. It can be extended to multivariate cases and "estimator" could be anything from a mean of a few samples to the output of a deep belief network.

* Take a = 1/(n-1) * sum(x^2) - 1/(n*(n-1)) * sum(x)^2, and b = (n-1)/n * a. v = Var[x] is the population variance. We have E[a] = v, which means a is unbiased (this is Bessel's correction). And obviously, E[b] = (n-1)/n * v. Thus Bias[b] = E[b] - v = - v / n. Calculation of Var[a] requires knowledge of the 4th moments of x, however, in the case of normally distributed data the MSE of b is less than that of a. And a property that holds in general is that Var[b] = ((n-1)/n)^2 * Var[a], and MSE[b] = Var[b] + (v/n)^2 = 1/n^2*((n-1)^2 Var[a] + v^2). Thus MSE[b] < MSE[a] if (2n - 1)*Var[a] > v^2. This is true for normal distributions and many others.

4