Hi guys, I have a set of feature values defined as x = {f_1,f_2,...,f_n} (x does not contain any zero) and the goal is to measure the similarity between these features using Mahalanobis distance so x is converted to a diagonal matrix called X_i where the diagonal elements are f_1,f_2,...,f_n, therefore, the distance is measured using columns of X_i.

Then I calculate the covariance matrix of X_i which is semi-positive definite (SPD) but the inverse of the covariance matrix is non-SPD and Mahalanobis distance is not valid(it became negative).

Any ideas or suggestions?

Thanks.

Comments

You must log in or register to comment.

HotAd9055 t1_j2xd5gd wrote on January 4, 2023 at 4:45 PM

#1,279,618

Add an identity matrix to the covariance before taking the inverse. It works. Otherwise, if you are looking for outliers consider using a shrinkage procedure as in http://www.ledoit.net/honey.pdf

wolfium t1_j2xjeyb wrote on January 4, 2023 at 5:25 PM

#1,279,975

Replying to HotAd9055 (#1,279,618)

Sounds very related to https://en.wikipedia.org/wiki/Ridge_regression (adding a constant times the identity matrix)

[deleted] t1_j2xkl1w wrote on January 4, 2023 at 5:32 PM

#1,280,022

Replying to HotAd9055 (#1,279,618)

[removed]

Competitive_Dog_6639 t1_j30hjmd wrote on January 5, 2023 at 5:09 AM

#1,285,078

The situation you are describing is not possible in theory. If a matrix is PSD and invertible, it must be positive definite. And the inverse of a positive definite matrix is also positive definite, which means it must only yield positive Mahalonois distances (or zero if the vectors are identical). https://math.stackexchange.com/questions/2288067/inverse-of-a-symmetric-positive-definite-matrix

In practice, this might happen due to small eigenvalues and numerical error. The easiest fix is to add the identity scaled by a small constant, like in ridge regression, as others suggest

eiliya_20 OP t1_j30sx8w wrote on January 5, 2023 at 7:05 AM

#1,285,572

Replying to Competitive_Dog_6639 (#1,285,078)

Yeah, your right but unfortunately I'm not getting the expected result as described above.

[deleted] t1_j30sxl8 wrote on January 5, 2023 at 7:05 AM

#1,285,573

Replying to Competitive_Dog_6639 (#1,285,078)

[deleted]

comradeswitch t1_j33qmla wrote on January 5, 2023 at 9:08 PM

#1,290,513

Replying to wolfium (#1,279,975)

Yes, ridge regression and the more general Tikhonov regularization can be obtained by setting up an optimization problem:

min_X ||AX - Y||^2 + c/2 ||X||^2

Taking gradient wrt X and rearranging, we get (A^T A + c I)X = A^T Y

A matrix is psd iff it can be written as X = B^T B for some matrix B, and is characterized by having nonnegative eigenvalues. And if Xv = lambda v, then (X+cI)v = (lambda + c)v and so v is still an eigenvector but c has been added to the eigenvalue. For a psd matrix, the smallest eigenvalue is at least 0, so for positive c, the matrix is strictly positive definite and therefore invertible.

It may also be approached from a probabilistic modelling standpoint, treating the regularization as a normal prior on the solution with zero mean and precision cI.

comradeswitch t1_j33wz4y wrote on January 5, 2023 at 9:46 PM

#1,290,884

Why is X being converted into a diagonal matrix here? Can you show some sample code? It's difficult to tell what exactly is happening here.

But I can say this- if the covariance matrix estimate is positive semidefinite, and the inverse exists, the inverse is positive semidefinite (psd from here) as well. In fact, if the inverse exists the matrix and its inverse are both positive definite. But if you have a psd matrix that is not pd, you have several options.

First, the Moore-Penrose pseudoinverse is a generalization of the matrix inverse that always exists, even if it is nonsquare. This occurs in least-squares estimation of over- and under-determined linear systems of equations. Using the pseudoinverse of the covariance matrix is equivalent to assuming that the normal distribution you have is restricted to the lower-dimensional subspace spanned by your observed data rather than the whole space. This can be calculated in the same amount of time roughly as a matrix inverse and if a singular value decomposition is available for the matrix, it can be constructed via VS^+V^T where S^+ has the reciprocals of the nonzero singular values on the diagonal and the zero singular values remain zero. This approach has the benefit of giving an "inverse" that has the same rank as the covariance matrix- if you have a low rank covariance matrix, the data can be transformed via Y = S^{+/2} V^T X (where S^{+/2}_{ii} = 1/sqrt{s_ii} for nonzero s_ii and zero else). Then, Y^T Y = X^T (X^T X)^{+} X which makes the dot product (and Euclidean distance) between columns of Y equal to the Mahalanobis inner product/distance...and if the covariance matrix is low rank, this reduces the number of dimensions in the data. Unfortunately, it's also not a very good estimator of the precision matrix (inverse of covariance) on mathematical statistics grounds as well as being prone to numerical instability for small singular values.

As others have said, you can also add a small multiple of the identity matrix to the estimate of the covariance matrix before inverting it. This has more desirable estimator performance, and it corresponds to penalization of the L2 norm of the estimate. This is significantly more robust and well behaved numerically. It adds that multiple to each singular value of the covariance matrix, which ideally prevents very small singular values with very low reliability blowing up when the reciprocal is taken. Unfortunately, if the covariance matrix is low rank, this regularized estimate is going to be full rank, which can be problematic or unjustified, and significantly more difficult to work with in cases where the rank is much lower than the dimension.

The third option is actually a hybrid of the previous two- it combines the low rank of the pseudoinverse approach and regularization. If A is the covariance matrix estimate, calculate (A^2 + cI)^{-1} A. The singular values of that matrix are s/(s^2 + c) for positive singular values, quite similar to that in the second approach which has eigenvalues 1/(s+c) for all singular values...including zeros. However, while (A^2+cI) is full rank, if A is not, their product will not be full rank either. If Av = 0, then v^T (A^2 + cI)^{-1} A = 1/c v^T A = 0. This retains the low rank benefits while still regularizing. This can be derived from a minimization problem of ||AX - I||^2 + c||X||^2 and solving the gradient root. Conveniently, in the limit as c approaches 0, this matrix becomes the pseudoinverse (and if the matrix is invertible, the pseudoinverse and the inverse coincide).