solresol

solresol t1_j2qh0lv wrote

I have applied to start a PhD this year and I'm 50.

It doesn't actually matter what you do your PhD unless you plan on staying in academia. Very little of it is going to be valuable to employers, but very few employers are going to care because the criteria for getting a job in industry in computing are (a) have a pulse (b) demonstrated ability to program a computer.

> Is it more advisable in long run to stay and get your PhD or just leave and join some ML role in industry that takes Masters guys and get some real world experience on how to use ML to generate business value ?

Maybe do both? Can you find an industrially-focussed research project?

20

solresol t1_j1yqgub wrote

For most algorithms (neural networks, neighbour methods, linear methods, svm) I would agree, but something like a small decision tree could have enough human input to show that there was "substantial human input" compared to the computer-generated part. Perhaps the author could argue that they tried a few different depths or loss functions to achieve a particular aesthetic result, or they manually pruned the tree afterwards for some purpose.

Also, a graphical manifestation of that decision tree would be copyrightable, because there are many human-made choices in its display, particularly if it is designed as a tool for human beings to use to perform inferences. (Again, there's probably no equivalent copyrightable graphical manifestation in other ML techniques, so this wouldn't apply there either.)

But the bulk of your point is true, in all jurisdictions, almost all ML models are not protected by copyright.

1

solresol t1_j1ylemo wrote

An explainable model that was human-readable (e.g. a decision tree) would probably be protected by copyright.

As long it is not just words (i.e. has a diagram) and is not just mathematics (i.e. maybe having a categorical variable might be sufficient), then you might be able to get a patent.

−2

solresol t1_j1yl2z8 wrote

You don't need intellectual property protection on the model. You can just contractually require exclusivity.

Insert a clause into the contract with your client that says "you have the right to use this software provided you do not re-use/copy/blah/blah the deep learning model in any way other than as specified in ____. This condition is of the essence, and monetary redress for breaking this term may be insufficient."

If you want to, you can add a watermark to your model for each customer (a special image or something where your classifier gives a totally unique output), then you can tell which customer leaked it, and sue them for contract violation.

8

solresol t1_j16ed0l wrote

As a real-world example that I encountered with a client that sells software to ecommerce stores... they wanted to know the number of products in a typical ecommerce store.

It turns out that there's a power law at work. If you sample N stores and count the number of products in all stores in total, you get XN products. Great! The mean is X.

But if you sample 2N stores, the number of products in total in all the stores is 4XN. That's because you have doubled your chances of finding a store that on its own has 2XN products, and the rest of the stores contribute the 2XN that you would have expected.

When you only sampled N stores, the average number of products per store was X. When you doubled the size of the sample, the average number of products was 2X.

Similar things happen to the variance.

As you increased the sample size, the average number of products goes up as well.

In a sane universe you would expect this to end eventually. This particular client is still growing, still analysing more stores, and they are always finding bigger and bigger stores (stores which on their own have more products than all other stores put together). Eventually they will have analysed every store in the world, and then they will be able to answer the question of "what's the average number of products in an ecommerce store that exists right now?"

But who knows? Maybe stores are being created algorithmically. It wouldn't surprise me. Certainly there will be more ecommerce stores in the future, so we probably can't say "that average number of products in an ecommerce store over all time?"

Anyway, the punchline is, you can't sample this data to find out the mean nor can you find its variance.

The original poster is finding that his residuals follow a power law. Depending on how steep the exponent is, it's possible that there is no well-defined mean for his residuals: as he collects more data, his mean will go up in proportion to the number of data points. If he is defining his loss function in terms of the mean of the residuals (or anything along those lines) then gradient descent is going to have some unresolvable[*] problems. If this is true, gradient descent will take his parameters on an exciting adventure through fractal saddles, where there's always a direction where it can reduce the loss function that makes no improvement to the majority of his data.

This looks to me like what is happening to him.

[*] Unresolvable with the state of the art at the moment AFAICT. I'm going to put this on my PhD research to do list.

1

solresol t1_j10nsf2 wrote

Epistemic status: I don't know what I'm talking about, and I know I'm not fully coherent. Be kind in replies.

I *think* that your data might not have a finite mean and finite variance. If so, then there's no obvious "best" regression at all. As you get more data, optimality will change. A different random subsample of data will lead to wildly different results.

I have done some research on problems like this in linguistic data, and I was able to do dirty stuff by swapping out the underlying metric so that the notion of where "infinity" was changed. But if you have real-valued data, I don't think this can help.

−4

solresol t1_iy5qxoy wrote

I rather like isolation kernel methods: https://arxiv.org/pdf/2109.14198.pdf

The idea is that you take a random subset of points, and then work out for each point in your dataset which one of those random points it is closest to.

Repeat that process some large number of times. Points that regularly get mapped to the same exemplar are obviously close to each other.

For some tasks that's enough. Otherwise, if you feed that out into something else (e.g. t-SNE) and get much better results than if you try to reduce dimensionality directly.

2