Jaffa6

Jaffa6 t1_jdlk3j9 wrote

There was a paper a while back (Chinchilla?) that indicated that for the best results, model size and the amount of data you give it should grow proportionally and that many then-SotA models were undertrained in terms of how much data they were given. You might find it interesting.

But as a tangent, I think ML focuses too much on chasing accuracy. You see it constantly in SotA papers where they're claiming things like "We improved our GLUE score by 0.1 compared to SotA, and all it took was spending the GDP of Switzerland on electricity and GPUs!"

And it's still a model that hallucinates way too much, contains bias, and just generally isn't worth all that time, money, and pollution.

14

Jaffa6 t1_jdc1gz4 wrote

It's possible, but I think you'd struggle to improve it (though I freely admit that I don't know enough maths to say). But yeah, it's never going to be a reliable method at all.

To be honest, I'd expect you to have more problems with people not being able to sign in as themselves (inconsistent behaviour) than signing in as other people deliberately.

1

Jaffa6 t1_jdbzs22 wrote

This is unfortunately going to be a bit harsh, but it's worth knowing sooner rather than later: Cryptography (which this essentially is) is a VERY difficult field and creating a secure encryption scheme is very difficult.

Wanting to encrypt and decrypt without the key being stored anywhere is an admirable goal, but this is certainly not the way I'd recommend doing it and it's not likely to be secure this way.

If you're dead set on doing it like this, then pretty much any neural network can do it. You're just inputting numbers and wanting numbers out.

I guess your training data would be many sets of behavioural data from each user, say at least 50 users, and training it to predict the user from that data, but heavily penalising it if it matches another user too.

1

Jaffa6 t1_jdbysmg wrote

Broadly speaking, machine learning models are huge black boxes that you can't really explain the behaviour of.

It's going to be very difficult (if it's even possible) to guarantee that a certain user's behaviour will create a unique key because it would really just be multiplying and adding some different numbers (which come from the factors you mentioned).

You can certainly generate a key, though.

Much simpler is, as someone else suggested, just using something like the device's MAC address. But then you'll run into issues with them being locked out if they change address.

1

Jaffa6 t1_javl6ef wrote

No problem.

I believe that if you're using a BERT-esque model, you do indeed need to do "full" tokenisation (part of which is creating the attention mask and padding) because BERT expects its input to be a list of token indices. E.g. Given the token mapping {"a": 1, "cow": 2, "cat": 3, "dog": 4}, tokenisation would turn "a cat" into [1, 3] which is in the form that BERT expects.

And since BERT comes with a token mapping (due to pre-training), if you're just putting in your own features (say, number of likes and number of retweets), they'll quite possibly just get interpreted as random tokens if their numbers match up with known token indices.

If your features are already the right kind (tokenised text, with the resultant indices matching the correct BERT token indices), I suppose you could do truncation/padding yourself and feed that input directly to BERT.

But it'll probably end up simpler and less error-prone to let BERT tokenise it for you (e.g. via HuggingFace's `AutoTokenizer.from_pretrained('bert-base')`)

2