PredictorX1 t1_j5h3ymz wrote on January 22, 2023 at 11:31 PM

>With more compute could it be easy to quickly un Mask all the people on Reddit by using text correlations to non masked publicly available text data?

With labeled samples of text, I think it would be pretty easy to come up with a a likelihood model, giving a reasonable educated guess of the identity of some Reddit members, and I don't think it would take much computing power.

Loquzofaricoalaphar OP t1_j5h59id wrote on January 22, 2023 at 11:40 PM

So like if you fed it 200 peoples samples you were looking and then fed it Reddit? Perhaps all of Reddit would be tricky because some might not have public text and it would be difficult to label all the text on Facebook or link-en, etc.

PredictorX1 t1_j5h5pb5 wrote on January 22, 2023 at 11:43 PM

The biggest technical challenges I see:

Having enough reference samples from known people
The difference how people write on Reddit and how they write elsewhere (professional articles, e-mail, etc.: presumably used as reference)
If too many Reddit users are being considered, it may all dissolve into mush (estimated probabilities would all be low)

Loquzofaricoalaphar OP t1_j5h6s4z wrote on January 22, 2023 at 11:51 PM

That is interesting to think about. I’m biased to think text patterns have lots of variables and are fairly unique. Perhaps it’s more of a model than compute problem to analyze it at scale and not get mush.