Viewing a single comment thread. View all comments

arxaquila t1_ixs6fqh wrote

Let me be totally frank and forthright about the application of very powerful statistical programs to large and varied data sets. It can lead to very misleading interpretations. My own experiences with large discrete data sets that merged human demographics, social characteristics and individual purchase data began well before the era of Facebook and Google. The growth of these companies was founded on the explosion of personally input data but the early antecedents of these companies were compilers like RL Polk and Metromail who took phone book lists, drivers license files and a household census conducted on a door to door basis by Polk. I founded a company in the 80's that was a licensee of all of the Polk data and merged it with individual financial services info, insurance data and shopping data across a spectrum of retail/grocery clients. We built more complex consumer models by applying personality-purchase data clusters to detailed consumer survey data collected through both cross sectional as well as time series panels by reputable survey firms. Aside from the modeled data all this was discrete data meaning that it was tied to individuals by name and address. We developed "fuzzy logic" software to identify matches that allowed many disparate sources of information to flow to a centralized database. The software we used at that time for "data mining" included SPSS advanced modules, CART and some homegrown cross-tabulation systems. Shopper data was collected at the cash register by a variety of approaches. All of this was done when Mark Zuckerberg was still in grade school. Obviously, software programs have become more sophisticated as well as powerful but the limiting factor has always been the ability of the human mind to construct and test hypotheses of causality. At that time there were many programs that pointed to correlations but to this day I am not aware of software programs that can automatically identify causal relations between various factors. Don't get me wrong, the powerful descriptors inferred through applying personality typologies like "the Big Five" to voter registration rolls were weaponized by Russian scientists both for use internally to cement Putin's hold on power in Russia as well as to feed Trump's political campaign with important and often critical insights into what hot buttons to press while on his presidential campaign in 2016. What I am certain of though is that most current investigators employing these more powerful tools today are not any smarter than we were 30 years ago and face the same struggles to separate out factors that imply causality from mere correlation. Witness the never-ending stream of new research postings on Reddit. So I take many if not most of these with a grain of salt and occasionally earn the ire of other posters for my skeptical remarks.

Thanks for you patience if you read this overly long post.

−3