Viewing a single comment thread. View all comments

KellinPelrine t1_iqtjzfs wrote

Just using date of publication or last date of modification does not avoid the issue I described. In my brief reading I couldn't find a link or reference for your data beyond it coming from Kaggle somehow (might have missed more exact reference), but your sample is definitely not random (as you describe it has exactly 2000 real and 2000 fake examples, while a representative random sample would not be balanced). If the 2000 fake ones have 2016 publication dates and the 2000 real ones have 2017 dates, you haven't found a new optimal detection method nor that every article ever published in 2016 was fake, you've found some artifact of the dataset. Still an important finding, especially if other people are using that data and might be drawing wrong conclusions from it, but not a new misinformation detection method.

Of course, it's probably not such an extreme case like that (although something nearly that extreme has occurred in some widely used datasets, as explained in paper I linked). But here's a more subtle thought experiment: suppose fake articles were collected randomly from a fact-checking website (a not uncommon practice). Further, maybe that fact-checking website expanded its staff near the 2016 US election, say in October, because there was a lot of interest in and public need for misinformation detection at that time. More staff -> more articles checked -> more fake news detected -> a random sample of fake news from the website will contain more examples from October (when there was more staff) than September. So in the data then the month is predictive, but that will not generalize to other data.

A machine learning paper, whatever the audience, requires some guarantee of generalization. Since the metadata features used in your paper are known to be problematic in some datasets, and the paper only reports results on one dataset, in my opinion it cannot give confidence in generalization without some explanation "why."

1

loosefer2905 OP t1_iqw45vd wrote

As per our understanding choosing the right Machine Learning models is one thing, choosing the right attributes is another. The reason of good accuracy in detection for the Bayesian classifier model in this dataset was because of the type of study done - Most past papers have been working on extracting linguistic features from the article, some have been looking at it from social media perspective, aka looking at twitter profiles and on that basis classifying tweets as fake or real. Month is not the ONLY attribute we used, the type of news (Political, World News, US news) was another factor we used.

Choosing the right model and the right attributes and the right methodology is a specific thing. Most linguistic features-extraction based models for example are more complicated in nature but they cannot even discern real news from fake news very well for most of the previous work we saw... the accuracy is in 70s. For us, getting the right performance using the right selection of attributes was critical and we feel we have done a decent job at that.

The why should be left to your interpretation. I have already said what I said. It is political in nature. More than that we cannot say.

1