Dataset

To develop your software, we provide you with a training corpus that consists of news articles in Urdu language, labeled with Fake News and Real news. You will be given a set of news sample (Real news and Fake news). For each news, the task is to find out whether the news article is real or fake.

Data


The Urdu fake news dataset, named Bend-The-Truth, is composed of news articles in six different domains: technology, education, business, sports, politics, and entertainment. To the best of our knowledge, this is the only available annotated corpus for fake news detection in Urdu. The real news were collected by following a very rigorous procedure using a variety of mainstream news websites predominantly in Pakistan, India, UK, and the USA. These news channels are BBC Urdu News, CNN Urdu, Express-News, Jung News, Naway Waqat, and many other reliable news websites for the time frame from January 2018 to December 2018.

The Urdu fake news detection has been proposed (Amjad al.,2020). The fake news included in this dataset are intentionally written by a group of professional journalists, each proficient in corresponding topics. The fake pieces are in the same domains and of the approximately same length as the verified news.


The dataset statistics can be seen in the figure.


Training set: The corpus can be downloaded from (Training Dataset)

Test set: The corpus can be downloaded from (Test Dataset)