Data Acquisition - NewsAPI

Introduction

Part of the discussion surrounding Student Loan Forgiveness is reporting done by different news media organizations. These organizations generally align with a certain political bias. With the realization that those on the left tend to support at least some form of student loan relief and those on the right tend to oppose it, it could be fruitful to pair news organizations with their political biases. To achieve this, two main sources of news related data were explored:

NewsAPI Extraction

NewsAPI maintains an application programming interfacae (API) which returns information on articles published around the world. For the scope of this analysis, two search parameters were queried from the API. Articles were gathered using the specific search parameter of student loan forgiveness, as well as a general search parameter of student loans.

The data is initially returned in JSON format, which was then turned into a dataframe and subsequently exported into a csv file.

NewsAPI Extraction Initial Data

Sample of the Raw NewsAPI Data

author content description date source title url
Loading ITables v2.3.0 from the internet... (need help?)

Columns:

  • Author: author of the article, sometimes source of the article
  • Content: truncated version of the article itself
  • Description: summary of the article
  • Date: date the article was published
  • Source: news media organization the article was published by
  • Title: the title of the article
  • URL: the url of the article on the news media organization source

NewsAPI Scraping

One limitation of NewsAPI is that it doesn’t return the content of an article in its entirety. To account for this, numerous web scrapers were built for sources which were both scraping-eligible and appropriate for this analysis. This did reduce the number of sources and articles in the data from NewsAPI, but did result in complete article content. A list of eligible sources and their related scrapers were iteratively called to populate a dataframe with the individual paragraphs from the url associated with the article. Procedurally, the initial scrape through will result in a dataframe where each paragraph from each scrapable article will have its own row.

NewsAPI Scraped Initial Data

Sample of the Raw Scraped Data

source url paragraph paragraph_num
Loading ITables v2.3.0 from the internet... (need help?)

Above shows a random sample of 5 rows after the scraping. Paragraphs and paragraph numbers are new compared to the initial extracted data.

Using the source and url columns as keys, these can recombined with the columns from the original extraction of data for future use.

AllSides Bias Data

AllSides is a community-based political bias rating platform for media. Using web scraping, biases for the sources in the scraped data were accumulated. Pages (urls) containing information on sources were scraped and manually found, which were then in turned scraped for their specific bias ratings.

AllSides Initial Data

Sample of the AllSides Data

source url Bias Numeric Bias Specific Type Region Website
Loading ITables v2.3.0 from the internet... (need help?)

Columns:

  • Source: Source as listed in AllSides
  • URL: link to the rating page in AllSides
  • Bias Numeric: numeric version of the bias, from -6 to +6 (left to right, respectively)
  • Bias Specific: categorically binned version of the bias
  • Type: news media, etc.
  • Region: regional location and coverage area of the news media organization
  • Website: home website of the news media organization

Cleaning and Merging

Thus far, raw data from an API and an array of web scraping functions has been gathered. This can now be combined together to create potential labels for the complete articles associated with the NewsAPI extraction. However, some data cleaning should be performed first.

Cleaning Strategy

NewsAPI Extraction:

  • author
    • duplicate authors from same article removed
    • email addresses and links removed
    • stripped of leading, trailing, and multiple spaces
    • if author is None then the entry from the source column is used in place
    • line breaks removed
  • description
    • text before and including delimiters removed (i.e. delimiter could be “Date and Location –”)
  • content
    • ignored in lieu of scraping the entire itself

NewsAPI Scraping Extraction:

  • paragraph
    • blank paragraphs removed
    • non-breaking space characters removed
    • stripped of leading, trailing, and multiple spaces
    • all paragraphs of a single article combined post-cleaning

AllSides Bias Ratings: no cleaning was required for this data, however a source-to-source map was created, as some sources were not identical between NewsAPI and AllSides.

Merging Strategy

The data with the completely scraped articles still retained the columns for source and url. Using those, the cleaned extraction columns can be merged in as well as the biases.

This resulted in a dataframe with the following columns:

  • source
  • url
  • article
  • source_bias
  • Bias Numeric
  • Bias Specific
  • author
  • date
  • title
  • search

Where the text data itself will likely be article, but interesting analyses could be extended by using description or title.

The main label of interest will be Bias Specific, but interesting analyses could hinge upon using source, author, date, or search (topic query parameter).

Sample of the Final Labeled Data

source url article source_bias Bias Numeric Bias Specific author description date title search
Loading ITables v2.3.0 from the internet... (need help?)

Specific and Complementary Data

Although many of the sources were able to matched with a political bias, a number were not.

Normally, missing data is either interpolated in some manner or simply removed. However, a goal of the subsequent analyses will be attempting to predict through classification techniques the political biases of articles (and potentially regression for the numeric representation of political bias). The missing label values will therefore be retained overall, but will be removed for training, and then found afterwards.

Additionally, classification is best trained on balanced data. Currently, the data with known biases is spread across five different categories:

  • Left
  • Lean Left
  • Center
  • Lean Right
  • Right

It could, however, be combined into three labels:

  • Left
  • Center
  • Right

In either the three or five category models, more data could be beneficial. Sources labeled with political biases from the AllSides data was done for the media outlet overall. In other words, the sources should have indicators of bias regardless of topic queried. Therefore, to gather more training data, articles from the top three most frequent sources across each of the five categories was gathered following the same methodologies as above. Although it does adhere specifically to the topic of student loan forgiveness, it could be a complementary training source for political bias.

Note that in the case the complementary data did have duplicate articles contained within the topic-specific data, these articles were removed.

The complementary data could provide the additional spread of labels: