Vectorization - NewsAPI

Introduction

Using the data which has been prepared and merged with potential labels, as seen in Data Acquisition - NewsAPI, a few more steps can be taken to turn the news articles into numerical representations which can then be used for further analyses and machine learning applications.

Strategy - Further Preprocessing

Recall that the prepared data looks like this:

source url article source_bias Bias Numeric Bias Specific author description date title search
Loading ITables v2.3.0 from the internet... (need help?)

The main label of interest for this data is Political Bias, however other potential labels include:

  • News Organization Source
  • Author
  • Date
  • Search Query Parameter

The data itself will be the News Article, however other potential data sources include:

  • Title
  • Description

This page will focus just on the entire News Article for data, but it could be worth comparing Title and Description in the future.

For this text data, the additional preprocessing will take place for each article:

  • Remove line breaks.
  • Remove punctuation.
  • Remove words containing numbers.
  • Remove standalone numbers.
  • Remove leading and trailing spaces.
  • Lowercase the remaining words.
  • Remove any single-length text remaining.

Strategy - Vectorizing

Now that the articles have been properly prepared to create a vectorized dataframe, several versions will be created. Namely, word count dataframes will be created using CountVectorizer() and normalized word count dataframes will be created using TfidfVectorizer(), both from scikit-learn. Stopwords will be removed using these functions as well. Dataframes will be further subsetted along the political bias labels. Lemmatizing and Stemming will also be used to create different versions available for further analyses. One additional option could be further versions of maximum words allowed in a dataframe.

Vectorizing - Overall

These versions will vectorize the completely preprocessed NewsAPI data in its entirety, remove stopwords and use a maximum of 200 features. A sample of the datasets for both vectorized versions will be shown. Additionally, a wordcloud and a top ten feature visualization will be shown for the CountVectorizer() versions.

CountVectorizer

Labeled Sample

source author Bias Specific Bias Numeric date search able acceptance access according account act administration agencies agency aid american americans assistance balance bank based best biden billion borrowers budget business card changes class college company congress cost costs country court cr credit crore current data day debt december department development did does doge don donald driven earnings economic education employees employment end enrollment equity example executive expected expenses faculty family federal financial forbearance forgiveness free fund funding funds future going good government graduation grants growth health help high higher home house idr including income increase india information institutions insurance job just know ll loan loans location long low lower make making management market median memo million money month monthly months musk national need net new news number office options order pause pay payment payments people personal plan plans policy power president private process profit program programs provide pslf public quarter rate rates ratio read real relief repayment report republican research right rs said save say says school schools sector security seen service services set social spending start state states student students support tax term think time total trump undergraduate university use ve want washington way week white work workers working year years
Loading ITables v2.3.0 from the internet... (need help?)

Wordcloud

Most Frequent Words

TfidfVectorizer

Labeled Sample

source author Bias Specific Bias Numeric date search able acceptance access according account act administration agencies agency aid american americans assistance balance bank based best biden billion borrowers budget business card changes class college company congress cost costs country court cr credit crore current data day debt december department development did does doge don donald driven earnings economic education employees employment end enrollment equity example executive expected expenses faculty family federal financial forbearance forgiveness free fund funding funds future going good government graduation grants growth health help high higher home house idr including income increase india information institutions insurance job just know ll loan loans location long low lower make making management market median memo million money month monthly months musk national need net new news number office options order pause pay payment payments people personal plan plans policy power president private process profit program programs provide pslf public quarter rate rates ratio read real relief repayment report republican research right rs said save say says school schools sector security seen service services set social spending start state states student students support tax term think time total trump undergraduate university use ve want washington way week white work workers working year years
Loading ITables v2.3.0 from the internet... (need help?)

Vectorizing - Overall Lemmatized

CountVectorizer

Labeled Sample

source author Bias Specific Bias Numeric date search access according account act action administration agency aid american balance bank based benefit biden billion borrower budget business card case challenge change class college come community company congress consumer cost country court cr credit crore cut data day debt december decision democrat department development did doe dollar don donald driven earnings economic education effort employee employment end enrollment equity example executive expected expert faculty family federal finance financial forbearance forgiveness free fund funding future going good government graduation grant group growth ha health help high higher home house idr including income increase india information know law legal life like likely ll loan location long look low lower make making management market mean median memo million money month monthly mortgage musk national need net new news number offer office opportunity option order pause pay payment people personal plan policy power president price private profit program provide pslf public quarter rate ratio relief repayment report republican research right risk rule said save saving say school sector security service social spending start state statement student support tax term thing think time total trump undergraduate university use ve wa want washington way week white work worker working year
Loading ITables v2.3.0 from the internet... (need help?)

Wordcloud

Most Frequent Words

TfidfVectorizer

Labeled Sample

source author Bias Specific Bias Numeric date search access according account act action administration agency aid american balance bank based benefit biden billion borrower budget business card case challenge change class college come community company congress consumer cost country court cr credit crore cut data day debt december decision democrat department development did doe dollar don donald driven earnings economic education effort employee employment end enrollment equity example executive expected expert faculty family federal finance financial forbearance forgiveness free fund funding future going good government graduation grant group growth ha health help high higher home house idr including income increase india information know law legal life like likely ll loan location long look low lower make making management market mean median memo million money month monthly mortgage musk national need net new news number offer office opportunity option order pause pay payment people personal plan policy power president price private profit program provide pslf public quarter rate ratio relief repayment report republican research right risk rule said save saving say school sector security service social spending start state statement student support tax term thing think time total trump undergraduate university use ve wa want washington way week white work worker working year
Loading ITables v2.3.0 from the internet... (need help?)

Vectorizing - Overall Stemmatized

CountVectorizer

Labeled Sample

source author Bias Specific Bias Numeric date search accept access accord account act action addit administr agenc aid allow american ani anoth balanc bank base becaus befor benefit biden billion borrow budget busi card challeng chang colleg come commun compani congress consid consum continu cost countri court creat credit crore current cut data day debt democrat depart develop don dure earn econom educ effect effort employ end enrol equiti execut expect expens faculti famili feder financ financi forgiv free fund futur gener good govern graduat grant growth ha health help hi high higher home hous howev idr impact import includ incom increas job just know law legal like limit live loan locat long look lower major make manag mani market mean million money month monthli nation need new news number offer offic onli opportun option order paus pay payment peopl person plan polici power presid privat process profit program project propos provid pslf public qualifi quarter rate receiv relief remain repay report republican requir research right rs rule said save say school sector secur servic set share sinc social spend start state student support tax term thi think time total tri trump undergradu univers use wa want way week white work year
Loading ITables v2.3.0 from the internet... (need help?)

Wordcloud

Most Frequent Words

TfidfVectorizer

Labeled Sample

source author Bias Specific Bias Numeric date search accept access accord account act action addit administr agenc aid allow american ani anoth balanc bank base becaus befor benefit biden billion borrow budget busi card challeng chang colleg come commun compani congress consid consum continu cost countri court creat credit crore current cut data day debt democrat depart develop don dure earn econom educ effect effort employ end enrol equiti execut expect expens faculti famili feder financ financi forgiv free fund futur gener good govern graduat grant growth ha health help hi high higher home hous howev idr impact import includ incom increas job just know law legal like limit live loan locat long look lower major make manag mani market mean million money month monthli nation need new news number offer offic onli opportun option order paus pay payment peopl person plan polici power presid privat process profit program project propos provid pslf public qualifi quarter rate receiv relief remain repay report republican requir research right rs rule said save say school sector secur servic set share sinc social spend start state student support tax term thi think time total tri trump undergradu univers use wa want way week white work year
Loading ITables v2.3.0 from the internet... (need help?)

Vectorizing - Political Bias

Lemmatizing seems to aggregate the text data while retaining meaning in the words. For instance, students are aggregated into student and loans into loan, while words like education are not reduced to something like educ. Moving forward, lemmatizing is a logical preprocessing step. Thus, lemmatization and stopwords removal for creating vectorized versions of the political bias will be used. For the process of creating subsets of the political bias data, there are two methods that could be used.

  • Subsetting Second: vectorize the entire dataset \(\rightarrow\) append labels \(\rightarrow\) subset on political bias
  • Subsetting First: subset the dataset on political bias \(\rightarrow\) vectorize the subset \(\rightarrow\) append labels

By subsetting first, the maximum word count will be reflective of the corpus associated with the respective political bias. Therefore, it might be more useful in this comparative analysis to subset first.

Additionally, CountVectorizer() will be used over TfidfVectorizer() for a first pass. A normalized version of the features could be useful in some cases, such as when dealing with varying sizes of content (i.e. total word count of content), but feature appearance counts will be more useful for visualizing in this analysis.

Political Bias: Left

Labeled Sample

source author Bias Specific Bias Numeric date search able abolition act administration agency aggregated america american assistance athlete biden big billion billionaire budget business california car care case change child civil class clear come company congress control country court cut day debt democracy democrat democratic department did district doe doesn doing dollar don donald economic education effect effort election employee end family far federal feel financial food forecast free freeze fund funding going good gop government grant group ha hard having health help home house idea including income individual industry insurance ivy job just justice kid kind know labor law le league likely limit little loan long lot maga major majority make making marcotte mean meat medicaid member memo men middle million money month musk national need new number office opportunity order parent party past pause pay people percent person place plan point policy political poor power prediction president program project provide public question read really repeal republican right said sargent say school security senate service social spending start state student suicide support tariff tax term thing think time told trillion trump trying tuesday union use ve vote voter wa want way wealth week white woman work working workplace world wrote year
Loading ITables v2.3.0 from the internet... (need help?)

Wordcloud

Most Frequent Words

Political Bias: Lean Left

Labeled Sample

source author Bias Specific Bias Numeric date search able access according account act action administration agency aid american assistance bank bankruptcy based benefit best biden big billion borrower brennan budget business california card case cfpb change child college come community company congress consumer cost country court credit cut data day debt decision democrat department did doe doesn doge dollar don donald education effort employee end equity executive expert family federal feel finance financial forgiveness free fund funding future getting going good got government grant group ha hard health help high higher home house including income information insurance investment issue job just know like likely line ll loan long look low make making management margaret market mcmahon mean member memo million money month mortgage musk national need new number offer office option order parent pause pay payment people plan policy political power president price private process program project public question rate really relief repayment report republican retirement right risk rule said save saving say school secretary security senator service social spending start state student support sure tax team term thing think time today told took trump trying tuesday university use used ve vought wa want washington way week white work worker working year
Loading ITables v2.3.0 from the internet... (need help?)

Wordcloud

Most Frequent Words

Political Bias: Center

Labeled Sample

source author Bias Specific Bias Numeric date search able acceptance according account act action administration agency aid american application apply available average balance based benefit best biden billion borrower budget business buy buyback card case certain challenge change check college come congress consumer cost court credit current cut data day debt department different doe don driven earnings education eligible employment end enrolled enrollment example executive expense expert faculty federal financial forbearance forgiven forgiveness free freeze fund funding future goal good gov government graduation grant ha health help high higher hold home house icr idr impact important including income increase individual information initiative like likely ll loan location long look low lower make making market mean median medical memo million money month monthly mortgage need new number offer office official opportunity option order pause pay paye paying payment people period personal plan policy power president process processing program provide pslf public qualify qualifying rate ratio read receive recent relief repayment report republican research review right rubin rule said save saving say school score security service servicer set social spending start state step student studentaid sure tax term time total transfer trump undergraduate university use ve wa want washington way week white work year
Loading ITables v2.3.0 from the internet... (need help?)

Wordcloud

Most Frequent Words

Political Bias: Lean Right

Labeled Sample

source author Bias Specific Bias Numeric date search acceptance access according act action administration agency aid america american ap area art best biden billion bond borrower bradley budget business california campus care center change child city class close college come community congress control cost country course court credit cut data day debt decade decision democrat department doe doge don donald earnings education effort employee employment end enrollment executive expectancy faculty fafsa family federal financial forgiveness free friday funding going government graduation grant group growth ha health high higher home house including income increase information institute institution issue job judge just know known live loan local located location look low lower make making massachusetts median million monday money month musk nation national nearly need new nonprofit number offer office opportunity order organization parent pause pay payment people personal plan policy power president press private program project provide public rate ratio record relief repayment report republican research resource right said save say school science secretary security service social spending state status strong student study support supreme tax team technology term think time today trillion trump undergraduate university use virginia virginian wa want washington way week white woman work worker working world year york young
Loading ITables v2.3.0 from the internet... (need help?)

Wordcloud

Most Frequent Words

Political Bias: Right

Labeled Sample

source author Bias Specific Bias Numeric date search access act action administration agency agenda aid amendment america american assistance attorney benefit biden billion birthright border borrower branch business called campaign case catholic change child citizen citizenship civil click college come congress constitution constitutional cost country court crisis cut daily data day debt decision dei democracy democrat democratic department did digital district doe doge dollar don donald education effort elect election end executive failed family federal final financial foreign forgiveness foundation fox free freeze fund funding general getty going government grant group ha harris help higher house illegal image immigrant immigration individual issue law left legal legislation like loan long look make mean medium member memo military million money musk nation national nearly need new news obama office official order organization parent party past pause pay people plan policy political post power president presidential press private program provide public read regulation related relief rep report reporter republican right rule ruling said say school second secretary security service social spending state student support supreme tax taxpayer term thing think time told took treasury trillion trump tuesday united university use vance ve vice wa want washington way week white wing work worker world year york
Loading ITables v2.3.0 from the internet... (need help?)

Wordcloud

Most Frequent Words

Summary

Lemmatizing aggregates nicely while retaining the meaning within words, so this will likely be a consistently utilized preprocessing step moving forward. Although word counts over normalized word counts were shown for the political biases section, this was mainly for illustrative purposes. Normalized word counts could be useful when interpolating towards much smaller or much larger text files. Maximum number of features can be altered in future analyses, and words which are similar across all biases could be removed in future analyses to help better illuminate the differences. Additionally, the political biases of Lean Left and Left may be combined as well as with Lean Right and Right for future analyses.