Modeling - Preparation

Introduction

This section will focus on supervised machine learning. Specifically, classification with the following families will be used:

Supervised Machine Learning models require labeled data, or known tags on the data to train the model. Additionally, when teaching the models, the data is split into disjoint training and testing sets. In essence, the models learn from the training set and then are tested on unseen data. This helps to prevent overfitting and simulates applying the model on real-world data.

This is what the exploratory and unsupervised methods in the previous sections have been leading to. The idea is to begin with the NewsAPI data labeled with political bias by news organization. After creating acceptable models, they will be applied to the Reddit data in an attempt to project political bias on Reddit authors. Given the ultimate goal of finding positive and negative sentiment on the topic of student loan forgiveness, and the fact that the sentiment is roughly split along politcal bias, aiming to classify by political bias will be a decent indicator of sentiment.

Data Preparation

To prepare for the modeling, the NewsAPI data where articles from organizations which have known political bias will be used. To increase the efficiency of the models, general articles (non-topic specific) will be combined with the topic specific articles. The Reddit data which will have political bias projected onto it will be content aggregated by Author. Additionally, only authors with an acceptable amount of content will be used (roughly equivalent to the first quartile of article length).

NewsAPI

As a reminder, the political labels are:

  • Left
  • Lean Left
  • Center
  • Lean Right
  • Right

There will be several aggregations of the labels used:

  • 5 Labels (strictly all five)
  • 3 Labels
    • Lean Left combined into Left
    • Center
    • Lean Right combined into Right
  • 3 Labels Strict
    • Strictly Left
    • Strictly Center
    • Strictly Right
  • 2 Labels
    • Lean Left combined into Left
    • Lean Right combined into Right
  • 2 Labels Strict
    • Strictly Left
    • Strictly Right

Each of these aggregations will be transformed into labeled 1000 word vectorized versions.

Reddit

The Reddit data will remain unlabeled, as this will be where the models are applied to project political bias onto authors. However, the Reddit data will be transformed into word vectorized versions with no limit on the maximum word count.

Vectorizing

NewsAPI Data

After the text data is cleaned, stopwords removed, and lemmatized, CountVectorization is performed and the labels were reappended to this. A sample of this data looks like:

BIAS ability able academic acceptance access according account act acting action activity actually ad add added adding addition additional additionally address administration administrative advantage affect affected affordable age agency agenda agent ago agreement agricultural agriculture ahead ai aid aim air allow allowed allowing allows amazon america american angeles announced announcement annual answer ap app application apply approach approved area art article ask asked asset assistance associated association attack attempt attorney authority available average avoid away bad balance bank bankruptcy based basis began begin believe benefit best better biden big biggest billion billionaire bird black blagojevich block blocked blue board body tech technology tell temporarily temporary term tetfund texas thing think thought thousand threat thursday time tip title today told took tool total track trade transfer transgender treasury treatment tried trillion trump try trying tuesday tuition turn twitter type typically ultimately uncertainty unclear undergraduate understand union united university update usaid use used user using valuable value vance various versus vice video view virginia vote voter vought want wanted war washington watch water way wealth website wednesday week went west white win wing wo woman won word work worked worker workforce working world worried worth wrote year yes yield york young yoy
Loading ITables v2.3.0 from the internet... (need help?)

From this vectorized version of the data, rows will be aggregated or dropped dependong on the 5, 3, or 2 strategy outlined above.

Reddit Data

After the text data is cleaned, stopwords removed, and lemmatized, CountVectorization is and labels were not appended. A sample of this data looks like:

aa abandon abandoned ability able abortion abraham absolute absolutely absolve abusing academically accelerated accept acceptable acceptance accepted accepting access accessible accident accidentally accommodation accomplished accomplishment accordance according accordingly account accountability accountable accountant accounting accreditation accrual accrued accrues accruing accumulate accumulated accumulating accumulation accurate aced achieve achieves act action active actively actual actually ad adam adapt add added adding addition additional additionally address addressed addressing adjudication adjust adjustable adjusted adjustment admin administration administrative administratively administrator admiral admission admit admitting adopt adoption adoreing adoring adult advantage advantageous advice advocacy advocate advocating af affair affect affected afford affordable africa ag age agency agenda week weekend weekly weighing weird weirdly welcome welfare wellwishes went west western whats wheel white whopping wich widely wife wiggle wild wildly willful willfully william willing win wind window wing winner wiped wiping wire wise wisely wiser wish wished withdraw withdrawn witherspoon wo woman won wonder wondered wonderful wont word wording work workaround workarounds worked worker workerslatest workforce working world worldnews worried worry worse worst worth worthiness worthless wouldn wound wow wrinkle write writing written wrong wrote wtfhow xfinity yale yap yard yea yeah year yearly yep yes yesterday yield york youdid young younger youngest youth youtube yr zero zillion
Loading ITables v2.3.0 from the internet... (need help?)

Training Testing Split

An important feature of the training and testing sets are that they are disjoint. Notice the indices in the first few rows of the sets below are different between the training and testing sets but are the same within the training and testing data compared to the labels. This ensures a real-world simulation but helps the models learn by matching the records with their respective labels.

Five Labels

Training Data and Labels Example

Data

ability able academic acceptance access according account act acting action activity actually ad add added adding addition additional additionally address administration administrative advantage affect affected affordable age agency agenda agent ago agreement agricultural agriculture ahead ai aid aim air allow allowed allowing allows amazon america american angeles announced announcement annual answer ap app application apply approach approved area art article ask asked asset assistance associated association attack attempt attorney authority available average avoid away bad balance bank bankruptcy based basis began begin believe benefit best better biden big biggest billion billionaire bird black blagojevich block blocked blue board body book tech technology tell temporarily temporary term tetfund texas thing think thought thousand threat thursday time tip title today told took tool total track trade transfer transgender treasury treatment tried trillion trump try trying tuesday tuition turn twitter type typically ultimately uncertainty unclear undergraduate understand union united university update usaid use used user using valuable value vance various versus vice video view virginia vote voter vought want wanted war washington watch water way wealth website wednesday week went west white win wing wo woman won word work worked worker workforce working world worried worth wrote year yes yield york young yoy
Loading ITables v2.3.0 from the internet... (need help?)

Label

BIAS
Loading ITables v2.3.0 from the internet... (need help?)

Testing Data and Labels Example

Data

ability able academic acceptance access according account act acting action activity actually ad add added adding addition additional additionally address administration administrative advantage affect affected affordable age agency agenda agent ago agreement agricultural agriculture ahead ai aid aim air allow allowed allowing allows amazon america american angeles announced announcement annual answer ap app application apply approach approved area art article ask asked asset assistance associated association attack attempt attorney authority available average avoid away bad balance bank bankruptcy based basis began begin believe benefit best better biden big biggest billion billionaire bird black blagojevich block blocked blue board body book tech technology tell temporarily temporary term tetfund texas thing think thought thousand threat thursday time tip title today told took tool total track trade transfer transgender treasury treatment tried trillion trump try trying tuesday tuition turn twitter type typically ultimately uncertainty unclear undergraduate understand union united university update usaid use used user using valuable value vance various versus vice video view virginia vote voter vought want wanted war washington watch water way wealth website wednesday week went west white win wing wo woman won word work worked worker workforce working world worried worth wrote year yes yield york young yoy
Loading ITables v2.3.0 from the internet... (need help?)

Label

BIAS
Loading ITables v2.3.0 from the internet... (need help?)

Three Labels

Training Data and Labels Example

Data

ability able academic acceptance access according account act acting action activity actually ad add added adding addition additional additionally address administration administrative advantage affect affected affordable age agency agenda agent ago agreement agricultural agriculture ahead ai aid aim air allow allowed allowing allows amazon america american angeles announced announcement annual answer ap app application apply approach approved area art article ask asked asset assistance associated association attack attempt attorney authority available average avoid away bad balance bank bankruptcy based basis began begin believe benefit best better biden big biggest billion billionaire bird black blagojevich block blocked blue board body book tech technology tell temporarily temporary term tetfund texas thing think thought thousand threat thursday time tip title today told took tool total track trade transfer transgender treasury treatment tried trillion trump try trying tuesday tuition turn twitter type typically ultimately uncertainty unclear undergraduate understand union united university update usaid use used user using valuable value vance various versus vice video view virginia vote voter vought want wanted war washington watch water way wealth website wednesday week went west white win wing wo woman won word work worked worker workforce working world worried worth wrote year yes yield york young yoy
Loading ITables v2.3.0 from the internet... (need help?)

Label

BIAS
Loading ITables v2.3.0 from the internet... (need help?)

Testing Data and Labels Example

Data

ability able academic acceptance access according account act acting action activity actually ad add added adding addition additional additionally address administration administrative advantage affect affected affordable age agency agenda agent ago agreement agricultural agriculture ahead ai aid aim air allow allowed allowing allows amazon america american angeles announced announcement annual answer ap app application apply approach approved area art article ask asked asset assistance associated association attack attempt attorney authority available average avoid away bad balance bank bankruptcy based basis began begin believe benefit best better biden big biggest billion billionaire bird black blagojevich block blocked blue board body book tech technology tell temporarily temporary term tetfund texas thing think thought thousand threat thursday time tip title today told took tool total track trade transfer transgender treasury treatment tried trillion trump try trying tuesday tuition turn twitter type typically ultimately uncertainty unclear undergraduate understand union united university update usaid use used user using valuable value vance various versus vice video view virginia vote voter vought want wanted war washington watch water way wealth website wednesday week went west white win wing wo woman won word work worked worker workforce working world worried worth wrote year yes yield york young yoy
Loading ITables v2.3.0 from the internet... (need help?)

Label

BIAS
Loading ITables v2.3.0 from the internet... (need help?)

Strict Three Labels

Training Data and Labels Example

Data

ability able academic acceptance access according account act acting action activity actually ad add added adding addition additional additionally address administration administrative advantage affect affected affordable age agency agenda agent ago agreement agricultural agriculture ahead ai aid aim air allow allowed allowing allows amazon america american angeles announced announcement annual answer ap app application apply approach approved area art article ask asked asset assistance associated association attack attempt attorney authority available average avoid away bad balance bank bankruptcy based basis began begin believe benefit best better biden big biggest billion billionaire bird black blagojevich block blocked blue board body book tech technology tell temporarily temporary term tetfund texas thing think thought thousand threat thursday time tip title today told took tool total track trade transfer transgender treasury treatment tried trillion trump try trying tuesday tuition turn twitter type typically ultimately uncertainty unclear undergraduate understand union united university update usaid use used user using valuable value vance various versus vice video view virginia vote voter vought want wanted war washington watch water way wealth website wednesday week went west white win wing wo woman won word work worked worker workforce working world worried worth wrote year yes yield york young yoy
Loading ITables v2.3.0 from the internet... (need help?)

Label

BIAS
Loading ITables v2.3.0 from the internet... (need help?)

Testing Data and Labels Example

Data

ability able academic acceptance access according account act acting action activity actually ad add added adding addition additional additionally address administration administrative advantage affect affected affordable age agency agenda agent ago agreement agricultural agriculture ahead ai aid aim air allow allowed allowing allows amazon america american angeles announced announcement annual answer ap app application apply approach approved area art article ask asked asset assistance associated association attack attempt attorney authority available average avoid away bad balance bank bankruptcy based basis began begin believe benefit best better biden big biggest billion billionaire bird black blagojevich block blocked blue board body book tech technology tell temporarily temporary term tetfund texas thing think thought thousand threat thursday time tip title today told took tool total track trade transfer transgender treasury treatment tried trillion trump try trying tuesday tuition turn twitter type typically ultimately uncertainty unclear undergraduate understand union united university update usaid use used user using valuable value vance various versus vice video view virginia vote voter vought want wanted war washington watch water way wealth website wednesday week went west white win wing wo woman won word work worked worker workforce working world worried worth wrote year yes yield york young yoy
Loading ITables v2.3.0 from the internet... (need help?)

Label

BIAS
Loading ITables v2.3.0 from the internet... (need help?)

Two Labels

Training Data and Labels Example

Data

ability able academic acceptance access according account act acting action activity actually ad add added adding addition additional additionally address administration administrative advantage affect affected affordable age agency agenda agent ago agreement agricultural agriculture ahead ai aid aim air allow allowed allowing allows amazon america american angeles announced announcement annual answer ap app application apply approach approved area art article ask asked asset assistance associated association attack attempt attorney authority available average avoid away bad balance bank bankruptcy based basis began begin believe benefit best better biden big biggest billion billionaire bird black blagojevich block blocked blue board body book tech technology tell temporarily temporary term tetfund texas thing think thought thousand threat thursday time tip title today told took tool total track trade transfer transgender treasury treatment tried trillion trump try trying tuesday tuition turn twitter type typically ultimately uncertainty unclear undergraduate understand union united university update usaid use used user using valuable value vance various versus vice video view virginia vote voter vought want wanted war washington watch water way wealth website wednesday week went west white win wing wo woman won word work worked worker workforce working world worried worth wrote year yes yield york young yoy
Loading ITables v2.3.0 from the internet... (need help?)

Label

BIAS
Loading ITables v2.3.0 from the internet... (need help?)

Testing Data and Labels Example

Data

ability able academic acceptance access according account act acting action activity actually ad add added adding addition additional additionally address administration administrative advantage affect affected affordable age agency agenda agent ago agreement agricultural agriculture ahead ai aid aim air allow allowed allowing allows amazon america american angeles announced announcement annual answer ap app application apply approach approved area art article ask asked asset assistance associated association attack attempt attorney authority available average avoid away bad balance bank bankruptcy based basis began begin believe benefit best better biden big biggest billion billionaire bird black blagojevich block blocked blue board body book tech technology tell temporarily temporary term tetfund texas thing think thought thousand threat thursday time tip title today told took tool total track trade transfer transgender treasury treatment tried trillion trump try trying tuesday tuition turn twitter type typically ultimately uncertainty unclear undergraduate understand union united university update usaid use used user using valuable value vance various versus vice video view virginia vote voter vought want wanted war washington watch water way wealth website wednesday week went west white win wing wo woman won word work worked worker workforce working world worried worth wrote year yes yield york young yoy
Loading ITables v2.3.0 from the internet... (need help?)

Label

BIAS
Loading ITables v2.3.0 from the internet... (need help?)

Strict Two Labels

Training Data and Labels Example

Data

ability able academic acceptance access according account act acting action activity actually ad add added adding addition additional additionally address administration administrative advantage affect affected affordable age agency agenda agent ago agreement agricultural agriculture ahead ai aid aim air allow allowed allowing allows amazon america american angeles announced announcement annual answer ap app application apply approach approved area art article ask asked asset assistance associated association attack attempt attorney authority available average avoid away bad balance bank bankruptcy based basis began begin believe benefit best better biden big biggest billion billionaire bird black blagojevich block blocked blue board body book tech technology tell temporarily temporary term tetfund texas thing think thought thousand threat thursday time tip title today told took tool total track trade transfer transgender treasury treatment tried trillion trump try trying tuesday tuition turn twitter type typically ultimately uncertainty unclear undergraduate understand union united university update usaid use used user using valuable value vance various versus vice video view virginia vote voter vought want wanted war washington watch water way wealth website wednesday week went west white win wing wo woman won word work worked worker workforce working world worried worth wrote year yes yield york young yoy
Loading ITables v2.3.0 from the internet... (need help?)

Label

BIAS
Loading ITables v2.3.0 from the internet... (need help?)

Testing Data and Labels Example

Data

ability able academic acceptance access according account act acting action activity actually ad add added adding addition additional additionally address administration administrative advantage affect affected affordable age agency agenda agent ago agreement agricultural agriculture ahead ai aid aim air allow allowed allowing allows amazon america american angeles announced announcement annual answer ap app application apply approach approved area art article ask asked asset assistance associated association attack attempt attorney authority available average avoid away bad balance bank bankruptcy based basis began begin believe benefit best better biden big biggest billion billionaire bird black blagojevich block blocked blue board body book tech technology tell temporarily temporary term tetfund texas thing think thought thousand threat thursday time tip title today told took tool total track trade transfer transgender treasury treatment tried trillion trump try trying tuesday tuition turn twitter type typically ultimately uncertainty unclear undergraduate understand union united university update usaid use used user using valuable value vance various versus vice video view virginia vote voter vought want wanted war washington watch water way wealth website wednesday week went west white win wing wo woman won word work worked worker workforce working world worried worth wrote year yes yield york young yoy
Loading ITables v2.3.0 from the internet... (need help?)

Label

BIAS
Loading ITables v2.3.0 from the internet... (need help?)

Reddit Authors Data

To choose the aggregated Reddit author data to project political bias (and thus sentiment) onto, the lengths of the cleaned data were analyzed. To get the authors with enough data to be properly labeled, the first quartile of news articles lengths was used for the subset.

Statistic News Article Value Reddit Author Value Reddit Subset Author Value
Loading ITables v2.3.0 from the internet... (need help?)

Balance of Labels

The introduction on this page discusses models “learning”. When models are taught on a dataset with unbalanced labels, the model may incorrectly predict a label which has a higher prevalence in the training data. It’s proper to examine the balance of the labels in the datasets. If the model performs poorly, this could be an area to either downscale by random removal or upscale by bootstrapping until the labels come into better balance. Fortunately, this data isn’t too skewed, but it isn’t perfect. The proportions are illustrated below, along with their total counts. Notice how the strict labels differ from the aggregated non-strict labels.

Five Labels

Three Labels

Strict Three Labels

Two Labels

Strict Two Labels

Applications

This data will be used throughout the remainder of the modeling sections.