Latent Dirichlet Allocation (LDA)

Overview

This section specifically focuses on gaining information by topic modeling through Latent Dirichlet Allocation (LDA). The data used in this section will be the news articles gathered through NewsAPI and Reddit content aggregated into the Author Aggregation Schema.

Topic Modeling and LDA

Topic Modeling is a general term for methods to discover groupings or themes within data. A topic is a mixture of words, and a document is a mixture of topics. Topic Modeling aims to uncover these layered similarities between a collection of documents via analyzing the shapes of documents. Just as true quantitative record data can form shapes by examining the matrix formed by vectors over its dimensions, matrices of vectorized documents can form shapes as well.

Latent Dirichlet Allocation (LDA), specifically, is a popular unsupervised machine learning method of topic modeling. LDA uses a form of the Multinomial Beta Distribution known as the Dirichlet Distribution to perform topic modeling. It should be noted that LDA has a non-uniqueness property, or that words can be featured across multiple topics. Essentially, LDA uses the Dirichlet distribution to find words that occur together to find a topic. However, it does not name or label the topic. When topics are found, a specified number of words can be reported and the topic can be deduced from this.

Data Preparation

Below are the snippets for the data that will be ran through LDA (also located here).

The data is created through count vectorizing over a tenth of the total vocabulary for each.

NewsAPI Data Before Transformations

source url article source_bias Bias Numeric Bias Specific author description date title search
Loading ITables v2.3.0 from the internet... (need help?)

Reddit Data Before Transformations

url title subreddit author original_author author_upvotes author_dates author_content author_content_aggregated replies_to replies_from
Loading ITables v2.3.0 from the internet... (need help?)

NewsAPI

aack ability able abo abolish abortion abuse academic acceptance access according account accountability accruing act acting action activity actually add added adding addition additional additionally address administrati administration administrator adult advance advantage advice advisor advocate aempt aend aended aention affair affect affected afford affordability affordable afp age agency agenda ago agreed ahead ai aid aimed alaska allow allowed allowing allows ally altadena alternati alys amendment america american analysis analyst angeles announced announcement annual answer anti aorney ap appeal appeared appears application applied apply approach approd appropriated approval apr arage area aren argued argument art article ask asked asset assistance associated track trade tradition traditional training transfer transgender transition transportation treasury treatment tried trillion true trump trust try trying tuesday tuition turn turned twier type typically ukraine ultimately unable uncertainty unclear unconstitutional undergraduate understand understanding union unique unirsity united unless unlikely unprecedented update updated usaid use used using usually valuable value vance various vice view ving virginia visit vote voted voter vought wealth website wednesday week went west white wide widespread wildfire willing win wind wing woke woman won word work worked worker workforce working workplace world worried worry worse worst worth wouldn wrien wrote year yes yield york young zero
Loading ITables v2.3.0 from the internet... (need help?)

Reddit Author Schema

ability able abo absolute absolutely accept access according account accrue accrued accruing act actily action actual actually add added additional address adjustment admin administrati administration adult advantage advice aempt aend affect afford affordable age agency ago agree agreed agreement ahead aid allow allowed alys america american answer anti anymore anyy apparently application applied apply applying approd approval arage area aren argue argument art article ask asked asking asset assume assuming authority auto automatically available avoid awesome ay bachelor backed bad bail bailouts balance bank bankruptcy barely base based basic basically bc beer begin beginning belie benefit best bet biden big time today told ton took topic tord tords total totally track trade training transfer tried trillion true truly trump trust try trying tuition turn type typically undergrad undergraduate understand understanding unfair unfortunately unirsity unless unsubsidized update usa usage use used useless user using usually va value various view ving vote voted voter voting wealth wealthy website week weird welfare went weren white wide wife wild willing win wing wipe wiped wish woman won wonder word work worked worker workforce working world worried worry worse worst worth worthless wouldn wrien write wrong yeah year yep yes young younger yr yup zero
Loading ITables v2.3.0 from the internet... (need help?)

Results

NewsAPI Results

NewsAPI Discussion

15 words over 3 topics were discovered within the NewsAPI documents. The first image shows the words per topic, and the second image shows the frequencies of the words by weighting compared to the vocabulary in the vectorized dataset.

Manual Topic Labeling:

  • First Topic: Financially Related
    • Words like “loan”, “borrower”, “plan”, “payment”, “forginess” (forgiveness), “debt”, “credit”, and “repayment” are not only selected by LDA, but have high frequencies particular to this topic.
  • Second Topic: Government Related
    • Words like “trump”, “federal”, “president”, “gornment” (government), “state”, “american”, and “agency” have high frequencies particular to this topic.
  • Third Topic: School Related
    • Words like “rate”, “college”, “unirsity” (university), “college”, “faculty”, “earnings”, “enrollment”, “employment”, “graduation”, “median”, and “ratio” have high frequencies particular to this topic.

Since each document is a collection of topics itself, it is logical that words are repeated and some of the frequencies aren’t a perfect indicator of the topic. However, by analyzing the high frequency words, those seem to be decent choices.

Reddit Author Schema Results

Reddit Author Schema Discussion

15 words over 3 topics were discovered within the NewsAPI documents. The first image shows the words per topic, and the second image shows the frequencies of the words by weighting compared to the vocabulary in the vectorized dataset.

Manual Topic Labeling:

  • First Topic: Monthly Payments and Credit Scores Related
    • Words like “payment”, “credit”, “month”, “payment”, “balance”, “score”, and “account” have high frequencies particular to this topic.
  • Second Topic: Political Divide on Forgiveness Related
    • Words like “forginess” or “forgin” (forgiveness), “biden”, “republican”, “trump”, and “vote” have high frequencies particular to this topic.
  • Third Topic: Debt from Degree or School Related
    • Words like “debt”, “college”, “money”, “school” and “degree” have high frequencies particular to this topic.

Again, certain high frequency words across all documents are expected to be repeated, such as “loan”. However, there are some clear topic choices knowing the overall climate surrounding this.

Conclusions

Following the Clustering Association Rule Mining section, its more evident that what and how news articles versus people posting on Reddit differ. The news articles feature the traditional media topics of Finance, Government, and Education. Reddit, being more free form, and potentially a place for individuals to vent or write their feelings out without repercussion, indicates slightly more convoluted but discussion like topics. Even so, by analyzing the frequencies of the words within the Reddit posts, general themes can be extracted. Although the topics are labeled sligthly differently between news articles and Reddit posts, they have similar base themes.

In summary, there appears to be the following three main themes across news articles and the online discourse seen on Reddit:

  • Finance
  • Government
  • Education

This is completely reasonable, given the overall topic of Student Loan Forgivess references the government helping to financially cover the education of students.