source | url | article | source_bias | Bias Numeric | Bias Specific | author | description | date | title | search |
---|---|---|---|---|---|---|---|---|---|---|
Loading ITables v2.3.0 from the internet... (need help?) |
Latent Dirichlet Allocation (LDA)
Overview
This section specifically focuses on gaining information by topic modeling through Latent Dirichlet Allocation (LDA). The data used in this section will be the news articles gathered through NewsAPI and Reddit content aggregated into the Author Aggregation Schema.
Topic Modeling and LDA
Topic Modeling is a general term for methods to discover groupings or themes within data. A topic is a mixture of words, and a document is a mixture of topics. Topic Modeling aims to uncover these layered similarities between a collection of documents via analyzing the shapes of documents. Just as true quantitative record data can form shapes by examining the matrix formed by vectors over its dimensions, matrices of vectorized documents can form shapes as well.
Latent Dirichlet Allocation (LDA), specifically, is a popular unsupervised machine learning method of topic modeling. LDA uses a form of the Multinomial Beta Distribution known as the Dirichlet Distribution to perform topic modeling. It should be noted that LDA has a non-uniqueness property, or that words can be featured across multiple topics. Essentially, LDA uses the Dirichlet distribution to find words that occur together to find a topic. However, it does not name or label the topic. When topics are found, a specified number of words can be reported and the topic can be deduced from this.
Data Preparation
Below are the snippets for the data that will be ran through LDA (also located here).
The data is created through count vectorizing over a tenth of the total vocabulary for each.
NewsAPI Data Before Transformations
Reddit Data Before Transformations
url | title | subreddit | author | original_author | author_upvotes | author_dates | author_content | author_content_aggregated | replies_to | replies_from |
---|---|---|---|---|---|---|---|---|---|---|
Loading ITables v2.3.0 from the internet... (need help?) |
NewsAPI
aack | ability | able | abo | abolish | abortion | abuse | academic | acceptance | access | according | account | accountability | accruing | act | acting | action | activity | actually | add | added | adding | addition | additional | additionally | address | administrati | administration | administrator | adult | advance | advantage | advice | advisor | advocate | aempt | aend | aended | aention | affair | affect | affected | afford | affordability | affordable | afp | age | agency | agenda | ago | agreed | ahead | ai | aid | aimed | alaska | allow | allowed | allowing | allows | ally | altadena | alternati | alys | amendment | america | american | analysis | analyst | angeles | announced | announcement | annual | answer | anti | aorney | ap | appeal | appeared | appears | application | applied | apply | approach | approd | appropriated | approval | apr | arage | area | aren | argued | argument | art | article | ask | asked | asset | assistance | associated | track | trade | tradition | traditional | training | transfer | transgender | transition | transportation | treasury | treatment | tried | trillion | true | trump | trust | try | trying | tuesday | tuition | turn | turned | twier | type | typically | ukraine | ultimately | unable | uncertainty | unclear | unconstitutional | undergraduate | understand | understanding | union | unique | unirsity | united | unless | unlikely | unprecedented | update | updated | usaid | use | used | using | usually | valuable | value | vance | various | vice | view | ving | virginia | visit | vote | voted | voter | vought | wealth | website | wednesday | week | went | west | white | wide | widespread | wildfire | willing | win | wind | wing | woke | woman | won | word | work | worked | worker | workforce | working | workplace | world | worried | worry | worse | worst | worth | wouldn | wrien | wrote | year | yes | yield | york | young | zero |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading ITables v2.3.0 from the internet... (need help?) |
Results
NewsAPI Results
NewsAPI Discussion
15 words over 3 topics were discovered within the NewsAPI documents. The first image shows the words per topic, and the second image shows the frequencies of the words by weighting compared to the vocabulary in the vectorized dataset.
Manual Topic Labeling:
- First Topic: Financially Related
- Words like “loan”, “borrower”, “plan”, “payment”, “forginess” (forgiveness), “debt”, “credit”, and “repayment” are not only selected by LDA, but have high frequencies particular to this topic.
- Second Topic: Government Related
- Words like “trump”, “federal”, “president”, “gornment” (government), “state”, “american”, and “agency” have high frequencies particular to this topic.
- Third Topic: School Related
- Words like “rate”, “college”, “unirsity” (university), “college”, “faculty”, “earnings”, “enrollment”, “employment”, “graduation”, “median”, and “ratio” have high frequencies particular to this topic.
Since each document is a collection of topics itself, it is logical that words are repeated and some of the frequencies aren’t a perfect indicator of the topic. However, by analyzing the high frequency words, those seem to be decent choices.
Conclusions
Following the Clustering Association Rule Mining section, its more evident that what and how news articles versus people posting on Reddit differ. The news articles feature the traditional media topics of Finance, Government, and Education. Reddit, being more free form, and potentially a place for individuals to vent or write their feelings out without repercussion, indicates slightly more convoluted but discussion like topics. Even so, by analyzing the frequencies of the words within the Reddit posts, general themes can be extracted. Although the topics are labeled sligthly differently between news articles and Reddit posts, they have similar base themes.
In summary, there appears to be the following three main themes across news articles and the online discourse seen on Reddit:
- Finance
- Government
- Education
This is completely reasonable, given the overall topic of Student Loan Forgivess references the government helping to financially cover the education of students.