Clustering

Overview

This section will focus on clustering the text data described in Data Acquisition - NewsAPI and Data Acquisition - Reddit. As a whole, clustering is an unsupervised machine learning technique, which means that it attempts to find patterns, trends, and groupings within unlabeled data. Specifically, clustering is useful for identifying and partitioning the data through similarity measures. As such, there isn’t a candid expectation for what will be found, aside from what data is similar and what data is dissimilar. However, given the topic of student loan forgiveness has a strong divide along political party lines, politcal bias will be a subsequent point of exploration.

Strategy

Clustering Methodologies

K-Means Clustering and Hierarchical Clustering will be used to analyze news articles and Reddit posts. These are both unsupervised machine learning methods, meaning they take unlabeled data. Specifically, unlabeled numerical data.

K-Means Clustering: A clustering algorithm with the goal of partitioning a dataset into a specified number of clusters, where each point belongs to the cluster with the nearest mean. Distance and “closeness” are usually evaluated via Euclidean Distance, as will be done with this analysis. In this scenario, each point will be a vectorized version of the text within a news article or reddit post (or subsequent aggregations). Vectorizing text data produces high dimension datasets. Using techniques based on Eucldiean Distance for high dimensionality data usually doesn’t produce great results. With that in mind, the vectorized data will be reduced via Principal Component Analysis (PCA). PCA is a data reduction technique which projects the data into an Eigenspace using a covariance matrix. The Eigenspace is built from orthonormal vectors known as Eigenvectors (or principal components) which actually represent the direction of explained variance from the original data. The strength (or magnitude) of explained variance is illustrated by the vector’s associated Eigenvalue. PCA will be used to reduce the high dimensional text data into just three principal components.
Hiearachical Clustering: A clustering algorithm which determines cluster assignments by building a hierachy and which can be illustrated with a tree-based visualization known as a dendrogram. This is a useful exploratory tool which shows where different groups of data are partitioned, highlighting similar and disimilar data at different levels. PCA will not be used in this process. Instead, a distance measure known as Cosine Similarity will be used on the vectorized text data as a whole. The Cosine Similarity measure is adept for high dimensional data.

Distance Metrics

Euclidean Distance

Given points \(p\) and \(q\) in any real dimensional space, Euclidean Distance is calculated by:

\[d(p, q) = \sqrt{(p - q)^2}\]

Cosine Similarity

Given vectors \(x\) and \(y\) in any real dimensional space, Cosine Similarity is calculated by:

\[S_C (x, y) = \frac{x \cdot y}{||x|| \cdot ||y||}\]

Data Preparation

There will be 4 different initial vectorized versions of the text data used:

Maximum Features: A numerical vectorized version of the entire vocabulary.
Tenth of Maximum Features: A numerical vectorized version with a tenth of the maximum features of the entire vocabulary.
Iterative Latent Dirichlet Allocation (LDA):
- Idea: Latent Dirichlet Allocation will be iteratively performed such that each topic will have unique words. This will begin with the words from a Tenth of Maximum features wordset, and then \(n\) unique words across \(t\) topics up to a total of \(m\) desired words will be found. In essence, \(m = t * n\).
- 3-Topic Iterative LDA with 150 Words will be used.
- 5-Topic Iterative LDA with 150 Words will be used.
- Note that a section on a proper LDA Analysis is located here.

The strategy is to begin with K-Means, find the vectorized version which produces the best results for for K-Means, and use that vectorized version to initiate Hiearchical Clustering. Since K-Means will be utilizing PCA, the dataset produced from the PCA projection will be shown for all, and then just the best vectorized versions will be shown.

NewsAPI Data

For the NewsAPI data (i.e., news articles), the data will be subset to the news sources which an overall political bias is known for, any labels will be removed, and then the vectorized versions will be created. The analysis will begin with the KMeans version, so the PCA projected data will be shown.

Data Before Transformations

source	url	article	source_bias	Bias Numeric	Bias Specific	author	description	date	title	search
Loading ITables v2.3.0 from the internet... (need help?)

NewsAPI PCA Maximum Snippet

component_1	component_2	component_3
Loading ITables v2.3.0 from the internet... (need help?)

NewsAPI PCA Tenth Snippet

component_1	component_2	component_3
Loading ITables v2.3.0 from the internet... (need help?)

NewsAPI PCA 3-Topic LDA Snippet

component_1	component_2	component_3
Loading ITables v2.3.0 from the internet... (need help?)

NewsAPI PCA 5-Topic Iterative LDA Snippet

component_1	component_2	component_3
Loading ITables v2.3.0 from the internet... (need help?)

Reddit Data

For the Reddit data, different aggregation schemas as described in the linked sections in the Overview will be used:

Reddit Base Schema: Author’s posts within a thread are aggregated.
Reddit Author Schema: Author’s posts across all threads are aggregated.
Reddit Thread Schema: Posts within a thread are aggregated.
Reddit Subreddit Schema: Threads within a Subreddit are aggregated.

For efficiency, just the snippets of the dataset which had the best K-Means clustering result will be illustrated. However, the snippets of all the versions can be found here.

Data Before Transformations

url	title	subreddit	author	original_author	author_upvotes	author_dates	author_content	author_content_aggregated	replies_to	replies_from
Loading ITables v2.3.0 from the internet... (need help?)

Reddit PCA Base Schema: 5-Topic Iterative LDA Snippet

component_1	component_2	component_3
Loading ITables v2.3.0 from the internet... (need help?)

Reddit PCA Author Schema: 5-Topic Iterative LDA Snippet

component_1	component_2	component_3
Loading ITables v2.3.0 from the internet... (need help?)

Reddit PCA Thread Schema: 3-Topic Iterative LDA Snippet

component_1	component_2	component_3
Loading ITables v2.3.0 from the internet... (need help?)

Reddit PCA Subreddit Schema: 3-Topic Iterative LDA Snippet

component_1	component_2	component_3
Loading ITables v2.3.0 from the internet... (need help?)

K-Means Clustering Results

Principal Component Analysis

Again, for K-Means Clusering, Principal Component Analysis (PCA) was used to reduce the text data to 3-Dimensions. The initial results show that when the maximum features were the entire vocabulary, there was a much lower amount of explained variance throughout the three principal components.

NewsAPI

Reddit Base Schema

Reddit Author Schema

Reddit Thread Schema

Reddit Subreddit Schema

Silhouette Coefficients

Silhouette Coefficients are a decent metric to evaluate how well K-Means clustering has performed. It measures how similar a point is to the cluster it was assigned compared to the other clusters. A coefficient value has a range from -1 to 1, where 1 indicates a great cluster assignment. It’s calculated using the average values between points. Scikit-Learn’s documentation gives a brief summary of the calculation.

\(a\): average intra-cluster distance
\(b\): average nearest-cluster distance
\(s\): silhouette score
\(S\): average silhouette score over the entire dataset
\(n\): size of sample (number of points in the entire dataset)

\(s = \frac{b-a}{max(a, b)}\)

\(S = \frac{\sum_{i=1}^n{s}}{n}\)