Vectorization - Reddit

Introduction

Using the data which has been prepared and merged with potential labels, as seen in Data Acquisition - Reddit, a few more steps can be taken to turn the Reddit posts into numerical representations which can then be used for further analyses and machine learning applications.

Strategy - Further Preprocessing

Recall that the prepared data looks like this:

url title subreddit author original_author author_upvotes author_dates author_content author_content_aggregated replies_to replies_from
Loading ITables v2.3.0 from the internet... (need help?)

As explained in the linked section above, the Reddit text data may be best fit for unsupervised learning methods, thus the labels may not be applicaple. However, potential labels of interest are:

  • Author (Reddit user)
  • URL (Reddit Thread)
  • Subreddit (Reddit Community)
  • Search (Reddit Search Query)

The data itself will be the content posted by the Reddit users, or authors. In the inital data above, the column containing every post, comment, and reply by an author on a single Reddit thread was aggregated. This will be the first aggregation schema. A few other aggregation schemas will be considered.

Aggregation Schemas:

  • Thread - Author (INITIAL FORMAT): corpus where each file is an author’s aggregated text within a unique thread.
  • Subreddit - Author: corpus where each file is an author’s aggregated text within a unique Subreddit.
  • Threads: corpus where each file is the overall aggregated text within unique Threads (author’s combined)
  • Subreddits: corpus where each file is the overall aggregated text within unique Subreddits (threads combined).
  • Authors: corpus where each file is the overall aggregated text by authors across every thread.

Using the initial format for the first schema, some additional preprocessing will take take place for the author’s aggregated posts on a single Reddit thread. This will include:

  • Remove line breaks.
  • Remove punctuation.
  • Remove words containing numbers.
  • Remove standalone numbers.
  • Remove leading and trailing spaces.
  • Lowercase the remaining words.
  • Remove any single-length text remaining.

As prescribed in the linked section in the Introduction on this page, the author AutoModerator will be removed before proceeding.

Strategy - Vectorizing

Following queues from the Data Vectorizing - NewsAPI page, this initial pass will lemmatize the data and use CountVectorizer() from the scikit-learn library, and compare the different aggregation schemas. The TfidfVectorizer() could be invaluable in later use cases, as the length of the text content for many of the Reddit aggregation schemas vary wildly.

Lemmatizing is still a useful data dimensionality reduction technique. However, when it comes to Reddit posts, lemmatizing and especially stemmatizing should be used cautiously. News articles mostly use proper and accepted language, structure, and terms, whereas social media and other online community discussion boards, like Reddit, are likely to follow a more unofficial format. Social media may feature slang, text speak, mispellings, and incomplete sentences. Incomplete sentences could be due to single word responses or reactions, or simply due to improperly structured sentences. Nuances in Reddit-type posts could be lost with cleaning and simplification methods.

Additionally, stopwords will be removed and 200 maximum features will be used.

Vectorizing - Thread-Author Schema

Labeled Sample

author url subreddit search able account actually administration ago american aren away bad balance bank bankruptcy based believe benefit best better biden big borrower business buy car card care career case change class college come company congress cost country court credit day debt decision degree democrat did didn different doe doesn doing dollar don education end exactly federal feel financial forgive forgiven forgiveness free fuck future getting going good got government graduate great ha happen hard having help high higher home hope house idea idr income instead isn issue job just kid know law le left let life like likely loan lol long look lot low lower major make making maybe mean million minimum mohela money month monthly mortgage need new number old paid parent party pay paying payment people person plan point poor ppp predatory president pretty price principal private probably problem program pslf public rate real really reason repayment republican rich right said save say saying school score service shit start started state stop student sure taking tax term thing think thought time took total trump try trying tuition understand university use used ve vote wa want way went won work worked working wouldn yeah year yes zero
Loading ITables v2.3.0 from the internet... (need help?)

Wordcloud

Most Frequent Words

Vectorizing - Subreddit-Author Schema

Labeled Sample

author subreddit able account actually administration ago american aren away bad balance bank bankruptcy based believe benefit best better biden big borrower business buy car card care career case change class college come company congress cost country court credit day debt decision degree democrat did didn different doe doesn doing dollar don education end exactly federal feel financial forgive forgiven forgiveness free fuck future getting going good got government graduate great ha happen hard having help high higher home hope house idea idr income instead isn issue job just kid know law le left let life like likely literally live loan lol long look lot low lower major make making maybe mean million minimum mohela money month monthly mortgage need new number old paid parent party pay paying payment people person plan point poor ppp predatory president pretty price principal private probably problem program pslf public rate real really reason repayment republican rich right said save say saying school score service shit start started state stop student sure taking tax term thing think thought time took total trump try trying tuition understand university use used ve vote wa want way went won work worked working wouldn yeah year yes zero
Loading ITables v2.3.0 from the internet... (need help?)

Wordcloud

Most Frequent Words

Vectorizing - Threads Schema

Labeled Sample

url able account actually administration ago american aren away bad balance bank bankruptcy based believe benefit best better biden big borrower business buy car card care career case change class college come company congress cost country court credit day debt decision degree democrat did didn different doe doesn doing dollar don education end exactly federal feel financial forgive forgiven forgiveness free fuck future getting going good got government graduate great ha happen hard having help high higher home hope house idea idr income instead isn issue job just kid know law le left let life like likely literally live living loan lol long look lot low lower major make making maybe mean million minimum mohela money month monthly mortgage need new number old paid parent party pay paying payment people person plan point poor ppp predatory president pretty price principal private probably problem program pslf public rate real really reason repayment republican rich right said save say saying school score service shit start started state stop student sure taking tax term thing think thought time took total trump try trying tuition understand university use used ve vote wa want way went won work worked working wouldn yeah year yes zero
Loading ITables v2.3.0 from the internet... (need help?)

Wordcloud

Most Frequent Words

Vectorizing - Subreddits Schema

Labeled Sample

subreddit able account actually administration ago american aren away bad balance bank bankruptcy based believe benefit best better biden big borrower business buy car card care career case change class college come company congress cost country court credit day debt decision degree democrat did didn different doe doesn doing dollar don education end exactly federal feel financial forgive forgiven forgiveness free fuck future getting going good got government graduate great ha happen hard having help high higher home hope house idea idr income instead isn issue job just kid know law le left let life like likely literally live living loan lol long look lot low lower major make making maybe mean million minimum mohela money month monthly mortgage need new number old paid parent party pay paying payment people person plan point poor ppp predatory president pretty price principal private probably problem program pslf public rate real really reason repayment republican rich right said save say saying school score service shit start started state stop student sure taking tax term thing think thought time took total trump try trying tuition understand university use used ve vote wa want way went won work worked working wouldn yeah year yes zero
Loading ITables v2.3.0 from the internet... (need help?)

Wordcloud

Most Frequent Words

Vectorizing - Authors Schema

Labeled Sample

author able account actually administration ago american aren away bad balance bank bankruptcy based believe benefit best better biden big borrower business buy car card care career case change class college come company congress cost country court credit day debt decision degree democrat did didn different doe doesn doing dollar don education end exactly federal feel financial forgive forgiven forgiveness free fuck future getting going good got government graduate great ha happen hard having help high higher home hope house idea idr income instead isn issue job just kid know law le left let life like likely literally live living loan lol long look lot low lower major make making maybe mean million minimum mohela money month monthly mortgage need new number old paid parent party pay paying payment people person plan point poor ppp predatory president pretty price principal private probably problem program pslf public rate real really reason repayment republican rich right said save say saying school score service shit start started state stop student sure taking tax term thing think thought time took total trump try trying tuition understand university use used ve vote wa want way went won work worked working wouldn yeah year yes zero
Loading ITables v2.3.0 from the internet... (need help?)

Wordcloud

Most Frequent Words

Vectorizing - Additional Parameters

Labeled Sample

author url subreddit search debt don just like loan make pay people student wa year
Loading ITables v2.3.0 from the internet... (need help?)

Wordcloud

Most Frequent Words

Summary

With the initial pass above, there are very minor differences between the aggregation schemas and respective corpuses. Stronger differences would likely be found if further subsetting on the authors, threads, and Subreddits. In future analyses, unsupervised learning methods and potentially supervised learnings methods (when paired with NewsAPI) could help indicate these subsets. Furthermore, pairings from upvotes and length of content within the data could be potential subsetting parameters as well.

However, when taking away the maximum features parameter and then restraining the data from a minimum frequency of \(0.1\) and a maximum frequency of \(0.9\) reveals a much different wordcloud. The range is rather large, but there are a limited number of words.