Curricula-Topic-Modeling

Introduction to Topic Modeling and LDA

Terminology

Before beginning there are three important preliminary definitions one will need in any text analysis:

  • Document: a distinct text object that one wishes to analyze. This could be a paper, a paragraph, etc.
  • Term: an individual word in a given document.
  • Corpus: The set of all documents.

Topic Models

Topic Modeling is a statistical model that attempts to cluster the words found in a document into various “topics”. The hope is that these topics would capture some underlying subject contained within the text. It is important to note that Topic Models are fuzzy in the sense that documents are assumed to be comprised of multiple topics with varying degrees of membership to each.

Latent Dirichlet Allocation

Perhaps the most famous topic model is Latent Dirichlet Allocation (LDA) (Blei et al., 2003). It is a three-level hierarchical Bayesian model that defines a collection of documents as a random mixture over some underlying topic set where each topic is itself a distribution over our vocabulary.

LDA works by assuming a data generating process (DGP) for our documents and then employs inference to discover the most likely parameters (Grimmer et al., 2022).

Blei, Ng, and Jordan define the following DGP (2003):

  1. Choose \(N \sim\) Poisson(\(\xi\))
  2. Choose \(\theta \sim\) Dir(\(\alpha\))
  3. For each of the \(N\) words \(w_n\):
    1. Choose a topic \(z_n\sim\)Multinomial(\(\theta\))
    2. Choose a word \(w_n\) from \(p(w_n|z_n,\beta)\), a multinomial probability conditioned on the topic \(z_n\)

For the following analysis we estimate the parameters using the collapsed Gibbs sampling method, though it is important to note there are others that yield varied results. To further complicate things, when fitting LDA in R one must predefine the number of topics for the model. Finding a good estimate for the number of topics is paramount and many methods are explored.

Once a model is fitted, it is common to extract 2 matrices \(\Phi\) and \(\Theta\). These are also commonly referred to as \(\beta\) and \(\gamma\) respectively. Both are given by the priors of LDA. \(\Phi\) represents the probability mass over the terms. \(\Theta\) represents the probability mass function over the topics.

Feature Extraction and Preprocessing

Document-Term Matrices

There are many methods of feature extraction from text; we opt for the bag of words model. To construct this model simply define a common set of words shared between documents and store a count of word appearances. Commonly this is stored as a Document-Term Matrix (DTM), for example, given the documents:

Corpus

Document 1
“I am happy”
Document 2
“I am sad”

the corresponding DTM would be

I am happy sad
Doc 1 1 1 1 0
Doc 2 1 1 0 1

In R we can us the tm package to create our Corpus and DTM objects. For these examples scrapped Data Science course information from various universities will be preprocessed.

load("../data/RObjects/degree_corpus_by_course.RData")
colnames(degree_corpus_by_course) <- c("doc_id", "text")
degree_corpus_by_course <- degree_corpus_by_course[!duplicated(degree_corpus_by_course$doc_id),]
degree_corpus_by_course <- degree_corpus_by_course[degree_corpus_by_course$text != "",]
head(degree_corpus_by_course)
                      doc_id
1   Computer Science 2210A/B
2   Computer Science 2211A/B
3 Computer Science 2212A/B/Y
4   Computer Science 2214A/B
5   Computer Science 3319A/B
6   Computer Science 3340A/B
                                                                                                                                                                                                                                                                                                                                                                                                   text
1                                                                                                                                                                                                                       Lists, stacks, queues, priority queues, trees, graphs, and their associated algorithms; file structures; sorting, searching, and hashing techniques; time and space complexity.
2       An introduction to software tools and systems programming. Topics include: understanding how programs execute (compilation, linking and loading); an introduction to a complex operating system (UNIX); scripting languages; the C programming language; system calls; memory management; libraries; multi-component program organization and builds; version control; debuggers and profilers.
3                                                                                                                   A team project course that provides practical experience in the software engineering field. Introduction to the structure and unique characteristics of large software systems, and concepts and techniques in the design, management and implementation of large software systems.
4                           This course presents an introduction to the mathematical foundations of computer science, with an emphasis on mathematical reasoning, combinatorial analysis, discrete structures, applications and modeling, and algorithmic thinking. Topics include sets, functions, relations, algorithms, number theory, matrices, mathematical reasoning, counting, graphs and trees.
5 A study of relational databases. Theoretical concepts will be covered, including relational algebra and relational calculus. Commercially available database systems will be used to demonstrate concepts such as Structured-Query-Language (SQL), writing code to connect and query a database, query optimization, Atomicity-Consistency-Isolation-Durability (ACID) concepts, and database design.
6                                                                                                                                                                                                                          Upper and lower time and space bounds; levels of intractability; graph algorithms; greedy algorithms; dynamic algorithms; exhaustive search techniques; parallel algorithms.
ds <- DataframeSource(degree_corpus_by_course)
corpus <- Corpus(ds)
inspect(corpus[1])
<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 1

                                                                                                                                                       Computer Science 2210A/B 
Lists, stacks, queues, priority queues, trees, graphs, and their associated algorithms; file structures; sorting, searching, and hashing techniques; time and space complexity. 

As shown the first element of the corpus is the course Computer Science 2210A/B and it holds a course description.

DTM Construction

Text Preprocessing

Raw text data lends itself to difficult analysis. NLP posses several best practices for cleaning text:

  • Punctation/Number Removal: deleting any non alphabetic characters.
  • Lowercasing: sending all words to lowercase.
  • Stopword Removal: stopwords are words used often in sentences that give little to no information, e.g., articles such as the, a, etc.
  • Tokenizing: dividing a document into a set of individual words, note we opt to combine synonymous tokens in our analysis as well
Note

TF-IDF and stemming are also very popular techniques. We found for our analysis these both lead to cumbersome topic interpretation, and we therefore omit them, though they should both be attempted in any complete analysis.

# Remove punctuation
corpus <- tm_map(corpus, content_transformer(removePunctuation))
# Send everything to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove stopwords
corpus <- tm_map(corpus,
                 content_transformer(removeWords),
                 stopwords("english"))
# Remove numbers
corpus <- tm_map(corpus, content_transformer(removeNumbers))

# Remove custom stop words
remove_words <-
  c("include","methods","topics","design","functions","emphasis","language","introduction","languages","performance","experience","course","techniques","variables","number","department","tools","fundamental","also","major","modern","issues","used","methods","using","case","architecture","covered","credit","basic","cosc","granted","use","solutions","will","students","fall","spring","important","one","considered","stacks","offers","types","may","held","former","honours","faculty","related","enter","review","enrolment","exercises","summer","need","offered","social","digital","terms","real","concepts","understanding","can","including","programs","program","recommended","examples","introduced","large","search","relations","key","etc","reasoning","intended","fws","general","restricted","version","two","comp","well","rich","intended","required","internet","recent","phys","sciences","covers","year","selected","renewal","explored","csch","principles","practice","development","studies","security","provides","advanced","instruction","discussed","processes","death","lower","high","crncr","taken","efficient","includes","core","retrieval","class","within","present","option","interested","together","session","week","new","order","tables","small","suitable","wide","without","good","introduces","assignments","current","thinking","completed","basics","essential","gain","effective","file","three","many","classes","extensive","tasks","work","meaningful","first","creating","elementary","image"
  )
corpus <- tm_map(corpus, function(x) {
  removeWords(x, remove_words)
})

# Remove whitespace
corpus <- tm_map(corpus, stripWhitespace)

inspect(corpus[1])
<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 1

                                                                                                  Computer Science 2210A/B 
lists queues priority queues trees graphs associated algorithms structures sorting searching hashing time space complexity 

Collocations

Collocations are terms that have a higher probability of appearing together adjacently than they do individually. Rather than just have a model of singular terms, unigrams, we can incorporate the most likely paired terms as their own token— bigrams. Lau et al. showed including Bigrams can signifigcantly improve topic intelligibility and leads to more parsimonious models (2013).

# Inspecting BiGrams table
corpusq <- quanteda::corpus(corpus)

corpuz_tokzd <- quanteda::tokens(corpusq)

BiGrams <- corpuz_tokzd %>% 
       quanteda::tokens_select(pattern = "^[A-Z]", 
                               valuetype = "regex",
                               case_insensitive = TRUE, 
                               padding = TRUE) %>% 
       textstat_collocations(min_count = 8, tolower = FALSE,size=2)

head(BiGrams)
             collocation count count_nested length   lambda        z
1              math math    31            0      2 7.510804 13.64619
2           data science    27            0      2 3.062254 11.75981
3            time series     9            0      2 5.838259 11.02856
4 differential equations    10            0      2 6.696623 10.83258
5           graph theory     9            0      2 5.202422 10.51883
6     hypothesis testing    10            0      2 8.155490 10.12133

Quanteda has functions to produce the most common collocations with statistical significance. Some of the terms are nonsensical like math math and are simply a product of our text preprocessing: likely stopword removal. However, we can examine the list and choose bigrams that we think ought to be included. Notable significant bigrams can be visualized:

# Plotting BiGrams
plot1 <- ggplot(BiGrams, aes(fill=count, y=lambda, x=reorder(collocation, -lambda))) + 
    geom_bar(position="dodge", stat="identity") + coord_flip()+labs(x= "Bigrams", y = "lambda")

plot2 <- ggplot(BiGrams, aes(fill=count, y=z, x=reorder(collocation, -z))) + 
    geom_bar(position="dodge", stat="identity") + coord_flip()+labs(x= "Bigrams", y = "z")
  
grid.arrange(plot1, plot2, ncol=2)

We will need a couple custom functions for combing terms and making bigrams in our DTM.

combineCols <- function(words, dtm, newColName){
  # Get indices of columns
idx <- which(colnames(dtm) %in% words)

# Convert sparse matrix dtm to normal matrix
dtm <- as.matrix(dtm)

# Combine columns
dtm[,c(idx[1])] <- rowSums(dtm[,c(idx)])

# Rename term
colnames(dtm)[idx[1]] <- newColName

if(length(words) == 2){
dtm <- dtm[,-idx[2]]  
}else{
dtm <- dtm[,-idx[2:length(words)]]  
}
# Convert back to DTM
dtm <- as.DocumentTermMatrix(as.simple_triplet_matrix(dtm),weighting = weightTf)

return(dtm)
}
createBigram <- function(df,term1,term2,replacement){
  regex <- paste0(term1,"\\s+",term2)
  for (i in 1:nrow(df)) {
    row_text <- df[i, 2]
    df[i, 2] <- str_replace(row_text, regex, replacement)
  }
  return(df)
}

Now can add bigrams to our corpus.

# Recreate data frame
cor <- as.list(corpus)
df <- cbind(data.frame("doc_id" = names(cor)),data.frame("text"=t(as.data.frame(cor))))

df <- createBigram(df,"data","science","data_science")
df <- createBigram(df,"data","structures","data_structures")
df <- createBigram(df,"data","analysis","data_analysis")
df <- createBigram(df,"data","mining","data_mining")
df <- createBigram(df,"machine","learning","machine_learning")
df <- createBigram(df,"computer","science","computer_science")
df <- createBigram(df,"time","series","time_series")
df <- createBigram(df,"database","systems","database_systems")
df <- createBigram(df,"data","visualization","data_visualization")
df <- createBigram(df,"neural","networks","neural_networks")
df <- createBigram(df,"graph","theory","graph_theory")
df <- createBigram(df,"differential","equations","differential_equations")
df <- createBigram(df,"big","data","big_data")
df <- createBigram(df,"hypothesis","testing","hypothesis_testing")
df <- createBigram(df,"linear","regression","linear_regression")
df <- createBigram(df,"regression","models","regression_models")
df <- createBigram(df,"data","sets","data_sets")
df <- createBigram(df,"text","data","text_data")
ds2 <- DataframeSource(df)
corpus2 <- Corpus(ds2)

# Remove custom stop words
remove_words <-
  c("data","science")
corpus2 <- tm_map(corpus2, function(x) {
  removeWords(x, remove_words)
})

dtm <- DocumentTermMatrix(corpus2)

We also manually combine synonymous terms.

dtm <-
  combineCols(c("math", "mathematics", "mathematically", "mathematical"),
              dtm,"mathematics")
dtm <- combineCols(c("stat", "statistics","statistical"), dtm,"statistics")
dtm <- combineCols(c("model","models","modeling"), dtm,"models")
dtm <- combineCols(c("database","databases"), dtm,"database")

Our DTM now includes Bigrams:

inspect(dtm[1:5,50:55])
<<DocumentTermMatrix (documents: 5, terms: 6)>>
Non-/sparse entries: 6/24
Sparsity           : 80%
Maximal term length: 16
Weighting          : term frequency (tf)
Sample             :
                            Terms
Docs                         computer_science counting discrete foundations
  Computer Science 2210A/B                  0        0        0           0
  Computer Science 2211A/B                  0        0        0           0
  Computer Science 2212A/B/Y                0        0        0           0
  Computer Science 2214A/B                  1        1        1           1
  Computer Science 3319A/B                  0        0        0           0
                            Terms
Docs                         mathematics matrices
  Computer Science 2210A/B             0        0
  Computer Science 2211A/B             0        0
  Computer Science 2212A/B/Y           0        0
  Computer Science 2214A/B             3        1
  Computer Science 3319A/B             0        0

Topic Number Discovery

LDA, like many unsupervised learning methods, is extremely dependent on its parameters. One such parameter that requires meticulous tuning is the number of topics. In the literature there exist many metrics for discovering the optimal number of topics with several being readily available in R.

Perplexity

Perplexity is a measure of a model’s ability to generalize to unseen data. It is the log-likelihood of a test set of data and is given by Pleplé:

\[\ell(w)=\text{log}p(w|\Phi,\alpha)=\sum_d \text{log} p(w_d|\Phi,\alpha)\]

  • \(w_d\): the unseen document
  • \(\Phi\): the matrix for the topics
  • \(\alpha\): the hyperparameter for the topic distributions

The lower our perplexity score the better. We can use 3-fold cross validation and plot our perplexities for various topic numbers.

set.seed(87460945)

# Calculate folds
idxs <- sample(seq_len(9))
folds <- split(idxs, rep(1:3, each = 3, length.out = 9))

# Define number of topics
topics <- seq(2, 50, 1)

# Create data frame for storing results
results <- data.frame()

# Perform cross validation
for (k in topics) {
  scores <- c()
  for (i in 1:3) {
    test_idx <- folds[[i]]
    train_idx <- setdiff(unlist(folds, use.names = FALSE), test_idx)

    test <- dtm[test_idx, ]
    train <- dtm[train_idx, ]

    LDA.out <- LDA(dtm, k, method = "Gibbs")
    p <- perplexity(LDA.out, newdata = test)
    scores <- c(scores, p)
  }
  temp <- data.frame("K" = k, "Perplexity" = mean(scores))
  results <- rbind(results, temp)
}

# Plot Perplexity vs. K
ggplot(results, aes(x=K, y=Perplexity)) + geom_line() + ggtitle("Perplexity vs. Number of Topics K")+
  theme(plot.title = element_text(hjust = 0.5))

# Retrieving K for minimum Perplexity
results[which.min(results$Perplexity),"K"]
[1] 14

Looks like in terms of Perplexity 14 is our optimal number of topics though there is a sizable initial tip at \(K=8\) that is work exploring.

ldatuning

The ldatuning package provides us with several metrics to evaluate the number topics, all of which rely on finding extrema. We seek to minimize Arun2010 and CaoJuan2009 and maximize Deveaud2014 and Griffiths2004.

# Find number of topics
result <- FindTopicsNumber(
  dtm,
  topics = seq(from = 2, to = 40, by = 1),
  metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
  method = "Gibbs",
  control = list(seed = 77),
  mc.cores = 2L,
)
# Plot metrics
FindTopicsNumber_plot(result)

It seems no topic numbers over 25 are viable. Let us zoom in on the 2-25 range.

# Find number of topics
result <- FindTopicsNumber(
  dtm,
  topics = seq(from = 2, to = 25, by = 1),
  metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
  method = "Gibbs",
  control = list(seed = 77),
  mc.cores = 2L,
)
# Plot metrics
FindTopicsNumber_plot(result)

For our minima we see CaoJuan2009 suggests 3 and 9 while Arun2010 promotes somewhere in the 20-25 range. Examining our maxima we find Deveaud2014 also suggests around 3 and Griffiths2004 recommends 15 or 18. This provides us with a good range of topic numbers to explore, \(k = 3,5,8,9,12,14,15,20,25\).

Topic Visualization

Rather than individually fit a model for each topic number and print the most likely generated terms, the R package LDAvis creates fantastic intractable visualizations. LDAvis tries to answer 3 questions (Sievert & Shirley, 2014):

  • How prevalent is each topic?
  • How do the topics relate to each other?
  • How do the topics relate to each other?

To interpret these visualizations we fist need to define several quantities:

Relevance - Sievert and Shirley proposed relevance as a metric to rank terms based on their usefulness for interpreting topics (2014). They define it as follows:

\[r(w,k|\lambda)=\lambda \text{log}(\phi_{kw}+(1-\lambda)\text{log}\left( \frac{\phi_{kw}}{p_w}\right)\]

  • \(\phi_{kw}\): the probability of term \(w\) for topic \(k\).
  • \(p_w\): the marginal probability of term \(w\) in the corpus.
  • \(\lambda\): a weight parameter such that \(0 \leq \lambda \leq 1\).
  • \(\text{log}\left( \frac{\phi_{kw}}{p_w}\right)\): the lift of a term.

A \(\lambda\) of 1 orders terms based on their topic-specific probability whereas a \(\lambda\) of 0 orders term based on their lift. Sievert and Shirley suggest \(\lambda = 0.6\). The higher this metric the better.

Lift - As seen above Lift is the ratio of a term’s probability in a given topic to its marginal probability. The metric downweights terms that are frequent throughout an entire corpus (Sievert & Shirley 2014).

Saliency - Introuced by Chuang, Manning, and Heer, it is defined as follows (2012): \[\begin{align} saliency(w) &= P(W) \times distinctiveness(w) \\ distinctiveness(w) &= \sum_T P(T|w)\text{log}\frac{P(T|w)}{P(T)} \end{align}\]

  • \(P(T|w)\): the probability a word \(w\) was genereted by a topic \(T\)
  • \(P(T)\): the marginal probability of the topic, i.e., the probability a word was produced by topic \(T\)

The general idea is that distinctiveness is the Kullback-Leibler divergence between \(P(T|w)\) and \(P(T)\) and informs us how useful a \(w\) is for determining a generating topic. Multiplying this by \(P(w)\) yields saliency; a metric that ranks terms in their usefulness in identifying topics.

With these we can understand our visuals:

  • Topic circles: circles are drawn equal to the number of topics. Their areas are proportional to the estimated number of words produced by that topic. The centers of the circles are determined by computing the distances between topics which are then projected onto a 2D plane. The larger a circle is, the more prevalent it is.

  • Red Bars: the estimated number of times a term was created by a topic. When you select a circle the red bars for the most relevant terms will be highlighted.

  • Blue Bars: each bar represents the frequency of each term. When no topic is selected the blue bars for the most salient terms are displayed. When a topic is selected the blue bars are the frequency of the most relevant terms.

  • Topic-Term Circles: When you highlight a term the circles’ areas will change. These circles’ areas are proportional to the number of times a topic generated that specific term.

If \(\lambda=1\) terms are ranked in decreasing order of their topic-specific probability. \(\lambda=0\) ranks terms solely by their lift. The suggested \(\lambda\) is \(\lambda=0.6\) though in our experimentation we found \(\lambda \in [0.6,0.8]\) tends to provide good results.

K=3 Solution

# Fit largest model
lda.out <- LDA(dtm, 3, method = "Gibbs",control = list(seed = 87460945))

# Find required quantities
phi <- posterior(lda.out)$terms %>% as.matrix
theta <- posterior(lda.out)$topics %>% as.matrix
vocab <- colnames(phi)
doc_length <- vector()
for (i in 1:length(corpus)) {
  temp <- paste(corpus[[i]]$content, collapse = ' ')
  doc_length <-
    c(doc_length, stri_count(temp, regex = '\\S+'))
}
temp_frequency <- as.matrix(dtm)
freq_matrix <- data.frame(ST = colnames(temp_frequency),
                          Freq = colSums(temp_frequency))

# Convert to json
json_lda <- LDAvis::createJSON(
  phi = phi,
  theta = theta,
  vocab = vocab,
  doc.length = doc_length,
  term.frequency = freq_matrix$Freq
)

# Open server for visualization
# serVis(json_lda, out.dir = "./TestVis/K_3", open.browser = FALSE)
# serVis(json_lda, open.browser = TRUE)

View the visualization here.

As we might expect the overarching topics of any Data Science curriculum are programming, statistics, and mathematics. Interestingly, models actually beats out statistics indicating that data scientists are primarily concerned with modelling as opposed to inference or estimation.

K=5 Solution

# Fit largest model
lda.out <- LDA(dtm, 5, method = "Gibbs",control = list(seed = 87460945))

# Find required quantities
phi <- posterior(lda.out)$terms %>% as.matrix
theta <- posterior(lda.out)$topics %>% as.matrix
vocab <- colnames(phi)
doc_length <- vector()
for (i in 1:length(corpus)) {
  temp <- paste(corpus[[i]]$content, collapse = ' ')
  doc_length <-
    c(doc_length, stri_count(temp, regex = '\\S+'))
}
temp_frequency <- as.matrix(dtm)
freq_matrix <- data.frame(ST = colnames(temp_frequency),
                          Freq = colSums(temp_frequency))

# Convert to json
json_lda <- LDAvis::createJSON(
  phi = phi,
  theta = theta,
  vocab = vocab,
  doc.length = doc_length,
  term.frequency = freq_matrix$Freq
)

# Open server for visualization
# serVis(json_lda, out.dir = "./TestVis/K_5", open.browser = FALSE)
# serVis(json_lda, open.browser = TRUE)

View the visualization here.

Now topics begin to become more nuanced:

  • Topic 1: statistics comes to the top with applications, random, theory, probability, estimation, and distributions trailing behind. Clearly a statistics topic.

  • Topic 2: software, database, and systems come to the top of this topic. Databases are an important part of any data scientist’s job and it makes sense this would be put into its own topic- its quite distinct from any of the others.

  • Topic 3: models, analysis, learning, regression, clustering, classification, and machine_learning. Very clearly a machine learning topic.

  • Topic 4: mathematics is by in large the most relevant term. This coupled with words such as calculus, equations, linear, integration shows this is a mathematics topic.

  • Topic 5: programming, algorithms, computer, objectorientated, this is likely the computer science and programming aspect of any data science program.

K=8 Solution

# Fit largest model
lda.out <- LDA(dtm, 8, method = "Gibbs",control = list(seed = 87460945))

# Find required quantities
phi <- posterior(lda.out)$terms %>% as.matrix
theta <- posterior(lda.out)$topics %>% as.matrix
vocab <- colnames(phi)
doc_length <- vector()
for (i in 1:length(corpus)) {
  temp <- paste(corpus[[i]]$content, collapse = ' ')
  doc_length <-
    c(doc_length, stri_count(temp, regex = '\\S+'))
}
temp_frequency <- as.matrix(dtm)
freq_matrix <- data.frame(ST = colnames(temp_frequency),
                          Freq = colSums(temp_frequency))

# Convert to json
json_lda <- LDAvis::createJSON(
  phi = phi,
  theta = theta,
  vocab = vocab,
  doc.length = doc_length,
  term.frequency = freq_matrix$Freq
)

# Open server for visualization
# serVis(json_lda, out.dir = "./TestVis/K_8", open.browser = FALSE)
# serVis(json_lda, open.browser = TRUE)

View the visualization here.

For the sake of brevity I will simply list what I consider to be the most likely topics from here on out. Note asterisks indicate noisey topics I am unsure of and a ? indicates the topic is uninterpretable.

  • Topic 1: statistics
  • Topic 2: machine learning
  • Topic 3: business
  • Topic 4: mathematics
  • Topic 5: data analysis/inference
  • Topic 6: databases
  • Topic 7: computer science
  • Topic 8: linear algebra

K=9 Solution

# Fit largest model
lda.out <- LDA(dtm, 9, method = "Gibbs",control = list(seed = 87460945))

# Find required quantities
phi <- posterior(lda.out)$terms %>% as.matrix
theta <- posterior(lda.out)$topics %>% as.matrix
vocab <- colnames(phi)
doc_length <- vector()
for (i in 1:length(corpus)) {
  temp <- paste(corpus[[i]]$content, collapse = ' ')
  doc_length <-
    c(doc_length, stri_count(temp, regex = '\\S+'))
}
temp_frequency <- as.matrix(dtm)
freq_matrix <- data.frame(ST = colnames(temp_frequency),
                          Freq = colSums(temp_frequency))

# Convert to json
json_lda <- LDAvis::createJSON(
  phi = phi,
  theta = theta,
  vocab = vocab,
  doc.length = doc_length,
  term.frequency = freq_matrix$Freq
)

# Open server for visualization
# serVis(json_lda, out.dir = "./TestVis/K_9", open.browser = FALSE)
# serVis(json_lda, open.browser = TRUE)

View the visualization here.

  • Topic 1: probability theory
  • Topic 2: machine learning
  • Topic 3: business analytics/marketing
  • Topic 4: mathematics
  • Topic 5: databases
  • Topic 6: data analysis*
  • Topic 7: computer science
  • Topic 8: business applications*
  • Topic 9: communication

K=12 Solution

# Fit largest model
lda.out <- LDA(dtm, 12, method = "Gibbs",control = list(seed = 87460945))

# Find required quantities
phi <- posterior(lda.out)$terms %>% as.matrix
theta <- posterior(lda.out)$topics %>% as.matrix
vocab <- colnames(phi)
doc_length <- vector()
for (i in 1:length(corpus)) {
  temp <- paste(corpus[[i]]$content, collapse = ' ')
  doc_length <-
    c(doc_length, stri_count(temp, regex = '\\S+'))
}
temp_frequency <- as.matrix(dtm)
freq_matrix <- data.frame(ST = colnames(temp_frequency),
                          Freq = colSums(temp_frequency))

# Convert to json
json_lda <- LDAvis::createJSON(
  phi = phi,
  theta = theta,
  vocab = vocab,
  doc.length = doc_length,
  term.frequency = freq_matrix$Freq
)

# Open server for visualization
# serVis(json_lda, out.dir = "./TestVis/K_12", open.browser = FALSE)
# serVis(json_lda, open.browser = TRUE)

View the visualization here.

  • Topic 1: machine learning
  • Topic 2: probability theory
  • Topic 3: statistics/inference
  • Topic 4: communication
  • Topic 5: mathematics
  • Topic 6: linear algebra
  • Topic 7: databases
  • Topic 8: algorithms, data structures, computer science
  • Topic 9: programming
  • Topic 10: applications*
  • Topic 11: data analysis*
  • Topic 12: proofs

K=14 Solution

# Fit largest model
lda.out <- LDA(dtm, 14, method = "Gibbs",control = list(seed = 87460945))

# Find required quantities
phi <- posterior(lda.out)$terms %>% as.matrix
theta <- posterior(lda.out)$topics %>% as.matrix
vocab <- colnames(phi)
doc_length <- vector()
for (i in 1:length(corpus)) {
  temp <- paste(corpus[[i]]$content, collapse = ' ')
  doc_length <-
    c(doc_length, stri_count(temp, regex = '\\S+'))
}
temp_frequency <- as.matrix(dtm)
freq_matrix <- data.frame(ST = colnames(temp_frequency),
                          Freq = colSums(temp_frequency))

# Convert to json
json_lda <- LDAvis::createJSON(
  phi = phi,
  theta = theta,
  vocab = vocab,
  doc.length = doc_length,
  term.frequency = freq_matrix$Freq
)

# Open server for visualization
# serVis(json_lda, out.dir = "./TestVis/K_14", open.browser = FALSE)
# serVis(json_lda, open.browser = TRUE)

View the visualization here.

  • Topic 1: machine learning
  • Topic 2: statistics
  • Topic 3: communication
  • Topic 4: mathematics
  • Topic 5: databases
  • Topic 6: linear algebra
  • Topic 7: research*
  • Topic 8: data analysis
  • Topic 9: ?
  • Topic 10: programming
  • Topic 11: data_structures
  • Topic 12: inference
  • Topic 13: programming*
  • Topic 14: applications*

K=15 Solution

# Fit largest model
lda.out <- LDA(dtm, 15, method = "Gibbs",control = list(seed = 87460945))

# Find required quantities
phi <- posterior(lda.out)$terms %>% as.matrix
theta <- posterior(lda.out)$topics %>% as.matrix
vocab <- colnames(phi)
doc_length <- vector()
for (i in 1:length(corpus)) {
  temp <- paste(corpus[[i]]$content, collapse = ' ')
  doc_length <-
    c(doc_length, stri_count(temp, regex = '\\S+'))
}
temp_frequency <- as.matrix(dtm)
freq_matrix <- data.frame(ST = colnames(temp_frequency),
                          Freq = colSums(temp_frequency))

# Convert to json
json_lda <- LDAvis::createJSON(
  phi = phi,
  theta = theta,
  vocab = vocab,
  doc.length = doc_length,
  term.frequency = freq_matrix$Freq
)

# Open server for visualization
# serVis(json_lda, out.dir = "./TestVis/K_15", open.browser = FALSE)
# serVis(json_lda, open.browser = TRUE)

View the visualization here.

  • Topic 1: probability theory
  • Topic 2: machine learning
  • Topic 3: business
  • Topic 4: communication
  • Topic 5: statistics/inference
  • Topic 6: mathematics
  • Topic 7: calculus
  • Topic 8: databases
  • Topic 9: programming
  • Topic 10: algorithms/data structures
  • Topic 11: linear algebra
  • Topic 12: modelling
  • Topic 13: statistics*
  • Topic 14: experiment design
  • Topic 15: analytics*

K=20 Solution

# Fit largest model
lda.out <- LDA(dtm, 20, method = "Gibbs",control = list(seed = 87460945))

# Find required quantities
phi <- posterior(lda.out)$terms %>% as.matrix
theta <- posterior(lda.out)$topics %>% as.matrix
vocab <- colnames(phi)
doc_length <- vector()
for (i in 1:length(corpus)) {
  temp <- paste(corpus[[i]]$content, collapse = ' ')
  doc_length <-
    c(doc_length, stri_count(temp, regex = '\\S+'))
}
temp_frequency <- as.matrix(dtm)
freq_matrix <- data.frame(ST = colnames(temp_frequency),
                          Freq = colSums(temp_frequency))

# Convert to json
json_lda <- LDAvis::createJSON(
  phi = phi,
  theta = theta,
  vocab = vocab,
  doc.length = doc_length,
  term.frequency = freq_matrix$Freq
)

# Open server for visualization
# serVis(json_lda, out.dir = "./TestVis/K_20", open.browser = FALSE)
# serVis(json_lda, open.browser = TRUE)

View the visualization here.

  • Topic 1: probability theory
  • Topic 2: machine learning
  • Topic 3: data mining
  • Topic 4: mathematics
  • Topic 5: databases
  • Topic 6: calculus
  • Topic 7: statistics/inference
  • Topic 8: algorithms/data structures
  • Topic 9: communication
  • Topic 10: projects*
  • Topic 11: data analysis/ hypothesis testing
  • Topic 12: linear algebra
  • Topic 13: experiment design
  • Topic 14: programming
  • Topic 15: ?
  • Topic 16: ?
  • Topic 17: object orientated programming*
  • Topic 18: regression
  • Topic 19: ?
  • Topic 20: statistics

K=25 Solution

# Fit largest model
lda.out <- LDA(dtm, 25, method = "Gibbs",control = list(seed = 87460945))

# Find required quantities
phi <- posterior(lda.out)$terms %>% as.matrix
theta <- posterior(lda.out)$topics %>% as.matrix
vocab <- colnames(phi)
doc_length <- vector()
for (i in 1:length(corpus)) {
  temp <- paste(corpus[[i]]$content, collapse = ' ')
  doc_length <-
    c(doc_length, stri_count(temp, regex = '\\S+'))
}
temp_frequency <- as.matrix(dtm)
freq_matrix <- data.frame(ST = colnames(temp_frequency),
                          Freq = colSums(temp_frequency))

# Convert to json
json_lda <- LDAvis::createJSON(
  phi = phi,
  theta = theta,
  vocab = vocab,
  doc.length = doc_length,
  term.frequency = freq_matrix$Freq
)

# Open server for visualization
# serVis(json_lda, out.dir = "./TestVis/K_25", open.browser = FALSE)
# serVis(json_lda, open.browser = TRUE)

View the visualization here.

  • Topic 1: probability theory
  • Topic 2: calculus
  • Topic 3: modelling*
  • Topic 4: machine learning
  • Topic 5: linear algebra
  • Topic 6: databases
  • Topic 7: statistics
  • Topic 8: algorithms
  • Topic 9: mathematics
  • Topic 10: programming
  • Topic 11: inference
  • Topic 12: analytics*
  • Topic 13: experiment design
  • Topic 14: writing*
  • Topic 15: projects*
  • Topic 16: business
  • Topic 17: computational methods*
  • Topic 18: text data*
  • Topic 19: data mining
  • Topic 20: ?
  • Topic 21: data visualization*
  • Topic 22: regression
  • Topic 23: time series
  • Topic 24: data analysis*
  • Topic 25: research/marketing

Resampling Course Pathways

Currently we are using 1 sampled degree pathway from each university. It would be interesting to see what happens if we sample say 100 students from each university. Will the variance in electives lead to different topics?

load("../data/RObjects/degree_corpus_total.RData")
colnames(degree_corpus_total) <- c("doc_id", "text")
degree_corpus_total <- degree_corpus_total[degree_corpus_total$text != "",]
degree_corpus_total <- na.omit(degree_corpus_total)
degree_corpus_total <- degree_corpus_total[!duplicated(degree_corpus_total$doc_id),]
ds <- DataframeSource(degree_corpus_total)
corpus <- Corpus(ds)
inspect(corpus[1])
<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 1

                                                                                                                                           Western Computer Science 2210A/B (1) 
Lists, stacks, queues, priority queues, trees, graphs, and their associated algorithms; file structures; sorting, searching, and hashing techniques; time and space complexity. 
# Remove punctuation
corpus <- tm_map(corpus, content_transformer(removePunctuation))
# Send everything to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove stopwords
corpus <- tm_map(corpus,
                 content_transformer(removeWords),
                 stopwords("english"))
# Remove numbers
corpus <- tm_map(corpus, content_transformer(removeNumbers))

# Remove custom stop words
remove_words <-
  c("include","methods","topics","design","functions","emphasis","language","introduction","languages","performance","experience","course","techniques","variables","number","department","tools","fundamental","also","major","modern","issues","used","methods","using","case","architecture","covered","credit","basic","cosc","granted","use","solutions","will","students","fall","spring","important","one","considered","stacks","offers","types","may","held","former","honours","faculty","related","enter","review","enrolment","exercises","summer","need","offered","social","digital","terms","real","concepts","understanding","can","including","programs","program","recommended","examples","introduced","large","search","relations","key","etc","reasoning","intended","fws","general","restricted","version","two","comp","well","rich","intended","required","internet","recent","phys","sciences","covers","year","selected","renewal","explored","csch","principles","practice","development","studies","security","provides","advanced","instruction","discussed","processes","death","lower","high","crncr","taken","efficient","includes","core","retrieval","class","within","present","option","interested","together","session","week","new","order","tables","small","suitable","wide","without","good","introduces","assignments","current","thinking","completed","basics","essential","gain","effective","file","three","many","classes","extensive","tasks","work","meaningful","first","creating","elementary","image","variety","field","engl","skills","evaluation","advances","however","substantial","ece","fields","sucessful","effectively","beyond","explicit","describe","take","earlier","worstcase","obtained","rules","previously","life","allow","abstractions","intensive","agreement","involving","shared","thus","attached","firsthand","partners","provider","remain","entity","given","delves","offerings","available","testingand","vary","school","complete","choice","certain","identify","ides","term","minimum","upon","working","discusses","final","highlevel"
  )
corpus <- tm_map(corpus, function(x) {
  removeWords(x, remove_words)
})

# Remove whitespace
corpus <- tm_map(corpus, stripWhitespace)

inspect(corpus[1])
<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 1

                                                                                      Western Computer Science 2210A/B (1) 
lists queues priority queues trees graphs associated algorithms structures sorting searching hashing time space complexity 

We search for bigrams once again:

# Inspecting BiGrams table
corpusq <- quanteda::corpus(corpus)

corpuz_tokzd <- quanteda::tokens(corpusq)

BiGrams <- corpuz_tokzd %>% 
       quanteda::tokens_select(pattern = "^[A-Z]", 
                               valuetype = "regex",
                               case_insensitive = TRUE, 
                               padding = TRUE) %>% 
       textstat_collocations(min_count = 500, tolower = FALSE,size=2)

BiGrams
                  collocation count count_nested length     lambda          z
1                   math math  3849            0      2  7.1236081 147.663228
2      differential equations  1227            0      2  6.9945033 109.552975
3                data science  2007            0      2  3.1165548 101.225615
4            database systems   900            0      2  4.3045611  96.151115
5                   stat stat   791            0      2  7.2490956  92.891146
6             neural networks   697            0      2  8.2201796  87.922537
7     conditional probability   560            0      2  5.9110985  87.473371
8                queues trees   549            0      2  5.8782531  87.306766
9             data structures  1435            0      2  4.0873762  86.860659
10           machine learning   960            0      2  6.8334718  86.489570
11  regression classification   680            0      2  5.6540018  86.143720
12               graph theory   600            0      2  5.1489152  86.017550
13          linear regression   819            0      2  3.7378780  85.662039
14          computing science   646            0      2  4.3647462  81.399262
15      statistical inference   520            0      2  4.6352065  78.268305
16          analysis variance   657            0      2  5.0411811  74.408687
17           computer science   604            0      2  3.6268005  73.105115
18       computer programming   607            0      2  3.5937047  72.686506
19           linear equations   582            0      2  3.6317445  72.377138
20         hypothesis testing   786            0      2  8.8417868  71.912605
21       statistical software   525            0      2  3.5336911  68.427885
22                  data sets   871            0      2  3.3526943  68.424026
23                data mining  1023            0      2  4.8592274  65.182590
24         data visualization   629            0      2  3.1203219  57.195120
25      ordinary differential   553            0      2  8.5019581  55.122070
26 mathematically disciplines   727            0      2 11.6502774  54.133670
27              data analysis  1078            0      2  1.7208998  50.088245
28                   big data   709            0      2  5.2794043  49.391833
29                monte carlo   602            0      2 17.5996629  12.198298
30                  data data   713            0      2  0.1731121   4.426817
# Plotting BiGrams
plot1 <- ggplot(BiGrams, aes(fill=count, y=lambda, x=reorder(collocation, -lambda))) + 
    geom_bar(position="dodge", stat="identity") + coord_flip()+labs(x= "Bigrams", y = "lambda")

plot2 <- ggplot(BiGrams, aes(fill=count, y=z, x=reorder(collocation, -z))) + 
    geom_bar(position="dodge", stat="identity") + coord_flip()+labs(x= "Bigrams", y = "z")
  
grid.arrange(plot1, plot2, ncol=2)

# Recreate data frame
cor <- as.list(corpus)
df <- cbind(data.frame("doc_id" = names(cor)),data.frame("text"=t(as.data.frame(cor))))

df <- createBigram(df,"data","science","data_science")
df <- createBigram(df,"data","structures","data_structures")
df <- createBigram(df,"data","analysis","data_analysis")
df <- createBigram(df,"data","mining","data_mining")
# df <- createBigram(df,"text","mining","text_mining")
df <- createBigram(df,"machine","learning","machine_learning")
df <- createBigram(df,"computer","science","computer_science")
df <- createBigram(df,"time","series","time_series")
df <- createBigram(df,"database","systems","database_systems")
df <- createBigram(df,"data","visualization","data_visualization")
df <- createBigram(df,"neural","networks","neural_networks")
df <- createBigram(df,"graph","theory","graph_theory")
df <- createBigram(df,"differential","equations","differential_equations")
df <- createBigram(df,"big","data","big_data")
df <- createBigram(df,"hypothesis","testing","hypothesis_testing")
# df <- createBigram(df,"linear","regression","linear_regression")
# df <- createBigram(df,"regression","models","regression_models")
df <- createBigram(df,"data","sets","data_sets")
df <- createBigram(df,"text","data","text_data")
df <- createBigram(df,"monte","carlo","monte_carlo")
df <- createBigram(df,"markov","chain","markov_chain")
df <- createBigram(df,"markov","chains","markov_chains")
df <- createBigram(df,"actuarial","science","actuarial_science")
df <- createBigram(df,"exploratory","data","exploratory_data")
df <- createBigram(df,"r","programming","r_programming")
ds2 <- DataframeSource(df)
corpus2 <- Corpus(ds2)

remove_words <-
  c("data","science")
corpus2 <- tm_map(corpus2, function(x) {
  removeWords(x, remove_words)
})

dtm <- DocumentTermMatrix(corpus2)
dtm <-
  combineCols(c("math", "mathematics", "mathematically", "mathematical"),
              dtm,"mathematics")
dtm <- combineCols(c("stat", "statistics","statistical"), dtm,"statistics")
dtm <- combineCols(c("model","models","modeling"), dtm,"models")
dtm <- combineCols(c("database","databases"), dtm,"database")
dtm <- combineCols(c("markov_chains","markov_chain"), dtm,"markov_chains")

Now we must discover topic numbers. Unfortunately, we cannot use Perplexity here as the model is too big and the code will take too long to run. However, ldatuning is still an option.

# Find number of topics
result <- FindTopicsNumber(
  dtm,
  topics = seq(from = 2, to = 20, by = 1),
  metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
  method = "Gibbs",
  control = list(seed = 77),
  mc.cores = 2L,
)
# Plot metrics
FindTopicsNumber_plot(result)

# Find number of topics
result <- FindTopicsNumber(
  dtm,
  topics = seq(from = 20, to = 40, by = 1),
  metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
  method = "Gibbs",
  control = list(seed = 77),
  mc.cores = 2L,
)
# Plot metrics
FindTopicsNumber_plot(result)

Potential Topic numbers from CaoJuan2009 are 23, 4. Arun2010 hits 0 at 20 and 40. Deveaud2014 suggests 3 and 23. Griffiths plateaus around 24.

K = 3 Solution

We can start with the 3 topic solution to confirm our model is working good. We expect to see something similar to before: statistics, mathematics, and programming/computer science.

# Fit largest model
lda.out <-
  LDA(dtm, 3, method = "Gibbs", control = list(seed = 87460945))

# Find required quantities
phi <- posterior(lda.out)$terms %>% as.matrix
theta <- posterior(lda.out)$topics %>% as.matrix
vocab <- colnames(phi)
doc_length <- vector()
for (i in 1:length(corpus)) {
  temp <- paste(corpus[[i]]$content, collapse = ' ')
  doc_length <-
    c(doc_length, stri_count(temp, regex = '\\S+'))
}
temp_frequency <- as.matrix(dtm)
freq_matrix <- data.frame(ST = colnames(temp_frequency),
                          Freq = colSums(temp_frequency))

# Convert to json
json_lda <- LDAvis::createJSON(
  phi = phi,
  theta = theta,
  vocab = vocab,
  doc.length = doc_length,
  term.frequency = freq_matrix$Freq
)

# Open server for visualization
# serVis(json_lda, out.dir = "./totalk3", open.browser = FALSE)
# serVis(json_lda, open.browser = TRUE)

View the visualization here.

  • Topic 1: statistics
  • Topic 2: programming
  • Topic 3: mathematics

K = 20 Solution

# Fit largest model
lda.out <- LDA(dtm, 20, method = "Gibbs",control = list(seed = 87460945))

# Find required quantities
phi <- posterior(lda.out)$terms %>% as.matrix
theta <- posterior(lda.out)$topics %>% as.matrix
vocab <- colnames(phi)
doc_length <- vector()
for (i in 1:length(corpus)) {
  temp <- paste(corpus[[i]]$content, collapse = ' ')
  doc_length <-
    c(doc_length, stri_count(temp, regex = '\\S+'))
}
temp_frequency <- as.matrix(dtm)
freq_matrix <- data.frame(ST = colnames(temp_frequency),
                          Freq = colSums(temp_frequency))

# Convert to json
json_lda <- LDAvis::createJSON(
  phi = phi,
  theta = theta,
  vocab = vocab,
  doc.length = doc_length,
  term.frequency = freq_matrix$Freq
)

# Open server for visualization
# serVis(json_lda, out.dir = "./totalk20", open.browser = TRUE)
# serVis(json_lda, open.browser = TRUE)

View the visualization here.

  • Topic 1: Probability
  • Topic 2: databases
  • Topic 3: machine learning
  • Topic 4: calculus
  • Topic 5: business/communication
  • Topic 6: mathematics
  • Topic 7: optimization
  • Topic 8: regression
  • Topic 9: linear algebra
  • Topic 10: statistics
  • Topic 11: algorithms
  • Topic 12: analytics
  • Topic 13: simulation
  • Topic 14: experimental design
  • Topic 15: marketing
  • Topic 16: graph theory
  • Topic 17: data structures
  • Topic 18: programming
  • Topic 19: management
  • Topic 20: computer ethics*

K = 23 Solution

# Fit largest model
lda.out <- LDA(dtm, 23, method = "Gibbs",control = list(seed = 87460945))

# Find required quantities
phi <- posterior(lda.out)$terms %>% as.matrix
theta <- posterior(lda.out)$topics %>% as.matrix
vocab <- colnames(phi)
doc_length <- vector()
for (i in 1:length(corpus)) {
  temp <- paste(corpus[[i]]$content, collapse = ' ')
  doc_length <-
    c(doc_length, stri_count(temp, regex = '\\S+'))
}
temp_frequency <- as.matrix(dtm)
freq_matrix <- data.frame(ST = colnames(temp_frequency),
                          Freq = colSums(temp_frequency))

# Convert to json
json_lda <- LDAvis::createJSON(
  phi = phi,
  theta = theta,
  vocab = vocab,
  doc.length = doc_length,
  term.frequency = freq_matrix$Freq
)

# Open server for visualization
# serVis(json_lda, out.dir = "./totalk23", open.browser = TRUE)
# serVis(json_lda, open.browser = TRUE

View the visualization here.

  • Topic 1: probability
  • Topic 2: linear algebra
  • Topic 3: machine learning
  • Topic 4: communication
  • Topic 5: marketing
  • Topic 6: databases
  • Topic 7: programming
  • Topic 8: statistics
  • Topic 9: inference
  • Topic 10: regression
  • Topic 11: experiment design*
  • Topic 12: differential equations
  • Topic 13: optimization
  • Topic 14: mathematics
  • Topic 15: algorithms (and maybe data structures)
  • Topic 16: management*
  • Topic 17: distributions and probability*
  • Topic 18: finance*
  • Topic 19: research*
  • Topic 20: calculus
  • Topic 21: business
  • Topic 22: graph theory
  • Topic 23: proofs

K = 24 Solution

# Fit largest model
lda.out <- LDA(dtm, 24, method = "Gibbs",control = list(seed = 87460945))

# Find required quantities
phi <- posterior(lda.out)$terms %>% as.matrix
theta <- posterior(lda.out)$topics %>% as.matrix
vocab <- colnames(phi)
doc_length <- vector()
for (i in 1:length(corpus)) {
  temp <- paste(corpus[[i]]$content, collapse = ' ')
  doc_length <-
    c(doc_length, stri_count(temp, regex = '\\S+'))
}
temp_frequency <- as.matrix(dtm)
freq_matrix <- data.frame(ST = colnames(temp_frequency),
                          Freq = colSums(temp_frequency))

# Convert to json
json_lda <- LDAvis::createJSON(
  phi = phi,
  theta = theta,
  vocab = vocab,
  doc.length = doc_length,
  term.frequency = freq_matrix$Freq
)

# Open server for visualization
# serVis(json_lda, out.dir = "./totalk24", open.browser = TRUE)
# serVis(json_lda, open.browser = TRUE

View the visualization here.

  • Topic 1: probability
  • Topic 2: machine learning
  • Topic 3: data mining*
  • Topic 4: communication
  • Topic 5: databases
  • Topic 6: marketing
  • Topic 7: regression
  • Topic 8: data visualization
  • Topic 9: simulation
  • Topic 10: linear algebra
  • Topic 11: programming
  • Topic 12: differential equations*
  • Topic 13: algorithms and data structures
  • Topic 14: management
  • Topic 15: mathematics
  • Topic 16: experiment design
  • Topic 17: Calculus
  • Topic 18: neural networks*
  • Topic 19: analytics
  • Topic 20: software engineering
  • Topic 21: logic*
  • Topic 22: programming
  • Topic 23: proofs
  • Topic 24: graph theory

K = 30 Solution

# Fit largest model
lda.out <- LDA(dtm, 30, method = "Gibbs",control = list(seed = 87460945))

# Find required quantities
phi <- posterior(lda.out)$terms %>% as.matrix
theta <- posterior(lda.out)$topics %>% as.matrix
vocab <- colnames(phi)
doc_length <- vector()
for (i in 1:length(corpus)) {
  temp <- paste(corpus[[i]]$content, collapse = ' ')
  doc_length <-
    c(doc_length, stri_count(temp, regex = '\\S+'))
}
temp_frequency <- as.matrix(dtm)
freq_matrix <- data.frame(ST = colnames(temp_frequency),
                          Freq = colSums(temp_frequency))

# Convert to json
json_lda <- LDAvis::createJSON(
  phi = phi,
  theta = theta,
  vocab = vocab,
  doc.length = doc_length,
  term.frequency = freq_matrix$Freq
)

# Open server for visualization
# serVis(json_lda, out.dir = "./totalk30", open.browser = TRUE)
# serVis(json_lda, open.browser = TRUE

View the visualization here.

  • Topic 1: probability
  • Topic 2: machine learning
  • Topic 3: communication
  • Topic 4: cleaning/collecting data*
  • Topic 5: analysis
  • Topic 6: deep learning
  • Topic 7: ?
  • Topic 8: programming
  • Topic 9: data mining
  • Topic 10: databases
  • Topic 11: linear algebra
  • Topic 12: calculus
  • Topic 13: mathematics
  • Topic 14: data structures
  • Topic 15: algorithms
  • Topic 16: ?
  • Topic 17: proofs
  • Topic 18: statistics
  • Topic 19: inference
  • Topic 20: regression
  • Topic 21: calculus
  • Topic 22: management
  • Topic 23: finance/accounting
  • Topic 24: research/presentation
  • Topic 25: ? some programming topic
  • Topic 26: physics
  • Topic 27: r programming
  • Topic 28: ?
  • Topic 29: optimization
  • Topic 30: ?

K = 40 Solution

# Fit largest model
lda.out <- LDA(dtm, 40, method = "Gibbs",control = list(seed = 87460945))

# Find required quantities
phi <- posterior(lda.out)$terms %>% as.matrix
theta <- posterior(lda.out)$topics %>% as.matrix
vocab <- colnames(phi)
doc_length <- vector()
for (i in 1:length(corpus)) {
  temp <- paste(corpus[[i]]$content, collapse = ' ')
  doc_length <-
    c(doc_length, stri_count(temp, regex = '\\S+'))
}
temp_frequency <- as.matrix(dtm)
freq_matrix <- data.frame(ST = colnames(temp_frequency),
                          Freq = colSums(temp_frequency))

# Convert to json
json_lda <- LDAvis::createJSON(
  phi = phi,
  theta = theta,
  vocab = vocab,
  doc.length = doc_length,
  term.frequency = freq_matrix$Freq
)

# Open server for visualization
# serVis(json_lda, out.dir = "./totalk40", open.browser = TRUE)
# serVis(json_lda, open.browser = TRUE

View the visualization here.

  • Topic 1: probability/likelihood
  • Topic 2: machine learning
  • Topic 3: visualization
  • Topic 4: programming
  • Topic 5: mathematics
  • Topic 6: analysis*
  • Topic 7: business/communication
  • Topic 8: linear algebra
  • Topic 9: data mining
  • Topic 10: regression
  • Topic 11: ?
  • Topic 12: marketing
  • Topic 13: clac4*
  • Topic 14: inference
  • Topic 15: databases
  • Topic 16: simulations
  • Topic 17: physics
  • Topic 18: proofs
  • Topic 19: databases
  • Topic 20: calculus
  • Topic 21: probability
  • Topic 22: management
  • Topic 23: data structures
  • Topic 24: differential equations
  • Topic 25: graph theory
  • Topic 26: ?
  • Topic 27: finance
  • Topic 28: research/communication
  • Topic 29: experiment design
  • Topic 30: optimization
  • Topic 31: ?
  • Topic 32: ?
  • Topic 33: statistics
  • Topic 34: ?
  • Topic 35: programming
  • Topic 36: algorithms
  • Topic 37: logic
  • Topic 38: ?
  • Topic 39: ?
  • Topic 40: ?

Overall there are some topic differences but not too much change overall.

University-Topic Composition

Another interesting question to explore is which topics do universities prioritize? Is one university’s Data Science program more heavily biased towards Math? How about Statistics? Recall the \(\gamma\) matrix provides us with the per-document-per-topic probabilities. That is, it provides us with a percentage break down of which topics generated each document.

Now that we have a few models and labelled topics we can look at what topics specific universities focus on. Let’s start with the 3 group solution.

lda.out <-
  LDA(dtm, 3, method = "Gibbs", control = list(seed = 87460945))

uni_topics <- tidy(lda.out,matrix="gamma")

universities <- c("Berkeley","Concordia","Laurier","Manitoba","SFU","Toronto","Waterloo","Western","UBCO")

vizdf <- data.frame(document = NA, topic = NA, gamma = NA)

for (uni in universities) {
  regex <- paste0("^", uni)
  
  test <- subset(uni_topics, topic == 1)
  idx <- which(str_detect(test$document, regex))
  gamma1 <- sum(test[idx, ]$gamma)
  
  test <- subset(uni_topics, topic == 2)
  idx <- which(str_detect(test$document, regex))
  gamma2 <- sum(test[idx, ]$gamma)
  
  test <- subset(uni_topics, topic == 3)
  idx <- which(str_detect(test$document, regex))
  gamma3 <- sum(test[idx, ]$gamma)

  gamma = c(gamma1,gamma2,gamma3)
  gammaz <- abs(scale(gamma))
  # gammaz <- gamma
  
# - Topic 1: statistics
# - Topic 2: programming
# - Topic 3: mathematic

  temp <- data.frame(document = rep(uni,3), topic = c("Statistics","Programming","Mathematics"), gamma = gammaz)
  
  vizdf <- rbind(vizdf,temp)
  }

vizdf <- na.omit(vizdf)

top_terms <- vizdf %>%
 group_by(topic) %>%
 arrange(topic, gamma)

top_terms %>%
 mutate(topic = reorder(topic, gamma)) %>%
 ggplot(aes(topic, gamma, fill = factor(document))) +
 geom_col(show.legend = FALSE) +
 facet_wrap(~ document, scales = "free") +
 coord_flip()

The 20 group solution looked quite good so we will visualize this next:

lda.out <- LDA(dtm, 20, method = "Gibbs",control = list(seed = 87460945))

uni_topics <- tidy(lda.out,matrix="gamma")

universities <- c("Berkeley","Concordia","Laurier","Manitoba","SFU","Toronto","Waterloo","Western","UBCO")

vizdf <- data.frame(document = NA, topic = NA, gamma = NA)

topics <- c("probability", "databases", "machine learning", "calculus", "business/communication", "mathematics", "optimization", "regression", "linear algebra", "statistics", "algorithms", "analytics", "simulation", "experimental design", "marketing", "graph theory", "data structures", "programming", "management", "computer ethics*")
for (uni in universities) {
  regex <- paste0("^", uni)
  
  gamma <- c()
  for (i in 1:length(topics)) {
  test <- subset(uni_topics, topic == i)
  idx <- which(str_detect(test$document, regex))
  gamma <- c(sum(test[idx, ]$gamma), gamma) 
    
  }

  gammaz <- abs(scale(gamma))

  temp <- data.frame(document = rep(uni,20), topic = topics, gamma = gammaz)
  
  vizdf <- rbind(vizdf,temp)
  }

vizdf <- na.omit(vizdf)

top_terms <- vizdf %>%
 group_by(topic) %>%
 arrange(topic, gamma)

top_terms %>%
  mutate(topic = reorder(topic, gamma)) %>%
  ggplot(aes(topic, gamma, fill = factor(document))) +
  theme(axis.text = element_text(size = 6)) +
  geom_col(show.legend = FALSE) +
  facet_wrap( ~ document, scales = "free") +
  coord_flip()

Topic-Word Composition

Given our optimal models we may now move onto exploring what are the main words that comprise each of the topics. To do so we explore \(\beta\) values using wordclouds. Recall that the $\beta$ matrix provides us with the per-topic-per-word probabilities.

lda.out <-
  LDA(dtm, 3, method = "Gibbs", control = list(seed = 87460945))

uni_topics <- tidy(lda.out,matrix="beta")

top_terms <- uni_topics %>%
 group_by(topic) %>%
 top_n(30, beta) %>%
 ungroup() %>%
 arrange(topic, -beta) %>%
  mutate(term = reorder(term, beta))

We will also visualize the 20 group solution.

lda.out <-
  LDA(dtm, 20, method = "Gibbs", control = list(seed = 87460945))

uni_topics <- tidy(lda.out,matrix="beta")

top_terms <- uni_topics %>%
 group_by(topic) %>%
 top_n(30, beta) %>%
 ungroup() %>%
 arrange(topic, -beta) %>%
  mutate(term = reorder(term, beta))

References

A topic model for movie reviews. (n.d.). Retrieved March 20, 2023, from
    https://ldavis.cpsievert.me/reviews/reviews.html

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning     research, 3(Jan), 993-1022.

Fernandes, E. (2019, November 2). Text preprocessing for NLP and Machine Learning Using R. DataMathStat. Retrieved March 20, 2023, from     https://datamathstat.wordpress.com/2019/10/25/text-preprocessing-for-nlp-and-machine-learning-using-r/

Gandrud, C. (2015, May 8). A Link Between topicmodels LDA and LDAvis. R. Retrieved March 20,     2023, from https://www.r-bloggers.com/2015/05/a-link-between-topicmodels-lda-and-ldavis/

Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as data: A New Framework for Machine     Learning and the Social Sciences. Princeton University Press.

Lau, J. H., Baldwin, T., & Newman, D. (2013). On collocations and topic models. ACM Transactions on     Speech and Language Processing, 10(3), 1–14. https://doi.org/10.1145/2483969.2483972

Meza, D. (2015, July 20). Topic modeling using R · Knowledger. knowledgeR. Retrieved March 20,     2023,from https://knowledger.rbind.io/post/topic-modeling-using-r/

Pleplé, Q. (n.d.). Perplexity To Evaluate Topic Models. qpleple. Retrieved March 20, 2023, from     http://qpleple.com/perplexity-to-evaluate-topic-models/

Sievert, C., & Shirley, K. (2014, June). LDAvis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces (pp. 63-70).

Silge, J., Robinson, D., & ProQuest (Firm). (2017). Text mining with R: A tidy approach (First ed.).     O’Reilly Media.

Tufts, C. (n.d.). The Little Book of LDA. Latest Posts – Mining the Details. Retrieved March 20, 2023,     from https://miningthedetails.com/LDA_Inference_Book/lda-inference.html