dendrogram

The case of Michael Brown, an unarmed black teenager, who was shot and killed on August 9th, by Darren Wilson, a white police officer, in Ferguson has driven public opinion around the globe to the suburb of St. Louis. On Nov. 24, after few weeks under tension, the St. Louis County prosecutor announced that a grand jury decided not to indict Mr. Wilson.
This trial yielded significant amount of text (speeches), which soon became available over the internet. Thanks for the workhorse of some guys on the text files, now we can simply download and analyze them.

I spent few hours learning about LDA--Latent Dirichlet Allocation from a package called Mallet. The Mallet machine learning package provides an interface to the Java implementation of latent Dirichlet allocation. To process a text file into mallet though a stopping list of words is required in the same path. Once again, I'm benefitted because there are quite a few of such a list over the internet, typically containing unimportant words and tag marks that can instruct the algorithm to skip them.

library(devtools)

library(repmis)

require(dplyr)

require(mallet)

data_url <- 'https://github.com/danielmarcelino/Tables/raw/master/en.txt'

stoplist <- repmis::source_data(data_url, sep = ",", header = TRUE)

Having dowloaded the documents, let's import them from the folder. Each document is split as 1000 words chunks. To proceed, we write the stop list file to the path directory.

docs <- mallet.read.dir("FergusontText/chunks")

write.table(stoplist, file="stoplist.txt",quote=F, sep=" ", row.names=F)

mallet.docs <- mallet.import(docs$id, docs$text, "en.txt", token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")

Let's create a topic trainer object for data

n.topics <- 50 topic.model <- MalletLDA(n.topics)

And then load the documents:

topic.model$loadDocuments(mallet.instances)

Now we can actually get the vocabulary and few statistics about word frequencies.

vocab <- topic.model$getVocabulary()

word.freq <- mallet.word.freqs(topic.model)

word.freq %>% arrange(desc(term.freq)) %>% head(10)

Nice, we have all set. Let's use simulations to optimize hyperparameters every 25 iterations with a warm-up period of 100 iterations (by default, the hyperparameter optimization is on). After this we can actually train the model. Once again, we can specify the number of iterations, or draws after the burn-in. I'm specifying 200 draws.

topic.model$setAlphaOptimization(25, 100)

topic.model$train(200)

Now it runs through only few iterations, but picking the 'best' topic for each token rather than sampling from the posterior distribution.

topic.model$maximize(20)

Notice that we a structure like: words nested topics, and topics in documents. Thus, it might be a good idea to get the probability of topics in documents and the probability of words in topics.

There is no magic here. The following functions return raw word counts, as I want probabilities, I've to normalize them. I also add smoothing to that so to avoid seen some topics with exactly 0 probability.

doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)

topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)

Now I've two matrixes to transpose and normalize the doc:topics

topic.docs <- t(doc.topics)

topic.docs <- topic.docs / rowSums(topic.docs)

Write down the output as CSV for further analysis:

write.csv(topic.docs, "ferguson-topics.csv" )

Now we can obtain a vector with short names for the topics:

topics.labels <- rep("", n.topics)

for(topic in 1:n.topics) topics.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=5)$words, collapse=" ")

Check out the keywords for each topic:

topics.labels %>% head(50)

write.csv(topics.labels, "ferguson-topics-lbs.csv")

Correlation matrix

Here, Correlations with significance levels - each 1000 line chunk correlated against the others. Positive correlation - similar topics.

cor.matrix <- cor(topic.docs, use="complete.obs", method="pearson")
write.csv(cor.matrix, "corr-matrix.csv")

From here, a variety of analyses can be conducted. As an instance, one could approach this as a network diagram, showing which pieces of the testimony share similar patterns of discourse, which ones do not.

library(corrgram)

library(ellipse)

corrgram(cor.matrix, order=NULL, lower.panel=panel.shade,
upper.panel=NULL, text.panel=panel.txt,
main="Correlated chunks of the Ferguson's grand jury testimony")

How about turn those bits into word clouds of the topics? A word cloud can be tricky as it doesn't tell us much of anything if obvious words are included. That's make our stop-list file key for generating good word clouds. Of course the subject names are going to show up a lot, but a word cloud is a lot more fancy and informative if it brings what sorts of emotional or subjective language is being used.

library(RColorBrewer)

library(wordcloud)

for(i in 1:10){
topic.top.words <- mallet.top.words(topic.model,
topic.words[i,], 20)
print(wordcloud(topic.top.words$words,
topic.top.words$weights,
c(4,.8), rot.per=0,random.order=F,
colors=brewer.pal(5, "Dark2") ) )
}

We can also try clustering plot based on shared words:

library(cluster)

hc <- hclust(dist(topic.words))

(dens <- as.dendrogram(hc))

plot(dens, edgePar=list(col = 1:2, lty = 2:3), dLeaf=1, edge.root = TRUE)

plot(hclust(dist(topic.words)), labels=topics.labels)

It seems to messy, let's create a data.frame and calculate a similarity matrix:

topic_docs <- data.frame(topic.docs)

names(topic_docs) <- docs$id

topic_dist <- as.matrix(daisy(t(topic_docs), metric = "euclidean", stand = TRUE))

The following does the trick to keep only closely related documents and avoid a dense diagram, otherwise it becomes difficult to interpret. Briefly, it changes row values to zero if less than row minimum + row standard deviation

topic_dist[ sweep(topic_dist, 1, (apply(topic_dist,1,min) + apply(topic_dist,1,sd) )) > 0 ] <- 0

Finally, we can use kmeans to identify groups of similar documents, and further get names for each cluster

km <- kmeans(topic_dist, n.topics)

alldocs <- vector("list", length = n.topics)

for(i in 1:n.topics){
alldocs[[i]] <- names(km$cluster[km$cluster == i]) }

I began to think on a nice way of plotting campaign expenditures in a paper I'm working on. I thought this would be something like the following--simple but meaningful even when there are outliers in both tails.

Though I like the seniors Tukey's boxplot and scatter plots, I had already used them the last time I published about this topic, so I'd like to oxygenate my figures; I thought a type of Manhattan plot could do the job.

The very idea is to have types of elections, districts or parties along the X-axis, with the negative logarithm of the association (p-value) between a candidate's spending and votes displayed on the Y-axis. Thus, each dot on the plot indicates a candidate's position on this metric. Because stronger associations have the smallest p-values (a log of 0.05 = -1.30103), their negative logarithms will be positivie and higher (e.g., 1.3), while those with p-values not statistically significant (whatever that means these days, maybe nothing ) will stay below this line.

The positive thing of this version is that it draws our attention to the upper outliers instead to the average association, which tends to be left-skewed because Brazilian elections typically attract many sacrificial lamb candidates who expend nearly nothing in their campaigns.

joinville_plot

IMG_1375 Last week, I attended a "Flower Fest" where I had the opportunity to admire several of the most beautiful and awarded flowers, orchids, and decoration plants. Surprisingly, though, I never had thought of flowers like fractals the way I did this time.

Fractals attract lots of interest, especially from mathematicians who actually spend some time trying to learn about their structures. But what makes something a fractal? A fractal is defined as an object that displays self-similarity on all scales. But the object doesn't need to have exactly the same structure at all scales only the same sort of structure must be visible or recognizable any how.

IMG_1368 The structure or the equation that defines a fractal is most of the time very simple. For instance, the formula for the famous Mandelbrot is z_{n+1}=z_n^2+c.

We start by plugging in a constant value for $c$, $z$ can start at zero. After one iteration, the equation gives us a new value for $z$; then we plug this back into the equation at old $z$ and iterate it again, it can proceed infinitely.

As a very simple example, let's start this with c = 1.

z_{1} = z_{0}^2 + c= 0 + 1 = 1

z_{2} = z_{1}^2 + c = 1 + 1 = 2

z_{3} = z_{2}^2 + c = 4 + 1 = 5

Graphing these results against n would create an upward parabolic curve because the numbers increase exponentially (to infinity). But if we start with c = -1 for instance, $z$ will behave completely different. That is, it will oscillate between two fixed points as:

z_{1} = z_{0}^2 + c= 0 + (-1) = -1

z_{2} = z_{1}^2 + c = 1 + (-1) = 0

z_{3} = z_{2}^2 + c = 0 + (- 1) = -1

z_{4} = z_{3}^2 + c = 1 + (- 1) = 0

And this movement back and forth will continue forever as we can imagine. I figured out, that the Mandelbrot set is made up of all the values for $z$ that stay finite, thus a solution such as the first example for $c = 1$ is not valid and will be thrown out because $z$ in those cases goes to infinity and Mandelbrot lives in a finite world. The following is an example of such set.

Mandelbrot The script for this set is here.

Every campaign cycle I usually do similar things, go to a repository, download a bounce of data, merge and store them to an existing RData file for posterior analysis. I've already wrote about this topic some time ago, but this time I think my script became simpler.

Set the Directory

Let's assume you're not in the same directory of your files, so you'll need to set R to where the population of files resides.

setwd("~/Downloads/consulta_cand_2014")

Getting a List of files

Next, it’s just a matter of getting to know your files. For this, the list.files() function is very handy, and you can see the file names right-way in your screen. Here I'm looking form those "txt" files, so I want my list of files exclude everything else, like pdf, jpg etc.

files <- list.files(pattern= '\\.txt$')

Sometimes you may find empty objects that may prevent the script to run successfully against them. Thus, you may want to inspect the files beforehand.

info = file.info(files)
empty = rownames(info[info$size == 0, ])

Moreover, in case you have the same files in more than one format, you may want to filter them like in the following:

CSVs <-list.files(pattern='csv')
TXTs <- list.files(pattern='txt')
mylist <- CSVs[!CSVs %in% TXTs]

Stacking files into a dataframe

The last step is to iterate "rbind" through the list of files in the working directory putting all them together.
Notice that in the script below I've included some extra conditions to avoid problems reading the files I have. Also, this assumes all the files have the same number of columns, otherwise "rbind" won't work. In this case you may need to replace "rbind" by "smartbind" from gtools package.

cand_br <- do.call("rbind",lapply(files,
FUN=function(files){read.table(files,
header=FALSE, sep=";",stringsAsFactors=FALSE, 
fileEncoding="cp1252", fill=TRUE,blank.lines.skip=TRUE)
}))

Uruguayan voters are about to give the Frente Amplio a third mandate this November 30th. The following graph shows how the outcome would look like if the election were held this week (now cast). Although it's just a momentum, this picture might hold till the election day. The picture plots the probability density function (pdf) of the vote support for the FA and the PN as published by major polling houses in that country. The script to replicate this analysis can de found here.

As the picture shows, the posterior densities are quite apart from each other and their confidence regions narrow, meaning that we have less uncertainty about the results under those areas.

Rplot

The next plot just add to the information provided by showing the likely difference between the two candidates as we hope to see after November 30th. According to this, Mr. Vásquez and Mr. Lacalle would appear apart for about 12%--I robust lead for the left coalition indeed.

Difference