I loved this %>% crosstable

This is a public tank you for @heatherturner's contribution. Now the SciencesPo's crosstable can work in a chain (%>%) fashion; useful for using along with other packages that have integrated the magrittr operator.

     > candidatos %>%
     + filter(desc_cargo == 'DEPUTADO ESTADUAL'| 
desc_cargo =='DEPUTADO DISTRITAL' | desc_cargo =='DEPUTADO FEDERAL' | 
desc_cargo =='VEREADOR' | desc_cargo =='SENADOR') %>% 
tab(desc_cargo,desc_sexo)

====================================================
                           desc_sexo                
                   -------------------------        
desc_cargo             NA   FEMININO MASCULINO  Total 
----------------------------------------------------
DEPUTADO DISTRITAL      1     826      2457     3284
                    0.03%     25%       75%     100%
DEPUTADO ESTADUAL     122   12595     48325    61042
                    0.20%     21%       79%     100%
DEPUTADO FEDERAL       40    5006     20176    25222
                    0.16%     20%       80%     100%
SENADOR                 4     161      1002     1167
                    0.34%     14%       86%     100%
VEREADOR             9682  376576   1162973  1549231
                    0.62%     24%       75%     100%
----------------------------------------------------
Total                9849  395164   1234933  1639946
                    0.60%     24%       75%     100%
====================================================

Chi-Square Test for Independence

Number of cases in table: 1639946 
Number of factors: 2 
Test for independence of all factors:
    Chisq = 1077.4, df = 8, p-value = 2.956e-227
                    X^2 df P(> X^2)
Likelihood Ratio 1216.0  8        0
Pearson          1077.4  8        0

Phi-Coefficient   : 0.026 
Contingency Coeff.: 0.026 
Cramer's V        : 0.018 

# Reproducible example:

library(SciencesPo)

 gender = rep(c("female","male"),c(1835,2691))
    admitted = rep(c("yes","no","yes","no"),c(557,1278,1198,1493))
    dept = rep(c("A","B","C","D","E","F","A","B","C","D","E","F"),
               c(89,17,202,131,94,24,19,8,391,244,299,317))
    dept2 = rep(c("A","B","C","D","E","F","A","B","C","D","E","F"),
               c(512,353,120,138,53,22,313,207,205,279,138,351))
    department = c(dept,dept2)
    ucb = data.frame(gender,admitted,department)


> ucb %>% tab(admitted, gender, department)
================================================================
                                 department                       
                  -----------------------------------------       
admitted gender   A      B      C      D      E      F    Total 
----------------------------------------------------------------
no       female     19      8    391    244    299    317   1278
                  1.5%  0.63%    31%    19%  23.4%    25%   100%
         male      313    207    205    279    138    351   1493
                 21.0% 13.86%    14%    19%   9.2%    24%   100%
         -------------------------------------------------------
         Total     332    215    596    523    437    668   2771
                 12.0%  7.76%    22%    19%  15.8%    24%   100%
----------------------------------------------------------------
yes      female     89     17    202    131     94     24    557
                   16%   3.1%    36%    24%  16.9%   4.3%   100%
         male      512    353    120    138     53     22   1198
                   43%  29.5%    10%    12%   4.4%   1.8%   100%
         -------------------------------------------------------
         Total     601    370    322    269    147     46   1755
                   34%  21.1%    18%    15%   8.4%   2.6%   100%
----------------------------------------------------------------
Total    female    108     25    593    375    393    341   1835
                  5.9%   1.4%    32%    20%  21.4%    19%   100%
         male      825    560    325    417    191    373   2691
                 30.7%  20.8%    12%    15%   7.1%    14%   100%
         -------------------------------------------------------
         Total     933    585    918    792    584    714   4526
                 20.6%  12.9%    20%    17%  12.9%    16%   100%
================================================================

The Greek thing II

Just few hours before Greeks head to the polls to decide on the bailout agreement, and ultimately, whether the country will stay in the euro, there is no overwhelming advantage of either side. Actually, the margin became blurred over the last three days, with the "Yes" position rehearsing a last-minute recovery. Despite this last-minute trend, the aggregate preference for the "NO" is not too far behind. To frame this in terms of probabilities, that is, the \theta_{YES} exceeds \theta_{NO}, I adapted a short function written a while ago to simulate from a Dirichlet distribution, and then to compute posterior probabilities shown in the chart below. It's really nothing, but the YES outperformed the NO in 57%.

Probs

The polls were aggregated and the "Don't Know" respondents were distributed accordingly to proportion of the Yes/No reported by the polls.

UPDATE:
With polls yesterday showing both sides in a dead heat, today's overwhelmingly majority of voters saying NO is a big surprise, isn't? Plenty of theories will appear to explain why Greeks have chosen to reject the terms of the deal as proposed by EU officials, meanwhile, it's time for the parties to set up the plan B.

The Greek thing

Greeks have been quite volatile on their opinion whether they should accept or not a proposal by the country's creditors for more austerity to keep aid flowing. The polls conducted over this week look like crazy, though that "belly" was likely provoked by the anxiety on what comes next after Greece not paying IMF back.

loess

The data were collected on the internet, most of them assembled by http://metapolls.net

Network analysis with igraph

Suddenly, I had to learn network analysis. Two weeks ago I started with a book by Christakis and Fowler, then a book by Kolaczyk and Csárdi, now here I am in a “how to” analyse network data using R moment. This is much like learning-by-doing knowledge, so may be useful for newcomers, although at this point I'm more concerned on visualization than measuring networks properly.

This is the part that I loved most: With just two packages, I was able to make not-so-ugly charts, saying publishable. Here, I post on how to go from unrefined to nicer charts in very few steps.

Rplot05

Rplot06

The code:

LDA on Ferguson Grand Jury I

dendrogram

The case of Michael Brown, an unarmed black teenager, who was shot and killed on August 9th, by Darren Wilson, a white police officer, in Ferguson has driven public opinion around the globe to the suburb of St. Louis. On Nov. 24, after few weeks under tension, the St. Louis County prosecutor announced that a grand jury decided not to indict Mr. Wilson.
This trial yielded significant amount of text (speeches), which soon became available over the internet. Thanks for the workhorse of some guys on the text files, now we can simply download and analyze them.

I spent few hours learning about LDA--Latent Dirichlet Allocation from a package called Mallet. The Mallet machine learning package provides an interface to the Java implementation of latent Dirichlet allocation. To process a text file into mallet though a stopping list of words is required in the same path. Once again, I'm benefitted because there are quite a few of such a list over the internet, typically containing unimportant words and tag marks that can instruct the algorithm to skip them.

library(devtools)

library(repmis)

require(dplyr)

require(mallet)

data_url <- 'https://github.com/danielmarcelino/Tables/raw/master/en.txt'

stoplist <- repmis::source_data(data_url, sep = ",", header = TRUE)

Having dowloaded the documents, let's import them from the folder. Each document is split as 1000 words chunks. To proceed, we write the stop list file to the path directory.

docs <- mallet.read.dir("FergusontText/chunks")

write.table(stoplist, file="stoplist.txt",quote=F, sep=" ", row.names=F)

mallet.docs <- mallet.import(docs$id, docs$text, "en.txt", token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")

Let's create a topic trainer object for data

n.topics <- 50 topic.model <- MalletLDA(n.topics)

And then load the documents:

topic.model$loadDocuments(mallet.instances)

Now we can actually get the vocabulary and few statistics about word frequencies.

vocab <- topic.model$getVocabulary()

word.freq <- mallet.word.freqs(topic.model)

word.freq %>% arrange(desc(term.freq)) %>% head(10)

Nice, we have all set. Let's use simulations to optimize hyperparameters every 25 iterations with a warm-up period of 100 iterations (by default, the hyperparameter optimization is on). After this we can actually train the model. Once again, we can specify the number of iterations, or draws after the burn-in. I'm specifying 200 draws.

topic.model$setAlphaOptimization(25, 100)

topic.model$train(200)

Now it runs through only few iterations, but picking the 'best' topic for each token rather than sampling from the posterior distribution.

topic.model$maximize(20)

Notice that we a structure like: words nested topics, and topics in documents. Thus, it might be a good idea to get the probability of topics in documents and the probability of words in topics.

There is no magic here. The following functions return raw word counts, as I want probabilities, I've to normalize them. I also add smoothing to that so to avoid seen some topics with exactly 0 probability.

doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)

topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)

Now I've two matrixes to transpose and normalize the doc:topics

topic.docs <- t(doc.topics)

topic.docs <- topic.docs / rowSums(topic.docs)

Write down the output as CSV for further analysis:

write.csv(topic.docs, "ferguson-topics.csv" )

Now we can obtain a vector with short names for the topics:

topics.labels <- rep("", n.topics)

for(topic in 1:n.topics) topics.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=5)$words, collapse=" ")

Check out the keywords for each topic:

topics.labels %>% head(50)

write.csv(topics.labels, "ferguson-topics-lbs.csv")

Correlation matrix

Here, Correlations with significance levels - each 1000 line chunk correlated against the others. Positive correlation - similar topics.

cor.matrix <- cor(topic.docs, use="complete.obs", method="pearson")
write.csv(cor.matrix, "corr-matrix.csv")

From here, a variety of analyses can be conducted. As an instance, one could approach this as a network diagram, showing which pieces of the testimony share similar patterns of discourse, which ones do not.

library(corrgram)

library(ellipse)

corrgram(cor.matrix, order=NULL, lower.panel=panel.shade,
upper.panel=NULL, text.panel=panel.txt,
main="Correlated chunks of the Ferguson's grand jury testimony")

How about turn those bits into word clouds of the topics? A word cloud can be tricky as it doesn't tell us much of anything if obvious words are included. That's make our stop-list file key for generating good word clouds. Of course the subject names are going to show up a lot, but a word cloud is a lot more fancy and informative if it brings what sorts of emotional or subjective language is being used.

library(RColorBrewer)

library(wordcloud)

for(i in 1:10){
topic.top.words <- mallet.top.words(topic.model,
topic.words[i,], 20)
print(wordcloud(topic.top.words$words,
topic.top.words$weights,
c(4,.8), rot.per=0,random.order=F,
colors=brewer.pal(5, "Dark2") ) )
}

We can also try clustering plot based on shared words:

library(cluster)

hc <- hclust(dist(topic.words))

(dens <- as.dendrogram(hc))

plot(dens, edgePar=list(col = 1:2, lty = 2:3), dLeaf=1, edge.root = TRUE)

plot(hclust(dist(topic.words)), labels=topics.labels)

It seems to messy, let's create a data.frame and calculate a similarity matrix:

topic_docs <- data.frame(topic.docs)

names(topic_docs) <- docs$id

topic_dist <- as.matrix(daisy(t(topic_docs), metric = "euclidean", stand = TRUE))

The following does the trick to keep only closely related documents and avoid a dense diagram, otherwise it becomes difficult to interpret. Briefly, it changes row values to zero if less than row minimum + row standard deviation

topic_dist[ sweep(topic_dist, 1, (apply(topic_dist,1,min) + apply(topic_dist,1,sd) )) > 0 ] <- 0

Finally, we can use kmeans to identify groups of similar documents, and further get names for each cluster

km <- kmeans(topic_dist, n.topics)

alldocs <- vector("list", length = n.topics)

for(i in 1:n.topics){
alldocs[[i]] <- names(km$cluster[km$cluster == i]) }