Home   Publications   Resources   My CV   Contact   RSSRSS

Bayesian Forecasting Discontinuity

July 17, 2014 by Daniel Marcelino | 1 Comment | Filed in Uncategorized

I’m finalizing a paper presentation for the ABCP meeting, in which I explore poor polling forecast in local elections in Brazil. I drew upon Jackman (2005)’s paper “Pooling the Polls” to explore a bit about “house effects” in the Brazilian context. However, during the analysis I found myself extending his original model to fit vote intention before and after an exogenous shock: aired political ads season. In Brazil, political parties have incentives for canvassing free of charge in the air media (radio and TV). The whole point is that this thing sometimes produces drastic changes to the vote distribution in a short period of time, so we can’t simply apply a Bayesian linear model because that would break up some of the linearity assumptions. In order to account for the advertising effect on the popular support, I had to adapt the model, where the transition component–the random walk–breaks at the last poll before the ads season began, and restarting with the first poll after it. The following chart says more about the problem. The black spots are the observed polls by the major pollsters and the gray area is the 95% intervals of what we usually don’t see: the daily fluctuations for vote intention.

Haddad

Over 58 weeks before the ads season began, the Workers’s Party candidate, Fernando Haddad, showed a weekly growth rate of 2.6%, but in the 5 weeks next to the beginning of the political advertising on radio and television, the same rate jumped to 9.46%. To put it simple, whiting a week of media exposition made his popular support increase in more than 5%; it is more than he could achieve in one year of “informal campaigning”.

Tags: , ,

Parallel computing in R

July 1, 2014 by Daniel Marcelino | 1 Comment | Filed in Tutorials

parallel Roughly a year ago I published an article about parallel computing in R here, in which I compared computation performance among 4 packages that provide R with parallel features once R is essentially a single-thread task package.

Parallel computing is incredibly useful, but not every thing worths distribute across as many cores as possible. Actually, there are cases without enough repetitions that R will gain in performance through serial computation. That is, R takes time to distribute tasks across the processors; conversely, it will need time for binding them all together later on. Therefore, if the time for distributing and gathering pieces together is greater than than the time need for single-thread computing, it doesn’t worth parallelize.

In this post I’ll perform the same experiment using the same physical resources, except that I will perform it in the Rstudio instead of Emacs. So I want to check whether the packages improved anything significant so far.

I tested a nontrivial computation instance using four critical R functions: the base lapply, mclapply from the “multicore” package, parLapply from the “snow” package, and sfLapply from the “snowfall” package. The last three functions essentially provide parallelized equivalent for the lapply.

The Experiment: These packages were used for distributing tasks across four CPUs of my MacBook pro with 8-G memory. The duty was to average out each column of a data frame built on the fly, but repeating this procedure 100 times for each trial because I don’t want to rely on one single round estimate. In addition, each trial demands different amount of memory allocation and time for computing once the matrix size varies as 1K, 10K, 100K, 1M, and 10M rows. The program I used to perform the tests is left here.

Overall, every function is doing a better job now than one year ago, but the mclapply from the “multicore” package rocks! parLapply from the “snow” package comes second (a huge improvement since then). Since there is a cost-of-programming, single-core computing (lapply function) is still an alternative when datasets are very small 1k to 10k and that the difference in performance isn’t that much, but if your data are greater than 10k rows, lapply will be your last desirable function to use.

In my former experiment, the lapply function was the way to go for matrix just as big as 10k rows. The second best alternative was then mclapply. parLapply from “snow”, and sfLapply from the “snowfall” package were simply too slow. A comparison between the older graph with the following one suggests that a “microrevolution” is taking place; the figures changed for distributing tasks even when the data vectors are small, say <10k. Distributing across CPUs seems to be less time consuming now than let it go serialized.

parallelfinal

Tags: , , , , , ,

Do you believe in World Cup superstition?

June 25, 2014 by Daniel Marcelino | Comments Off | Filed in Uncategorized

Screenshot 2014-06-24 20.59.25 If you believe in supernatural causality, you will love what the numbers of the World Cup have to say which team is going to win this Cup in Brazil. According to this numerology approach neither Brazil nor Germany or Netherlands will be the winner, but Uruguay. The table below shows my reason. This was produced using the googleVis Package. Script here.

Tags: , ,

R issues with Portuguese diacritics

June 21, 2014 by Daniel Marcelino | Comments Off | Filed in Uncategorized

I started writing in Portuguese (á, é, í, ó, ú, ç, etc.) inside R for MAC, but I receive some encoding issues, so I managed to fix it at once by simply typing the following command in the terminal:

defaults write org.R-project.R force.LANG en_US.UTF-8

Tags:

More on TV Ads and Presidential Elections

June 4, 2014 by Daniel Marcelino | 1 Comment | Filed in Uncategorized

In my last post, I was telling about the Brazilian influential incentives for parties to coalesce based on the share of TV advertising a party holds. I just played around with those data I have gathered to produce the following chart. It’s not colors-representative of the parties, which would take a while to adjust. The interesting thing about this chart is that you can have some interaction with the data by selecting a combination of the information you want on it. Clearly, in the next steps I want to predict vote share (y axis) given a certain amount of advertising (x axis).

Tags: , , , ,