Home   Publications   Resources   My CV   Contact   RSSRSS

What are the Odds of an Independent Scotland?

August 18, 2014 by Daniel Marcelino | No Comments | Filed in Tutorials

yes-no2 “For things to remain the same, everything must change.” (Gattopardo by Giuseppe Tomasi di Lampedusa) In less than a month, Scots will decide if they want Scotland tied or apart from UK. Over the last days, I’ve noticed a variety of projections in the British press about this, but I decided to give it a try myself using a Beta distribution application.

The beta density function is suitable to represent outcomes like proportions or probabilities defined on the continuum between 0 and 1, and it is a very versatile distribution that we can apply to many different contexts, from baseball games to political elections. The only thing to remember is that this distribution applies well to problems involving two classes: Yes and No, but not to a higher number of them.

Though there are often other third-category in the polls, the “Undecided” voters, the dispute is effectively between YES and NO. Although we can use models that are more complex for this, I found sensible still to reduce the number of categories to two and carried out a simulation analysis with the Beta distribution.

Some Referendum Notes

The referendum will be carried out on September 18th to ask the Scots’ opinion about the two platforms: Yes Scotland and pro-union Better Together. However, looking the polls backwards since 2011, it is not hard to conclude that there is virtually no chance that the Yes side will make it. The polls have been pretty much stable so far, where the No side predicted support is at 60-55 percent and Yes side is about 40 or so.

As can be seen from the raw polling data since 2011, the Yes Scotland support has always been lower than “NO” by an average of about 8.9%.

2014 so far — Yes: 36% vs. No: 47%; Uncertain: 17%
2013 — Yes: 32% vs. No: 49%; Uncertain: 17%
2012 — Yes: 32% vs. No: 53%; Uncertain: 15%
2011 — Yes: 38% vs. NO: 42%; Uncertain: 20%

Although the Yes campaigners insist to say that the more people learn about independence, the more likely they are to vote for it, when you look at this kind Yes or No vote, the No side is that tends to grow over time (For instance, the Quebec referendum in 1995, and the Brazilian firearms referendum in 2005). As the uncertainty comes into play from the approaching days to referendum, people tend not to default to changing the status quo.

The Beta Distribution


Based on recent pre-referendum polling, it looks like the “NO” will likely win by a similar margin and maybe a little higher than the average of 13%. Actually, based on latest pre-referendum polls this margin will be closer to 14.2% points. Note that the marginal difference between “YES” and “NO” is distributed as a beta distribution and we can see that the threshold of zero (0) is far left in the tail of the curve. Therefore, based on previous and current polls, it is very improbable that the pro-independence movement will make this plea. The assumption that more information leads to the “Yes” vote renders not that plausible after all.


Independence Referendum Betting Odds

Interesting, however, is to compare this result with what the market says about the likely outcome. The Scottish independence market on August 18th was giving an opposite outcome of the prediction that I found using polling data: 11/2 “Yes to independence” for 1/10 “No to independence”. So, in this case, the view of the stock market doesn’t match the view of pooled people at large.

The script to produce the analysis is here.

Tags: , , , ,

Polling the Brazilian Presidential Election

August 14, 2014 by Daniel Marcelino | No Comments | Filed in Uncategorized

Analyzing polls can be particularly disappointing as each opinion poll has a confidence interval of approximately 3%, which can yield an uncertain range of 6%. These days for example, some polls are giving PT’s incumbent candidate, Dilma Rousseff, 38%, others 36%. It renders difficult to tell to what extent is the 36% incorrect, or within the realms of possibility.

The first conclusion from forecasting with polls is that we cannot rely on a single poll to ascertain what is going on in the campaign, or what are the trends, if any. A typical approach to reduce this uncertainty is to aggregate all polls (a poll-of-polls). Nonetheless, in the context of any underlying trend in the sample of polls, this approach will yield skewed results as it attributes an equal weighting to the oldest polls as it does to the most recent polls.

The next approach then is to combine recent polls only. There are difficulties and problems with this too. First, the more significant the trends, the more polls that need to be discarded and therefore we reduce the number of polls to be used. What is more, the interpretation of which set of polls to use is inherently unclear due to the aforementioned variance of the polls themselves. In this election, journalist and pundits are faced with the typical dilemma as there now appears to have been a trend in support from PSDB’s candidate and bouncing away from Workers’s Party’s candidate. But how can we be sure of that?

A standard approach to interpreting polling trends involves using loess regression. This generates a trend on the entire dataset. It has huge benefits as it negates the influences of the exceptional data and other sources of random variation. However, loess regression models have a big drawback. Computation using loess will only run smoothly when polls are plentiful, and in the case of Brazil, we simply do not have enough polls to deploy such a technique.

The solution to this problem involves employing a technique known as Kalman Filter. It is a technique employed in a wide range of applications, but mostly in the field of engineering. For example, KF has been used for trajectory estimation (the Apollo program) and for signal recognition of FM Radios. In short, this technique involves two steps: (1) predict and (2) update.

Kalman Filtering The Kalman Filter applies a similar smoothing algorithm as the loess regression in that it accounts for all polling data points. However, Kalman Filtering also assigns weights to poll estimates, by taking into account such factors as sample size and the amount of time elapsed between polls. It further accounts for the relative spuriousness of different polls by conducting autoregression on the data at the outset.

In short, the technique yields to estimates from polling data at each point in time based on a variety of factors. It minimizes error from the two sources of uncertainty in polling: the true level of support for a party/candidate, which may change on a day-to-day basis by some unknown amount; and the error in polling.

Results I applied the KF to the polling data for this year, having the polling data from the previous years (2012/2013) as the prior covariance matrix. It produces the story of the election campaign as depicted in the following figures. Each point is a polling result for a candidate at that date and the trend-line is the trend as determined by the technique.






What I can say about the election is therefore that: (1) Support for Rousseff (PT), the incumbent, has remained almost constant with a shy trend-line pushing it towards 40%. (2) Support for Neves (PSDB) has increased and the trend-line suggests this may continue to increase in the short-term future. (3) Support for Campos/Marina (PSB) has increased (or begin recovering to its elevated level as the May-2014). (4) Support for Other candidates follow stabilized with a shy trend upwards. (5) Those voters that were to vote none has decreased and the trend-line suggest this may continue.

Overall, it shows that recent commentary in relation to decline in PT/Rousseff support so far has been spurious when considering the campaign term. Although, another poll with a good sample size like the 36% Data Folha recently put her on might change this interpretation.

Variance in relation to the PT candidate is wider than other candidates due to the dramatic variance the incumbent saw her support become smaller after June-2013. Since I’m using this variance structure, it renders particularly difficult to predict the interval on election day.

Tags: , ,

Bayesian Forecasting Discontinuity

July 17, 2014 by Daniel Marcelino | 1 Comment | Filed in Uncategorized

I’m finalizing a paper presentation for the ABCP meeting, in which I explore poor polling forecast in local elections in Brazil. I drew upon Jackman (2005)’s paper “Pooling the Polls” to explore a bit about “house effects” in the Brazilian context. However, during the analysis I found myself extending his original model to fit vote intention before and after an exogenous shock: aired political ads season. In Brazil, political parties have incentives for canvassing free of charge in the air media (radio and TV). The whole point is that this thing sometimes produces drastic changes to the vote distribution in a short period of time, so we can’t simply apply a Bayesian linear model because that would break up some of the linearity assumptions. In order to account for the advertising effect on the popular support, I had to adapt the model, where the transition component–the random walk–breaks at the last poll before the ads season began, and restarting with the first poll after it. The following chart says more about the problem. The black spots are the observed polls by the major pollsters and the gray area is the 95% intervals of what we usually don’t see: the daily fluctuations for vote intention.


Over 58 weeks before the ads season began, the Workers’s Party candidate, Fernando Haddad, showed a weekly growth rate of 2.6%, but in the 5 weeks next to the beginning of the political advertising on radio and television, the same rate jumped to 9.46%. To put it simple, whiting a week of media exposition made his popular support increase in more than 5%; it is more than he could achieve in one year of “informal campaigning”.

Tags: , ,

Parallel computing in R

July 1, 2014 by Daniel Marcelino | 1 Comment | Filed in Tutorials

parallel Roughly a year ago I published an article about parallel computing in R here, in which I compared computation performance among 4 packages that provide R with parallel features once R is essentially a single-thread task package.

Parallel computing is incredibly useful, but not every thing worths distribute across as many cores as possible. Actually, there are cases without enough repetitions that R will gain in performance through serial computation. That is, R takes time to distribute tasks across the processors; conversely, it will need time for binding them all together later on. Therefore, if the time for distributing and gathering pieces together is greater than than the time need for single-thread computing, it doesn’t worth parallelize.

In this post I’ll perform the same experiment using the same physical resources, except that I will perform it in the Rstudio instead of Emacs. So I want to check whether the packages improved anything significant so far.

I tested a nontrivial computation instance using four critical R functions: the base lapply, mclapply from the “multicore” package, parLapply from the “snow” package, and sfLapply from the “snowfall” package. The last three functions essentially provide parallelized equivalent for the lapply.

The Experiment: These packages were used for distributing tasks across four CPUs of my MacBook pro with 8-G memory. The duty was to average out each column of a data frame built on the fly, but repeating this procedure 100 times for each trial because I don’t want to rely on one single round estimate. In addition, each trial demands different amount of memory allocation and time for computing once the matrix size varies as 1K, 10K, 100K, 1M, and 10M rows. The program I used to perform the tests is left here.

Overall, every function is doing a better job now than one year ago, but the mclapply from the “multicore” package rocks! parLapply from the “snow” package comes second (a huge improvement since then). Since there is a cost-of-programming, single-core computing (lapply function) is still an alternative when datasets are very small 1k to 10k and that the difference in performance isn’t that much, but if your data are greater than 10k rows, lapply will be your last desirable function to use.

In my former experiment, the lapply function was the way to go for matrix just as big as 10k rows. The second best alternative was then mclapply. parLapply from “snow”, and sfLapply from the “snowfall” package were simply too slow. A comparison between the older graph with the following one suggests that a “microrevolution” is taking place; the figures changed for distributing tasks even when the data vectors are small, say <10k. Distributing across CPUs seems to be less time consuming now than let it go serialized.


Tags: , , , , , ,

Do you believe in World Cup superstition?

June 25, 2014 by Daniel Marcelino | Comments Off | Filed in Uncategorized

Screenshot 2014-06-24 20.59.25 If you believe in supernatural causality, you will love what the numbers of the World Cup have to say which team is going to win this Cup in Brazil. According to this numerology approach neither Brazil nor Germany or Netherlands will be the winner, but Uruguay. The table below shows my reason. This was produced using the googleVis Package. Script here.

Tags: , ,