Every campaign cycle I usually do similar things, go to a repository, download a bounce of data, merge and store them to an existing RData file for posterior analysis. I've already wrote about this topic some time ago, but this time I think my script became simpler.

Set the Directory

Let's assume you're not in the same directory of your files, so you'll need to set R to where the population of files resides.

setwd("~/Downloads/consulta_cand_2014")

Getting a List of files

Next, it’s just a matter of getting to know your files. For this, the list.files() function is very handy, and you can see the file names right-way in your screen. Here I'm looking form those "txt" files, so I want my list of files exclude everything else, like pdf, jpg etc.

files <- list.files(pattern= '\\.txt$')

Sometimes you may find empty objects that may prevent the script to run successfully against them. Thus, you may want to inspect the files beforehand.

info = file.info(files)
empty = rownames(info[info$size == 0, ])

Moreover, in case you have the same files in more than one format, you may want to filter them like in the following:

CSVs <-list.files(pattern='csv')
TXTs <- list.files(pattern='txt')
mylist <- CSVs[!CSVs %in% TXTs]

Stacking files into a dataframe

The last step is to iterate "rbind" through the list of files in the working directory putting all them together.
Notice that in the script below I've included some extra conditions to avoid problems reading the files I have. Also, this assumes all the files have the same number of columns, otherwise "rbind" won't work. In this case you may need to replace "rbind" by "smartbind" from gtools package.

cand_br <- do.call("rbind",lapply(files,
FUN=function(files){read.table(files,
header=FALSE, sep=";",stringsAsFactors=FALSE, 
fileEncoding="cp1252", fill=TRUE,blank.lines.skip=TRUE)
}))

Uruguayan voters are about to give to the Frente Amplio a third mandate this November 30th. The following graph shows how the outcome would look like if the election were held this week. The undecided voters were distributed accordingly to each party as by the election day. The picture plots the probability density function (pdf) of the vote support for the FA and the PN as published by the major polling houses. The script can de found here.

As the picture suggests, the posterior densities are quite apart from each other and their confidence regions narrow, meaning that we have less uncertainty about the results under that area.

Rplot

Within 2 weeks, electors in Uruguay will vote for the runoff election between FA and PN. According to the polling data being published, it's very likely Uruguayans will give FA a third mandate. I run the following forecast model which suggest that the difference between the two parties are huge; even greater than the number of undecided voters.

Tabaré Vázquez - Frente Amplio

FA

Luis Lacalle Pou - Partido Nacional

PN

The latest polls just released tonight are suggesting a numerical tie between Dilma Rousseff (PT) and Aecio Neves (PSDB) considering the limit of the margin of error. Actually, these polls fired up a possible game-changing for the opposition over the government as some of the polls did capture any impact stimulated by the televised debate on Friday night.

There is still a lot of uncertainty around; roughly 5% of the electorate were reported to be undecided still. Nonetheless, by the time I run my model today, it turned Aecio ahead of Dilma for a little margin (< 1%). These numbers also account for Wasting votes, so it will typically diverge from the official results.

DILMA

PT

AECIO

PSDB

WASTING VOTES

WASTE

House Effects

Although the machinery behind the model I'm running allows for drawing several elections from data, it's too risky to call one side or the other given the pollster's credibility, which was certainly aggravated by the poor performance 3 weeks ago. Meanwhile, I've been trying to learn the pollsters' random walk in the Brazilian campaigns, but given the small range of observations this will take a while to produce robust measures.

The following chart shows the house effects considering those polls released over the runoff campaign. Ideally, a poster would have its effect equally distributed between positive and negative bands. Like a drunkard's walk a pollster could stagger left and right near each party or candidate. Not surprisingly, however, the picture shows two blocks of bias. While the first 4 pollsters typically fielded more positive numbers of the Government, the last 3 did so for the opposition. In addition, the house effects found for Datafolha, Veritá, and Sensus are statistically different than zero.

houseeffects

Taking a less systematic approach on the house effects, adjusting only for the sample size of the polls, regardless the methodology employed (probability with quotas or simply quotas), Dilma appears ahead with an interval of [2.5% to 5.7%] as represented in the following distribution. This happen because polling firms with a house house effects toward the government happen to sample much more people than otherwise, though the methodology they use to sample a large quantity of voters is poorer. This election is so mercurial that wrong decision on the precision parameter can sway the outcome from one side to the other.

Dilma

Although polling data are the most common source in an electoral campaign, there are also models that use prediction markets data (trade contracts flow) as the source of information about who is going to win the election. What is the best way of predicting an election is up to debate, but models based on the wisdom of crowds have been used extensively on the web for all-purpose forecasting, including prices, sales, and disasters. Actually, the range of events a bet can be trade has increased over the years; for elections, it is an obvious step to take.

The debate can be placed as such: Berg et al (2001) compared opinion polls and market based predictions from 19 national elections, finding evidence that market predictions provide a serious alternative to opinion polls. Not surprisingly, this argument is contested. For instance, Erikson and Wleizen (2008) argue that opinion polls reflect opinion on the day they were collected, and therefore should not be naively interpreted as forecasts. It's pretty much a consensus in the literature, but further they suggest that, if opinion poll data are appropriately adjusted, they will outperform market predictions.

Evidence for the Brazilian Election

This suggests that market-based prediction provides a serious alternative to opinion polls in predicting political contests. So what does the market say about the outcome of the Brazilian election so far?

The evidence used in this under constructing study has been retrieved from the history of the odds offered on “Dilma” and “Aécio” votes from 23 bookmakers between Sep 2013 and May 2014.
The bookmaker Ipredict had launched some contracts addressing Brazilian election outcome, one of them says: "This contract pays $1 if the President of Brazil following the next General Election is a member of the Brazilian Workers' Party. Otherwise, this contract will close at $0." In other words, the purpose of this contract is to forecast the probability that the President of Brazil is a member of the Brazilian Workers' Party. A similar contract were in the place for a member of the Social Democracy Party (PSDB).

app

Figure 1: Market-based probability of “Dilma” victory in the 2014 election

It can be seen that over the last days, the market is betting high on Dilma. By today, the probability of seen her as the next president was about 80% compared to 20% of Aecio Neves.

Opinion Polls

The obvious standard comparator to market-based predictions are the regular opinion polls that are continuously measuring the vote intentions for the candidates. In the following box, I show some predictions based on a Bayesian model I've been developing this year. It aggregates many polls and filter them out based on sample size and time elapsed between one poll to the next. These prediction also includes Wasting votes, so it will typically diverge from the official results. However, the important point is that, considering the mean of the prediction, Dilma has a probability of 75% of winning this race, pretty close of the prediction market, isn't it?

Probability Intervals (95%):
               2.5%          50%      97.5%
PT       0.40575221  0.471395889 0.53077824
PSDB     0.38681539  0.450461758 0.49911252
WASTING  0.06448501  0.088751737 0.11329596

I’ve been looking at the polls since last year, I never doubt the Workers’ Party would make it again, thought not-so-fast because of the economic downturn.