Analyzing polls can be particularly disappointing as each opinion poll has a confidence interval of approximately 3%, which can yield an uncertain range of 6%. These days, for example, some polls are giving PT’s incumbent candidate, Dilma Rousseff, 38%, others 36%. It renders difficult to tell to what extent is the 36% incorrect, or within the realms of possibility.
The first conclusion from forecasting with polls is that we cannot rely on a single poll to ascertain what is going on in the campaign, or what are the trends, if any. A typical approach to reduce this uncertainty is to aggregate all polls (a poll-of-polls). Nonetheless, in the context of any underlying trend in the sample of polls, this approach will yield skewed results as it attributes an equal weighting to the oldest polls as it does to the most recent polls.
The next approach then is to combine recent polls only. There are difficulties and problems with this too. First, the more significant the trends, the more polls that need to be discarded and therefore we reduce the number of polls to be used. What is more, the interpretation of which set of polls to use is inherently unclear due to the aforementioned variance of the polls themselves. In this election, journalist and pundits are faced with the typical dilemma as there now appears to have been a trend in support from PSDB’s candidate and bouncing away from Workers’s Party’s candidate. But how can we be sure of that?
A standard approach to interpreting polling trends involves using loess regression. This generates a trend on the entire dataset. It has huge benefits as it negates the influences of the exceptional data and other sources of random variation. However, loess regression models have a big drawback. Computation using loess will only run smoothly when polls are plentiful, and in the case of Brazil, we simply do not have enough polls to deploy such a technique.
The solution to this problem involves employing technique known as Kalman Filtering. It is an iterative linear update procedure for maximum a posterior probability estimation when the parameters to be estimated change as time progresses in the form of a linear dynamical system. Kalman filter is a technique employed in a wide range of applications from trajectory estimation for the Apollo program to FM Radios signal identification. Kalman Filtering involves two steps: (1) predict and (2) update.
The Kalman Filter applies a similar smoothing algorithm as the loess regression in that it accounts for all polling data points. However, Kalman Filtering also assigns weights to poll estimates, by taking into account such factors as sample size and the amount of time elapsed between polls. It further accounts for the relative spuriousness of different polls by conducting autoregression on the data at the outset.
In short, the technique yields to estimates from polling data at each point in time based on a variety of factors. It minimizes error from the two sources of uncertainty in polling: the true level of support for a party/candidate, which may change on a day-to-day basis by some unknown amount; and the error in polling.
I applied the KF to the polling data for this year, having the polling data from the previous years (2012/2013) as the prior covariance matrix. It produces the story of the election campaign as depicted in the following figures. Each point is a polling result for a candidate at that date and the trend-line is the trend as determined by the technique.
What I can say about the election is therefore that:
(1) Support for Rousseff (PT), the incumbent, has remained almost constant with a shy trend-line pushing it towards 40%. (2) Support for Neves (PSDB) has increased and the trend-line suggests this may continue to increase in the short-term future. (3) Support for Campos/Marina (PSB) has increased (or begin recovering to its elevated level as the May-2014). (4) Support for Other candidates follow stabilized with a shy trend upwards. (5) Those voters that were to vote none has decreased and the trend-line suggest this may continue.
Overall, it shows that recent commentary in relation to decline in PT/Rousseff support so far has been spurious when considering the campaign term. Although, another poll with a good sample size like the 36% Data Folha recently put her on might change this interpretation.
Variance in relation to the PT candidate is wider than other candidates due to the dramatic variance the incumbent saw her support become smaller after June-2013. Since I’m using this variance structure, it renders particularly difficult to predict the interval on election day.
Tags: Bayesian, Brazilian elections, Kalman