World Cup Results (Before They Happen)

2014 FIFA World Cup Logo


Brazil and Argentina will easily win their respective group phases. The home team will knock out the Netherlands, Uruguay and Germany on their way to the tournament match, while the representative of the Southernmost O&A office will depend on a penalty shootout in the semis to claw into their first final in 24 years. Historical rivals will then meet in one of the hottest matches ever witnessed since 1950, the first in 64 years to be played by two Latin-American teams.

This is no fortune telling or science fiction – it’s the prediction returned by a stochastic model that generates a distribution of outcomes for each of the matches to be played in Brazil 2014. By performing a full regression analysis of international football matches since 1960, Goldman Sachs produced a set of coefficients that aim to predict the numbers of goals scored by each team against their rival.

Variables in this analysis include:

  • The difference in the ELO ratings between contestants, which accounts for the success rate of each team according to the entire history of international football matches. This proves to be the most powerful variable in the model.
  • The average number of goals scored by the team in the last 10 mandatory international games.
  • The average number of goals received by the opposing team over the last 5 mandatory international games.
  • A country-specific dummy (that is, binary) variable indicating whether the game in question took place at a World Cup. This variable is meant to capture whether a team has a tendency to systematically outperform (such as Germany, the team that has made it to the highest number of finals) or under-perform (that’s historically a disgraceful habit of Spain – only broken in 2010, and vigorously renewed yesterday). They only include this variable for countries that have participated in a sufficient number of post-1960 World Cup games (Brazil, Germany, Argentina, Spain, Netherlands, England, Italy and France).
  • A dummy variable indicating whether the team played in its home country.
  • A dummy variable indicating whether the team played on its home continent, with coefficients that are allowed to vary by country.

A simulation with 100,000 draws was then generated for each game and used to predict the results of individual games based upon the coefficients mentioned below. Some predictions that hit the nail right in the head (Korea 1–1 Russia and Argentina 2-1 Bosnia) seem to suggest the accuracy of this model.

However, and maybe because statistics cannot account for the performance (or even presence) of individual players, human judgment and (sadly) the probability that spurious interests actually do exist, the success rate of these predictions is pretty modest until now: in only 7 out of 20 matches so far it could predict the result (be it a draw or who would be the winner), and it just hit the exact result twice, both results mentioned lines above. This gets worse when one notices huge misses such as pushing Spain for semi-finals when they bought their tickets back home as early as after game 2.

(Note that we don’t need to quote a legendary oracle like Paul The Octopus, with 12 correct guesses out of 14 tries, to find precedents of way better success rates. Actually, pretty much everyone the author knows seems to be doing way better in predictions, which could end up with him losing numerous bets with friends).

Anyways, despite the interest in building this type of models and using the same principles for Data Mining, the author begs for as much deviation as possible between predictions and reality, even if it implies that the National Team to wave fans goodbye in week 2. It would mean that, to some extent, soccer is still a game in which everything boils down to passing and dribbling a ball through a rough sea of proud opponents – one that refuses, and will always refuse, to be systematized.