Using R to derive the German election manifesto word clouds

In just a month, the largest country in Europe, Germany, will run for elections. In this short guide, I will write the text of the campaign manifestos of the main parties to visualize the state of German politics.

What are the topics of German politics?

I don’t know a single person who reads election manifestos. In order, however, to get an idea of what they are talking about, I have processed the text of the current manifestos on the general elections in Germany. The overall result is summarized in the word cloud above showing which side dominates which word.

Words are sized according to how often they appear in campaign manifestos, and are colored according to the manifesto where they appear most often. The colors will be familiar to most readers, but I’ve also included the party names just in case.

The word clouds for each political party cover different topics. I draw the top 100 words of each manifest and scale the words by frequency.

Chancellor Merkel’s ruling conservative parties CDU / CSU appear to be focused on Germany (Germany) and, to a lesser extent, Europe (Europe). Otherwise, community-related words (menschen, unser, unserer) are visible.

The other ruling party, the Social Democratic SPD, seems unable to find a strong verbal focus, preferring instead to speak more or less the same about Germany (Germany) and Europe (Europe). The relatively low incidence of social justice (Gerechtigkeit) is noticeable given that the SPD tried to make it a topic for these elections.

The largest opposition party at present, the far-left DIE LINKE, speaks more than other parties about “the need to do something” (müssen) and people (menschen). However, they failed the competition to promise “more” (mehr) to the Social Democrats. Otherwise, the terminology of the welfare state turns out to be quite common (solidarische, sozial (en), pflege, arbeit).

People (menschen), the future (zukunft) and Germany (deutschland) seem to be the most common words for greenery. Surprisingly, words related to the environment (ökologische) hardly make it into the top 100 words displayed in the word cloud.

The liberal, pro-business FDP is likely to re-enter the federal parliament after a four-year hiatus. They are expected to use business vocabulary frequently (leistungen, unternehmen, wirthschaft, wettbewerb). Perhaps in order to look like a modern party, a lot of computer vocabulary is also included (digitalisierung, digitalen, daten).

The right-wing populist AFD is expected to enter the federal parliament for the first time this fall. Their admiration for this perspective seems to be reflected in the frequent reference to “general elections” (bundestagswahl). They unexpectedly lose in the competition to mention Germany (Germany) the most in front of the CDU / CSU Merkel. Note that the vocabulary associated with foreigners coming to Germany turns into their word cloud (asyl, islam, zuwanderung, türkei).

By the way, a certain Martin Schweinberger made a similar project for the last general elections in Germany, published in his blog. Compare his word cloud to mine and see how German political discourse has changed. For example, four years ago Germany (Germany) was mentioned much less, and this word was dominated by the liberal FDP. This is very far from the 2017 situation.

Next, I’ll show you how you can build these lots from the comfort of your own home. The complete script can be found on github.

Making word clouds with R

We start off by acquiring the election manifestos directly from the party websites. We put the urls in maifestos_pdf and then loop through them using sapply. sapply uses a nameless function defined on the spot. That function prepares a location dest in the tmp folder where the manifestos will be stored. It then downloads the manifesto of this loop iteration and uses pdftotext.exe (which you should download before) in order to turn the pdf into a txt file whose name is returned by the function through its last statement.

 manifestos_pdf = c('https://www.cdu.de/system/tdf/media/dokumente/170703regierungsprogramm2017.pdf',                     'https://www.spd.de/fileadmin/Dokumente/Bundesparteitag_2017/Es_ist_Zeit_fuer_mehr_Gerechtigkeit-Unser_Regierungsprogramm.pdf',                     'https://www.die-linke.de/fileadmin/download/wahlen2017/wahlprogramm2017/die_linke_wahlprogramm_2017.pdf',                     'https://www.fdp.de/sites/default/files/uploads/2017/08/07/20170807-wahlprogramm-wp-2017-v16.pdf',                     'https://www.gruene.de/fileadmin/user_upload/Dokumente/BUENDNIS_90_DIE_GRUENEN_Bundestagswahlprogramm_2017.pdf',                     'https://www.afd.de/wp-content/uploads/sites/111/2017/06/2017-06-01_AfD-Bundestagswahlprogramm_Onlinefassung.pdf')    #download and install pdftotext.exe    files_txt = sapply(manifestos_pdf, function(x) {    dest <- tempfile(pattern = substr(x, 13, 15), fileext = ".pdf") # prepare a tmp file which includes party in name    download.file(x, dest, mode = "wb") # fill the tmp file with a downloaded election manifesto    system(paste('"C:\Program Files\xpdf-tools-win-4.00\bin64\pdftotext.exe" -layout ', # make windows use this pdf-to-txt program                 """, dest, """, sep = ""), # on this election manifesto           wait = F)    sub(".pdf", ".txt", dest)}) #return name of final txt file   

These txt files allow us to do some text mining with the R library tm.

 if(!require(tm)){install.packages('tm')}  library(tm) # text mining   

We load the txt files using the Corpus() function and then turn the corpus into a so-called term document matrix which represents words as rows and documents in which these words occur as columns. Cell entries are word frequency counts. The second step is combined with some text pre-processing (turning all to lower case, removing stop words, etc.)

 txt <- Corpus(URISource(files_txt),                readerControl = list(reader = readPlain, language = 'german'))    txt.tdm <- TermDocumentMatrix(txt, control = list(removePunctuation = TRUE,                                                    stopwords = T,#I am not sure this works well in German                                                    tolower = T,                                                    stemming = F,#disabled to get full words                                                    removeNumbers = TRUE,                                                    bounds = list(global = c(3, Inf))))#only words mentioned at least three times across manifestos   

Given the variable length of the manifestos, we don’t want to give a party an advantage for just writing a lot. So, we adjust word frequencies by how many words are in the document using the prop.table() function. This is also a good moment to get proper column headers saved in party_names.

 ft_matrix_prop = prop.table(as.matrix(txt.tdm), margin = 2)#column proportions  party_names = c("CDU/CSU", "SDP", "Die Linke", "FDP", "Grüne", "AFD")  colnames(ft_matrix_prop) <- party_names # add publication ready column labels   

And that’s it in terms of data preparation. We can start plotting using the wordcloud package. The first plot showing which party dominates which word uses traditional party colours defined in party_colours and the comparison_cloud() function. Note that the maximum word count is not per party but instead for the whole plot. We add the caption into the margin using R’s mtext() function. Side 1 is the bottom and lines are counted from the inside out.

 if(!require(wordcloud)){install.packages('wordcloud')}  library(wordcloud)    party_colours = c('black', 'red', 'magenta 3', 'orange', 'dark green', 'blue')    comparison.cloud(ft_matrix_prop, max.words = 200, random.order = FALSE,                   colors = party_colours, title.size = 1.5)  mtext('@ri', side = 1, line = 4, adj = 0) # caption   

The party specific word clouds follow the same principle. However, in order to plot all words, the word size had to be reduced through the scale parameter. I added a title using R’s mtext() function to unambiguously show which party is meant.

 for (i in 1:length(party_names)) {     wordcloud(rownames(ft_matrix_prop), ft_matrix_prop[,i],              min.freq = 0.001, max.words = 100,              colors = party_colours[i],              scale = c(3.2, .4))    mtext('@ri', side = 1, line = 4, adj = 0) # caption    mtext(party_names[i], side = 3, line = 3, adj = 0.5) # title  }   

Like this post? Share it with your followers or follow me on Twitter!

What are the topics of German politics?

Making word clouds with R

Leave a Comment Cancel Reply