How close are Germany’s election manifestos in meaning?
If you want to know what the text says, there is no good substitute for reading it. But no one reads election manifestos. So, how about asking an algorithm from computational linguistics to give us a general idea of how much the election manifestos for the upcoming federal parliamentary elections in Germany are in agreement.
The algorithm is called Latent Semantic Analysis (sometimes called Latent Semantic Indexing). He knows nothing about word order or the meaning of words. It simply assumes that words that regularly appear together in text units (in this case, paragraphs) mean similar things.
This idea allows the algorithm to infer hidden (“hidden”) meanings (“semantic”) dimensions, such as good-bad, excited-calm, or gossip-serious. Semantic dimensions really can be anything. It all depends on the texts that you put into the analysis.
I have included in the election manifesto the paragraphs of six parties that are likely to be represented in the next parliament. I limited myself to the first three hidden semantic dimensions in order to be able to portray the result. The importance of semantic dimensions decreases as you go down the list, so the first three are a natural subset for plotting.
The first thing you can notice is the concentration of points around the origin (0; 0; 0). Typically, these are short paragraphs with so few words that analysis cannot determine what meaning to give them. Thus, he remains neutral.
By clicking on the legend entry for each part, you can see how much it overlaps with the other parties. While the new right-wing populist, AfD, is reputedly quite far from the post-war German consensus, it does not stand out in any way. Instead, the green side is the one with the most scattered dots, that is, paragraphs with meanings that are far from all other paragraphs in the sample. However, pay attention to the caveat. The green ones also have the longest paragraphs, which allows for more extreme placement. The AfD, on the other hand, tends to have much shorter paragraphs, and thus the hidden semantic analysis is likely to underestimate its radicalism.
Finally, if you hide all the dots and focus only on the plus signs, you will essentially see the average of each party’s election manifesto paragraphs. There are different ways of looking at this, but I see six well-spaced markers, which means the next German parliament is likely to represent a healthy mix of different points of view.
Out of curiosity, I increased the dimension count to 10 and looked for each party what other party is closest to it. The following table lists the distances in arbitrary semantic units and it orders them from closest to farthest.
CDU/CSU | SPD | Die Linke | FDP | Grüne | AfD |
---|---|---|---|---|---|
AfD (0.010) | AfD (0.011) | SPD (0.014) | AfD (0.028) | SPD (0.013) | CDU/CSU (0.010) |
SPD (0.014) | Grüne (0.013) | AfD (0.016) | SPD (0.028) | CDU/CSU (0.018) | SPD (0.011) |
Grüne (0.018) | CDU/CSU (0.014) | CDU/CSU (0.022) | Grüne (0.030) | AfD (0.021) | Die Linke (0.016) |
Die Linke (0.022) | Die Linke (0.014) | Grüne (0.022) | CDU/CSU (0.031) | Die Linke (0.022) | Grüne (0.021) |
FDP (0.031) | FDP (0.028) | FDP (0.035) | Die Linke (0.035) | FDP (0.030) | FDP (0.028) |
The one thing to take away from this table is that the analysis I did is very, very limited. The right populist AfD lies the closest to three other parties while the establishment, pro-business FDP lies the farthest from all parties? I am not sure this makes much sense to me.
Perhaps it helps to know what exactly each semantic dimension represents. How different words load on different dimensions can illuminate that. The first dimension has relatively strong negative loadings by the following words: für, dass, Mensch, können, uns, muss, mehr, Europa, Deutschland, stark. Hmmm, not sure what to make of this. This all seems like standard German political discourse stuff to me.
The second dimension is not much better. The following words load strongly: für (positive loading), dass, Europa, uns, Mensch, können, Deutschland, Land, mehr (negative loadings).
Finally, the third dimension makes more sense: Europa, Deutschland, international (positive loadings), Mensch, können, Kind, gut, Arbeit, mehr (negative loadings). It looks like it distinguisches foreign policy (Europe, Germany, international) from domestic policy (people, work, child).
The take-home message is: have fun with this analysis but take it all with a grain of salt. A better approach would have been to train the analysis on a huge corpus of German text (which I don’t have) using very generous computer resources (which I am unwilling to pay for) and then fold the election manifestos into the learned semantic dimensions.
As always, in the following I provide you with a fool proof guide to recreating the analyses and plot the results. The script I actually used can be found on github here. Please comment below if anything should be unclear or you want to lift my spirits.
Performing Latent Semantic Analysis in R
It’s usually best to load the necessary libraries first. Should you not have one of these downloaded, use the install.packages()
function, e.g. install.packages('plotly')
.
library(tm)#text mining library(stringr)#word counts library(lsa)#latent semantic analysis library(plotly)#interactive plotting
Next we will load the data. I went through the election manifesto pdfs by hand and turned them into machine readable text files. In order to read them as intended, we need the little helper function readManifesto()
.
readManifesto = function(manifesto_address){ #read a political manifesto separating paragraphs into list entries line_break = '<<>><<>>' m = paste(readLines(manifesto_address), collapse = line_break) m = strsplit(m, split = paste(rep(line_break, 2), collapse = ''), fixed = T) m = gsub(line_break, ' ', m[[1]]) }
We load the txt files from github using the readManifesto()
function which we just defined. Every election manifesto is now in a party specific vector. We put these vectors together in a data frame df
and remove paragraphs which are too short, i.e. shorter than minimal_paragraph_length
. The number I chose is completely arbitrary.
wd = 'https://raw.githubusercontent.com/ri/BTW2017_manifesto_analysis/master/cleaned_manifestos' afd = readManifesto(paste(wd, "afd_cleaned.txt", sep = '/')) cdu = readManifesto(paste(wd, "cdu_cleaned.txt", sep = '/')) die = readManifesto(paste(wd, "die_cleaned.txt", sep = '/')) fdp = readManifesto(paste(wd, "fdp_cleaned_ANSI.txt", sep = '/')) gru = readManifesto(paste(wd, "gru_cleaned_ANSI.txt", sep = '/')) spd = readManifesto(paste(wd, "spd_cleaned.txt", sep = '/')) #remove too short paragraphs and put it all together minimal_paragraph_length = 12#defined in words df <- data.frame(text = c(afd[str_count(afd, '\w+') > minimal_paragraph_length], cdu[str_count(cdu, '\w+') > minimal_paragraph_length], die[str_count(die, '\w+') > minimal_paragraph_length], fdp[str_count(fdp, '\w+') > minimal_paragraph_length], gru[str_count(gru, '\w+') > minimal_paragraph_length], spd[str_count(spd, '\w+') > minimal_paragraph_length]), view = factor(c(rep('AFD', sum(str_count(afd, '\w+') > minimal_paragraph_length)), rep('CDU/CSU', sum(str_count(cdu, '\w+') > minimal_paragraph_length)), rep('Die Linke', sum(str_count(die, '\w+') > minimal_paragraph_length)), rep('FDP', sum(str_count(fdp, '\w+') > minimal_paragraph_length)), rep('Bündnis 90/Die Grünen', sum(str_count(gru, '\w+') > minimal_paragraph_length)), rep('SPD', sum(str_count(spd, '\w+') > minimal_paragraph_length)) )), stringsAsFactors = FALSE) #How many paragraphs included summary(df$view)
Having loaded the data we can turn to Laten Semantic Analysis (LSA). We prepare a corpus corp
using the Corpus()
function, turn that into a term document matrix while performing basic data cleaning (punctuation removal etc.). For some reason this has to be turned into an ordinary matrix td.mat
with as.matrix()
in order to apply a term weighting scheme and then perform the actual LSA.
The weighting scheme I chose penalizes extremely common words, e.g. der, die, das, and words which are so rare that they only occur in very few paragraphs. The words which LSA can work with (which occur across paragraphs) and which are typical for that paragraph (which don’t occur everywhere) get more weight.
Beware that the lsa()
call will require patience. I only grab 10 dimensions from the analysis, because I am not interested in the rest. I believe that choosing more does not change the values associated with the first ten dimensions.
#prepare corpus corp <- Corpus(VectorSource(df$text), readerControl = list(language = 'german')) #turn corpus into TDM corp <- TermDocumentMatrix(corp, control = list(removePunctuation = TRUE, stopwords = T,#I am not sure this works well in German tolower = T, stemming = T,#I am not sure this works well in German removeNumbers = TRUE)) td.mat <- as.matrix(corp) td.mat.lsa <- lw_bintf(td.mat) * gw_gfidf(td.mat) # weighting: global frequency * inverse document frequency lsaSpace <- lsa(td.mat.lsa, dims = 10) # create LSA space
The matrix lsaSpace$tk
holds the word loadings for each dimension. The matrix lsaSpace$dk
does the same for paragraphs.
Interactive 3D Scatter Plot in R using plotly
And so we are ready for some 3D scatter plot action. I want the user to be able to hover the cursor over a data point and see which paragraph it is, i.e. directly read the text. Because plotly
does not implement line breaks, I have to do it myself.
br_max = 90#maximal number of characters before line break for (i in 1:length(df$text)){#for each paragraph paragraph = df$text[i] spaces = str_locate_all(pattern =' ', paragraph)[[1]][,1]#locate where spaces are in paragraph br_idx = unique(unlist(lapply(br_max * 1:ceiling(nchar(paragraph) / br_max), function(x) max(spaces[spaces < x]))))#locate future line breaks #implement line breaks paragraph_sep <- unlist(strsplit(paragraph,""))#split string into separate characters paragraph_sep[br_idx] <- '
'#replace characters where line break should occur with html line break df$text[i] = paste0(paragraph_sep,collapse='')#put the string back together }
Prepare the data frame plotly
expects to get. Include, besides the coordinates x
, y
, z
, also the party c
and the hover text t
. As always, R
reorganises factor levels alphabetically. The last line enforces the order we want for c
.
#data frame for plotting paragraphs points <- data.frame(x = lsaSpace$dk[,1], y = lsaSpace$dk[,2], z = lsaSpace$dk[,3], c = df$view, t = df$text) #get colours of plot right party_colours = c('black', 'red', 'magenta 3', 'orange', 'dark green', 'blue') party_names = c("CDU/CSU", "SPD", "Die Linke", "FDP", "Bündnis 90/Die Grünen", "AFD") points$c <- factor(points$c,levels= party_names) # add publication ready labels
We do something very similar for the centroids, i.e. the average coordinates of a given party’s paragraphs. Calculating a centroid is done easily enough by taking the mean along each dimension. We simply loop through dimensions. I have the weird feeling that there is a more elegant version out there which uses tidyverse code. If it is faster and shorter than 5 lines of code, please tell me!
If you are interested in the euclidean distances between centroids in the ten-dimensional space, just run the dist()
statement I commented out. Finally, tell R
again that we do not want alphabetically ordered factor levels.
centroids = matrix(NA, nrow =10, ncol =6) colnames(centroids) = party_names for(i in 1:10){#for each LSA dimension centroids[i,] = unlist(lapply(party_names, function(x) mean(lsaSpace$dk[df$view == x,i]))) } #dist(t(centroids))#the distances between party centroids points_centroids <- data.frame(x = centroids[1,], y = centroids[2,], z = centroids[3,], c = paste(party_names, ' centroid')) points_centroids$c <- factor(points_centroids$c,levels= paste(party_names, ' centroid')) # add publication ready labels
Now we have everything in place to start plotting. First things first: the paragraphs and centroids in 3D semantic space. plotly
somehow does not allow independent color specification between add_markers()
calls. So, we give it the same list of colours twice, once for the paragraphs and once for the centroids, but all in the first add_markers()
. I know this is confusing.
p <- plot_ly() %>% add_markers(data = points, x = ~x, y = ~y, z = ~z, color = ~c, colors = rep(party_colours, 2),#requires repeating because plotly doesn't link paragraph and centroid legend entries hoverinfo = 'text', text = ~paste(c, ':
', t)) %>% add_markers(data = points_centroids, x = ~x, y = ~y, z = ~z, color = ~c, hoverinfo = 'text', text = ~paste(c), marker = list(symbol = 'cross'))
We now have a very basic 3D scatter plot which the following layout()
call will prettify. I am particularly happy with the legend title hack I found online somewhere. Both one annotation (legend title) and the legend itself have the same y-position (height on page). However, the annotation is just above it because yanchor = 'bottom'
while the legend is just below because yanchor = 'top'
.
p <- p %>% layout(scene = list(xaxis = list(title = 'Semantic
Dimension 1'), yaxis = list(title = 'Semantic
Dimension 2'), zaxis = list(title = 'Semantic
Dimension 3')), title = 'The semantic space of German election manifestos', annotations = list(list(x = 0, y = 0,#bottom left corner of frame text = '@ri', xref = 'paper', yref = 'paper', xanchor = 'left',#left aligned showarrow = F), list(x = 1.02, y = 0.8, text = 'Click to show/hide', xanchor = 'left', yanchor = 'bottom', showarrow = F)), legend = list(y = 0.8, yanchor = 'top')) #p#privately have a look at the plot in browser
Happy with what you see when calling p
? Then push it to your own plotly account using plotly_POST()
. Don’t forget to set up your credentials in your .Rprofile
.
plotly_POST(p, filename = "BTW2017_manifesto_LSA")#push plotly post to plotly website # Set up API credentials: https://plot.ly/r/getting-started
Like this post? Share it with your followers or follow me on Twitter!