Witch hunting in Europe: a discovery of missingness

Recently, Leeson and and Russ released a paper describing the relationship between the witch trials of Europe and the competitiveness between the papal and reformation churches of medieval Europe. It’s due to be released in the Economic Journal shortly.

The paper is a fascinating quantitative exploration: but the incredible achievement in data collection that under pins it shouldn’t be overlooked. Leeson and Russ have developed a database of over 10 000 witch trials in middle ages Europe. And they put it on Github for everyone.

I took a look at the dataset and thought that it provided a golden opportunity to do a little hunt of my own: and to show how some as-yet relatively unknown R packages can help with that.

One of our favourite books is Good Omens by Terry Pratchet and Neil Gaiman. In it, Shadwell, intrepid witchfinder stalks across the countryside (or as far as his pensioner’s ticket will take him) armed with his witchfinder’s kit. While we don’t have the pendulum of righteousness, we’ve got the packages of visualisation and data discovery.

There are four packages I think are particularly useful for data exploration and visualisation: tidyverse, skimr, naniar and visdat.

install.packages("naniar") # you only need to run these lines if you haven't installed the packages already
install.packages("skimr")
install.packages("visdat")
install.packages("tidyverse")
library(naniar) # you need to run these lines every time you use these packages
library(skimr)
## Warning: package 'skimr' was built under R version 3.4.3
library(tidyverse)
library(visdat)

The witch trial data

The data is located in a number of different formats on Github, but here we’ll use the .csv file that covers the trial data. We’ll download it directly from Github.

url <- "https://raw.githubusercontent.com/JakeRuss/witch-trials/master/data/trials.csv"

db <- read.csv(url)

The data is marvellously prepared and doesn’t need anything in the way of cleaning for this project. I will however change the name of one of the variables we’ll be using to make the process clearer. We’ll call gadm.adm0, an administrative area country instead. This is a gross oversimplification of geography in the middle ages of Europe, but it describes the location of the trial in terms that wil be most familiar to many modern users.

db$decade <- as.numeric(db$decade)
db$country <- db$gadm.adm0
db <- db[,-9]

Data discovery: structure and content

I talk alot about what I do with datasets when I first open them. Some of the specific R functions I find useful are:

summary(db)
##       year          decade        century         tried        
##  Min.   :1300   Min.   :1300   Min.   :1300   Min.   :  1.000  
##  1st Qu.:1597   1st Qu.:1590   1st Qu.:1500   1st Qu.:  1.000  
##  Median :1630   Median :1620   Median :1600   Median :  1.000  
##  Mean   :1621   Mean   :1615   Mean   :1569   Mean   :  3.953  
##  3rd Qu.:1660   3rd Qu.:1650   3rd Qu.:1600   3rd Qu.:  1.000  
##  Max.   :1850   Max.   :1850   Max.   :1800   Max.   :500.000  
##  NA's   :931                                                   
##      deaths              city              gadm.adm2   
##  Min.   :  0.000   Geneva  : 320   Mittelfranken:1303  
##  1st Qu.:  0.000   Budingen: 264   East Lothian : 516  
##  Median :  0.000   Bruges  : 121   Darmstadt    : 493  
##  Mean   :  2.296   Munich  :  80   Stuttgart    : 450  
##  3rd Qu.:  1.000   Augsburg:  75   Fife         : 363  
##  Max.   :500.000   (Other) :4867   (Other)      :6768  
##  NA's   :3826      NA's    :5213   NA's         :1047  
##              gadm.adm1         lon              lat       
##  Scotland         :3129   Min.   :-3.534   Min.   :41.38  
##  Bayern           :1388   1st Qu.: 6.154   1st Qu.:47.73  
##  Baden-Wurttemberg:1153   Median : 8.765   Median :48.53  
##  England          : 619   Mean   : 8.291   Mean   :48.55  
##  Hessen           : 519   3rd Qu.:10.489   3rd Qu.:49.76  
##  (Other)          :3973   Max.   :19.820   Max.   :54.77  
##  NA's             : 159   NA's   :5803     NA's   :5803   
##                record.source            country    
##  Goodare et al. (2003):3128   United Kingdom:3750  
##  Behringer (1987)     :1330   Germany       :3417  
##  Midelfort (1972)     :1031   Switzerland   :1272  
##  Monter (1976)        : 837   France        : 807  
##  Kieckhefer (1976)    : 529   Belgium       : 671  
##  Niess (1982)         : 512   Sweden        : 353  
##  (Other)              :3573   (Other)       : 670
glimpse(db)
## Observations: 10,940
## Variables: 12
## $ year          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ decade        <dbl> 1520, 1530, 1540, 1580, 1590, 1600, 1610, 1620, ...
## $ century       <int> 1500, 1500, 1500, 1500, 1500, 1600, 1600, 1600, ...
## $ tried         <int> 1, 1, 5, 7, 11, 6, 22, 14, 25, 39, 10, 10, 8, 8,...
## $ deaths        <int> 1, 1, 5, 5, 0, 1, 18, 8, 4, 10, 1, 4, 1, 3, 3, 0...
## $ city          <fctr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ gadm.adm2     <fctr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ gadm.adm1     <fctr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ lon           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ lat           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ record.source <fctr> Madar (1990), Madar (1990), Madar (1990), Madar...
## $ country       <fctr> Estonia, Estonia, Estonia, Estonia, Estonia, Es...

There’s another recent addition which I think is fabulous for a quick look of what’s in there, and that’s from the skimr package.

skim(db)
## Skim summary statistics
##  n obs: 10940 
##  n variables: 12 
## 
## Variable type: factor 
##        variable missing complete     n n_unique
## 1          city    5213     5727 10940      906
## 2       country       0    10940 10940       19
## 3     gadm.adm1     159    10781 10940      113
## 4     gadm.adm2    1047     9893 10940      365
## 5 record.source       0    10940 10940       38
##                                  top_counts ordered
## 1    NA: 5213, Gen: 320, Bud: 264, Bru: 121   FALSE
## 2 Uni: 3750, Ger: 3417, Swi: 1272, Fra: 807   FALSE
## 3 Sco: 3129, Bay: 1388, Bad: 1153, Eng: 619   FALSE
## 4   Mit: 1303, NA: 1047, Eas: 516, Dar: 493   FALSE
## 5 Goo: 3128, Beh: 1330, Mid: 1031, Mon: 837   FALSE
## 
## Variable type: integer 
##   variable missing complete     n    mean    sd  min  p25 median  p75  max
## 1  century       0    10940 10940 1569.2  72.61 1300 1500   1600 1600 1800
## 2   deaths    3826     7114 10940    2.3  16.3     0    0      0    1  500
## 3    tried       0    10940 10940    3.95 19.26    1    1      1    1  500
## 4     year     931    10009 10940 1621.06 66.25 1300 1597   1630 1660 1850
##       hist
## 1 ▁▁▁▂▇▁▁▁
## 2 ▇▁▁▁▁▁▁▁
## 3 ▇▁▁▁▁▁▁▁
## 4 ▁▁▁▂▇▇▁▁
## 
## Variable type: numeric 
##   variable missing complete     n    mean    sd     min     p25  median
## 1   decade       0    10940 10940 1614.87 66.68 1300    1590    1620   
## 2      lat    5803     5137 10940   48.55  1.65   41.39   47.73   48.53
## 3      lon    5803     5137 10940    8.29  2.89   -3.53    6.15    8.77
##       p75     max     hist
## 1 1650    1850    ▁▁▁▂▇▅▁▁
## 2   49.76   54.78 ▁▁▂▅▇▅▁▁
## 3   10.49   19.82 ▁▁▂▆▇▃▁▁

The contents of this database indicate dates of the trial - in years, decades or century, the number of tried persons, deaths and geographic information. But also a great deal of missing data, which is very common for the historic record.

If we wanted to know about the frequency of people accused and killed over time, we can work a little dplyr magic (this chart appears in similar form in the paper):

bydecade <- db[,c(2,4,5)] %>% 
      group_by(decade) %>% 
      summarise(deathsDecade = sum(deaths, na.rm = TRUE), 
                triedDecade = sum(tried, na.rm = TRUE)) 

ggplot(bydecade, 
       aes(x = decade, y = deathsDecade))+ 
  xlab("Decade")+
  ylab("Deaths")+
  geom_line() + 
  theme_light()+
  theme(legend.position="bottom")+
  theme(
    axis.text.x=element_text(angle=45,hjust=1, size = 10))+
  ggtitle("Number of deaths over time")