Duplicate and Missing Cases



John Snow’s map of the 1854 cholera outbreak in London is a canonical example of data visualization:1

In 1992, Rusty Dodson and Waldo Tobler digitized the map. While the original data are no longer available,2 they have been preserved in Michael Friendly’s ‘HistData’ package. These data are plotted below:

Data problems

However, I would argue that there are two apparent coding errors in these data that stem from three misplaced cases.

While the data record 578 bars, only 575 of them have a unique x-y coordinate.3 Three pairs have identical coordinates: 1) 93 and 214; 2) 91 and 241; and 3) 209 and 429. Within the scheme of stacking bars to represent the number of fatalities at a given “address”, this should not occur. Each bar should have its own unique x-y coordinate. For this reason, I believe that any duplicate coordinates are likely to be coding errors.

duplicates <- HistData::Snow.deaths[(duplicated(HistData::Snow.deaths[,
  c("x", "y")])), ]

duplicates.id <- lapply(duplicates$x, function(i) {
  HistData::Snow.deaths[HistData::Snow.deaths$x == i, "case"]

HistData::Snow.deaths[unlist(duplicates.id), ]
>     case        x        y
> 93    93 12.84460 11.61027
> 214  214 12.84460 11.61027
> 91    91 12.65285 11.26382
> 241  241 12.65285 11.26382
> 209  209 12.68321 11.28437
> 429  429 12.68321 11.28437

Fortunately, a careful comparison of Snow’s map and the map generated by Dodson and Tobler’s data reveals that there are also three “missing” bars in the latter. An expedient “fix” would be to simply use the duplicates to fill in for the “missing” bars:

fatalities <- HistData::Snow.deaths

fix <- data.frame(x = c(12.56974, 12.53617, 12.33145), y = c(11.51226, 11.58107, 14.80316))

fatalities[c(91, 93, 209), c("x", "y")] <- fix

This fixed data set is available as fatalities in this package and as Snow.deaths2 in ‘HistData’ (>= ver. 0.7-8). For those interested, details about how I arrived at these values can be found in fixFatalities() and in the “note on duplicate and missing cases”, available online in this package’s GitHub repository.

  1. The map was originally published in Snow’s 1855 book, “On The Mode Of Communication Of Cholera”, and was reprinted as John Snow et. al., 1936. Snow on Cholera: Being a Reprint of Two Papers. New York: The Common Wealth Fund. You can also find the map online (a high resolution version is available at https://www.ph.ucla.edu/epi/snow/highressnowmap.html) and in many books, including Edward Tufte’s 1997 “Visual Explanations: Images and Quantities, Evidence and Narrative”.↩︎

  2. http://www.ncgia.ucsb.edu/pubs/snow/snow.html↩︎

  3. There is a lack of consensus about the actual number of cases represented in Snow’s map. For what it’s worth, I manually recounted the data on Snow’s map and the result I got matches Dodson and Tobler’s.↩︎