Cluster Detection

Cluster Detection

This ex post approach is structured quite differently from the previous ones. The initial phase is predominantly exploratory of spatial patterns either by statistical or geocomputational techniques and, once an abnormal concentration of some effect has been detected, some form of environmental simulation modeling may be used in order to confirm the transport and fate mechanisms from the revealed or suspected source to the receptors as a precursor to the implementation of mitigation measures. Many of these applications center around the investigation of diseases caused by unsanitary conditions and or pollution and have strong roots in spatial epidemiology (Lawson, 2001) and environmental engineering (Nazaroff and Alvarez-Cohen, 2001). But, as we shall see next, these techniques can also be applied to purely engineering phenomena, such as, for example, pipe bursts, landslides, subsidence, and so on. But first, we begin with some principles of the approach.

Recognition of spatial patterns to events have been a cornerstone of spatial epidemiology since John Snow (a physician to Queen Victoria) in 1854 determined the source of a cholera epidemic in the Soho district of London to be a pump on Broad Street. Once the handle of the pump had been removed, the epidemic subsided. Snow’s revelation is often attributed to a mapping of the cases, but actually this map was only created after the event for a monograph recording his observations and analysis (Snow, 1855). Nevertheless, his correct deduction of the cause arose first from an observation that the main cluster of deaths in this epidemic centered geographically on the said pump and, secondly, that he had previously hypothesized that cholera was ingested from contaminated drinking water (which up until then was not a proven hypothesis), otherwise he might not have focused on the pump at all. Subsequent forensic investigation of the Broad Street pump confirmed that sewage had leaked into the well causing the contamination.

The significance of Snow’s work was not so much that he drew a map, but that he set in place an approach that is still relevant today: investigate patterns for indications of abnormality or concentration (spatially and/or temporally), hypothesize and forensically confirm the cause of the pattern, then take necessary corrective action. In modern spatial epidemiology, spatial distributions can be exam-
ined in three ways (Lawson, 2001):


  1. Nonspecific: This is a global, statistical approach that provides an assessment of the overall pattern for a complete map, usually the degree to which a mapped distribution may be characterized as being regular, random, or clustered.
  2. Specific: The aim here is to identify specifically where clusters are should they indeed be found to exist, and can be carried out in one of the two ways:

An event, such as catching a disease, the occurrence of a landslide, or a pipe burst, can be treated as a binary event (0, 1) in as much as either it has happened or it hasn’t. You don’t get half ill and a pipe doesn’t partially burst (unless you want to get pedantic and say it just leaks). Such binary events for the purpose of a specific cluster analysis are best treated as point data. Although there are a range of techniques for analyzing spatially aggregated data (e.g., Besag and Newell, 1991; Anselin, 1995; Ord and Getis, 1995), we will focus here on point binary events. Apart from spatial epidemiology, the analysis of such data has a long tradition in geography (Dacey, 1960; Knox, 1964; Cliff and Ord, 1981) and ecology (Clark and Evans, 1954; Greig-Smith, 1964) and has received renewed interest within GIS and geocomputational frameworks (Fotheringham and Zhan, 1996;
Gatrell et al., 1996; Openshaw, 1998; Brimicombe and Tsui, 2000; Atkinson and Unwin, 2002), and more recently within spatial data mining (Miller and Han, 2001; Brimicombe, 2002; 2006; Jacquez, 2008). But what is a cluster? Unfortunately there is no standard definition, but, instead, two broadly defined classes of cluster:

The first comes from the mainstream statistics of cluster analysis arising from the work of Sokal and Sneath (1963). Thus, clustering is an act of grouping by statistical means which, when applied to spatial data, seeks to form a segmentation into regions or clusters, which minimize withincluster variation, but maximize between-cluster variation. There is a general expectation that the spatial clustering will mutually exclusively include all points and, therefore, is spacefilling within the geographical extent of the data (e.g., Murray and Estivill-Castro, 1998; Halls et al., 2001; Estivill-Castro and Lee, 2002). With a spatial segmentation, further analysis of this form of clustering usually leads to aggregated data techniques (cited above).

The other class of cluster is concerned with “hotspots.” These can be loosely defined as a localized excess of some incidence rate and are typified by Openshaw’s Geographical Analysis Machine and its later developments (Openshaw et al., 1987; Openshaw, 1998). This definition of a cluster is well suited to binary event occurrences. Unlike the statistical approach, there is no expectation that all points in the data set will be uniquely assigned to a cluster, only some of the points are necessarily identified as belonging to hotspots and these then remain the focus of the analysis.

With this type of clustering, the null hypothesis of no clustering is a random event occurrence free from locational constraints and would thus form a Poisson distribution (Harvey, 1966; Bailey and Gatrell, 1995). Because the recognition of this type of cluster is in relation to some incidence rate, the significance of clustering is often evaluated against an underlying “at risk” or control population. This is a critical issue because misspecification is clearly going to lead to erroneous results. In some applications (e.g., data mining) the “at risk” population may be identifiable at the outset and for yet other applications (e.g., landslides, subsidence), the notion of an “at risk” population, such as all those parts of a slope that are vulnerable to failure, may have little meaning.

Published on: 11/29/19, 2:25 PM