Blog

Initial Conceptualizations GIS

In broad terms, GIS have provided environmental modelers with an ideal platform for spatial data integration, parameter estimation, and cartographic visualization, while environmental modeling has allowed GIS profession als to move beyond simple inventory and thematic mapping activities (Sui and Maggio, 1999). In practice, the degree of integration between the two technologies has tended to vary project by project. But, more to the point, what is meant by or how to specify any particular degree of “integration” is not so straightforward. Anselin et al. (1993) suggested a three-tiered classification based on the direction of the interaction between any two technologies: one-directional, two-directional, and dynamic two-directional (flexible data flow). Throughout the 1990s, however, it was generally accepted that for environmental simulation modeling, four levels of integration were possible using the following terminology: independent, loosely coupled, tightly coupled, and embedded (Fedra, 1993; Karimi and Houston, 1996; McDonnell, 1996; Sui and Maggio, 1999), although definitions vary slightly among these authors.

 

Independent

This doesn’t really represent a level of “integration” as such, but is included for completeness to cover those situations where GIS and environmental modeling are used together but independently on projects to achieve some common goal. In this context, GIS might be used to replace manual map measurements as traditionally carried out by modelers. Such measurements were invariably time consuming and prone to errors. Standard GIS functionality for measuring distances and areas could be used instead. GIS could also be used in parameter estimation for lumped models where, for example, dominant classes, spatially averaged, or interpolated values might be derived from relevant GIS coverage. The results from GIS usage would tend to be in the form of summary tables, which would then used as inputs to the environmental model.

Loosely Coupled

At this level of integration, GIS and an environmental model can share data files. GIS interaction with a dynamic simulation model is likely to be more than a once-only set of measurements or parameter determinations, particularly where the outputs of a number of scenarios may need to be visualized or further processed using GIS. Moreover, where parameter estimation is for distributed parameter models, a tabular approach to data exchange becomes extremely cumbersome. It is much better then to have some means by which both GIS and simulation models can share data files. More often than not, this entails exporting data files into some data format that is common to both GIS and environmental modeling software. This might be some formatted text file for attribute tables and raster matrices or, very popular at the time, the.dxf CAD format for vector graphics. One distinct advantage of this approach is that off-the-shelf and industry standard software can be used together, on the same computer, with a minimum of further development costs (even zero development costs if both have built-in compatible data import/export functionality).

As each software becomes upgraded by the vendor, it can be brought into immediate use provided that a hardware or operating system incompatibility is not introduced in doing so (for example, the latest version of the GIS software might now only run under Windows XP or Vista, while perhaps the environmental model hasn’t been upgraded since it was first compiled under Windows 3.11, or more likely, the chip and memory on your faithful workhorse PC cannot take the upgrade; even peripherals such as an older plotter might no longer be supported in the software upgrade). It is also possible to switch software completely as a result of new developments because of the particular characteristics of the problem to be solved or even for compatibility with some third party (research colleagues, client, or other consultants in a consortium). On the other hand, with each software running through its own interface, it becomes necessary to do GIS and simulation tasks one at a time in sequence, exchanging files and switching software at each stage.

Tightly Coupled

Under this level of integration, both software are run through a common interface that provides seamless access to GIS functionality and the envi-
ronmental modeling. They may even share a common file format that avoids the need to translate files to an exchange format, but if not, a file management system provides seamless data sharing. There is a development cost in creating the common interface, but it brings about tangible advantages. First, off-the-shelf and industry standard software can still be used, as in the loosely coupled option, but avoids the need for the exchange files to be dealt with manually. This is important, as on large dynamic modeling projects, the number of these files can extend into hundreds, easily leading to mistakes in using the wrong file. This aspect plus the avoidance of alternately switching from one software to another can save considerable time and adds flexibility in running scenarios. Incremental development of the common interface and file management may be required with each software upgrade, which may not be at the same pace or timing for GIS and the environmental modeling. Also, if for reasons given above a different GIS package or environmental model needs to be substituted, the development effort has to be carried through again. Tight coupling in this way, therefore, tends to be implemented for stable situations where a large amount of work needs to be carried out over a period of time.

Embedded

A number of authors consider that an embedded level of integration is the same as tightly coupled and might indeed be so if defined as such. However, there is considerable difference in using GIS and environmental modeling through a common user interface and having either GIS functionality embedded in an environmental model or environmental modeling code embedded in a GIS package. For a start, one tends to dominate through the use of its interface as the only one used. Also, some of the embedded implementations can be partial, such as limited GIS functionality inside an environmental model. Embedded environmental models may also be in a simplified form. Often such embedding is carried out by vendors to make their products more attractive. Environmental simulation models are typically developed using mainstream programming languages (e.g., C++, FORTRAN, Visual Basic, Java) or advanced technical languages, such as MATLAB, which are not ideal, but offer a pragmatic solution for environ mental modelers.

Published on: 11/29/19, 2:31 PM

Cluster Detection

This ex post approach is structured quite differently from the previous ones. The initial phase is predominantly exploratory of spatial patterns either by statistical or geocomputational techniques and, once an abnormal concentration of some effect has been detected, some form of environmental simulation modeling may be used in order to confirm the transport and fate mechanisms from the revealed or suspected source to the receptors as a precursor to the implementation of mitigation measures. Many of these applications center around the investigation of diseases caused by unsanitary conditions and or pollution and have strong roots in spatial epidemiology (Lawson, 2001) and environmental engineering (Nazaroff and Alvarez-Cohen, 2001). But, as we shall see next, these techniques can also be applied to purely engineering phenomena, such as, for example, pipe bursts, landslides, subsidence, and so on. But first, we begin with some principles of the approach.

Recognition of spatial patterns to events have been a cornerstone of spatial epidemiology since John Snow (a physician to Queen Victoria) in 1854 determined the source of a cholera epidemic in the Soho district of London to be a pump on Broad Street. Once the handle of the pump had been removed, the epidemic subsided. Snow’s revelation is often attributed to a mapping of the cases, but actually this map was only created after the event for a monograph recording his observations and analysis (Snow, 1855). Nevertheless, his correct deduction of the cause arose first from an observation that the main cluster of deaths in this epidemic centered geographically on the said pump and, secondly, that he had previously hypothesized that cholera was ingested from contaminated drinking water (which up until then was not a proven hypothesis), otherwise he might not have focused on the pump at all. Subsequent forensic investigation of the Broad Street pump confirmed that sewage had leaked into the well causing the contamination.

The significance of Snow’s work was not so much that he drew a map, but that he set in place an approach that is still relevant today: investigate patterns for indications of abnormality or concentration (spatially and/or temporally), hypothesize and forensically confirm the cause of the pattern, then take necessary corrective action. In modern spatial epidemiology, spatial distributions can be exam-
ined in three ways (Lawson, 2001):

  • Disease mapping: This concerns the use of models to characterize and uncover the overall structure of mapped disease distributions.
  • Ecological analysis: Here, explanatory factors for a disease are already known and the analysis is carried out at an aggregated spatial level to compare incidence rates with measures of the explanatory factors.
  • Disease clustering: This is of most interest to us in the context of this chapter. This concerns the detection and analysis of abnormal/unusual spatial or temporal clusters that indicate an elevated incidence or risk of a disease. Within this are a number of approaches:
  1. Nonspecific: This is a global, statistical approach that provides an assessment of the overall pattern for a complete map, usually the degree to which a mapped distribution may be characterized as being regular, random, or clustered.
  2. Specific: The aim here is to identify specifically where clusters are should they indeed be found to exist, and can be carried out in one of the two ways:
  • Focused: This is where a putative cause is suspected or known a priori, such as pollution from a factory, which then focuses the search for clusters.
  • Nonfocused: Where there are no a priori assumptions and an exploratory search is carried out to find clusters wherever they may occur.

An event, such as catching a disease, the occurrence of a landslide, or a pipe burst, can be treated as a binary event (0, 1) in as much as either it has happened or it hasn’t. You don’t get half ill and a pipe doesn’t partially burst (unless you want to get pedantic and say it just leaks). Such binary events for the purpose of a specific cluster analysis are best treated as point data. Although there are a range of techniques for analyzing spatially aggregated data (e.g., Besag and Newell, 1991; Anselin, 1995; Ord and Getis, 1995), we will focus here on point binary events. Apart from spatial epidemiology, the analysis of such data has a long tradition in geography (Dacey, 1960; Knox, 1964; Cliff and Ord, 1981) and ecology (Clark and Evans, 1954; Greig-Smith, 1964) and has received renewed interest within GIS and geocomputational frameworks (Fotheringham and Zhan, 1996;
Gatrell et al., 1996; Openshaw, 1998; Brimicombe and Tsui, 2000; Atkinson and Unwin, 2002), and more recently within spatial data mining (Miller and Han, 2001; Brimicombe, 2002; 2006; Jacquez, 2008). But what is a cluster? Unfortunately there is no standard definition, but, instead, two broadly defined classes of cluster:

The first comes from the mainstream statistics of cluster analysis arising from the work of Sokal and Sneath (1963). Thus, clustering is an act of grouping by statistical means which, when applied to spatial data, seeks to form a segmentation into regions or clusters, which minimize withincluster variation, but maximize between-cluster variation. There is a general expectation that the spatial clustering will mutually exclusively include all points and, therefore, is spacefilling within the geographical extent of the data (e.g., Murray and Estivill-Castro, 1998; Halls et al., 2001; Estivill-Castro and Lee, 2002). With a spatial segmentation, further analysis of this form of clustering usually leads to aggregated data techniques (cited above).

The other class of cluster is concerned with “hotspots.” These can be loosely defined as a localized excess of some incidence rate and are typified by Openshaw’s Geographical Analysis Machine and its later developments (Openshaw et al., 1987; Openshaw, 1998). This definition of a cluster is well suited to binary event occurrences. Unlike the statistical approach, there is no expectation that all points in the data set will be uniquely assigned to a cluster, only some of the points are necessarily identified as belonging to hotspots and these then remain the focus of the analysis.

With this type of clustering, the null hypothesis of no clustering is a random event occurrence free from locational constraints and would thus form a Poisson distribution (Harvey, 1966; Bailey and Gatrell, 1995). Because the recognition of this type of cluster is in relation to some incidence rate, the significance of clustering is often evaluated against an underlying “at risk” or control population. This is a critical issue because misspecification is clearly going to lead to erroneous results. In some applications (e.g., data mining) the “at risk” population may be identifiable at the outset and for yet other applications (e.g., landslides, subsidence), the notion of an “at risk” population, such as all those parts of a slope that are vulnerable to failure, may have little meaning.

Published on: 11/29/19, 2:25 PM

When working either in a geocomputational mode or in complex modeling situations, it is unlikely that the scientist or professional will be working with just a single tool. It is likely instead that several different tools are used. The way these are configured and pass data may have important influences on the outcome. In Chapter 6 and subsequently, we have seen how in coastal oil spill modeling there are, in fact, three models that cascade. Initially there is the hydrodynamic model that from bathymetric, shoreline, and tidal data calculates the tidal current over the study area using finite element method (FEM). In the next stage, these tidal currents together with other data, such as wind and the properties of the particular type of oil, are used in the oil spill trajectory model.

The trajectory model is a routing model requiring only arithmetic calculation and, therefore, is carried out on a grid. However, this requires a reinterpolation of the tidal current from a triangular network to a grid. As we have noted above, not only is there algorithm choice for reinterpolation, but that there is likely to be some level of corruption of the output data from the hydrodynamic modeling as it is transformed to a grid by the chosen interpolation algorithm. Experiments by Li (2001) have shown that errors in the bathymetry and tidal data will be propagated and amplified through the hydrodynamic modeling and affect the computed currents.

Interpolation of those currents to a grid further degrades the data by increasing the amount of variance by about 10%. This is then propagated through the next stage of modeling. These are inbuilt operational errors that are a function of overall model structure. Because different components or modules within the overall modeling environment work in very different ways such that they cannot be fully integrated but remain instead tightly coupled, then resampling or reinterpolation becomes necessary. Both model designers and model users, however, should be more aware of and try to limit the effects.

Another aspect of model structure that is hard to guard against is inadvertent misuse. While blunders, such as typographic errors in setting parameters, are sure to occur from time to time and will normally manifest themselves in nonintuitive outputs, there are subtle mistakes that result in believable but wrong outputs. In hydrodynamic modeling, for example, the forced tidal movement at the open boundary requires a minimum number of iterations in order for its effect to be properly calculated throughout the study area. For a large network with many thousands of elements in the triangular mesh, this may take many iterations at each time step. The model usually requests of the modeler the number of iterations that should be carried out; too many can be time consuming for a model with many thousands of time steps, but too few can give false results. Figure 9.8(a/b) shows the results of hydrodynamic modeling for an adequate number of iterations at 0 h and at 2 h. The tide is initially coming in and then starts to turn on the eastern side of the study area. In Figure 9.8(c/d), the exact same modeling has been given an
insufficient number of iterations at each time step to give the correct answer.

After 2 h, the tide continues to flow in and has not turned. The result of this error on the oil spill trajectory modeling can be seen in Figure 9.9. This can be compared for half-hourly intervals against Figure 9.1(a/f), which uses the correctly simulated tidal currents. With an insufficient number of iterations, the oil spill ends up in quite a different place and may adversely affect decision making.

In agent-based modeling there is a different, but by no means less complex set of issues in assessing the validity and usefulness of the results. In Chapter 5, we identified how agent-based models could have many thousands of agents all programmed with microlevel behaviors in order to study the macro patterns that emerge over time from these behaviors (Figure 5.15). Aspects for consideration include the stability or robustness of the emergent patterns, possible equifinality of emergent patterns from different initial states, boundary conditions and parameter values, different emergent patterns depending on parameter values, nonlinear responses to parameter change, and the propagation of error.

In order to illustrate a nonlinear response to incremental changes in a parameter, Figure 9.10 shows emergent patterns after 200 iterations of the Schelling three-population model implemented as cellular automata (CA). Only one parameter has been changed—minimum neighborhood tolerance—that has been incrementally increased by 10%. Each final state at 200 iterations has been quantified using a global Index of Contagion (O’Neill et al., 1988), as implemented in FRAGSTATS (http://www.umass.edu/landeco/), which measures the level of aggregation (0% for random patterns, 100% where a single class occupies the whole area). As illustrated in Figure 9.10, between a parameter value of 20% and 30% is a tipping point after which a high level of clustering quickly replaces randomness. As the parameter is further increased, so there is a nonlinear return to randomness. Such a sensitivity analysis (see the next section: Issues of Calibration) is the usual approach to exploring the robustness of the solution spaces, but in models where there are many parameters, the task can quickly become intractable. Li et al. (2008) have proposed the use of agent-based services to carry out such sensitivity analysis and model calibration of multiagent models; in other words, using the power of agents to overcome the complexity of using agents. The same approach can be used to explore all parameter spaces in n-dimensions to discover all possible emergent patterns of interest.

Published on: 11/29/19, 2:16 PM