Geography 486 - Lesson 10
Data Uncertainty and Multivariate Representation of Data Uncertainty - Week 2
Representations of Data Uncertainty and Multivariate Data
Data Uncertainty
In a recent paper on visualizing geospatial information uncertainty, the authors suggest that good science requires statements of accuracy by which the reliability of results can both be understood and communicated.1 Uncertainty, however, covers a much broader range of doubt or inconsistency about data than can objectively be measured.2 Uncertainty is a complex and multifaceted concept and it is this complexity that has often been ignored both by GIS cartographers trying to visualize it as well as by decision makers trying to use this information.3 Unfortunately, decision makers and user of spatial data have come to expect spatial data to be presented without uncertainty or ambiguity.4
Issues involving data uncertainty have been discussed in the GIS literature for quite some time. In 1992, Alan MacEachren of The Pennsylvania State University had his paper Visualizing Uncertain Information published in the Cartographic Perspective.5 This paper references a number of earlier works acknowledging that there needs to be some method of representing uncertainty through tools that allow our visual and cognitive processes to "automatically focus on the patterns depicted rather than on mentally generating those patterns."6 Further, MacEachren argues that uncertainty is a critical issue in geographic visualization because most people treat maps and data generated from computers as gospel, and are quite willing to make decisions based upon this information despite the fact that it may not be accurate.7 In fact, as we have learned throughout this class, there is much uncertainty in the data we use based upon how data is collected, how data is analyzed, and how the resulting information is presented.
A number of possible methods for displaying this uncertainty have been proposed. MacEachren suggests that size and color value are most appropriate for depicting uncertainty in numerical data; hue, shape and possibly orientation can be used for uncertainty in nominal information; and texture might work best in a binary classification scheme for either nominal or numerical data.8 In addition, MacEachren goes on to suggest that color saturation, focus, and resolution can be used to depict uncertainty.9
In his paper on Visualizing Uncertainty in Geo-spatial Data, Alex Pang acknowledges that there are basically two different ways of combining uncertainty into a visualization.10 First, uncertainty can be mapped as an additional piece of information or data incorporating uncertainty by mapping it as transparency, haze, and blur to alter the appearance of the underlying data.11 Uncertainty is treated as if it is another layer in a Geographic Information System tool. This approach is consistent with MacEachren's earlier work.
Alternatively, Pang suggests that new visualization primitives and abstractions could be created that reflect uncertainty.12 These new primitives are then non-separable pieces of information from the data. These new primitives, such as glyphs, could be used to represent specific mechanisms for displaying uncertainty as scalars, pairs or n-tuples, or distributions.13 Scalars can be used to depict confidence levels or errors in differences, pairs of scalars can be used to represent intervals or ranges of uncertainty, and in situations where sufficient sampling is available, a map of the actual data distribution can represent the uncertainty in the data.14
While much of the work I reviewed to this point was focused on how to depict uncertainty, Judi Thomson et. al., documented efforts to create a typology for visualizing uncertainity.15 Their typology identified nine different categories of uncertainty:16
· Accuracy/error - the difference between observation and reality
· Precision - the exactness of the measurement
· Completeness - the extent to which the information or data is comprehensive
· Consistency - the extent to which the various information components agree
· Lineage - the conduit through which the information or data has passed
· Currency/timing - the temporal gaps between occurrence, information collection, and use
· Credibility - the reliability of the information source
· Subjectivity - the amount of interpretation or judgment that is included
· Interrelatedness - the source independence from other information
The reasons that Thomson, et. al., created this typology to both incorporate and identify the various characterizations of uncertainty was to provide a guide for the further development of visual representations of uncertainity.17
To this point we have only addressed the concept of uncertainty. This concept can be thought of as a single dimension or variable that impacts the final design of the map or visualization (in fact it may be multi-dimensional as shown above, but for the purposes of exposition we will for the moment consider this as a single dimension or variable). The concept of uncertainty, however, only interacts with and impacts the variables under study. Uncertainty is a way of defining or describing the relative 'goodness' of the spatial data that is being used in the map visualization and is but one, of many, variables that should be assessed when accepting the results of an analysis.
Multivariate Representation
Multivariate representation uses a variety of variables to represent one or more attributes in a map.18 As we learned, these attributes can be displayed in one of several ways and data uncertainty can be one of the many variables used.
First, we can create a composite variable, generally from some statistical data reduction method. Kriging and Inverse Distance Weighting are examples of introducing uncertainty as a spatial component of a point observation. Additionally, we can use the visualization of a number of independent variables layered on top of each other to arrive at a new observation or a specific recommendation. A second method of displaying multivariate data is superposition, or the use of multiple different symbologies within a map. Using sets of maps displayed side by side with different variables, called small multiples is another means of displaying multivariate data. And finally, a fourth means of displaying multivariate data involves the use of both visual and non-visual symbolization.19
With uncertainty represented as a variable, it can be displayed on a map in one of five basic types. The traditional approach has been to place a statement describing the uncertainty or the display of a reliability diagram in the margins of the map. A second method is to provide two side-by-side maps, one displaying the attribute data and the other displaying the uncertainty information. This method requires the reader to mentally make the connection between the two different representations. A variant of this approach is to present maps that can be toggled (turned on and off alternately) that presents the same attribute and uncertainty information as the side-by-side maps.20 It would appear that the holy grail of map making is to create some form integrated symbolization that can represent both attribute and uncertainty information within the same map visualization. J. B. Krygier's paper Sound and Graphic Visualization explores the use of non-visual symbolization, in this case sound, to represent data.21 Krygier describes how sound can be used as vocal narration; as memetic symbols (earcons); as a redundant variable; as a a cue to order, or reorder, data; as an alternative to different visual patterns; as an alarm; as a means of adding non-visual data dimensions to interactive displays; and for locating sounds in a sound space.22 The fifth means of displaying uncertainty is via some form of animation that can display different degrees of uncertainty as was demonstrated by the Quicktime movie in our lesson.23
Conclusion
As to why this discussion of uncertainty and multivariate data representation is important, there has been a body of work that attempts to assess how maps are used for decision making under conditions of uncertainty. MacEachren, et. al., suggest that there is general agreement that uncertainty affects the decisions that we make.24 Citing previous research by Tversky and Kahneman (1974) that found that in situations where decisions were required without all the information needed to make an accurate decision, people reverted to heuristics on which to base their decisions.25 That is, the subjects of Tversky and Kahneman's study used stereotyped representations to arrive at conclusions, even when the stereotype was unlikely to be correct based upon what statistical information was provided.26
MacEachren, et. al., further suggested that experts are dependent upon statistical analyses to make decisions but that lay users will tend to ignore or misinterpret statistical probabilities and instead rely on stereotyping (heuristics) to make decisions.27 Further, the authors cited an earlier study (Ciburn et al. (2002) that found the depiction of uncertainty as a drawback, not just because the decision makers wanted unambiguous information, but because the nature of uncertainty could be used to discredit the model or analysis, and hence the decision.28
Clearly, the need to express uncertainty is important, but it needs to be expressed in a way that not only provides information and context to the variables under study but helps the viewer make rational choices based on the information presented. There is obviously still much work to be done in this field.
Footnotes
1Thomson, Judi; Hetzler, Beth; MacEachren, A. M. ; Gahegan, M.: Harrap; Pavel, Misha; (in press); A Typology for Visualizing Uncertainty; Conference on Visualization and Data Analysis 2005; San Jose; CA; accessed via http://www.geovista.psu.edu/publications/2004/Thomson_VDA205_prepub_draft_July5.pdf on March 21, 2006.
2Ibid.
3MacEachren, A. M.; Robinson, A; Hopper, S.; Gardner, S; Murray, R; Gahegan, M.:Harrap; and Hetzler, E; (2005); Visualizing Geospatial Information Uncertainty: What We Know and What We Need to Know; Cartography and Goegraphic Information Science, Vol. 32; pp. 139-160; accessed via http://www.geovista.psu.edu/publications/2005/MacEachrenCGIS_Vol32_No3.pdf on March 21, 2006.
4Ibid.
5MacEachren, A; (1992); Visualizing Uncertain Information; Cartographic Perspectives, Vol. 13; accessed via http://www.geovista.psu.edu/publications/MacEachren/MacEachren_uncertainty_cp1992.pdf on March 21, 2006.
6Ibid.
7Ibid.
8Ibid.
9Ibid.
10Pang, Alex; Visualizing Uncertainty in Geo-spatial Data; September 20, 2001; accessed via http://www.spatial.maine.edu/~worboys/SIE565/papers/pang%20viz%20uncert.pdf on March 21, 2006.
11Ibid.
12Ibid.
13Ibid.
14Ibid.
15Thomson, et. al. (2005)
16Ibid.
17Ibid.
18Lesson 9: Data Uncertainty and Multivariate Representation of Data Uncertainty; accessed via
https://www.e-education.psu.edu/courses/geog486/L09_compiled.html on March 9, 2006.
19Ibid.
20Ibid.
21Krygier, J. B,; Sound and Geographic Visualization; accessed via http://go.owu.edu/~jbkrygie/krygier_html/krysound.html on March 21, 2006.
22Ibid.
23Lesson 9 (2006)
24MacEachren, et. al. (2005)
25Ibid.
26Ibid.
27Ibid.
28Ibid.
For my Capstone Project, I did a voter turnout study for York County in Pennsylvania. I was able to obtain the York County voter record database containing turnout information back to 1990. In an effort to predict the voter turnout in this year's election, I only looked at data for the off-year (Non-Presidential) elections. I did this because voter turnout in Presidential years is far larger and not of the same composition as voter turnout in non-Presidential years. In performing this study, I faced three major issues.
First, the voter database is purged every six years in York County. By this, if a person fails to vote in any one of five successive years, they are purged from the voting rolls. If they simply haven't voted, they must re-register. However, most of the people purged from the rolls are taken off because they have died. This presented an interesting data issue when looking at data for the years prior to 2002, viz. 1998, 1994, and 1990. For each of these three years some portion of the voter turnout is not represented because these people were stricken from the rolls. Since there is no way to get around this, I made the implicit assumption that the proportion of people stricken from the rolls would be proportionally distributed across all political parties. In other words, all political parties would be equally impacted by voters being taken off the voting roll.
Second, because York County is primarily a Republican county, I chose to look only at the Republican party turnout. Had I chosen the Democratic Party to study, I would expect the results to be virtually the same, i.e. the overall trend would reflect decreasing Democratic Party turnout and increasing Republican turnout as this represents the opposite sides of the same coin.
Finally, I struggled with how to represent the data uncertainty portion of this project. I arrived at an effort to predict the Republican Party voter turnout for the upcoming election. I did this by calculating a trend value based upon a simple regression equation. I did this for each of the 159 voting districts (precincts) in the county. I used each precinct because some areas are growing more rapidly than others, and because I was curious to investigate where the greatest changes were occurring. To introduce the notion of uncertainty, I calculated a 90% confidence interval around the predicted trend value in order to get a low estimate and a high estimate of turnout for 2006, feeling that the actual 2006 results would fall somewhere between these two intervals.
The data from 1990 through 2002 represents the full population of information that was available from the county data center. I chose to display this information in a sequence of maps displayed vertically on a poster-sized output. What resulted is a gradual "reddening" of York County showing the increasing Republican voter turnout over four off-year election cycles. The three predicted outcomes (low, expected, and high) for 2006 are displayed horizontally across the bottom of the poster, but in sequence with the 2002 election cycle map. Visually, I hoped to capture the notion that the 2006 outcomes would fall somewhere between the predicted low and predicted high maps.
Keeping to the popular media representation of Republican areas depicted in red and Democratic areas depicted in blue, I chose a diverging red-blue color ramp to represent turnout values. Since this was an analysis of Republican party turnout, turnout percentages greater than 50% gradually increased the red color density and those below 50% gradually increased the blue color density. While the color blue in this case does not specifically represent Democratic turnout, it does represent the turnout of all other parties in each precinct, i.e. the non-Republican turnout.
It should be noted that voter turnout was measured by analyzing the numbers from each party that actually turned out at the polls relative to each other, not the percentage of voters registered within a particular party that turned out to vote. In other words, a 50% turnout means that 50% of the voters showing up at the polls on that election day were from a specific party, in this case increasingly more Republican voters.
I chose to go with the large poster-sized output for my project because I wanted to display enough detail to make the analysis visually interesting and appealing. I also felt that a large format poster could be used in a variety of different forums - from a scholastic lecture setting to a political party rally.
After all was done, I am happy with the results. I don't know that I would have done anything any differently. I did experiment with using a vertical scale marker for each precinct that would show the low, expected, and high predicted values, but this symbolization was just too cluttered and difficult to read. I think the analysis shows the growing number of Republican voters in the county and that they are increasingly turning out in larger numbers to vote.
Finally, I came to the conclusion that this analysis was much like doing technical analysis of stock movements in the stock market. Technical stock analysis examines the movement of stocks to determine patterns of behavior, irrespective and independent of any fundamental analysis of the company itself. Fundamental stock analysis looks at how a company is performing by examining Security and Exchange Commission filings, balance sheets, debt, and a host of other fundamental indicators of a company's health and future growth. In the same way, looking at these voting patterns does not take into account why voters are turning out. Though the predictions suggest a large Republican turnout in 2006 (and at least equal to that of 2002), this could turn out to be incorrect if Republicans choose to stay home because of a growing disenchantment with the direction the country is taking or because the race for Governor turns out to be less than exciting.
The PDF of my project can be downloaded by clicking here, or on the thumbnail image below.