Archive

Posts Tagged ‘data quality’

Best Available Data: “BAD” Data?

August 14, 2017 3 comments

You may have heard the phrase that the “Best Available Data” is sometimes “BAD” Data. Why?  As the acronym implies, BAD data is often used “just because it is right at your fingertips,” and is often of lower quality than the data that could be obtained with more time, planning, and effort.  We have made the case in our book and in this blog for 5 years now that data quality actually matters, not just as a theoretical concept, but in day to day decision-making.  Data quality is particularly important in the field of GIS, where so many decisions are made based on analyzing mapped information.

All of this daily-used information hinges on the quality of the original data. Compounding the issue is that the temptation to settle for the easily obtained grows as the web GIS paradigm, with its ease of use and plethora of data sets, makes it easier and easier to quickly add data layers and be off on your way.  To be sure, there are times when the easily obtained is also of acceptable or even high quality.  Judging whether it is acceptable depends on the data user and that user’s needs and goals; “fitness for use.”

One intriguing and important resource in determining the quality of your data can be found in The Bad Data Handbook, published by O’Reilly Media, by Q. Ethan McCallum and 18 contributing authors.  They wrote about their experiences, their methods and their successes and challenges in dealing with datasets that are “bad” in some key ways.   The resulting 19 chapters and 250-ish pages may make you want to put this on your “would love to but don’t have time” pile, but I urge you to consider reading it.  The book is written in an engaging manner; many parts are even funny, evident in phrases such as, “When Databases attack” and “Is It Just Me or Does This Data Smell Funny?”

Despite the lively and often humorous approach, there is much practical wisdom here.  For example, many of us in the GIS field can relate to being somewhat perfectionist, so the chapter on, “Don’t Let the Perfect be the Enemy of the Good” is quite pertinent.   In another example, the authors provide a helpful “Four Cs of Data Quality Analysis.”  These include:
1. Complete: Is everything here that’s supposed to be here?
2. Coherent: Does all of the data “add up?”
3. Correct: Are these, in fact, the right values?
4. aCcountable: Can we trace the data?

Unix administrator Sandra Henry-Stocker wrote a review of the book here,  An online version of the book is here, from it-ebooks.info, but in keeping with the themes of this blog, you might wish to make sure that it is fair to the author that you read it from this site rather than purchasing the book.  I think that purchasing the book would be well worth the investment.  Don’t let the 2012 publication date, the fact that it is not GIS-focused per se, and the frequent inclusion of code put you off; this really is essential reading–or at least skimming–for all who are in the field of geotechnology.

baddatabook.PNG

Bad Data book by Q. Ethan McCallum and others. 

 

Advertisements

Data Quality on Live Web Maps

June 19, 2017 3 comments

Modern web maps and the cloud-based GIS tools and services upon which they are built continue to improve in richness of content and in data quality.  But as we have focused on many times in this blog and in our book, maps are representations of reality.  They are extremely useful representations, to be sure, particularly so in the cloud, but still are representations.   These representations are dependent upon the data sources, accuracy standards, map projections, completeness, processing and rendering procedures used, regulations and policies in place, and much more.  A case in point are offsets between street data and the satellite image data that I noticed in mid-2017 in Chengdu in south-central China.  The streets are about 369 meters southeast of where they appear on the satellite image (below):

china-google-maps

Puzzled, I panned the map to other locations in China.  The offsets varied, but they appeared everywhere in the country; for example, note the offset of 557 meters where a highway crosses the river at Dongyang, again to the southeast:

china-google-maps2

As of this writing, the offset appears in the same cardinal direction and only in China; indeed; After examining border towns with North Korea, Vietnam, and other countries, the offset appears to stop along those borders.  No offsets exist in Hong Kong nor in Macao.  Yahoo Maps Bing Maps both show the same types of offsets in China (Bing maps example, below):

china_bing

MapQuest, which uses an OpenStreetMap base, showed no offset.  I then tested ArcGIS Online with a satellite image base and the OpenStreetMap base, and there was no offset there, either (below).  This offset is a datum issue related to national security that is documented in this Wikipedia article.  The same data restriction issues that we discuss in our book and in our blog touch on other aspects of geospatial data, such as fines for unauthorized surveys, lack of geotagging information on many cameras when the GPS chip detects a location within China, and seeming unlawfulness of crowdsourced mapping efforts such as OpenStreetMap.

But furthermore, as we have noted, the satellite images are processed tiled and data sets, and like other data sets, they need to be critically scrutinized as well.  They should not be considered “reality” despite their appearance of being the “actual” Earth’s surface.  They too contain error, may have been taken on different dates or seasons, may be reprojected on a different datum, and other data quality aspects need to be considered.

china-agol

Another difference between these maps is the wide variation in the amount of detail in terms of the streets data in China.  The OpenStreetMap was the most complete; the other web mapping platforms offered a varying level of detail; some of which were seriously lacking, surprisingly especially in the year 2017, in almost every type of street except major freeways.  The streets content was much more complete in other countries.

It all comes back to identifying your end goals in using any sort of GIS or mapping package.  Being critical of the data can and should be part of the decision making process that you use and the choice of tools and maps to use.  By the time you read this, the image offset problem could have been resolved.  Great!  But are there now new issues of concern? Data sources, methods, and quality vary considerably among different countries. Furthermore, the tools and data change frequently, along with the processing methods, and being critical of the data is not just something to practice one time, but rather, fundamental to everyday work with GIS.

Aqua People? Reflections on Data Collection and Quality

November 20, 2016 3 comments

Data quality is a central theme of this blog and our book.  Here, we focus on quality of geospatial information, which is most often in the form of maps.  One of my favorite maps in terms of the richness of information and the choice of symbology is this “simple map of future population growth and decline” from my colleague at Esri, cartographer Jim Herries.  Jim symbolized this map with red points indicating areas that are losing population and green points indicating areas that are gaining population.  This map can be used to learn where population change is occurring, down to the local scale, and, with additional maps and resources, help people understand why it is changing and the implications of growth or decline.

But the map can also be an effective tool to help people understand issues of data collection and data quality.  Pan and zoom the map until you see some rivers, lakes, or reservoirs, such as Littleton Colorado’s Marston Reservoir, shown on the map below. If you zoom in to a larger scale, you will see points of “population” in this and nearby bodies of water. Why are these points shown in certain lakes and rivers?  Do these points represent “aqua people” who live on houseboats or who are perpetually on water skis, or could the points be something else?

water_living_in.jpg

The points are there not because people are living in or on the reservoir, but because the dots are randomly assigned to the statistical area that was used.  In this case, the statistical areas are census tracts or block groups, depending on the scale that is being examined.  The same phenomena can be seen with dot density maps at the county, state, or country level.  And this phenomenon is not confined to population data.  For example, dot density maps showing soybean bushels harvested by county could also be shown in the water, as could the number of cows or pigs, or even soil chemistry from sample boreholes.  In each case, the dots do not represent the actual location where people live, or animals graze, or soil was tested.  They are randomly distributed within the data collection unit.  In this case, at the largest scale, the unit is the census block group, and randomly distributing the points means that some points fall “inside” the water polygons.

Helping your colleagues, clients, students, or some other audience you are working with understand concepts such as these may seem insignificant but is an important part of map and data interpretation.  It can help them to better understand the web maps that we encounter on a daily basis.  It can help people understand issues and phenomena, and better enable them to think critically and spatially.  Issues of data collection, quality, and the geographic unit by which the data was collected–all of these matter.  What other examples could you use from GIS and/or web based maps such as these?

Understanding Data: It is Critical!

November 22, 2015 2 comments

Think of spatial data as the fuel for your GIS engine.  It is fundamental to any spatial analysis.  On listservs, LinkedIn, GeoNet, in this blog, other blogs, and in our book, discussions about data are commonplace.  The volume of spatial data available has increased dramatically in recent years as have the formats in which that data is stored, and the means by which that data is delivered to the user—via web-mapping services, servers, portals, media, user-defined boxes, predefined tiles, and more.

In this avalanche of spatial data, it is more important than ever to encourage your users to fully understand the data they are using. Sometimes, stakeholders view anything on the computer as accurate and complete, including maps.  Maps are incredibly useful, but inherently full of errors and distortions, from the map projection they are drawn from, to missing data, to generalized lines.  Nowadays, anyone can make a digital map.  Help your the users of your data understand that data quality affects subsequent analysis.  For example, in a lesson I frequently teach on plate tectonics, I ask students to study 2001’s largest earthquake, below (south of) the tip of the arrow:

Using a measure tool, students determine that the earthquake is 4 kilometers off of the coast of Peru.  But then I ask them to consider the fact that the generalized coastline was digitized at 1:30,000,000 scale.  How confident are we based on this shoreline that the earthquake was offshore?  Consider the classic geography problem of calculating the length of the British (or any) coastline—the more detailed the scale, the longer the coastline becomes, because at larger and larger scales, the coastline begins to include every cape and bay.  Peru’s coastline may actually twist and turn here, so the earthquake could have occurred on the beach.  The “so what” and spatial thinking discussion continues with the impacts of coastal earthquakes versus underwater quakes, and possible tsunamis.

Encourage your data users – whether they are students, customers, managers, the general public, or others – to be critical of spatial data—knowing its source, who produced it, when and why it was produced, the scale at which it was produced, and its content.   Show them how to create and access metadata.   They will then be able to critically evaluate spatial information and decide whether they will use it in their present and future decision making.  And it is my hope that when they produce their own data, that they will tag and document it thoroughly.

Know Your Data! Lessons Learned from Mapping Lyme Disease

August 16, 2015 3 comments

I have taught numerous workshops using Lyme Disease case counts from 1992 to 1998 by town in the state of Rhode Island.  I began with an Excel spreadsheet and used Esri Maps for Office to map and publish the data to ArcGIS Online.  The results are here.
lyme

As the first decade of the 2000s came to a close, my colleague and I wanted to update the data with information from 1999 to the present, and so we contacted the people at the Rhode Island Department of Health. They not only provided the updated data, for which we were grateful, but they also provided valuable information about the data.  This information has wider implications for data quality in general that we frequently discuss on this Spatial Reserves blog.

The Public Health staff told us that the Lyme disease surveillance is time and resource intensive.  During the 1980s and 1990s, as funding and human resource capacity allowed, the state ramped up surveillance activities including robust outreach to healthcare providers.  Prioritizing Lyme surveillance allowed the state to obtain detailed clinical information for a large number of cases and classify them appropriately.  The decrease observed in the 2004-2005 case counts was due to personnel changes and a shift in strategy for Lyme surveillance.  Resource and priority changes reduced their active provider follow up.  As a result, in the years since 2004, the state has been reporting fewer cases than in the past.  They believe this decrease in cases is a result of changes to surveillance activities and not to a change in the incidence of disease in Rhode Island.

If this isn’t the perfect example of “know your data”, I don’t know what is.  If one did not know the above information, an erroneous conclusion about the spatial and temporal patterns of Lyme disease would surely have occurred.  This kind of information often does not make it into standard metadata forms.  This therefore is also a reminder that contacting the data provider is often the most helpful way of obtaining the “inside scoop” on how the data was gathered.  I created a video highlighting these points.  And rest assured that we made certain that this information was included in the metadata when we served this updated information.

Confusion and Conflict in Boundary Law: A Recent Case along the Coast

April 27, 2015 1 comment

An article in The American Surveyor by Michael J. Pallamary highlights a recent case where the author states that the US Supreme Court has recently introduced confusion and conflict in boundary law. The intention of the law, it was hoped, would settle a decades-old dispute over the location of California’s offshore boundary, a line common with the United States’ boundary in that part of the world.  The conflict stemmed from 1946 when the federal government sued California for leasing land for oil drilling offshore from Long Beach.  The federal government stated that the submerged lands belonged to the federal government and not the state.  The location of the state boundary is apparently three geographical miles distant from its coastline.

Any geographer or GIS analyst must see the “red flags” fly when they read “coastline” as the basis for any boundary, realizing that coastlines change, sea levels change, the definition of the coastline itself means different things to different people, and, remember the good ol’ coastline paradox? Scale matters!  The author, a 50 year professional surveyor who knows what he is talking about, gets right to the point:  The notion of “absolute certainty” in the recent law and prior laws “appears to be derived from the misunderstood notion of significant figures and little to no understanding of geodesy.”  He goes on to state that one organization in California still uses nautical miles and most other agencies use geographical miles, and points out the differences between low water, ordinary low water, mean low water, lower low water, mean high water, and well, you get the idea.  He also states that the coordinate values in the law are expressed as both NAD 83 and WGS 84 UTM, when it is well known that these are not the same, and moreover, they are in an “endless state of flux.”

Why does all this matter?  The locations of boundaries are of critical concern to our 21st Century world. In terms of coastal boundaries, yes, energy extraction remains important as it was during the 1940s, but think of additional issues:  Measuring and assessing property and boundaries of all sorts along coasts, zoning, natural resource protection, shipping, defense and security, responsibilities for emergency management, and a whole host of other coastal and national concerns.

I couldn’t help but sigh and scratch my head upon reading the complexities detailed in this article. It provides a good update to our discussion in our book about the intricacies of boundaries, about the importance of datums, about precision and accuracy, and about knowing what you are doing when you are working in the field of geotechnologies.  In talking with the author, I understand that a petition has been filed, asking that the legislation be redone.  That is good news, but I wonder in how many other cases around the world are boundary related laws passed without consulting the surveying and geodesy community.  Maybe I don’t want to know the answer!  So, until we get our ideal world, I think it is important for all of us in the geospatial community to keep promoting why what you are doing is important to our 21st Century world.  It is my hope that your work will be visible so that you will be called upon from time to time by decision makers at all levels to advise them on legislation that affects us all.

Contour tracing the Line of Mean High Water of the Pacific Ocean

Contour tracing the Line of Mean High Water of the Pacific Ocean, from the American Surveyor.I highly recommend regular reading of the American Surveyor, in addition to reading the Spatial Reserves blog!