June | 2023 | Spatial Reserves

What to do when your data service goes missing?

June 26, 2023 josephkerski 16 comments

As we have written about extensively in our book and in this blog, the advent of data-as-services has been nothing short of revolutionary for the entire field of GIS and for GIS users. Most of these services, especially from authoritative agencies, tend to be stable and reliable. However, they are on the web, after all, and the web is a rapidly evolving set of a myriad of technologies. The most efficient ways to serve GIS data changes as time advances. Coupled with this dynamic environment is the fact that the devices we use to access these services themselves are tied to versions of and types of operating systems. Furthermore, modern computing tools are frequently modified by data users, so that my interaction with “tool x” and how it looks to me is very likely to be different from your interaction and the way it looks to you.

Therefore, occasionally a data layer that we have been using goes offline for a number of reasons. What actions can we take when that happens? I recently experienced an issue with a certain data set that went offline. My purpose for the data was to create this educational lesson based on it, so while nothing life threatening, the data was critical to my specific needs. In this case it was a data set of vector bathymetry for all 5 of the Great Lakes of the USA and Canada. The data set that I needed and had been reviewing for the preceding months and weeks leading up to when I had made time to work on it suddenly disappeared from all sources as feature services on the weekend before I needed it.

Several options were available to me. First, I could have downloaded some raster data and created some vector bathymetry from that. Second, additional research showed me that I could have recreated the data from the original NOAA feature services that appear in several different locations online. Upon further investigation of that data, I found out that the field names and structure for Lake Superior was different from the other 4 lakes, so I was faced with dealing with that issue through table manipulation. None of these methods was impossible, but would have probably taken several days of my time, which I didn’t really have–the days I had allotted to this effort were to write the lesson, not to manipulate the data. In the end, after several inquiries, some of my colleagues re-published the data (bless them!), and all I had to do was point to the new URL.

An additional thought running through my head through this challenge was that I could have saved even a modicum of worry if I had downloaded the feature service and saved it into my own account in ArcGIS Online (with proper citing of the original source, of course). Then, the data would not disappear, as long as I have a copy of the data in my own, in this case, ArcGIS Online organization. But that gives rise to yet another concern–how often should we as data users consider saving our own version of a valuable data set? If a data set is not restricted from copying, it could be done, but sometimes data sets are just too large for us to make our own instance of it. Surely each data user shouldn’t have to make their own copy of high resolution satellite image or Lidar data, UAV data, every water well in North America or another large vector data set. Saving our own copy of data is how we used to think about data before the advent of data-as-services in the cloud. Do we really need to go back to those days? No, but there may be data that is so important to me that I want to be the owner of it and therefore reduce the risk of it going offline when I really need it. In the case of the data I needed, I did end up contacting the original providers of the data. They re-published the data and I was once again happy.

The above situation that happened to me might be one that you, too, will encounter someday if you haven’t already. This type of situation gives rise to several key concerns. What are your thoughts about this?

–Joseph Kerski

A section of the data set that I was using that suddenly disappeared from the web as a data service. Has this ever happened to you? What strategies did you use to deal with it?

Categories: Public Domain Data

One local government’s progress in ensuring data quality

June 12, 2023 josephkerski 3 comments

King County, in Washington State, USA, has long been a leader in the rigorous application of GIS and serving their spatial data. I remember working with them while serving as a geographer for the US Census Bureau on their TIGER file updates long before most of you reading this essay were born. Suffice it to say that they really know their data, and moreover, think about their data, a lot. As the focus of our book and this blog often lands on data quality, King County’s words of wisdom about this topic are worth reading. Spanning three parts, part 1, part 2, and part 3, these essays also offer insights to other organizations as to how their organization builds quality standards into their many departments and programs, and how those standards are adhered to, checked, and updated.

The County’s three levels of improving and ensuring data quality at the time of the above articles were written are as follows:

GIS Maintenance Prioritization and Data Review.
Validation of Spatial Data Warehouse Objects.
Quality Assessment of Metadata Completeness and Content.

How does such a large and populous county, which contains the city of Seattle and over 2 million people, prioritize which layers to update? Thinking more broadly: How does your organization decide which layers to update? And how often? Which geographic areas, or themes, or attributes?

In the case of King County, their Spatial Data Warehouse (SDW) contains over 1,500 GIS layers and tables, and this was back in 2017 when these articles were written. I assume the warehouse contains even more layers with even more attributes and formats now. A GIS priority initiative drove development of a system that: (a) identified framework layer status for SDW layers, (b) monitored the update frequency of these layers as compared to their stated update frequency, and (c) estimated workload requirements for maintenance of each layer. This led to prioritization of the layers so that “staffing resources can be allocated most efficiently.”

The second essay discusses processes and tools for validation of the contents of the SDW. Multiple linked automated and manual steps help ensure good internal consistency and an anomaly-free environment. I like the way their essays talked about dataset tracking as “cradle-to-grave” – as I said, these folks really do care about their data and do the required care and feeding! This tracking occurs in four phases: 1. Dataset notification and pre-posting. 2. Registration, posting, and SDW check-in. 3. SDW omission and commission error reporting, and 4. Dataset archiving and retirement. Yes, just like books in a library, not all datasets need to be kept up to date–some need to be purged and retired from time to time, as they are replaced with newer or higher resolutions, but recognizing the historical value in certain layers, as well.

The third essay discusses how they assess the quantitative completeness and qualitative content of metadata created for data that they publish to the SDW. I like that their metadata also provides contact information on who to call if you have questions about the data–haven’t we all been in these situations many times over the course of our career? The metadata provides explanations for all those what King County calls “pesky codes” (oh, I love their terminology) that they store in their data tables. King County uses the FGDC CSDSM for the structure and content guide for their metadata. King County was using Python scripts to automate the process and to generate the templates long before many other organizations were doing so. But they also have a data steward add the human elements to the documentation, including the abstract, purpose, and lineage (which as they point out is too often neglected, and they definitely don’t want to neglect it!). Their metadata even contains much graphics-enabled help.

Having spent time with the King County open data portal, here, I find it very data-user friendly. How to use the portal is very nicely explained here. The portal offers a variety of formats, downloads, and streaming data services, including something you don’t see very often– the ability to filter the data before you download or stream it. Clever!

Part of the King County GIS portal.

These innovative King County folks are realists–they state quite plainly, “Many of the layers in the SDW are updated less frequently than stated in their metadata, and less frequently than is optimal. As with many organizations, there are fewer resources available for data maintenance than are required to accomplish all maintenance on schedule.” But despite the constraints they are under, their journey and work offers guidance for others seeking to serve their data users internally and externally with the geospatial data that is increasingly in demand.

–Joseph Kerski