Data Pipelines: A new tool to import data from 3rd Party APIs, AWS, CSVs, and more

January 8, 2024 1 comment

In June 2023, ArcGIS Online released with a new data integration capability, Data Pipelines. This application, currently in beta, provides a workbench User Interface to import and clean data from cloud data stores, such as AWS and Snowflake, read files such as CSVs, GeoJSON, and other formats. Once you have connected to your data, you can engineer the data to a desired state and then write the data out to a feature layer. This new feature in ArcGIS Online is central to the themes of this Spatial Reserves data blog and I trust will be of interest to our readers.

According to my colleague’s post linked here, you can:

  • Connect to datasets in your external data stores, like Amazon S3 or Snowflake.
  • Ingest public data that is accessible via URL, such as datasets found in open data portals or a downloadable CSV provided by your local government.
  • Filter and clean your data using data processing tools, like Filter by attribute, Select fields, and Remove duplicates.
  • Enhance your data by joining it with information from Living Atlas layers using the Join tool, or use Arcade functions to calculate field values using the Calculate field tool.
  • Easily integrate and clean data in ArcGIS Online with an easy-to-use drag-and-drop interface
  • Create reproducible, no-code data prep workflows.

As illustrated recently by my colleague Brian Baldwin, shown here, Brian pointed to the New York City 311 data set hosted on their Open Data platform, and built out a layer from it in about 2 minutes. Instead of the typical workflow from an open data site where you would have to export the data and/or have to build a Python script to do so, Brian created a data pipeline that could automatically bring the data in and even be run on a recurring schedule to keep the data up to date.

To engage the tool, in ArcGIS Online, go to the app launcher (the icon that looks like a 3×3 set of dots in the upper right of the ArcGIS Online interface), and select and launch Data Pipelines. Then, create your first data pipeline. The User Interface provides a blank canvas where you construct elements in a data model that tells the app what you want to do with the data. You would then need to, in the UI, provide the:

  • Inputs–These are the connections to data sources used to read in the data you want to prepare. You can add one or multiple inputs to build your workflow. A full list of supported inputs can be found here.
  • Tools–Once you’re connected to your data, you can configure tools to prepare and transform your data. For example, you can filter for certain records using queries, integrate datasets by using joins, merge multiples datasets together, or calculate a geometry field to enable location. A full list of the available tools can be found here.
  • Outputs–Once your data is prepared, it can be written to feature layers. You can create a new feature layer or update existing feature layers. For more detailed information on configuring data pipeline outputs, see the output feature layer documentation.

Here is what the UI looks like. The drag and drop user experience reminds me of using Model Builder in ArcGIS Pro. On an instructional note, I believe this type of diagram and flow between the graphics fosters spatial thinking as well as GIS skills.

A view of data pipelines in use.

In the introductory blog, you can watch an introductory video. Once you are ready to get started, you can create your first data pipeline via this tutorial. The Esri Community on Data Pipelines is another great place to learn more.

I look forward to using this tool and to hearing your reactions to it.

–Joseph Kerski

Categories: Public Domain Data Tags: ,

Exploring National Data Hubs

December 25, 2023 Leave a comment

National data hubs provide data about a nation’s people, place, and economy. National portals and libraries are intended to help users gain access to basic information to keep users well informed and assist in good policy and decision making. Of interest to the readers of this blog is that increasing numbers of national portals include geospatial data in their offerings.

A selected short list of these is as follows: 

Ireland: https://www.geohive.ie/

Australia: https://digital.atlas.gov.au/

Burkina Faso SDG hub: https://sustainable-development-goals-bfdatahub.hub.arcgis.com/

The UAE: https://1map-fcsa.hub.arcgis.com/

Saudi Arabia SDG hub: https://www.geoportal.sa/portal/apps/sites/#/ksa-sdgs-en/ which is very newly launched. 

Many of the above in my opinion can be viewed as “best practice” that others could emulate for example using ArcGIS Hub’s ease-of-use and functionality. The hub sites feature data but also selected data stories–the most critical issues to {Country/ citizens e.g., Jamaicans or Kenyans} is a healthy and stable population, a prosperous economy and a healthy natural environment. Key to reaching these goals are programs focused on education, health care, social protections and initiatives to support a stable and sustainable economy as well as those focused on protecting the nation’s natural resources and unique environment. These stories connect with these issues and highlight the need to collect, compile, and communicate, so that information will be available to decision makers in a wide variety of formats, including maps and geospatial data layers.

One of the most useful features of any of these sites, which should be of interest to the readers of this data blog, is embedded web maps, often available at multiple scales.

I encourage you to investigate this resource and look forward to hearing your reactions. Some other data hubs are listed here. Do you know of other ArcGIS Hub sites or other data libraries focused on official statistics? If so, feel free to share them in the comments section.

–Joseph Kerski

Data Ethics Working Group Recent Details Published

December 11, 2023 Leave a comment

A recent session organized by the Data Ethics Working Group of the CODATA (for more information, see  https://codata.org/initiatives/working-groups/data-ethics/ [codata.org] ) at the SciDataCon 2023 and International Data Week 2023 should be of interest to the readers of this blog and our book. CODATA is the Committee on Data of the International Science Council (ISC).  CODATA exists to promote global collaboration to improve the availability and usability of data for all areas of research.  CODATA supports the principle that data produced by research and susceptible to be used for research should be as open as possible and as closed as necessary.  CODATA works also to advance the interoperability and the usability of such data: research data should be intelligently open or FAIR. We have written about FAIR principles in the past in this blog. By promoting the policy, technological and cultural changes that are essential to promote Open Science, CODATA helps advance ISC’s vision and mission of advancing science as a global public good.

The growing application of big data and artificial intelligence in scientific research raises ethical and normative challenges, particularly in relation to openness, privacy, transparency, accountability, equity and responsibility. The Data Ethics Working group of CODATA is working with global scholars to collaboratively establish a basic consensus for further activities and research on data ethics principles and a data ethics framework covering the whole data life cycle. This will help CODATA to advance its mission in championing global open data exchange and applications in alignment with the UNESCO Recommendation on Open Science. The proposed session will explore research and scholarly practice related to data ethics to advocate for open and ethical data practices at the global level.  

The sessions took the form of a lightning/ignition talk and structured panel discussion to explore issues related to the following four themes:

Theme 01: Ethics and Scientific Integrity –Joy Jang

The UNESCO recommendation on Open Science emphasizes the role of research data in making knowledge openly available, accessible, and reusable for everyone. This thematic group focuses on data ethics and research integrity, covering the entire life cycle of research and the multiple perspectives of data users, providers, managers, funders, publishers, and other stakeholders. Transparency, quality, reusability and impact of research is discussed, and management and interpretation of research data, with a focus on collaborative efforts and the role of open scholarship in supporting research integrity. 

Theme 02: Ethics and Protection of Personal Data – Masanori Arita

The ethics and protection of sensitive and personal data cast critical questions related the appropriate conduct, usage, management, and storage. Explored questions include the politics and political economy of data — who and what has power in the context of data, and how do these power relationships play out in the context of different environments, such as the different markets and governments in which data is used? Modern technologies also enabled full sequencing of personal genomes, which are not only personal but also communal, national, and even continental. Discussion includes current data policies and pros and cons of data handling strategies.

Theme 03: Ethics and Indigenous Data Governance Johannes John-Langba

In the era of open data and open science it is important that data on Indigenous knowledge is shared in an ethical manner. Decisions on what data is to be shared should lie with Indigenous populations themselves, ensuring their autonomy and informational self-determination. The subgroup focuses on data principles such as CARE and JUST. Moreover, both Indigenous data sovereignty and data ethics need institution building for data trustees (and similar intermediaries) which would enable selective digital disclosure. 

Theme 04: Ethics, Global Power and Economic Relations – Louise Bezuidenhout

How the UNESCO Recommendation is implemented to realise open and equitable OS in practice must account for the structural conditions shaping research at a national and individual level. Scholars in many national contexts face barriers such as lack of basic infrastructure, unsupportive national policy, the control of the research agenda by global north funders and the domination of oligopolistic publishers and Big Tech companies. At an individual level, researchers everywhere who do not fit the expected norm of a scholar (white, able bodied, male) face multiple barriers such as conscious and unconscious bias, racism, misogyny, career breaks and societal expectations about caring responsibilities.

The session was co-moderated by: Prof Johannes John-Langba & Prof Lianglin Hu, Co-Chair: Data Ethics Working Group, CODATA.

More details are here: https://www.scidatacon.org/IDW-2023-Salzburg/sessions/508/ [scidatacon.org]

SciDataCon 2023 was part of International Data Week 2023: A Festival of Data, taking place 23 – 26 October in Salzburg.   IDW 2023 was hosted by the University of Salzburg through its interdisciplinary Data Science group and the Geoinformatics department, supported by the Governor of Salzburg and with assistance from the Austrian Academy of Sciences – GIScience and the European Umbrella Organization for Geographic Information.

SciDataCon 2023: High-level Themes at https://www.scidatacon.org/conference/IDW-2023-Salzburg/Themes/ [scidatacon.org]

The full program is at https://www.scidatacon.org/IDW-2023-Salzburg/programme/ [scidatacon.org]

–Joseph Kerski

Categories: Public Domain Data Tags:

Spatial Data on the Web Best Practices from the Open Geospatial Consortium

November 27, 2023 Leave a comment

The Open Geospatial Consortium recently published progress and an update of their document on “Spatial Data On the Web–Best Practices”, here: https://www.w3.org/TR/sdw-bp/ which I believe is relevant reading for the community of readers of this blog and anyone interested in spatial data–particularly those who are hosting data.

The purpose of the document are as follows: This document advises on best practices related to the publication of spatial data on the Web; the use of Web technologies as they may be applied to location. The best practices presented here are intended for practitioners, including Web developers and geospatial experts, and are compiled based on evidence of real-world application. These best practices suggest a significant change of emphasis from traditional Spatial Data Infrastructures by adopting an approach based on general Web standards. As location is often the common factor across multiple datasets, spatial data is an especially useful addition to the Web of data.

Readers of this blog will note items that we have discussed in this blog over the years, including FAIR practices, ethics, and more. The 17 best practices are:

I especially like the frequent inclusion of “why” in the document, and the efforts the editors have made to create a readable and helpful set of guidelines. It is my hope that all those creating and hosting spatial data will keep this document in close proximity as they do so.

–Joseph Kerski

Categories: Public Domain Data Tags:

The Overture Maps Foundation First Datasets Released

November 13, 2023 Leave a comment

The Overture Maps Foundation [overturemaps.org] has released [overturemaps.org] its first datasets [overturemaps.org], which include 59 million “points of interest” (landmarks, businesses, parks, and others), 785 million (yes, million) building outlines, road network data, and administrative boundaries.

This important initiative, which is steered by several giant tech companies, “could help third-party developers use maps that don’t rely on Google and Apple,” The Verge’s Emma Roth writes [theverge.com]. These datasets [github.com] draw on a range of sources, including the project’s member-companies, OpenStreetMap [openstreetmap.org], and USGS’s 3D Elevation Program [usgs.gov].  Given their volume and variety, they are important sources for the readers of this blog and anyone using GIS tools and geospatial data.

Overture 2023-07-26-alpha.0 is formatted in the Overture Maps schema and is described here. It is available in cloud-native Parquet and stored on AWS and Azure. Users can select the data of interest and download it by following the process outlined here. As a data user, the downloading and/or streaming may not be immediately apparent, so you will need to navigate the site and determine the best ways to access the information; a sample view is below.

To learn more, see this article: “Exploring the Overture Maps places data using DuckDB, sqlite-utils and Datasette [til.simonwillison.net],” by Simon Willison [simonwillison.net], who considers the data release “a really big deal.”

–Joseph Kerski

Categories: Public Domain Data

A book review of The GIS Management Handbook, 3rd Edition

October 30, 2023 2 comments

At a recent statewide GIS conference in Kentucky, I had the pleasure of meeting Peter Croswell, the President of Croswell-Schulte IT Consultants. Peter is also the author of The GIS Management Handbook, now in its 3rd Edition, published by Kessey Dewitt Publications. The book can be obtained here from URISA, and from the author’s consultancy, here. A review appeared in ArcUser not long ago, here.

As the book touches on many topics that are central to the interests of the Spatial Reserves data blog and our book, I asked Peter if I could review his book to share with our readers; he graciously agreed and my comments and his comments are below. I look forward to hearing your reactions!

From Peter:

When I began preparation of the first version of The GIS Management Handbook in 2008, I thought it would be a relatively straightforward and not too time-consuming effort relying largely on my 25+ years in GIS program and project management. It turned into much more than that—involving substantial literature research and contributions from many GIS professionals.  In the two editions that followed, the most recent one being the 3rd Edition (2022), it is an up-to-date, comprehensive, and practical guide to planning, implementing, and managing GIS programs and projects.  My goal has also been to provide a useful resource to professionals, academicians, and students.  With the release of the Spanish version of the book last year, the audience has expanded quite a bit to readers in Spain and Latin America.

From Joseph:

The book’s central topics of how to develop, implement, and operate GIS projects from a management and program perspective are closely aligned to the themes of this blog because data acquisition, hosting, serving, and use are an important part of any GIS management, whether in government, nonprofit, academia, or private industry organization. In particular, Chapter 5 about funding and budgeting, and Chapter 6, about copyright, public information access, and related matters, are particularly relevant to data and its implications. The book’s chapters on database design (Chapter 7), and GIS projects and management (Chapter 9) also provide not only fundamental concepts, but practical, useful advice from someone who has spent his entire career in geospatial technologies.

Right away in chapter 1 (section, 1.5.5), the author jumps right in to the data topic with a practical discussion of standards and open systems affecting GIS, continuing in chapter 2 with the GIS Capability Maturity Model and the central role that data maintenance and sharing has. Data has a key role also in documenting requirements (Chapter 2), where Peter describes data types and formats, and in database design (also Chapter 2). National spatial data infrastructures, ethics, the GIS data product and service market, how to serve data, fee vs free (Chapter 5), data license agreements, open records laws, crowdsourcing (Chapter 6), data quality management and data sources (Chapter 7), and other topics we regularly discuss in this blog can be more fully understood by reading Peter’s book.

What I most like about Peter’s book is the literally thousands of examples that he provides across a wide variety of topics, so that the book is very practical in its focus. Croswell also provides extensive references and organizations with which to investigate further for best practices. The length and scope of this book are comprehensive but Peter has risen to the challenge of keeping it up to date, relevant, and extremely useful. As such, it provides an excellent supplement to our own GIS and Public Domain Data book and I highly recommend it.

Joseph Kerski

Categories: Public Domain Data

IPUMS Libraries of Social Science and Health Data

October 16, 2023 Leave a comment

Stemming back to my days as a US Census Bureau Geographer, I have long had deep admiration for the IPUMS data libraries. IPUMS provides census and survey data from around the world integrated across time and space. IPUMS integration and documentation makes it easy to study change, conduct comparative research, merge information across data types, and analyze individuals within family and community contexts. Data and services are available free of charge, all via https://ipums.org.

These data sets include families and households, immigration and migration, income and poverty, education, ethnicity and language, marriage and cohabitation, employment, income and poverty, reproductive health, nutrition, care access and utilization, disability, and vaccination. Data formats include spreadsheets, shapefiles, and much more. IPUMS also includes something that many of us in GIS hold dear–the National Historical GIS, tabular US Census data and GIS boundary files from 1790 to the present. But IPUMS goes far beyond even these amazing data sets. Indeed, IPUMS international is a data dissemination partnership between national statistical offices from around the world and IPUMS at the University of Minnesota. National statistical offices contribute their census and survey microdata and metadata, and IPUMS provides data integration and services to the national statistical office partner.

As a central theme of this blog and our book is “can this data service be trusted,” IPUMS is earning the Core Trust Seal as a sustainable and trusted data repository. Nearly 500 census and survey files from more than 100 countries have been entrusted and are now part of the largest microdata repository in the world.

Upon entering the IPUMS site, the user sees the main libraries, including IPUMS international, IPUMS USA, IPUMS Time Use, IPUMS Global Health, IPUMS Higher Ed, IPUMS Health Surveys, IPUMS IHGIS (population, housing, and agricultural census data), IPUMS CPS (Current Population Survey microdata including monthly surveys and supplements from 1962 to the present), and IPUMS NHGIS (mentioned above–historical US Census information).

IPUMS is run from the University of Minnesota and thus education and being helpful is central to their mission. It comes therefore as no surprise that the IPUMS data user experience is what I would characterize as very user friendly: The user is guided through the steps of selecting a region, a sample, and a set of variables, and then is presented with a “data cart”, somewhat like a “Amazon.com” experience but with data in your “cart.” It really doesn’t get any more straightforward than this, right down to guidance on clearly explaining to the user what the records will mean once chosen, in plain language (shown below). I salute their GUI designers and research staff for building this incredible resource.

I encourage you to give the IPUMS resources a try!

–Joseph Kerski

Categories: Public Domain Data

Reflections on Data Equity

October 2, 2023 2 comments

A central theme of our book and this blog is access to data. The phrase “data equity” is increasingly used to describe who has access to data to make informed decisions. In a recent informative article in GovLoop, equity is defined as the “consistent and systematic fair, just, and impartial treatment of all individuals (pursuant to a US Executive Order, recommendations from the equitable data working group).” And when applied to data, it can “illuminate opportunities for targeted actions that will result in demonstrably improved outcomes for underserved communities.”

To achieve an “equitable data vision”, the following practices will be employed by US federal agencies: Make disaggregated data the norm while protecting privacy, catalyze existing federal infrastructure to leverage underused data, build capacity for robust equity assessment for policymaking and program implementation, galvanize diverse partnerships across levels of government and the research community, and be accountable to the American Public.

Recommendations outlined to achieve the vision set above include: Revise the Office of Management and Budget’s Statistical Policy standards for maintaining, collecting, and presenting federal data on race and ethnicity, generating disaggregated statistical estimates including increased funding for selected statistical agencies, catalyzing existing federal government infrastructure to leverage underused data, building capacity for robust equity assessment for policymaking and program implementation, and galvanizing diverse partnerships.

Included in the “being accountable” section is something I found particularly noteworthy: “Build data access tools that are user-friendly.” To this statement, I say “hooray!”, and indeed, advocating for user-friendly tools have been a central theme of this blog for over a decade now. Executive Order 13571 focuses on “transforming federal customer experience and service delivery to rebuild trust in government”. Specifically, it refers to a “time tax”: When an individual interacts with the government, the time it takes to obtain the necessary information, permit, or anything else often takes needlessly long, in other words, a “time tax” imposed on the individual. This is often much too burdensome and must be reduced.

As someone who worked in three federal agencies for over 21 years, I salute this goal but realize it will not easily be solved–the entanglements of regulations, sites created not with data users in mind, funding and staffing, and other challenges are many, but, we need to start somewhere. The tenets of this order need to be at the forefront of every agency creating and maintaining data portals and libraries, including geospatial data. Furthermore the agencies need to be working together: We’ve spent decades creating siloed data libraries within organizations. These were noble pursuits for their time but it’s a new era with new challenges, and collaboration and vision are keys to making the goals set forth in these documents a reality.

What are your reactions to this article? What stories can you share about your progress in data equity in your own organization?

–Joseph Kerski

Categories: Public Domain Data

Best practices for a statewide state parcel data portal

September 18, 2023 Leave a comment

I have long had great respect for the data portals, coordination, and community of practice in North Dakota, as I first wrote about here. I recently asked my colleague Bob Nutsch, the geospatial program coordinator for North Dakota, to share strategies that I believe will be instructive to the reader–focusing on one data layer in particular that is in high demand and yet presents challenges to coordinate and serve–parcels. Particularly if you are tasked with similar duties in your own organization, consider Bob’s words of wisdom here.

–Joseph Kerski

The goal of the North Dakota State Parcel Program is to maintain an accurate, publicly accessible, maintained, Statewide Parcel Dataset that supports the State of North Dakota business needs. Other beneficiaries of the dataset include local government and businesses. The dataset is maintained by aggregating parcel boundary and tax roll data from each of the state’s 53 counties and/or their vendors.

To minimize impact to county staff and their vendors, the Parcel Program is flexible, providing data contribution options for manual and scripted uploading by the data provider(s) and options for automated harvesting of data by the Parcel Program.

A preferred harvesting option is using the open data portals from the counties and/or their vendor.  Over the past two years that the Parcel Program has been in existence, Bob and his team have seen a marked increase in the number of open data portals utilized by the providers of the parcel boundary data. Advantages of sourcing data from the open data portals include the following:

  • Having a familiar REST API saves significant time to our contractor for setting up the Extract, Transform and Load (ETL) for a county that moves from another data sharing method to using an open data portal and associated REST API.
  • The familiar interface and functionality make it easy for anyone to view and navigate through the data provided by the county and/or their vendor, assisting in initial set up and subsequent debugging or QA work.
  • Less time spent by the county and/or their data provider, e.g., they no longer have to zip up a file geodatabase or shapefile following by manually uploading the dataset.
  • The “build it once, use it a bunch” benefit to the county is made possible by the county’s investment to stand up the open data portal to meet their business needs first and foremost, followed by responding to data requests of others such as those associated with the Parcel Program. Years ago, at NSGIC I heard the Arkansas GIO use the “build it once, use it bunch” term for the first time. With the use of open data portals, his description precisely describes open data platforms.
  • Open data portals provide the foundational framework and “common language” of seamlessly and easily sharing data amongst public and private entities, first envisioned by the U.S.G.S. National Map years ago.

An example of using an open data portal can be found by accessing the Statewide Parcel Dataset via the Hub Data Portal. The parcel data can be found by browsing within the Land Records theme or by simply searching on “parcels” which then shows the “Parcels TaxRoll” and the “Parcels” map. Clicking on the “Parcels” item allows one to easily view the data.  While in the viewer, one can click on the Download icon to retrieve the data in a variety of formats and geographic extents.  While in the viewer, one can also click on the “I want to use this” button provides a list of options including links to the REST API. That REST API can be used in web applications and in desktop GIS applications (Esri and non-Esri).

If you are using ArcGIS Pro, although you can use the REST API, it’s much easier and quicker to lean on the native capabilities of ArcGIS Pro and simply go to the Catalog view, click on the Portal tab, click on the “ArcGIS Online” button, and then in the “Search ArcGIS Online” text field type in “ndgishub parcels” (you may wish to also apply the “Status – Authoritative” to eliminate clutter).  Once you have added the parcel data to your map (the related tax roll table is included), you can then apply your own symbology, labeling, scales, etc.

–Bob Nutsch

Categories: Public Domain Data

The metadata editor has arrived in ArcGIS Online

September 4, 2023 4 comments

As our book and this blog has a major focus on data quality, metadata has always been a topic of these essays. With the summer 2023 release of ArcGIS Online, a metadata editor (beta) was included. This is a streamlined experience for creating high-quality geospatial metadata. This new editor enables you to:

  • Quickly complete what’s needed, creating essential metadata compliant with international open standards.
  • Complete more robust documentation, including optional metadata elements.
  • Search and find metadata elements.
  • View, download, and overwrite metadata records.

See my colleagues article for more, on: https://www.esri.com/arcgis-blog/products/arcgis-online/announcements/introducing-metadata-editor-beta/.

First, the administrator for your ArcGIS Online organization needs to enable metadata for the organization. See below for the screen that the administrator will see in this regard.

Enabling metadata editing in an ArcGIS Online organization.

Once done, you, as an owner of content in your organization, can use the metadata editor beta button to create and edit metadata. First, find it by > Contents > Item Details > Metadata > Open in metadata editor beta. Then, document your data by exploring the options and features as shown below.

Furthermore, as a core theme of this blog is how to find geospatial data, and given the widespread usage of ArcGIS Online to serve data, I anticipate the use of this metadata editor as an enormous boom to the entire geospatial community: People are going to be able to be informed as never and be able to make smart decisions about how, when, and why to use specific geospatial data sets, as documented with the help of the metadata editor. I salute my Esri colleagues for developing this tool and the user community in charting the way forward in the rich use of this tool.

–Joseph Kerski

Categories: Public Domain Data