Search Results

Keyword: ‘big data’

Using Kaggle big data in a GIS

July 9, 2018 1 comment

Kaggle is a platform for predictive modelling and analytics competitions in which statisticians and data miners compete to produce the best models for predicting and describing the datasets uploaded by companies and users. This crowdsourcing approach relies on the fact that there are countless strategies that can be applied to any predictive modelling task and it is impossible to know beforehand which technique or analyst will be most effective. [Wikipedia].   Over a half million people are in the Kaggle community, from nearly every country in the world.   Kaggle was acquired by Google a few years ago.  You can also learn about R, SQL, machine learning, and other topics on the site.  Why mention Kaggle in our geospatial data blog?  Kaggle hosts data sets on their site, some of which are spatial in nature, and some of which are truly “big data” (such as 9 million open images URLs), and as such, it represents a source of information for the GIS analyst, researcher, and instructor.

Because the data posted to Kaggle comes from a global community with diverse interests, expect an unusual array of data sets, from chest x-rays, superheroes, air quality, to birdsongs.  Some data are from surveys.  Many intriguing gems exist; for example, one of the data sets of interest to me as a geographer on the Kaggle site is the world happiness data .  It is available as a CSV for three different years.  The only unfortunate aspect of these tables is the lack of a country code; and relying only on name of country could present problems in joining the data to a map.

One can also learn about data sources by spending time on the Kaggle site.  For example, I learned about Uber Movement that contains data from selected cities and points of departure, Sports Reference that someone used to scrape 120 years of Olympic history data from, and a cancer imaging archive that someone used to obtain disease type and location.    Given the nature of the site, expect all sorts of oddities: My search on mountains of the world resulted in lots of “404 Not Found” errors; some data is documented and others not so much; and obtaining some of the data requires the user to be a programmer.  Still, Kaggle is a useful and unusual source worthy of attention, and given the rapid evolution in big  data and crowdsourcing, as we frequently write about on this blog, I expect that we will be seeing many more sites like this in the future.

kaggle

A section of the Kaggle listing of data sets, showing the diversity of themes, scales, and sizes. 

Advertisements

Reflections on the Effective Use of Geospatial Big Data article

March 5, 2018 1 comment

Glyn Arthur, in a thought provoking article in GIM International, entitled “Effective Use of Geospatial Big Data“, raises several issues that have been running through the Spatial Reserves blog.  The first is to point out that the “heart of any geospatial analysis system is increasingly becoming the server.” Glyn, a GIS professional with over 25 years experience, then dives into one of the chief challenges that this environment brings, namely, to deal with the increasing quantity and variety of data that the world produces. Of particular importance is emerging sensor platforms which must be incorporated into future GIS applications. The second point is the need to embrace, and not avoid, the world of big data and its benefits–but also recognize the challenges it brings. The third point is to carefully consider the true costs of the data server and decision making solution when making a purchasing decision.

Frankly, I found the “don’t beat around the bush” theme of Glyn’s article refreshing. This is evident in such statements as, “for mission-critical systems, purposely designed software is required, tested in the most demanding environments. Try doing it cheaper and you only end up wasting money.”  Glyn also points out that the “maps gone digital” attitude “disables.” I think what Glyn means by this is that systems built around the view that GIS is just a digital means of doing what we used to do with paper maps will be unable to meet the needs of organizations in the future (or dare I say, even today). Server systems must move away from the “extract-transform-load” paradigm to meet the high speed and large data demands of users.  Indeed, in this blog we have praised those portals that allow for direct streaming into GIS packages from their sites, such as here in Utah and here in North Dakota.  The article also digs into the nuts-and-bolts of how to decide on what solution should be purchased–considering support, training, backwards compatibility, and the needs of the user and developer community. Glyn points out something that might not set well with some, but I think is relevant and needs to be grappled with by the GIS community, which is this:  A weakness of Open Source software is its sometimes lack of training from people with relevant qualifications and who have a direct relationship with the original coding team, particularly when lives and property are at stake.

Glyn cites some examples from work at Luciad with big data users such as NATO, EUROCONTROL, Oracle, and Engie Ineo. Geospatial server solutions should be able to connect to a multitude of geographic data formats. The solutions must be able to publish data with a few clicks. Their data must be able to be accessed and represented in any coordinate system, especially with temporal and 3D data that includes ground elevation data and moving objects.

glynarticle.JPG

Glyn Arthur’s article about effective use of geospatial big data is well-written and thought-provoking.

Categories: Public Domain Data Tags: ,

Era of Big Data is Here: But Caution Is Needed

September 25, 2017 Leave a comment

As this blog and our book are focused on geospatial data, it makes sense that we discuss trends in data–such as laws, standards, attitudes, and tools that gradually helping more users to more quickly find the data that they need.  But with all of these advancements we continue to implore decision makers to think carefully about and investigate the data sources they are using.  This becomes especially critical–and at times difficult–when that data is in the “big data” category.  The difficulty arises when big data is seen as so complex that often it is cited and used in an unquestioned manner.

Equally challenging and at times troublesome is when the algorithms based on that data are unchallenged, and when access to those algorithms are blocked to those who seek to understand who created them and what data and formulas they are based on.  As these data and algorithms increasingly affect our everyday lives, this can become a major concern, as explained in data scientist Cathy O’Neil’s TED talk,  who says “the era of blind faith in big data must end.”

In addition, the ability to gain information from mapping social media is amazing and has potential to help in so many sectors of society.  This was clearly evident with the usefulness of social media posts that emergency managers in Texas and Florida USA mapped during the August-September 2017 hurricanes there.  However, with mapping social media comes an equal if not greater need for caution, as this article that points out the limitations of such data for understanding health and mitigating the flu.  And from a marketing standpoint, Paul Goad cautioned here against relying on data alone.

It is easy to overlook an important point in all this discussion on data, big data, and data science. We tend to refer to these phenomena in abstract terms but these data largely represent us – our lives, our habits, our shopping preferences, our choice of route on the way to work, the companies and organisations we work for and so on. Perhaps less data and data science and more humanity and humanity science.  As Eric Schmidt, CEO of Google, has said, “We must remember that technology remains a tool of humanity.  How can we, and corporate giants, then use these big data archives as a tool to serve humanity?”

Understanding your data

Use caution in making decisions from data–even if you’re using “Big Data” and algorithms derived from it.    Photograph by Joseph Kerski. 

Categories: Public Domain Data

Findings of the Big Data and Privacy Working Group

May 11, 2014 2 comments

US White House senior counselor John Podesta recently summarized an extensive review of big data and privacy that he led.  Over 90 days, he met with academic researchers, privacy advocates, regulators, technology industry representatives, advertisers, and civil rights groups.  The findings were presented on 1 May 2014 to the President and summarized in Mr Podesta’s report but the full 79-page report is also available.  In the report, geospatial data is recognized as an important contributor to big data but does not receive special attention over other types of data.  Nevertheless, the report provides a useful overview of the current opportunities of big data and the challenges it poses to privacy.

After discussing some of the technological trends making big data possible, the report then details the opportunities it presents:  Saving lives (through monitoring infections in newborns), making the economy work better (through sensors in jet engines, monitoring of peak electrical demands), and making government work better (by being able to predict reimbursement fraud in insurance, for example).   Next, the report raises some of the serious concerns that accompany big data, such as how to protect our privacy and how to make sure that it does not enable civil rights protections to be circumvented.

Recommendations from the report include advancing the proposed Consumer Privacy Bill of Rights, passing National Data Breach legislation, extending privacy protections to non-US persons, ensuring data collected on students in school is used for educational purposes, expanding technical expertise to stop discrimination, and amending the electronic communications privacy act.  In short, the report recognizes the immense benefit that big data brings, but also the challenges, and makes specific recommendations for governments to deal with those challenges.

Categories: Public Domain Data

Geospatial Advances Drive the Big Data Problem but Also its Solution

In a recent essay, Erik Shepard claims that geospatial advances drive the big data problem but also its solution:  http://www.sensysmag.com/article/features/27558-geospatial-advances-drive-big-data-problem,-solution.html.  ”  The expansion of geospatial data is estimated to be 1 exabyte per day, according to Dr. Dan Sui.  Land use data, satellite and aerial imagery, transportation data, and crowd-sourced data all contribute to this expansion, but GIS also offers tools to manage the very data that it is contributing to.

We discuss these issues in our book, The GIS Guide to Public Domain Data.  These statements from Shepard are particularly relevant to the reflections we offer in our book:  “Today there is a dawning appreciation of the assumptions that drive spatial analysis, and how those assumptions affect results.  Questions such as what map projection is selected – does it preserve distance, direction or area? Considerations of factors such as the modifiable areal unit problem, or spatial autocorrelation.”

Indeed!  Today’s data users have more data at their fingertips than ever before.  But with that data comes choices about what to use, how, and why.  And those choices must be made carefully.

Categories: Public Domain Data Tags:

Key statements about the importance of spatial data

August 6, 2018 3 comments

Sometimes it is helpful to have some research results and quotes in support your data advocacy efforts at your own organization–in your promotion of  “why this all matters!”.   And, of course, why your efforts need to be funded and supported!  Here are a few key quotes about the importance of spatial data–and what happens when the data doesn’t exist.

Kathryn Sullivan, former NASA astronaut, recently commented, The power of a map to put time and place and phenomena together, to give it to our brains through the most potent input sensor human beings have — our eyes — is a remarkable accelerator for the comprehension and engagement and use of the data that tell us what’s on Earth, where are things happening on our planet …   as reported in Cheney, S. (2017). How GIS can help us understand our changing oceans.  (quoted on the 2017 Esri Ocean GIS Forum on https://www.devex.com/news/how-gis-can-help-us-understand-our-changing-oceans-91366).

Another extremely useful statement is, “Advances in research on resilience and vulnerability are hampered by access to reliable data” can be found in Barrett, C. B. and D.D. Headey. 2014. Measuring resilience in a risky world: Why, where, how, and who? 2020 Conference Brief 1. May 17-19, Addis Ababa, Ethiopia. Washington, D.C: International Food Policy Research Institute.

“The lack of data is one of the biggest obstacles to progress toward development goals” was a part of the statements from the United Nations Independent Expert Advisory Group (UN). 2014. A World That Counts: Mobilising the Data Revolution for Sustainable Development. A Report to the UN Secretary General. New York, NY: United Nations, pp.28.

Perhaps the strongest argument for more and better data comes from ODI’s report The Data Revolution:  Finding the Missing Millions, where the authors cite numerous cases where inadequate data leads to sub-optimal policy decisions.  These cases “confirm some of the anecdotal evidence about the lack of good data in developing country ministries.”    The full citation is:  Stuart, E., E. Samman, W. Avis, and T. Berliner. 2015. The data revolution: finding the missing millions. ODI Research Report 03. London: Overseas Development Institute, pp. 51. (https://www.odi.org/sites/odi.org.uk/files/odi-assets/publications-opinion-files/9604.pdf).

Based on the research that Jill Clark and I have done in this area over the past decade, I would add to the ODI statement that in developed countries, some similar challenges exist.  We have documented those in these blog essays for six years.

I sometimes use the statements from this National Academies of Sciences report and these written by vterrain.org.  I have created videos on this topic such as here and articles such as in Directions Magazine here.

My own contribution to these quotes is, “We have made much progress, to be sure, but the world’s increasingly complex and serious issues are not going to wait around another generation for us to get our data act together.”

What are the quotes and studies you are using in your own data advocacy efforts?  Please share those in the comments section.

Back CameraPhotograph by Joseph Kerski.

 

Reflections on Why Open Data is not as Simple as it Seems article

December 25, 2017 1 comment

Sabine de Milliano, in a relevant and thoughtful article in the GIS Professional newsletter entitled “Why Open Data is Not as Simple as it Seems,” eloquently raises several issues that have been running through this Spatial Reserves blog for the past five years.  She also raises concerns that have been in just about every data and GIS conference for the same amount of time to a new level. Rather than camping on the statement, “open data is great” and leaving it at that, Ms. de Milliano points out that “open data is much more complicated than simply collaborating on work and sharing results to help humanity move forward”.  She recognizes the “common good” of collaboration and innovation, and the transparency that results from open data. She states that access to open data is “only possible by solving the sum of technological, economic, political, and communication challenges.” Indeed.

In this blog and in our book, we have written extensively about the “fee vs. free” discussions that debate whether government agencies should charge for their data, and Ms. de Milliano sums up arguments on both sides. But she goes further and says that challenges to open data range from “ethical to practical”, and that there is a “large grey zone on what data should actually be shared and what should remain private.” What if someone creates a map based on your open data and someone else makes a fatal decision based on an error in this derivative product? Who is accountable?

For Ms. de Milliano, the biggest challenge of open data is discoverability and accessibility. She mentions open data portals including the Copernicus Open Access Hub, Natural Earth Data, USGS Earth Explorer, and the Esri ArcGIS Hub, and we have written about many others in this blog, such as here and here.  Ms. de Milliano holds an impressive set of GIS credentials and makes her points in an understandable and actionable manner.  Her article also points out that despite the advent of open data, some datasets remain “knowledge intensive”, meaning that only a limited number of users have sufficient technical background to understand how to process, analyze, and use them (such as SAR data) and therefore, they remain the domain of experts. I frequently touch on this point when I am teaching GIS workshops and courses, beginning with the thesis: “Despite data and technical advancements in GIS over the past 25 years, GIS is not easy. It requires technical expertise AND domain expertise.”  Effective use of GIS requires the user to be literate in what I see as three legs making up “geoliteracy”–content knowledge, skills, and the geographic perspective. I do not see skills as solely those of acquiring more competency in geotechnologies, but rather including equally important skills in critical thinking, dealing with data, being ethical, being organized, being a good communicator, and other skills.

Article about open data

Sabine de Milliano’s article about open data touches on many of the themes in this blog and in our book in an eloquent and thought-provoking way.

Categories: Public Domain Data

Best Available Data: “BAD” Data?

August 14, 2017 3 comments

You may have heard the phrase that the “Best Available Data” is sometimes “BAD” Data. Why?  As the acronym implies, BAD data is often used “just because it is right at your fingertips,” and is often of lower quality than the data that could be obtained with more time, planning, and effort.  We have made the case in our book and in this blog for 5 years now that data quality actually matters, not just as a theoretical concept, but in day to day decision-making.  Data quality is particularly important in the field of GIS, where so many decisions are made based on analyzing mapped information.

All of this daily-used information hinges on the quality of the original data. Compounding the issue is that the temptation to settle for the easily obtained grows as the web GIS paradigm, with its ease of use and plethora of data sets, makes it easier and easier to quickly add data layers and be off on your way.  To be sure, there are times when the easily obtained is also of acceptable or even high quality.  Judging whether it is acceptable depends on the data user and that user’s needs and goals; “fitness for use.”

One intriguing and important resource in determining the quality of your data can be found in The Bad Data Handbook, published by O’Reilly Media, by Q. Ethan McCallum and 18 contributing authors.  They wrote about their experiences, their methods and their successes and challenges in dealing with datasets that are “bad” in some key ways.   The resulting 19 chapters and 250-ish pages may make you want to put this on your “would love to but don’t have time” pile, but I urge you to consider reading it.  The book is written in an engaging manner; many parts are even funny, evident in phrases such as, “When Databases attack” and “Is It Just Me or Does This Data Smell Funny?”

Despite the lively and often humorous approach, there is much practical wisdom here.  For example, many of us in the GIS field can relate to being somewhat perfectionist, so the chapter on, “Don’t Let the Perfect be the Enemy of the Good” is quite pertinent.   In another example, the authors provide a helpful “Four Cs of Data Quality Analysis.”  These include:
1. Complete: Is everything here that’s supposed to be here?
2. Coherent: Does all of the data “add up?”
3. Correct: Are these, in fact, the right values?
4. aCcountable: Can we trace the data?

Unix administrator Sandra Henry-Stocker wrote a review of the book here,  An online version of the book is here, from it-ebooks.info, but in keeping with the themes of this blog, you might wish to make sure that it is fair to the author that you read it from this site rather than purchasing the book.  I think that purchasing the book would be well worth the investment.  Don’t let the 2012 publication date, the fact that it is not GIS-focused per se, and the frequent inclusion of code put you off; this really is essential reading–or at least skimming–for all who are in the field of geotechnology.

baddatabook.PNG

Bad Data book by Q. Ethan McCallum and others. 

 

Connections between Geospatial Data and Becoming a Data Professional

September 25, 2016 Leave a comment

Dr. Dawn Wright, Chief Scientist at Esri, recently shared a presentation she gave on the topic of “A Geospatial Industry Perspective on Becoming a Data Professional.”

How can GIS and Big Data be conceptualized and applied to solve problems?  How can the way we define and train data professionals move the integration of Big Data and GIS simultaneously forward?  How can GIS as a system and GIS as a science be brought together to meet the challenges we face as a global community?   What is the difference between a classic GIS researcher and a modern GIS researcher?   How and why must GIS become part of open science?

These issues and more are examined in the slides and the thought-provoking text underneath each slide.  Geographic Information Science has long welcomed strong collaborations among computer scientists, information scientists, and other Earth scientists to solve complex scientific questions, and therefore parallels the emergence as well as the acceptance of “data science.”

But the researchers and developers in “data science” need to be encouraged and recruited from somewhere, and once they have arrived, they need to blaze a lifelong learning pathway.  Therefore, germane to any discussion on emerging fields such as data science is how students are educated, trained, and recruited–here, as data professionals within the geospatial industry.  Such discussion needs to include certification, solving problems, critical thinking, and ascribing to codes of ethics.

I submit that the integration of GIS and open science not only will be enriched by the immersion of issues that we bring up in this blog and in our book, but is actually dependent in large part on researchers and developers who understand such issues and can put them into practice.  What issues?  Issues of understanding geospatial data and knowing how to apply it to real-world problems, of scale, or data quality, of crowdsourcing, of data standards and portals, and others that we frequently raise here.  Nurturing these skills and abilities in geospatial professionals is a key way of helping GIS become a key part of data science, and our ability to move GIS from being a “niche” technology or perspective to one that all data scientists use and share.

data_professional.PNG

This presentation by Dr. Dawn Wright touches on the themes of data and this blog from a professional development perspective.

 

2015 and Beyond: Who will control the data?

November 17, 2015 1 comment

Earlier this year Michael F. Goodchild, Emeritus Professor of Geography at the University of California at Santa Barbara, shared some thoughts about current and future GIS-related developments in an article for ArcWatch. It was interesting to note the importance attached to the issues of privacy and the volume of personal information that is now routinely captured through our browsing habits and online activities.

Prof. Goodchild sees the privacy issue as essentially one of control; what control do we as individuals have over the data that are captured about us and how that data are used. For some the solution may be to create their own personal data stores and retreat from public forums on the Internet. For others, an increasing appreciation of the value of personal information to governments and corporations, may offer a way to reclaim some control over their data. The data could be sold or traded for access to services, a trend we also commented on in a previous post.

Turning next to big data, the associated issues were characterised as the three Vs:

  • Volume—Capture, management and analysis of unprecedented volumes of data
  • Variety—Multiple data sources to locate, access, search and retrieve data from
  • Velocity—Real-time or near real-time monitoring and data collection

Together the three Vs bring a new set of challenges for data analysts and new tools and techniques will be required to process and analyse the data. These tools will be required to not only better illustrate the patterns of current behaviour but to predict more accurately future events, such as extreme weather and the outbreak and the spread of infectious diseases, and socio-economic trends. In a recent post on GIS Lounge Zachary Romano described one such initiative from Orbital Insights,  a ‘geospatial big data’ company based in California. The company is developing deep learning processes that will recognise patterns of human behaviour in satellite imagery and cited the examples of the number of cars in a car park as an indicator of retail sales or the presence of shadows as an indicator of construction activity. As the author noted, ‘Applications of this analytical tool are theoretically endless‘.

Will these new tools use satellite imagery to track changes at the level of individual properties? Assuming potentially yes, the issue of control over personal data comes to the fore again, only this time most of us won’t know what satellites are watching us, which organisations or governments control those satellites and who is doing what with our data.