Home > Public Domain Data > Using Kaggle big data in a GIS

Using Kaggle big data in a GIS

Kaggle is a platform for predictive modelling and analytics competitions in which statisticians and data miners compete to produce the best models for predicting and describing the datasets uploaded by companies and users. This crowdsourcing approach relies on the fact that there are countless strategies that can be applied to any predictive modelling task and it is impossible to know beforehand which technique or analyst will be most effective. [Wikipedia].   Over a half million people are in the Kaggle community, from nearly every country in the world.   Kaggle was acquired by Google a few years ago.  You can also learn about R, SQL, machine learning, and other topics on the site.  Why mention Kaggle in our geospatial data blog?  Kaggle hosts data sets on their site, some of which are spatial in nature, and some of which are truly “big data” (such as 9 million open images URLs), and as such, it represents a source of information for the GIS analyst, researcher, and instructor.

Because the data posted to Kaggle comes from a global community with diverse interests, expect an unusual array of data sets, from chest x-rays, superheroes, air quality, to birdsongs.  Some data are from surveys.  Many intriguing gems exist; for example, one of the data sets of interest to me as a geographer on the Kaggle site is the world happiness data .  It is available as a CSV for three different years.  The only unfortunate aspect of these tables is the lack of a country code; and relying only on name of country could present problems in joining the data to a map.

One can also learn about data sources by spending time on the Kaggle site.  For example, I learned about Uber Movement that contains data from selected cities and points of departure, Sports Reference that someone used to scrape 120 years of Olympic history data from, and a cancer imaging archive that someone used to obtain disease type and location.    Given the nature of the site, expect all sorts of oddities: My search on mountains of the world resulted in lots of “404 Not Found” errors; some data is documented and others not so much; and obtaining some of the data requires the user to be a programmer.  Still, Kaggle is a useful and unusual source worthy of attention, and given the rapid evolution in big  data and crowdsourcing, as we frequently write about on this blog, I expect that we will be seeing many more sites like this in the future.


A section of the Kaggle listing of data sets, showing the diversity of themes, scales, and sizes. 

  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: