Spatial on Saturdays No. 9
80% of all data is spatial? (is this really true)
By the numbers
80%
of all data has a spatial component (I am sure you have heard it)
78%
of articles have a georeferenced component (“How much information is geospatially referenced? Networks and cognition.” 2012)
9.9%
of data on the New York City Open Data Portal can be mapped (342/3452 datasets)
What’s in the news
Cloud providers made a splash this past week with two interesting programs. In the first, AWS opened a space tech accelerator program in India that will help grow space technology development in the region with other partners in a 14-week program. In the second Google and the Environmental Defense Fund are teaming up to map methane leaks across the globe and provide that data for analysis in Google Earth Engine.
I also came across a great blog series from The World Bank discussing “Spatial Insights into the Gender Employment Gap”. This particular article discusses the different geographic factors that impact wage gaps around the globe and how those can be paired with other indicators to create a holistic and up-to-date picture of employment gaps in different locations.
In a few other things to round up, it appears that MariaDB has completely offloaded its geospatial business capabilities to CubeWerx. While it appears that there are still geo capabilities in the core open source product, the website listing for MariaDB’s geospatial capabilities now redirects to the enterprise home page.
And in acquisition news, MyTraffic, a French provider of traffic analytics has acquired GeoBlink, a retail-based analytic solution from Spain. This appears to be following a trend that's seeing companies with focused data products start to come together to offer a more unified solution, whether that be point-of-interest data or mobility data sets, most notably the merger between Unacast and Gravy Analytics.
Is 80% of all data spatial?
The question that we want to answer today. I'm sure if you've worked in GIS or geospatial over the years, I am sure that you have seen or even used this quote yourself to showcase and justify just how much data in our world has a spatial component to it and ultimately the importance of spatial analysis in our world today.
And I have to admit that I'm guilty of this myself. And many of my past presentations and calls with colleagues and others outside of my own company, I have cited this statistic without fully understanding where it exactly came from and in fact, if it is even true.
I'm also fairly certain that many of you reading this newsletter understand the importance of geospatial analytics and geospatial data in our world and have also been in the position of justifying why you should be doing that work in the first place. Yes, we all understand that location is important, but there are a lot of other things that are just as important as well. And in any given data project, understanding just how important location is can vary depending on the needs and outcomes that you may have. Yet this single phrase alone has given a lot of people, a lot of work over the years, as seen in this quote from this article.
The end of Kahman, Burghardt, and Weber’s article indicates real value of the phrase, citing a quote on Twitter by John Fagan, Head of Software Engineering, Axon Active AG. Agile & Lean and formerly of Bing Maps and Multimap: “that geo quote keeps us all in our jobs. Best not go poking around to see if it’s true.”
But today that is exactly what we are going to do. We are going to look around and see if this is true and to what degree it matters in the work that we're doing. Personally, I have a few hypotheses about this quote. First, it is somewhat true that there is likely a lot of data that has a location component to it, but exactly how much is to be determined. The second is that the value that that location component provides, actually varies depending on the accuracy of the location.
The third point is that this quote often underpins the issue in geospatial of having to defend or advocate for geospatial in your organization. This is something I believe a lot of people do and something that we as a geospatial community should work towards getting past. Of course, we know it's important, but why do we have to keep advocating for it?
For example, take an accurate latitude-longitude location, and an IP address. If we know the level of accuracy of that latitude and longitude point, we can be very confident in the accuracy of its location. An IP address on the other end can represent a very accurate location such as a Wi-Fi access point or something as high level as the centroid of a country if that is the only location exposed in that IP address.
Now it wasn't necessarily easy to find detailed data on where the stat came from and the only two sources I was able to find were an article from geographyrealm.com and a lengthy exchange on Stack Overflow. While these aren't two sources I would necessarily underpin an academic paper on, they certainly provide some interesting answers on where this could have originated.
taking a look at the first article, we can see that the phrase has seemingly passed down through different papers and sources over time to ultimately end up as where it is today - the following points are all pulled directly from the article.
Some pointed to Franklin, Carl and Paula Hane, “An introduction to GIS: linking maps to databases,” as the originating source (examples: gis.stackexchange.com and Spatial Sustain)
That then drilled down into referencing a 1990 report from the Ohio Geographically Referenced Information Program (OGRIP)
The earliest date is an article written in 1987. In “Analytic Mapping and Geographic Databases”, Issue 87, published in 1992 and edited by Robert S. Biggs, G. David Garson.
The earliest date is an article written in 1987. In “Analytic Mapping and Geographic Databases”, Issue 87, published in 1992 and edited by Robert S. Biggs, G. David Garson, the authors make the statement, “Computer mapping is particularly important in government, and hence is salient to social scientists who study government policies. It is estimated that 80% of the informational needs of local government policymakers are related to geographic location.”
The article by Williams lists no sources or any indication where the number comes from. However, a little digging into GEOMAX reveals that the program was developed in 1985 by two academics at the University of Florida in 1985. John Alexander, a professor of urban and regional planning, and Paul Zwick, a research scientist were behind the effort to digitize maps at Alachua County so perhaps the knowledge of where the phrase originates lies with them?
As you can see here it certainly looks like this quote and data point was passed down from generation to generation from paper to paper and it's still unclear where it may have originated from apart from this paper in 1985. But that still doesn't provide us an answer as to whether this number is in any way accurate and where it came from originally. So to that end let's check out the article from Stack Overflow to see if that has any other information for us. Once again, all points below are pulled directly from the article, apart from bullet point number two.
The reference is: Franklin, Carl and Paula Hane, “An introduction to GIS: linking maps to databases,” Database. 15 (2) April, 1992, 17-22.
The OGRIP paper is cited again here as well.
William Huxhold’s 1991 book ‘An Introduction to Urban Geographic Information Systems’ pages 22-23: ‘A 1986 brochure (Municipality of Burnaby) published by the Municipality of Burnaby, British Columbia, reported the results of a needs analysis for an urban geographic information system (GIS) in that municipality: eighty to ninety percent of all the information collected and used was related to geography.
I just spoke with Jack Dangermond in person and he confirmed it definitely (for all data - not just local government). It is sufficient that he backs this concept?
You couldn't be more correct. Perhaps the original quote will be found in the German study. The original number circa 1970's was more like 60% of all data (at least that is what was taught to us in grad school). The new German study confirmed it to be a minimum of 78%. if attributing the quote is absolutely needed to an originator, its the original study. If you want the most authoritative, it's Jack.
While citing one person's response as a credible source doesn't quite hit the burden of proof, it does provide some burden of proof in my book, it does provide some validation that this has been a long-standing concept. Now, the German study that you see listed here and above references this 78% number that represents the number of location attributes listed in academic research papers. While this could be anything that has a location that's georeferenc-ible, it does not represent what we would generally think of as location data - that being in tabular format or in raster data formats.
I in no way doubt the accuracy of this German study, but I think it still doesn't meet the proof point of how the quote is being used today to discuss the necessity of geospatial analysis given the amount of data being produced that has some sort of location component to it. Buried in the Stack Overflow article is this quote which I found quite interesting and fitting for all the information I was able to gather on this topic.
That reminds me of this famous quote: “The trouble with quotes on the internet is that it’s difficult to discern whether or not they are genuine.” ― Abraham Lincoln
In short, it looks like there is no one source where this quote originated from and while it does appear there is some level of accuracy to it based on the German research. This appears to be one of those things that has just been passed down over time and has entered into the lexicon of GIS analysts and practitioners around the globe.
For me, the question becomes is this something we should continue to use when we discuss spatial data? Or should we look to find a more accurate representation of the data, or something completely different? In a very quick and unscientific study of the New York City Open Data website, I found that only about 10% of the data on the website was mappable, although there are likely other datasets that have some sort of location component.
My take on this is to drop this from our vocabulary and find better ways to justify that spatial analytics has a place in this world. There are a ton of ways that we do that today: proving the value of the outputs, time/money/effort saved, impact in the real world, etc. But simply saying that there is a ton of location data in the world doesn’t prove the point that location and geospatial are important.
On top of that, there is a clear hierarchy of value from different types of geospatial data as discussed earlier. And that is the topic I want to cover in the next newsletter, so for now, we will leave it here.
Hi Matt! I am the founder of the data search project Dateno (https://dateno.io) and we have a lot of statistics in the search index and when searching data sources.
I would like to say that in terms of number of geospatial datasets was never 80%. Maybe something between 30-60%. But in terms of quantity, and if we consider climate and geoscientific data as geospatial data, then there was probably a time when geospatial data was king. Not sure about 80%, but a lot.
But since 2020 the amount of genomic data is increasing and for example the Chinese Academy genomic database is about 53PB and it's growing very fast just like other genomic databases. So in terms of amount, genomic data and thermonuclear data from CERN (369PB in 2022) could be much bigger.