Evolution of data and data analysis: A geographer’s perspective


By Linda Kaidan

How has our technology evolved from basic data concepts to big-data concepts in just 60 years?

Examining the history of data evolution and Geographic Information Systems is an excellent way to understand the progress of data storage, management and data analysis as a whole. In the beginning data was stored as written records on paper. Throughout history maps have been significant in navigation as they continue to be today. In the 1950’s slide rules and mechanical calculators were commonly used for calculations. Many mapping applications were completed by teams of mathematicians performing spatial and geodetic analysis that today is being accomplished in microseconds.

In the 1950s, maps were simple. They had their place in vehicle routing, new development planning and locating points of interest. Early mapping did not have the advantage of computers. However, satellite and aircraft systems were early uses of computer technology. In the 1950’s a self-contained dead reckoning system was developed by government contractors. The first space programmers developed a navigation system that could function without external references. US government contracting companies engineered inertial commercial navigation systems in the 1970s. They were capable of computing present position, distance to waypoints, direction and heading without navigation input employing onboard self-contained systems (Wyatt).

Digital Equipment Corporation built the first user friendly computer in 1959, the PDP 1. They introduced the PDP-11/70 in 1975.  It was used to integrate satellite-derived data and imagery from remote sensing. It supported stereographic image analysis, aircraft sensor data and data analysis. In 1972 The Defense Mapping Agency began providing mapping, charting, and geodetic information in paper and electronic format to the US Defense Department (US Government Publishing Office).

ARC INFO offered a GIS system with a graphical user interface for desktop computers. ARC refers to line segments of map elements. Info refers to data stored in an information system. The first microcomputer based geographic information system was released in 1982. ArcView provided a PC based environment supporting a basic geospatial model. Geo-referencing mapped world geographic coordinate systems to digital map elements.

Data representation featured vector models of points lines and polygons mapped to georeferenced X, Y coordinates. Topological models were used to relate map elements to each other. Spaghetti models connected independent map elements layering them atop one another. Raster models store data that varies continuously such as aerial or satellite acquired surface imagery.

As early as the 1980’s ARC GIS systems have been put to use to map municipality data in layers of substructures. Municipalities maintain control of information regarding public utilities like water, electric and sewer operations and detailed location of buildings and other structures. Whenever excavations are made the dig safe program retrieves data from Municipal GIS systems to ensure that electrical lines, gas lines and other buried infrastructures are not damaged in the process.

ArcGIS stores data in proprietary files. They also used ORACLE’s relational database system. Unlike many technology companies which have fallen by the wayside, ArcGIS is available today as a cloud based software as a service (ARCGIS.COM). Today’s ARCGIS provides industry support to utilities by making GIS system design affordable to cooperatives and municipalities with limited budgets using data model templates for project implementation (EIS.COM).

Open source Geographic Information Systems data is available to all. This includes Landsat satellite imagery and Tiger Data. ArcGIS Online stores massive amounts of spatial data. GIS software is being developed using open source collaborative efforts and is available to the public free of charge (ERSI.COM).

ORACLE is the second largest software company in the world. It began offering the first database software employing highly structured relational modeling for data storage retrieval and analysis. Though nearly 40 years have now gone by since the company’s inception in 1977, ORACLE’s RDBMS is the most widely used RDBMS in the world.

Well-structured data based on relational modeling is essential for organizations like financial institutions including banks, consumer based organizations, and medical research companies. It will likely continue to be valuable for a very long time to come. However, with our ever increasing capacity to create, access and store data, what is popularly called Big Data is an ever growing and every increasingly important source of information.

Big Data can be well structured just as data stored in a relational database or unstructured like that found in a periodical, research paper or essay. Data can be stored in a user defined self-referential structures such as that supported by JSON or XML.

Where does Big Data come from? It may come from sensors in manufacturing facilities, elevator banks in the World Trade Center or from automated, self-contained agriculture facilities in isolated areas. It may even come from biometric data collected by sensors on our bodies. Mining and manufacturing operations monitor production and distribution processes with sensors. That data can provide real-time quality control and predict when maintenance is necessary as well as when malfunctions occur. The result is less down time and greater efficiency. When used properly, Big Data can answer important questions relating to safety, quality, efficiency and market trends.

Big Data can be analyzed in real time as it streams or after collection. Data Analytics is a relatively new science offering techniques to glean the most from ever increasing volumes of data. ORACLE defines Big Data by referencing:

  • Volume of data processed which can reach into the petabytes (1000 terabytes)
  • Velocity – the rate at which data is received and acted upon
  • Value – an ongoing discovery process fueled by business executives and analysts.  Determining value is a function of asking the right questions, identifying patterns, making perceptive assumptions, and predicting behavior.

Data Representation Structured Versus Unstructured

Files, Relational Models, Self-Defined and Unstructured Data all contribute to how data can be represented and stored. Commercial and scientific software often use proprietary product based files to store data. Photoshop uses PSD files, DOCX is a file format for MS Word and standard image formats include jpg and png.

Relational data is modeled and stored in relational databases like MySQL, SQLSERVER and ORACLE. Relational data is based on relational mathematics for data set mapping developed by Edgar F. Codd in the 1970’s.  Such databases are sets of normalized data organized by tables, records and columns. There are well defined relationships between database tables. Relational structures support communication, sharing, search ability, organization and reporting. Structured Query Language called SQL, offers easy access to data from programs and scripts (Techopedia).

Unstructured data may be impossible to fit into an RDBMS. Such data often originates in mobile, social, and cloud computing data feeds. Some estimates agree that 90 percent of the data that exists now was created in the last two years and most of this information is unstructured. Unstructured databases like NoSQL and Hadoop offer the opportunity to capture, store and analyze vast resources of unstructured data as it is generated and to use it for business intelligence.

XML data is a data format that supports both data and a definition of that data’s structure. That means it’s self-defining. XML format can be used by any individuals or organizations that want to share data in a consistent way and in a format that they define and choose. Clusterpoint Server is database software for high-speed storage and large-scale processing of XML and JSON data on commodity hardware clusters. It works as a schema free document-oriented DBMS platform with an open source API. Clusterpoint supports immediate search and access to billions of documents and fast analytics operating on structured and unstructured data.

JSON stands for JavaScript Object Notation. It is an easily readable and usable data-interchange format that machines can parse and humans can read and understand.  It is defined as a portion of JavaScript Programming Language, Standard ECMA-262.


Our ongoing data evolution began in the 50’s. We maximized its use to do amazing things like sending satellites to space and using those satellites to acquire and transmit geospatial information and meteorological information. As time has gone by, each generation of technologists has built upon the work of the last. We’ve created library upon library of mathematical functions, utilities and models that allow us to build systems that are more specialized, inclusive and integrated. We’ve created an amazing toolbox of wonders that will continue to grow as long as we do.  We have not only expanded our data and data analysis resources, we have new open source opportunities that allow us to share our data technology and further accelerate our learning process.

In such a short period of time we’ve become capable of monitoring and analyzing the oceans, climate, forest fires, epidemics, seismic events and financial catastrophes.  We’ve just recently heard about new quantum computers which are not binary but based on state. This multi-state structure supports more adaptable algorithms and enables us to solve problems millions of times faster, leading to solutions to previously insoluble problems. We’ve done all of this in just 60 years.

One challenge for the future will be to manage this ever increasing information torrent and to uncover some way to find and make use of everything we seek to discover. At the same time Big Data increases threats to both personal and corporate privacy. With every click of our mouse we give up information to parties unknown.

Linda Kaidan was a Principal Software Developer at Oracle Corporation. She’s also developed public utilities software at IBM and was a senior software design engineer at the Jet Propulsion Laboratory. Kaidan has a BA in Geography from the Hebrew University of Jerusalem and an MS in Computer Science from American University. Early in her career, she worked as a cartographer at the National Geospatial Intelligence Agency. Kaidan is the author of Surviving Climate Change: Decide to Live.



