Virtually every day another article or posting proclaims the new era of “Big Data.” In my recent blog post on strategic planning, I identified big data as one of four major trends that we will need to consider in our IST strategic planning exercise. The post cited projections that by 2017 over 50,000 petabytes per month of video data, 15,000 petabytes of web, e-mail, and other data, and 9,000 petabytes of file sharing will be added to the global online collection of data. (For reference, a petabyte equals 1024 terabytes. In turn, a terabyte is approximately 17,000 hours of music, 310,000 photo files, or 500 hours of digital video.) Another comparison is provided by the web site Gizmodo, which suggests that the entire written works of mankind in all languages from recorded history constitute about 50 petabytes.
In other words, a nearly inconceivable amount of data is being collected and generated on a daily basis. Consider these sources of big data:
- Scientific data collected by NASA’s Earth Observing System (using polar-orbiting and low-inclination satellites for long-term global observations of the land surface, biosphere, solid Earth, atmosphere, and oceans
- Genomic data related to DNA sequencing, genetic mapping, and bio-informatics
- Data from the Internet of Things (currently more than 100 billion smart sensors connected to the web)
- Data collected from business activities
- Information from video, images, and voice collected by 7 billion humans with cell phones
- And much, more!
The excitement about “big data” involves discussions about new opportunities to “mine” the rapidly growing data. A report by the McKinsey Global Institute cites opportunities such as reduction of health care costs by measuring the actual effectiveness of medical interventions (e.g., effects of drugs, treatments, surgery), reduced fraud and errors in tax collection, increased performance by organizations via improved real-time or “just in time” supply chain management, development of new markets such as condition-based maintenance of equipment due to the available real-time sensor measurements of equipment conditions, and improved decision making based on improved knowledge and monitoring of entire enterprises.
Other benefits include improved understanding of our environment and the impact of human activity on the environment, and truly personalized medicine based on new knowledge of genomics and tailored treatments. Global Pulse, a United Nations initiative, conducts sentiment analysis of messages in social networks and text messages for improved understanding of political activities, economic conditions, and understanding of the spread of disease.
A key challenge for humans interacting with big data is to determine what data is worth looking at. In many cases we are most interested in “little data,” the proverbial needle in the haystack. We can look to recent history for innovations in science to confirm this. For instance, the discovery of the Higgs boson, the elementary particle theorized in 1964 as being fundamental to the makeup of the universe, was tentatively confirmed to exist in March 2013, by seeking evidence of a single event among several months of data involving several hundred million particle collisions occurring every second.
In another example of little data making a big difference, in 1928, Penicillin was discovered by the Scottish scientist, Alexander Fleming, who accidentally observed that the mold Penicillium infestans feeds on decaying organic matter and destroys putrification. He had noticed that a Petri dish containing a plate culture that he had mistakenly left uncovered was contaminated by a blue-green mould that inhibited the growth of the bacteria that causes food poisoning. This single “data point” eventually led to a major improvement in medicine!
Sometimes even the absence of data can be enlightening. As a former astronomer, I was always intrigued by Olber’s paradox, named after the German astronomer Heinrick Wilhelm Olbers who lived in the late eighteenth and early 19th century. Olbers wondered why the sky is dark at night. This is a seemingly silly question. But if we assume an infinite (or nearly infinite) universe having a near-uniform distribution of stars, then any direction we look in the night sky should ultimately end at the (very bright) surface of a star. Hence, under these assumptions, the night sky should be as bright as the surface of the sun! The absence of light leads to a number of implications, some of which are still being explored by modern astronomers and astrophysicists.
We need new tools to assist in interacting with large data sets to understand both the big and little, or even non-existent data. The picture at the top of this blog shows 400,000 flight status data reports from aircraft taking off from and landing at the Los Angeles International airport over the course of four days. Each “point” represents an automatically generated flight report from an airplane indicting its location, time, and other status information. Interestingly, if we observe the data in three dimensions and view it from the top (as shown on the right), you can immediately see what appears to be an avoided area, or “no-fly” zone. Is this due to mountains, weather, or perhaps due to the “non-existence” of Area 51?
To address the onslaught of big data much on-going research is being conducted in areas such as machine learning (development of algorithms to automatically process and “learn” patterns in data), meta-data generation (creation of “data about data”) such as semantic labeling of images and signals, data visualization and sonification, text indexing, advanced search engines, and many more. Many IST faculty members are actively engaged in these research areas, producing state-of-the-art results.
As we produce a global tsunami of data, we will need advanced tools and techniques to understand the data ocean, currents, waves, and even water (data) “droplets.”