Let’s make a distinction about the data on the Internet. I think that there are three kinds of data on the Internet: created data, derivative data, and duplicated data.
Created data is original, not found elsewhere. This original data appears in the work of statistical agencies and organizations whose mission is to be creators of data. The US Bureau of Labor Statistics, US Department of Commerce, Statistics Canada, The New York Stock Exchange, NASDAQ, and a long list of other organizations that survey the world (the stock exchanges are simply timely surveys of the selling price of stocks) and report the data. Created data includes financial reporting from companies, product catalogues, original news articles, and original writings.
Derivative data is a creative effort that blends original data from other sources into a new interpretation. Derivative data often is of greater value than created data. For over fifteen years, people turned to Yahoo Finance for financial data about publicly traded companies. Yahoo Finance blends created data from different sources into a single platform, increasing the potential for people to find information that answers the questions they are asking. These sites also provide a portal to other places users can go to find more potential information. I say potential because the data may not answer the question.
Derivative data includes mashups, blogs, and other content on websites. What you are reading this moment is both created data and derivative data. The created data is the words typed. The derivative data consists of the other elements on the page—like the images—that are found elsewhere and used to provide greater context and understanding. I think most of the data found on the Internet is derivative data. LOL Cats is derivative data, as is every news-reporting site.
Duplicated data is simply that—duplicated. Duplicated data lives in the body of websites. The reposting on Facebook of photos found on the Internet is one form of duplicated data. The reporting of quotations, news stories, and videos are all forms of duplicated data. Our capacity to duplicate data is astounding. Photographs, images, and videos are perhaps the fastest-growing classes of data on the Web.
As an experiment, open Google and search on the topic of cats. In less than a second, Google presents over 489 million web pages about cats. Wait a week and run the search again; the resulting count remains approximately the same.
I tested this while writing this post. I ran the first search one day, and then waited two days to do the search again. The second time I ran the search, the resulting count dropped by one million. I searched "cat images," and over 1.3 billion results appeared in the first search, dropping to 614 million two days later. When I searched for videos about cats, 419 million videos appeared on a Friday, and the count dropped to 121 million on Sunday. Only the blog search remained constant at 50.3 million blog pages that have some mention of cats.
Stop and think. The results dropped only two days later. Why? The only answer is that the algorithm produced a different result. Perhaps fewer people used the term "cats" on Sunday?
My only point is that the results of huge data sets are unpredictable because of the amount of data and the complexity of the algorithms used to tease sense out of such a huge pile of it. What is happening is no longer a needle-in-a-haystack problem; it is a drop-of-water-in-an-ocean kind of problem.
The exponential growth of systematized electronic data is happening because we are creating data in three measurable ways—Volume, Velocity, and Variety.
Volume builds as more business processes migrate from paper-based systems to electronic systems. More business entities are created every year, with each new business adding to the volume of data. However, the migration of paper-based processes to electronic processes is the true driver. Business processes executed by phone or fax five years ago are now electronic. Five years ago, every office received faxes—now less than half even have a dedicated machine. Companies reward customers for completing transactions on websites.
In the logistics world, more information moves by the web than any other conduit. Crushing competitive pressure on dedicated VAN like Klineschmidt comes as more companies turn to A/S2 to move their data. Free from the toll taking VANs, more companies use EDI for more applications.
Velocity is building. Data flows faster as more fiber-optic cable capacity comes online. Speed in the final mile between switching station and user continues to increase as fiber replaces copper. Microwave communication, once thought to have been replaced by fiber, is enjoying a resurgence as companies add additional backhaul capacity between buildings on the same office campus, or across cities.
Seven years ago, it could take three to five days for an event in China, the loading of a container, to be confirmed and updated on a computer terminal. Five years ago, the delay dropped to under three days. Now, depending on the shipper and the cargo freight station, the time between loading and systems update is same-day, usually within hours. Seven years ago, we were at next-day in many domestic networks in the US, and now it is same-minute in some supply chains. Technology sped up the flow of information, backed by changes in business practices that embraced the gains made with automated ID and RF data communications.
Variety is perhaps the greatest influence on the exponential growth of data. Look at a Google map in the US for directions, and the system will tell you not only how long it will take in current traffic to reach your destination, but also where the traffic slowdowns will be. The biggest challenge to any data system is the wide variety available to collect. Video over IP allows a guard in Pennsylvania to monitor a warehouse gate in California, allowing the guard to inspect the paperwork a delivery driver presents via an image-capture device. The guard can capture the driver’s picture, the license plate on the front of the truck, the trailer plate, and even the USDOT number on the side of the cab, all remotely, the data stored for future inspection.
Just the standard operations of a company can produce what many label as big data. According to a February 25, 2010 article in The Economist, WalMart processes over a million customer transactions every hour. The article is vague about the specifics of these transactions, or what part of WalMart’s worldwide business the transaction count comes from. The reported one million customer transactions per hour sounds low—by as much as an order of magnitude. WalMart’s own investor relations site reports that the company serves 200 million customers per week worldwide. With a 24/7 week of 189 hours, the transaction count is more likely to be one billion per hour.
A billion customer transactions per hour sounds like a big data job. Certainly it is a lot of volume, but there is more to consider. Each customer transaction triggers multiple systems events, including a credit card tender, tax escrow credit, and flash sales update. Within the transaction, the systems update the sales by item records for the store. Almost all of these transactions process at the local store level, and the store transfers summary data up to the host system data centers.
The operational transactions are not big data problems. The data consists of well-defined fields of data with known relationships between the elements. This is the world of operational information and data processing management, something that every major retailer knows how to do well.
Based on my research, a key feature of big data is the problem of multiple sources of data, and the difficulty of relating the data from one source to the data in another source. What WalMart does day in and day out—processing sales transactions—is clearly well defined, the data from a few systems (yes, many stores, but the systems in each of the stores is the same, just a different instance). Still, what WalMart does in processing the transaction data is not big data by definition.
But a part of the data could be.
Consider this example. When we shop at grocery stores, the store wants to scan our customer number. We purchase our goods and the store tracks the items we purchase. If we purchase an item that is similar to another item the store carries, the store’s system produces a coupon from the cash register. How many times have you looked at the coupon and thrown it away? Is that a function of big data or predictive analytics? No, it isn’t.
Which leaves us still searching for a clearer answer to the question: what is big data?