We hear a lot about big data these days; data lakes, data reservoirs, data mining—sounds like a geography course, not technology. Over the past few years big data technology has jumped to the forefront of the innovation discussion.
What wasn’t big about our data prior to a few years ago? Oh, it was big; we simply didn’t have commercially available tools to make much sense of it.
Primary search engines like Google and Yahoo had their own secret sauce to process massive amounts of data, but the rest of us had follow traditional approaches.
The first implementation of big data, named Hadoop, was developed by Doug Cutting and Mike Cafarella for an open source search engine in 2005. Search engines have the unique ability to interrogate lots of data in a very short period of time. Even more importantly, data generated on the internet (and in our businesses) has great diversity in type and structure, something that traditional technologies can’t accommodate.
Big data intends to eliminate the complicated problem of having to make information look the same for specific analytical purposes. For instance, let’s say you want to calculate the median home price for each of the Mid-Atlantic states. That’s an easy job if each state reported the same raw information in the same way, but that usually isn’t the case. With traditional technology painful steps must be taken to get that data in a consistent form so that you can create your report.
Big data has a lot more to offer, so here are a few more “bits” about the benefits of this awesome technology and an introduction to some of the players.
Related Article: Separating Big Data From Big Buzz
Relational Databases (Not About Genealogy)
The traditional technologies mentioned are called relational databases. This is how we have approached data storage for the past 25-30 years and it has worked pretty well. The technology, by design, insures consistency and accuracy of information. There are a few problems; it is not easily used for purposes other than that for which it was designed.
Rules must be followed to maintain its integrity, making it hard to reuse and when you have lots and lots of data, maintaining the rules can make access very slow. Big data solves these problems by allowing all different forms of information to reside together, happily. Raw data collects here ready for consumption or further processing, thus the data lake and data reservoir metaphors.
My company uses a relational database when we are storing things like account balances, payments or medical records that need to be reliably updated and maintained. We use big data to analyze large amounts of data for uses such as discovering trends, marketing our customers or detecting fraud.
Big Data, the Movie (However, No Sequel)
SQL (Structured Query Language) is the programming language used to access the rigid data of relational databases. Big data is generally accompanied by tools that can access and analyze this information in its free form and is aptly named, NoSQL. These tools are generally easy to use where SQL is rigid and requires a bit of engineering expertise to develop correctly.
When developed incorrectly it can take databases down, where free form query and analysis with big data and the NoSQL approach make problems like this disappear.
The ACID Test
Because big data tries not to worry about consistency and integrity, it fails the ACID test that is expected of relational databases, but that's OK because we are not worried about reliable transactions. We are using this data for purposes such as marketing or search engine results, not wiring Aunt Betty a thousand dollars.
What is important about this is it allows our big data repositories to express a massive amount of scale and availability. We can write copies of the same data to many different nodes in the database without worrying if it all stays aligned. If storage fails it is highly likely that the copies of the affected data exist on other nodes in the database and big data keeps running and information is still accessible.
Forcing consistency and integrity slows things down, and we don’t want that, so if we need those qualities we defer to the relational approach.
Finally, we want our results fast. When we have errant queries, lots of data rules to follow, and a need for information of the highest integrity, the database spends a lot of time dealing with those concerns instead of serving up data.
Big data solutions, now free of these nuisances, can be incredibly fast sifting through mounds of data, where a traditional database would be slow and incapable of storing that much data in the first place.
Some Big Data Players
|Hadoop||Open Source big data engine|
|Cloudera||Hadoop based, service provider|
|Cassandra||Open Source, highly available big data engine, service provider|
|Aerospike||Proprietary, highly available, NoSQL database for BigData|
The massive amounts of data we generate from internetactivity, medical records, financial transactions, weather data, housing data, documents and research can be used to learn some very important things about humanbehavior or the human condition. It is now possible to analyze many sources of data, regardless ofform, to determine things like global warming, a failing economy or a marketbubble.
Big data isn’t for everything; since it is unstructured, can be inconsistent and integrity isn’t guaranteed we probably wouldn’tuse it to store checking account balances in a nightly bank posting. But think about massive amounts of medical data (of course without identifying information) and how that could be used to track a trending epidemic, locate sources of food poisoning, track the health habits of individuals and their relationship to disease.
Only big data can do this effectively with commercially available tools. Big data has some big promise and is probably one of the big game changers of this decade.