The source systems that create your enterprise data may be unintentionally mutating that data. This is data drift.
Without integrity, nothing works. That is particularly true in the case of data integrity.
According to a new whitepaper by StreamSets, the source systems that create your enterprise data may be unintentionally mutating that data.
This is data drift, and what the company defines as: The unpredictable, unannounced and unending mutation of data characteristics caused by the operation, maintenance and modernization of the systems that produce the data.
Today, many sources of important data, such as mobile interactions, sensor logs and web clickstreams, are constantly changing as those systems are tweaked, updated or even re-platformed by their owners. These changes to data content, structure, behavior and meaning are similar to genetic mutations, they are unpredictable, unannounced, and unending. This data drift can wreak havoc on your business.
Data drift occurs more often than businesses realize and is becoming increasingly common, explains Girish Pancha, CEO and Co-founder of StreamSets. It is a consequence of new big data sources that are growing in importance but are outside the control of the enterprise.
The StreamSets whitepaper offers three real-life examples of data drift in business operations.
A bank adds leading characters to its text-based account numbers to support a larger customer base. Existing downstream systems - unaware of the change, start to conflate the data for multiple accounts (e.g. account 0023456 and account 0123456) leading to bad information in customer service systems.
A digital advertising company sees a spike in its revenue. Upon further investigation it is determined that the measurement was a false positive driven by a change in an upstream data field from IPv4 to IPv6 address format, which led to misinterpretation by analytic systems.
Related Article: Cloud-Housed Data Analytics is Here—and It's Wonderful
Loss of Agility
A price prediction service regularly adds new external data sources to improve its predictive model and expand its market reach. These 3rd-parties often fail to consistently conform to data schema guidelines in which case data engineers must write custom code for each data set. This slows the onboarding of new sources, ties up valuable data personnel and makes the ingest operation unreliable and difficult to diagnose when it fails.
Data drift can greatly harm your business in several ways. ìIt causes you to make poor decisions," Pancha said, "either because you react to analysis that has been polluted by data that has drifted or, in the more insidious false negative case, because you think everything is fine, when in fact, it isnít.
And one bad data-based decision is all it takes to introduce doubt into any big data initiative. In the aforementioned example of the digital advertising company, the false positive they experienced might have led them to increase their marketing spend inappropriately to capitalize on what they perceived as improved performance.
Even though they caught the problem before this point (although it took several months), they were less confident in their data going forward.
To solve for data drift, you need data flow management tools that are drift-aware. A few things a drift-aware tool can do are to:
- Coerce data to the correct data structure when a change is detected.
- Detect patterns in data values that might indicate a semantic change that needs to be addressed, ideally by being transformed before it reaches the data store so that consuming applications continue to operate unaffected.
- Sample, trap and alert on anomalous data rather than blindly passing it through to consuming systems.
Most traditional data integration tools were built before big data and thus did not consider the data drift problem, and the popular big data movement tools operate at too low a level to detect and act on data patterns, Pancha points out.
Related Article:The Skinny on Big Data: Everything You Need to Know From Our CTO
But for the growing number of businesses that needs to analyse streaming data, such as for IoT, cybersecurity or customer 360 applications, data drift is a reality that must be addressed."