Big data brings with it new security concerns.
Although our data capacity is growing exponentially, we have imperfect solutions for the many security issues that affect even local, self-contained data.
Hacking technology outstrips defensive technologies, and there are demons that haunt large organizations in the area of security.
Political and structural issues can have a negative impact on the effectiveness of security policies, organizations are open to malicious damage from disgruntled employees, and they are probably even more open to damage caused by satisfied but careless employees.
Big data increases the risk. For one thing, big data breaches will be big breaches. For another, the more information you have, the more likely it is that it includes personal or sensitive information. Sources of information vary greatly, allowing multiple opportunities for infiltration. And finally, distributed computing, which is the only way to process the massive quantity of “big data”, opens up additional opportunities for data breaches.
Related Article: The Skinny on Big Data: Everything You Need to Know From Our CTO
Big data operations gather data from multiple sources, combining in-house stores with data harvested from public sources like blogs, social media, and clickstream data, then store and analyze this data, which can total in the petabytes (1 petabyte = 1024 terabytes or a million gigabytes).
Building big data infrastructure in-house is a major investment of time and money for research, hardware, software, and countless other details, so most organizations will not install their own big data infrastructure. Thus, big data in the cloud and its security should be considered. There are few challenges associated with big data privacy and issues which can be divided into four groups: Infrastructure Security, Data Privacy, Data Management, and Integrity/Reactive Security.
Privacy Challenges and Recommendations
Maintain Security in Distributed Computing Frameworks.
The distributed computing framework, utilizing parallel computation across multiple Workers, creates opportunities for breaches of security. Identifying a malicious or unreliable Worker computer and protecting the data from these unreliable processors is a key to security solutions involving big data.
SOLUTION: There are two techniques for ensuring the trustworthiness of the worker computers:
- Trust establishment, in which Workers are stringently authenticated, and given access properties only by Masters, who are explicitly authorized to do so. After the initial qualification, Worker properties must be checked periodically to ensure they continue to conform to predefined standards.
- Mandatory Access Control (MAC), in which the access of each Worker is constrained to a very limited set of tasks. Typically, in the MAC system, the ability of a user to control the objects it creates is highly restricted. MAC adds labels to all file system objects defining the appropriate access for each object, and all users appropriately defined access.
Best Security Practices for Non-Relational Data Stores.
In finding solutions to big data management, many organizations migrate from a traditional relational database to a NoSQL (Not Only Structured Query Language) database to deal with the unstructured data.
A big gap in NoSQL database architecture is secure. NoSQL solutions were originally built as solution-specific tools to operate within a larger framework, leaving security to the parent system. In addition, the architectural flexibility that made NoSQL a good solution for multi-sourced data also leaves it vulnerable to attack.
SOLUTION: Data integrity should be enforced through an application or middleware layer, and encryption should be used at all times – when data is in transit and at rest. This means that database contents should be encrypted; protecting data at rest, and additional protection for data in transit should be applied using SSL encryption to connect the client and server, ensuring that only trusted computers can access the encrypted data.
It is sensible to monitor logs on a real-time basis to spot anomalies that may indicate an attack. Use data tagging and enforced time stamps to prevent unauthorized activity.
To maintain performance, scalability, and the security of the overall system, NoSQL should be integrated into a framework, which will handle many of the security components for the system.
Preserve the Privacy in Data Mining and Analytics.
Big data can enable “invasions of privacy, invasive marketing, decreased civil liberties, and increased state and corporate control”. The amount of information collected on each individual can be processed to provide a surprisingly complete picture. As a result, organizations that own data are legally responsible for the security and the usage policies they apply to their data.
Attempts to anonymous specific data are not successful in protecting privacy because there is so much available that some data can be used as a correlation for identification purposes.
Users' data are also constantly in transit, being accessed by inside users and outside contractors, government agencies, and business partners sharing data for research.
SOLUTION: Privacy, for legal reasons, must be preserved even at the cost – not only monetary but that of system performance. Developing approaches include “differential privacy”, a formal and proven model that comes with a great deal of systems overhead; and an emerging technology known as homomorphic encryption, which allows analytics to work with encrypted data. Older, more standard solutions include encryption of data within the database, access control, and stringent authorization policies. Keeping security patches up to date, another bit of standard wisdom, is important.
An important consideration for implementing privacy policies is that legal requirements vary from country to country, and it is necessary to comply with the policies of the countries where you are active.
Related Article: How Health and Big Data Are Working Together to Save Lives
Encrypted Data-Centric Security.
There are two distinct approaches to applying security controls to control data visibility: first, controlling access to the system, and secondly, applying encryption to the data itself. The first method, which is easier and less costly to implement, is also less effective, as it provides what it calls a larger “attack surface”.
If the system is breached, then the attacker has access to all the data. Deploying encryption on all data on a granular basis helps ensure that even if there is a system breach, the data itself remains protected.
SOLUTION: Identity- and attribute-based encryption can be used to enforce access control on individual objects through cryptography. Identity-based systems can encrypt plain text so that only an entity with a certain identity can decrypt the text. Attribute-based encryption applies the same controls based on attributes rather than identities. A complete homomorphic encryption scheme, as mentioned above, would keep the data encrypted even while it’s being worked on.
Another tool to help maintain privacy is the concept of “Group Signatures”, which allows individual entities access to their data, but publicly can be identified only as part of a group. Only trusted users would have access to specific identity information.
Granular Access Control.
When implementing big data security, it is important that privacy concerns are respected while still permitting data usage and analytics to continue. This is one of big data’s greatest security challenges, as the collection of data is useless without being able to use it; but a data privacy breach has legal and ethical implications, as well as marketplace effects. Granular access control acts on each piece of data, thus ensuring a high level of both security and usability.
There are three problems with effective implementation of granular access control:
- Keeping track of the security/secrecy requirements and policies in a cluster-computing environment;
- Keeping track of user access throughout all components of the ecosystem;
- Proper implementation of security/secrecy requirements with mandatory access control.
Choosing an appropriate level of granularity is its own challenge, and requires knowledge of the data store and analytics systems. Row-level access, because a row typically represents a single record, is often used for data derived from multiple sources. Column-level access, because a column represents a specific field for all records, is often used for sensitive elements, because the identification columns are not necessarily available to users. Cell-level access means a label is applied to every grain of information, and can support a wide range of usages and analytics. However, such a scheme must be rigorously applied in order to be effective, and in data sets this big, the overhead could be prohibitive.
SOLUTION: To cope with the complexity of tracking and implementation in big data environments, where the scale is so extensive, it is recommended to reduce the complexity of granular access controls on the application level. Instead, use the infrastructure to implement as much of the access control as possible, and adopt standards and practices that simplify whatever access controls still exist in the application level. This fits in well with the need to build a framework to support NoSQL security.
Ensure your access scheme assigns an appropriate level of granularity, balancing the size of the data store with the need for security.
Secure Storage and Transaction Logging.
In order to deal with petabytes of data, a form of storage called auto-tiering has become a necessity. In auto-tiering, items are assigned a level of storage automatically, based on policies established by the organization. Auto tiering opens a number of vulnerabilities due to unverified storage services and mismatched security policies. Moreover, because data is moved automatically, rarely accessed, but the critical information could end up on a lower tier, which typically has less security attached to it. Finally, auto-tiering maintains transaction logs of its activities, which now also have to be protected in order to protect the data of its logs information.
There are specific vulnerabilities associated with big data storage: Confidentiality and integrity, data provenance, and consistency. The need for availability presents risks as well: stored data needs to be available on demand, which requires the system to have hooks to the data that can be exploited.
It states about two types of attacks, which auto-tiered environments are susceptible to collusion attacks, in which service providers exchange keys and access codes and thus gain access to more than the subset of data assigned to them, and Rollback attacks, in which an outdated dataset is uploaded to replace the latest version.
SOLUTION: Technologies for dealing with some of these issues have become more robust in response to big data demands. Encryption is a crucial part of maintaining confidentiality and integrity of data. Digital signatures using asymmetric encryption, regular audits, and hash chaining can help secure the data. Persistent Authenticated Dictionaries (PADs), which allow queries against older versions of the structure, can assist in identifying Rollback attacks. The secure untrusted data repository (SUNDR) is a network file system designed to store data securely on untrusted servers by checking consistency in the data fork. Collusion attacks can be addressed with varying levels of encryption/decryption policies and digital rights management.
Persistent Authenticated Dictionaries (PADs), which allow queries against older versions of the structure, can assist in identifying Rollback attacks. The secure untrusted data repository (SUNDR) is a network file system designed to store data securely on untrusted servers by checking consistency in the data fork. Collusion attacks can be addressed with varying levels of encryption/decryption policies and digital rights management.
The problem is not an absence of technologies, but the absence of an all-inclusive systemic approach or framework. The result is a patchwork of security policies, rather than a unified structure in which all the parts work together to provide a sufficient level of security.
As discussed below, the goal of real-time security monitoring is to raise the alert at the first sign of trouble. However, since that doesn’t always happen due to the challenges of identifying real threats among a huge number of false alarms, it’s important to have frequent, granular audits to identify breaches after the fact. Audit information also helps identify exactly what happened so a similar attack or problem can be avoided in the future. An effective audit depends on four factors:
- Completeness of the information that is required for the audit;
- Timely access to the information;
- Integrity of the information;
- Controlled access to the information, to prevent tampering, thus compromising the integrity.
SOLUTION: Start by enabling logging options for all components to ensure completeness of information. This would include applications at all layers, including operating systems. Deploy a forensic or SIEM (Security Information and Event Management) tool to collect, analyze, and process the log information. This should be done outside of the infrastructure used for the data, so it is not subject to the same risks as the data to be audited.
Related Article: Go Big or Go Home: How to Utilize Big Data for Human Resources
Data Provenance and Verification
Big data is collected from a wide variety of sources, and in enterprise settings that can mean millions of end-user machines. In this environment, the question of how trustworthy the data might be is of paramount importance. As the volume grows, so does the complexity of the provenance. Provenance information is contained in the metadata attached to each data object and provides information about the object’s creation.
In big data applications, the provenance metadata includes the provenance for the big data infrastructure itself, which is like having meta-metadata. As development in this area progresses, provenance metadata will become more complex due to large provenance graphs generated from provenance-enabled big data applications. Analytics for graphs of this size and complexity are very resource-intensive in terms of computational overhead.
The major threats to data integrity in big data applications are malfunctioning infrastructure, and attacks on infrastructure from inside or outside the organization. The provenance metadata itself must be protected as well so that audits and other detection methods can be effective in verifying data sources.
SOLUTION: Very granular access control is an important starting point for securing provenance and verification metadata. It is recommended that data-independent persistence should also be satisfied when updating provenance graphs. This means that even if a data object is removed, it might be an ancestor of other data; therefore, its provenance should be retained.
Furthermore, access control should be dynamic and scalable, and should use, lightweight, fast authentication for reducing overhead. Secure channels between infrastructure components should be part of the architecture, and responsive, flexible revocation mechanisms should be included.
Integrity and Reactive Security
Endpoint Input Validation and Filtering
Because of the breadth of data sources, including endpoint collection devices, a major challenge facing big data schemes is whether the data is valid from the point of input. Given the size of the data pool, how can we validate the sources? How can we be sure that a source of input data is not malicious, or simply incorrect? In addition, how can we filter out malicious or unreliable data?
Both data collection devices and programs are susceptible to attack. An infiltrator may spoof multiple IDs and feed fake data to the collection system (ID clone attack, or Sybil attack).
SOLUTION: Solutions must take two approaches: prevention of tampering, and detection and filtering of compromised data. However, it is virtually impossible to build a complex, extensive system that is completely resistant to tampering. Therefore, there is no secure way to ascertain the integrity of your data at the input, but make three recommendations to deal with this situation:
- The big data collection system design must take into account this inherent unreliability and the inevitability of relying on untrusted devices and try to develop the most secure data collection platforms and applications possible.
- The system should be able to identify likely Sybil and ID cloning attacks and be prepared with cost-effective ways to mitigate the attacks.
- Understanding that a determined adversary can infiltrate any existing system with false data, designers must develop detection and filtering algorithms to find and eliminate malicious input.
Real-Time Security Monitoring
Real-time security monitoring is intended to alert the organization at the very first sign of an attack. However, there is an enormous amount of feedback from SIEM systems, whose aim is to provide big-picture feedback of an organization’s data security in real time.
Few organizations have the resources to monitor this feedback with the kind of oversight and analysis necessary to identify real attacks from false alarms. Privacy considerations drive the need for high security, but make detection delicate, as privacy laws need to be navigated along with the analytics that will identify attacks.
SOLUTION: Big data analytics itself can be used to identify threats, including differentiating real threats from false positives. Logs can be mined for anomalous connections to the cluster. Improved analytics can help separate out the false positives
Your big data framework needs to include analysis and monitoring tools. If they are not available within the framework, such tools can be placed in a front-end system whose job is primary to provide the analytics necessary to assess the SIEM feedback to identify threats.
Big Data Techniques Used for Security
This last point is one worth taking a second look at, in closing. It is worth noting that an emerging area of security uses big data analytics to detect threats at an early stage, applying pattern analysis to identify anomalies that may indicate a breach or a threat. It indicates that big data analytics is an effective technique with extensible applications.
As in a game of chess, eventually you will use the king to help in his own defense; in just that way big data analytics can be employed to analyze situations, evaluate interactions, and even assist in data modeling realistic and effective solutions to its own dangers.