When we talk to clients about Big Data it's often assumed that there is a strong security infrastructure in place to secure all that data either at rest or in motion. While that assumption is starting to be true it's important to understand where the Hadoop Eco-system is strong in security and where there is additional work to do.
- Go back 2 years and the the answer to the 'Security?' question was very simple - Kerberos. While that sounds a little silly now, it really was that basic. Any requirements for Role Based Access control, encryption, audit trails / governance / compliance, intrusion detection and all the other requirements of Enterprise security were handled in a single somewhat evasive response 'Those aspects are covered by the operating systems and / or the networking systems.' What this actually meant was that Big Data implementations had gaping holes in security.
- We worked with one client whose Internal Audit team worked out this lack of security and kept a large Big Data project on hold for a full 12 months before they would allow the Hadoop cluster to be joined in any way to the Enterprise Infrastructure.
- If we look at where Security in Hadoop stands today then customers can implement
- Authentication using LDAP or AD. (this is the Kerberos stuff)
- Encryption at rest or in motion inside the cluster.
- Role Based Access control (for example using Apache Sentry)
- Data redaction (thus avoiding the problem of admins having access to all data). This is critical when using Hadoop clusters to PCI use cases so that PII (Personally Identifiable information) can be redacted.
- Data Governance that provides Auditing, data lineage, data life-cycle management.
- Key management to manage encryption keys, certificates etc.
These are all significant advances that have been implemented in various Apache projects such as Sentry. However, it's also clear in our discussions with clients that most CIO's and CISO's still don't feel 100% comfortable with Big Data Security. This is particularly true in Europe as a recent survey by Forrester showed. http://eandt.theiet.org/news/2015/apr/big-data-privacy.cfm
So - when we talk to clients about their Big Data strategy and how they should design and architect what, for many of them, is totally new, we now ask a set of simple questions :-
- Is the CISO part of the Big Data strategy team, if not why not?
- As part of the Discovery process, we need to make sure the client has the rights to use all the data they plan to use.
- Will the client implement data redaction?
- Is the client willing to encrypt everything?
- Who owns the cluster security profile?.
What do you think? Is Security in Big Data a big problem? Do you think the progress in the last 2 years has allowed Hadoop implementations to catch up with more traditional designs?