David Bennett's Big Data Blog - Excelerate Systems: August 2015

Sunday, August 23, 2015

Big Data and Security - still a problem?

When we talk to clients about Big Data it's often assumed that there is a strong security infrastructure in place to secure all that data either at rest or in motion. While that assumption is starting to be true it's important to understand where the Hadoop Eco-system is strong in security and where there is additional work to do.

Go back 2 years and the the answer to the 'Security?' question was very simple - Kerberos. While that sounds a little silly now, it really was that basic. Any requirements for Role Based Access control, encryption, audit trails / governance / compliance, intrusion detection and all the other requirements of Enterprise security were handled in a single somewhat evasive response 'Those aspects are covered by the operating systems and / or the networking systems.' What this actually meant was that Big Data implementations had gaping holes in security.
We worked with one client whose Internal Audit team worked out this lack of security and kept a large Big Data project on hold for a full 12 months before they would allow the Hadoop cluster to be joined in any way to the Enterprise Infrastructure.
If we look at where Security in Hadoop stands today then customers can implement

Authentication using LDAP or AD. (this is the Kerberos stuff)
Encryption at rest or in motion inside the cluster.
Role Based Access control (for example using Apache Sentry)
Data redaction (thus avoiding the problem of admins having access to all data). This is critical when using Hadoop clusters to PCI use cases so that PII (Personally Identifiable information) can be redacted.
Data Governance that provides Auditing, data lineage, data life-cycle management.
Key management to manage encryption keys, certificates etc.

These are all significant advances that have been implemented in various Apache projects such as Sentry. However, it's also clear in our discussions with clients that most CIO's and CISO's still don't feel 100% comfortable with Big Data Security. This is particularly true in Europe as a recent survey by Forrester showed. http://eandt.theiet.org/news/2015/apr/big-data-privacy.cfm

So - when we talk to clients about their Big Data strategy and how they should design and architect what, for many of them, is totally new, we now ask a set of simple questions :-

Is the CISO part of the Big Data strategy team, if not why not?
As part of the Discovery process, we need to make sure the client has the rights to use all the data they plan to use.
Will the client implement data redaction?
Is the client willing to encrypt everything?
Who owns the cluster security profile?.

This is not meant to be a complete list but it simply makes the client consider Security as an issue at the design phase not as an afterthought.

What do you think? Is Security in Big Data a big problem? Do you think the progress in the last 2 years has allowed Hadoop implementations to catch up with more traditional designs?

Friday, August 14, 2015

Big Data Applications - anyone out there?

If you read my post last week we discussed how Big Data and Hadoop is no longer running on commodity hardware. This is a good thing, provided you are prepared to accept the higher Hardware cost, because it lays the groundwork for the next big move in Big Data - off the shelf, high value applications.

If you are at all involved in Big Data you know several things :-

It's moving very fast, new Apache Projects / new Cloudera Labs projects and significant upgrades in real-time/streaming/analytics capabilities.
Right now, Big Data is a bunch of tools - infrastructure tools, ingestion tools, analytic tools, go to any Big Data conference and what you see is a lot of customer specific use cases and a lot of talk about the latest 'tools'.
Make any kind of search for 'Big Data Applications' or Hadoop Applications' or even 'Cloudera Applications' and you come up empty. What you will find is a lot of SaaS Analytics offerings that leverage a Big Data infrastructure. To me, these are not new Big Data Apps, but simply BI/Analytics apps using additional data sources. Here's an example http://www.cio.com/article/2917433/startups/can-a-startup-democratize-big-data-apps.html Looks interesting but it's not a Big Data app.

I believe there is a second Big Data wave coming soon that will be just as important as the ERP and CRM and Mobile / online waves were. That is - Big Data Apps that move beyond just analytics and start offering services to customers based on real time, streaming data ingestion matched up with geo-location and customer preference information. This is not new and it's certainly not an original idea. What's new is that Big Data and the Hadoop ecosystem in particular can now deliver what businesses and governments have aimed for - customer/client/citizen segmentation that delivers individual services/offers/capabilities to to specific individual people.

Very simple example - multi channel business (bricks and mortar, online, etc) targets a micro segment of couples under 30, no kids living in the top 50 income neighborhoods in the US. Every time one of those customers or potential customers passes by a multi -channel outlet (either physically or online) they receive a co-ordinated marketing campaign just for them (promotional offer, 2 for 1 offer, groupon offer or whatever the analytics part of the Big Data process has defined as the right offer. There are many other Apps that we will discuss in later posts.

When I review the use cases that we have developed for customers, in the beginning it was simple batch oriented analytics for customer analysis, maybe telco infrastructure analysis. Then maybe 2 years ago the discussion moved to real time analytics. What I see now is a definite move to discussing not jut analytics but new services to customers (B2C, B2B, G2C etc) that can only be developed using Big Data paradigms.

Critical in this (and in current Big Data projects) is security and we will dsicuss that next week.

What do you think? Are we still in the era of Big Data tools? Or are we entering the next phase of Big Data Apps?

The Internet of Things and Big Data

We are seeing increasing interest from a number of clients in IoT. While it's still more of a hype phrase than reality there are a number of Organizations that can point to real success with IoT and Big Data. One of them is Transport for London who use heir thousands of sensors in their transport network to build actionable insights. Lauren Sager Weinstein of TfL explains more here https://tfl.gov.uk/cdn/static/cms/documents/big-data-customer-experience.pdf

To me it seems that Public sector and particularly State and Local Government are the leading adopters of IoT and Big Data - interesting idea - local government as technology leaders. In particular we see this trend in Europe rather than the USA. In multiple regional and EU projects where government is trying to leverage their massive array of sensors to improve delivery of services to citizens.

The area where there is a lot of discussion but not a lot of real use cases is in the Enterprise. This week I was at a very large retailer who is a client of ours. Clearly Retailing is the next obvious sector for IoT. Lots of locations, lots of customers, lots of channels, lots of assets producing data. David Dorf at Oracle put out a good list of possible use cases in January - you can read it here http://www.forbes.com/sites/oracle/2015/01/09/how-the-internet-of-things-will-shake-up-retail-in-2015/

So - what next? It's clear IoT is going to change Big Data Architectures again, just as we get used to Spark possibly replacing MapReduce, the massive increase in data volumes and types from IoT, is, for sure, going to lead to another architecture pivot.

What are your thoughts? Do you see IoT use cases in the Enterprise happening to-day or is it just talk. What about Public sector - are they the leaders in IoT?

Spark and Hadoop - replace or complement?

I recently read a survey report from Typesafe of 2,136 respondents who were asked about Spark vs Hadoop. You can see the report here for yourself (registration required) https://info.typesafe.com/COLL-20XX-Spark-Survey-Report_LP.html?lst=RW&lsd=COLL-20XX-Spark-Survey-Trends-Adoption-Report

The most interesting part of the report for me was that 78% of respondents were using Spark for fast processing of BATCH data sets! Think about that. Spark can work with HDFS as the persistent data store but Spark is really good at processing streaming, transactional data - but - most respondents are just using it to make batch go faster.

This is our experience too - when we talk to customers they want to consider Spark, they know they have to think about future use cases which will almost certainly involve streaming data, transactional data sets and - most importantly - real time analytics and machine learning. But - for now, even ten years after Doug Cutting and Mike Cafarella invented Hadoop - we are still seeing the vast majority of use case focused on batch processing. It really is - back to the 80's!

So - in my view - Spark is not replacing Hadoop but is simply complementing what is already out there. What do you think?

This Aptuz blog also summarizes neatly the Spark vs Hadoop discussion http://aptuz.com/blog/is-apache-spark-going-to-replace-hadoop/

Big Data Appliances - good or bad?

We saw in the last week that Teradata has updated their Big Data Appliance it can now ship with Cloudera and has had Hortonworks available for some time.http://ht.ly/PFrgV Although Appliances are appealing for simplicity, ease of management, lots of testing done by the vendor(s) and should work more or less out of the box, I am not convinced.

First - why limit yourself to the configuration that the appliance comes in? The TD box is a hefty machine and prices for Hadoop appliances typically start at $500k plus.

Second - be sure that what the Hardware vendor is shipping is actually the latest release. We have had experience of appliances that are behind the Apache versions that are already out there. This creates a lot of support headaches.

Third - look very carefully at all the proprietary stuff that is loaded on the Appliance. This is classic Enterprise Software lock-in. Don't want to get locked in? Don't buy proprietary software, or at least buy solutions where you can escape. As the saying goes ' Make sure your data is your data, not your suppliers'.

What do you think, are appliances the way to go?