Weekly blog dedicated to all things Big data, some technical, some market oriented, some vendor related, always customer oriented.
Tuesday, September 1, 2015
Hadoop Distributions - do we care which one to use?
It's almost 9 months since Hortonworks went public and Cloudera continue to provide some insight into their growth. Along with MapR, IBM and Pivotal (EMC), Amazon's Hadoop offerings and straight Apache Hadoop, we can get a pretty clear picture of how each distribution is doing. Of course this is Open Source (mostly), so each Distribution Vendor gets their work recycled into Apache as well as making it into their own version.
When we talk to clients and they ask us 'Which Distribution do you recommend'?' we give one of two answers :-
If they have nothing deployed or, (more typically) failed in-house deployments, we always recommend Cloudera. This is because it has the most enterprise ready ecosystem, security and governance capabilities and is easy to scale quickly.
If they already have something deployed, (usually Hortonworks but sometimes Pivotal or Amazon), we tell them to keep using it and we will help them achieve a quicker time to value and /or get a return on an already existing investment.
In our experience working with clients in Telco, Retail, Finance, Government and other sectors there are 5 real factors to consider when implementing a Big Data project or trying to rescue a failed one :-
Are you buying the right hardware? I have covered in previous posts how the Hardware landscape for Big Data is changing but - with the ecosystem changing so rapidly make sure you are buying hardware that will meet the current and future needs.
Is your Big Data strategy driven by the business side of the company? Big Data projects driven by IT get stuck in all sorts of pointless discussions about - for example.... which Distribution to use!
Are your use cases clearly defined?
Does the partner you are working with have real, referenced projects and customers? I can't tell you how many times we have worked clients with failed projects who - in reality - had been paying their consulting provider for on the job training in Big Data without knowing it.
Get ready to scale - once a Big Data infrastructure is in place the business usually demands a rapid adoption of new use cases.
As you can see, with these 5 factors, choice of the Hadoop distribution can be an afterthought. so sure, we can all get excited about ODP, or Vora from SAP http://fortune.com/2015/09/01/sap-to-bridge-big-data-gap/ or many other technology issues but - the most important factors in our experience are those above and the technology is secondary.
What do you think? Are Hadoop Distributions not that important? What do you think about ODP - hype or important.?