David Bennett's Big Data Blog - Excelerate Systems: October 2015

Wednesday, October 28, 2015

Big Data has a Big Problem

One of the keys to success in Big Data is having the skills to make Big Data projects successful. While some of these skills are 'soft skills', the core requirement is to know what we are doing with complex, highly interrelated and fast moving technology. I recently read a Cap Gemini report on How Successful Companies Make Big Data Operational

The report itself made for interesting reading but also highlighted one big problem -

Global organizational spending on Big Data exceeded $31 billion in 2013, and is predicted to reach $114 billion in 2018.

Almost 300% growth in 5 years. Now - think about the critical issues this creates - not enough Data Scientists, Admins, Developers, NoSQL experts (plus many other critical skill sets) are being trained right now to come anywhere close to satisfying this demand. As long ago as 2011 McKinsey estimated that there will a 1.5 million person hole in the US workforce alone of managers and analysts capable of using Big Data to make effective decisions.

So - as others have pointed out, if you are not Facebook/Google/Linkedin etc what can you do. One interesting set of results lies in some research conducted at Microsoft earlier this year. The full report is here and a great summary is at the Register The study effectively shows that Big Data - the technology - in Microsoft is not failing but - equally - the results are not yet packaged in a way that helps the consumers of the information to use it.

This is the critical point - when we talk to clients we discuss the details of the technology we believe will meet their needs but there is absolutely no substitute for experience. And experience in the customer is critical. For this reason - whenever a client asks for help on Big Data we always tell them - get started, get started right now and start building your own institutional knowledge. That way you will not get trapped in a skills shortage dead-end.

What do you think? Is the skills shortage going to stifle the adoption of Big Data? For an interesting slant on this question take a look at this Venturebeat article from Cameron Sim at Crewspark.

Wednesday, October 14, 2015

Data wrangling - just a phase or here to stay?

There have been myriad articles, blogs, posts in the last 12-24 months about Data Wrangling. I don't intend to re-hash those here - if you want a summary of Data wrangling take a look at this article by Lukas Biewald http://www.computerworld.com/article/2902920/the-data-science-ecosystem-part-2-data-wrangling.html

We have worked with many clients where Data Wrangling has been the largest part of the Professional services engagement. Forget about all the desired outcomes of better/faster/new/amazing insights that part only comes after we get the data into the cluster.. Getting the data into a usable format for the Hadoop cluster and then ensuring it stays that way is usually a major piece of effort. Others have described it as 'Janitorial' work. That hides the high level of complexity in choosing how to map the raw data into formats that the client will want to use and is therefore suitable for the Hadoop cluster to ingest.

So - given that Data Wrangling is a well known concept now - is that skill going to be required for some time or will tools emerge (tools again....) that will semi automate or automate completely the process?

There are some products out there like Trifacta, ClearstoryData and then multiple open source tools like Tabula, DataWrangler (confused yet?), R Packages and you can even use Python (with Pandas).

Many of these cross over into Dashboarding and Visualization - even Datameer could be considered a Data Wrangling tool in some ways.

The question is - will Data Wrangling as a required skill set, and, more importantly, as a major element of Big Data projects, disappear under an onslaught of products that can do it quicker and more cost effectively?

My view is - in your dreams. The old standby of the three V's shows why. Volumes are increasing, velocity is increasing and, most importantly for Data Wrangling - Variety is going to continue to accelerate. Think about IoT, Realtime streaming, Transactional data, unstructured data from legacy systems, new use cases emerging every day. For sure the standard Data Wrangling tasks of SQL, CSV, XML, JSON etc will be handled by products but with the ever growing number of data sources and as Big Data continues to redefine Enterprise computing I don't think Data Wrangling is going to disappear for a while yet. Customers can prepare themselves for continuing to spend a large amount of their Big Data budgets on simply getting data ready to be ingested.

Want to know more ?

For a good list of free Data Wrangling tools visit the Varonis blog http://blog.varonis.com/free-data-wrangling-tools/

Ben Lorica gave a good summary in January of 2015. http://radar.oreilly.com/2015/01/lessons-from-next-generation-data-wrangling-tools.html

What do you think - is Data wrangling going to disappear and be replaced by products like Trifacta?

Tuesday, October 6, 2015

Kudu - the end of MapReduce?

Last week at Strata Cloudera announced Kudu, http://blog.cloudera.com/blog/2015/09/kudu-new-apache-hadoop-storage-for-fast-analytics-on-fast-data/

Kudu is going to fill in the gaps in Hadoop's storage layer and be almost as good as HDFS at what HDFS is good at (high speed writes and scans) and, at the same time, almost as good as HBase at what HBase does best (random access queries). Although it's a long way from being enterprise ready it's clear that Kudu can avoid architectures that we have implemented in several customers :-

Persistence of data in HDFS and
A subset of data in HBase for real time access / analytics / GIS services etc.

Not only is this architecture expensive in Hardware and Professional services but it introduces high levels of complexity that a lot of customers are uneasy about. So - when Kudu is ready - it will spell the end of HDFS and HBase for many customers.

So now it becomes clear that the architectural goal is to replace HDFS and HBase with Kudu and, as we have all ready heard about before now - replace MapReduce with Spark.

So then you have the target architecture 2-3 years out - Kudu and Spark replacing HDFS/MapReduce/HBase. As I and others have written in previous posts this will then allow for full real-time streaming analytics and services on massively scaleable clusters. It is this architecture that will lead to an explosion in IoT use cases.

Now - one last point - in all this change there are some Dinosaurs out there - yes EMC I am looking at you, and you HP and IBM and Teradata and even NetApp. These businesses are in flat or declining markets (take a look at this Forbes article for more detail http://www.forbes.com/sites/greatspeculations/2015/01/02/how-emc-lines-up-against-netapp-hp-ibm-hitachi-in-storage-systems-market/ ). As Kudu gains traction these older vendors with old style technology will become less and less relevant. EMC has been reinventing itself for a while but it will be interesting to see how the decline or even disappearance of Enterprise storage will reshape the landscape.

Maria Deutscher also makes some great points about Spark and the Hadoop Ecosystem over at SiliconANGLE http://siliconangle.com/blog/2015/09/28/apache-kudu-how-cloudera-wants-to-save-hadoop-by-killing-it/