David Bennett's Big Data Blog - Excelerate Systems

Wednesday, May 3, 2017

Is Docker Ready for Prime Time?

Docker - ready for prime time or not? It's a question that has been asked (and answered) hundreds if not thousands of times already. So - rather than repeat that long and somewhat tired conversation I want to focus on one piece of the debate - Security.

There's three issues that I see with Docker and Security

Because Docker is opensource it has been widely adopted and is almost certainly already deployed, whether you like it or not, inside your organization. This makes for all sorts of security nightmares that the CISO's and their teams are unable to control. What if an employee introduces uncontrolled code to a mission critical stack? What does that do for Compliance, internal audit and Corporate governance issues not to mention liability problems.
Many others have covered the point about large attack surface. Thousands of containers vs hundreds of apps, VMs etc. The larger the attack surface the more vulnerable your organization is to internal and external breaches.
The very flexibility that Docker and containerization in general provide gives it a massive security hole. What if a rogue employee or external intruder plants a container that launches and East-West attack? Good luck finding that single container in the thousands you have already deployed.

There is plenty of advice out there on how to implement Docker security effectively - this article from Amir Jerbi co-founder and CTO of Aqua Security, is a good basis.

In my discussions with customers about Docker it's clear that, at the Enterprise level, they are just not comfortable yet in adopting Docker. Typical responses include 'Maybe next year,' 'let's wait and see', 'who else is using Docker across their infrastructure?' All good points with limited answers. Look at the list of Docker customers at docker.com. Are these all in production? Let's hope so.

When customers ask me about Docker security I always tell them 'Be careful, move forward in a considered way and you might just end up where you expect to be. If you let it get out of control you will spend a lot of time and money getting Docker under control.' Full disclosure - we offer Docker/Containerization as a service from Alauda We do this because Containerization as a service is intrinsically more secure running on AWS or Azure than letting Docker loose in your Datacenter.

What do you think? Is Docker ready for Prime time?

Thursday, April 7, 2016

Big Data and Security - the next big disruptor?

Last quarter I was invited to a Cloudera sales event in Las Vegas. Some impressive stats on last year's performance, a lot of enthusiasm and in particular a great session from Charles Zedlewski @zedlewski outlining some of the product and Apache initiatives coming soon.

Two in particular are now announced

the Open Network Insight (ONI) https://vision.cloudera.com/introducing-open-network-insight-accelerating-cybersecurity-analytics-solutions/ ONI is effectively Cloudera's move into core security functions such as threat detection.
Apache Arrow https://arrow.apache.org/ Arrow leverages latest SIMD (Single input multiple data) operations optimization of analytical data processing. Arrow is Cloudera's attempt to gain control of in-memory columnar data processing.

So far, so good, but these two announcement will make a huge impact in the IT Security market. for sometime now there has been little innovation in Security. the main players are all offering incremental enhancements to technology that has been around for years.

Big Data and the Hadoop eco-system can (and already has) disrupt the ITSec market. Principally it's a cost/scale dynamic. SIEM's, Vulnerability Management, Configuration Management tools and others are essentially about reacting to events that have already happened. they also use Metadata structured repositories to normalize, correlate and report. Look at any SIEM vendors details and you will see this common theme. Detect and fix something that has already happened.

With Hadoop and it's various components and, in particular, the continuing path to maturity in machine learning products, this old style architecture is going to disappear. Sometime between now and 2020 the Enterprise Security Warehouse concept will be widely adopted. All data from all sources poured into a massive data lake (in real-time of course), with an HDFS/Kudu style repository for persistence and machine learning algorithms constantly monitoring what is happening and taking appropriate action as the threats happen not after they happen. Gartner predicted this back in 2014 so it must be true..... http://www.gartner.com/newsroom/id/2778417

In our discussions with clients we see a gradual realization, usually in the biggest clients first, that the old style Security Architectures have failed to keep up and new architectures built on big Data eco-systems and machine learning in particular, offer the greatest potential for the next disruptor. Look at how Splunk has built a $600m business on just this premise but without the machine learning part.

For an alternative view of ML and Security read Matt Harrigan's post@mattharrigan at Tech Crunch. http://techcrunch.com/2016/02/29/machine-learning-is-not-the-answer-to-better-network-security/

What do you think, is Machine learning already the big disruptor in Cyber Security?

Thursday, February 11, 2016

Machine Learning and Spark - get ready for the next big disruptor

There are lots of articles, blogs, reports and noise at the moment about Spark and machine learning - driven primarily by the rapid adoption of MLlib (Spark's general machine learning library) that is leading developers to use R and Python in particular for Advanced Analytics. For a great overview go to Infoworld - Why you should use Spark for Machine learning.

It's generally recognized that Spark has a long way to go before it is fully Enterprise ready. Almost every client I talk to follows a very familiar pattern - they want to try it for speed and scale, they try it and get disappointed in particular by it's scaleability and then decide to wait.

However, when Machine Learning comes into the discussion, Spark adoption is rapid, visible and highly successful. Customers are now recognizing the growing power of Spark/MLLib, particularly with the growing number of algorithms Spark MLLib supports. ML has been around since 1979 and more recently the 'not very good' Mahout implementation has led to a lot of disappointed projects.

We don't have space here to go into the details of ML but I notice four key trends that will help customers see strong and rapid time to value in their machine learning projects :-

Customer 360 views are one of the most common Big Data use cases. Using ML and Spark MLLib in particular, customers can leverage massive data volumes to make product recommendations to customers in real time using ads or other recommendation platforms. ML can take Recommendation and Monetization engines to whole new level of predictability and relevance in real-time
Similarly in Mobile Networks, ML can be used to predict and manage Network Optimization - a critical cost element in Mobile Network profitability. Think about it like a river. Use ML to maximize the flow of water through the narrowest channels while maintaining speed and volume. Maximum benefit flows from predicting in near real time how the flows (Wireless traffic) should be managed.
With Geolocation services, massive data volumes and ML, Retailers can tailor specific offers to individuals. Imagine a scenario where you go into a Nordstrom's type store, the Store ML system picks up (from the Store's already installed Mobile App) that you have entered the store. As you wander round the various departments the ML system is rapidly choosing products you will be interested in (and presenting them on your mobile device) and, when you press the 'Get Help' button on your phone, the Sales assistant glides over, already armed with all your previous purchase history and set of suggestions on what to buy. They open the conversation with 'Good Morning Mr. Bennett, let's take a look at that Emile Staub Cocotte that you looked at last time you were here'.....
Data Wrangling is still a big issue, Machine learning based companies like Trifacta are starting to get a lot of traction inside the Enterprise. Once large companies understand how ML apps can change their entire Big Data ecosystem, ML will become a mainstream technology during 2016.

Want to know more about Machine learning - take a look at this Infoworld slideshare
What do you think? Is Machine Learning the next big disruptor?

Tuesday, January 19, 2016

Would you like a Monopoly with your Data or just a cartel?

Two weeks ago I was fortunate to attend the CES show in Las Vegas. I experienced the joys of 3x Uber surge pricing on Thursday afternoon as well as some great discussions. Two things struck me from CES :-

Try as they might, no-one can make the Internet of things very interesting. I visited a home automation demo in the Qualcomm booth. the demo was good and you can see some of the content here in this promo from Qualcomm. But the problem is - the technology is not new, not very innovative and has been around quite a while. More interesting was the La Poste booth where there was a platform for IoT traffic (Hub Numerique) that is being used by 20 innovative startups (from smart shoes, to intelligent drinking glasses). Hub Numerique (French). Using the platform idea, La Poste takes away the complexity and frees start ups to innovate.
Even though it is not very interesting, IoT is really going to take off when bandwidth makes the next step. 4G mobile networks are not powerful enough to handle the vast amounts of data that IoT will generate. 5G - which is more of an idea than an emerging standard - will provide mobile networks that are 100 times fast than 4G. (Download an HD movie in 1 second). So my thought is that IoT is coming but maybe not quite as fast as the vendors would like it to. Tech Republic have a great post on 5G and it's background including the usual issue about spectrum for new wireless networks.

However, it will happen, and, when it does, there will be an enormous explosion in data volumes. Already we create 5 Exabytes every two days, by 2019 it will be 5 Exabytes every day (including 1 Exabyte in mobile platforms alone). by way of comparison - From the beginning of civilization to 2003 the human race created..... 5 Exabytes. Now we create the same volume of data in 48 hours. And this growth is happening without massive IoT adoption. Cisco's annual forecast is informative for those of us whose like lots of numbers. This year's will be out in February and it will be interesting to see what the growth will be to 2020.

So all this data and who will control it? This is a key question for markets, customers, solution providers and the whole Big Data ecosystem. In the European Union for example, the online search market is completely dominated by Google with a 90% share. Which has given birth to a new term 'Data dominance'. just like the old monopolies and cartels of Oil, Railways, Steel making and so on shaped the 19th and 20th Century economies so - the theory goes - Data dominance is the key metric for commercial success in the 21st Century. Then it becomes clear that even if you have a free market with limited regulation and access to as much information as you want - whoever controls that information and data controls everything. So Big Data becomes a platform for centralization and consolidation of market power.

What do you think - is Data Dominance important? Do you think we should be concerned? For a (Vodafone sponsored) survey on this question and some interesting insights into European perspectives on Big data take a look at this recently published report or the summary at Forbes.com

Monday, November 23, 2015

The Open Source Vendor game

The latest in a long line of struggling Data Warehouse companies announced their results on 5th November. Teradata had flat revenues and flat to slightly declining revenues predicted for the year. Look at the Dell/EMC idea, HP''s continuing machinations and flat revenues, IBM's problems (including the strange decision to buy .... The Weather Channel , Oracle's well documented problems and it's clear the seismic shift in Enterprise software that started with AWS and continues with Hadoop by way of open source continues to grind down the Proprietary Enterprise software vendors.

As Mike Olson put it over 2 years ago - the Proprietary vendors now have no hope of competing. They don't have the resources, the knowledge or the skills to compete with globalized software development teams running Apache projects or similar.

So when a client asks us - what does this all mean? Why is everything moving so fast? Why can't we get back on the old familiar proprietary software merry go round? All this uncertainty makes customers very nervous about what to do, and in some cases they therefore..... do nothing.

When a client asks this question we tell them three things :-

Look at what is happening in the critical Apache projects. Take Hadoop, or Spark, or Kudu or many others. Look at the trend in committers to see which projects have a solid bench of committers and is growing.
Secondly - look at where these committers are doing their day jobs. Here is a good example of Spark. Scroll down the list and you will see it's solidly dominated by Databricks, with a good mix of UC Berkeley (where Spark was developed), Intel (watching their investment in Cloudera) and, of course Cloudera.
Thirdly - if you are intent on looking at projects still in incubation - be careful. Hadoop Development tools seemed like a good idea but failed to get any sort of traction with the community.

The last piece of advice we give clients is always look at what the market leaders are donating to Apache. Last week Cloudera announced they want to donate Impala and Kudu to Apache. Impala has been widely adopted and so Cloudera want the community to take over some of the development work. Kudu is a big complex project so, now that it's architecture has been broadly set, Cloudera want the bigger, more agile, more skilled worldwide community to shoulder some of the development load.

For many clients, used to easy decision making provided by Gartner and others, Open Source adoption in the enterprise and in the data center in particular is changing everything. 5 years ago and certainly 10 years ago clients would insist on Roadmaps under NDA, long term support with access to source code if the vendor went out of business, long complex discussions about pricing and licensing and entry into elite customer user groups. Not one of those types of discussion happens any more in Open Source and Big Data. No roadmaps (it's up to the community to direct the roadmap so join in!), source code is already freely available, licensing is easy - it's free!,

We believe clients need to adapt not just their technology options but also their vendor outlook to get the most out of the Open source world.

What do you think - are customers able to adapt to the changing market and technology landscape? Or they stuck in a time warp - working in a style and with a market view that no longer exists?

And take a look at the Platinum sponsors of ASF - they are sponsoring for a reason of course.

Wednesday, October 28, 2015

Big Data has a Big Problem

One of the keys to success in Big Data is having the skills to make Big Data projects successful. While some of these skills are 'soft skills', the core requirement is to know what we are doing with complex, highly interrelated and fast moving technology. I recently read a Cap Gemini report on How Successful Companies Make Big Data Operational

The report itself made for interesting reading but also highlighted one big problem -

Global organizational spending on Big Data exceeded $31 billion in 2013, and is predicted to reach $114 billion in 2018.

Almost 300% growth in 5 years. Now - think about the critical issues this creates - not enough Data Scientists, Admins, Developers, NoSQL experts (plus many other critical skill sets) are being trained right now to come anywhere close to satisfying this demand. As long ago as 2011 McKinsey estimated that there will a 1.5 million person hole in the US workforce alone of managers and analysts capable of using Big Data to make effective decisions.

So - as others have pointed out, if you are not Facebook/Google/Linkedin etc what can you do. One interesting set of results lies in some research conducted at Microsoft earlier this year. The full report is here and a great summary is at the Register The study effectively shows that Big Data - the technology - in Microsoft is not failing but - equally - the results are not yet packaged in a way that helps the consumers of the information to use it.

This is the critical point - when we talk to clients we discuss the details of the technology we believe will meet their needs but there is absolutely no substitute for experience. And experience in the customer is critical. For this reason - whenever a client asks for help on Big Data we always tell them - get started, get started right now and start building your own institutional knowledge. That way you will not get trapped in a skills shortage dead-end.

What do you think? Is the skills shortage going to stifle the adoption of Big Data? For an interesting slant on this question take a look at this Venturebeat article from Cameron Sim at Crewspark.

Wednesday, October 14, 2015

Data wrangling - just a phase or here to stay?

There have been myriad articles, blogs, posts in the last 12-24 months about Data Wrangling. I don't intend to re-hash those here - if you want a summary of Data wrangling take a look at this article by Lukas Biewald http://www.computerworld.com/article/2902920/the-data-science-ecosystem-part-2-data-wrangling.html

We have worked with many clients where Data Wrangling has been the largest part of the Professional services engagement. Forget about all the desired outcomes of better/faster/new/amazing insights that part only comes after we get the data into the cluster.. Getting the data into a usable format for the Hadoop cluster and then ensuring it stays that way is usually a major piece of effort. Others have described it as 'Janitorial' work. That hides the high level of complexity in choosing how to map the raw data into formats that the client will want to use and is therefore suitable for the Hadoop cluster to ingest.

So - given that Data Wrangling is a well known concept now - is that skill going to be required for some time or will tools emerge (tools again....) that will semi automate or automate completely the process?

There are some products out there like Trifacta, ClearstoryData and then multiple open source tools like Tabula, DataWrangler (confused yet?), R Packages and you can even use Python (with Pandas).

Many of these cross over into Dashboarding and Visualization - even Datameer could be considered a Data Wrangling tool in some ways.

The question is - will Data Wrangling as a required skill set, and, more importantly, as a major element of Big Data projects, disappear under an onslaught of products that can do it quicker and more cost effectively?

My view is - in your dreams. The old standby of the three V's shows why. Volumes are increasing, velocity is increasing and, most importantly for Data Wrangling - Variety is going to continue to accelerate. Think about IoT, Realtime streaming, Transactional data, unstructured data from legacy systems, new use cases emerging every day. For sure the standard Data Wrangling tasks of SQL, CSV, XML, JSON etc will be handled by products but with the ever growing number of data sources and as Big Data continues to redefine Enterprise computing I don't think Data Wrangling is going to disappear for a while yet. Customers can prepare themselves for continuing to spend a large amount of their Big Data budgets on simply getting data ready to be ingested.

Want to know more ?

For a good list of free Data Wrangling tools visit the Varonis blog http://blog.varonis.com/free-data-wrangling-tools/

Ben Lorica gave a good summary in January of 2015. http://radar.oreilly.com/2015/01/lessons-from-next-generation-data-wrangling-tools.html

What do you think - is Data wrangling going to disappear and be replaced by products like Trifacta?