David Bennett's Big Data Blog - Excelerate Systems: Data wrangling

There have been myriad articles, blogs, posts in the last 12-24 months about Data Wrangling. I don't intend to re-hash those here - if you want a summary of Data wrangling take a look at this article by Lukas Biewald http://www.computerworld.com/article/2902920/the-data-science-ecosystem-part-2-data-wrangling.html

We have worked with many clients where Data Wrangling has been the largest part of the Professional services engagement. Forget about all the desired outcomes of better/faster/new/amazing insights that part only comes after we get the data into the cluster.. Getting the data into a usable format for the Hadoop cluster and then ensuring it stays that way is usually a major piece of effort. Others have described it as 'Janitorial' work. That hides the high level of complexity in choosing how to map the raw data into formats that the client will want to use and is therefore suitable for the Hadoop cluster to ingest.

So - given that Data Wrangling is a well known concept now - is that skill going to be required for some time or will tools emerge (tools again....) that will semi automate or automate completely the process?

There are some products out there like Trifacta, ClearstoryData and then multiple open source tools like Tabula, DataWrangler (confused yet?), R Packages and you can even use Python (with Pandas).

Many of these cross over into Dashboarding and Visualization - even Datameer could be considered a Data Wrangling tool in some ways.

The question is - will Data Wrangling as a required skill set, and, more importantly, as a major element of Big Data projects, disappear under an onslaught of products that can do it quicker and more cost effectively?

My view is - in your dreams. The old standby of the three V's shows why. Volumes are increasing, velocity is increasing and, most importantly for Data Wrangling - Variety is going to continue to accelerate. Think about IoT, Realtime streaming, Transactional data, unstructured data from legacy systems, new use cases emerging every day. For sure the standard Data Wrangling tasks of SQL, CSV, XML, JSON etc will be handled by products but with the ever growing number of data sources and as Big Data continues to redefine Enterprise computing I don't think Data Wrangling is going to disappear for a while yet. Customers can prepare themselves for continuing to spend a large amount of their Big Data budgets on simply getting data ready to be ingested.

Want to know more ?

For a good list of free Data Wrangling tools visit the Varonis blog http://blog.varonis.com/free-data-wrangling-tools/

Ben Lorica gave a good summary in January of 2015. http://radar.oreilly.com/2015/01/lessons-from-next-generation-data-wrangling-tools.html

What do you think - is Data wrangling going to disappear and be replaced by products like Trifacta?

David Bennett's Big Data Blog - Excelerate Systems

Wednesday, October 14, 2015

Data wrangling - just a phase or here to stay?

No comments:

Post a Comment