Big Data: What is it and why all the fuss?

Big Data seems to be the latest buzzword that seems to be trending.

The term has been around for a while but now, the largest corporations are promoting Big Data products and services very strongly, so something big is on the horizon. Right now, it still looks like a load of hype, but scratching beneath the surface, it seems to me that it has the potential to affect every person in society and there’s no getting away from it. What is all the fuss about?

Big Data isn’t really just about ‘big’. Depending on who you ask, mnemonics “V3” or “V4” summarise it well.

  • Volume – is the quantity – and it’s big.
  • Velocity – the rate of arrival/capture of data, and that’s big too.
  • Variety – the sheer variety of data and formats to be used.
  • Veracity – the accuracy, truth or value of that data.

Volume and velocity are driving the technical aspects – relational is out, NoSQL (not only SQL) is in and the relational data skills out there are not enough.

Variety and veracity are the real challenges: device instrumentation, social feeds, government, location, financial, voice, image and video and all the data captured by any (and I mean ANY) device that we use or encounter or that monitors us and the gadgets we use are being stored, because some day, it might be useful to a data analyst working for a start-up, a corporate or our government.

If you don’t know anything about Big Data, this session will provide a basic introduction to what’s happening out there, right now.

Here is the video of the webinar I presented on 28 August 2013. I mentioned some books on data analysis and tools during the talk. They aren’t listed in the video, so they are listed below.

  • Data Analysis with Open Source Tools, Philipp, K Janert – the title says it all – Python libraries (numpy, scipy, matplotlib, simpy, pycluster), R, Gnu Scientific Library (GSL), Sage, C Clustering library, Berkeley DB, SQLite
  • Python for Data Analysis, Wes McKinney – Python libraries: numpy, pandas, matplotlib, IPython, SciPy
  • Interactive Data Visualization, Scott Murray – an introduction to using the D3 (Data-Driven Document) open source JavaScript Library to present data in a myriad of interesting ways
  • Visualise This, Nathan Yau – less about tools, more about how to “tell your story with data presented in creative, visual ways”