Tuesday, November 20, 2012

"Big Data" - What it is and Why it Matters

Thanks to Google, Facebook, Linkedin, Amazon and many others we have entered the era of "Big Data."

 

What is "Big Data?"

  • It is Google slurping-up the entire internet each day so we can ask it things like "Best pizza in New Haven" and get an immediate and credible answer.
  • It is Facebook with their one billion users who generate 500TB (yes, that is 500 terabytes) of data per day.
"Big Data" is managing data on a scale that is, frankly, hard for many of us to imagine.
Today managing massive volumes of data is a reality through the ability of software to spread the data across many low cost commodity servers. When I say many I am talking about thousands!

At its core "Big Data" involves the same set of core actions that can be applied to any type and size of data. They are:
  • Collect the data 
  • Store the data 
  • Access and consume the data subset that we desire
  • Analyze the data
You do not need to be a “data geek” to wrap your head around these four actions. Most of us do these things on a personal level as part of our daily work. The demands and variations of how people and organizations want and need to consume data make doing these four actions a technical challenge. Now add to this data volumes on the order of hundreds of terabytes a day and the problem becomes exponentially daunting.

Additionally, think about all the input sources we have for data these days. Excel spreadsheets, Word Document and Powerpoints sitting out on network shares. Email and IM, Tweets, Social Media... the list goes on and on. We all work with this type of data on daily basis. We also use internal software sytems for ERP, CRM, Accounting, etc... . All of us from the cubes to the glass corner offices struggle when it comes to collating, managing and using this data to our best advantage.

 

Two Essential Components to "Big Data"

There are two essential components to "Big Data" systems.
  1. There is the "Big Data File System." We need a means of slurping in all of this data and spreading it across many servers so we can process it in parallel. Think Moore's Law in reverse.
  2. We need a means of querying it and getting back the subset we want to process. Enter into the equation a process Google terms MapReduce.
In simplest terms, MapReduce is a framework for parallel processing massive amounts of data in the shortest time possible. For example, it is possible today to process one petabye (1000 Terabytes) of data in a few hours. And, when I say process I mean make it ready for second stage consumption.

 

Answering Your Questions (Back to Pizza)

You are able to get a quality answer to your pizza question so quickly because Google has mapped all the web pages of the planet and MapReduced this data into indexes that can answer such questions.

What if your business could do the same?

Well it can.

Today there are a number of "Big Data" solutions on the market. Many are open source. The biggest one in the mainstream today is called Hadoop. It essentially is the open source version modeled after the Google File System and Google MapReduce. Hadoop was created by a guy named Doug Cutter. After reading the Google white papers on their file system and MapReduce Doug set off to replicate Googles’ work in open source. Sigh... if we could all be this brilliant

And, there are services by giants like Amazon (EC2) and Google (BigQuery) that are "Big Data" service providers.

 

Mind Blowing Data Fact

90% of the electronic data in existence today did NOT exist two years ago.

You can translate this little tidbit to mean "Big Data" is not some passing fad. It is the wave of the future. "Big Data" has finally made its way into the mainstream and is today solving massive scale data problems that SQL and relational database management systems just do not have the capacity to deal with. A special note to my friends and collegues that live in this world. I am not asking you to abandon the RDBMS ship. But y'all need to come over to this side of the water and dip your toe in to see how it feels.

"Big Data" - What is Possible (Dreaming in Data)

Imagine slurping in all your corporate data including crawlers that search social media sites collecting what people are saying about you. Then imagine being able to query that data to ask natural language questions such as, "What are the customers saying about our product and services relative to the competition?" ...or... "What are the aggregate pain points for “x” so we can get into that market segment or solve the problems the marketplace faces?"

Think how much quicker you can respond to market forces and increase your bottom line. And, better serve your customers and your shareholders ;-).

“Big Data” is more than just feeding the corporate bottom line. It is more than just data collection and evaluation on an Orwellian scale. “Big Data” is going to allow the next Facebook to rise. Imagine a world education portal the size of Facebook (or potentially larger) where students came come to learn from anyplace at any time. Where teachers can congregate and share best practices and do what they love most: teach. Imagine a portal for artists, writers, and musicians that has gallaries where they can share there work with others in the world. A site loaded with virtual museums and concert halls that integrate 3D technologies.

This is what I mean when I say “dreaming in data.”

You and "Big Data"

If you are a business of any size and "Big Data" is not on your radar it should be. Todays "Big Data" solutions can work just as well on medium and small data.

If you work with data, and most of you out there work with data, you need to be asking your leadership when they are going to invest in "Big Data" solutions.

If you are a development house trying to still fit every problem you have into a relational database and doing it using object oriented data you should be asking yourself why aren't we also using NoSQL product like MongoDB as part of our core technology strategy for backend development.

So as not sell you a bag of Fools Gold...

I trust I am mostly preaching to the choir on this next point.

Collecting, managing and analyzing data is hard. I do not want to you to think that you are going to go out to Apache, download Hadoop, setup a cluster of twenty rack servers in your data center, slurp in all your data, write all the MapeReduce code and have answers to your most pressing questions by the end of the week.

Like any implementation this is going to take you some time and some coin. Hadoop will not cost you a dime. The companies that can help you implement a "Big Data" solution will cost you many dimes. You will also need to involve key staff members that understand your data to be a part of this journey. Finally, you will need an experienced architect that can strategically lead you into this new frontier that is rapidly changing how we do business on this planet.

Who is Into "Big Data" in Big Ways

Here is a partial list of who is involved in implementing and using "Big Data" (Like Hadoop and MongoDB)

  • Craigslist
  • SAP
  • Disney
  • CNN
  • Amazon
  • eBay
  • H-P
  • Apple
  • Microsoft
  • etc...etc..etc...

What are you waiting for… to be left in the dust?

Tuesday, November 13, 2012

The Next Linkedin/Facebook...

I don’t know what the next LinkedIn or Facebook is going to be, but I do know what one of the next LinkedIn or Facebooks is going to be.
 
It will be just as big,and actually could even end up being bigger than Facebook in terms of end users.
 
It will cross borders and countries, and it will change the face of the planet as we know it today in a positive way.
 
It is going to be an education destination where anyone can come to learn anything at anytime.
 
Education without bureaucracy, education without borders. It will be both synchronous and asynchronous. It can be instructor led, it can be self-paced. It can be for profit, and it can be for free.
 
It can and will ultimately be all of these things and much more than we can even imagine today.