bigdata

With the evolution and growth of digital data, there’s been a trend of ‘Big Data’ – there’s no official definitions but there’s some industry rule-of-thumbs:

If you have to ask, you probably aren’t using it

But the non-whimsical and grounded definition is probably

data that is too complex/large to process with standard database management systems or traditional analytics.

There aren’t technical delineations amongst all big data – there are common characteristics:

  • Volume - the most obvious is size. Users are generating immense amounts of data with their software and hardware – from obvious things like Facebook content and Tweets, to subtle things like application settings, playlists in the cloud, exercise tracking.

    If you think about software written for a traditional company, it’s designed for access by hundred of thousands of employees at once – even then, the average case is significantly less than that (in the middle of the night, non-peak hours). While Google is an extreme example, the search engine processes billions of search queries daily.

  • Velocity - the rate at which we are creating this data. A relevant example is electronic trading; there are more trades being executed each hour than there was 10 years ago in a day. This data would require several hundred ticker tapes simultaneously nowadays.If you’re not familiar with Foursquare – it’s a location based social media site that lets users ‘check in’ into places they’re currently at on mobile devices. The platform shares this information through other social media and allows pictures, comments, tagging. If Foursquare were to disseminate this information to retailers, they’d have to process hundreds of thousands requests simultaneously:
    • you have people checking in
    • Foursquare processing the data
    • Foursquare distributing data outwards
    • retailers pulling more (contextual) data out (user profile, has this user come to my place before?).

    This would all have to be done in a timely manner – within the first few moments of the customer coming in so the retailer can prepare appropriately – otherwise the data becomes non-veracious: a point later explained.

  • Variety - the nature of the digital data (pre-Facebook) used to be very narrowly defined: a number or text, etc.While that hasn’t changed fundamentally, the high level abstractions have.Consider the anatomy of a tweet: it can have a hashtag “#”, a user profile link “@xyz”, text of the tweet, a web link to another site, and timestamp.To even attempt to categorize and organize the tweet, it must be abstracted beyond a number or a text, hence the variety of data grows exponentially.
  • Veracity - simply put, it has to do with the timeliness of the data.If the Foursquare example above couldn’t explain veracity, consider
    • Imagine having a newspaper mailed to your house using snail-mail (a few days lag)
    • You are trying to keep up with current events (trading on stock picks, checking the weather)

    The timeliness and freshness of the data can change how useful the data is in a heartbeat.

Before any kind of Big Data developments can happen – there are a few sine qua non (requirements) that must be satisfied:

  • Storage Capacity – scale cheaper and faster. Solid state disks have probably satisfied retail demand for storage requirements for the next few years. They’re quite expensive for enterprises since enterprises look at it from a cost per storage unit perspective, but once production scales and solid states are just as cheap as standard magnetic drives – expect to see controllers and infrastructure for new solid states come up into demand.
  • Memory – computations are done exponentially faster when data is pulled from memory. There’s a very interesting 3-way teeter-totter balancing act between memory, processing speed, and software optimization. The slowest one will largely determine the overall result, but like above, memory size and speeds have made huge improvements and are one of the cheaper upgrades.
  • Processors – a lot of people know Moore’s famous doubling rule, but there’s a secondary and lesser known Moore’s law.

    It states that the costs of semi-conductor development increases exponentially with time.

    It’s a known fact that we’re approaching physical limits of semi-conductors and solutions are being sought. There’s a lot of research into quantum computing, and biocomputing. This shouldn’t be a big risk factor in the mid-term.

  • Networks – bandwidth and speed. Everyone should have first-hand experience with internet speeds – slow YouTube videos or webpage loads. Because the internet is the backbone of Google’s whole business model, they’ll push it as hard as they can.

I talked about Variety earlier but the nature of data requires exploring a bit more in depth:

Structured Data

From the 1970′s, the growth in digital data was structured – the data had structure, patterns, and schemas for handling the data. Stock trading is an obvious and effective example, the data for the end of day could look like this: price, volume, high, low, open, close. This format didn’t change no matter what stock we were looking at, there wouldn’t be 2 highs or multiple opens.

Dealing with structured data is pretty standard now, there are several resources about RDMS (and the subsequent growth of Oracle and IBM as a result of this boom).

Unstructured Data

Like the tweet example above, unstructured data can’t be easily put into a structure – yes I did say that the tweet could only contain certain things, a hashtag, a profile link, a weblink, text, timestamp. But the problem is that when you create a traditional database, you have to set limits – how many #’s and how many @’s each tweet is allowed to contain. If you set the database to account for the worst case scenarios (say 10 #’s, 10 @’s), then almost 90% of the database would be wasted empty space, this is simply unacceptable and an ineffective way of managing the data.

It’s appropriate to point out the dichotomy between human-generated data and machine-generated data. The growth of machine-data is generally faster than human-generated data because it doesn’t require human input.

Human Generated Data

Text messages, social media, this data has to be stored and interpreted with context – imagine picking out texts from a conversation stream and they’re not even in the right order. Analyzing this data is more effectively done by a person than a machine right now, i.e. the sentiment analysis of Twitter streams is a lot more effective now than it was years ago, but it only takes a cursory glance for a person to look at a text conversation to know the underlying mood and feelings.

Machine Generated Data

Here’s an amazing picture from Splunk:

MachineDataSources

You can see given the breadth of the data and data-sources, how painstaking it would be to interface with each of them. A holistic solution (while preached by solution vendors) should raise some eyebrows.

Given this natural fragmentation, it’s expected to see an influx of both mature entities and start-ups providing niche solutions. At the pace of technology, you’ll expect extreme fragmentation and niche solutions in the interim. Though as time goes on, when certain companies pull ahead and refine their providing and understand the market better, consolidation occurs. Startups run out of runway and are either acquired or leave the space (same with mature companies).

We’re still early, seeing a lot of different providers for cloud-based storage, security, analytics, marketing, distribution, just to name a few.

Technical Aside

This will be a bit more technical than above; but if you’re curious about one of the new technologies driving Big Data, keep reading

Hadoop the Yellow Elephant

The site is not that pretty considering how big Hadoop is, but let’s delve beyond first impressions. Formally Hadoop is an open-source (free) technology that

allows companies to collect, manage, and analyze very large, typically unstructured data sets.

This kind of conflicts with our pre-existing notions: structured datastore of unstructured data? Hadoop uses a divide and conquer approach, it splits a singular task into many (thousands) of smaller sub-tasks to be processed in parallel (this goes back to the Enterprise model of paying unit/$ for memory, processing power, etc). So you’re able to analyze and process huge amounts of data efficiently and cost-effectively, it’s possible to simply add a server, install the software, and be able to scale out by one more unit (though at this point you’re dividing workload by n + 1 servers rather than n servers, so as n gets bigger, the difference becomes negligible).

I used the word datastore earlier instead of a database, a database is a type of data store~

The architecture has 2 things:

  1. Hadoop Distributed File System (HDFS) – stores and manages data
  2. MapReduce – a famous Google architecture on how to handle and process large datasets

A picture worth 1111101000 words (1000 in binary…)

Screen Shot 2013-11-07 at 2.14.18 AM

typical of technologies today

  • when things go wrong at one server, the system doesn’t come down
  • each server can do tasks in parallel (they won’t slow each other down)
  • scalable – refer to n + 1 vs n – still better than previous scaling methods where everything was stuffed into one really big server
  • cost – each server is a maximum of unit/$ – commonly referred to as commodity hardware – cheap

HDFS

Splits the data into small chunks of data, then distributes them (several copies) across different servers. This ensures that the average access time of the data from each server is kept rather low, and no server error will lose data.

Screen Shot 2013-11-07 at 2.21.01 AM

MapReduce

The name… says it all. It maps tasks out to different servers, and reduces the results and aggregates them. So when a query goes in and says “find xyz”. Map sends the orders out to servers that contain relevant data – perhaps only 3/4, so the first server isn’t sent the request. Reduce takes the results and reports them in aggregate – so it seems like the data store is one large system.