Big Data is how big exactly?
I see that “Big Data” has become the new buzzword with a spike of hype around it. Everyone’s jumping on it. Companies are eager to promote their products as “Big Data,” just as they were eager to be associated with Web 2.0, Service-Oriented Architectures, and all the rest. Predictably, there’s basically zero agreement on what it means.
I’ve seen “Big Data” mentioned in the context of 1TB, which I think is rather moderate sized. But worse yet, I’ve seen 100GB labeled Big Data. I’ve even seen 5GB labeled Big Data. No links — I don’t want to draw attention to them.
I don’t know what Big Data is, but the stick-of-gum-sized flash drive in my pocket holds 16GB. It’s pretty Small. I mean, I forget it’s even there — it’s definitely not Big. I don’t know where I’d draw the line, but if it fits in a commodity server’s memory, which 100GB can do easily these days, it’s not Big Data. I don’t even think that 1TB is Big — again, it’s only twice as big as commonly available servers can fit in RAM. In fact, most things in the MySQL world aren’t Big Data if they run on a single server, and I’m not sure I’d call a large sharded data store Big Data either — just a bunch of Small Data sitting next to each other. I might make an exception to my no-MySQL-allowed rule of thumb for technologies like InfoBright, which starts to hit its stride in the low-to-mid tens of terabytes of data. That’s entry-level Big in my opinion. This is completely arbitrary, but I’d say 100TB is Big Data in my mind, because it is a couple orders of magnitude bigger than commodity RAM capacities. Ask me a few years from now, and I’ll probably say a petabyte.
The lack of definition of Big Data is characteristic of hyped buzzwords. It’s why nobody can refute anyone’s claims. I think a good guiding principle for marketing might be “don’t associate yourself with something that you can claim despite it being unverifiable.” This might go along with “don’t brag about things your competitors can also claim.”
Edit: oh my, I just realized that one of Percona’s webinars had “Big Data” in the title. Busted. It was Continuent who proposed the webinar and picked the title, but still… the pot calls the kettle black!



I’ve heard “big data” defined as any amount of data that becomes difficult to process using the usual tools at your disposal. This means that the size of big data goes up over time since computing processing power increases over time.
It also means that the definition of big data changes depending on the tools you use.
How big your “big data” is also depends on how you use your data. If you need to do full data set analysis over terabytes of data, then most open source database tools are going to have difficulties.
If you have a terabyte of data with a hot head and infrequently accessed tail, then operational issues like ALTER TABLE, and limited incremental backup support may concern you more than parallel performance and you might not consider this a “big data” problem.
MySQL, on the whole, is not generally seen as a platform for big data because:
a) it is a row store
b) it lacks these typical features that support big data
1) hash, bitmap, partial and functional indexes
2) hash joins
3) materialized views
4) parallel processing
5) RLE compression
6) complete partitioning support
7) good ALTER TABLE performance
My goal is to make tools that make working with big data on MySQL easier, particularly in combination with other tools.
Infobrite has hash joins and it compresses data very well. I only have 128GB of fast storage, so I rarely test any data sets over 100GB.
Shard-Query can be used to scale-out horizontally partitioned data sets in parallel. This is how most database servers tackle working with big data. Usually a column store is used in combination with MPP scale-out. I have tested Infobright + Shard-Query, and they work well together.
Flexviews can be used to add true materialized views to MySQL. These can be used to create data structures which emulate function/partial indexes. You could use this with the CRC32 function, for example, to simulate hash indexes.
FlexCDC can be used to feed data to Fastbit or other external indexing engines. Fastbit can be used to create compressed bitmap indexes, or Sphinx can be used for inverted indexes. Fastbit has been used to analyze scientific data sets of vast size because its WAH compressed bitmaps can be compared, logically, without decompression.
TokuDB is apparently addressing the issues of ALTER TABLE performance.
Justin Swanhart
1 Apr 11 at 12:19 am
Big Data should not consider the total size of the database, but rather the working set. I can have a database with a working set of 1GB and have 10TB of archives which are never used and are only there for some government regulation. That’s not big data!
Daniël van Eeden
1 Apr 11 at 5:24 am
i think “Big Data” it just come from our own perspective. in my place that i work, they like cutting data if perceived database is slow.. and that goverment regulation..
eka
1 Apr 11 at 8:08 am
You are raising some interesting points.
I think in the context of this post, the following panel on Big Data at one of GigaOm’s recent events could be helpful – bit.ly/f1YmY6 (it’s a link to video recording of the panel).
They also discuss how big data as a term is poorly defined and means a lot of things to different people in different contexts.
@somic
1 Apr 11 at 10:44 am
Doh, messed up the link.
bit.ly/f1YmY6
@somic
1 Apr 11 at 10:44 am
Whatever you do, do not define big data to be an arbitrary number (5 Gb, 100 Gb, 1Tb, 100 Tb,ect). If its to have any meaning that will be useful in 5, 10 years, it has to defined in terms of other practical, measurable things. Like the amount of ram in common servers, network bandwidth, Disk I/O performance, ect.
William
1 Apr 11 at 10:56 am
William, I agree with you in part. But I don’t even think that Big Data as a meme will last that long. It’ll get over-used and people will move on to the next thing. Who talks about the Semantic Web anymore, really?
Xaprb
1 Apr 11 at 12:20 pm
Big data = lot of unused information…
Roy
5 Apr 11 at 6:58 am