Summary: Big Data is Dead by Jordan Tigani
For over a decade, inability to gain value from data has been blamed on its size. But data size wasn’t the problem.
Original article: Big Data Is Dead by Jordan Tigani
Motivation: data size isn’t the problem - most data is stored but rarely queried and almost never queried all at once. While vendors push their ability to scale (“Big Data is coming! You need to buy what I’m selling!“), practitioners struggle to solve simpler real-world “small-data” problems.
Most people don’t have that much data
One of MotherDuck’s investors surveyed their portfolio companies about their data sizes, and they found out that:
100 GB was the right order of magnitude for a data warehouse size
the largest B2B companies had 1TB of data, and the largest B2C companies had 10 TB of data in their cloud data warehouse.
To understand why large data sizes are rare, it is helpful to think about where data comes from. The most critical data is usually small; new orders, customer records, or new leads. Large data sets often come from auxiliary data sources that are relevant only for a short time (high-velocity data such as logs, traces, metrics, text, …) and don’t need to be kept for very long.
Even if you have lots of data, most of it is rarely queried — storage increases much faster than compute
Thanks to scalable and reasonably fast object storage such as S3 and GCS, you can store a lot of data and decide later how to query it (there are fewer constraints on a database design). While new data is generated all the time, most analysis is done only on the most recent data.
“This bias towards storage size over compute size has a real impact in system architecture. It means that if you use scalable object stores, you might be able to use far less compute than you had anticipated. You might not even need to use distributed processing at all.”
Scanning all data when running analytical queries is wasteful, and there are many techniques these days that allow you to further reduce how much data needs to be queried and processed (compute):
building aggregations containing important answers for reporting (no need to keep all granular data),
column projection to read only a subset of fields,
partition pruning to read only a narrow date range,
exploit locality in the data via clustering or automatic micro-partitioning,
doing less IO at query time by computing over compressed data, projection, and predicate pushdown.
Economic pressures incentivize people to reduce the amount of data to process. This holds true even if you’re not using a pay-per-byte-scanned pricing model (such as with BigQuery)—optimized queries can be processed with a smaller Snowflake instance.
Is Big Data Dead?
In 2004, when the MapReduce paper was published, scaling up was expensive. But today, you can spin up a relatively affordable cloud compute VM instance with incredible capacity within minutes. Also, cost scales up linearly with compute power.
So, why do people end up with big data even today? When you don’t know what’s worth keeping, you hoard everything in data swamps. This leads to increased storage costs, maintenance burden (”bit rot”), and potential legal implications for storing data you shouldn’t. Deciding what can be deleted requires hard work and prioritization.
“Big Data is real, but most people may not need to worry about it.“
Some questions that you can ask to figure out if you’re a “Big Data One-Percenter”:
Are you really generating a huge amount of data? If so:
do you need to process it at once?
is your data really too big to fit on one machine?
are you hoarding data instead of deciding how to model your data and what to keep? can something be stored in an aggregated way?
If not, use simpler tools that help you handle data at the size you actually have, not the size that people try to scare you into thinking that you might have someday.
Core message & CTA
Stop worrying about data size and focus on how to use it.