What to Consider When Migrating Data Warehouse to the Cloud

Not so long ago, establishing an enterprise data warehouse involved a project that would take months or even years. These days, with cloud computing, you can easily register for a SaaS or PaaS offering provided by one of the cloud vendors, and shortly after you can start building your schemas and tables. In this article, I will discuss the key features to consider when migrating a data warehouse to the cloud and why is it a smart choice to pick one that separates storage from compute.

What does it mean to separate storage and compute?

From a single server to a data warehouse cluster

It all boils down to the difference between scale-out & scale-in vs. scale-up & scale-down. In older database and data warehouse solutions the storage and compute reside within a single (often large & powerful) server instance. This may work well until this single server instance would reach its maximum compute or storage capacity. In such cases, in order to accommodate the increased workloads, you could scale-up, i.e. exchange the CPU, RAM, or storage disks to ones with a larger capacity — with cloud services it would mean switching to a larger instance. Analogically, if your single instance is too large, to save money, you could exchange it for a smaller one, i.e. scale-down. This process has two main disadvantages:

  • scale-up & scale-down process is time-consuming and often means that your data warehouse would become unavailable for some time
  • there is a limit to how much you can scale-up due to the natural limitations of a single server instance.

MPP: Massively Parallel Computing

In order to mitigate this problem, data warehouse vendors started using MPP (Massively Parallel Computing) paradigm, allowing your data warehouse to use an entire cluster of instances at once. This way, if you start reaching the maximum capacity limits, you can simply add another server instance with more storage and compute capacity to the cluster (i.e. scale-out).

MPP can solve the initial problem of scalability to a large extent. However, it also entails that your storage and compute capacities are tightly coupled together across the nodes in your cluster. This means that if you want to shut down some compute capacity at night (i.e. scale-in), because almost nobody queries data during that time, you can’t do that, as terminating the instance would mean either loss of your data or having to create backups and restoring from it in the morning. If your architecture doesn’t allow you to easily scale-in idle compute resources, you simply throw your money away.

The paradigm discussed here is often called Shared-Nothing architecture. This is how Wikipedia [1] defines it:

A shared-nothing architecture (SN) is a distributed-computing architecture in which each update request is satisfied by a single node (processor/ memory/ storage unit). The intent is to eliminate contention among nodes.

How can we make MPP with SN architecture better?

We can clearly see that the bottleneck lies in the fact that the storage and compute are tightly coupled and can’t be scaled independently from each other. Ideally, we would like to obtain an architecture, where we can scale the compute capacity as needed depending on our query workloads, and the storage is shared across all compute nodes. The storage should have unlimited capacity to make the architecture future-proof and should scale automatically as we store increasingly more data over time.

So instead of SN-MPP, we would like to achieve SD-MPP — Shared Data Massively Parallel Processing cluster.

This is precisely what many cloud vendors did. There are some differences in their implementation, but their goal is the same: an elastic compute layer with an independently and endlessly scalable shared storage layer.

Why does the separation of storage and compute work well for analytical queries?

Hadoop was originally designed to analyze data (i.e. to run queries to retrieve & process data) as closest as possible to where it is stored. This means that if your Sales data is stored on node A, your query to retrieve Sales data will likely be executed also on node A for performance gains. Overall, the fastest way to get the data for processing is (in this order):

  • from RAM,
  • then from SSDs
  • and then from HDDs and object storage.

So having a separation of storage and compute may seem counterintuitive, as we move the data (storage) further away from where it is processed.

However, with high-performance columnar databases, cloud vendors apply many optimization techniques to both, storage and compute (ex. AWS Redshift applies AQUA [4]) so that the separation of storage from compute shouldn’t have any negative impact on the performance. Those optimization techniques involve a combination of compression, encoding, caching, and internally moving data between object storage and SSDs.

When you run a query within a columnar in-memory database, under the hood you only load some small portion of this data from a shared storage layer (say from object storage) into memory (while at the same time also applying dictionary encoding and compression on this data to reduce its size). This data can be then processed in the same way as it would be when loaded from a local disk with block storage.

Distributed compute engines such as Spark also support directly loading data from object storage [2], which is yet another example of a separation of compute and storage for analytical data processing.

What are the benefits of separating storage and compute in a data warehouse

  • no idle compute resources — storage and compute can be scale up and down independently from each other
  • if used with object storage (ex. AWS S3) or with Network File System (es. AWS EFS), we get infinite highly-available and fault-tolerant storage at a low cost
  • no management of nodes for storage (this is what you would usually have to maintain with ex. Hadoop cluster nodes or with Amazon Redshift dense storage nodes) — with SD-MPP you (usually) only need to monitor and scale your compute nodes
  • massive cost reduction — being able to shut down some of the compute nodes at night, after seasonal peaks or simply when they are not needed, can save tons of money
  • making your architecture future-proof with respect to data growth — with data growth that we experience these days, it’s inevitable that our amounts of data will increase over time. By using SN-MPP it’s still possible to accommodate this growth, but at a price that many companies just can’t afford.
  • flexibility — being able to take into account seasonality into your architecture: ex. more compute needed during specific times of the year such as Black Friday, Christmas, or the time when you release a new product
  • higher performance — you can get more done within your time: ex. if you have some more computationally expensive jobs, you can spin up one additional compute node with much more RAM and CPU capacity to get your compute-intensive job faster and after that, you can terminate that node without having to re-architect your entire data warehouse
  • fault-tolerance: if for some reason, all of your compute nodes should go down, you won’t lose your data — you can simply launch a new compute instance and instantly get back access to your schemas and tables
  • when scaling out, there is no need to redistribute, repartition or reindex the data in your cluster — with tightly coupled SN-MPP architecture, repartitioning or reindexing is needed to prevent from overburning some particular nodes, i.e. to prevent that one node takes all the storage or all the work to compute while the other nodes remain idle. In short, to evenly distribute storage and compute across the nodes.
  • separation of compute for different teams while still keeping your data (shared storage) in one central place accessible to everyone. For instance, you could have separate “virtual” compute capacity allocated to data scientists so that their computationally expensive queries for ML don’t affect other users. This feature is not supported by all cloud vendors (only heard about it from Snowflake [10]).

How cloud vendors approached the separation of storage and compute

In the following, I list only cloud data warehouse services that make use of the SD-MPP architecture which separates storage from compute. Since each of those cloud offerings differ from each other, I provide a short description of how they incorporated shared data paradigm into their services.

Snowflake

Snowflake pioneered and marketed (they seem to have a solid marketing budget!) the concept of multi-cluster shared data architecture (SD-MPP). They further divide it into [3]:

  • Database Storage layer — this is where data is persisted and optimized (turned into columnar form and compressed) after we load any data to Snowflake. This storage is abstracted away from users — data is only visible when running the queries.
  • Query Processing layer — determines how data is processed inside of virtual warehouses. This is the compute layer that we can actively manage ex. we can create several warehouses for specific teams.
  • Cloud Services layer — includes metadata & infrastructure management, authentication & access control, as well as query optimization.

One of the biggest selling points of Snowflake is that their SD-MPP product is cloud-agnostic — you can set it up on Amazon Web Services, Azure, or Google Cloud Platform [9].

Amazon Redshift

Until December 2019, Redshift would be considered a typical example of an SN-MPP architecture. Redshift is one of the first cloud data warehouse solutions — it’s on the market since October 2012.

AWS likely noticed that other cloud vendors are offering competing services with SD-MPP architecture (allowing for massive cost reduction due to separation of storage and compute) or perhaps they listened to the voices of their customers. At first, AWS implemented Redshift Spectrum — a service that provides an additional compute layer to query data directly from S3. This feature lets us create External Tables (external, as they don’t exist within the data warehouse— they are retrieved from S3) and join them with existing tables from the data warehouse. It offers seamless integration of data between data warehouse and data lake, but it doesn’t solve the issue of SN-MPP architecture that Redshift is based on.

In December 2019, AWS released:

  • Amazon Redshift Managed Storage, which allows us to use shared storage between compute nodes that scale automatically up to 8.2 PB. This storage is based on a combination of S3 and SSD disks. AWS completely manages how data is stored and moved between S3 and SSD.
  • AQUA (Advanced Query Accelerator) for Redshift — includes hardware-accelerated cache within a Managed Storage layer to speed up operations. According to AWS, this allows Redshift to be 10x faster than any cloud data warehouse [4]. Here are the slides from re:Invent [5], where they claim this speed improvement — however, they “forgot” to link the source of their performance benchmark so I couldn’t validate it.
  • new RA3 instance type — compute nodes from this instance type family work together with the Redshift Managed Storage.

AWS Athena

Amazon has another service that can be used for data warehousing and data lake: Athena. In contrast to Redshift, Athena is a serverless option that combines the Presto engine (compute) together with S3 (storage) to query data on-demand from S3-based data lake. You pay on a per-query basis + for the S3 storage.

IBM

IBM Db2 Warehouse on Cloud is a fully-managed service with SD-MPP architecture, but with some extras such as AI-based query optimizers:

“Where normal query optimizers may continue to suggest the same query path even after it proves to be less effective than hoped, a machine learning query optimizer can learn from experience, mimicking neural network patterns. This helps it constantly improve as opposed to optimizing at intervals”. [6]

Side note: at the time of writing, IBM claims to offer $1,000 USD in IBM Cloud credit so that you can try out their cloud data warehouse. You can find more about it here.

SAP Data Warehouse Cloud

SAP took, in some ways, a similar approach to Snowflake in the sense that they also offer virtual warehouses that they call “Spaces”:

Spaces […] are isolated and can be assigned quotas for available disc space, CPU usage, runtime hours, and memory usage. [7]

They promise that within those Spaces, you can scale storage and compute independently of each other. However, the storage doesn’t seem to grow elastically, as you are asked to specifically assign disc space quotas per Space.

Google Big Query

BigQuery is completely serverless so it’s abstracted away from users how the storage and compute works. BigQuery scales storage and compute automatically without us having to do anything. Under the hood, it uses a separate distributed storage layer called Colossus and a compute engine called Dremel. Similarly to Amazon Athena, Big Query uses a per-query pricing model.

The separation of storage and compute within modern data warehouse solutions blurs the lines between Data Lake and Data Warehouse

Companies tend to build data lakes to save costs — data lakes offer unlimited storage and many data lake cloud services offer additional services to quickly and efficiently retrieve data from a data lake, often using SQL interfaces built on top of a data lake. This way, we can obtain a storage layer (your data lake ex. by leveraging AWS S3) and some SQL query engines serving as a (serverless) compute layer to query this data (ex. Amazon Athena). From an architectural perspective, it’s a similar concept to the shared data layer in a data warehouse. Also, using those SQL interfaces built for data lakes often resembles using a data warehouse. In a way, this blurs the line between a data lake and a data warehouse.

Common examples that confirm this hypothesis:

  • the open-source Presto + the AWS implementation of Presto: Amazon Athena
  • Upsolver provides SQL interface for ingestion and transformation of data stored in a data lake [8]
  • the “good old” Apache Hive provides a SQL interface to a data lake stored on Hadoop already since 2010
  • Snowflake already calls itself a cloud data platform because it combines data warehouse and data lake functionality within one product
  • Amazon Redshift created Redshift Spectrum to provide a capability to query data warehouse and data lake together within a single service.

Conclusion

In this article, we looked at why the separation of compute from storage is crucial to make your cloud data warehouse and data lake architecture future-proof in a cost-effective way.

We looked at the history of how we reached the Shared Disk Massively Parallel Processing architecture and how it has been implemented by Snowflake, Amazon, Google, SAP, and IBM.

Finally, we listed the benefits of this approach and concluded that the separation of compute from storage in the modern cloud data warehouse solutions blurs the lines between data lake and data warehouse.

Thank you for reading and feel free to follow me for the next articles.

Resources:

[1] https://en.wikipedia.org/wiki/Shared-nothing_architecture

[2] Preetam Kumar: Cutting the cord: separating data from compute in your data lake with object storage

[3] Snowflake Docs

[4] AWS Pages

[5] AWS Slides from re:Invent December 2019 PDF

[6] IBM Blog

[7] SAP Blog

[8] Upsolver

[9] Snowflake — supported vendors

[10] Snowflake Virtual Warehouses

Photo by John Schnobrich on Unsplash

Image contains an affiliate link. If you join Datacamp this way, you're not only learning a ton of valuable information, but you're also supporting this blog.