Google has found a way to stretch a data warehouse across multiple data centers, using an architecture its engineers developed that could pave the way for much larger, more reliable and more responsive cloud-based analysis systems.
A Mesa implementation can hold petabytes of data, update millions of rows of data per second and field trillions of queries per day, Google says. Extending Mesa across multiple data centers allows the data warehouse to keep working even if one of the data centers fails.
Google built Mesa to store and analyze critical measurement data for its Internet advertising business, but the technology could be used for other, similar data warehouse jobs, the researchers said.
"Mesa ingests data generated by upstream services, aggregates and persists the data internally, and serves the data via user queries," the researchers wrote in a paper describing Mesa.
For Google, Mesa solved a number of operational issues that traditional enterprise data warehouses and other data analysis systems could not.
For one, most commercial data warehouses do not continuously update the data sets, but more typically update them once a day or once a week. Google needed its streams of new data to be analyzed as soon as they were created.
Google also needed a strong consistency for its queries, meaning a query should produce the same result from the same source each time, no matter which data center fields the query.
Consistency is typically considered a strength of relational database systems, though relational databases can have a hard time ingesting petabytes of data. It's especially hard if the database is replicated across multiple severs in a cluster, which enterprises do to boost responsiveness and uptime. NoSQL databases, such as Cassandra, can easily ingest that much data, but Google needed a greater level of consistency than these technologies can typically offer.
The Google researchers said that no commercial or existing open-source software was able to meet all of its requirements, so they created Mesa.
Mesa relies on a number of other technologies developed by the company, including the Colossus distributed file system, the BigTable distributed data storage system and the MapReduce data analysis framework. To help with consistency, Google engineers deployed a homegrown technology called Paxos, a distributed synchronization protocol.
In addition to scalability and consistency, Mesa offers another advantage in that it can run be run on generic servers, which eliminates the need for specialized, expensive hardware. As a result, Mesa can be run as a cloud service and easily scaled up or down to meet the job requirements.