This article discusses the big data technologies that Google has been utilizing to get its data managed and processed, and provide various services such as BigQuery, Google Analytics etc.
Google probably processes more information than any company on the planet and tends to have to invent tools to cope with the data. As a result, its technology runs a good five to 10 years ahead of the competition. Google has come up with quite a few big data processing algorithms such as MapReduce, Flume on which many big data technologies such as Hadoop have been developed. We, in this section, will discuss about some of the big data technology stack at Google:
- Google Mesa - Mesa is a highly scalable analytic data warehousing system that stores critical measurement data related to Google's Internet advertising business. Mesa is designed to satisfy a complex and challenging set of user and systems requirements, including near real-time data ingestion and queryability, as well as high availability, reliability, fault tolerance, and scalability for large data and query volumes. Specifically, Mesa handles petabytes of data, processes millions of row updates per second, and serves billions of queries that fetch trillions of rows per day. Mesa is geo-replicated across multiple datacenters and provides consistent and repeatable query answers at low latency, even when an entire datacenter fails.
- Google File System - Google File System is a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.It is widely deployed within Google as the storage platform for the generation and processing of data used by Google service as well as research and development efforts that require large data sets.
Google File System is the base of hadoop's HDFS that is being used actively in a lot of big data tools and databases such as HBase, Cassandra, Spark etc.
- BigTable - Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements(from backend bulk processing to real-time data serving).
- Google Flume - FlumeJava is a Java framework developed at Google for MapReduce computations. MapReduce though enables distributed computing but not all real life problems can be described using a MapReduce task. Instead, most of the real life problems require a chain of MapReduce tasks for complete processing. This requires intermediate code to pipeline MapReduce tasks. Apache Flume attempts to resolve this problem by providing pipelining of MapReduce tasks out of the box.
Flume has been handed over to Apache and there is an active project running on this named as Apache Flume.
- Google MilWheel - MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework's fault-tolerance guarantees. MillWheel's programming model provides a notion of logical time, making it simple to write time-based aggregations. MillWheel was designed from the outset with fault tolerance and scalability in mind. In practice, we find that MillWheel's unique combination of scalability, fault tolerance, and a versatile programming model lends itself to a wide variety of problems at Google.
- Dremel - Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. Google present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.
Google offers a cloud analytics platform called BigQuery based on Dremel to enable companies get their huge structured data processed at lightening fast speeds.
- Google Mesa Whitepaper
- Google File System Whitepaper
- Google Bigtable Whitepaper
- Google Flume Whitepaper
- Google MilWheel Whitepaper
- Dremel Whitepaper
Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.