This article discusses the big data tools and technologies that Facebook has been utilizing to provide its services. Some of these frameworks and projects were developed by Facebook in order to manage their data and process operations effectively and efficiently.
Every time one of the 1.2 billion people who use Facebook visits the site, they see a completely unique, dynamically generated home page. There are several different applications powering this experience--and others across the site--that require global, real-time data fetching. In this section, we will discuss some of the tools, frameworks and applications that Facebook developed to overcome the challenge of processing the huge data:
- RocksDB - RocksDB is an embeddable persistent key-value store for fast storage. RocksDB can also be the foundation for a client-server database but our current focus is on embedded workloads. RocksDB builds on LevelDB to be scalable to run on servers with many CPU cores, to efficiently use fast storage, to support IO-bound, in-memory and write-once workloads, and to be flexible to allow for innovation.
Facebook built RocksDB for storing and accessing hundreds of petabytes of data and is constantly improving and overhauling its tools to make this as fast and efficient as possible.
- Corona - Corona is a new scheduling framework developed by Facebook to overcome the limitations of Apache Hadoop MapReduce scheduling framework. Apache MapReduce scheduling framework is responsible for 2 functions - cluster resource management and jobs tacking. Facebook noticed that Apache MapReduce scheduling framework was not able to cope well with the peak data loads. Facebook solved this problem by coming up with Corona scheduling framework which separates out the cluster management and jobs tracking allowing it to enable processing of peak data loads at optimal speeds.
- Presto - Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.
Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse. Over 1,000 Facebook employees use Presto daily to run more than 30,000 queries that in total scan over a petabyte each per day.
Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.