This article discusses the big data tools and technologies that LinkedIn has been utilizing to provide its services. Some of these frameworks and projects were developed by LinkedIn in order to manage their data and process operations effectively and efficiently.
LinkedIn need to process huge amount of events each day and provide the resulted data to data analysts, engineers, business experts, and data scientists that seek deep understanding of the interactions within LinkedIn professional social graph. These experts use this data to derive insights and performance metrics which lead to better business decisions on products, marketing, sales, and other functional areas.
Here are the Big Data tools and frameworks that are being used in LinkedIn to solve their data and processing problems:
- Espresso - Espresso is a horizontally scalable, indexed, timeline-consistent, document-oriented, highly available NoSQL data store. As LinkedIn grows, our requirements for primary source-of-truth data are exceeding the capabilities of a traditional RDBMS system. More than a key-value store, Espresso provides consistency, lookups on secondary fields, full text search, limited transaction support, and the ability to feed a change capture service for easy integration with other online, nearline and offline data ecosystem.
LinkedIn is aggressively migrating many applications to use Espresso as the source-of-truth. Examples include: member-member messages, social gestures such as updates, sharing articles, member profiles, company profiles, news articles, and many more. Espresso is the source of truth in LinkedIn for many applications and tens of terabytes of primary data.
- Apache DataFu - Apache DataFu is a collection of libraries for working with large-scale data in Hadoop and Pig. The project was inspired by the need for stable, well-tested libraries for data mining and statistics. It is used at LinkedIn in many of our off-line workflows for data derived products like "People You May Know" and "Skills and Endorsements".
- Apache Helix - Apache Helix is a generic cluster management framework used for the automatic management of partitioned, replicated and distributed resources hosted on a cluster of nodes.Helix manages the state of a resource by supporting a pluggable distributed state machine. One can define the state machine table along with the constraints for each state. Some of the common state models supported by Helix are Master-Slave, Online-Offline and Leader-Standby.
LinkedIn currently uses Helix to manage its search-as-a-service clusters hosting multiple search applications, Databus, its data change capture component, and Espresso, LinkedIn's indexed, timeline-consistent, document-oriented data store.
- Apache Kafka - Apache Kafka is a distributed publish-subscribe messaging system. Kafka provides a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale web site. This kind of activity (page views, searches, and other user actions) are a key ingredient in many of the social feature on the modern web. This data is typically handled by “logging” and ad hoc log aggregation solutions due to the throughput requirements. This kind of ad hoc solution is a viable solution to providing logging data to an offline analysis system like Hadoop, but is very limiting for building real-time processing.
- Apache Samza - Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.
- Azkaban - Azkaban was implemented at LinkedIn to solve the problem of Hadoop job dependencies. LinkedIn had jobs that needed to run in order, from ETL jobs to data analytics products. Initially a single server solution, with the increased number of Hadoop users over the years, Azkaban has evolved to be a more robust solution.
- Voldemort - Voldemort is a distributed key-value storage system. It is used at LinkedIn by numerous critical services powering a large portion of the site.
- Norbert - Norbert is a library that provides easy cluster management and workload distribution. With Norbert, you can quickly distribute a simple client/server architecture to create a highly scalable architecture capable of handling heavy traffic.
Implemented in Scala, Norbert wraps ZooKeeper and Netty and uses Protocol Buffers for transport to make it easy to build a cluster aware application. A Java API is provided and pluggable routing strategies are supported with a consistent hash strategy provided out of the box.
- White Elephant - White Elephant is a Hadoop log aggregator and dashboard which enables visualization of Hadoop cluster utilization across users. White Elephant is compiled and tested against Hadoop 1.0.3 and should work with any 1.0.x version. Hadoop 2.0 is not yet supported.
Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.