Recent Tutorials and Articles
    Getting Started with Data Lake
    Published on: 6th March 2015

    This article provides you with a good introduction to Data Lake along with the characteristics and challenges associated with it. It also discusses how data lake is different from data warehouse and data mart.

    Getting Started with Data Lake

    One of the strong use case of Big Data technologies is to analyse the data, and find out the hidden patterns and information out of it. For this to be effective, all the data from sources must be saved without any loss or tailoring. However traditional RDBMS databases and most of NoSQL storage systems require data to be transformed to a specific format in order to be utilized - adding, updating, searching - effectively.

    Data Lake concept is introduced to fill this gap and talks about storing the data in raw state (same state as data exist in source systems) without any data loss and transformation. For the same reason, Data Lake is also referred as Data Landing Area.

    Here is the diagram depicting a typical process involving Data Lakes wherein documents and web logs data are loaded in data lake by data loader without any changes. This stored data then can be converted using data processors in different forms as required for each type of analysis.

    Data Lake is rather a concept and can be implemented using any suitable technology/software that can hold the data in any form along with ensuring that no data loss is occured using distributed storage providing failover. Example of such a technology would be Apache Hadoop where its MapReduce component could be used to load data into its distributed file system known as Hadoop Distributed File System (HDFS).

    How is it different from Data Warehouse and Data Mart?

    After knowing what Data Lake is, one may ask that how it is different from Data Warehouse as that is also used to store/manage the enterprise data to be utilized by data analysts and scientists. Similarly, Data Lake could also be compared to Data Mart which manages the data for a silo/department. This is either done by having completely different data storage for a silo or by creating a view on company wide data warehouse.

    Here is the diagram that shows the same process of analysing the docs and web logs using data warehouse:

    As we can see that there are two differences in the process. Firstly, we have ETL component instead of data loader which emphasises that in case of data warehouse, input data is transformed and tailored to a pre-defined schema in order to be saved to data warehouse. This process of ETL, in most cases, results into data loss due to fixed schemas. Second difference is that in case of data warehouse, there are no data processors as data is already in a pre-defined schema ready to be consumed by data analysts.

    Benefits of Data Lake

    There are following benefits that companies can reap by implementing Data Lake -

    1. Data Consolidation - Data Lake enales enterprises to consolidate its data available in various forms such as videos, customer care recordings, web logs, documents etc. in one place which was not possible with traditional approach of using data warehouse.
    2. Schema-less and Format-free Storage - Data Lake talks about the storage of data in its raw form i.e. same format as it is sent from source systems. This eliminates need of source systems having to emit data in a pre-defined schema.
    3. No Data Loss - Since Data Lake doesn't require source systems to emit the data in a pre-defined schema, source systems do not need to tailor the data. This enables the access of all the data to data analysts and scientists resulting into more accurate analysis.
    4. Cost Effectiveness - Data Lake talks about distributed storage wherein commodity hardware can be utilized to store the huge volumes of data. Procuring and levearing the commodity hardware for storage is much cost effective than using the high configuration hardware.
    Challenges with Data Lake
    1. Ability to accommodate any format and type of data sometimes can convert Data Lake into a data mess referred as Data Swarm. Hence, it is important that enterprise take caution while implementing Data Lake to ensure that data is property maintained and accessible in Data Lake.
    2. Another challenge with Data Lake is that since it stores the raw data, each type of analysis requires the data transformation and tailoring from scratch requiring additional processing infrastructure every time you want to do some analysis on data stored in Data Lake.

    Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.

    Published on: 6th March 2015

    Comment Form is loading comments...