Recent Tutorials and Articles
    Task Parallelism vs Data Parallelism
    Published on: 8th May 2015

    This article provides a good introduction to two forms of parallel computing - Task Parallelism and Data Parallelism.

    Abstract

    Parallel computing is a computation style of carrying out multiple operations simultaneously using one (by means of multi-threading) or many machines. Parallel computing works on the principal that large problems can often be divided into small problems and these small problems can be executed concurrently.

    There are also some other popular and eminent computing styles such as concurrent computing (generalized form of parallel computing as parallel operations could be part of one large problem or different problems altogether) and distributed computing (specialized form of parallel computing as it works on the principal of processing parallel operations using multiple machines).

    As the article title suggests, we would be shedding light on following two forms of parallel computing -

    1. Task Parallelism
    2. Data Parallelism
    Task Parallelism

    This form of parallelism covers the execution of computer programs across multiple processors on same or multiple machines. It focuses on executing different operations in parallel to fully utilize the available computing resources in form of processors and memory.

    One example of task parallelism would be an application creating threads for doing parallel processing where each thread is responsible for performing a different operation. Here is pseudo code illustrating task parallelism -

    FOR each CPU in parallel computing environment
        Retrieve next task from task queue
        Create a thread and provide it with the retrieved task
        Start the created thread
    END FOR
    
    

    Some of Big Data frameworks that utilize task parallelism are Apache Storm and Apache YARN (it supports more of hybrid parallelism providing both task and data parallelism).

    Data Parallelism

    This form of parallelism focuses on distribution of data sets across the multiple computation programs. In this form, same operations are performed on different parallel computing processors on the distributed data sub set.

    One example of data parallelism would be to divide the input data into sub sets and pass it to the threads performing same task on different CPUs. Here is the pseudo example illustrating data parallelism using a data array called d -

    lower_limit = 0
    upper_limit = 0
    FOR each CPU in parallel computing environment
        lower_limit = upper_limit + 1
        upper_limit = upper_limit + round(d.length/ no_of_cpus)
        Create a thread and provide it with lower_limit and upper_limit data array indexes
        Start the created thread
    END FOR
    
    

    Some of Big Data frameworks that utilize data parallelism are Apache Spark, Apache MapReduce and Apache YARN (it supports more of hybrid parallelism providing both task and data parallelism).

    Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.

    Published on: 8th May 2015

    Comment Form is loading comments...