This article provides a good introduction to two forms of parallel computing - Task Parallelism and Data Parallelism.
Parallel computing is a computation style of carrying out multiple operations simultaneously using one (by means of multi-threading) or many machines. Parallel computing works on the principal that large problems can often be divided into small problems and these small problems can be executed concurrently.
There are also some other popular and eminent computing styles such as concurrent computing (generalized form of parallel computing as parallel operations could be part of one large problem or different problems altogether) and distributed computing (specialized form of parallel computing as it works on the principal of processing parallel operations using multiple machines).
As the article title suggests, we would be shedding light on following two forms of parallel computing -
- Task Parallelism
- Data Parallelism
This form of parallelism covers the execution of computer programs across multiple processors on same or multiple machines. It focuses on executing different operations in parallel to fully utilize the available computing resources in form of processors and memory.
One example of task parallelism would be an application creating threads for doing parallel processing where each thread is responsible for performing a different operation. Here is pseudo code illustrating task parallelism -
FOR each CPU in parallel computing environment Retrieve next task from task queue Create a thread and provide it with the retrieved task Start the created thread END FOR
Some of Big Data frameworks that utilize task parallelism are Apache Storm and Apache YARN (it supports more of hybrid parallelism providing both task and data parallelism).
This form of parallelism focuses on distribution of data sets across the multiple computation programs. In this form, same operations are performed on different parallel computing processors on the distributed data sub set.
One example of data parallelism would be to divide the input data into sub sets and pass it to the threads performing same task on different CPUs. Here is the pseudo example illustrating data parallelism using a data array called d -
lower_limit = 0 upper_limit = 0 FOR each CPU in parallel computing environment lower_limit = upper_limit + 1 upper_limit = upper_limit + round(d.length/ no_of_cpus) Create a thread and provide it with lower_limit and upper_limit data array indexes Start the created thread END FOR
Some of Big Data frameworks that utilize data parallelism are Apache Spark, Apache MapReduce and Apache YARN (it supports more of hybrid parallelism providing both task and data parallelism).
Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.