Pages

Monday, December 28, 2015

Parallel Programming Paradigms - MPI, OpenMP and CUDA


MPI
OpenMP
CUDA
Description

Multi-threaded API
Parallel Computing Platform
and API (NVIDIA GPU)
Compiler
mpicc
gcc –fompenmp -Igomp
nvcc (LLVM-based)
Headers
#include <mpi.h>
#include <omp.h>
#include <cuda_runtime.h>
CFlags
-I/usr/lib/openmpi
-L/usr/lib/openmpi/…/lib
-lmpi
lgomp
-I/usr/local/cuda/include
-L/usr/local/cuda/lib
-lcudrt
Memory Model
Shared memory and Distributed memory
Shared memory multiprocessing
Shared memory
Concepts
Point-to-point messaging
Broadcasting (1 to M)
Scatter/Gather. Support R, C++, Java, Fortran, Python.
An add-on in compiler
Designed to work with C/C++ and Fortran. More effective for parallel computing: e.g., fast sort algorithms of large lists.
Variates
OpenMPI, MPICH2

SIMD, SIMT, SMT
Pros
More general solution;
Can run in clustering environments;
Distributed memory is less expensive
Easier to program;
Can still run the program as a serial code(no code change);

Scattered reads;
Unified virtual memory;
Fast shared memory;
Full support for integer and bitwise operations.
Cons
Code change from serial to parallel version;
Harder to debug;
Bottleneck of network communication.
Difficult to debug synchronization bugs and race conditions;
Requires compiler support;
Only in shared memory architecture;
Mostly used in loop parallelization.
Memory copy between host and device may incur performance hit due to bandwidth and latency;
Thread group with 32+ threads for best performance;
Valid C/C++ may not compile;
C++ RTTI is not support(?).

The information sources in the above table include, but not limited to, open-mpi.org, openmp.org, nvidia.com and Wikipedia.org.

No comments:

Post a Comment