OpenMP Tutorial 


The advent of readily-available inexpensive multi-core processors has made parallel programming more important and more accessible than ever before. Over half of all computers sold today have more than one processor. While most new computers have two CPUs, the percentage of computers with four CPUs is steadily increasing. This trend will continue to increase well into the future. This is where OpenMP steps in. OpenMP is a portable and scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications in Fortran 90/95 programming languge (as well as C/C++ programming language) for platforms ranging from the desktop (e.g. Linux desktop) to the supercomputer. The OpenMP approach is a driving force behind the push of parallel programming into the main stream. It was adopted as an informal standard in 1997 by computer scientists who wanted a unified model on which to base programs for shared memory systems. OpenMP is now used by many software developers; it offers significant advantages over both hand-threading and MPI (message passing interface). Relativelly recently, a so called Cluster OpenMP has been introduced into the world of parallel programming by the Intel Corporation. It is supposed to extend OpenMP into the distributed memory systems (e.g. supercomputers) and provide an alternative to the rather complicated MPI. Finally, it can be stated that during the past 20 years, the trends indicated by ever faster networks, distributed systems, and multi-processor computer architectures (even at the desktop level) clearly show that parallelism is the future of computing. Latest OpenMP specifications can be found here.

Combination of sophisticated Fortran compiler (e.g. Intel Fortran Compiler for Linux) and OpenMP API (application programming interface) support gives a potential Fortran 90/95 developer an excellent oportunity for high-end enginnering software development. OpenMP implementation allows for an decrease in execution time of compex numerical codes (e.g. various full-wave electromagnetic models, wire-antenna theory models, finite element analysis tools, etc.) on modern, shared-memory multicore processor (and multi-processor), computer architectures. Significant decrease in execution time of existing Fortran 90/95 programs can be achieved with a relatively modest investments in the source code reorganization. This reorganization of the existing source code is implemented through a streamlined OpenMP approach. Additionally, new Fortran 90/95 programs could be written with OpenMP approach in mind. This is a novel way of developing sophisticated, numerically intensive, computer programms. And it is here to stay. The Fortran 90/95 programming language acknowledges this fact, embraces it and grows in importance from this symbiosis. It is a well known fact that Fortran 90/95 programs are true number crunching beasts, and that this programming language is in fact ideally posed for the development of computationally intensive  computer programs, which involve complex numerical methods. Addition of the OpenMP support to the Fortran 90/95 makes it the No. 1 choice for the engineering software development platform (programming language). Only C/C++ programming language can come close in this regard, mostly due to its wide-spread addoption (not ease of use or computational efficiency).

Introduction to Parallel Programming


An excellent overview of parallel programming (computing) methodology can be found here. Basic outline of the parallel programming paradigm is given next. Traditionally, software has been written for a so-called serial computation, to be run on single computer having a single CPU. A problem has been broken into a discrete series of instructions which executed ano after another (i.e. in a series). In case of parallel computing, one uses multiple compute resources (e.g. multiple CPUs) to solve a given problem. Here a probelm is broken into discrete parts which could be solved concurently. Additionally, each part could be broken down into a series of instructions. Instructions from each independent part execute simultaneously on different CPUs. The compute resource can include a single computer with multiple processors or an arbitrary number of computers connected by a network (or even a combination of both). The computational problem, on the other hand, usually demonstrates characteristics such as the ability to be broken apart into discrete pieces of work that can be solved simultaneously and solved in less time with multiple compute resources than with a single compute resource. Upon the availability of a compute resources and characteristics of the computational problem one will proceede with a appropriate solution method. In theory, throwing more resources at a task will shorten its time to completion, with potential cost savings. On the other hand, many problems are so large and/or complex that it is impractical or impossible to solve them on a single computer, especially given limited computer memory. Whence, regarding the problem at hand and available computational resources one can determine the most-efficient approach to the problem's parallel solution. In that regard one can basically define two main aspects of the parallel computational approach: shared memory parallel programming and disributed memory parallel programming approach. This classification is carried out in line with the classification of the parallel computer architectures. More in-dept analysis of the parallel computation procedures and wider theoretical background of the parallel computing paradigms can be found here.

Shared mamory architecture Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as global address space. Multiple processors can operate independently but share the same memory resources. Changes in a memory location effected by one processor are visible to all other processors. Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs. Primary disadvantage of shared memory parallel computrs is the lack of scalability between memory and CPUs. Adding more CPUs can geometrically increases traffic on the shared memory - CPU path, and for cache coherent systems, geometrically increase traffic associated with cache / memory management. Also, it becomes increasingly difficult and expensive to design and produce shared memory machines with ever increasing numbers of processors. In the shared memory parallel programming model, tasks share a common address space, which they read and write asynchronously. Various mechanisms such as locks / semaphores may be used to control access to the shared memory. An advantage of this model from the programmer's point of view is that the notion of data "ownership" is lacking, so there is no need to specify explicitly the communication of data between tasks. Program development can often be simplified. This is where the OpenMP comes into the main focus. OpenMP is an Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism. It is comprised of three primary API components: compiler directives, runtime library routines and environment variables. It provides a standard among a variety of shared memory architectures / platforms and operating systems.

Parallel memory architectureLike shared memory systems, distributed memory systems vary widely but share a common characteristic. Distributed memory systems require a communication network to connect inter-processor memory. Processors have their own local memory. Memory addresses in one processor do not map to another processor, so there is no concept of global address space across all processors. Because each processor has its own local memory, it operates independently. Changes it makes to its local memory have no effect on the memory of other processors. When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. Synchronization between tasks is likewise the programmer's responsibility. Main disadvantage of this aproach is that a programmer is responsible for many of the details associated with data communication between processors, which can result with excessive time invested in programming the source code. On distributed memory systems, a so-called message passing models are in use for obtaining efficient parallel programmimng solution. From a programming perspective, message passing implementations (MPI) commonly comprise a library of subroutines that are imbedded in source code. The programmer is responsible for determining all parallelism. Multiple tasks can reside on the same physical machine as well across an arbitrary number of machines. Tasks exchange data through communications by sending and receiving messages.

Excellent seven-part video tutorial series: An Introduction to Parallel Programming is available from the Sun Microsystems. Here are the links for convenience:
  1. Performance Tuning
  2. Multicore Processor Architectures
  3. Parallel Architectures
  4. Parallel Programming Basics
  5. Parallel Programming Models - Distributed Memory & MPI
  6. Parallel Programming Models - Shared Memory, Auto Parallel & OpenMP
  7. Hybrid Programming Module and What's Next

Other parallel programming models besides those previously mentioned certainly exist, and will continue to evolve along with the ever changing world of computer hardware and software. Here, we will concentrare exclusively on the shared memory parallel programming model, implemented through the OpenMP approach.

Introduction to OpenMP


An excellent introduction into the OpenMP parallel programming model can be found here. Also, there is an excellent material which is an introduction into the OpenMP API usage (it complements OpenMP specifications and explains the usage of API routines in greater detail). It is titled Parallel Programming in Fortran 95 using OpenMP and can be accessed here. Official OpenMP specifications (Fortran 90/95 and C/C++ version) can be downloaded here.

OpenMP is based upon the existence of multiple threads in the shared memory programming paradigm. This means that a single process can have multiple, concurrent execution paths called threads. Hence, a shared memory parallel process consists of multiple threads running simultaneously. It is an explicit (not automatic) programming model, offering the programmer full control over parallelization. OpenMP uses the so-called "fork and join" model of parallel execution. This means that all OpenMP programs begin as a single process (as in serial programs). This main process is called the master thread. The master thread executes sequentially until the first parallel region construct is encountered. Namely, portions of the serial code are parallelized with the use of certain OpenMP (API and compiler directives) constructs, whence creating so-called parallel regions. When master thread encounters this parallel region it then creates a team of parallel threads, called work threads. The statements in the (serial) program that are enclosed by the parallel region construct are then executed in parallel among the various team / work threads. When the team (work) threads complete the statements in the parallel region construct, they synchronize and terminate, leaving only the master thread. Master thread then continues executiong remaining program statements in a standard serial fashion until encountering another parallel region, where everything prevoiusly described repeats again. 

Fork and Join model

Most OpenMP parallelism is specified through the use of compiler directives which are embedded on top of the exsiting (Fortran 90/95) source code. This provides flexibility and high efficiency in writing parallel computer programs. In fact, almost every piece of the existing Fortran 90/95 (and even Fortran 77)source code can be parallelized through the OpenMP approach, with little additional effort. This is the main advantage of using  OpenMP. It can be additionally combined with sophisticated compiler features (e.g. automatic parallelization and multithreading features), such as: auto-parallelization, auto-vectorization, loop unrolling, vector pipelining, etc. This tutorial will exclusively concentrate on using OpenMP in Fortran 90/95 programming language.

Basic OpenMP API Overview


All Fortran 90/95 OpenMP directives must begin with a sentinel. A sentinel can appear in any column (free format source code), but must be preceded by a white space only. Fortran 90/95 OpenMP sentinel is !$OMP. Sentinel is directly followed by a reserved directive name and optional clauses. Clauses can be in any order, and repeated as necessary unless otherwise restricted by the standard. A valid OpenMP directive must appear after the sentinel and before any clauses. Continuation lines, if used, must have an ampersand as the last non-blank character in a line. The following line must begin with a sentinel and then the continuation directives. Comments can not appear on the same line as a directive. Additionally, only one directive-name may be specified per directive. An example of the OpenMP directive is given below.

!$OMP PARALLEL [clause ...]
               DEFAULT (PRIVATE | SHARED | NONE)
               FIRSTPRIVATE (var_list)
               PRIVATE (var_list)
               SHARED (var_list)
               REDUCTION (operator: list)
               ...

   block of code to be parallelized

!$OMP END PARALLEL


Block of code eclosed by the OpenMP directive is executed in parallel by the tem of threads. Number of threads executing the parallel region can be controled by the user, even during runtime. Source code before and after the OpenMP directives is executed in serial, by the master thread. The block of code directly placed between the two directives !$OMP PARALLEL and !$OMP END PARALLEL is said to be in the lexical extent of the directive-pair. The code included in the lexical extent plus all the code called from inside the lexical extent (e. g. in subroutines called from the parallel region) is said to be in the dynamic extent of the directive-pair.

Four different groups of OpenMP directives or constructs exist. Each group has a different aim and the selection of one directive or another inside the same group depends on the nature of the problem to be solved. Therefore, it is good to understand the principles of each of these directives in order to perform the correct choices. The most important group of OpenMP directives looks forward to divide a given work into pieces and to give one or more of these pieces to each parallel running thread. In this way the work, which would be done by a single thread in a serial program, is distributed over a team of threads achieving a faster running program. This is basically the cornerstone of the OpenMP approach. All work-sharing constructs have an implied synchronization in their closing-directives. This is in general necessary to ensure that all the information, required by the code following the work-sharing construct, is up-to-date. Follows a brief description of work-sharing constructs:

!$OMP DO / !$OMP END DO  -  This directive-pair makes the immediately following do-loop to be executed in parallel,
!$OMP SECTIONS / !$OMP END SECTIONS  -  This directive-pair allows to assign to each thread a completely different task (each section of code is executed only once by a thread in the team),
!$OMP SINGLE / !$OMP END SINGLE  -  The code enclosed in this directive-pair is only executed by one of the threads in the team, namely the one who first arrives,
!$OMP WORKSHARE / !$OMP END WORKSHARE  -  This work-sharing construct targets special Fortran 95 array features, such as array notation expressions or forall and where statements (in this case no explicit do-loops are visible).

Additionally, OpenMP provides so-called combined parallel work-sharing constructs, which are in fact shortcuts for specifying a parallel region that contains only one work-sharing construct. Hence, following combined parallel work-sharing constructs exist:

!$OMP PARALLEL DO / !$OMP END PARALLEL DO
!$OMP PARALLEL SECTIONS / !$OMP END PARALLEL SECTIONS
!$OMP PARALLEL WORKSHARE / !$OMP END PARALLEL WORKSHARE

In certain cases it is not possible to leave each thread on its own and it is necessary to bring them back to an order. This is generally achieved through synchronizations of the threads. These synchronizations can be explicit, like the ones present in the closing directive of the work-sharing constructs, or implied to previously presented OpenMP directives. Whence, synchronization constructs present another group of OpenMP directives. Follows a brief outline of synchronization constructs:

!$OMP MASTER / !$OMP END MASTER  -  The code enclosed inside this directive-pair is executed only by the master thread of the team (meanwhile, all the other threads continue with their work),
!$OMP CRITICAL / !$OMP END CRITICAL  -  This directive-pair restricts the access to the enclosed code to only one thread at a time; In this way it is ensured, that what is done in the enclosed code is done correctly,
!$OMP BARRIER  -  This directive represents an explicit synchronization between the different threads in the team; When encountered, each thread waits until all the other threads have reached this point,
!$OMP ATOMIC  -  When a variable in use can be modified from all threads in a team, it is necessary to ensure that only one thread at a time is writing / updating the memory location of the considered variable, otherwise unpredictable results will occur,
!$OMP FLUSH  -  This directive appears at the precise point in the code at which (explicit) data synchronization is required.

Another set of OpenMP directives (or clauses) is meant for controlling the data environment during the execution in parallel. They specify how each variable is handled and who is allowed to see its value and to change it. This ensures consistency among data (variables) in parallel regions. Not all of the data scope attribute clauses (directives) are allowed by all directives, but the clauses that are valid on a particular directive are indicated in the OpenMP specifications. Here follows data scope clauses:

!$OMP THREADPRIVATE (list)  -  This clause determines global variables, but with values which are specific for each thread,
COPYIN (list)  -  This clause determines for variables, which have been declared as THREADPRIVATE, that their values in each thread will be set equal to the value in the master thread,
PRIVATE (list)  -  This clause determines which variables are going to be considered as local variables to each thread,
SHARED (list)  -  This clause determines which variables should be available to all threads inside the scope of a directive-pair (because their values are needed by all threads or because all threads have to update their values),
DEFAULT (PRIVATE | SHARED | NONE)  -  This clause determines default setting for variables in a work-sharing construct (loop indexes need not be specified),
FIRSTPRIVATE (list)  -  This clause determines that the variables (in the list) inherit the value of the original variable before the starting-directive (otherwise, they would have unknown value, which is the case with PRIVATE clause),
LASTPRIVATE (list)  -  This clause determines that the variables included in the list will be updated by the ”last” value they get inside the scope of the associated directive-pair,
COPYPRIVATE (list)  -  This clause ensures a possibility that, after a single thread inside a parallel region has executed a set of instructions enclosed inside an !$OMP SINGLE / !$OMP END SINGLE directive-pair, it is possible to broadcast the value of a private variable to the other threads in the team.

There are several other clauses, such as REDUCTION clause, which are explained in the OpenMP specifications. Additionally, there are clauses which are not concerned with data scoping, but have other tasks. One such clause is concerned with thread scheduling during the DO loop executions inside a parallel region.

SCHEDULE (type, chunk)  -  This clause determines how the distribution of iterations of the parallelized DO loop will be carried out among the threads of the team; first parameter, type, specifies the way in which the work is distributed over the threads while the other, chunk (which is optional), specifies the size of the work given to each thread; four different options of scheduling exist:

STATIC  -  when this option is specified, the pieces of work created from the iteration space of the do-loop are distributed evenly among the threads of the team and stay fixed for the duration of the execution; the number of pieces of work is equal to the number of threads in the team and all pieces are approximately equal in size; if the optional parameter chunk is specified, the size of the pieces are fixed to that amount,
DINAMIC  -  when this option is specified, the pieces of work created from the iteration space of the do-loop are distributed in a dinamic way, which means that as one thread finishes its piece of work, it gets a new one; the iteration space is divided into pieces of work with a size equal to chunk,
GUIDED  -  when this option is specified, the pieces of work created from the iteration space of the do-loop are also distributed in a dinamic way, but this time pieces of work have decreasing sizes, so that their associated work is smaller and smaller as they are assigned to the different threads; the decreasing law is of exponential nature so that the following pieces of work have half the number of iterations as the previous ones; the optional parameter chunk here specifies the smallest number of iterations grouped
into one piece of work,
RUNTIME  -  when this option is specified, method of division of the iteration space of the do-loop is postponed until runtime; whence, by choosing this option user can define / modify division of the iteration space during runtime.

Another group of OpenMP directives, which are part of the OpenMP run-time library, are at the programmer's disposal. The OpenMP run-time library is meant to serve as a control and query tool for the parallel execution environment, which the programmer can use from inside its program. Therefore, the run-time library is a set of external procedures with clearly defined interfaces. Here are some of them:

OMP_set_num_threads  -  This subroutine sets the number of threads to be used by subsequent parallel regions,
OMP_get_num_threads  -  This function allows to know the number of threads currently in the team executing the parallel region from which it is called,
OMP_get_max_threads  -  This function returns the maximum number of threads that can be used in the program,
OMP_get_thread_number  -  This function returns the identification number of the current thread within the team,
OMP_get_num_procs  -  This function returns the number of processors available to the program,
OMP_in_parallel  -  This function allows to know, if a given region of the program is being computed in parallel or not, etc.

These run-time library routines are well documented within the OpenMP specifications. Among those available run-time routines are also several library procedures for assessing performance (benchmarking). Yet another group of subroutines and functions deal with so called locks. These locks represent another synchronization mechanism at the OpenMP programmer's disposal. User is advised to consult the OpenMP specifications and other above mentioned resources for more in-depth explanations on using locks and other mechanisms.

Simple OpenMP Example:

!$OMP PARALLEL DEFAULT(NONE)

id = OMP_get_thread_num()
if (id == 0) then
num_threads = OMP_get_num_threads()
print *, 'Number of threads =', num_threads
end if

print *, 'Thread',id,' starting ...'

!$OMP DO
SHARED(A,B) SCHEDULE(DINAMIC,256)

   do i = 1,10000
A(i) = 2.5 * B(i)
   end do

!$OMP END DO NOWAIT

   print *, 'Thread',id,' is done with the first do-loop.'

!$OMP SINGLE

print *, ' Enter a value for CHUNK'
read(*,*) CHUNK

!$OMP END SINGLE

!$OMP DO SCHEDULE(STATIC,CHUNK) &
!$OMP SHARED(C,D,CHUNK)

   do i = 1,20000
C(i) = 3.2 * D(i)
   end do

!$OMP END DO

!$OMP WORKSHARE SHARED(A,B,D)

   D = A + B

!$OMP END WORKSHARE
!$OMP END PARALLEL

In the above example, firstly a parallel region is constructed with a PARALLEL directive. Here a data sharing clauses define that the treatment of all variables in the parallel region need to be explicitely stated. This is accomplished by a DEFAULT(NONE) clause. Do-loop indices are excluded from this (they are always private to each thread). Example of the usage of the run-time library routines is given in the opening of the parallel region. Next, a work-sharing construct is introduced in order to parallelize the following do-loop. Shared variables of the following do-loop have been declared as such. Here, a distribution of the do-loop iterations is specified to be DINAMIC, with a pieces equal to 256 iterations. A NOWAIT clause at the closing directive of the first do-loop avoids the implied sinchronization, which is not needed in this case. Next follows a SINGLE directive which encloses a region of code executed by a single thread (the one that arrives first at this point). This usage of SINGLE directive is often the case when one needs to manage I/O operations from the inside of the parallel region. Second do-loop work sharing construct is carried out with a STATIC schedule, now with a chunk size defined / entered by the user during runtime. Data sharing is defined in the second line of the second do-loop work-sharing construct. Note that a line continuation symbol (&) need to be present. There is an implied data synchronization at the closing directive of the second do-loop work-sharing construct. Third section / region uses a WORKSHARE directive to parallelize array additions. Here, an implied do-loop is present in the statement, which might not be seen by a compiler. Again, this work-sharing construct is closed with the appropriate end-clause and parallel region is closed after that. Implied data syncronisation is present at the closing directive. This same example could have been written in a different manner as well, by the use of combined parallel work-sharing constructs and splitting-up the source code in three distinct parallel regions.

OpenMP Compiler Switches


All Fortran 90/95 OpenMP directives start with a sentinel !$OMP, which indicates to the OpenMP compliant Fortran compiler that a following code on that source code line is an OpenMP statement / directive. For the regular (non-OpenMP compliant) Fortran Compiler this line signifies a comment line (since it starts with an exclamation mark). Most modern Fortran Compilers, such as Intel Fortran Comipler for Linux and SunStudio 12 Fortran Compiler, include OpenMP support. In fact majority (if not all) Fortran compilers now include OpenMP support. In order for the compiler to be able to recognize the OpenMP sentinels and produce parallel executable, programmer needs to specify certain compiler switches in the associated makefile. Hence, one and the same source code file could be compiled to produce a serial program, as well as a parallel version of the same program, depending on the compiler switches used in the associated makefile. More info on the Fortran makefiles could be found here. In this way, existing serial Fortran programs could be easily parallelized, while the serial version is still retained and fully functional.

For example, Intel Fortran Compiler for Linux (ifort) uses an -openmp switch. Hence, one would use for example a following command:

ifort -o program -O3 -xP -openmp program.f90

to obtain a parallel version of the program (found in the program.f90 source code file). In the above line, several other compiler switches have been used as well. First switch (-o) renames the executable to the name program. Next, optimisation of the source code is implemented with a -O3 switch, followed by an another code optimisation, targeted to the specific platform (-xP switch). Optimisations included in the -O3 switch, among others, try to vectorize do-loops found in the source code. This results with a significant improvements in the execution speed.  Next compiler switch (-openmp) identifies OpenMP parallel regions, introduced into the source code by the programmer, and parallelize them.

Programmer might try to use compilers auto-parallelization options (switch) to obtain parallel version of the execution code. In case of Intel Fortran Compiler for Linux this option (-parallel) tries to automatically parallelize do-loops found in the source code. If the do-loops are properly structured, this is very possible. There is no OpenMP here! Modern Fortran compilers do an excellent job of indentifying do-loops and auto-parallelizeing them. Hence, one might build a program with a following command:

ifort -o program -O3 -xP -parallel program.f90

SunStudio 12 Fortran Compiler for Linux (f95) uses also the -openmp switch, which is a synonim for the -xopenmp switch (both are correct). Hence, one would in this case use, for example, a following command:

f95 -o program -O3 -xopenmp program.f90

SunStudio 12 Fortran Compiler for Linux also supports the auto-parallelization with the -autopar switch (-parallel switch is said to be obsolete). Whence, one might write a following compiler command:

f95 -o program -O3 -autopar program.f90

Do-loop presented in the above simple example would be automatically parallelized by the compiler (-parallel switch) even without the OpenMP intervention. Hence, OpenMP should be reserved for the more difficult cases, where compiler fails to automatically identify and parallelize loops or other parts of source code. Programmer could intervene in these cases, with  OpenMP directives, and force a compiler (-openmp switch) to parallelize these regions of source code. This is especially important in case of advanced array features, inherent in the Fortran 90/95 programming language. Those are array sections (collon notation), usage of forall and where Fortran 90/95 language constructs, which were introduced into the language with the parallelization in mind. This is where the WORKSHARE OpenMP work-sharing construct comes in handy. This is a unique OpenMP construct reserved for the Fortran language. It does not exist for the C/C++ language. At the same time, C/C++ programming language has inferior capabilities when it comes to array treatment.

Performance Analysis Tools


Fortran 90/95 developer on Linux platforms have excellent various (free for non-commercial use) tools for the performance analysis of their software. Usage of these tools would be a first step in the development of parallel software. Firstly, one must acknowladge the fact that modern Fortran ompilers are very sophisticated. They include automatic vectorization, automatic paralellization, loop trasforations and other optimization procedures. All this, combined with fine-tuned math kernel library (MKL) and other optimised library packages, results with very fast executables on Linux machines. Additional speed-ups could be obtained by the OpenMP API implementation. A nice overview of the stages which are necessary for the optimization of parallel software can be found here. First step in the course of fine-tuning the source code would be using a compiler switches, which allow sophisticated optimizations for the targeted architecture. Next step is performance profiling of the application using the performance analysis tools. Here, one can choose from variety of free (for non-commercial use) linux tools. One excellent choice is an Intel VTune Performance Analyzer for Linux. It can be obtained free of charge here. It has a GUI interfaced into the Eclipse platform. Another excellent (also free for non-commerical use) tool is a Performance Profiler bundled with a SunStudio 12 for Linux, developed by Sun Microsystems. (SunStudio 12 is an excellent full-featured IDE for Fortran 90/95 on Linux platform and can be obtained here.)

Performance analysis tool gives programmer insight into the application performance from the execution time perspective. It can identify certain application bottlenecks, i.e. pieces of code (functions, subroutines) where application is spedning majority of its runtime. Those parts could be hopefully parallelized by the OpenMP approach. Certain gains could be expected from running code in parallel. There are numerous issues which need to be accounted for here, such as overheads due to data management and communication between threads. Certain problems will profit more from OpenMP approach than others. Also, underlying architecture has significant impact onto the obtained performance, e.g. number of available multi-core processors and number of (hardware) threads per processor.

Application tuningThe most important thing to keep in mind when optimizing an application is to create a systematic and organized approach. Several distinctive steps could be identified in the course of establishing an application tuning methodology (approach). The process which goes through these steps is iterarive in its nature. It can be graphically ilustrated with a chart (on the left). The first step of the process would be to gather performance data, as described previously, and analyze those data to identify opportunities for improvement. This can be often carried out using tools such as Intel VTune Performance Analyzer to see where time is being spent or Intel Thread Profiler to discover threading inefficiencies. SunStudio 12 performance profiler, for example, could be used instead. Once the troubled spot have been identified one needs to generate alternatives to resolve the found issue. For every problem, there is always a solution, and this step identifies the fix that will resolve the current issue. It should be noted that programmer should try to solve only one issue (usually the major one found) in each iteration cycle. Otherwise, it might be impossible to resolve which change made what performance impact (this is crucial in case that implemented fixes degrade performance, which might happen). The fix for the found problem could range from being trivial to a very complicated one, depending on the situation at hand. Once you have decided what you plan to change in the code and taken note of it in your project record, make the appropriate changes to the source code. This makes the Implement Enhancement step of the process (on the chart). Next, test the results of the implemented source code changes. Collect additional data, compare it to the baseline measurement, and take the time to understand the results, backtracking if necessary. This brings the cycle at the beginning. From here, follow the cycle into the second iteration, and so on. Creating a methodology and applying it for each individual performance tuning project requires taking this generic sequence of steps and determining how it should be manifested in each specific case. Majority of the work will be invested in the implementation of the enhancements (i.e. fixing problems, resolving bottlenecks, etc.) found by the performance analyzer. Whence, performance analyzer is a valuable, if not indispensible, tool for optimizing the performance of parallel applications.

Final Remarks


It is often argued that the OpenMP approach is limited to the so-called "fine grain" parallelization. This is somewhat true, but not a limiting feature of the OpenMP specifications "per-se". It is up to the programmer to try to achieve the so-called "coarse grain" parallelization. Much of the tedious work in the parallelization can be relinquished to the compiler itself (auto-parallelization through the compiler switches). Programmer should be occupied with a higher level parallelization. This might include even developing novel algorithms, which would be more parallel-friendly (or easier to parallelize).  Such algorithms have been developed for the case of e.g. FDTD approach to the solution of Maxwell equations (i.e. Finite-Difference Time-Domain Calculations). New ways of thinking about solving engineering problems might be needed, ways which would include parallel notions from the beginning. Engineering software development is going through an important revision process, which from now on heavily relies on parallel programming paradigms.

Hybrid architectures - combined from dosens to several hundreds of multi-core processor (and/or multi-processor) machines (SMPs), which are mutually interconnected into the single large system (cluster) - require new ways of parallel software development. A so-called MPI approach has been acknowledged as a de-facto standard here (on so-called distributed memory systems). Relatively recently, a so-called Cluster OpenMP approach has been put-forth by the Intel. This new system, Cluster OpenMP, is an implementation of OpenMP that can make use of multiple SMP machines without resorting to MPI. This advance has the advantage of eliminating the need to write explicit messaging code, as well as not mixing programming paradigms. The shared memory in Cluster OpenMP is maintained across all machines through a distributed shared-memory subsystem. This approach will hopefully reduce the overhead (on part of the programmer) in developing parallel applications on cluster (Linux) systems.