Fortran 90/95 & Linux OS
Fortran 90/95
development on Linux
OpenMP Tutorial
The advent of readily-available inexpensive multi-core processors has made parallel programming more important and more accessible than ever before. Over half of all computers sold today have more than one processor. While most new computers have two CPUs, the percentage of computers with four CPUs is steadily increasing. This trend will continue to increase well into the future. This is where OpenMP steps in. OpenMP is a portable and scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications in Fortran 90/95 programming languge (as well as C/C++ programming language) for platforms ranging from the desktop (e.g. Linux desktop) to the supercomputer. The OpenMP approach is a driving force behind the push of parallel programming into the main stream. It was adopted as an informal standard in 1997 by computer scientists who wanted a unified model on which to base programs for shared memory systems. OpenMP is now used by many software developers; it offers significant advantages over both hand-threading and MPI (message passing interface). Relativelly recently, a so called Cluster OpenMP has been introduced into the world of parallel programming by the Intel Corporation. It is supposed to extend OpenMP into the distributed memory systems (e.g. supercomputers) and provide an alternative to the rather complicated MPI. Finally, it can be stated that during the past 20 years, the trends indicated by ever faster networks, distributed systems, and multi-processor computer architectures (even at the desktop level) clearly show that parallelism is the future of computing. Latest OpenMP specifications can be found here.
Combination
of sophisticated Fortran compiler
(e.g. Intel Fortran Compiler for Linux) and OpenMP API (application
programming interface) support gives a potential Fortran 90/95
developer an excellent oportunity for high-end enginnering software
development. OpenMP implementation allows for an decrease in execution
time of compex numerical codes (e.g. various full-wave electromagnetic
models, wire-antenna theory models, finite element analysis tools,
etc.) on
modern, shared-memory multicore processor (and multi-processor),
computer architectures. Significant decrease in execution time of
existing Fortran 90/95 programs can be achieved with a relatively
modest investments in the source code reorganization. This
reorganization of the existing source code is implemented through a
streamlined
OpenMP approach. Additionally, new Fortran 90/95 programs could be
written with OpenMP approach in mind. This is a novel way of developing
sophisticated, numerically intensive, computer programms. And it is
here to stay. The Fortran 90/95 programming language acknowledges this
fact, embraces it and grows in importance from this symbiosis.
It is a well known fact that Fortran 90/95 programs are true number
crunching beasts, and that this programming language is in fact ideally
posed for the development of computationally intensive
computer
programs, which involve complex numerical methods. Addition of the
OpenMP support to the Fortran 90/95 makes it the No. 1 choice for the
engineering software development platform (programming language). Only
C/C++ programming language can come close in this regard, mostly due to
its wide-spread addoption (not ease of use or computational efficiency).
Introduction to Parallel Programming
An excellent overview of parallel programming (computing) methodology can be found here. Basic outline of the parallel programming paradigm is given next. Traditionally, software has been written for a so-called serial computation, to be run on single computer having a single CPU. A problem has been broken into a discrete series of instructions which executed ano after another (i.e. in a series). In case of parallel computing, one uses multiple compute resources (e.g. multiple CPUs) to solve a given problem. Here a probelm is broken into discrete parts which could be solved concurently. Additionally, each part could be broken down into a series of instructions. Instructions from each independent part execute simultaneously on different CPUs. The compute resource can include a single computer with multiple processors or an arbitrary number of computers connected by a network (or even a combination of both). The computational problem, on the other hand, usually demonstrates characteristics such as the ability to be broken apart into discrete pieces of work that can be solved simultaneously and solved in less time with multiple compute resources than with a single compute resource. Upon the availability of a compute resources and characteristics of the computational problem one will proceede with a appropriate solution method. In theory, throwing more resources at a task will shorten its time to completion, with potential cost savings. On the other hand, many problems are so large and/or complex that it is impractical or impossible to solve them on a single computer, especially given limited computer memory. Whence, regarding the problem at hand and available computational resources one can determine the most-efficient approach to the problem's parallel solution. In that regard one can basically define two main aspects of the parallel computational approach: shared memory parallel programming and disributed memory parallel programming approach. This classification is carried out in line with the classification of the parallel computer architectures. More in-dept analysis of the parallel computation procedures and wider theoretical background of the parallel computing paradigms can be found here.
Shared memory parallel computers vary widely, but generally have in
common the ability for all processors to access all memory as global
address space. Multiple processors can operate independently but share
the same memory resources. Changes in a memory location effected by one
processor are visible to all other processors. Data sharing between
tasks is both fast and uniform due to the proximity of memory to CPUs.
Primary disadvantage of shared memory parallel computrs is the lack of
scalability between memory and
CPUs. Adding more CPUs can geometrically increases traffic on the
shared memory - CPU path, and for cache coherent systems, geometrically
increase traffic associated with cache / memory management. Also, it
becomes increasingly difficult and expensive to design and produce
shared memory machines with ever increasing numbers of processors. In
the shared memory parallel programming model, tasks share a common
address space, which they read and write asynchronously. Various
mechanisms such as locks / semaphores may be used to control access to
the shared memory. An advantage of this model from the programmer's
point of view is that the notion of data "ownership" is lacking, so
there is no need to specify explicitly the communication of data
between tasks. Program development can often be simplified. This is
where the OpenMP comes into the main focus. OpenMP is an Application
Program
Interface (API) that may be used to explicitly direct multi-threaded,
shared memory parallelism. It is comprised of three primary API
components: compiler directives, runtime library routines and
environment variables. It provides a standard among a variety of shared
memory architectures / platforms and operating systems.
Like
shared
memory systems, distributed memory systems vary widely but share a
common characteristic. Distributed memory systems require a
communication network to connect inter-processor memory. Processors
have their own local memory. Memory addresses in one processor do not
map to another processor, so there is no concept of global address
space across all processors. Because each processor has its own local
memory, it operates independently. Changes it makes to its local memory
have no effect on the memory of other processors. When a processor
needs access to data in another processor, it is usually the task of
the programmer to explicitly define how and when data is communicated.
Synchronization between tasks is likewise the programmer's
responsibility. Main disadvantage of this aproach is that a programmer
is responsible for many of the details associated with data
communication between processors, which can result with excessive time
invested in programming the source code. On distributed memory systems,
a so-called message passing models are in use for obtaining efficient
parallel programmimng solution. From a programming perspective, message
passing implementations (MPI) commonly comprise a library of
subroutines that are imbedded in source code. The programmer is
responsible for determining all parallelism. Multiple tasks can reside
on the same physical machine as well across an arbitrary number of
machines. Tasks exchange data through communications by sending and
receiving messages.
- Performance Tuning
- Multicore Processor Architectures
- Parallel Architectures
- Parallel Programming Basics
- Parallel Programming Models - Distributed Memory & MPI
- Parallel Programming Models - Shared Memory, Auto Parallel & OpenMP
- Hybrid Programming Module and What's Next
Other parallel programming models besides those previously mentioned certainly exist, and will continue to evolve along with the ever changing world of computer hardware and software. Here, we will concentrare exclusively on the shared memory parallel programming model, implemented through the OpenMP approach.
Introduction to OpenMP
An excellent introduction into the OpenMP parallel programming model can be found here. Also, there is an excellent material which is an introduction into the OpenMP API usage (it complements OpenMP specifications and explains the usage of API routines in greater detail). It is titled Parallel Programming in Fortran 95 using OpenMP and can be accessed here. Official OpenMP specifications (Fortran 90/95 and C/C++ version) can be downloaded here.
OpenMP is based upon the existence of multiple threads in the shared memory programming paradigm. This means that a single process can have multiple, concurrent execution paths called threads. Hence, a shared memory parallel process consists of multiple threads running simultaneously. It is an explicit (not automatic) programming model, offering the programmer full control over parallelization. OpenMP uses the so-called "fork and join" model of parallel execution. This means that all OpenMP programs begin as a single process (as in serial programs). This main process is called the master thread. The master thread executes sequentially until the first parallel region construct is encountered. Namely, portions of the serial code are parallelized with the use of certain OpenMP (API and compiler directives) constructs, whence creating so-called parallel regions. When master thread encounters this parallel region it then creates a team of parallel threads, called work threads. The statements in the (serial) program that are enclosed by the parallel region construct are then executed in parallel among the various team / work threads. When the team (work) threads complete the statements in the parallel region construct, they synchronize and terminate, leaving only the master thread. Master thread then continues executiong remaining program statements in a standard serial fashion until encountering another parallel region, where everything prevoiusly described repeats again.

Most OpenMP parallelism is specified through the use of compiler directives which are embedded on top of the exsiting (Fortran 90/95) source code. This provides flexibility and high efficiency in writing parallel computer programs. In fact, almost every piece of the existing Fortran 90/95 (and even Fortran 77)source code can be parallelized through the OpenMP approach, with little additional effort. This is the main advantage of using OpenMP. It can be additionally combined with sophisticated compiler features (e.g. automatic parallelization and multithreading features), such as: auto-parallelization, auto-vectorization, loop unrolling, vector pipelining, etc. This tutorial will exclusively concentrate on using OpenMP in Fortran 90/95 programming language.
Basic OpenMP API Overview
All Fortran 90/95 OpenMP directives must begin with a sentinel. A sentinel can appear in any column (free format source code), but must be preceded by a white space only. Fortran 90/95 OpenMP sentinel is !$OMP. Sentinel is directly followed by a reserved directive name and optional clauses. Clauses can be in any order, and repeated as necessary unless otherwise restricted by the standard. A valid OpenMP directive must appear after the sentinel and before any clauses. Continuation lines, if used, must have an ampersand as the last non-blank character in a line. The following line must begin with a sentinel and then the continuation directives. Comments can not appear on the same line as a directive. Additionally, only one directive-name may be specified per directive. An example of the OpenMP directive is given below.
| !$OMP
PARALLEL [clause ...] DEFAULT (PRIVATE | SHARED | NONE) FIRSTPRIVATE (var_list) PRIVATE (var_list) SHARED (var_list) REDUCTION (operator: list) ... block of code to be parallelized !$OMP END PARALLEL |
Block of code eclosed by the OpenMP directive is executed in parallel
by the tem of threads. Number of threads executing the parallel region
can be controled by the user, even during runtime. Source code before
and after the OpenMP directives is executed in serial, by the master
thread. The block of code directly placed between the two directives
!$OMP PARALLEL and !$OMP END PARALLEL is said to be in the lexical
extent of the directive-pair. The code included in the lexical extent
plus all the code called from inside the lexical extent (e. g. in
subroutines called from the parallel region) is said to be in the
dynamic extent of the directive-pair.
Four different groups of OpenMP directives or constructs exist. Each group has a different aim and the selection of one directive or another inside the same group depends on the nature of the problem to be solved. Therefore, it is good to understand the principles of each of these directives in order to perform the correct choices. The most important group of OpenMP directives looks forward to divide a given work into pieces and to give one or more of these pieces to each parallel running thread. In this way the work, which would be done by a single thread in a serial program, is distributed over a team of threads achieving a faster running program. This is basically the cornerstone of the OpenMP approach. All work-sharing constructs have an implied synchronization in their closing-directives. This is in general necessary to ensure that all the information, required by the code following the work-sharing construct, is up-to-date. Follows a brief description of work-sharing constructs:
!$OMP DO
/ !$OMP END DO
- This directive-pair makes the immediately
following do-loop to be executed in parallel,
!$OMP
SECTIONS /
!$OMP END SECTIONS
- This directive-pair allows to assign to each
thread a completely different task (each section of code is executed
only once by a thread in the team),
!$OMP
SINGLE / !$OMP
END SINGLE
- The code enclosed in this directive-pair is only
executed by one of the threads in the team, namely the one who first
arrives,
!$OMP
WORKSHARE /
!$OMP END WORKSHARE
- This work-sharing construct targets special
Fortran 95 array features, such as array notation expressions or forall
and where statements (in this case no explicit do-loops are visible).
Additionally, OpenMP provides so-called combined parallel work-sharing constructs, which are in fact shortcuts for specifying a parallel region that contains only one work-sharing construct. Hence, following combined parallel work-sharing constructs exist:
!$OMP PARALLEL DO / !$OMP
END PARALLEL DO
!$OMP PARALLEL SECTIONS / !$OMP END PARALLEL SECTIONS
!$OMP PARALLEL WORKSHARE / !$OMP END PARALLEL WORKSHARE
In certain cases it is not possible to leave each thread on its own and it is necessary to bring them back to an order. This is generally achieved through synchronizations of the threads. These synchronizations can be explicit, like the ones present in the closing directive of the work-sharing constructs, or implied to previously presented OpenMP directives. Whence, synchronization constructs present another group of OpenMP directives. Follows a brief outline of synchronization constructs:
!$OMP
MASTER / !$OMP END MASTER
- The code enclosed inside this directive-pair is
executed
only by the master thread of the team (meanwhile, all the other threads
continue with their work),
!$OMP
CRITICAL / !$OMP END CRITICAL
- This directive-pair restricts the access to the
enclosed
code to only one thread at a time; In this way it is ensured, that what
is done in the enclosed code is done correctly,
!$OMP
BARRIER
- This directive represents an explicit
synchronization
between the different threads in the team; When encountered, each
thread waits until all the other threads have reached this point,
!$OMP
ATOMIC
- When a variable in use can be modified from all
threads
in a team, it is necessary to ensure that only one thread at a time is
writing / updating the memory location of the considered variable,
otherwise unpredictable results will occur,
!$OMP
FLUSH
- This directive appears at the precise point in
the code at which (explicit) data synchronization is required.
Another set of OpenMP directives (or clauses) is meant for controlling the data environment during the execution in parallel. They specify how each variable is handled and who is allowed to see its value and to change it. This ensures consistency among data (variables) in parallel regions. Not all of the data scope attribute clauses (directives) are allowed by all directives, but the clauses that are valid on a particular directive are indicated in the OpenMP specifications. Here follows data scope clauses:
!$OMP
THREADPRIVATE (list) -
This clause determines global variables, but with values
which are specific for each thread,
COPYIN
(list) - This clause determines for
variables, which have been declared as THREADPRIVATE, that their values
in each thread will be set equal to the value in the master thread,
PRIVATE
(list) - This clause determines which
variables are going to be considered as local variables to each thread,
SHARED
(list) - This clause determines which
variables should be available to all threads inside the scope of a
directive-pair (because their values are needed by all threads or
because all threads have to update their values),
DEFAULT
(PRIVATE | SHARED | NONE) - This
clause determines default setting for variables in a work-sharing
construct (loop indexes need not be specified),
FIRSTPRIVATE
(list) - This clause determines that
the variables (in the list) inherit the value of the original variable
before the starting-directive (otherwise, they would have unknown
value, which is the case with PRIVATE
clause),
LASTPRIVATE
(list) - This clause determines that
the variables included in the list will be updated by the ”last” value
they get inside the scope of the associated directive-pair,
COPYPRIVATE
(list) - This clause ensures a
possibility that, after a single thread inside a parallel region has
executed a set of instructions enclosed inside an !$OMP SINGLE / !$OMP END
SINGLE directive-pair, it is possible to broadcast the
value of a private variable to the other threads in the team.
There are several other clauses, such as REDUCTION clause, which are explained in the OpenMP specifications. Additionally, there are clauses which are not concerned with data scoping, but have other tasks. One such clause is concerned with thread scheduling during the DO loop executions inside a parallel region.
SCHEDULE (type, chunk) - This clause determines how the distribution of iterations of the parallelized DO loop will be carried out among the threads of the team; first parameter, type, specifies the way in which the work is distributed over the threads while the other, chunk (which is optional), specifies the size of the work given to each thread; four different options of scheduling exist:
STATIC
- when this option is specified, the pieces of work
created from the iteration space of the do-loop are distributed evenly
among the threads of the team and stay fixed for the duration of the
execution; the number of pieces of work is equal to the number of
threads in the team and all pieces are approximately equal in size; if
the optional parameter chunk is specified, the size of the pieces are
fixed to that amount,
DINAMIC
- when this option is specified, the pieces of work
created from the iteration space of the do-loop are distributed in a
dinamic way, which means that as one thread finishes its piece of work,
it gets a new one; the iteration space is divided into pieces of work
with a size equal to chunk,
GUIDED
- when this option is specified, the pieces of work
created from the iteration space of the do-loop are also distributed in
a dinamic way, but this time pieces of work have decreasing sizes, so
that their associated work is smaller and smaller as they are assigned
to the different threads; the decreasing law is of exponential nature
so that the following pieces of work have half the number of iterations
as the previous ones; the optional parameter chunk here specifies the
smallest number of iterations grouped
into one piece of work,
RUNTIME
- when this option is specified, method of division
of the iteration space of the do-loop is postponed until runtime;
whence, by choosing this option user can define / modify division of
the iteration space during runtime.
Another group of OpenMP directives, which are part of the OpenMP run-time library, are at the programmer's disposal. The OpenMP run-time library is meant to serve as a control and query tool for the parallel execution environment, which the programmer can use from inside its program. Therefore, the run-time library is a set of external procedures with clearly defined interfaces. Here are some of them:
OMP_set_num_threads
- This subroutine sets the number of threads to be
used by subsequent parallel regions,
OMP_get_num_threads
- This function allows to know the number of
threads
currently in the team executing the parallel region from which it is
called,
OMP_get_max_threads
- This function returns the maximum number of
threads that can be used in the program,
OMP_get_thread_number
- This function returns the identification number
of the current thread within the team,
OMP_get_num_procs
- This function returns the number of processors
available to the program,
OMP_in_parallel
- This function allows to know, if a given region
of the program is being computed in parallel or not, etc.
These run-time library routines are well documented within the OpenMP specifications. Among those available run-time routines are also several library procedures for assessing performance (benchmarking). Yet another group of subroutines and functions deal with so called locks. These locks represent another synchronization mechanism at the OpenMP programmer's disposal. User is advised to consult the OpenMP specifications and other above mentioned resources for more in-depth explanations on using locks and other mechanisms.
Simple OpenMP Example:
!$OMP
PARALLEL DEFAULT(NONE)
!$OMP DO SHARED(A,B) SCHEDULE(DINAMIC,256) do i = 1,10000 A(i) = 2.5 * B(i)
end do!$OMP END DO NOWAIT print *, 'Thread',id,' is done with the first do-loop.'!$OMP DO SCHEDULE(STATIC,CHUNK) & !$OMP SHARED(C,D,CHUNK) do i = 1,20000 C(i) = 3.2 * D(i)
end do!$OMP END DO !$OMP WORKSHARE SHARED(A,B,D) D = A + B !$OMP END WORKSHARE !$OMP END PARALLEL |
In the above example, firstly a parallel region is constructed with a PARALLEL directive. Here a data sharing clauses define that the treatment of all variables in the parallel region need to be explicitely stated. This is accomplished by a DEFAULT(NONE) clause. Do-loop indices are excluded from this (they are always private to each thread). Example of the usage of the run-time library routines is given in the opening of the parallel region. Next, a work-sharing construct is introduced in order to parallelize the following do-loop. Shared variables of the following do-loop have been declared as such. Here, a distribution of the do-loop iterations is specified to be DINAMIC, with a pieces equal to 256 iterations. A NOWAIT clause at the closing directive of the first do-loop avoids the implied sinchronization, which is not needed in this case. Next follows a SINGLE directive which encloses a region of code executed by a single thread (the one that arrives first at this point). This usage of SINGLE directive is often the case when one needs to manage I/O operations from the inside of the parallel region. Second do-loop work sharing construct is carried out with a STATIC schedule, now with a chunk size defined / entered by the user during runtime. Data sharing is defined in the second line of the second do-loop work-sharing construct. Note that a line continuation symbol (&) need to be present. There is an implied data synchronization at the closing directive of the second do-loop work-sharing construct. Third section / region uses a WORKSHARE directive to parallelize array additions. Here, an implied do-loop is present in the statement, which might not be seen by a compiler. Again, this work-sharing construct is closed with the appropriate end-clause and parallel region is closed after that. Implied data syncronisation is present at the closing directive. This same example could have been written in a different manner as well, by the use of combined parallel work-sharing constructs and splitting-up the source code in three distinct parallel regions.
OpenMP Compiler Switches
All Fortran 90/95 OpenMP directives start with a sentinel !$OMP, which indicates to the OpenMP compliant Fortran compiler that a following code on that source code line is an OpenMP statement / directive. For the regular (non-OpenMP compliant) Fortran Compiler this line signifies a comment line (since it starts with an exclamation mark). Most modern Fortran Compilers, such as Intel Fortran Comipler for Linux and SunStudio 12 Fortran Compiler, include OpenMP support. In fact majority (if not all) Fortran compilers now include OpenMP support. In order for the compiler to be able to recognize the OpenMP sentinels and produce parallel executable, programmer needs to specify certain compiler switches in the associated makefile. Hence, one and the same source code file could be compiled to produce a serial program, as well as a parallel version of the same program, depending on the compiler switches used in the associated makefile. More info on the Fortran makefiles could be found here. In this way, existing serial Fortran programs could be easily parallelized, while the serial version is still retained and fully functional.
For example, Intel Fortran Compiler for Linux (ifort) uses an -openmp switch. Hence, one would use for example a following command:
to obtain a parallel version of the program (found in the program.f90 source code file). In the above line, several other compiler switches have been used as well. First switch (-o) renames the executable to the name program. Next, optimisation of the source code is implemented with a -O3 switch, followed by an another code optimisation, targeted to the specific platform (-xP switch). Optimisations included in the -O3 switch, among others, try to vectorize do-loops found in the source code. This results with a significant improvements in the execution speed. Next compiler switch (-openmp) identifies OpenMP parallel regions, introduced into the source code by the programmer, and parallelize them.
Programmer might try to use compilers auto-parallelization options (switch) to obtain parallel version of the execution code. In case of Intel Fortran Compiler for Linux this option (-parallel) tries to automatically parallelize do-loops found in the source code. If the do-loops are properly structured, this is very possible. There is no OpenMP here! Modern Fortran compilers do an excellent job of indentifying do-loops and auto-parallelizeing them. Hence, one might build a program with a following command:
SunStudio 12 Fortran Compiler for Linux (f95) uses also the -openmp switch, which is a synonim for the -xopenmp switch (both are correct). Hence, one would in this case use, for example, a following command:
SunStudio 12 Fortran Compiler for Linux also supports the auto-parallelization with the -autopar switch (-parallel switch is said to be obsolete). Whence, one might write a following compiler command:
Do-loop presented in the above simple example would be automatically parallelized by the compiler (-parallel switch) even without the OpenMP intervention. Hence, OpenMP should be reserved for the more difficult cases, where compiler fails to automatically identify and parallelize loops or other parts of source code. Programmer could intervene in these cases, with OpenMP directives, and force a compiler (-openmp switch) to parallelize these regions of source code. This is especially important in case of advanced array features, inherent in the Fortran 90/95 programming language. Those are array sections (collon notation), usage of forall and where Fortran 90/95 language constructs, which were introduced into the language with the parallelization in mind. This is where the WORKSHARE OpenMP work-sharing construct comes in handy. This is a unique OpenMP construct reserved for the Fortran language. It does not exist for the C/C++ language. At the same time, C/C++ programming language has inferior capabilities when it comes to array treatment.
Performance Analysis Tools
Fortran 90/95 developer on Linux platforms have excellent various (free for non-commercial use) tools for the performance analysis of their software. Usage of these tools would be a first step in the development of parallel software. Firstly, one must acknowladge the fact that modern Fortran ompilers are very sophisticated. They include automatic vectorization, automatic paralellization, loop trasforations and other optimization procedures. All this, combined with fine-tuned math kernel library (MKL) and other optimised library packages, results with very fast executables on Linux machines. Additional speed-ups could be obtained by the OpenMP API implementation. A nice overview of the stages which are necessary for the optimization of parallel software can be found here. First step in the course of fine-tuning the source code would be using a compiler switches, which allow sophisticated optimizations for the targeted architecture. Next step is performance profiling of the application using the performance analysis tools. Here, one can choose from variety of free (for non-commercial use) linux tools. One excellent choice is an Intel VTune Performance Analyzer for Linux. It can be obtained free of charge here. It has a GUI interfaced into the Eclipse platform. Another excellent (also free for non-commerical use) tool is a Performance Profiler bundled with a SunStudio 12 for Linux, developed by Sun Microsystems. (SunStudio 12 is an excellent full-featured IDE for Fortran 90/95 on Linux platform and can be obtained here.)
Performance analysis tool gives programmer insight into the application performance from the execution time perspective. It can identify certain application bottlenecks, i.e. pieces of code (functions, subroutines) where application is spedning majority of its runtime. Those parts could be hopefully parallelized by the OpenMP approach. Certain gains could be expected from running code in parallel. There are numerous issues which need to be accounted for here, such as overheads due to data management and communication between threads. Certain problems will profit more from OpenMP approach than others. Also, underlying architecture has significant impact onto the obtained performance, e.g. number of available multi-core processors and number of (hardware) threads per processor.
The
most important thing to keep in mind when optimizing an application is
to create a systematic and organized approach. Several distinctive
steps could be identified in the course of establishing an application
tuning methodology (approach). The process which goes through these
steps is iterarive in its nature. It can be graphically ilustrated with
a chart (on the left). The first step of the process would be to gather
performance data, as described previously, and analyze those data to
identify opportunities for improvement. This can be often carried out
using tools such as Intel VTune Performance Analyzer to see where time
is being spent or Intel Thread Profiler to discover threading
inefficiencies. SunStudio 12 performance profiler, for example, could
be used instead. Once the troubled spot have been identified one needs
to generate alternatives to resolve the found issue. For every
problem, there is always a solution, and this step identifies the fix
that will resolve the current issue. It should be noted that programmer
should try to solve only one issue (usually the major one found) in
each iteration cycle. Otherwise, it might be impossible to resolve
which change made what performance impact (this is crucial in case that
implemented fixes degrade performance, which might happen). The fix for
the found problem could range from being
trivial to a very complicated one, depending on the situation at hand.
Once you have decided what you plan to change in the code and taken
note of it in your project record, make the appropriate changes to the
source code. This makes the Implement Enhancement step of the process
(on the chart). Next, test the results of the implemented source code
changes. Collect additional data, compare it to the baseline
measurement, and take the time to understand the results, backtracking
if necessary. This brings the cycle at the beginning. From here, follow
the cycle into the second iteration, and so on. Creating a methodology
and applying it for each individual performance tuning project requires
taking this generic sequence of steps and determining how it should be
manifested in each specific case. Majority of the work will be invested
in the implementation of the enhancements (i.e. fixing problems,
resolving bottlenecks, etc.) found by the performance analyzer. Whence,
performance analyzer is a valuable, if not indispensible, tool for
optimizing the performance of parallel applications.Final Remarks
It is often argued that the OpenMP approach is limited to the so-called "fine grain" parallelization. This is somewhat true, but not a limiting feature of the OpenMP specifications "per-se". It is up to the programmer to try to achieve the so-called "coarse grain" parallelization. Much of the tedious work in the parallelization can be relinquished to the compiler itself (auto-parallelization through the compiler switches). Programmer should be occupied with a higher level parallelization. This might include even developing novel algorithms, which would be more parallel-friendly (or easier to parallelize). Such algorithms have been developed for the case of e.g. FDTD approach to the solution of Maxwell equations (i.e. Finite-Difference Time-Domain Calculations). New ways of thinking about solving engineering problems might be needed, ways which would include parallel notions from the beginning. Engineering software development is going through an important revision process, which from now on heavily relies on parallel programming paradigms.
Hybrid architectures - combined from dosens to several hundreds of multi-core processor (and/or multi-processor) machines (SMPs), which are mutually interconnected into the single large system (cluster) - require new ways of parallel software development. A so-called MPI approach has been acknowledged as a de-facto standard here (on so-called distributed memory systems). Relatively recently, a so-called Cluster OpenMP approach has been put-forth by the Intel. This new system, Cluster OpenMP, is an implementation of OpenMP that can make use of multiple SMP machines without resorting to MPI. This advance has the advantage of eliminating the need to write explicit messaging code, as well as not mixing programming paradigms. The shared memory in Cluster OpenMP is maintained across all machines through a distributed shared-memory subsystem. This approach will hopefully reduce the overhead (on part of the programmer) in developing parallel applications on cluster (Linux) systems.