Home     |     .Net Programming    |     cSharp Home    |     Sql Server Home    |     Javascript / Client Side Development     |     Ajax Programming

Ruby on Rails Development     |     Perl Programming     |     C Programming Language     |     C++ Programming     |     IT Jobs

Python Programming Language     |     Laptop Suggestions?    |     TCL Scripting     |     Fortran Programming     |     Scheme Programming Language

Cervo Technologies
The Right Source to Outsource

MS Dynamics CRM 3.0

Fortran Programming Language

omp question

What is that I do not understand? The program listed below produces
execution times

wx22gt> ./try 40000000
  num threads =  32
  do loop:  0.5078125000E-01
  where + array:  1.023437500

but, when I comment out the omp statements, the "where" statement is faster:

wx22gt> ./try 40000000
  do loop:  0.3007812500
  where + array:  0.8085937500

Operating system AIX 5.3, the compile/link statements are
xlf95_r -g -O0 -qfree=f90 -qsuffix=f=f90 -qnosave -qsmp=omp  -c try.f90
xlf95_r -g -O0 -qfree=f90 -qsuffix=f=f90 -qnosave -qsmp=omp  -o try try.o

I just started experimenting with Open MP, so I must be missing something.


module try_internal

implicit none


function get_time()
     real :: get_time

     integer :: date_time(8)

     call date_and_time(values=date_time)
     get_time = date_time(5)*3600 + date_time(6)*60 + date_time(7) + &
end function get_time

end module try_internal

program try
     use omp_lib
     use try_internal
     implicit none

     real, dimension(:), allocatable :: a
     character(len=256) :: buf
     integer :: n, i, k
     real :: t
     real, parameter :: eps = 0.2

     call getarg(1, buf)
     read(buf, *) n
     call random_number(a)
     t = get_time()
!$omp parallel
     if (omp_get_thread_num() == 1) &
         print *, 'num threads = ', omp_get_num_threads()
!$omp do
     do i = 1, n
         if (a(i) < eps) a(i) = a(i) + 1.0
     end do
!$omp end do
!$omp end parallel
     print *, 'do loop: ', get_time() - t
     call random_number(a)
     t = get_time()
!$omp parallel workshare
     where (a < eps) a = a + 1.0
!$omp end parallel workshare
     print *, 'where + array: ', get_time() - t

     stop 0

end program try

Do you have a 32-processor machine that you're running this on?  If not,
why so many threads?

- Brooks

The "bmoses-nospam" address is valid; no unmunging needed.

Yes, I do. To experiment further, I added

call omp_set_dynamic(.false.)
call omp_set_num_threads(2)

after the first !$omp parallel line and the result was the same: 32
threads. I was not sure whether the argument to omp_set_dynamic() should
really be .false. (as in "OpenMP Application Interface, version 2.5") so
I tried .true. as well. Again 32 threads and no change in execution time!


Using more processors doesn't necessarily speed up a program.  You have
to think carefully about caching issues.  I don't know the correct
terminology, but the basic idea is that when a CPU accesses memory it
transfers a chunk into a cache, and this chunk has a size that may
exceed the size of the element you are changing.  This speeds up
processing (on average) because you often want to perform repeated
operations on the same memory location or on adjacent locations.  This
is a complex subject of which I have only an inkling:
but I do know that this issue can have a serious impact on shared-memory
multiprocessing.  The reason is that a CPU in one thread often has to
wait for a CPU in another thread to release a chunk of memory, if the
two threads are operating on adjacent memory locations (e.g. on elements
within the same array).  This can happen when the cache chunk is bigger
than the array element size (in your case 4 bytes).

I encountered this unpleasant surprise in my OpenMP coding, and have
developed coding procedures to minimize the slowdown.  If I can't avoid
having threads trying to access adjacent elements in an array a
significant fraction of the time, I resort to the crude but effective
expedient of padding out the array elements so that they are at least a
cache-chunk apart.  For example, if the cache-chunk is M bytes, instead
of a(i) I use a((M/4)*i), where the dimension of a is now (M/4)*n.

I suggest you experiment with this, and explore the effect on timing of
different values of M.

By the way, random_number() in a parallel loop is a real trap for young
players.  The reason is that the random number generator maintains a
global seed value - the state of the RNG.  This value is accessed and
changed by all the threads, leading to contention.  Since I do a lot of
Monte Carlo simulations I wrote my own parallel RNG, in which each
thread generates a random sequence independent of the other threads
(each has its own seed).  When I get to work I'll send you a document I
wrote on this subject, if you are interested and if you provide your
email address.


I understand this. My point was (or at least the one I wanted to make)
that the do loop was running about 6 times faster (0.05s vs 0.3s) with
OMP, while the equivalent where statement was a bit slower - it looked
as the OMP overhead was taken into account, but the implied loop in
where and array arithmetics was not executed in parallel.
The call to random_number() is outside the parallel region, I think.


Sorry for going off on a tangent :-)
On Jun 6, 8:21 pm, George Trojan <george.tro@noaa.gov> wrote:

Many OpenMP-capable Fortran compilers are still poor with WORKSHARE. I
recall that Intel v 9.0 accepted the construct, but apparently did not
parallelize at all (i.e. treated it as SINGLE). Although WORKSHARE
seems the most convenient way of parallelizing array expressions
(which may be slower or faster than DO loops), apparently compiler
vendors care much more about DO, presumably because that's where most
codes can save most time.
Add to del.icio.us | Digg this | Stumble it | Powered by Megasolutions Inc