|
|
 |
 |
 |
 |
Fortran Programming Language
|
 |
 |
 |
 |
 |
 |
 |
 |
omp question
What is that I do not understand? The program listed below produces execution times wx22gt> ./try 40000000 num threads = 32 do loop: 0.5078125000E-01 where + array: 1.023437500 STOP 0 but, when I comment out the omp statements, the "where" statement is faster: wx22gt> ./try 40000000 do loop: 0.3007812500 where + array: 0.8085937500 STOP 0 Operating system AIX 5.3, the compile/link statements are xlf95_r -g -O0 -qfree=f90 -qsuffix=f=f90 -qnosave -qsmp=omp -c try.f90 xlf95_r -g -O0 -qfree=f90 -qsuffix=f=f90 -qnosave -qsmp=omp -o try try.o I just started experimenting with Open MP, so I must be missing something. George module try_internal implicit none contains function get_time() real :: get_time integer :: date_time(8) call date_and_time(values=date_time) get_time = date_time(5)*3600 + date_time(6)*60 + date_time(7) + & date_time(8)*1.0e-3 end function get_time end module try_internal program try use omp_lib use try_internal implicit none real, dimension(:), allocatable :: a character(len=256) :: buf integer :: n, i, k real :: t real, parameter :: eps = 0.2 call getarg(1, buf) read(buf, *) n allocate(a(n)) call random_number(a) t = get_time() !$omp parallel if (omp_get_thread_num() == 1) & print *, 'num threads = ', omp_get_num_threads() !$omp do do i = 1, n if (a(i) < eps) a(i) = a(i) + 1.0 end do !$omp end do !$omp end parallel print *, 'do loop: ', get_time() - t call random_number(a) t = get_time() !$omp parallel workshare where (a < eps) a = a + 1.0 !$omp end parallel workshare print *, 'where + array: ', get_time() - t stop 0 end program try
George Trojan wrote: > What is that I do not understand? The program listed below produces > execution times > wx22gt> ./try 40000000 > num threads = 32 > do loop: 0.5078125000E-01 > where + array: 1.023437500 > STOP 0 > but, when I comment out the omp statements, the "where" statement is faster: > wx22gt> ./try 40000000 > do loop: 0.3007812500 > where + array: 0.8085937500 > STOP 0
Do you have a 32-processor machine that you're running this on? If not, why so many threads? - Brooks -- The "bmoses-nospam" address is valid; no unmunging needed.
Brooks Moses wrote: > George Trojan wrote: >> What is that I do not understand? The program listed below produces >> execution times >> wx22gt> ./try 40000000 >> num threads = 32 >> do loop: 0.5078125000E-01 >> where + array: 1.023437500 >> STOP 0 >> but, when I comment out the omp statements, the "where" statement is >> faster: >> wx22gt> ./try 40000000 >> do loop: 0.3007812500 >> where + array: 0.8085937500 >> STOP 0 > Do you have a 32-processor machine that you're running this on? If not, > why so many threads? > - Brooks
Yes, I do. To experiment further, I added call omp_set_dynamic(.false.) call omp_set_num_threads(2) after the first !$omp parallel line and the result was the same: 32 threads. I was not sure whether the argument to omp_set_dynamic() should really be .false. (as in "OpenMP Application Interface, version 2.5") so I tried .true. as well. Again 32 threads and no change in execution time! George
George Trojan wrote: > What is that I do not understand? The program listed below produces > execution times > wx22gt> ./try 40000000 > num threads = 32 > do loop: 0.5078125000E-01 > where + array: 1.023437500 > STOP 0 > but, when I comment out the omp statements, the "where" statement is > faster: > wx22gt> ./try 40000000 > do loop: 0.3007812500 > where + array: 0.8085937500 > STOP 0 > Operating system AIX 5.3, the compile/link statements are > xlf95_r -g -O0 -qfree=f90 -qsuffix=f=f90 -qnosave -qsmp=omp -c try.f90 > xlf95_r -g -O0 -qfree=f90 -qsuffix=f=f90 -qnosave -qsmp=omp -o try try.o > I just started experimenting with Open MP, so I must be missing something. > George > module try_internal > implicit none > contains > function get_time() > real :: get_time > integer :: date_time(8) > call date_and_time(values=date_time) > get_time = date_time(5)*3600 + date_time(6)*60 + date_time(7) + & > date_time(8)*1.0e-3 > end function get_time > end module try_internal > program try > use omp_lib > use try_internal > implicit none > real, dimension(:), allocatable :: a > character(len=256) :: buf > integer :: n, i, k > real :: t > real, parameter :: eps = 0.2 > call getarg(1, buf) > read(buf, *) n > allocate(a(n)) > call random_number(a) > t = get_time() > !$omp parallel > if (omp_get_thread_num() == 1) & > print *, 'num threads = ', omp_get_num_threads() > !$omp do > do i = 1, n > if (a(i) < eps) a(i) = a(i) + 1.0 > end do > !$omp end do > !$omp end parallel > print *, 'do loop: ', get_time() - t > call random_number(a) > t = get_time() > !$omp parallel workshare > where (a < eps) a = a + 1.0 > !$omp end parallel workshare > print *, 'where + array: ', get_time() - t > stop 0 > end program try
Using more processors doesn't necessarily speed up a program. You have to think carefully about caching issues. I don't know the correct terminology, but the basic idea is that when a CPU accesses memory it transfers a chunk into a cache, and this chunk has a size that may exceed the size of the element you are changing. This speeds up processing (on average) because you often want to perform repeated operations on the same memory location or on adjacent locations. This is a complex subject of which I have only an inkling: http://en.wikipedia.org/wiki/CPU_cache but I do know that this issue can have a serious impact on shared-memory multiprocessing. The reason is that a CPU in one thread often has to wait for a CPU in another thread to release a chunk of memory, if the two threads are operating on adjacent memory locations (e.g. on elements within the same array). This can happen when the cache chunk is bigger than the array element size (in your case 4 bytes). I encountered this unpleasant surprise in my OpenMP coding, and have developed coding procedures to minimize the slowdown. If I can't avoid having threads trying to access adjacent elements in an array a significant fraction of the time, I resort to the crude but effective expedient of padding out the array elements so that they are at least a cache-chunk apart. For example, if the cache-chunk is M bytes, instead of a(i) I use a((M/4)*i), where the dimension of a is now (M/4)*n. I suggest you experiment with this, and explore the effect on timing of different values of M. By the way, random_number() in a parallel loop is a real trap for young players. The reason is that the random number generator maintains a global seed value - the state of the RNG. This value is accessed and changed by all the threads, leading to contention. Since I do a lot of Monte Carlo simulations I wrote my own parallel RNG, in which each thread generates a random sequence independent of the other threads (each has its own seed). When I get to work I'll send you a document I wrote on this subject, if you are interested and if you provide your email address. Gib
Gib Bogle wrote: > George Trojan wrote: >> What is that I do not understand? The program listed below produces >> execution times >> wx22gt> ./try 40000000 >> num threads = 32 >> do loop: 0.5078125000E-01 >> where + array: 1.023437500 >> STOP 0 >> but, when I comment out the omp statements, the "where" statement is >> faster: >> wx22gt> ./try 40000000 >> do loop: 0.3007812500 >> where + array: 0.8085937500 >> STOP 0 >> Operating system AIX 5.3, the compile/link statements are >> xlf95_r -g -O0 -qfree=f90 -qsuffix=f=f90 -qnosave -qsmp=omp -c try.f90 >> xlf95_r -g -O0 -qfree=f90 -qsuffix=f=f90 -qnosave -qsmp=omp -o try try.o >> I just started experimenting with Open MP, so I must be missing >> something. >> George >> module try_internal >> implicit none >> contains >> function get_time() >> real :: get_time >> integer :: date_time(8) >> call date_and_time(values=date_time) >> get_time = date_time(5)*3600 + date_time(6)*60 + date_time(7) + & >> date_time(8)*1.0e-3 >> end function get_time >> end module try_internal >> program try >> use omp_lib >> use try_internal >> implicit none >> real, dimension(:), allocatable :: a >> character(len=256) :: buf >> integer :: n, i, k >> real :: t >> real, parameter :: eps = 0.2 >> call getarg(1, buf) >> read(buf, *) n >> allocate(a(n)) >> call random_number(a) >> t = get_time() >> !$omp parallel >> if (omp_get_thread_num() == 1) & >> print *, 'num threads = ', omp_get_num_threads() >> !$omp do >> do i = 1, n >> if (a(i) < eps) a(i) = a(i) + 1.0 >> end do >> !$omp end do >> !$omp end parallel >> print *, 'do loop: ', get_time() - t >> call random_number(a) >> t = get_time() >> !$omp parallel workshare >> where (a < eps) a = a + 1.0 >> !$omp end parallel workshare >> print *, 'where + array: ', get_time() - t >> stop 0 >> end program try > Using more processors doesn't necessarily speed up a program. You have > to think carefully about caching issues. I don't know the correct > terminology, but the basic idea is that when a CPU accesses memory it > transfers a chunk into a cache, and this chunk has a size that may > exceed the size of the element you are changing. This speeds up > processing (on average) because you often want to perform repeated > operations on the same memory location or on adjacent locations. This > is a complex subject of which I have only an inkling: > http://en.wikipedia.org/wiki/CPU_cache > but I do know that this issue can have a serious impact on shared-memory > multiprocessing. The reason is that a CPU in one thread often has to > wait for a CPU in another thread to release a chunk of memory, if the > two threads are operating on adjacent memory locations (e.g. on elements > within the same array). This can happen when the cache chunk is bigger > than the array element size (in your case 4 bytes). > I encountered this unpleasant surprise in my OpenMP coding, and have > developed coding procedures to minimize the slowdown. If I can't avoid > having threads trying to access adjacent elements in an array a > significant fraction of the time, I resort to the crude but effective > expedient of padding out the array elements so that they are at least a > cache-chunk apart. For example, if the cache-chunk is M bytes, instead > of a(i) I use a((M/4)*i), where the dimension of a is now (M/4)*n. > I suggest you experiment with this, and explore the effect on timing of > different values of M. > By the way, random_number() in a parallel loop is a real trap for young > players. The reason is that the random number generator maintains a > global seed value - the state of the RNG. This value is accessed and > changed by all the threads, leading to contention. Since I do a lot of > Monte Carlo simulations I wrote my own parallel RNG, in which each > thread generates a random sequence independent of the other threads > (each has its own seed). When I get to work I'll send you a document I > wrote on this subject, if you are interested and if you provide your > email address. > Gib
I understand this. My point was (or at least the one I wanted to make) that the do loop was running about 6 times faster (0.05s vs 0.3s) with OMP, while the equivalent where statement was a bit slower - it looked as the OMP overhead was taken into account, but the implied loop in where and array arithmetics was not executed in parallel. The call to random_number() is outside the parallel region, I think. George
George Trojan wrote: > Gib Bogle wrote: >> George Trojan wrote: >>> What is that I do not understand? The program listed below produces >>> execution times >>> wx22gt> ./try 40000000 >>> num threads = 32 >>> do loop: 0.5078125000E-01 >>> where + array: 1.023437500 >>> STOP 0 >>> but, when I comment out the omp statements, the "where" statement is >>> faster: >>> wx22gt> ./try 40000000 >>> do loop: 0.3007812500 >>> where + array: 0.8085937500 >>> STOP 0 >>> Operating system AIX 5.3, the compile/link statements are >>> xlf95_r -g -O0 -qfree=f90 -qsuffix=f=f90 -qnosave -qsmp=omp -c try.f90 >>> xlf95_r -g -O0 -qfree=f90 -qsuffix=f=f90 -qnosave -qsmp=omp -o try >>> try.o >>> I just started experimenting with Open MP, so I must be missing >>> something. >>> George >>> module try_internal >>> implicit none >>> contains >>> function get_time() >>> real :: get_time >>> integer :: date_time(8) >>> call date_and_time(values=date_time) >>> get_time = date_time(5)*3600 + date_time(6)*60 + date_time(7) + & >>> date_time(8)*1.0e-3 >>> end function get_time >>> end module try_internal >>> program try >>> use omp_lib >>> use try_internal >>> implicit none >>> real, dimension(:), allocatable :: a >>> character(len=256) :: buf >>> integer :: n, i, k >>> real :: t >>> real, parameter :: eps = 0.2 >>> call getarg(1, buf) >>> read(buf, *) n >>> allocate(a(n)) >>> call random_number(a) >>> t = get_time() >>> !$omp parallel >>> if (omp_get_thread_num() == 1) & >>> print *, 'num threads = ', omp_get_num_threads() >>> !$omp do >>> do i = 1, n >>> if (a(i) < eps) a(i) = a(i) + 1.0 >>> end do >>> !$omp end do >>> !$omp end parallel >>> print *, 'do loop: ', get_time() - t >>> call random_number(a) >>> t = get_time() >>> !$omp parallel workshare >>> where (a < eps) a = a + 1.0 >>> !$omp end parallel workshare >>> print *, 'where + array: ', get_time() - t >>> stop 0 >>> end program try >> Using more processors doesn't necessarily speed up a program. You >> have to think carefully about caching issues. I don't know the >> correct terminology, but the basic idea is that when a CPU accesses >> memory it transfers a chunk into a cache, and this chunk has a size >> that may exceed the size of the element you are changing. This speeds >> up processing (on average) because you often want to perform repeated >> operations on the same memory location or on adjacent locations. This >> is a complex subject of which I have only an inkling: >> http://en.wikipedia.org/wiki/CPU_cache >> but I do know that this issue can have a serious impact on >> shared-memory multiprocessing. The reason is that a CPU in one thread >> often has to wait for a CPU in another thread to release a chunk of >> memory, if the two threads are operating on adjacent memory locations >> (e.g. on elements within the same array). This can happen when the >> cache chunk is bigger than the array element size (in your case 4 bytes). >> I encountered this unpleasant surprise in my OpenMP coding, and have >> developed coding procedures to minimize the slowdown. If I can't >> avoid having threads trying to access adjacent elements in an array a >> significant fraction of the time, I resort to the crude but effective >> expedient of padding out the array elements so that they are at least >> a cache-chunk apart. For example, if the cache-chunk is M bytes, >> instead of a(i) I use a((M/4)*i), where the dimension of a is now >> (M/4)*n. >> I suggest you experiment with this, and explore the effect on timing >> of different values of M. >> By the way, random_number() in a parallel loop is a real trap for >> young players. The reason is that the random number generator >> maintains a global seed value - the state of the RNG. This value is >> accessed and changed by all the threads, leading to contention. Since >> I do a lot of Monte Carlo simulations I wrote my own parallel RNG, in >> which each thread generates a random sequence independent of the other >> threads (each has its own seed). When I get to work I'll send you a >> document I wrote on this subject, if you are interested and if you >> provide your email address. >> Gib > I understand this. My point was (or at least the one I wanted to make) > that the do loop was running about 6 times faster (0.05s vs 0.3s) with > OMP, while the equivalent where statement was a bit slower - it looked > as the OMP overhead was taken into account, but the implied loop in > where and array arithmetics was not executed in parallel. > The call to random_number() is outside the parallel region, I think.
Sorry for going off on a tangent :-)
On Jun 6, 8:21 pm, George Trojan <george.tro@noaa.gov> wrote:
> What is that I do not understand? The program listed below produces > execution times > wx22gt> ./try 40000000 > num threads = 32 > do loop: 0.5078125000E-01 > where + array: 1.023437500 > STOP 0 > but, when I comment out the omp statements, the "where" statement is faster: > wx22gt> ./try 40000000 > do loop: 0.3007812500 > where + array: 0.8085937500 > STOP 0 > Operating system AIX 5.3, the compile/link statements are > xlf95_r -g -O0 -qfree=f90 -qsuffix=f=f90 -qnosave -qsmp=omp -c try.f90 > xlf95_r -g -O0 -qfree=f90 -qsuffix=f=f90 -qnosave -qsmp=omp -o try try.o > I just started experimenting with Open MP, so I must be missing something. > George > module try_internal > implicit none > contains > function get_time() > real :: get_time > integer :: date_time(8) > call date_and_time(values=date_time) > get_time = date_time(5)*3600 + date_time(6)*60 + date_time(7) + & > date_time(8)*1.0e-3 > end function get_time > end module try_internal > program try > use omp_lib > use try_internal > implicit none > real, dimension(:), allocatable :: a > character(len=256) :: buf > integer :: n, i, k > real :: t > real, parameter :: eps = 0.2 > call getarg(1, buf) > read(buf, *) n > allocate(a(n)) > call random_number(a) > t = get_time() > !$omp parallel > if (omp_get_thread_num() == 1) & > print *, 'num threads = ', omp_get_num_threads() > !$omp do > do i = 1, n > if (a(i) < eps) a(i) = a(i) + 1.0 > end do > !$omp end do > !$omp end parallel > print *, 'do loop: ', get_time() - t > call random_number(a) > t = get_time() > !$omp parallel workshare > where (a < eps) a = a + 1.0 > !$omp end parallel workshare > print *, 'where + array: ', get_time() - t > stop 0 > end program try
Many OpenMP-capable Fortran compilers are still poor with WORKSHARE. I recall that Intel v 9.0 accepted the construct, but apparently did not parallelize at all (i.e. treated it as SINGLE). Although WORKSHARE seems the most convenient way of parallelizing array expressions (which may be slower or faster than DO loops), apparently compiler vendors care much more about DO, presumably because that's where most codes can save most time.
|
 |
 |
 |
 |
|