|
|
 |
 |
 |
 |
Fortran Programming Language
|
 |
 |
 |
 |
 |
 |
 |
 |
what exactly does CPU_TIME measure?
In one of the assignments I have given to my students, the purpose is to time the multiplication of two matrices for dimension N=1,...N_max and then use the CPU_TIME intrinsic to measure the time and calculate the number of Mflop/s that was achieved for each N. One of my students now asked if the time returned by CPU_TIME also includes the time that was consumed by things like cache-misses, page-faults, etc... As I actually doubted about the answer myself, I went looking in the document http://j3-fortran.org/doc/year/97/97-007r2/pdf/97-007r2.pdf and the end of Note 13.8 probably answers my question: "Most computer systems have multiple concepts of time. One common concept is that of time expended by the processor for a given program. This may or may not include system overhead, and has no obvious connection to elapsed 'wall clock' time." So therefore I would conclude that it depends on the particular compiler and how CPU_TIME is implemented whether or not system-overhead stuff like cache-misses and page faults are included in the timings or not. Am I right with this assumption? If 'yes', can anybody then tell me whether or not the following compilers include system-overead stuff like cache-misses and page faults in their CPU_TIME timings? NAGWare f95 gfortran g95 When I look at the results of my experiments, I would say they do... but it would be nice to get some feedback on this so I can also inform my students. Thanks, Bart -- "Share what you know. Learn what you don't."
Bart Vandewoestyne wrote:
(snip) > One of my students now asked if the time returned by CPU_TIME > also includes the time that was consumed by things like > cache-misses, page-faults, etc...
Cache misses are, in all systems I know, part of instruction execution. On some systems it does make the timing for one program slightly dependent on what other programs are doing on the same system, but the effect should be pretty small. Page faults are different. In that case, it is the OS doing work for you, and I believe it depends on the OS how that time is accounted for. Unix gives two values for process CPU time called user and system time. It would be expected that paging time would count toward system time, but I don't know that it is required. Consider a DLL shared between two users. If that is being paged in, which one does it count toward? The first SPARC system I used, a Sun 4/110, had no hardware multiply instruction. Multiply was done by a system call to the OS, which then did a software multiply. This counted toward system time, where on other systems multiply would count toward user time. Implementations of CPU_TIME then have to decide what to report based on what the OS supplies. My favorite timing system is the TSC, Time Stamp Counter, on most intel processors. It is a 64 bit counter counting at the system clock frequency, and read using the RDTSC instruction into the EAX and EDX registers. For a 32 bit system, this usually comes out just right for a function returning a 64 bit integer value, such that the two instructions rdtsc ret plus whatever it takes to satisfy the assembler, will result in a function returning the TSC value. This allows one to do timing over small sections of code without needing to average over a long time. -- glen
Ultimately, the actual timing comes from the operating system. All (?) OSes I know give you two things: wall-clock time, and processor-time for the current process, in which your program is running. Most (all?) OSes also allow you to measure the fraction of that time spent in user mode - almost all of which is going to be in your program or in support libraries - and system mode, a large fraction of which is used by the OS operating on your program's behalf. But the latter is not necessarily so: If, for instance, another process performs event-based computing not synchronized with the clock interrupt, and always shorter than the inter-interrupt interval, it might be charged for nothing, although it is consuming the overwhelming fraction of the processor time, with your process being charged for everything. Thus the recommendation that for such benchmarking, the program being timed being the only active one, and for repeating the measurement several times - for instance, SPEC CPU requires at least three repeats. Cache misses et al. are definitely included in this type of measurement, because the processor is typically not able to reschedule between misses, not even serve an interrupt or such. (There are processors where this happens, but it's unlikely you are using one of them.) Page faults and swapping are more complicated: they definitely show up as system time, but some or all of that might or might not be charged against a different (system) process. Confused enough? Jan
Bart Vandewoestyne wrote: > So therefore I would conclude that it depends on the particular compiler > and how CPU_TIME is implemented whether or not system-overhead stuff like > cache-misses and page faults are included in the timings or not. > Am I right with this assumption? > If 'yes', can anybody then tell me whether or not the following compilers > include system-overead stuff like cache-misses and page faults in their > CPU_TIME timings? > NAGWare f95 > gfortran > g95
GFortran uses "GetProcessTimes" on Windows and "getrusage" on linux (if available). See also: http://gcc.gnu.org/svn/gcc/trunk/libgfortran/intrinsics/cpu_time.c What these system calls include exactly is not answered easily. From experiments earlier this week I can tell: if OpenMP is used, CPU_TIME returns the accumulated time spent by all threads (linux). Not much of a help, but a start, maybe. Daniel
Daniel Franke wrote: > Bart Vandewoestyne wrote: >> So therefore I would conclude that it depends on the particular compiler >> and how CPU_TIME is implemented whether or not system-overhead stuff like >> cache-misses and page faults are included in the timings or not. >> Am I right with this assumption? >> If 'yes', can anybody then tell me whether or not the following compilers >> include system-overead stuff like cache-misses and page faults in their >> CPU_TIME timings? >> NAGWare f95 >> gfortran >> g95 > GFortran uses "GetProcessTimes" on Windows and "getrusage" on linux (if > available). See also: > http://gcc.gnu.org/svn/gcc/trunk/libgfortran/intrinsics/cpu_time.c > What these system calls include exactly is not answered easily. From > experiments earlier this week I can tell: if OpenMP is used, CPU_TIME > returns the accumulated time spent by all threads (linux).
gfortran's implementation of cpu_time is representative of normal practice. The facilities used are the best means generally provided by the OS for measuring the CPU time (total of all threads) associated with the application, including those other events as much as possible. As others pointed out, hardware-level elapsed time tick counters are often better suited for code performance tuning on a dedicated system.
glen herrmannsfeldt wrote: > Bart Vandewoestyne wrote: > (snip) >> One of my students now asked if the time returned by CPU_TIME >> also includes the time that was consumed by things like >> cache-misses, page-faults, etc... > Cache misses are, in all systems I know, part of instruction > execution. On some systems it does make the timing for one > program slightly dependent on what other programs are doing > on the same system, but the effect should be pretty small. > Page faults are different. In that case, it is the OS doing > work for you, and I believe it depends on the OS how that time > is accounted for. Unix gives two values for process CPU time > called user and system time. It would be expected that paging > time would count toward system time, but I don't know that it > is required. Consider a DLL shared between two users. If that > is being paged in, which one does it count toward? > The first SPARC system I used, a Sun 4/110, had no hardware > multiply instruction. Multiply was done by a system call > to the OS, which then did a software multiply. This counted > toward system time, where on other systems multiply would count > toward user time. > Implementations of CPU_TIME then have to decide what to report > based on what the OS supplies. > My favorite timing system is the TSC, Time Stamp Counter, on > most intel processors. It is a 64 bit counter counting at the > system clock frequency, and read using the RDTSC instruction > into the EAX and EDX registers. For a 32 bit system, this > usually comes out just right for a function returning a > 64 bit integer value, such that the two instructions > rdtsc > ret > plus whatever it takes to satisfy the assembler, will result > in a function returning the TSC value. This allows one to > do timing over small sections of code without needing to average > over a long time.
Excellent summary. cpu_time() and clock cycle counters have different purposes. On most popular platforms, the resolution of the counters which cpu_time() needs fall in the range 1/1024 second to 0.010 sec. Bus cycle counters have overheads of a few bus cycles. rdtsc, although it is scaled according to the CPU clock, actually uses the front side bus clock on current Intel platforms. By measuring raw bus ticks without correlation to outside world time or to which application is responsible, the resolution should be well under 1 microsecond. BTW, several C compilers have a built-in __rdtsc() which avoids having to know the various instruction sequences for 32- and 64-bit platforms. I suppose a knowledgeable person could propose a gcc built-in which would work on a wider variety of platforms.
Tim Prince wrote: > glen herrmannsfeldt wrote: >> Bart Vandewoestyne wrote: >> (snip) >>> One of my students now asked if the time returned by CPU_TIME >>> also includes the time that was consumed by things like >>> cache-misses, page-faults, etc... >> Cache misses are, in all systems I know, part of instruction >> execution. On some systems it does make the timing for one >> program slightly dependent on what other programs are doing >> on the same system, but the effect should be pretty small. >> Page faults are different. In that case, it is the OS doing >> work for you, and I believe it depends on the OS how that time >> is accounted for. Unix gives two values for process CPU time >> called user and system time. It would be expected that paging >> time would count toward system time, but I don't know that it >> is required. Consider a DLL shared between two users. If that >> is being paged in, which one does it count toward? >> The first SPARC system I used, a Sun 4/110, had no hardware >> multiply instruction. Multiply was done by a system call >> to the OS, which then did a software multiply. This counted >> toward system time, where on other systems multiply would count >> toward user time. >> Implementations of CPU_TIME then have to decide what to report >> based on what the OS supplies. >> My favorite timing system is the TSC, Time Stamp Counter, on >> most intel processors. It is a 64 bit counter counting at the >> system clock frequency, and read using the RDTSC instruction >> into the EAX and EDX registers. For a 32 bit system, this >> usually comes out just right for a function returning a >> 64 bit integer value, such that the two instructions >> rdtsc >> ret >> plus whatever it takes to satisfy the assembler, will result >> in a function returning the TSC value. This allows one to >> do timing over small sections of code without needing to average >> over a long time. > Excellent summary. > cpu_time() and clock cycle counters have different purposes. On most > popular platforms, the resolution of the counters which cpu_time() needs > fall in the range 1/1024 second to 0.010 sec. Bus cycle counters have > overheads of a few bus cycles. rdtsc, although it is scaled according > to the CPU clock, actually uses the front side bus clock on current > Intel platforms. By measuring raw bus ticks without correlation to > outside world time or to which application is responsible, the > resolution should be well under 1 microsecond. > BTW, several C compilers have a built-in __rdtsc() which avoids having > to know the various instruction sequences for 32- and 64-bit platforms. > I suppose a knowledgeable person could propose a gcc built-in which > would work on a wider variety of platforms.
I suppose Fortran system_clock() was meant to cover some of the same ground for which we use __rdtsc. It falls far short when it happens to produce a 32-bit integer result. In principle (at least in gfortran) it can be made to produce a 64-bit integer result, by supplying 64-bit integer arguments. The resolution can't be as good as a raw clock tick counter, as it goes through all the C library support, using gettimeofday() as the preferred method. In the ifort docs, it says the count_rate is adjusted to the size of integer used in the arguments. This enables a reasonable time interval to be measured with default integers, at the expense of resolution.
Tim Prince wrote: > Daniel Franke wrote: >> Bart Vandewoestyne wrote: >>> So therefore I would conclude that it depends on the particular compiler >>> and how CPU_TIME is implemented whether or not system-overhead stuff >>> like >>> cache-misses and page faults are included in the timings or not. >>> Am I right with this assumption? >>> If 'yes', can anybody then tell me whether or not the following >>> compilers >>> include system-overead stuff like cache-misses and page faults in their >>> CPU_TIME timings? >>> NAGWare f95 >>> gfortran >>> g95 >> GFortran uses "GetProcessTimes" on Windows and "getrusage" on linux (if >> available). See also: >> http://gcc.gnu.org/svn/gcc/trunk/libgfortran/intrinsics/cpu_time.c >> What these system calls include exactly is not answered easily. From >> experiments earlier this week I can tell: if OpenMP is used, CPU_TIME >> returns the accumulated time spent by all threads (linux). > gfortran's implementation of cpu_time is representative of normal > practice. The facilities used are the best means generally provided by > the OS for measuring the CPU time (total of all threads) associated with > the application, including those other events as much as possible. As > others pointed out, hardware-level elapsed time tick counters are often > better suited for code performance tuning on a dedicated system.
Some recent processors incorporate frequency changing as part of their power saving strategy, which makes it harder to use a hardware register to keep track of time. If, in fact, RDTSC gives the quantity \int_0^T \f dt, one has to differentiate the RDTSC readings and then evaluate \int_0^T dt! -- mecej4
mecej4 wrote:
(snip) > Some recent processors incorporate frequency changing as part of their > power saving strategy, which makes it harder to use a hardware register > to keep track of time. > If, in fact, RDTSC gives the quantity \int_0^T \f dt, one has to > differentiate the RDTSC readings and then evaluate \int_0^T dt!
Yes, but if you consider the goal of minimizing the number of clock cycles then rdtsc makes more sense. Or, to put it another way, not all CPU seconds are equal (on a variable speed CPU) but clock cycles are. On the other hand, rdtsc measure wall time, not CPU time. If you time small enough sections of code you will rarely be interrupted. Small sections are difficult to time with a coarse resolution clock. If the OS would keep track of rdtsc cycles allocated to a given task, then it might be even more useful. -- glen -- glen
glen herrmannsfeldt wrote: > mecej4 wrote: > (snip) >> Some recent processors incorporate frequency changing as part of their >> power saving strategy, which makes it harder to use a hardware >> register to keep track of time.
I see, so when my wife says "I'll be ready in 5 minutes" she's actually saving power? ;) ;) ;) >> If, in fact, RDTSC gives the quantity \int_0^T \f dt, one has to >> differentiate the RDTSC readings and then evaluate \int_0^T dt! > Yes, but if you consider the goal of minimizing the number of > clock cycles then rdtsc makes more sense. Or, to put it > another way, not all CPU seconds are equal (on a variable > speed CPU) but clock cycles are.
A question. Does the frequency change also apply to memory reads or cache reads? Does it always take the memory system the same amount of cycles, or the same amount of fracto-seconds, to deliver a value? Depending on the answer, memory bound jobs might be somewhat independent of clock frequency. The same question applies to disk bound jobs. They're likely to be independent of processor speed. Dick Hendrickson
> On the other hand, rdtsc measure wall time, not CPU time. > If you time small enough sections of code you will rarely be > interrupted. Small sections are difficult to time with a coarse > resolution clock. If the OS would keep track of rdtsc cycles > allocated to a given task, then it might be even more useful. > -- glen > -- glen
In article <2tydnZXVes5fdq3bnZ2dnUVZ_oytn @comcast.com>, glen herrmannsfeldt wrote: > mecej4 wrote: > (snip) >> Some recent processors incorporate frequency changing as part of their >> power saving strategy, which makes it harder to use a hardware register >> to keep track of time. >> If, in fact, RDTSC gives the quantity \int_0^T \f dt, one has to >> differentiate the RDTSC readings and then evaluate \int_0^T dt! > Yes, but if you consider the goal of minimizing the number of > clock cycles then rdtsc makes more sense. Or, to put it > another way, not all CPU seconds are equal (on a variable > speed CPU) but clock cycles are. > On the other hand, rdtsc measure wall time, not CPU time.
No, it measures the cpu tick count (which is why the CPU changing frequency is a problem for timing). Secondly each CPU has its own TSC, which is not synchronized with the TSC:s of other CPU:s, so if your app happens to switch to another CPU on a multiprocessor -> oops. For these reasons, RDTSC is pretty useless these days. Which is a bit of a shame, since as such it's the most accurate and lowest overhead timer available. See e.g. http://www.ussg.iu.edu/hypermail/linux/kernel/0505.1/1463.html http://en.wikipedia.org/wiki/RDTSC > If you time small enough sections of code you will rarely be > interrupted. Small sections are difficult to time with a coarse > resolution clock.
Fancier profilers these days seem to use hw counters and sampling (e.g. oprofile), so you can get accurate timing even for small sections, provided the loop count is high enough. -- Janne Blomqvist
Janne Blomqvist wrote:
(snip) >>On the other hand, rdtsc measure wall time, not CPU time. > No, it measures the cpu tick count (which is why the CPU changing > frequency is a problem for timing). Secondly each CPU has its own TSC, > which is not synchronized with the TSC:s of other CPU:s, so if your > app happens to switch to another CPU on a multiprocessor -> oops.
I meant wall time, not CPU time, in that the clock doesn't stop when a task stops executing, as a CPU time clock would. It is my understanding that they are supposed to be synchronized on multi-CPU systems, though it might not always happen. I have used rdtsc on multi-CPU systems, and never had any case where the value decreased. > For these reasons, RDTSC is pretty useless these days. Which is a bit > of a shame, since as such it's the most accurate and lowest overhead > timer available.
I do agree that it isn't so useful as a clock, but I do believe it is still useful for finding which parts of a program are taking more clock cycles. -- glen
mecej4 wrote: > Tim Prince wrote: >> Daniel Franke wrote: >>> GFortran uses "GetProcessTimes" on Windows and "getrusage" on linux (if >>> available). See also: >>> http://gcc.gnu.org/svn/gcc/trunk/libgfortran/intrinsics/cpu_time.c >>> What these system calls include exactly is not answered easily. From >>> experiments earlier this week I can tell: if OpenMP is used, CPU_TIME >>> returns the accumulated time spent by all threads (linux). >> gfortran's implementation of cpu_time is representative of normal >> practice. The facilities used are the best means generally provided >> by the OS for measuring the CPU time (total of all threads) associated >> with the application, including those other events as much as >> possible. As others pointed out, hardware-level elapsed time tick >> counters are often better suited for code performance tuning on a >> dedicated system. > Some recent processors incorporate frequency changing as part of their > power saving strategy, which makes it harder to use a hardware register > to keep track of time.
Well, time does march on, and it has been about 3 years now that the counter for rdtsc doesn't change frequency on Intel platforms, and "64-bit" Xeon and Opteron have been the standard.
|
 |
 |
 |
 |
|