Home     |     .Net Programming    |     cSharp Home    |     Sql Server Home    |     Javascript / Client Side Development     |     Ajax Programming

Ruby on Rails Development     |     Perl Programming     |     C Programming Language     |     C++ Programming     |     IT Jobs

Python Programming Language     |     Laptop Suggestions?    |     TCL Scripting     |     Fortran Programming     |     Scheme Programming Language


 
 
Cervo Technologies
The Right Source to Outsource

MS Dynamics CRM 3.0

Fortran Programming Language

what exactly does CPU_TIME measure?


In one of the assignments I have given to my students, the
purpose is to time the multiplication of two matrices for dimension
N=1,...N_max and then use the CPU_TIME intrinsic to measure the
time and calculate the number of Mflop/s that was achieved for
each N.

One of my students now asked if the time returned by CPU_TIME
also includes the time that was consumed by things like
cache-misses, page-faults, etc...

As I actually doubted about the answer myself, I went looking in
the document

http://j3-fortran.org/doc/year/97/97-007r2/pdf/97-007r2.pdf

and the end of Note 13.8 probably answers my question:

"Most computer systems have multiple concepts of time. One common concept is
 that of time expended by the processor for a given program. This may or may
 not include system overhead, and has no obvious connection to elapsed 'wall
 clock' time."

So therefore I would conclude that it depends on the particular compiler and
how CPU_TIME is implemented whether or not system-overhead stuff like
cache-misses and page faults are included in the timings or not.

Am I right with this assumption?

If 'yes', can anybody then tell me whether or not the following compilers
include system-overead stuff like cache-misses and page faults in their
CPU_TIME timings?

NAGWare f95
gfortran
g95

When I look at the results of my experiments, I would say they do... but
it would be nice to get some feedback on this so I can also inform my students.

Thanks,
Bart

--
        "Share what you know.  Learn what you don't."

Bart Vandewoestyne wrote:

(snip)

> One of my students now asked if the time returned by CPU_TIME
> also includes the time that was consumed by things like
> cache-misses, page-faults, etc...

Cache misses are, in all systems I know, part of instruction
execution.  On some systems it does make the timing for one
program slightly dependent on what other programs are doing
on the same system, but the effect should be pretty small.

Page faults are different.  In that case, it is the OS doing
work for you, and I believe it depends on the OS how that time
is accounted for.  Unix gives two values for process CPU time
called user and system time.  It would be expected that paging
time would count toward system time, but I don't know that it
is required.  Consider a DLL shared between two users.  If that
is being paged in, which one does it count toward?

The first SPARC system I used, a Sun 4/110, had no hardware
multiply instruction.  Multiply was done by a system call
to the OS, which then did a software multiply.  This counted
toward system time, where on other systems multiply would count
toward user time.

Implementations of CPU_TIME then have to decide what to report
based on what the OS supplies.

My favorite timing system is the TSC, Time Stamp Counter, on
most intel processors.  It is a 64 bit counter counting at the
system clock frequency, and read using the RDTSC instruction
into the EAX and EDX registers.  For a 32 bit system, this
usually comes out just right for a function returning a
64 bit integer value, such that the two instructions

     rdtsc
     ret

plus whatever it takes to satisfy the assembler, will result
in a function returning the TSC value.  This allows one to
do timing over small sections of code without needing to average
over a long time.

-- glen

Ultimately, the actual timing comes from the operating system. All (?) OSes I
know give you two things: wall-clock time, and processor-time for the current
process, in which your program is running. Most (all?) OSes also allow you to
measure the fraction of that time spent in user mode - almost all of which is
going to be in your program or in support libraries - and system mode, a large
fraction of which is used by the OS operating on your program's behalf. But
the latter is not necessarily so: If, for instance, another process performs
event-based computing not synchronized with the clock interrupt, and always
shorter than the inter-interrupt interval, it might be charged for nothing,
although it is consuming the overwhelming fraction of the processor time, with
your process being charged for everything. Thus the recommendation that for
such benchmarking, the program being timed being the only active one, and for
repeating the measurement several times - for instance, SPEC CPU requires at
least three repeats.

Cache misses et al. are definitely included in this type of measurement,
because the processor is typically not able to reschedule between misses, not
even serve an interrupt or such. (There are processors where this happens, but
it's unlikely you are using one of them.) Page faults and swapping are more
complicated: they definitely show up as system time, but some or all of that
might or might not be charged against a different (system) process.

Confused enough?

        Jan

Bart Vandewoestyne wrote:
> So therefore I would conclude that it depends on the particular compiler
> and how CPU_TIME is implemented whether or not system-overhead stuff like
> cache-misses and page faults are included in the timings or not.

> Am I right with this assumption?

> If 'yes', can anybody then tell me whether or not the following compilers
> include system-overead stuff like cache-misses and page faults in their
> CPU_TIME timings?

> NAGWare f95
> gfortran
> g95

GFortran uses "GetProcessTimes" on Windows and "getrusage" on linux (if
available). See also:

    http://gcc.gnu.org/svn/gcc/trunk/libgfortran/intrinsics/cpu_time.c

What these system calls include exactly is not answered easily. From
experiments earlier this week I can tell: if OpenMP is used, CPU_TIME
returns the accumulated time spent by all threads (linux).

Not much of a help, but a start, maybe.

   Daniel

gfortran's implementation of cpu_time is representative of normal
practice.  The facilities used are the best means generally provided by
the OS for measuring the CPU time (total of all threads) associated with
the application, including those other events as much as possible.  As
others pointed out, hardware-level elapsed time tick counters are often
better suited for code performance tuning on a dedicated system.

Excellent summary.

cpu_time() and clock cycle counters have different purposes.  On most
popular platforms, the resolution of the counters which cpu_time() needs
fall in the range 1/1024 second to 0.010 sec.  Bus cycle counters have
overheads of a few bus cycles.  rdtsc, although it is scaled according
to the CPU clock, actually uses the front side bus clock on current
Intel platforms.  By measuring raw bus ticks without correlation to
outside world time or to which application is responsible, the
resolution should be well under 1 microsecond.
BTW, several C compilers have a built-in __rdtsc() which avoids having
to know the various instruction sequences for 32- and 64-bit platforms.
I suppose a knowledgeable person could propose a gcc built-in which
would work on a wider variety of platforms.

I suppose Fortran system_clock() was meant to cover some of the same
ground for which we use __rdtsc.  It falls far short when it happens to
produce a 32-bit integer result.  In principle (at least in gfortran) it
can be made to produce a 64-bit integer result, by supplying 64-bit
integer arguments.  The resolution can't be as good as a raw clock tick
counter, as it goes through all the C library support, using
gettimeofday() as the preferred method.
In the ifort docs, it says the count_rate is adjusted to the size of
integer used in the arguments.  This enables a reasonable time interval
to be measured with default integers, at the expense of resolution.

Some recent processors incorporate frequency changing as part of their
power saving strategy, which makes it harder to use a hardware register
to keep track of time.

If, in fact, RDTSC gives the quantity \int_0^T \f dt, one has to
differentiate the RDTSC readings and then evaluate \int_0^T dt!

-- mecej4

mecej4 wrote:

(snip)

> Some recent processors incorporate frequency changing as part of their
> power saving strategy, which makes it harder to use a hardware register
> to keep track of time.
> If, in fact, RDTSC gives the quantity \int_0^T \f dt, one has to
> differentiate the RDTSC readings and then evaluate \int_0^T dt!

Yes, but if you consider the goal of minimizing the number of
clock cycles then rdtsc makes more sense.  Or, to put it
another way, not all CPU seconds are equal (on a variable
speed CPU) but clock cycles are.

On the other hand, rdtsc measure wall time, not CPU time.
If you time small enough sections of code you will rarely be
interrupted.  Small sections are difficult to time with a coarse
resolution clock.  If the OS would keep track of rdtsc cycles
allocated to a given task, then it might be even more useful.

-- glen

-- glen

glen herrmannsfeldt wrote:
> mecej4 wrote:

> (snip)
>> Some recent processors incorporate frequency changing as part of their
>> power saving strategy, which makes it harder to use a hardware
>> register to keep track of time.

I see, so when my wife says "I'll be ready in 5 minutes" she's actually
saving power?  ;) ;)  ;)

>> If, in fact, RDTSC gives the quantity \int_0^T \f dt, one has to
>> differentiate the RDTSC readings and then evaluate \int_0^T dt!

> Yes, but if you consider the goal of minimizing the number of
> clock cycles then rdtsc makes more sense.  Or, to put it
> another way, not all CPU seconds are equal (on a variable
> speed CPU) but clock cycles are.

A question.  Does the frequency change also apply to memory reads
or cache reads?  Does it always take the memory system the same
amount of cycles, or the same amount of fracto-seconds, to deliver a
value?  Depending on the answer, memory bound jobs might be somewhat
independent of clock frequency.

The same question applies to disk bound jobs.  They're likely to
be independent of processor speed.

Dick Hendrickson

No, it measures the cpu tick count (which is why the CPU changing
frequency is a problem for timing). Secondly each CPU has its own TSC,
which is not synchronized with the TSC:s of other CPU:s, so if your
app happens to switch to another CPU on a multiprocessor -> oops.

For these reasons, RDTSC is pretty useless these days. Which is a bit
of a shame, since as such it's the most accurate and lowest overhead
timer available.

See e.g.

http://www.ussg.iu.edu/hypermail/linux/kernel/0505.1/1463.html

http://en.wikipedia.org/wiki/RDTSC

> If you time small enough sections of code you will rarely be
> interrupted.  Small sections are difficult to time with a coarse
> resolution clock.

Fancier profilers these days seem to use hw counters and sampling
(e.g. oprofile), so you can get accurate timing even for small
sections, provided the loop count is high enough.

--
Janne Blomqvist

Janne Blomqvist wrote:

(snip)

>>On the other hand, rdtsc measure wall time, not CPU time.
> No, it measures the cpu tick count (which is why the CPU changing
> frequency is a problem for timing). Secondly each CPU has its own TSC,
> which is not synchronized with the TSC:s of other CPU:s, so if your
> app happens to switch to another CPU on a multiprocessor -> oops.

I meant wall time, not CPU time, in that the clock doesn't stop
when a task stops executing, as a CPU time clock would.

It is my understanding that they are supposed to be synchronized
on multi-CPU systems, though it might not always happen.  I have
used rdtsc on multi-CPU systems, and never had any case where the
value decreased.

> For these reasons, RDTSC is pretty useless these days. Which is a bit
> of a shame, since as such it's the most accurate and lowest overhead
> timer available.

I do agree that it isn't so useful as a clock, but I do believe
it is still useful for finding which parts of a program are
taking more clock cycles.

-- glen

Well, time does march on, and it has been about 3 years now that the
counter for rdtsc doesn't change frequency on Intel platforms, and
"64-bit" Xeon and Opteron have been the standard.
Add to del.icio.us | Digg this | Stumble it | Powered by Megasolutions Inc