Home     |     .Net Programming    |     cSharp Home    |     Sql Server Home    |     Javascript / Client Side Development     |     Ajax Programming

Ruby on Rails Development     |     Perl Programming     |     C Programming Language     |     C++ Programming     |     IT Jobs

Python Programming Language     |     Laptop Suggestions?    |     TCL Scripting     |     Fortran Programming     |     Scheme Programming Language


 
 
Cervo Technologies
The Right Source to Outsource

MS Dynamics CRM 3.0

Fortran Programming Language

Speed versus memory


I am going through and updating a computation fluid dynamics solver
from F77-style programming to "modern" Fortran (F95 not 2003 since it
doesn't seem to be generally available). The code will be used to run
very large jobs (upwards of 100 million grid points and many thousands
of time steps) on massively parallel architectures.

My initial impulse was to use vector processes everywhere. To some
extent this was driven by books such as "Numerical Recipes in F90" and
"Fortran Explained 90/95" which take the stance that do-loops
introduce an artificial order to the processes (do i=1, then i=2, then
i=3...) which limit a compiler's ability to generate fast executables.
Algorithms written using vector operators are also generally less
verbose, which I find makes them easier to read and check for bugs.

However, I've noticed that writing complex algorithms (such as an
upwind quadratic interpolation routine) using vector notation requires
an absurd amount of memory. Such a routine only has two input arrays
and one output array but goes through ten intermediate arrays all of
which must have space allocated for them when using vector processes.
Even if the compiler splits the thread among a large number of
processors I have found that the routine will use thousands of times
more memory than the same routine written with do-loops (with ten
continually redefined scalars).   I can't imagine this is
computationally fast and it's limiting the size of jobs I can run.

I have hacked together some methods using pointers and elemental
routines that seem like they might be ideal. As I understand it
pointers don't really use up memory (just a few bites per array) and
it seems that intermediate variables within an elemental are allocated
efficiently (they aren't blown up to the full array size unless the
computer can operate simultaneously on all those components) but I
have no idea how compiler and platform dependent that will be.

I'm wondering if other people have any hard-and-fast rules or guiding
principles in regards to the problem of speed versus memory
allocation. Thanks.

Gabe <DrWeymo@gmail.com> wrote:
> My initial impulse was to use vector processes everywhere.

I assume from context that by "vector processes" you mean what are more
commonly called something like whole array operations. Your "vector
precesses" made me thing of things like openMP and the like.

> To some
> extent this was driven by books such as "Numerical Recipes in F90" and
> "Fortran Explained 90/95" which take the stance that do-loops
> introduce an artificial order to the processes (do i=1, then i=2, then
> i=3...) which limit a compiler's ability to generate fast executables.

Do those texts really say that? It has been quite a while since I looked
at Numerical Recipes, but I know the authors of the "Explained" series
and I wouldn't expect any of then to say quite that.

Generalizing about perfiormance is *VERY* dangerous in many areas,
specifically including this one. Counterexamples of almost any
generalization can be found. That being said, my general impression is
pretty much the opposite - Fortran compilers have been optimizing DO
loops for over 5 decades now. They have gotten pretty darned good at it.
Optimization of whole array expressions is a much newer field. It is
done, but things about it are still being learned and working their way
into practice.

> Algorithms written using vector operators are also generally less
> verbose, which I find makes them easier to read and check for bugs.

Sometimes. Plenty of counterexamples exist, but I think that is more
often correct than not. In any case, it is at least often correct. I do
think that, on the whole, that's a better reason than performance for
choosing the form.

> However, I've noticed that writing complex algorithms...
> ... intermediate arrays all of which must have space allocated...

And that allocation/deallocation takes time independent of memory use
issues.

> I have found that the routine will use thousands of times
> more memory than the same routine written with do-loops

Thousands of times sounds implausible. Certainly one can generate things
arbitrarily bad, but I wouldn't expect a factor of thousands. This
sounds to me like a memory leak... also known as a bug... rather than
expected behavior. I couldn't say whether the bug is in your coding of
the array expressions or in the compiler's implementation. (I have in
the past seen compiler bugs with failing to deallocate compiler
temporaries.) In either case, it sounds an awful lot like a bug.

--
Richard Maine                    | Good judgement comes from experience;
email: last name at domain . net | experience comes from bad judgement.
domain: summertriangle           |  -- Mark Twain

Richard Maine wrote:
> Gabe <DrWeymo@gmail.com> wrote:

(snip)

My understanding of the above is that DO loops work well for
serial processors, but that array operations might work better
for parallel processors.  It should be easier to vectorize an
array expression, for example.  Still, it will depend much on
how each one is actually written.

>>Algorithms written using vector operators are also generally less
>>verbose, which I find makes them easier to read and check for bugs.
> Sometimes. Plenty of counterexamples exist, but I think that is more
> often correct than not. In any case, it is at least often correct. I do
> think that, on the whole, that's a better reason than performance for
> choosing the form.

My feeling is that array operators are a good choice
for simpler operations, and DO loops for more complex
operations.  Though that is mostly considering serial
processors.

That more temporary arrays tend to be needed for array
operations is one reason.  Also, sometimes the expression
as written will require much more work even though it
looks simpler mathematically.

>>However, I've noticed that writing complex algorithms...
>>... intermediate arrays all of which must have space allocated...
> And that allocation/deallocation takes time independent of memory use
> issues.
>>I have found that the routine will use thousands of times
>>more memory than the same routine written with do-loops
> Thousands of times sounds implausible. Certainly one can generate things
> arbitrarily bad, but I wouldn't expect a factor of thousands. This
> sounds to me like a memory leak... also known as a bug... rather than
> expected behavior. I couldn't say whether the bug is in your coding of
> the array expressions or in the compiler's implementation. (I have in
> the past seen compiler bugs with failing to deallocate compiler
> temporaries.) In either case, it sounds an awful lot like a bug.

If it is one very large temporary array it could easily be
thousands of times more than needed for DO loops.  If it is
running on multiple processors, though, it shouldn't need so much
for each one.

-- glen

Gabe <DrWeymo@gmail.com> writes:
> My initial impulse was to use vector processes everywhere.

As I understand it, by "vector processes" you mean array notation.

> However, I've noticed that writing complex algorithms (such as an
> upwind quadratic interpolation routine) using vector notation requires
> an absurd amount of memory.

Richard already stated that performance optimisation of array notation
statements is still developing and maturing.

Regarding your initial problem, I would suggest that you write the code
using array notation because it's often more concise and (more
important) lets you know if execution order is relevant or not (same
goes for the employ of FORALL and WHERE constructs).

To optimise the serial part of your program, you will have to do
performance measurements; if there are memory problems caused by
inefficient array syntax, you can always rewrite the critical part in
explicit DO-loop terms. Having clearly stated where execution order
matters or not helps a great deal when you're parallelising the code.

Citing Hoare: "Premature optimisation is the root of all evil"

Sebastian

Thanks for all three responses. First off, I did mean to say array
notation. Thanks for putting up with my poor Fortran vocabulary.

The the point that compilers are more experienced at optimizing do-
loops is a very interesting one. "Recipes" says many times that array
notation and use of intrinsics allows compilers to make faster code.
For example, from section 22.1 "In fact the serial code above [nested
do-loop example] overspecifies the desired task...The essence of
parallel programming is not to force "into the time dimension"(ie, to
serialize) operations that naturally extend across a span of data..."
On page 1015 "One quick (if superficial) test of how much parallelism
is achieved in a Fortran 90 code is to count it's do-loops..." You
read enough of these kind of comments and you'll start thinking do-
loops are slow as mud and should be avoided like the plague. "Fortran
Explained" isn't so damning, but they still tout array operations as
the best thing since sliced bread.

And while my estimate of 1000 was just a back of the envelope thing it
seems reasonable to me. If I've got 10 dummy arrays with more than
1000 components each (even after breaking the full grid over a large
number of processors) then wouldn't that take more than 1000 times
more memory than the serial version which just has 10 scalars?

Also, does anyone have any opinion on elemental routines? I've found
that converting routines with array operations to elemental routines
is straightforward, but I don't have enough experience to know if
there's a consistent advantage in terms of memory or speed. (I've
found that my upwind quadratic interpolation algorithm uses orders of
magnitude less memory on my 250,000 grid point test case when written
with pointer arrays passed to an elemental routine.)

Gabe <DrWeymo@gmail.com> wrote:
> On page 1015 "One quick (if superficial) test of how much parallelism
> is achieved in a Fortran 90 code is to count it's do-loops..." You
> read enough of these kind of comments and you'll start thinking do-
> loops are slow as mud and should be avoided like the plague. "

Wow. That quote tells me a lot... but not about DO loops. It tells me
more about the authors of NR. That's horrible advice. I've seen people
follow advice like that. They take clear and efficient code using DO
loops and turn it into convoluted and slow code using array operations.
I'm not saying array operations are bad. Sometimes they are simpler
and/or faster. But I have trouble expressing how horrible I think that
advice is.

> And while my estimate of 1000 was just a back of the envelope thing it
> seems reasonable to me. If I've got 10 dummy arrays with more than
> 1000 components each (even after breaking the full grid over a large
> number of processors) then wouldn't that take more than 1000 times
> more memory than the serial version which just has 10 scalars?

No. Because there is no code around that is going to take only the
storage of 10 scalar variables. With most current systems, that would be
40 bytes (maybe 80 if you are talking double precision). There is all
kinds of overhead and other cost. Try to find an executable program that
takes only 80 bytes.

Now that you detail it, I wonder why you are expressing this as a factor
of the memory size anyway. Seems to me that absolute numbers would make
more sense and tell you more anyway. Is that actually the size of
numbers you are talking about? Even in back-of-the-envelope terms? Is
your increase of 1,000 times something like from 40 bytes to 40,000? If
that is the case, then the answer to your question is simple. No, using
40,000 bytes will not have measurable effect on the performance of any
current system you are likely to have. If you are talking the difference
between 40,000 and 40 million, that might start to get more interesting,
though it still isn't likely to be a huge performance issue.

Sounds to me like "factor of x" just isn't the right kind of measure.

--
Richard Maine                    | Good judgement comes from experience;
email: last name at domain . net | experience comes from bad judgement.
domain: summertriangle           |  -- Mark Twain

On May 26, 3:18 pm, Gabe <DrWeymo@gmail.com> wrote:

<snip>

> Also, does anyone have any opinion on elemental routines?
> I've found
> that converting routines with array operations to elemental routines
> is straightforward, but I don't have enough experience to know if
> there's a consistent advantage in terms of memory or speed.

Elemental functions are convenient because they are generic and can
reduce the amount of duplicative code in a program, but I have never
seen evidence that they are faster than equivalent code with array
arguments.

Beliavsky <beliav@aol.com> wrote:
> Elemental functions are convenient because they are generic...

That's not in general so. Similar idea in some senses, but they aren't
technically inherently generic. You can make them so if desired, but
that's more or less independent of genericity.

--
Richard Maine                    | Good judgement comes from experience;
email: last name at domain . net | experience comes from bad judgement.
domain: summertriangle           |  -- Mark Twain

> > And while my estimate of 1000 was just a back of the envelope thing it
> > seems reasonable to me. If I've got 10 dummy arrays with more than
> > 1000 components each (even after breaking the full grid over a large
> > number of processors) then wouldn't that take more than 1000 times
> > more memory than the serial version which just has 10 scalars?

> No. Because there is no code around that is going to take only the
> storage of 10 scalar variables. With most current systems, that would be
> 40 bytes (maybe 80 if you are talking double precision). There is all
> kinds of overhead and other cost. Try to find an executable program that
> takes only 80 bytes.

I think there is a misunderstanding. I mean the amount of memory used
for the temporary arrays will be ~1000 times larger than the amount of
temporary memory required for the serial version. My full code uses
around 15 full-sized variables, so the change in total memory
requirements couldn't be more than 67% (10/15). Taking overhead (and
other temporary arrays) into account we're probably talking less than
50%.

> Is that actually the size of
> numbers you are talking about? Even in back-of-the-envelope terms?

The jobs I need to eventually run will use tens of millions of grid
points in double precision, so the temporary memory requirement would
be order 1e9 bytes. Again, this might be split up over ~100
processors, so ~1e7 bytes each.
Add to del.icio.us | Digg this | Stumble it | Powered by Megasolutions Inc