[vox] Who thinks Java is cool?

Fri Jun 17 20:19:18 PDT 2011

On 06/17/2011 11:00 AM, Norm Matloff wrote:
> Well, there's fact and there's taste.

Agreed.

>> The HPC world of parallel processing seems to be largely invested in MPI
>> which seems orders of magnitude more popular than shared memory both in
>> HPC clusters and in HPC codes.  
> 
> That was true until two or three years ago.  But then GPU became big.

True, as long as you only have one.  As soon as you have more than one
you are back to message passing... no?  In face isn't that required even
with many of the new video cards that actually have 2 GPUs?

> In general, one can do more tweaking in a message-passing environment.
> This is especially true if the alternative is a cache-coherent
> shared-memory setting.  That picture changes radically with GPU.  If
> one's criterion is how many different applications can get excellent
> performance/price ratio, GPU would probably win hands down.

Agreed.

> Of course, one does even better in a multi-GPU setting, linking the GPUs
> using message-passing, and currently this is the typical mode in the
> "world class" supercomputers.  (Take that with a grain of salt, as the
> applications used are rather narrow.)

Indeed, single precision is a huge win, and double precision less so
(especially with the market leading nvidia).  Interestingly Nvidia's
large difference in double vs single precision seems to have resulted in
more researchers looking hard at what they need.  Hybrid precision codes
show great gains with Nvidia.  I'm glad to see ATI and Nvidia heading
towards more generally useful engines with ECC, unified address space
and if the rumors are true, a real MMU *gasp*.  Of course at the same
time CPUs are getting more GPU like, more cores, more similar memory
systems, larger vectors/large registers/more threads, etc.

The days node bandwidth ~= GPU bandwidth which is a huge change from a
year or two ago... at least for intel anyways.

Anyone know how many of the top 500 have GPUs?  I see 3 or 4 in the top 10.

> I'm really surprised that you say one can get speedups of multiple
> orders of magnitude on non-embarrassingly parallel problems.  I've been
> trying to think up examples of this, or find examples that others have
> implemented, for quite a while now, and I haven't found any.  If you
> know of examples, PLEASE LET ME KNOW. 

I suspect it's just a difference in definition of embarrassingly
parallel.  For me linpack is the upper end of embarrassingly parallel.
Often GigE (embarrassingly slow) is only 10% or so less efficient than
the state of the art interconnects.  Numerous scientific codes seem to
scale reasonably well on the T3 designed systems (direct connections to
your nearest 6 neighbors), blue gene, and various more traditional IB
connected clusters. While my experience is in the 0-800 CPU range,
scaling well into the 1000-10,000 CPU run doesn't seem unusual.

Nwchem for instance scaled horribly on our cluster with GigE, but often
scales to 4096 CPUs with a nice interconnect.  Umm, here's a graph:
http://www.nwchem-sw.org/index.php/File:Dft-scaling-c240-pbe02.png

Granted the scaling starts to worsen around 1024 CPUs.  I don't believe
the linked example is unusual in any way, and nwchem is rather
communications intensive, in my tests even with old slow nodes GigE
scaling was often on the order of 4-8 nodes.

Vasp doesn't scale as well, optimal speedups are often around 32-64
cores, but often continue to improve up through 256-512 cores before
going negative.

I suspect I could find more examples in the 1k - 10K cpu range if
needed.  Did you mean powers of 2 or powers of 10?  Typically when you
get to 1024-4096 CPU jobs the minimum is in in the 32-64 CPU range just
because you often are limited by available ram when you go lower.

> The all-caps format here is not
> to shout at you or do challenge you, but rather because it really
> matters. 

My understanding is that 1000-10,000 CPU runs on large super computers
are fairly common, but they do ask that you demonstrate cool scaling in
with smaller numbers of cores before burning through allocations with
the highly parallel runs.

In fact many codes scale so well that many sites have a maximum 2 day
limit on jobs... forcing users to run with more nodes instead of longer.
This would be insane if codes didn't scale.

> Among other things, I'm writing a book on R programming, and I
> have a chapter on parallel R.  Things get a little more complicated in
> R, but the basic issue remains:  Can one achieve speedups of that
> magnitude on non-embarrassingly parallel problems in large-scale
> distributed systems?

Hrm, well I've used R somewhat, no idea how strong it's message passing
is, or if it's done something stupid like done it's own message passing.
 Sadly many programs decide to do message passing themselves, assume
tcpip over ethernet, and get locked out of the high speed/low latency
interconnects.  But assuming a well written code, using a MPI
wrapper/bindings, and surprising weaknesses I don't see why it couldn't
scale well.  Various things will effect scaling, and ideally R would
support full MPI, including non-blocking calls for maximum overlap of
communications and computation.  I certainly don't know of any R codes
that scale high because R isn't a first choice for high performance
number crunching (usually a well tested F90 numerical library is used).
Does R by chance have bindings for any of the parallel numeric libraries
like say PETSc?

If there's a big difference between native fortran numerical libraries
and R the HPC sites might not be so happy if you are getting much less
performance for the name hours of CPU time.  Any comments on R vs native?

So unfortunately scaling isn't an easy question... and scaling on one
supercomputer doesn't really imply scaling well on another.  MPI does
guarantee however that if you support it you will at least run on
practically every cluster on the planet.

>> Go has pointers, but not pointer arithmetic (unless I'm
>> misremembering my languages).
> 
> Then does it not have shared arrays?

Not sure if there's any special definition of shared arrays.  There's a
package called that I believe, I think even nwchem uses it.  The vendors
hate it because they just want to support MPI and the nwchem users think
they should get a special driver for it.  Frustratingly for the vendors
some of the nwchem users are very well funded.

But if you just mean can multiple goroutines read/write from the same
array then yes Go can do that.  I just have a global array for the
mandelbrot data and all goroutines write directly to it.  So all can
write to a[x][y], you can't however say please write to 4 bytes past
a[x][y].