[vox] Who thinks Java is cool?

Fri Jun 17 20:41:29 PDT 2011

On Fri, Jun 17, 2011 at 08:19:18PM -0700, Bill Broadley wrote:
> On 06/17/2011 11:00 AM, Norm Matloff wrote:

> > That was true until two or three years ago.  But then GPU became big.

> True, as long as you only have one.  As soon as you have more than one
> you are back to message passing... no?  In face isn't that required even
> with many of the new video cards that actually have 2 GPUs?

Not required, no.  Some people use OpenMP.  But MPI is probably more
common.

> Indeed, single precision is a huge win, and double precision less so
> (especially with the market leading nvidia).  Interestingly Nvidia's
> large difference in double vs single precision seems to have resulted in
> more researchers looking hard at what they need.  Hybrid precision codes
> show great gains with Nvidia.  I'm glad to see ATI and Nvidia heading
> towards more generally useful engines with ECC, unified address space
> and if the rumors are true, a real MMU *gasp*.  

The virtue of the GPU current cards is that you know what you're
getting.  People have developed extremely fine-tuned algorithms that
depend on knowing exactly what the memory will behave like.  Adding an
MMU would destroy a lot of that.  

> Of course at the same time CPUs are getting more GPU like, more cores,
> more similar memory systems, larger vectors/large registers/more
> threads, etc.

I suspect the Intel Larrabee will put a major dent in NVIDIA's business.

> > I'm really surprised that you say one can get speedups of multiple
> > orders of magnitude on non-embarrassingly parallel problems.  I've been
> > trying to think up examples of this, or find examples that others have
> > implemented, for quite a while now, and I haven't found any.  If you
> > know of examples, PLEASE LET ME KNOW. 

> I suspect it's just a difference in definition of embarrassingly
> parallel.  For me linpack is the upper end of embarrassingly parallel.
> Often GigE (embarrassingly slow) is only 10% or so less efficient than
> the state of the art interconnects.  Numerous scientific codes seem to
> scale reasonably well on the T3 designed systems (direct connections to
> your nearest 6 neighbors), blue gene, and various more traditional IB
> connected clusters. While my experience is in the 0-800 CPU range,
> scaling well into the 1000-10,000 CPU run doesn't seem unusual.

> Nwchem for instance scaled horribly on our cluster with GigE, but often
> scales to 4096 CPUs with a nice interconnect.  Umm, here's a graph:
> http://www.nwchem-sw.org/index.php/File:Dft-scaling-c240-pbe02.png

I don't get it, Bill.  In the best case in that picture, you're getting
only a 100X increase at 4096 processors.  And in the worst case,
performance actually starts to degrade after you hit 128 processors.

And as you said, what constitutes embarrassingly parallel is in the eye
of the beholder.  But I certainly wouldn't consider the LINPACK apps to
be in that category, at least not the iterative ones.

> I suspect I could find more examples in the 1k - 10K cpu range if
> needed.  Did you mean powers of 2 or powers of 10?  Typically when you

I meant powers of 10, as I was replying to your usage of powers of 10.

> > Among other things, I'm writing a book on R programming, and I
> > have a chapter on parallel R.  Things get a little more complicated in
> > R, but the basic issue remains:  Can one achieve speedups of that
> > magnitude on non-embarrassingly parallel problems in large-scale
> > distributed systems?

> Hrm, well I've used R somewhat, no idea how strong it's message passing
> is, or if it's done something stupid like done it's own message passing.

R has no parallel processing facilities at all.  They're done through
add-on packages, the most famous being Rmpi, an R interface to you know
what.  My Rdsm package adds a threads capability to R.  But any parallel
R is slow on nonembarrassingly parallel apps.

> Does R by chance have bindings for any of the parallel numeric libraries
> like say PETSc?

I seem to recall seeing something like this, but I'm not sure.

But a lot of the heavy R applications are not really numerical.

> If there's a big difference between native fortran numerical libraries
> and R the HPC sites might not be so happy if you are getting much less
> performance for the name hours of CPU time.  Any comments on R vs native?

R is written in C, with some parts in FORTRAN.  It's an interpreted
language, but just like with Python, if you make use of the functional
programming features well, it can be quite fast.

> MPI does guarantee however that if you support it you will at least
> run on practically every cluster on the planet.

I've used MPI for years, and always find it to be finicky to set up.

> But if you just mean can multiple goroutines read/write from the same
> array then yes Go can do that.  I just have a global array for the
> mandelbrot data and all goroutines write directly to it.  So all can
> write to a[x][y], you can't however say please write to 4 bytes past
> a[x][y].

OK, that answers my question.

My next question is whether the goroutines give you true parallelism.
Or can only one run at a time, which the term "coroutine" would seem to
imply?

Norm