[vox] Who thinks Java is cool?

Bill Broadley bill at broadley.org
Fri Jun 17 23:14:13 PDT 2011


On 06/17/2011 08:41 PM, Norm Matloff wrote:
> The virtue of the GPU current cards is that you know what you're
> getting.  People have developed extremely fine-tuned algorithms that
> depend on knowing exactly what the memory will behave like.  Adding an
> MMU would destroy a lot of that.  

Hrm, do you think that's really that common?  Sure there are hero codes
where a shocking amount of work is targeted at a specific CPU
architecture.  But in general these codes survive for multiple CPU
generations... especially since writing for a new architecture (like
GPUs) often means that a new GPU will be out before you are done.  Even
just today there's quite a variety of nvidia chips shipping, different
ratios of double DP/SP ratios, different width memory busses, and
different generations of chips shipping.

>> Of course at the same time CPUs are getting more GPU like, more cores,
>> more similar memory systems, larger vectors/large registers/more
>> threads, etc.
> 
> I suspect the Intel Larrabee will put a major dent in NVIDIA's business.

Heh, I've heard that several times over the years, and heard several
times that it's back to the drawing boards.  I think Larrabee is dead,
but certainly I expect something similar from intel eventually.
Everything shipped so far is pretty poor, even the sandy bridge (high
end intel CPU) has a pretty poor GPU.  I'm all for competition, but I've
not seen anything promisingly parallel or GPU like from intel.  The
sandy bridge is a fine chip otherwise, with some careful coding I
managed 6.5 flops per clock per core on SAXPY.

> I don't get it, Bill.  In the best case in that picture, you're getting
> only a 100X increase at 4096 processors.  And in the worst case,
> performance actually starts to degrade after you hit 128 processors.

Yes some things parallelize better than others.  So there's 3 graphs
showing part of nwhem and how well they scale.  But the researchers
trying to get something done care about one thing, how well the entire
code scales, represented by the Total bar (black circles).

So with 32 CPUs you get approximately 3000 seconds.  As mentioned the
scaling starts to decay at around 1024 processors.  So between 32 CPUs
and 1024 CPUs (a factor of 32) the run time goes from 3000 seconds to
approximately 110 seconds a factor a factor of 27 or so.  Of course this
factor of 27 is only because 32 was the base line because that's likely
as small as they could go and sill have enough memory for the data they
were running.  Don't you think it's reasonable to call it 32*27 or so?
 Pretty close to 3 orders of magnitude.

Seems like all the graphs I see cover similar ranges.  So I found a page
on XT5 scaling that shows scaling form 10,000 CPUs to 150,000 CPUs.
Which is only a factor of 14.  The problem might only run on a cluster
as small as 10k CPU.  So going from 10k (hard to tell on the graph,
might be er, anywhere between 5 and 10K) and 64k the scaling is good.
Something like 4Tflop to 30 Tflop.  Pretty close to linear anyways...
it's frustratingly hard to interpolate their graph.  Sigh, I wish all
log log graphs had a table for the actual values.

Going from 60k CPUs to 150k CPUs scales more poorly, only up over 50K
Tflops, although it's relatively stable and not obviously decaying.  So
basically WRF seems to provide good scaling to 15K CPUs, ok scaling to
65K CPUs, and weak (but positive) scaling to 150K cpus.

So basically those allocating resources have to decide if 50TF is worth
triple the number of CPUs than 30TF (often not).

So can MPI run simulations 100, 1000, or 10,000 times faster than a
serial code, I'd say yes.  I found some references to 100Tflop runs but
without information on the configuration.  Alas, my search for how many
flops a current single CPU WRF run could expect didn't yield any results.

BTW, numbers I found on nvidia previous generation (the 280), I suspect
before the hybrid mode WRF was available shows a 15% improvement when
adding the GPU for a node.

Details of the 10k -> 150k graph at:
http://ncasweb.leeds.ac.uk/onlinebooking/images/stories/wrf_pres/john_michalakes.pdf

> And as you said, what constitutes embarrassingly parallel is in the eye
> of the beholder.  But I certainly wouldn't consider the LINPACK apps to
> be in that category, at least not the iterative ones.

Heh, well I meant specifically linpack as measured by the top500 which
ends up being rather highly tuned and is widely ridiculed as a poor
measure of cluster performance.  GPUs have fanned the flames because not
it's much cheaper to get bragging rights of being high on the list.  It
boils down to does achieved research scale with linpack performance
before GPUs came out?  After?  There have been clusters that scored high
on the list, but never achieved any useful research.

In any case the linpack used is so network efficient that clusters with
shared/oversubscribed Gbit networks with a latency of 30-50us scale only
10% poorer than those with 40 Gbit networks with 1.5us latencies.  So
the reality is a high speed interconnect can let you scale radically
better on real research codes, but significantly damages your top500 score.

> R has no parallel processing facilities at all.  They're done through
> add-on packages, the most famous being Rmpi, an R interface to you know
> what.  My Rdsm package adds a threads capability to R.  But any parallel
> R is slow on nonembarrassingly parallel apps.

Rmpi sounds promising.

>> MPI does guarantee however that if you support it you will at least
>> run on practically every cluster on the planet.
> 
> I've used MPI for years, and always find it to be finicky to set up.

Job security ;-).  I'm quite fond of OpenMPI over mpich/lam.  Much less
finicky, and *GASP* the binaries compiled against them are interconnect
agnostic.  Quite useful for our clusters that have some nodes with
different interconnect.  Also handy for evaluating price/performance of
GigE vs IB without changing anything.

> My next question is whether the goroutines give you true parallelism.
> Or can only one run at a time, which the term "coroutine" would seem to
> imply?

Yes.  On my embarrassingly parallel mandelbrot set routing I get
basically perfect scaling.  Not sure exactly why they are goroutines,
they seem to encourage abuse (way too many thread, er, goroutines) yet I
find the best performance with one per core... much like I'd expect a
thread.  So agreed their name is strange, they seem to act just like I'd
expect threads (with pthreads as my baseline) to.




More information about the vox mailing list