[vox] Who thinks Java is cool?

Tue Jun 21 20:37:25 PDT 2011

On a somewhat related note I found a quip from the top500.org folks
saying that 19 of the 500 top clusters (3.8%) are using "GPU technology".

I found this article mentioning some recent Intel comments about using
many cores:
http://www.theregister.co.uk/2011/06/20/intels_reinders_on_many_core_coding/

Sounds like the failed larrabee (announced in 2008) pieces are being
picked up and are now called Knights corner/Many Integrated Core (MIC)
project.  Sounds like a more than 50 core product is planned for 2012.

I definitely agree with Intel's statement "The key to getting
performance on [a MIC] is to have some data, keep it local, and nobody
else touches it".

On 06/18/2011 10:01 AM, Norm Matloff wrote:
> Well, I don't have an insider's information, but my impression is that
> Larrabee is very much alive.  And as you said, Intel will come out with
> something to compete with the GPU firms.

Heh, and indeed they have already made significant progress, along the
lines of:
* 4 memory channels per socket instead of 1 per system
* 30-50GB/per socket instead of 6-8GB/sec per system
* few cores/threads to 10-20 cores/threads per socket
* Peak 8 DP flops/core/cycle up from 2-4 DP flops/core/cycle

> Intel will prevail, I believe, for a number of reasons.  I would cite
> two in particular (besides its general clout, of course):

I'd have to agree, at least compared to Nvidia.  AMD seems to be making
some headway with their merged CPU/GPU, their marketing term is APU.

> 1.  The time penalty for communicating between CPU and GPU is a major
> obstacle to general performance.

Especially when on the wrong end of a non-memory coherent pci-e bus.
Directly attaching a GPU/Accelerator to a QPI or Hypertransport link
would help greatly with this.  This would work well with an on board MMU
as well.  Not that a GPU shouldn't have a killer memory system, but
don't hide the rest of the system (and it's ram) on the wrong side of a
high latency pci-e link.

> 2.  The GPU memory structure is great for you physics/chemistry people,
> but not for the biologists, data miners and so on who make up the bulk
> of the market (current and potential, as it's still early).  The
> physical science market is much too small to drive the technology, as
> Cray found out.

Yup, GPUs only exist because of consumer markets willing to pay for
them.  Nvidia seems close to being self-sufficient with their tesla, but
only (so far) because of amortizing the R&D over the consumer market.
Designing a few B transistors and getting them fabricated and tested
isn't cheap.

> Actually, I've become much less interested in it than before.  It adds
> its own finicky nature on top of the finicky nature of MPI. :-)   The
> most common questions on the parallel-R online discussion list deal with
> Rmpi.  I've personally gone back and forth with Rmpi's author to try to
> solve some problems, and some of them remain unsolved.

Unfortunate.  I just got my first Rmpi request for one of my clusters
today, so I'll be checking it out.

> For most users, another parallel R package, snow, is much better than
> Rmpi.  It basically does a scatter/gather operation, and in a very
> simple, convenient manner.  It's almost TOO easy.  Rmpi adds value only
> in applications in which lots of nodes need to interact directly with
> each other rather than with a manager node, which is not common in the R
> world.
> 
> There are other message-passing parallel R packages too, such as foreach
> and multicore (not what it sounds like).  But they are much more complex
> than snow, without much benefit in my view.  (I say that even though I'm
> close to the company that developed foreach and made it open source, a
> very good firm.)

Thanks for the overview, if Rmpi proves to be too unreliable I'll point
the user to the above mentioned packages.

>> Yes.  On my embarrassingly parallel mandelbrot set routing I get
>> basically perfect scaling.  Not sure exactly why they are goroutines,
> 
> I guess they couldn't resist the pun.

Heh, I read the wiki article on coroutines and it says:
  Coroutines are computer program components that generalize
  subroutines to allow multiple entry points for suspending and
  resuming execution at certain locations.

Not sure if that is an acceptable definition in the literature, but that
does describe go's coroutines rather well.  Channels provide the second
(and higher) entry point(s).