[vox-tech] another gcc question (Branch Delay Slots)

Bill Broadley vox-tech@lists.lugod.org
Fri, 1 Mar 2002 03:31:22 -0800


>     In modern processors you can take this much further 
>   "parallel speculative execution" and dedicate *HUGE* amounts of chip
>   space to be able to run *both* sides of the branch operation in parallel
>   and then "forget" the results of the side that shouldn't have been run
>   once you know the outcome of the conditional test... some processors
>   can even nest this a few levels so that a branch who's both children
>   also split could in theory not stall the processor.

I'd call this a bit optimistic.  Branch prediction is common, even
through multiple branches, intermediate results are stored in shadow
registers and only committed once the branch has been verified as correct.
These shadow registers are not architectually visible.  So when a branch
is mispredicted you just clear all the pipelines, don't commit
the related shadow registers and start over.  

Not that their are no such machines that exist that follow both branches,
but I bet noone on vox-tech owns one.  Hrm, at least the only chip I
know of that has it is the IA64/itanium, last I checked they had sold
2200 of them, which sounded good, till IBM mentioned they had 2000 of
them in a large parallel computer.  

Alas branches average 1 in every 7 cycles, cpu pipelines are growing
into the 10-20 cycle range, with 4-8 functional units running you could
easily loose 100's of instructions with a branch mispredict.

The current branch prediction is fairly sophisticated stuff, many cpu's
have more then one algorithm, some put prediction info in the cache line,
some depend on interactions with the compiler to attempt to make the
branchs easier to predict by rearranging the braches to be more similar
to each other.  Some strategies include history for the branch, and some
cpu's even implement more then one strategy and then track which is most
accurate for a branch.

I haven't read all of this thread, but it seems like the loop unrolling
typically goes like:

for (i = 0; i < 100; i++)
  {
      do_stuff(i);
}

Crap, if do_stuff isn't much work I spend most of my time doing
a compare, test and jump, so lets unroll:
for (i = 0; i < 100; )
  {
      do_stuff(i); i++;
      do_stuff(i); i++;
      do_stuff(i); i++;
      do_stuff(i); i++;
      do_stuff(i); i++;
      do_stuff(i); i++;
      do_stuff(i); i++;
      do_stuff(i); i++;
      do_stuff(i); i++;
      do_stuff(i); i++;
}

Oh, hrm, what happens if I the number of loops (100) changes?  That is
a bit more difficult but we could do something like:

while ((100-i)>unroll factor)
for (i = 0; i < 100; )
  {
      do_stuff(i); i++;
      do_stuff(i); i++;
      do_stuff(i); i++;
      do_stuff(i); i++;
      do_stuff(i); i++;
      do_stuff(i); i++;
      do_stuff(i); i++;
      do_stuff(i); i++;
      do_stuff(i); i++;
      do_stuff(i); i++;
}

for (j = i; j < 100; j++)
  {
      do_stuff(j);
}


Then a cool guy named Duff figured out:

  j = 0;
  count=100;
  n = (count + 7) / 8;
  switch (count % 8)
    {
      do
        {
    case 0:
          do_stuff(j);
          j++;
    case 7:
          do_stuff(j);
          j++;
    case 6:
          do_stuff(j);
          j++;
    case 5:
          do_stuff(j);
          j++;
    case 4:
          do_stuff(j);
          j++;
    case 3:
          do_stuff(j);
          j++;
    case 2:
          do_stuff(j);
          j++;
    case 1:
          do_stuff(j);
          j++;
        }
      while (--n > 0);
    }

Despite it not looking very dense, it's actually quite efficient, and
reasonably flexible.  It gracefully handles a dynamic number of loops
that don't have to be a multiple of the unrolling factor.

It's commonly refered to as Duff's device.

>     Now if you have a smart compiler (that knows how to reorder), 
>   and
>     the code in question has operations which can be used to fill the
>     branch delay slots, 
>   and 
>     the chip builders don't want to waste lots of chip space to be 
>     able to speculatively execute

Yes, chip builders like to design simple chips, the $billion question
is how to improve useful work done per cycle with the doubling every
18 month transistor budget.

My current favorite technique that might actually improve this
significantly (work done per cycle hasn't changed much in the last 10
doublings of transistors budgets or so) is SMT, the ability for a cpu to
run multiple threads/processes at the same time during the same cycle.
An alpha (RIP) paper showed a factor of 2.5 in the amount of work done
per cycle with identical cpu resources (except for SMT).  Pretty amazing
stuff, especially useful now that quad issue 1.5-2.0 Ghz cpu's are fairly
common and memory is as far away as ever (about 100 ns or so)

Anyways, enough rambling, late.r

-- 
Bill Broadley
Mathematics/Institute of Theoretical Dynamics
UC Davis