[vox-tech] another gcc question (Branch Delay Slots)

Thu, 28 Feb 2002 15:03:18 -0500

#Received: from www.livepenguin.com (dcn251-11.dcn.davis.ca.us [168.150.251.11])
#        by moria.simons-clan.com (8.10.0/8.10.0) with ESMTP id g1SIe7s26765
#        for <msimons@moria.simons-clan.com>; Thu, 28 Feb 2002 13:40:07 -0500
#Received: from www.livepenguin.com (livepenguin [127.0.0.1])
#        by www.livepenguin.com (Postfix) with ESMTP
#        id B925C662CD; Wed, 27 Feb 2002 23:32:01 -0800 (PST)

  Ack, I was waiting for my first message to clear the list before sending 
the second one... almost 11 hour list delay?  (7:30 to 18:40)

On Thu, Feb 28, 2002 at 02:29:24AM -0500, Mike Simons wrote:
> > so what exactly are we optimizing here?
> 
>   Just theory... possible advantages:
>   - Making it more likely to be able to fill all of the branch delay slots 
>     on a machine with many slots.  (see other post on subject).

  With modern pipelining CPU's conditional branches are expensive... 
  because you have to "stall the processor" to let all instructions 
  upto the condition test complete execution before you know if you 
  are going to branch or not.

    There are a couple of ways a branch can be made less expensive one
  method (what MIPS does) "branch delay slots": when you call a branch the
  X instructions directly following a branch are executed regardless of if
  the branch is taken or not.  This keeps the processor doing something
  even while it waits to see if the branch will be taken... if there is
  nothing to fill the slot a do_nothing instruction needs to be given
  (NOP).

    Another method which I think is used in older Intel CPU's is 
  "speculative execution"... in it's most basic form the processor
  guesses that the branch is taken or not and just continues running
  the instructions on one side of the branch, but keeping enough
  state of what needs to be "undone" if when the test condition result
  is actually finished execution the machine realizes it took the
  wrong branch... note that mis-prediction of a branch is often *very* 
  expensive to "clean up".
    In modern processors you can take this much further 
  "parallel speculative execution" and dedicate *HUGE* amounts of chip
  space to be able to run *both* sides of the branch operation in parallel
  and then "forget" the results of the side that shouldn't have been run
  once you know the outcome of the conditional test... some processors
  can even nest this a few levels so that a branch who's both children
  also split could in theory not stall the processor.

    Now if you have a smart compiler (that knows how to reorder), 
  and
    the code in question has operations which can be used to fill the
    branch delay slots, 
  and 
    the chip builders don't want to waste lots of chip space to be 
    able to speculatively execute

  ...Then branch delay slots are *clearly* superior just stalling the 
  processor.

    So _if_ you have a bunch of branch delay slots, and a count unknown 
  loop that runs many times, everything else above, and three planets 
  align, then -unroll-all will help (maybe even a bunch), because it 
  allows the compiler to make much better use of the branch delay slots.

so this is my attempt to explain Branch Delay Slots
===========
  It has been a while and I learned MIPS assembly so I would be amazed 
  if the sample below have the correct syntax... I do remember the MIPS 
  I used had a single branch delay slot... but there is no reason a 
  there couldn't be 10 slots following a branch for a more complex 
  processor, it's just a concept.

    Now there are a bunch of rules about what can be used to fill the 
  BDS, like you can't put anything that modifies the register being 
  tested by the conditional branch... and I also don't remember the
  details on what can fill a BDS...

  so for example C code like such:
    for (l = 10; l--l)
      do_something();

  in dumb mode while using a machine with a single branch delay slot:
  ==========
           store $1, 11       // l = 10;
  label_1: call do_something  // do_something();
           addi  $1, -1       // l--;
           jnez  $1, label_1  // branch loop when $1 not equal to zero
           nop                // this fills the branch delay slot
  ==========

  what ends up running is something like:
    call do_something
    addi $1, -1
    nop
    conditional
    call do_something
    addi $1, -1
    nop
    conditional
    /* repeat... */

  in smart mode, we shift the add to occupy the branch delay slot, and tweak
           the loop counter so that the right thing happens:
  ==========
           store $1, 10       // l = 10;
  label_1: addi  $1, -1       // l--; 
           jnez  $1, label_1  // loop when $1 not equal to zero
           call do_something  // do_something(); AND fill the BDS
  ==========

  this runs approximately 25% faster, because the slot is filled with 
    something useful...
    addi $1, -1
    call do_something
    conditional
    addi $1, -1
    call do_something
    conditional
    /* repeat... */

  I'm not going to come up with a real complex loop to demo how this
could help a processor with many branch delay slots... so I hope the
idea makes sense.