[vox-tech] another gcc question (Branch Delay Slots)
Mike Simons
vox-tech@lists.lugod.org
Thu, 28 Feb 2002 15:03:18 -0500
#Received: from www.livepenguin.com (dcn251-11.dcn.davis.ca.us [168.150.251.11])
# by moria.simons-clan.com (8.10.0/8.10.0) with ESMTP id g1SIe7s26765
# for <msimons@moria.simons-clan.com>; Thu, 28 Feb 2002 13:40:07 -0500
#Received: from www.livepenguin.com (livepenguin [127.0.0.1])
# by www.livepenguin.com (Postfix) with ESMTP
# id B925C662CD; Wed, 27 Feb 2002 23:32:01 -0800 (PST)
Ack, I was waiting for my first message to clear the list before sending
the second one... almost 11 hour list delay? (7:30 to 18:40)
On Thu, Feb 28, 2002 at 02:29:24AM -0500, Mike Simons wrote:
> > so what exactly are we optimizing here?
>
> Just theory... possible advantages:
> - Making it more likely to be able to fill all of the branch delay slots
> on a machine with many slots. (see other post on subject).
With modern pipelining CPU's conditional branches are expensive...
because you have to "stall the processor" to let all instructions
upto the condition test complete execution before you know if you
are going to branch or not.
There are a couple of ways a branch can be made less expensive one
method (what MIPS does) "branch delay slots": when you call a branch the
X instructions directly following a branch are executed regardless of if
the branch is taken or not. This keeps the processor doing something
even while it waits to see if the branch will be taken... if there is
nothing to fill the slot a do_nothing instruction needs to be given
(NOP).
Another method which I think is used in older Intel CPU's is
"speculative execution"... in it's most basic form the processor
guesses that the branch is taken or not and just continues running
the instructions on one side of the branch, but keeping enough
state of what needs to be "undone" if when the test condition result
is actually finished execution the machine realizes it took the
wrong branch... note that mis-prediction of a branch is often *very*
expensive to "clean up".
In modern processors you can take this much further
"parallel speculative execution" and dedicate *HUGE* amounts of chip
space to be able to run *both* sides of the branch operation in parallel
and then "forget" the results of the side that shouldn't have been run
once you know the outcome of the conditional test... some processors
can even nest this a few levels so that a branch who's both children
also split could in theory not stall the processor.
Now if you have a smart compiler (that knows how to reorder),
and
the code in question has operations which can be used to fill the
branch delay slots,
and
the chip builders don't want to waste lots of chip space to be
able to speculatively execute
...Then branch delay slots are *clearly* superior just stalling the
processor.
So _if_ you have a bunch of branch delay slots, and a count unknown
loop that runs many times, everything else above, and three planets
align, then -unroll-all will help (maybe even a bunch), because it
allows the compiler to make much better use of the branch delay slots.
so this is my attempt to explain Branch Delay Slots
===========
It has been a while and I learned MIPS assembly so I would be amazed
if the sample below have the correct syntax... I do remember the MIPS
I used had a single branch delay slot... but there is no reason a
there couldn't be 10 slots following a branch for a more complex
processor, it's just a concept.
Now there are a bunch of rules about what can be used to fill the
BDS, like you can't put anything that modifies the register being
tested by the conditional branch... and I also don't remember the
details on what can fill a BDS...
so for example C code like such:
for (l = 10; l--l)
do_something();
in dumb mode while using a machine with a single branch delay slot:
==========
store $1, 11 // l = 10;
label_1: call do_something // do_something();
addi $1, -1 // l--;
jnez $1, label_1 // branch loop when $1 not equal to zero
nop // this fills the branch delay slot
==========
what ends up running is something like:
call do_something
addi $1, -1
nop
conditional
call do_something
addi $1, -1
nop
conditional
/* repeat... */
in smart mode, we shift the add to occupy the branch delay slot, and tweak
the loop counter so that the right thing happens:
==========
store $1, 10 // l = 10;
label_1: addi $1, -1 // l--;
jnez $1, label_1 // loop when $1 not equal to zero
call do_something // do_something(); AND fill the BDS
==========
this runs approximately 25% faster, because the slot is filled with
something useful...
addi $1, -1
call do_something
conditional
addi $1, -1
call do_something
conditional
/* repeat... */
I'm not going to come up with a real complex loop to demo how this
could help a processor with many branch delay slots... so I hope the
idea makes sense.