[vox-tech] another gcc question

Mike Simons vox-tech@lists.lugod.org
Thu, 28 Feb 2002 02:29:24 -0500


--h31gzZEtNLTqOjlF
Content-Type: text/plain; charset=us-ascii

On Wed, Feb 27, 2002 at 09:25:41AM -0800, Peter Jay Salzman wrote:
>        -funroll-loops
>               Perform the optimization of loop  unrolling.   This
>               is  only  done for loops whose number of iterations
>               can be determined at compile time or run time.

  /* compile time */
  for (l = 100; l--;) do_something();

  /* run time */
  for (l = rand(); l--;) do_something();

>        -funroll-all-loops
>               Perform the optimization of loop  unrolling.   This
>               is done for all loops.  This usually makes programs
>               run more slowly.

  /* never known */
  while (do_something());


  There are thousands of variations of the three cases above... 
basically you have to ask yourself if the compiler knows for sure 
how many times it will run, else if the runtime system will know 
before the loop begins how many times to run, else if each loop 
iteration will decide if the next one runs.


Disclaimers:
  - All of this is user space trial and error, I've not actually looked
    at the gcc source.
  - Also, I don't know intel or gnu assembly, so if someone who does could 
    verify these observations that would be nice... 
  - Platform used was Debian/woody gcc 2.95.4 running on a Pentium.


  For case 1 (compile time) the compiler appears to do the following:
    - find the largest number that divides evenly into the loop counter 
      between 1 and 21... if the LCD isn't 1... then do this:
        for (l = loop_counter / LCD; l--;)
        {
          do_something();
          do_something();
          /* ...repeat LCD times... */
        }
      if however LCD *is* equal to 1, then behave the same as runtime
      case below;

  For case 2 (runtime) the compiler appears to:
    - if the loop body _modifies_ the loop counter in a non-constant way
      then use case 3 below...
    - otherwise break the code into something like the following case 
      statement:

      l = some_int;
    
      while (l)
      {
        switch (l > 4 ? 4 : l)
        {
          4: do_something();
          3: do_something();
          2: do_something();
          1: do_something();
          0: ;
        }
        l -= l > 4 : 4 : l;
      }

    
  For case 3 (never known) the compiler appears to:
  - Do something like the following...

  do {
    done = do_something() || do_something() || do_something() || 
           do_something() || do_something() || do_something() || 
           do_something() || do_something();
  } while (!done);


> ok, -funroll-all-loops usually makes programs run more slowly.  it also
> clearly makes the executable larger.
> 
> so what exactly are we optimizing here?

  Just theory... possible advantages:
  - Making it more likely to be able to fill all of the branch delay slots 
    on a machine with many slots.  (see other post on subject).
  - On some architectures taking a branch is (much) more expensive than not 
    taking unrolling will help, if an "never known" loop runs a long time.


On Wed, Feb 27, 2002 at 10:03:22AM -0800, ME wrote:
> $ gcc gcc -funroll-all-loops -S sample.c
[...]
> This would lead me to believe the generated asm, code is not unrolled if I
> understand the expectation of the unrolling process. 

  You are absolutely correct.

First, you need the following:
$ gcc -O -funroll-all-loops -S sample.c

  No optimizations are done unless optimizations are turned on.  Since
-funroll-all-loops is an optimizations without -O nothing happens.


  Second, the code sample doesn't actually do anything with m,
so with -O the compiler is smart enough to just remove the loops 
all together main becomes:
========
main:
        pushl %ebp
        movl %esp,%ebp
        xorl %eax,%eax
        leave
        ret
========

  I've changed the sample code to demo the three cases listed above
(see attachment).  If you look for in the assembly code for "call" ... 
in main, "printf" marks the boundary between block types, 
"do_something" shows you how the unrolling works out.

  Don't _run_ the sample code, it is just provided for viewing of assembly
code, the while loop is endless ... something more complex would
be needed to halt that loop and would just make the assembly produced
harder to understand.

    Later,
      Mike

ps:
  I noticed that people don't appear to be using text attachements much...
hope it's not against some list policy.

--h31gzZEtNLTqOjlF
Content-Type: text/plain
Content-Disposition: attachment; filename="s.c"

#include <stdio.h>
#include <stdlib.h>

int do_something(void)
{
  return printf(".");
}

int main(void)
{
  int l;

  /* compile time decision */
  for (l = 100; l--;) do_something();
  printf("\n");

  /* runtime decision */
  for (l = rand(); l--;) do_something();
  printf("\n");

  /* always unknown */
  while (do_something());
  printf("\n");

#if 0
  l = 100;

  while (l > 0)
  {
    switch (l > 4 ? 4 : l)
    {
      case 4: do_something();
      case 3: do_something();
      case 2: do_something();
      case 1: do_something();
      case 0: ;
    }
    l -= l > 4 ? 4 : l;
  }
#endif

  return 0;
}

--h31gzZEtNLTqOjlF
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="s.s"

	.file	"s.c"
	.version	"01.01"
gcc2_compiled.:
.section	.rodata
.LC0:
	.string	"."
.text
	.align 4
.globl do_something
	.type	 do_something,@function
do_something:
	pushl %ebp
	movl %esp,%ebp
	subl $8,%esp
	addl $-12,%esp
	pushl $.LC0
	call printf
	leave
	ret
.Lfe1:
	.size	 do_something,.Lfe1-do_something
.section	.rodata
.LC1:
	.string	"\n"
.text
	.align 4
.globl main
	.type	 main,@function
main:
	pushl %ebp
	movl %esp,%ebp
	subl $20,%esp
	pushl %ebx
	movl $99,%ebx
	.p2align 4,,7
.L37:
	call do_something
	call do_something
	call do_something
	call do_something
	call do_something
	call do_something
	call do_something
	call do_something
	call do_something
	call do_something
	call do_something
	call do_something
	call do_something
	call do_something
	call do_something
	call do_something
	call do_something
	call do_something
	call do_something
	call do_something
	addl $-20,%ebx
	cmpl $-1,%ebx
	jne .L37
	addl $-12,%esp
	pushl $.LC1
	call printf
	call rand
	movl %eax,%ebx
	addl $16,%esp
	subl $1,%ebx
	jc .L40
	movl %ebx,%eax
	notl %eax
	andl $3,%eax
	je .L42
	cmpl $3,%eax
	jge .L50
	cmpl $2,%eax
	jge .L51
	call do_something
	decl %ebx
.L51:
	call do_something
	decl %ebx
.L50:
	call do_something
	subl $1,%ebx
	jc .L40
	.p2align 4,,7
.L42:
	call do_something
	call do_something
	call do_something
	call do_something
	addl $-4,%ebx
	cmpl $-1,%ebx
	jne .L42
.L40:
	addl $-12,%esp
	pushl $.LC1
	call printf
	addl $16,%esp
	.p2align 4,,7
.L46:
	call do_something
	testl %eax,%eax
	jne .L46
	addl $-12,%esp
	pushl $.LC1
	call printf
	xorl %eax,%eax
	movl -24(%ebp),%ebx
	leave
	ret
.Lfe2:
	.size	 main,.Lfe2-main
	.ident	"GCC: (GNU) 2.95.4  (Debian prerelease)"

--h31gzZEtNLTqOjlF--