[vox-tech] lame question on memory allocation

Tue, 21 Jan 2003 22:31:43 -0800 (PST)

I think my comments here are mostly intended for Tim... I am piggybacking
off of Bill's comments because I am lazy.

On Tue, 21 Jan 2003, Bill Broadley wrote:

> First I'd like to mention that register size depends on exactly what
> registers your refering to.
> 
> Your average generic pentium 4 will have:
> 	32 bit integer registers
> 	80 bit floating point registers
> 	64 bit (I think) MMX registers
> 	128 bit (I think) registers.
> 
> AMD's coming out with a new cpu "RSN", that allows for all the registers
> to be >= 64 bits, and doubles the number of floating point, and integer
> registers.  The 128 bit quantity/register BTW is called a double quad-word
> I believe.

A "word" is by convention 16 bits on all x86 architectures... but by the
generic definition, a Pentium uses a 32 bit word, and on an Alpha it is
64 bits.

> Back to memory allocation, in general use what you need, try to ignore
> the page size whenever possible, even on x86's there are often more
> than 1 page size available, it's up to the OS which to use.
> 
> Don't forget that you never know when your crossing a page boundary,
> so while a char may trigger a new page just like a larger array, the
> odds are much higher for the larger array.
> 
> As far as optimization go:
> * Malloc is VERY expensive, avoid when possible, especially inside loops
> * Large malloc requests can increase performance if you efficiently track
>   the utilization, and make lots of regular small sized requests.  This
>   is often called "pooling"

To rephrase this statement: fixed-size allocation libraries are available,
and are generally much more time efficient than malloc.  Of course, they
are only useful if you need to allocate blocks of memory of a fixed size,
but you can have different "pools" of blocks with different sizes in
different pools.

> * Doing your own memory allocation library is tricky, it's very easy to
>   introduce subtle errors

An interesting exercise, but fortunately you should be able to find a
variety to use without the pain of developing from scratch.

> * Memory allocation in general is very tricky, double allocations, double
>   free's, or the dreaded memory leak are common side effects.

Isn't perl nice? Gotta love that garbage collection. ;)

> * Be very careful with array lengths, do not rely on null terminated strings
>   when possible, especially if the source of the string is a remote program,
>   service, user, etc.  Buffer overflows are the number 1 security problem.

That is, never trust input to always give you less than any particular
number of bytes... "gets()" is dangerous.

> * In general working with the largest useful unit leads to the highest 
>   performance.  I.e. copying a file 1024 double-words at a time is faster
>   then by character.  Of course you have to handle the case when the file
>   is not a multiple of 1024 double words long.
> * Be wary of trying to outguess the OS, memory hierarchies are complicated
>   and changing, it's an area of active research and sometimes changes even in
>   minor revisions to the kernel.
> 
> > I learned that a byte is 8 bits, no matter how many bits are available for
> > storage.
> > I also learned that the CPU stores both an integer and a byte in memory as a
> > word. Try
> > this test:
> > 
> > /* test1.c */
> > int main( int argc, char **argv )
> > {
> >     static char c;
> > }
> > 
> > /* test2.c */
> > int main( int argc, char **argv )
> > {
> >     static int c;
> > }
> > 
> > ls -l test1 test2 <-- the sizes are the same on my computer.
> 
> Umm, ls isn't exactly the tool for this kind of thing, if you don't use a
> variable the compiler could just not allocate it.

Assuming you compile with optimizations off, I would still not be
surprised to see these programs have the same size.  Uninitialized static
variables are not the same as initialized static variables... the compiler
typically stuffs the a number measuring the amount of memory to allocate
for uninitialized static variables into the executable file, and the
startup code that runs before main() allocates one big buffer for all of
the uninitialized static variables in the program and zeroes it out in a
loop.

Even if you made the variable initialized statics or globals, the block of
data stored in the executable image could very well be allocated in words
(16 or 32 or 64 bit), with the unused bytes in the "char" version being
ignored in the running program.

Following list is the typical memory initialization techniques used to 
comply minimally with ANSI C89, from my memory banks:

  class            initialization       mechanism
  auto             uninitialized        alter frame ptr, undefined data
  auto             initialized          alter frame ptr, code copies
                                        data every time function is called
  global           uninitialized        memory allocated on program load,
                                        undefined data
  global           initialized          memory allocated on program load,
                                        block of data copied from image
  static           uninitialized        memory allocated on program load,
                                        zeroed programmatically
  static           initialized          same as global initialized

Linux/GCC seems to hand out zeroed memory to the global uninitialized
variables, which can take the novice by surprise if (s)he ports to another
OS later.

> Often compilers allow tuning for the architecture which includes changing
> the alignments of various datatypes for maximum performance.  These rules
> have changed with the different levels of Pentiums, this is part of what
> changes when you tell the compiler which specific cpu you are targeting.  The
> code should work in all cases, but have maximum performance for the cpu
> that is targeted.  The linux kernel I believe uses these kinds of things
> as well.
> 
> Sometimes for instance it's worth 64 bit aligning something, sometimes it's
> just a waste of memory.
> 
> > > A word is the natural unit of data that can be moved from memory to a
> > > processor register. [1]
> > 
> > Right. The CPU moves words from memory to registers and back. It moves
> > memory in chunks of words because that is how it addresses them.
> 
> Er, that kind implies that a 32 bit load sends 2 requests for 1 word each,
> which is not true.  In actuality memory requests if they miss the cache
> results in a cache line load, which is usually 64-128 bytes or so.

Be careful to distinguish the x86 definition of word from the generic
definition.  By my above definition, a word is 32 bits for a Pentium,
since it has a 32 bit wide data bus, and the native registers are 32 bits
wide.

The generic definition gets used in the C language specification because C
is intended to be implemented so that the basic data types operate as
efficiently as possible no matter what architecture you are on.  The
number of bits is purposely left vague to allow the implementor
flexibility in obtaining efficiency.

> For maximum performance from cache or memory in general you reference
> the bigggest useful chunk you can.  With integers it's 32 bit unless you
> use MMX or SSE.
> 
> You might want to try say reading and writing a few arrays to and from
> memory of different sizes with different datatypes, if your careful you
> should see 2-3 plateau in performance where you can see the performance
> of L1 cache, L2 cache, and main memory seperately.
> 
> > Is this for backward compatibility for 16 bit buses? My guess is that
> > by now there's a "move a 32 bit word from memory to a register in
> > one operation" x86 instruction.
> 
> There is.

I am not sure I understand Tim's initial question here.  The 16-bit
definition of word you seem to use here derives from a desire to maintain
the ability to continue running old assembly language on newer versions of
the x86 family.

> > > It may be inefficient to move a "word" around that is not stored beginning
> > > with the first addressable byte in the data bus.
> > 
> > Hardware is not my forte, but I don't see how this can even be possible,
> > much less inefficient. What instruction addresses the middle of a word?
> 
> Well if you wanted say bits 9-25 (a word) typically you would load the 32
                                ^^ 23?
> bits, and then and it with a bit mask to get the bits you want, then shift
> it if you wish.

Except that the hardware does this much more efficiently in normal
operation.  (I could swear I have done these diagrams already on this
list...)

        | e0| e1| e2| e3|
        +---+---+---+---+
  1008  | 8 | 9 | a | b |
        +---+---+---+---+
  1004  | 4 | 5 | 6 | 7 |
        +---+---+---+---+
  1000  | 0 | 1 | 2 | 3 |
        +---+---+---+---+
        |   |   |   |   |

We could regard the 32bits beginning at address 1000 on a little-endian
processor like the x86 to be binary 000000110000001000000100000000 (3, 2,
1, and 0), or 50_462_976 in perl-decimal.  The processor can put 1000 on
the address lines and enable all four "enable" lines e0, e1, e2, and e3,
and listen to the thirty-two data lines to read the number from memory.

We could regard the 16 bits beginning at address 1004 as a "short" in C on
the x86, with bit pattern 0000010100000100 = 1284.  To read this value
from memory, the processor need only enable e0 and e1, and ignore the
other 16 bits.

Finally, we could regard the 32 bits beginning at address 1006 as one
integer: 00001001000010000000011100000110 = 151_521_030.  However, in
order to read this value into a register, the processor has to put 1004 on
the address lines and turn on e2 and e3, and route the data from the high
16 bits of the data bus to the low 16 bits of the register, ignoring the
low 16 bits of the data bus, and then put 1008 on the address lines, and
enable e0 and e1, and route the low 16 bits of the data bus to the high 16
bits of the register, ignoring the high 16 bits of the data bus.  On a
RISC architecture this much hardware would be considered a waste, so the
CPU would head for the exception handlers if asked to load a word from
address 1006. [1]

If the processor supports this operation, the inefficiency applies whether
the cpu is fetching from cache or not... though the speed impact of
hitting cache is not as great as on main memory, the speed is still at
best half what it could be.

Clearly, bytes are addressable (through combinations of enable lines and
address lines), though they may take some extra work to manipulate.

Note that other data bus widths can be implemented, and enable lines can
enable blocks of other than eight bits.  Such architectures have lost
favor recently, but the generic definitions of "byte" and "word" cover
such architectures.

[1] Strictly speaking, "1004" is never put on the address lines, because
the bottom two address lines are replaced with the four enable lines. That
is, 1004 = binary 00000000000000000000001111101100, but the least
significant bits don't really get used, so they don't bother implementing
them as address lines.  Instead, the enable lines do the job that the
least significant address lines would have done.

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil@dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...2k
---------------------------------------------------------------------------