[vox-tech] lame question on memory allocation

Bill Broadley vox-tech@lists.lugod.org
Tue, 21 Jan 2003 19:51:29 -0800


First I'd like to mention that register size depends on exactly what
registers your refering to.

Your average generic pentium 4 will have:
	32 bit integer registers
	80 bit floating point registers
	64 bit (I think) MMX registers
	128 bit (I think) registers.

AMD's coming out with a new cpu "RSN", that allows for all the registers
to be >= 64 bits, and doubles the number of floating point, and integer
registers.  The 128 bit quantity/register BTW is called a double quad-word
I believe.

Back to memory allocation, in general use what you need, try to ignore
the page size whenever possible, even on x86's there are often more
than 1 page size available, it's up to the OS which to use.

Don't forget that you never know when your crossing a page boundary,
so while a char may trigger a new page just like a larger array, the
odds are much higher for the larger array.

As far as optimization go:
* Malloc is VERY expensive, avoid when possible, especially inside loops
* Large malloc requests can increase performance if you efficiently track
  the utilization, and make lots of regular small sized requests.  This
  is often called "pooling"
* Doing your own memory allocation library is tricky, it's very easy to
  introduce subtle errors
* Memory allocation in general is very tricky, double allocations, double
  free's, or the dreaded memory leak are common side effects.
* Be very careful with array lengths, do not rely on null terminated strings
  when possible, especially if the source of the string is a remote program,
  service, user, etc.  Buffer overflows are the number 1 security problem.
* In general working with the largest useful unit leads to the highest 
  performance.  I.e. copying a file 1024 double-words at a time is faster
  then by character.  Of course you have to handle the case when the file
  is not a multiple of 1024 double words long.
* Be wary of trying to outguess the OS, memory hierarchies are complicated
  and changing, it's an area of active research and sometimes changes even in
  minor revisions to the kernel.

> I learned that a byte is 8 bits, no matter how many bits are available for
> storage.
> I also learned that the CPU stores both an integer and a byte in memory as a
> word. Try
> this test:
> 
> /* test1.c */
> int main( int argc, char **argv )
> {
>     static char c;
> }
> 
> /* test2.c */
> int main( int argc, char **argv )
> {
>     static int c;
> }
> 
> ls -l test1 test2 <-- the sizes are the same on my computer.

Umm, ls isn't exactly the tool for this kind of thing, if you don't use a
variable the compiler could just not allocate it.

Often compilers allow tuning for the architecture which includes changing
the alignments of various datatypes for maximum performance.  These rules
have changed with the different levels of Pentiums, this is part of what
changes when you tell the compiler which specific cpu you are targeting.  The
code should work in all cases, but have maximum performance for the cpu
that is targeted.  The linux kernel I believe uses these kinds of things
as well.

Sometimes for instance it's worth 64 bit aligning something, sometimes it's
just a waste of memory.

> > A word is the natural unit of data that can be moved from memory to a
> > processor register. [1]
> 
> Right. The CPU moves words from memory to registers and back. It moves
> memory in chunks of words because that is how it addresses them.

Er, that kind implies that a 32 bit load sends 2 requests for 1 word each,
which is not true.  In actuality memory requests if they miss the cache
results in a cache line load, which is usually 64-128 bytes or so.

For maximum performance from cache or memory in general you reference
the bigggest useful chunk you can.  With integers it's 32 bit unless you
use MMX or SSE.

You might want to try say reading and writing a few arrays to and from
memory of different sizes with different datatypes, if your careful you
should see 2-3 plateau in performance where you can see the performance
of L1 cache, L2 cache, and main memory seperately.

> Is this for backward compatibility for 16 bit buses? My guess is that
> by now there's a "move a 32 bit word from memory to a register in
> one operation" x86 instruction.

There is.

> > It may be inefficient to move a "word" around that is not stored beginning
> > with the first addressable byte in the data bus.
> 
> Hardware is not my forte, but I don't see how this can even be possible,
> much less inefficient. What instruction addresses the middle of a word?

Well if you wanted say bits 9-25 (a word) typically you would load the 32
bits, and then and it with a bit mask to get the bits you want, then shift
it if you wish.


-- 
Bill Broadley
Mathematics
UC Davis