[vox-tech] Memory addressing?

Bill Broadley bill at broadley.org
Wed Jun 23 19:28:23 PDT 2010


On 06/23/2010 12:39 PM, Brian Lavender wrote:
> What started this is that having a hash provides ideally a O(1).  So,
> I say that to achieve this, one would ideally want to have a large array
> to store your IP addresses and a hashing function that provides good
> distribution. To which, Tim pointed to using a smaller array and using
> chaining, because he initially thought that arrays were limited to smaller
> sizes than was observed by doing some simple tests. To which I replied,
> using a fixed size array and the chaining could cause the hash to decay
> into a linked list which has a O(n).

Indeed, the classic memory vs access time for hashes.

> And to which Tim has noted using a
> large fixed size array may just take up all your architectural limits
> of the program. I imagine that having such a large array would push
> the lower bound of the stack up in memory, or if allocated at runtime,
> would push the heap way down from the top.

I had what sounds like a similar problem for web logging.  I wanted to 
take a URL which contained agent, IP, URL, referer.  I posted the source 
on lugod quite awhile ago.

The pseudo code was:
for hits:
     agent_id=getid(agent,"agent string");
     ip_id=getid(ip,"ip string");
     url_id=getid(url,"url string");
     refer_id=getid(refer,"refer string");
     append_log(agent_id,ip_id,url_id,refer_id,datastamp,returncode)

Of course getid would lookup the value and if it didn't exist it would 
add that string to the table.

This resulted in several advantages:
* Made it much easier to extract info from the logs, things like how
   many hits did a subset of the website have last year vs this year
   so we can evaluate advertising effectiveness.
* Logs were denser (less disk)
* Running reports like top 10 for things like hits, bandwidth, IPs,
   agents, and 404s was easy.

While load testing on a rather old machine (PII-400 I believe) I could 
record 1000 hits/sec even when using mysql as the backend.  With a 
memory resident database or something lighter like sqllite I'd expect 
much better.  Of course hardware has come a long way since the PII-400 
as well.

Such a setup might work well with for an IDS as well.  With some care it 
should be relatively straight forward to make it parallel friendly.

With all that said, I used the big hash method in a simple little tool I 
wrote to instantly tell you where your bandwidth is going:
   http://broadley.org/bill/tcpdump.pl

The output:
tcpdump -n -c 1000 | ./tcpdump.pl | head -5
1000 packets captured
1000 packets received by filter
0 packets dropped by kernel
lines=1000 lengths=760
     613884 128.120.46.32.8649 -> 128.120.246.1.48876
     153463 128.120.46.33.8649 -> 128.120.246.1.48609
     153462 128.120.46.33.8649 -> 128.120.246.52.54912
       2684 131.215.74.22.61733 -> 128.120.246.102.9996




More information about the vox-tech mailing list