[vox-tech] Memory addressing?
Bill Broadley
bill at broadley.org
Wed Jun 23 19:28:23 PDT 2010
On 06/23/2010 12:39 PM, Brian Lavender wrote:
> What started this is that having a hash provides ideally a O(1). So,
> I say that to achieve this, one would ideally want to have a large array
> to store your IP addresses and a hashing function that provides good
> distribution. To which, Tim pointed to using a smaller array and using
> chaining, because he initially thought that arrays were limited to smaller
> sizes than was observed by doing some simple tests. To which I replied,
> using a fixed size array and the chaining could cause the hash to decay
> into a linked list which has a O(n).
Indeed, the classic memory vs access time for hashes.
> And to which Tim has noted using a
> large fixed size array may just take up all your architectural limits
> of the program. I imagine that having such a large array would push
> the lower bound of the stack up in memory, or if allocated at runtime,
> would push the heap way down from the top.
I had what sounds like a similar problem for web logging. I wanted to
take a URL which contained agent, IP, URL, referer. I posted the source
on lugod quite awhile ago.
The pseudo code was:
for hits:
agent_id=getid(agent,"agent string");
ip_id=getid(ip,"ip string");
url_id=getid(url,"url string");
refer_id=getid(refer,"refer string");
append_log(agent_id,ip_id,url_id,refer_id,datastamp,returncode)
Of course getid would lookup the value and if it didn't exist it would
add that string to the table.
This resulted in several advantages:
* Made it much easier to extract info from the logs, things like how
many hits did a subset of the website have last year vs this year
so we can evaluate advertising effectiveness.
* Logs were denser (less disk)
* Running reports like top 10 for things like hits, bandwidth, IPs,
agents, and 404s was easy.
While load testing on a rather old machine (PII-400 I believe) I could
record 1000 hits/sec even when using mysql as the backend. With a
memory resident database or something lighter like sqllite I'd expect
much better. Of course hardware has come a long way since the PII-400
as well.
Such a setup might work well with for an IDS as well. With some care it
should be relatively straight forward to make it parallel friendly.
With all that said, I used the big hash method in a simple little tool I
wrote to instantly tell you where your bandwidth is going:
http://broadley.org/bill/tcpdump.pl
The output:
tcpdump -n -c 1000 | ./tcpdump.pl | head -5
1000 packets captured
1000 packets received by filter
0 packets dropped by kernel
lines=1000 lengths=760
613884 128.120.46.32.8649 -> 128.120.246.1.48876
153463 128.120.46.33.8649 -> 128.120.246.1.48609
153462 128.120.46.33.8649 -> 128.120.246.52.54912
2684 131.215.74.22.61733 -> 128.120.246.102.9996
More information about the vox-tech
mailing list