[vox-tech] Greylisting and LUGOD
Karsten M. Self
kmself at ix.netcom.com
Tue Sep 14 23:38:59 PDT 2004
on Mon, Sep 13, 2004 at 09:39:15AM -0700, Rod Roark (rod at sunsetsystems.com) wrote:
> I ran across this during my morning reading:
>
> http://projects.puremagic.com/greylisting/whitepaper.html
>
> of which there seems to be a good Postfix implementation
> (Postfix is my MTA of choice):
>
> http://isg.ee.ethz.ch/tools/postgrey/
OK, I gave that a 30 second scan. It fits in with a few of my own
activities, of which spam profiling is one.
> So I'm seriously considering putting this on my server.
>
> The effect on LUGOD would be:
>
> (1) Virtually no spam.
Good luck.
You _may_ significantly drop the spam load. Killing it outright is
unlikely. More below.
> Mostly this is of interest to the
> officers, as the mailing lists already require
> registration in order to post; however spammers might
> easily forge the FROM header to abuse this.
Note that the greylisting is based on a tuple of which at least one
element (immediate upstream IP) is difficult or impossible to reliably
forge.
> (2) Mail from first-time posters, or from those who post
> less frequently than once per month, would likely be
> delayed by an hour or so.
Possibly.
> (3) This *might* allow me to eliminate the current blocking
> of mail from dynamic IPs.
...iff (sic) the IP isn't a candidate for blocking under other criteria.
> Comments?
Sure.
First: for a given receiving MTA, the _vast_ bulk of legit mail will
appear to come from a handful of IPs, or failing that, netblocks.
Just for kicks, I happen to have some 857+ mails in my lugod vox-tech
folder. Let's get their upstream IPs, that is: the IP from which
LUGoD's mailserver received the mail. This may not be the _ultimate_
origin, but it is the one _assured_ point of transit, and certainly has
no business, say, spewing forth spam spewe.
OK, I'm in my vox-tech/new Maildir directory:
for f in $( ls );
do
formail -cX "Received:" < $f |
grep -m2 'by www.livepenguin.com' |
grep -v 'ns1\.livepenguin\.com'
done |
sed -e 's/by www\.live.*//' -e 's/^.*\[//' -e 's/[])]//g' |
tee /tmp/lugod-ips
wc -l /tmp/lugod-ips
sort -u < /tmp/lugod-ips | wc -l
Gives 857 mails from 141 IPSs. Ok, that's a big handful....
Let's run these through the reverse-DNS service at asn.routeviews.org
which lets us determine the ASN and CIDR associated with each IP.
'reverse_ip' is a bash shell function in my SpamTools kit which reverses
the quads of an IP for rDNS queries:
for ip in $( cat /tmp/lugod-ips )
do
host -W 6 -R 10 -t txt $( reverse_ip $ip ).asn.routeviews.org
done | sed -e 's/^.*text //" -e 's/"//g'
We're now down to a total of 45 ASNs, of which 42 appear more than once:
$ awk '{print $1}' /tmp/lugod-cidrs | sort | uniq -c | sort -nr | cat -n
1 160 7065
2 108 7132
3 82 5731
4 77 7961 [> half of all messages]
5 61 7018
6 60 22489
7 48 6192
8 41 4294967295 [unresolved]
9 41 26085
10 34 1698
11 34 10787
12 21 4265
13 20 15169
14 17 701
15 15 11403
16 14 26101
17 12 4355
18 11 6540
19 10 14779
20 9 6939
21 9 21566
22 9 2152
23 9 17175
24 7 7407
25 7 23310
26 7 11022
27 6 6478
28 6 29863
29 6 21844
30 6 174
31 6 14051
32 6 12076
33 4 25646
34 4 1742
35 3 6785
36 3 4151
37 3 3561
38 2 6517
39 2 26283
40 2 22799
41 2 12181
42 1 226
43 1 209
44 1 15687
45 1 14829
...which is getting to the neighborhood of what I'd consider to be "a
handful". *Half* of all mail comes from four ASNs. The "4294967295"
value, BTW, is what routeviews.org returns for an unknown IP -- the data
aren't perfect.
We can also get CIDR from the string (it's the third and fourth columns
in my output file). Turns out the spread isn't too much more -- 64
CIDRs, of which 24 appear more than once:
$ awk '{printf( "%s/%s\n", $2, $3)}' /tmp/lugod-cidrs | sort | uniq -c |
sort -nr | cat -n
1 142 64.142.0.0/19
2 77 198.144.192.0/19
3 74 168.150.0.0/16
4 61 204.127.128.0/17
5 48 169.237.0.0/16
6 41 66.163.160.0/19 [> half of all mail]
7 41 0/0
8 34 216.57.64.0/20
9 34 207.115.32.0/19
10 33 204.127.200.0/21
11 30 69.55.224.0/20
12 28 204.127.192.0/21
13 21 216.148.224.0/22
14 21 216.148.224.0/19
15 17 207.247.0.0/16
16 16 63.192.0.0/12
17 15 69.55.238.0/24
18 15 69.55.237.0/24
19 15 66.111.0.0/20
20 15 208.201.224.0/19
21 14 66.218.64.0/19
22 11 209.210.251.0/24
23 11 207.217.0.0/16
24 10 64.233.170.0/24
25 10 64.233.160.0/19
26 10 206.190.32.0/20
27 9 212.165.128.0/17
28 9 208.184.190.0/23
29 9 130.86.0.0/16
30 8 158.222.0.0/16
31 7 63.101.96.0/21
32 7 209.239.32.0/19
33 7 209.232.0.0/15
34 7 199.233.217.0/24
35 6 69.56.128.0/17
36 6 65.54.224.0/19
37 6 38.0.0.0/8
38 6 209.151.64.0/19
39 4 64.62.128.0/18
40 4 64.62.128.0/17
41 4 24.2.32.0/19
42 4 209.79.220.0/22
43 4 134.174.0.0/16
44 3 66.120.0.0/13
45 3 64.142.64.0/19
46 3 217.157.0.0/16
47 3 209.225.0.0/18
48 3 147.49.0.0/16
49 2 66.60.128.0/18
50 2 66.54.152.0/23
51 2 66.54.128.0/17
52 2 24.207.0.0/18
53 2 216.93.192.0/19
54 2 216.86.192.0/19
55 1 67.172.160.0/19
56 1 67.169.224.0/20
57 1 66.60.130.0/24
58 1 66.60.129.0/24
59 1 65.19.128.0/18
60 1 217.16.96.0/20
61 1 207.69.200.0/24
62 1 207.159.64.0/18
63 1 207.159.120.0/24
64 1 130.221.0.0/16
The handily useful thing about ASNs and CIDRs are:
- They aggregate beautifully. A wide range of IPs clusters into a
narrow band of CIDRs or ASNs. So both your spamhaus with a large
number of IPs trickling out a small number of spams each, and your
friendly neighborhood ISP with a few hundred white hats scattered
over a /24 or /18, cluster nicely.
- The data's a DNS query away. And the zonefiles are rsyncable.
- The spam/ham determination can be as local and specific as you want.
- Organizationally, ASNs and CIDRs both map to what's typically a
single entity with effective control over its network. How it uses
that control, and whether for good or for bad, is its business. But
the data are readily and immediately available to you.
Where I see the next generation of MTAs headed is keeping track of
sender reputation not on the basis of an individual IP's track record
(the classic DNSBL model), but on the record of blocks of IPs. If you
think about the implications of IPv6 (effectively limitless address
space), you'll *have* to utilize an aggregating tool to be able to use
reputation-based tools effectively (of course, IPv6 appears to be a ways
off for other reasons as well....).
My own data suggest that the bulk of spam, as the bulk of mail on a
list, originate from a small number of identifiable sources. One ASN
regularly accounts for between 12%-18% of my own spam (Kornet's 4766).
The top four ASNs are 25% of my spam, the top 20 or so, 50%.
Which suggests a very cheap mode of cutting into spam volumes markedly
by employing ASNs, CIDRs, or similar IP aggregates (though I'm aware of
none) in generating reputation data, and effecting firewalling,
probabalistic rejection (you reject traffic from an ASN directly
proportional to the probability it's spam), rate-limiting, etc.
Backing off from a black-and-white allow/deny mode gives legit mail a
fighting chance....
Which all sounds well and good.
The question, though, is how much spam are you getting?
There are two large-volume, well-known lists for which I'm aware of spam
stats being available, comp.risks and the debian-user mailing list.
Comp.risks declared in 2001 that it had reached the spam crossover:
even with filtering, over 50% of the mail received
in the moderator's inbox was spam. As of October 2003, with
SpamAssassin catching > 1000 spams daily, *90%* of the remaining volume
was spam:
http://catless.ncl.ac.uk/Risks/22.92.html#subj9.1
Debian-user currently rejects > 95% of all mail based on various rules.
Let's say you've got a list that receives 90% spam, and you introduce
point-of-origin filtering at the 50% cutoff (kill any aggregated network
in the first 50%ile spam contributors list).
Congratulations, you've just eliminated half your spam with a single
20-element rule, based on your own experience.
Your list also _still_ receives 45% spam.
It's a matter of both the amount of spam you can cut, and the total
volume of spam you're receiving.
On the other hand, content/context based filtering gets expensive both
CPU and time-wise, particularly if you're making extensive use of DNSBLs
(they're useful data sources, they're time-intensive). It takes me
10-20 seconds to determine spam or ham on my own system, on a high-speed
line, via Spamassassin. I'm faster doing it manually, but I'm not going
to sit in hour after hour, day in and day out. So the machine does it.
My own read: any network in the top-50% range, or whose net mail
contribution is > 50% spam, has no business delivering legitimate
_packets_, let alone mail, and should be firewalled. I see this as a
network hygiene issues -- one of the administrators of a network
adequately policing and ensureing that it doesn't spew crud over other
people's networks. And if they're not going to make the effort to
prevent this to their own satisfaction and needs, the rest of the Net's
welcome to take whatever measures satisfy _their_ own business needs.
So, making a long post, um, longer: reputation-based MTAs are a Good
Thing[tm], and disposition of mail at SMTP time is the Right Way To Do
It[tm]. It is not, however, the Total Solution[tm]. You're going to
need content filtering. It's a nice big step though, and you _can_ use
origin to, say, preserve your expensive filtering steps for the small
number (by volume) of points-of-origin for which you don't have a good
trust basis.
Rod, does that answer your question ;-)
Peace.
--
Karsten M. Self <kmself at ix.netcom.com> http://kmself.home.netcom.com/
What Part of "Gestalt" don't you understand?
Erin Joyce: can't get the story right, won't correct it
http://z.iwethey.org/forums/render/content/show?contentid=96625
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://ns1.livepenguin.com/pipermail/vox-tech/attachments/20040914/a27eddec/attachment.bin
More information about the vox-tech
mailing list