[vox-tech] Greylisting and LUGOD

Karsten M. Self kmself at ix.netcom.com
Tue Sep 14 23:38:59 PDT 2004


on Mon, Sep 13, 2004 at 09:39:15AM -0700, Rod Roark (rod at sunsetsystems.com) wrote:
> I ran across this during my morning reading:
> 
>   http://projects.puremagic.com/greylisting/whitepaper.html
> 
> of which there seems to be a good Postfix implementation
> (Postfix is my MTA of choice):
> 
>   http://isg.ee.ethz.ch/tools/postgrey/

OK, I gave that a 30 second scan.  It fits in with a few of my own
activities, of which spam profiling is one.
 
> So I'm seriously considering putting this on my server.
> 
> The effect on LUGOD would be:
> 
> (1) Virtually no spam.  

Good luck.

You _may_ significantly drop the spam load.  Killing it outright is
unlikely.  More below.

>     Mostly this is of interest to the
>     officers, as the mailing lists already require
>     registration in order to post; however spammers might
>     easily forge the FROM header to abuse this.

Note that the greylisting is based on a tuple of which at least one
element (immediate upstream IP) is difficult or impossible to reliably
forge.
 
> (2) Mail from first-time posters, or from those who post
>     less frequently than once per month, would likely be
>     delayed by an hour or so.

Possibly.
 
> (3) This *might* allow me to eliminate the current blocking
>     of mail from dynamic IPs.

...iff (sic) the IP isn't a candidate for blocking under other criteria.
 
> Comments?

Sure.

First:  for a given receiving MTA, the _vast_ bulk of legit mail will
appear to come from a handful of IPs, or failing that, netblocks.

Just for kicks, I happen to have some 857+ mails in my lugod vox-tech
folder.  Let's get their upstream IPs, that is:  the IP from which
LUGoD's mailserver received the mail.  This may not be the _ultimate_
origin, but it is the one _assured_ point of transit, and certainly has
no business, say, spewing forth spam spewe.

OK, I'm in my vox-tech/new Maildir directory:

    for f in $( ls ); 
    do 
        formail -cX "Received:" < $f |
            grep -m2 'by www.livepenguin.com' |
            grep -v 'ns1\.livepenguin\.com'
    done |
        sed -e 's/by www\.live.*//' -e 's/^.*\[//' -e 's/[])]//g' |
        tee /tmp/lugod-ips

    wc -l /tmp/lugod-ips
    sort -u < /tmp/lugod-ips | wc -l

Gives 857 mails from 141 IPSs.  Ok, that's a big handful....

Let's run these through the reverse-DNS service at asn.routeviews.org
which lets us determine the ASN and CIDR associated with each IP.
'reverse_ip' is a bash shell function in my SpamTools kit which reverses
the quads of an IP for rDNS queries:

    for ip in $( cat /tmp/lugod-ips )
    do 
        host -W 6 -R 10 -t txt $( reverse_ip $ip ).asn.routeviews.org
    done | sed -e 's/^.*text //" -e 's/"//g'


We're now down to a total of 45 ASNs, of which 42 appear more than once:

    $ awk '{print $1}' /tmp/lugod-cidrs  | sort | uniq -c | sort -nr | cat -n
         1      160 7065
         2      108 7132
         3       82 5731
         4       77 7961        [> half of all messages]
         5       61 7018
         6       60 22489
         7       48 6192
         8       41 4294967295  [unresolved]
         9       41 26085
        10       34 1698
        11       34 10787
        12       21 4265
        13       20 15169
        14       17 701
        15       15 11403
        16       14 26101
        17       12 4355
        18       11 6540
        19       10 14779
        20        9 6939
        21        9 21566
        22        9 2152
        23        9 17175
        24        7 7407
        25        7 23310
        26        7 11022
        27        6 6478
        28        6 29863
        29        6 21844
        30        6 174
        31        6 14051
        32        6 12076
        33        4 25646
        34        4 1742
        35        3 6785
        36        3 4151
        37        3 3561
        38        2 6517
        39        2 26283
        40        2 22799
        41        2 12181
        42        1 226
        43        1 209
        44        1 15687
        45        1 14829

...which is getting to the neighborhood of what I'd consider to be "a
handful".  *Half* of all mail comes from four ASNs.  The "4294967295"
value, BTW, is what routeviews.org returns for an unknown IP -- the data
aren't perfect.



We can also get CIDR from the string (it's the third and fourth columns
in my output file).  Turns out the spread isn't too much more -- 64
CIDRs, of which 24 appear more than once:

    $ awk '{printf( "%s/%s\n", $2, $3)}' /tmp/lugod-cidrs | sort | uniq -c |
        sort -nr | cat -n

         1      142 64.142.0.0/19
         2       77 198.144.192.0/19
         3       74 168.150.0.0/16
         4       61 204.127.128.0/17
         5       48 169.237.0.0/16
         6       41 66.163.160.0/19     [> half of all mail]
         7       41 0/0
         8       34 216.57.64.0/20
         9       34 207.115.32.0/19
        10       33 204.127.200.0/21
        11       30 69.55.224.0/20
        12       28 204.127.192.0/21
        13       21 216.148.224.0/22
        14       21 216.148.224.0/19
        15       17 207.247.0.0/16
        16       16 63.192.0.0/12
        17       15 69.55.238.0/24
        18       15 69.55.237.0/24
        19       15 66.111.0.0/20
        20       15 208.201.224.0/19
        21       14 66.218.64.0/19
        22       11 209.210.251.0/24
        23       11 207.217.0.0/16
        24       10 64.233.170.0/24
        25       10 64.233.160.0/19
        26       10 206.190.32.0/20
        27        9 212.165.128.0/17
        28        9 208.184.190.0/23
        29        9 130.86.0.0/16
        30        8 158.222.0.0/16
        31        7 63.101.96.0/21
        32        7 209.239.32.0/19
        33        7 209.232.0.0/15
        34        7 199.233.217.0/24
        35        6 69.56.128.0/17
        36        6 65.54.224.0/19
        37        6 38.0.0.0/8
        38        6 209.151.64.0/19
        39        4 64.62.128.0/18
        40        4 64.62.128.0/17
        41        4 24.2.32.0/19
        42        4 209.79.220.0/22
        43        4 134.174.0.0/16
        44        3 66.120.0.0/13
        45        3 64.142.64.0/19
        46        3 217.157.0.0/16
        47        3 209.225.0.0/18
        48        3 147.49.0.0/16
        49        2 66.60.128.0/18
        50        2 66.54.152.0/23
        51        2 66.54.128.0/17
        52        2 24.207.0.0/18
        53        2 216.93.192.0/19
        54        2 216.86.192.0/19
        55        1 67.172.160.0/19
        56        1 67.169.224.0/20
        57        1 66.60.130.0/24
        58        1 66.60.129.0/24
        59        1 65.19.128.0/18
        60        1 217.16.96.0/20
        61        1 207.69.200.0/24
        62        1 207.159.64.0/18
        63        1 207.159.120.0/24
        64        1 130.221.0.0/16

The handily useful thing about ASNs and CIDRs are:

  - They aggregate beautifully.  A wide range of IPs clusters into a
    narrow band of CIDRs or ASNs.  So both your spamhaus with a large
    number of IPs trickling out a small number of spams each, and your
    friendly neighborhood ISP with a few hundred white hats scattered
    over a /24 or /18, cluster nicely.

  - The data's a DNS query away.  And the zonefiles are rsyncable.

  - The spam/ham determination can be as local and specific as you want.

  - Organizationally, ASNs and CIDRs both map to what's typically a
    single entity with effective control over its network.  How it uses
    that control, and whether for good or for bad, is its business.  But
    the data are readily and immediately available to you.

Where I see the next generation of MTAs headed is keeping track of
sender reputation not on the basis of an individual IP's track record
(the classic DNSBL model), but on the record of blocks of IPs.  If you
think about the implications of IPv6 (effectively limitless address
space), you'll *have* to utilize an aggregating tool to be able to use
reputation-based tools effectively (of course, IPv6 appears to be a ways
off for other reasons as well....).



My own data suggest that the bulk of spam, as the bulk of mail on a
list, originate from a small number of identifiable sources.  One ASN
regularly accounts for between 12%-18% of my own spam (Kornet's 4766).
The top four ASNs are 25% of my spam, the top 20 or so, 50%.

Which suggests a very cheap mode of cutting into spam volumes markedly
by employing ASNs, CIDRs, or similar IP aggregates (though I'm aware of
none) in generating reputation data, and effecting firewalling,
probabalistic rejection (you reject traffic from an ASN directly
proportional to the probability it's spam), rate-limiting, etc.
Backing off from a black-and-white allow/deny mode gives legit mail a
fighting chance....

Which all sounds well and good.

The question, though, is how much spam are you getting?

There are two large-volume, well-known lists for which I'm aware of spam
stats being available, comp.risks and the debian-user mailing list.
Comp.risks declared in 2001 that it had reached the spam crossover:
even with filtering, over 50% of the mail received
in the moderator's inbox was spam.  As of October 2003, with
SpamAssassin catching > 1000 spams daily, *90%* of the remaining volume
was spam:

    http://catless.ncl.ac.uk/Risks/22.92.html#subj9.1

Debian-user currently rejects > 95% of all mail based on various rules.

Let's say you've got a list that receives 90% spam, and you introduce
point-of-origin filtering at the 50% cutoff (kill any aggregated network
in the first 50%ile spam contributors list).

Congratulations, you've just eliminated half your spam with a single
20-element rule, based on your own experience.

Your list also _still_ receives 45% spam.

It's a matter of both the amount of spam you can cut, and the total
volume of spam you're receiving.

On the other hand, content/context based filtering gets expensive both
CPU and time-wise, particularly if you're making extensive use of DNSBLs
(they're useful data sources, they're time-intensive).   It takes me
10-20 seconds to determine spam or ham on my own system, on a high-speed
line, via Spamassassin.  I'm faster doing it manually, but I'm not going
to sit in hour after hour, day in and day out.  So the machine does it.


My own read:  any network in the top-50% range, or whose net mail
contribution is > 50% spam, has no business delivering legitimate
_packets_, let alone mail, and should be firewalled.  I see this as a
network hygiene issues -- one of the administrators of a network
adequately policing and ensureing that it doesn't spew crud over other
people's networks.  And if they're not going to make the effort to
prevent this to their own satisfaction and needs, the rest of the Net's
welcome to take whatever measures satisfy _their_ own business needs.


So, making a long post, um, longer:  reputation-based MTAs are a Good
Thing[tm], and disposition of mail at SMTP time is the Right Way To Do
It[tm].   It is not, however, the Total Solution[tm].  You're going to
need content filtering.  It's a nice big step though, and you _can_ use
origin to, say, preserve your expensive filtering steps for the small
number (by volume) of points-of-origin for which you don't have a good
trust basis.


Rod, does that answer your question ;-)



Peace.

-- 
Karsten M. Self <kmself at ix.netcom.com>        http://kmself.home.netcom.com/
 What Part of "Gestalt" don't you understand?
    Erin Joyce:  can't get the story right, won't correct it
     http://z.iwethey.org/forums/render/content/show?contentid=96625
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://ns1.livepenguin.com/pipermail/vox-tech/attachments/20040914/a27eddec/attachment.bin


More information about the vox-tech mailing list