[vox-tech] The Great Spam Investigation (with no tabs)

Peter Jay Salzman vox-tech@lists.lugod.org
Sun, 25 Apr 2004 09:37:51 -0700


Pfaw!

I have expandtab set whenever editing a file named "/tmp/mutt*".
However, I wrote this document offline and then !!cat'ed into vim, so
the tabs remained.  Ooops.

My sincere apologies.  This version has no tabs.




Introduction
============

For years, I've used Exim 3, just because that's what got installed when I
first installed Debian, back in the "slink and a half"/Potato days.

A few days ago, I installed Postfix 2.0, mostly because every time Rod Roark
posts an email to vox-tech about Postfix, I get jealous.  Initially, I was
going to install Exim 4, but Exim 4's configuration looked very obfuscated.
I might have stayed with Exim had there been an easy way to convert my Exim 3
configuration to Exim 4 that didn't require reading.  But alas, that looked
complicated too.  So I converted to Postfix.

It has been a while since I really looked at all the spam coming into
dirac.org, so to celebrate installing Postfix, I did a 24 hour test.  This is
the result of that test.

A bit about dirac.org.  It's a domain connected to the internet via DSL.
There are only two users: me and my wife.  Being the founder and past
president of The Linux User's Group of Davis, and having a strong presence on
the web and USENET, I get a lot of spam.  I think it's downright cute when
she complains about the 2 or 3 pieces she gets a day...  :)

Mail at dirac.org comes from two paths:

   1. Directly to dirac.org
   2. From my school account at lifshitz.ucdavis.edu

Mail at lifshitz passes through spamassassin.  If it gets marked as spam, it
gets deleted offhand.  If not, it's forwarded to dirac.org.  Therefore, in
following statistics, spam caught on lifshitz is not included.  My real email
to spam ratio is actually lower than advertised.  Keep that in mind.

Mail at dirac.org passes through my new Postfix spam controls.  Any email
that remains gets delivered to procmail, which first sends it to Bogofilter, a
spam filter based on Bayesian statistics.  Then it goes through a few
procmail recipes that I wrote.  Then it gets delivered to my inbox.

In what follows, keep in mind that I list the tests in order.  For
example, the RBL bl.spamcop.net gets "first crack" at incoming mail, so
it's bound to catch more spam than any other RBLs.  No doubt if
cbl.abuseat.org got "first crack", that RBL would have the highest spam
catching rate.  Keep that in mind when looking at the numbers.  Procmail
and Bogofilter are both powerful, they just get the "leftovers" after
Postfix does its stuff.

Lastly, note that only 4 spams to dirac.org actually made it into my inbox
within a 24 hour period.  I list their "spamicity", determined by Bogofilter.
You might wonder how "viagra" or, more accurately, "v1agra" makes its way
past Bogofilter.  I've been receiving spam that contains poetry from such
giants as John Keats and even the lyrics to "Stairway From Heaven".  I am
loathe to pass those spams on to Bogofilter.

Enough yapping.  I have a lot of important work to get finished.  On to
the interesting stuff...





Raw Data
========

I) SMTP Conversation Dropped Before Spam Gets Delivered

   A) HELO rejected

      1. Sender claimed he was "dirac.org" or "localhost":        51
      2. RBL: bl.spamcop.net:                                    179
      3. RBL: list.dsbl.org:                                      20
      4. RBL: relays.ordb.org:                                     0
      5. RBL: cbl.abuseat.org:                                     7
      6. RBL: sbl.spamhaus.org:                                    0
      7. RBL: opm.blitzed.org:                                     0
      4. RBL: dul.dnsbl.sorbs.net:                                 3
   
   B) MAIL FROM rejected

      1. Sender did not use fully qualified hostname:             65
      2. Sender did not use fully qualified address:               1
      3. Sender domain does not exist:                             7

   C) RCPT TO rejected

      1. Sender attempted to have spam relayed:                    1
      2. Attempt to deliver to unknown dirac.org account:          4


II) SMTP Conversation Completed, But MTA Discards Spam Before
    Delivery to MUA.

   A) Body rule /^TVqQAAMAAAAEAAAA\/\/8AALg/, which must be
      contained in every win32 program.  Nobody should be
      sending me win32 executables, so this must be a virus:       9


III) Spam Delivered to MUA But Not Delivered To My Inbox

   A) Spam caught by Bogofilter:                                   7
   B) Spam caught by procmail rule
      * charset=.*(koi8|windows-125[01345678]|big-?5)              1


IV) Non-UCE Delivered To My Inbox

   A) Real Email (slow email day!):                               19
   B) Bounces because of a virus forging its "From:" header
      to say it came from p@dirac.org:                             5

V) UCE Delivered To My Inbox

   A) Spam delivered directly to dirac.org                         4

      1. spamicity: 0.519249, unknown language
      2. spamicity: 0.501567, unknown language
      3. spamicity: 0.919377, viagra
      4. spamicity: 0.500561, VCD's, unknown language
   
   B) UCE delivered from psalzman@lifshitz.ucdavis.edu             3




Results
=======

Spams will include bounce messages due to viruses forging their headers to
make it look like their from dirac.org, as well as the uhhh.... "helpful"
messages I get from hosts that tell me that "my" email was not delivered
because it contained a virus.  I consider the idiotic administrators of these
systems to be another source of unwanted email, and therefore, not much
different from UCE.  Honestly, this is a DOS waiting to happen.  Sheesh.


Total emails sent to dirac.org:               386

   Total spams sent to dirac.org:             367

      Total spams caught                      355

         Total spam caught by Postfix:        347
            Total spam caught by RBL:         209
         Total spam caught by Bogofilter:       7
         Total spam caught by procmail:         1

      Total spams uncaught                     12

   Total "real" email delivered:               19




Email that is spam:                     95%
Email that is not spam:                  5% 

Spam caught before delivered to MTA:    95%
Spam caught before delivered to inbox:  97%
Spam delivered to my inbox:              3%    <-- what I care about

Spam caught by RBLs:                    57%    <-- nice!
Spam claiming it came from "me":        15%
Spam with improper SMTP envelope:       18%
Spam giving non-existant domain
   in SMTP envelope:                     2%    <-- dumbest of the dumb



Conclusions
===========

First, I knew that I had a high spam to email ratio, but I was shocked
to see that my spam to ham ratio was 20 to 1.

Second, I'm quite pleased with the results.  Postfix along with RBLs
shot down most of the crud.  Only a very small trickle passed through.
I'm convinced more than ever that Postfix + RBL is the way to go for
spam control.  This is more preferable than relying on spam assassin,
bogofilter and procmail as a first line of defense, since they sap up
more system resources.

As a last note, I'm nearly certain that if I had spam assassin installed on
dirac.org, my total spam delivered count would've been truly, truly zero.



Thanks
======

First, thanks to the authors of all the open source software that enables me
to protect my inbox and valuable time.  You guys rock.  No, seriously, you
guys are really awesome.  Thank you.

I'd like to thank Rod Roark for getting me to use Postfix in the first
place.  Henry House introduced me to Bogofilter.  Mike Egan and Henry
House introduced me to Procmail, oh so long ago.