[vox-tech] spam to defeat bayesian filtering?

ME vox-tech@lists.lugod.org
Thu, 18 Dec 2003 09:30:44 -0800 (PST)


I can't tell by your message, but many of these include content in HTML
and content in text and the two do not match. (Dual attachments.) mail
clients that can display either often default to html and these users see
the spam while users of text see content like this.

We have been working to raise the bar on this. First, postmaster is
collecting domains that are known to spam (from address) as well ashost
spam content on the web.

So, we have our blacklist (6327 domains blacklisted with double entry
domains: blacklist_from *@domain and blacklist from *@*.domain for 12654
entries based on domain and 652 blacklist entries based on IP address in
URL or relay host before last trusted host for a total of 13320 entries.)

We publish these to inform the public of domains which provide content we
do not wish to receive:
from: http://www.passwall.com/download.html : (these are for spamassassin)
http://www.passwall.com/blacklist.txt
http://www.passwall.com/uri.txt

In addition to this, we know some domains forge their from address, but
are perfectly fine with including URL to their site. As a result, I take
the list of domains in the blacklist (and IP address) and parse through
them to general rules for spamassassin to searc through URI for any URI
that contain any domains/IP that are in the blacklist, and then increase
the spamscore with these. (The uri file above is auto-generated from the
blacklist file domains. Both re-generated from flatfile every 24 hours.)

I have two more steps to perform next (not done yet, maybe over winter
break):
1) Create a list of phone numbers and a good regex pattern for custom
rules to augment scores of messages that contain these phone numbers in
the bodies.
2) Create some custom rules that consider common names in headers (like a
received line including an rDNS name for a host with "dial" or "dsl" to
increase spam score by maybe 0.3 or .4)

Use of blacklist and whitelist seems to help the bayesian filter in
spamassassin. By whitelisting everyone in my addressbook (I added a
feature to do this with the SpamAssassin SQL plugin for SquirrelMail) and
using the global blacklist of domains to score domains I know spam, the
Bayes filter seems to do a better job due to the excessive weight placed
on the "known good mail" and "knownbad e-mail".

I stopped using spamcop to report e-mail... it takes too much time for the
benefit I receive when compared to the other techniques I use above. Over
time, there has been polarization in the ISP space. There are ISP who know
about spam and do a fair job of enforcing the rules to dismiss customers
who send spam through these ISP, and there are ISP which allow any and all
content so long as the customer is willing to pay for it. As a result,
having spamcop antagonize these groups provides little benefit most of the
time. (Imagine inverted bell curve with most spammers with ISP who police
their own content or ISP who just do not care.)

-ME

Peter Jay Salzman said:
> hi all,
>
> between bl.spamcop, ORDB and bogofilter, the only spam i'm getting these
> days are pieces like this.  it appears to be an attempt to pollute
> bayesian spam filters like bogofilter and spam assassin.
>
> i don't do automatic training anymore (convention wisdom says to train
> manually so your database doesn't drift due to false positives and false
> negatives).  i have NOT been training bogofilter on spams like this.
> i've mostly been forwarding them to spamcop and for particularly
> egregious spamhauses, dropping their IP blocks into hosts.deny (i've
> wrapped exim with tcpd).
>
> but honestly, i haven't given much thought to this.
>
> has anybody thought about these types of spams in relation to bayesian
> filters?  or perhaps read an article written by someone who's given the
> matter some thought?
>
> should we train on these types of emails or not?
>
> and if not, are there ways to combat this type of spam besides spam
> assassin's lexical parsing?
>
> one thing i've noticed about these types of spam.  they don't have
> sentences.  no punctuation, no capitalization (oops!), and no sense of
> grammar.  i'm wondering if the next tool to combat spam will look
> something like the Z-interpreter used by the old-style infocom text
> adventures.  ;-)
>
> thanks,
> pete
>
>
>
> ----- Forwarded message from Winnie  <qaxdyrr@tom.com> -----
>
> Return-path: qaxdyrr@tom.com
> Envelope-to: p@dirac.org
> Delivery-date: Thu, 18 Dec 2003 08:30:50 -0800
> Received: from 213.213.239.124.brutele.be ([213.213.239.124]
> ident=xitjpqrvop)
> 	by gabriel.localdomain with smtp (Exim 3.36 #1 (Debian))
> 	id 1AX131-0007di-00
> 	for <p@dirac.org>; Thu, 18 Dec 2003 08:30:50 -0800
> Received: from [213.213.239.124] by 101.30.124.94 with HTTP;
> 	Wed, 17 Dec 2003 22:27:50 -0600
> From: Winnie  <qaxdyrr@tom.com>
> To: p@dirac.org
> Subject: Re: HPRDD, the master threw
> Mime-Version: 1.0
> X-Mailer: mPOP Web-Mail 2.19
> X-Originating-IP: [80.28.112.19]
> Date: Thu, 18 Dec 2003 09:31:50 +0500
> Reply-To: Winnie  <qaxdyrr@tom.com>
> Content-Type: multipart/alternative;
> 	boundary="--ALT--PIML01481750780296"
> Message-Id: <KRONGLR-0009246905811@response>
> X-Bogosity: No, tests=bogofilter, spamicity=0.609282, version=0.15.8
>    int  cnt   prob  spamicity histogram
>   0.00    8 0.031480 0.009831 ########
>   0.10    1 0.176640 0.016506 #
>   0.20    1 0.253007 0.026157 #
>   0.30    6 0.373008 0.102732 ######
>   0.40    0 0.000000 0.102732
>   0.50    0 0.000000 0.102732
>   0.60    2 0.612572 0.145412 ##
>   0.70    4 0.737999 0.239923 ####
>   0.80    3 0.852553 0.311769 ###
>   0.90   21 0.987941 0.583047 #####################
>
>    Free Cable# TV
>
>    [IMG] delphi apex turtleback tartary carmichael isotopic bonito house
> atop
>    deducible radiography
>    cultivable barricade epiphysis set hettie consent patrician baritone
> meyer
>    nashua browse indecipherable steelmake caputo learn trimer wu
> tambourine
>    beef borate marina benthic allah bass descent enid avery airpark tithe
>    union million federal genteel cabaret
>
> ----- End forwarded message -----
>
> --
> Make everything as simple as possible, but no simpler.  -- Albert Einstein
> GPG Instructions: http://www.dirac.org/linux/gpg
> GPG Fingerprint: B9F1 6CF3 47C4 7CD8 D33E 70A9 A3B9 1945 67EA 951D
> _______________________________________________
> vox-tech mailing list
> vox-tech@lists.lugod.org
> http://lists.lugod.org/mailman/listinfo/vox-tech
>
>