[vox] dealing with old unscanned mail mbox and spamassassin...

ME vox@lists.lugod.org
Fri, 24 Oct 2003 18:49:39 -0700 (PDT)


keywords: spamassassin, mbox, archive, rescan

I don't know about you, but I have been bothered by having so much of my
old archived mail (in mbox format) using up space on my HD. Keeping well
over 10 years of e-mail can lead to problems with utilization of unwanted
space with spam. I recently decided to play around with processing my old
mbox messages through spamassassin to weed out spam.

This solution is not an "obvious" cat-ing of your old mbox through
spamassassin. Spamassassin appear to expect messages to come trhough one
at a time. Catting the whole mbox does not seem to have the desired
result. Also, for large mboxes, you can expect a lot of memory to be used
during the process. (I was using well over 800MB of RAM on some of my
smaller mbox files while testing this method.)

Check some of these useful methods that work for me:
(reformail is part of many courier-imap packages)
(grepmail is part fo another package)
(mboxgrep is a package by itself)

$ reformail -s spamc < INPUTFILE > OUTPUTFILE

I supposed if you do not used spamd/spamc but instead use spamassassin
called per message, you could do this:)

$ reformail -s spamassassin < INPUTFILE > OUTPUTFILE

What you end up with is another mbox of messages that has instead been
parsed through spamassassin for scoring.

You can then use your favorite mailer to filter based on header
information, or use procmail or maybe you can check out "mboxgrep" and use
the -H flag to search through an mbox and only pull out messages than are
not marked with "^X-Spam-Status: Yes" and dump those into a new mbox file
that is clean.

Why do I keep these old message in mbox format?
Now that I seldom even look at them, I can compress the mbox file and it
compresses much better than several files in a maildir.

Anyway, I found the above useful in weeding spam from my old mboxes and
will probably be able to reclaim 800MB-1000MB of space by doing this. :-D

So, why else do this? If you are on an older, slower machine, you could
copy your mail out to a faster and better machine, process it, and then
transfer it back. (Woohoo!)

Maybe your gateway and/or mail server is weak, but your desktop machine is
more powerful. (ding, ding ding!)

I'm sure you can come up with other reasons.

HTH someone out there,
-ME