Looking for a
spam-free mailbox? Try an ivarch.com email account for under $11 a year!
NAME
qsf - quick spam filter
SYNOPSIS
Filtering: qsf [-snrAtav] [-d DB] [-g
DB] [-L LVL] [-S SUBJ] [-H MARK] [-Q NUM] [-X NUM]
Training: qsf -T SPAM NONSPAM [MAXROUNDS]
[-d DB]
Retraining: qsf -[m|M] [-d DB] [-w
WEIGHT] [-ayN]
Database: qsf -[p|D|R|O] [-d DB]
Database merge: qsf -E OTHERDB [-d DB]
Allowlist query: qsf -e EMAIL [-m|-M|-t]
[-d DB] [-g DB]
Denylist query: qsf -y -e EMAIL [-m -m|-M
-M|-t] [-d DB] [-g DB]
Help: qsf -[h|V]
DESCRIPTION
qsf reads a single email on standard input, and by
default outputs it on standard output. If the email is determined
to be spam, an additional header ("X-Spam: YES") will be added, and
optionally the subject line can have "[SPAM]" prepended to it.
qsf is intended to be used in a procmail(1)
recipe, in a ruleset such as this:
-
:0 wf
| qsf -ra
:0 H:
* X-Spam: YES
$HOME/mail/spam
For more examples, including sample procmail(1) recipes,
see the EXAMPLES section below.
TRAINING
Before qsf can be used properly, it needs to be trained.
A good way to train qsf is to collect a copy of all your
email into two folders - one for spam, and one for non-spam. Once
you have done this, you can use the training function, like
this:
-
qsf -aT spam-folder non-spam-folder
This will generate a database that can be used by qsf to
guess whether email received in the future is spam or not. Note
that this initial training run may take a long time, but you should
only need to do it once.
To mark a single message as spam, pipe it to
qsf with the --mark-spam or -m ("mark as
spam") option. This will update the database accordingly and
discard the email.
To mark a single message as non-spam, pipe it to
qsf with the --mark-nonspam or -M ("mark as
non-spam") option. Again, this will discard the email.
If a message has been mis-tagged, simply send it to qsf
as the opposite type, i.e. if it has been mistakenly tagged as
spam, pipe it into qsf --mark-nonspam --weight=2 to add it
to the non-spam side of the database with double the usual
weighting.
OPTIONS
The qsf options are listed below.
- -d, --database [TYPE:]FILE
- Use FILE as the spam/non-spam database. The default is
to use /var/lib/qsfdb and, if that is not available or is
read-only, $HOME/.qsfdb. This option can also be useful if
there is a system-wide database but you do not want to use it -
specifying your own here will override the default.
If you prefix the filename with a TYPE, of the form
btree:$HOME/.qsfdb, then this will specify what kind of
database FILE is, such as list, btree,
gdbm, sqlite and so on. Check the output of qsf
-V to see which database backends are available. The default is
to auto-detect the type, or, if the file does not already exist,
use list. Note that TYPE is not case-sensitive.
- -g, --global [TYPE:]FILE
- Use FILE as the default global database, instead of
/var/lib/qsfdb. If you also specify a database with
-d, then this "global" database will be used in read-only
mode in conjunction with the read-write database specified with
-d. The -g option can be used a second time to
specify a third database, which will also be used in read-only
mode. Again, the filename can optionally be prefixed with a
TYPE which specifies the database type.
- -P, --plain-map FILE
- Maintain a mapping of all database tokens to their non-hashed
counterparts in FILE, one token per line. This can be useful
if you want to be able to list the contents of your database at a
later date, for instance to get a list of email addresses in your
allow-list. Note that using this option may slow qsf down,
and only entries written to the database while this option is
active will be stored in FILE.
- -s, --subject
- Rewrite the Subject line of any email that turns out to be
spam, adding "[SPAM]" to the start of the line.
- -S, --subject-marker SUBJECT
- Instead of adding "[SPAM]", add SUBJECT to the Subject
line of any email that turns out to be spam. Implies
-s.
- -H, --header-marker MARK
- Instead of setting the X-Spam header to "YES", set it to
MARK if email turns out to be spam. This can be useful if
your email client can only search all headers for a string, rather
than one particular header (so searching for "YES" might match more
than just the output of qsf).
- -n, --no-header
- Do not add an X-Spam header to messages.
- -r, --add-rating
- Insert an additional header X-Spam-Rating which is a rating of
the "spamminess" of a message from 0 to 100; 90 and above are
counted as spam, anything under 90 is not considered spam. If
combined with -t, then the rating (0-100) will be output, on
its own, on standard output.
- -A, --asterisk
- Insert an additional header X-Spam-Level which will contain
between 0 and 20 asterisks (*), depending on the spam rating.
- -t, --test
- Instead of passing the message out on standard output, output
nothing, and exit 0 if the message is not spam, or exit 1 if the
message is spam. If combined with -r, then the spam rating
will be output on standard output.
- -a, --allowlist
- Enable the allow-list. This causes the email addresses given in
the message's "From:" and "Return-Path:" headers to be checked
against a list; if either one matches, then the message is always
treated as non-spam, regardless of what the token database says.
When specified with a retraining flag, -a -m (mark as spam)
will remove that address from the allow-list as well as marking the
message as spam, and -a -M (mark as non-spam) will add that
address to the allow-list as well as marking the message as
non-spam. The idea is that you add all of your friends to the
allow-list, and then none of their messages ever get marked as
spam.
- -y, --denylist
- Enable the deny-list. This causes the email addresses given in
the message's "From:" and "Return-Path:" headers to be checked
against a second list; if either one matches, then theh message is
always treated as spam. Training works in the same way as with
-a, except that you must specify -m or -M
twice to modify the deny-list instead of the allow-list, and with
the reverse syntax: -y -m -m (mark as spam) will add that
address to the deny-list, whereas -y -M -M (mark as
non-spam) will remove that address from the deny-list. This double
specification is so that the usual retraining process never touches
the deny-list; the deny-list should be carefully maintained rather
than automatically generated.
Normally you would not need to use the deny-list.
- -L, --level, --threshold LEVEL
- Change the spam scoring threshold level which must be reached
before an email is classified as spam. The default is 90.
- -Q, --min-tokens NUM
- Only give a score if more than NUM tokens are found in
the message - otherwise the message is assumed to be non-spam, and
it is not modified in any way. The default is 0. This option might
be useful if you find that very short messages are being frequently
miscategorised.
- -e, --email, --email-only EMAIL
- Query or update the allow-list entry for the email address
EMAIL. With no other options, this will simply output "YES"
if EMAIL is in the allow-list, or "NO" if it is not. With
-t, it will not output anything, but will exit 0 (success)
if EMAIL is in the allow-list, or 1 (failure) if it is not.
With the -m (mark-spam) option, any previous allow-list
entry for EMAIL will be removed. Finally, with the -M
(mark-nonspam) option, EMAIL will be added to the allow-list
if it is not already on it.
If EMAIL is just the word MSG on its own, then an
email will be read from standard input, and the email addresses
given in the "From:" and "Return-Path:" headers will be used.
Using -e automatically switches on -a.
If you also specify -y, then the deny-list will be
operated on. Remember that -m and -M are reversed
with the deny-list.
If you specify an email address of the form @domain
(nothing before the @), then the whole domain will be allow
or deny listed.
- -v, --verbose
- Add extra X-QSF-Info headers to any filtered email,
containing error messages and so on if applicable. Specify
-v more than once to increase verbosity.
- -T, --train SPAM NONSPAM [MAXROUNDS]
- Train the database using the two mbox folders SPAM and
NONSPAM, by testing each message in each folder and updating
the database each time a message is miscategorised. This is done
several times, and may take a while to run. Specify the -a
(allow-list) flag to add every sender in the NONSPAM folder
to your allow-list as a side-effect of the training process. If
MAXROUNDS is specified, training will end after this number
of rounds if the results are still not good enough. The default is
a maximum of 200 rounds.
- -m, --mark-spam
- Instead of passing the message out on standard output, mark its
contents as spam and update the database accordingly. If the
allow-list (-a) is enabled, the message's "From:" and
"Return-Path:" addresses are removed from the allow-list. If the
deny-list (-y) is enabled and you specify -m twice,
the message's addresses are added to the deny-list instead.
- -M, --mark-nonspam
- Instead of passing the message out on standard output, mark its
contents as non-spam and update the database accordingly. If the
allow-list (-a) is enabled, the message's "From:" and
"Return-Path:" addresses are added to the allow-list (see the
-a option above). If the deny-list (-y) is enabled
and you specify -M twice, the message's addresses are
removed from the deny-list instead.
- -w, --weight WEIGHT
- When marking as spam or non-spam, update the database with a
weighting of WEIGHT per token instead of the default of 1.
Useful when correcting mistakes, eg a message that has been
mistakenly detected as spam should be marked as non-spam using a
weighting of 2, i.e. double the usual weighting, to counteract the
error.
- -D, --dump [FILE]
- Dump the contents of the database as a platform-independent
text file, suitable for archival, transfer to another machine, and
so on. The data is output on stdout or into the given
FILE.
- -R, --restore [FILE]
- Rebuild the database from scratch from the text file on stdin.
If a FILE is given, data is read from there instead of from
stdin.
- -O, --tokens
- Instead of filtering, output a list of the tokens found in the
message read from standard input, along with the number of times
each token was found. This is only useful if you want to use
qsf as a general tokeniser for use with another filtering
package.
- -E, --merge OTHERDB
- Merge the OTHERDB database into the current database.
This can be useful if you want to take one user's mailbox and merge
it into the system-wide one, for instance (this would be done by,
as root, doing qsf -d /var/lib/qsfdb -E /home/user/.qsfdb
and then removing /home/user/.qsfdb).
- -B, --benchmark SPAM NONSPAM [MAXROUNDS]
- Benchmark the training process using the two mbox folders
SPAM and NONSPAM. A temporary database is created and
trained using the first 75% of the messages in each folder, and
then the entire contents of each folder is tested to see how many
false positives and false negatives occur. Some timing information
is also displayed.
This can be used to decide which backend is best on your system.
Use -d to select a backend, eg qsf -B spam nonspam -d
GDBM - this will create a temporary database which is removed
afterwards.
The exception to this is the MySQL backend, where a full
database specification must be given (-d
MySQL:database=db;host=localhost;...) and the database table
given will not be wiped beforehand or dropped afterwards.
As with -T, if MAXROUNDS is specified, training
will never be done for more than this number of rounds; the default
is 200.
- -h, --help
- Print a usage message on standard output and exit
successfully.
- -V, --version
- Print version information, including a list of available
database backends, on standard output and exit successfully.
DEPRECATED OPTIONS
The following options are only for use with the old binary tree
database backend or old databases that haven't been upgraded to the
new format that came in with version 1.1.0.
- -N, --no-autoprune
- When marking as spam or nonspam, never automatically prune the
database. Usually the database is pruned after every 500 marks; if
you would rather --prune manually, use -N to disable
automatic pruning.
- -p, --prune
- Remove redundant entries from the database and clean it up a
little. This is automatically done after several calls to
--mark-spam or --mark-nonspam, and during training
with --train if the training takes a large number of rounds,
so it should rarely be necessary to use --prune manually
unless you are using -N / --no-autoprune.
- -X, --prune-max NUM
- When the database is being pruned, no more than NUM
entries will be considered for removal. This is to prevent CPU and
memory resources being taken over. The default is 100,000 but in
some circumstances (if you find that pruning takes too long) this
option may be used to reduce it to a more manageable number.
FILES
- /var/lib/qsfdb
- The default (system-wide) spam database. If you wish to install
qsf system-wide, this should be read-only to everyone; there
should be one user with write access who can update the spam
database with qsf --mark-spam and qsf --mark-non-spam
when necessary.
- /var/lib/qsfdb2
- A second, read-only, system-wide database. This can be useful
when installing qsf system-wide and using third-party spam
databases; the first global database can be updated with
system-specific changes, and this second database can be
periodically updated when the third-party spam database is
updated.
- $HOME/.qsfdb
- The default spam database for per-user data. Users without
write access to the system-wide database will have their data
written here, and the two databases will be read together. The
per-user database will be given a weighting equivalent to 10 times
the weighting of the global database.
NOTES
Currently, you cannot use qsf to check for spam while the
database is being updated. This means that while an update is in
progress, all email is passed through as non-spam.
There is an upper size limit of 512Kb on incoming email;
anything larger than this is just passed through as non-spam, to
avoid tying up machine resources.
The plaintext token mapping maintained by --plain-map
will never shrink, only grow. It is intended for use by
housekeeping and user interface scripts that, for instance, the
user can use to list all email addresses on their allow-list. These
scripts should take care of weeding out entries for tokens that are
no longer in the database. If you have no such scripts, there is
probably no point in using --plain-map anyway.
Avoid using the deny-list (-y) in any automated
retraining, as it can be cause the filter to reject mail
unnecessarily. In general the deny-list is probably best left
unused unless explicitly required by your particular setup.
If both the allow-list and the deny-list are enabled, then email
addresses will first be checked against the deny-list, then the
allow-list, then the domain of the email address will be checked
for matching "@domain" entries in the deny-list and then in the
allow-list.
EXAMPLES
To filter all of your mail through qsf, with the
allow-list enabled and the "spam rating" header being added, add
this to your .procmailrc file:
-
:0 wf
| qsf -ra
If you want qsf to add "[SPAM]" to the subject line of
any messages it thinks are spam, do this instead:
-
:0 wf
| qsf -sra
To automatically mark any email sent to
spambox@yourdomain.com as spam (this is the "naive"
version):
-
:0 H
* ^To:.*spambox@yourdomain.com
| qsf -am
To do the same, but cleverly, so that only email to
spambox@yourdomain.com which qsf does NOT already
classify as spam gets marked as spam in the database (this stops
the database getting too heavily weighted):
-
# If sent to spambox@yourdomain.com:
:0
* ^To:.*spambox@yourdomain.com
{
:0 wf
| qsf -a
# The above two lines can be skipped if you've
# already piped the message through qsf.
# If the qsf database says it's not spam,
# mark it as spam!
:0 H
* ^X-Spam: NO
| qsf -am
}
Remove the -a option in the above examples if you don't
want to use the allow-list.
A more complicated filtering example - this will only run
qsf on messages which don't have a subject line saying "your
<something> is on fire" and which don't have a sender address
ending in "@foobar.com", meaning that messages with that subject
line OR that sender address will NEVER be marked as spam, no matter
what:
-
:0 wf
* ! ^Subject: Your .* is on fire
* ! ^From: .*@foobar.com
| qsf -ra
For more on procmail(1) recipes, see the
procmailrc(5) and procmailex(5) manual pages.
A couple of macros to add to your .muttrc file, if you
use mutt(1) as a mail user agent:
-
# Press F5 to mark a message as spam and delete it
macro index <f5> "<pipe-message>qsf -am\n<delete-message>"
macro pager <f5> "<pipe-message>qsf -am\n<delete-message>"
# Press F9 to mark a message as non-spam
macro index <f9> "<pipe-message>qsf -aM\n"
macro pager <f9> "<pipe-message>qsf -aM\n"
Again, remove the -a option in the above examples if you
don't want to use the allow-list.
Note, however, that the above macros won't work when operating
on multiple tagged messages. For that, you'd need something like
this:
-
macro index <f5> ":set pipe_split\n<tag-prefix><pipe-message>qsf -am\n<tag-prefix><delete-message>\n:unset pipe_split\n"
If you use qmail(7), then to get procmail working
with it you will need to put a line containing just
DEFAULT=./Maildir/ at the top of your ~/.procmailrc
file, so that procmail delivers to your Maildir folder
instead of trying to deliver to /var/spool/mail/$USER, and you will
need to put this in your ~/.qmail file:
-
| preline procmail
This will cause all your mail to be delivered via
procmail instead of being delivered directly into your mail
directory.
See the qmail(7) documentation for more about mail
delivery with qmail.
If you use postfix(1), you can set up a system-wide mail
filter by creating a user account for the purpose of filtering
mail, populating that account's .qsfdb, and then creating a
shell script, to run as that user, which runs qsf on stdin
and passes stdout to sendmail(8).
Doing this requires some knowledge of postfix
configuration and care needs to be taken to avoid mail loops. One
qsf user's full HOWTO is included in the doc/
directory with this package.
THE ALLOW-LIST
A feature called the "allow-list" can be switched on by
specifying the --allowlist or -a option. This causes
messages' "From:" and "Return-Path:" addresses to be checked
against a list of people you have said to allow all messages from,
and if a message's "From:" or "Return-Path:" address is in the
list, it is never marked as spam. This means you can add all your
friends to an "allow-list" and qsf will then never mis-file
their messages - a quick way to do this is to use -a with
-T (train); everyone in your non-spam folder who has sent
you an email will be added to the allow-list automatically during
training.
You can manually add and remove addresses to and from the
allow-list using the -e (email) option. For instance, to add
foo@bar.com to the allow-list, do this:
-
qsf -e foo@bar.com -M
To remove bad@nasty.com from the allow-list, do this:
-
qsf -e bad@nasty.com -m
And to see whether someone@somewhere.com is in the
allow-list or not, just do this:
-
qsf -e someone@somewhere.com
In general, you probably always want to enable the allow-list,
so always specify the -a option when using qsf. This
will automatically maintain the allow-list based on what you
classify as spam or non-spam.
The only times you might want to turn it off are when people on
your allow-list are prone to getting viruses or if a virus is
causing email to be sent to you that is pretending to be from
someone on your allow-list.
BACKUP AND RESTORE
Because the database format is platform-specific, it is a good
idea to periodically dump the database to a text file using qsf
-D so that, if necessary, it can be transferred to another
machine and restored with qsf -R later on.
Also note that since the actual contents of email messages are
never stored in the database (see TECHNICAL DETAILS), you
can safely share your qsf database with friends - simply
dump your database to a file, like this:
-
qsf -D > your-database-dump.txt
Once you have sent your-database-dump.txt to another
person, they can do this:
-
qsf -R < your-database-dump.txt
They will then have an identical database to yours.
TECHNICAL DETAILS
When a message is passed to qsf, any attachments are
decoded, all HTML elements are removed, and the message text is
then broken up into "tokens", where a "token" is a single word or
URL. Each token is hashed using the MD5 algorithm (see below for
why), and that hash is then used to look up each token in the
qsf database.
For full details of which parts of an email (headers, body,
attachments, etc) are used to calculate the spam rating, see the
TOKENISATION section below.
Within the database, each token has two numbers associated with
it: the number of times that token has been seen in spam, and the
number of times it has been seen in non-spam. These two numbers,
along with the total number of spam and non-spam messages seen, are
then used to give a "spamminess" value for that particular token.
This "spamminess" value ranges from "definitely not spammy" at one
end of the scale, through "neutral" in the middle, up to
"definitely spammy" at the other end.
Once a "spamminess" value has been calculated for all of the
tokens in the message, a summary calculation is made to give an
overall "is this spam?" probability rating for the message. If the
overall probability is 0.9 or above, the message is flagged as
spam.
In addition to the probability test is the "allow-list". If
enabled (with the -a option), the whole probability check is
skipped if the sender of the message is listed in the allow-list,
and the message is not marked as spam.
When training the database, a message is split up into tokens as
described above, and then the numbers in the database for each
token are simply added to: if you tell qsf that a message is
spam, it adds one to the "number of times seen in spam" counter for
each token, and if you tell it a message is not spam, it adds one
to the "number of times seen in non-spam" counter for each token.
If you specify a weight, with -w, then the number you
specify is added instead of one.
To stop the database growing uncontrollably, the database keeps
track of when a token was last used. Underused tokens are
automatically removed from the database. (The old method was to
"prune" every 500 updates).
Finally, the reason MD5 hashes were used is privacy. If the
actual tokens from the messages, and the actual email addresses in
the allow-list, were stored, you could not share a single
qsf database between multiple users because bits of
everyone's messages would be in the database - things like emailed
passwords, keywords relating to personal gossip, and so on. So a
hash is stored instead. A hash is a "one-way" function; it is easy
to turn a token into a hash but very hard (some might say
impossible) to turn a hash back into the token that created it.
This means that you end up with a database with no personal
information in it.
TOKENISATION
When a message is broken up into tokens, various parts of the
message are treated in different ways.
First, all header fields are discarded, except for the important
ones: From, Return-Path, Sender, To,
Reply-To, and Subject.
Next, any MIME-encoded attachments are decoded. Any attachments
whose MIME type starts with "text/" (i.e. HTML and text) are
tokenised, after having any HTML tags stripped. Any non-textual
attachments are replaced with their MD5 hash (such that two
identical attachments will have the same hash), and that hash is
then used as a token.
In addition to single-word tokens from textual message parts,
qsf adds doubled-up tokens so that word pairs get added to
the database. This makes the database a bit bigger (although the
automatic pruning tends to take care of that) but makes matching
more exact.
SPECIAL FILTERS
As well as using the textual content of email to detect spam,
qsf also uses special filters which create "pseudo-tokens"
based on various rules. This means that specific patterns, not just
individual words, can be used to determine whether a message is
spam or not.
For example, if a message contains lots of words with multiple
consonants, like "ashjkbnxcsdjh", then each time a word like that
is seen the special token ".GIBBERISH-CONSONANTS." is added to the
list of tokens found in the message. If it turns out that most
messages with words that trigger this filter rule are spam, then
other messages with gibberish consonant strings will be more likely
to be flagged as spam.
Currently the special filters are:
- GTUBE
- Flags any message containing the string
XJS*C4JDBQADN1.NSBN3*2IDNEN*GTUBE-STANDARD-ANTI-UBE-TEST-EMAIL*C.34X
as spam - useful for testing that your qsf installation is
working.
- ATTACH-SCR
- ATTACH-PIF
- ATTACH-EXE
- ATTACH-VBS
- ATTACH-VBA
- ATTACH-LNK
- ATTACH-COM
- ATTACH-BAT
- Adds a token for every attachment whose filename ends in
".scr", ".pif", ".exe", ".vbs", ".vba", ".lnk", ".com", and ".bat"
respectively (these are often viruses).
- ATTACH-GIF
- ATTACH-JPG
- ATTACH-PNG
- Adds a token for every attachment whose filename ends in
".gif", ".jpg" or ".jpeg", and ".png" respectively.
- ATTACH-DOC
- ATTACH-XLS
- ATTACH-PDF
- Adds a token for every attachment whose filename ends in
".doc", ".xls", or ".pdf" respectively (these tend to indicate a
non-spam email).
- SINGLE-IMAGE
- Adds a token if the message contains exactly one attached
image.
- MULTIPLE-IMAGES
- Adds a token if the message contains more than one attached
image.
- GIBBERISH-CONSONANTS
- Adds a token for every word found that has multiple consonants
in a row, as described above. Spam often contains strings of
gibberish.
- GIBBERISH-VOWELS
- Adds a token for every word found that has multiple vowels in a
row, eg "aeaiaiaeeio".
- GIBBERISH-FROMCONS
- Like GIBBERISH-CONSONANTS, but only for the "From:" and
"Return-Path:" addresses on their own.
- GIBBERISH-FROMVOWL
- Like GIBBERISH-VOWELS, but only for the "From:" and
"Return-Path:" addresses on their own.
- GIBBERISH-BADSTART
- Adds a token for every word that starts with a bad character
such as %.
- GIBBERISH-HYPHENS
- Adds a token for every word with more than three hyphens or
underscores in it.
- GIBBERISH-LONGWORDS
- Adds a token for every word with over 30 characters in it (but
less than 60).
- HTML-COMMENTS-IN-WORDS
- Adds a token for every HTML comment found in the middle of a
word. Spam often contains HTML inside words, like this:
w<!--dsgfhsdgjgh-->ord
- HTML-EXTERNAL-IMG
- Adds a token for every HTML <img> (image) tag found that
contains :// (i.e. it refers to an external image).
- HTML-FONT
- Adds a token for every HTML <font> tag found.
- HTML-IP-IN-URLS
- Adds a token for every URL found containing an IP address.
- HTML-INT-IN-URL
- Adds a token for every URL found containing an integer in its
hostname.
- HTML-URLENCODED-URL
- Adds a token for every URL found containing a % sign in its
hostname.
Normally, filters will just cause a token to be added, and these
tokens are processed by the normal weighting algorithm. However the
GTUBE filter will immediately flag any matching message as
spam, bypassing the token matching.
DATABASE BACKENDS
The inbuilt "list" database backend will not necessarily provide
the best performance, but is provided because using it requires no
external libraries.
If, when qsf was compiled, the correct libraries were
available, then it will be possible to use qsf with
alternative database backends. To find out which backends you have
available, run qsf -V (capital V) and read the second line
of output. To see how well a backend performs, collect some spam
and non-spam and use qsf -d BACKEND -B SPAM NONSPAM (see the
entry for -B above).
Some people find that they get the best performance out of the
gdbm backend; this is a library that is widely available on
many systems.
To efficiently share a qsf database across multiple
machines, you may find the MySQL backend useful. However, using it
is a little more complicated.
To use the MySQL backend you will need to create a table with
the fields key1, key2, token, value1,
value2 and value3. The token, value1,
value2, and value3 fields must be VARCHAR(64),
BIGINT or INT, and BIGINT or INT
respectively, and indexing on the token field is a good
idea. The key1 and key2 fields can be anything, but
they must be present.
For example:
-
USE mydatabase;
CREATE TABLE qsfdb (
key1 BIGINT UNSIGNED NOT NULL,
key2 BIGINT UNSIGNED NOT NULL,
token VARCHAR(64) DEFAULT '' NOT NULL,
value1 INT UNSIGNED NOT NULL,
value2 INT UNSIGNED NOT NULL,
value3 INT UNSIGNED NOT NULL,
PRIMARY KEY (key1,key2,token),
KEY (key1),
KEY (key2),
KEY (token)
);
The key1 and key2 fields allow you to have
multiple qsf databases in one table, by specifying different
key1 and key2 values on invocation.
Instead of specifying a database file with the --database
/ -d option, you must specify either a specification string
as described below, or the name of a file containing such a string
on its first line.
The specification string is as follows:
-
database=DATABASE;host=HOST;port=PORT;
user=USER;pass=PASS;table=TABLE;
key1=KEY1;key2=KEY2
This string must be all on one line, with no spaces.
- DATABASE
- is the name of the MySQL database.
- HOST
- is the hostname of the database server (eg "localhost").
- PORT
- is the TCP port to connect on (eg 3306).
- USER
- is the username to connect with.
- PASS
- is the password to connect with.
- TABLE
- is the database table to use. If a table with this name does
not exist when qsf is called in update or training mode,
then it will be created if permissions allow this to be done.
- KEY1
- is the value to use for the key1 field.
- KEY2
- is the value to use for the key2 field.
Since command lines can be seen in the process list, it is
probably best to specify a filename (eg qsf -d
mysql:qsfdb.spec) and put the specification string inside that
file.
TROUBLESHOOTING
If you have problems with qsf, please check the list
below; if this does not help, go to the qsf home page and
investigate the mailing lists, or email the author.
- Nothing is being marked as spam.
-
First, use the -r option to switch on the
X-Spam-Rating header, and check that this header appears in
email passed through qsf. If it does not, then it is likely
that qsf is not being run at all - check your configuration
of procmail(1) or its equivalent.
-
If you are seeing X-Spam-Rating headers, and different
emails have different scores, then you may simply need to retrain
your database a little more. Take more spam email and pass it to
qsf -m.
-
If you are seeing X-Spam-Rating headers but they all give
the same spam rating, then the most likely reason is that
qsf is not reading any database. Make sure that whatever is
processing the email has read permissions on /var/lib/qsfdb
and/or ~/.qsfdb - and make sure that, if you are using
~/.qsfdb, what your database creator thought was ~
($HOME) is the same as it is for whatever is processing the
email.
- Retraining sometimes takes a very long time.
- With the obtree backend or 2-column MySQL or SQLite
tables, every 500th retrain (-m or -M), the database
is pruned. On some systems this may take some time, and during this
time the database is locked (except when using the MySQL or SQLite
backends). If you constantly do a lot of retraining and want to
avoid this, then use the -N option to suppress auto-pruning,
and then have a cron(8) job or something run a manual prune
(qsf -p) every now and again.
- Running qsf from procmail fails with an error.
- If you can run qsf from the command line, but in your
procmail log file you get errors about "qsf: cannot execute
binary file", then contact your system administrator for help. It
may be that incoming email is handled by a different server to the
one you normally shell into, and either they are of a different
architecture or operating system, or the mail server is not
permitted to execute user-owned binaries.
ACKNOWLEDGEMENTS
The following people have contributed suggestions, comments,
patches, and testing:
- Tom Parker <http://www.bits.bris.ac.uk/palfrey/>
Dr Kelly A. Parker
Vesselin Mladenov <http://www.antipodes.bg/>
Glyn Faulkner
Mark Reynolds
Sam Roberts
Scott Allen
Karsten Kankowski
M. Kolbl
Micha Holzmann
Jef Poskanzer <http://www.acme.com/jef/>
Clemens Fischer <http://ino-waiting.gmxhome.de/>
Nelson A. de Oliveira
Michal Vitecek
Tommy Pettersson <http://www.lysator.liu.se/~ptp/>
AUTHOR
The author:
- Andrew Wood
http://www.ivarch.com/
Project home page:
- http://www.ivarch.com/programs/qsf/
BUGS
If you find any bugs, please contact the author, either by email
or by using the contact form on the web site.
SEE ALSO
procmail(1), procmailrc(5),
procmailex(5)
Someone has written a guide to using qsf with KMail that
can be found at:
http://www.softwaredesign.co.uk/Information.SpamFilters.html
LICENSE
This is free software, distributed under the ARTISTIC 2.0
license.