Andrew | 6 Sep 2007 15:23
Picon
Favicon

Idea for improving the learning stage

Hello, I would like to submit an idea which I think would improve the 
accuracy and the learning stage of any statistical spam filter.

The concept: learn where the "giveaway" is by watching user behaviour.

It basically comes down to having the filter take note of this: did the 
user need to open the email before flagging it as spam?

If the answer is "no", then concentrate your stats on the subject line 
and ignore the body (which might be full of random words used by the 
spammer to pollute the filter's database).

If the answer is "yes", the reverse applies: ignore the subject, which 
must have looked "legitimate" to the user, and concentrate on the body, 
which is what clued the user in about the email being spam.

By analyzing only the subject OR the body, you analyze only what 
actually looks like spam, thus ignoring the parts of the email that are 
there to deceive.

What do you think?

Regards,
Andrew

_______________________________________________
Bogofilter-dev mailing list
Bogofilter-dev <at> bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter-dev

(Continue reading)

Matthias Andree | 8 Sep 2007 09:35
Picon
Picon

Re: Idea for improving the learning stage

On Thu, 06 Sep 2007, Andrew wrote:

> It basically comes down to having the filter take note of this: did the 
> user need to open the email before flagging it as spam?
> 
> If the answer is "no", then concentrate your stats on the subject line 
> and ignore the body (which might be full of random words used by the 
> spammer to pollute the filter's database).
> 
> If the answer is "yes", the reverse applies: ignore the subject, which 
> must have looked "legitimate" to the user, and concentrate on the body, 
> which is what clued the user in about the email being spam.
> 
> By analyzing only the subject OR the body, you analyze only what 
> actually looks like spam, thus ignoring the parts of the email that are 
> there to deceive.

How does bogofilter, for a newly arriving mail, decide whether to look
at header or body? If we modified just the learning side, we'd still be
evaluating body and header, which might still mislead bogofilter. So,
does your suggestion imply we'll have to keep header and body databases
separate? That's certainly doable technically, but what do you do with
nested MIME messages? (Postfix, for instance, allows to specify
regexp-based filters for the message as a whole, or for headers of
embedded MIME parts, usually "attachments").

--

-- 
Matthias Andree
_______________________________________________
Bogofilter-dev mailing list
(Continue reading)

Andrew | 8 Sep 2007 12:21
Picon
Favicon

Re: Idea for improving the learning stage

On Sat, 8 Sep 2007 09:35:03 +0200,
Matthias Andree <matthias.andree <at> gmx.de> wrote:

> How does bogofilter, for a newly arriving mail, decide whether to look
> at header or body?

When mail comes in, Bogofilter would always evaluate the full message in 
any case. 

It's only when the user flags a message as spam or ham that my idea 
comes into play and Bogofilter decides what to look at, based on message 
status.

Message status would tell Bogofilter what words prompted the user to 
recognize the message as spam or ham when he flagged it.

So, in the end, its database would mostly be made of those words that 
*really* were critical for the user when he recognized spam from ham.

> So, does your suggestion imply we'll have to keep header and body 
> databases separate?

I've been thinking about separate databases, but I've come to the 
conclusion that we wouldn't really need them: words that looked "spammy" 
in a subject would still look spammy in the body, and vice-versa. So, in 
my opinion, only one database would still be the way to go.

Regards,
Andrew

(Continue reading)

Matthias Andree | 8 Sep 2007 12:58
Picon
Picon

Re: Idea for improving the learning stage

Andrew <aremo <at> ngi.it> writes:

> I've been thinking about separate databases, but I've come to the 
> conclusion that we wouldn't really need them: words that looked "spammy" 
> in a subject would still look spammy in the body, and vice-versa. So, in 
> my opinion, only one database would still be the way to go.

No, body and header are orthogonal, since bogofilter tags header tokens
before registering them, and registering partial messages would only
skew the individual token probabilities by skewing .MSG_COUNT. Try bogolexer or bogoutil
dumps...

Suppose we're registering a header, we'll bump .MSG_COUNT but not
registering any body tokens, so the significance of all body tokens will
slowly decrease... so I wonder if we need .BODY_COUNT and .HEADER_COUNT
or something like that to replace .MSG_COUNT. That being an incompatible
change, it cannot become part of bogofilter 1.0.X.

I understand what you're aiming at, and I'm not saying it's not useful
-- it's just that there's more to the solution than just registering
partial messages.

--

-- 
Matthias Andree
_______________________________________________
Bogofilter-dev mailing list
Bogofilter-dev <at> bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter-dev

(Continue reading)

David Relson | 8 Sep 2007 14:06
Favicon

Re: Idea for improving the learning stage

On Sat, 08 Sep 2007 12:58:48 +0200
Matthias Andree wrote:

> Andrew <aremo <at> ngi.it> writes:
> 
> > I've been thinking about separate databases, but I've come to the 
> > conclusion that we wouldn't really need them: words that looked
> > "spammy" in a subject would still look spammy in the body, and
> > vice-versa. So, in my opinion, only one database would still be the
> > way to go.
> 
> No, body and header are orthogonal, since bogofilter tags header
> tokens before registering them, and registering partial messages
> would only skew the individual token probabilities by
> skewing .MSG_COUNT. Try bogolexer or bogoutil dumps...
> 
> Suppose we're registering a header, we'll bump .MSG_COUNT but not
> registering any body tokens, so the significance of all body tokens
> will slowly decrease... so I wonder if we need .BODY_COUNT
> and .HEADER_COUNT or something like that to replace .MSG_COUNT. That
> being an incompatible change, it cannot become part of bogofilter
> 1.0.X.
> 
> I understand what you're aiming at, and I'm not saying it's not useful
> -- it's just that there's more to the solution than just registering
> partial messages.
> 
> -- 
> Matthias Andree

(Continue reading)

Tom Anderson | 6 Sep 2007 16:52
Favicon

Re: Idea for improving the learning stage

Sounds like an interesting strategy on client-side filters which are 
integrated into the mail client, but I don't see how this could apply to 
a server-side filter.  Perhaps Bogofilter could be a link in the chain 
of such a system, but Bogofilter itself doesn't "watch" user behavior.

Tom

Andrew wrote:
> Hello, I would like to submit an idea which I think would improve the 
> accuracy and the learning stage of any statistical spam filter.
> 
> The concept: learn where the "giveaway" is by watching user behaviour.
> 
> It basically comes down to having the filter take note of this: did the 
> user need to open the email before flagging it as spam?
> 
> If the answer is "no", then concentrate your stats on the subject line 
> and ignore the body (which might be full of random words used by the 
> spammer to pollute the filter's database).
> 
> If the answer is "yes", the reverse applies: ignore the subject, which 
> must have looked "legitimate" to the user, and concentrate on the body, 
> which is what clued the user in about the email being spam.
> 
> By analyzing only the subject OR the body, you analyze only what 
> actually looks like spam, thus ignoring the parts of the email that are 
> there to deceive.
> 
> What do you think?
> 
(Continue reading)

Andrew | 6 Sep 2007 17:07
Picon
Favicon

Re: Idea for improving the learning stage

On Thu, 06 Sep 2007 10:52:04 -0400,
Tom Anderson <tanderso <at> oac-design.com> wrote:

> Sounds like an interesting strategy on client-side filters which are 
> integrated into the mail client, but I don't see how this could apply to 
> a server-side filter.  Perhaps Bogofilter could be a link in the chain 
> of such a system, but Bogofilter itself doesn't "watch" user behavior.

Hi Tom, I use Bogofilter as a client-side filter: my MUA (KMail) calls 
it when new mail comes in and when I flag messages as spam or ham.

My idea could probably be implemented as a command-line switch followed 
by a read/unread flag . That way, when the MUA calls Bogofilter, it 
could include that switch and pass message status.

Regards,
Andrew

_______________________________________________
Bogofilter-dev mailing list
Bogofilter-dev <at> bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter-dev

David Relson | 7 Sep 2007 01:11
Favicon

Re: Idea for improving the learning stage

On Thu, 6 Sep 2007 15:07:12 +0000 (UTC)
Andrew wrote:

> On Thu, 06 Sep 2007 10:52:04 -0400,
> Tom Anderson <tanderso <at> oac-design.com> wrote:
> 
> > Sounds like an interesting strategy on client-side filters which
> > are integrated into the mail client, but I don't see how this could
> > apply to a server-side filter.  Perhaps Bogofilter could be a link
> > in the chain of such a system, but Bogofilter itself doesn't
> > "watch" user behavior.
> 
> Hi Tom, I use Bogofilter as a client-side filter: my MUA (KMail)
> calls it when new mail comes in and when I flag messages as spam or
> ham.
> 
> My idea could probably be implemented as a command-line switch
> followed by a read/unread flag . That way, when the MUA calls
> Bogofilter, it could include that switch and pass message status.
> 
> Regards,
> Andrew

Since bogofilter normally classifies a message before the MUA is used
to view it, your message had me puzzled for a while.  After thinking a
bit your idea seems more reasonable.  Assuming your MUA can pass the
read/unread state to a script, then the script would be able to
translate the MUA flag to a bogofilter training flag.  This seems to be
a technique that could be implemented for a MUA, rather than a
capability needing a change to bogofilter.
(Continue reading)

Andrew | 7 Sep 2007 03:01
Picon
Favicon

Re: Idea for improving the learning stage

On Thu, 6 Sep 2007 19:11:21 -0400,
David Relson <relson <at> osagesoftware.com> wrote:

> Since bogofilter normally classifies a message before the MUA is used
> to view it, your message had me puzzled for a while.

Hi David, surely bogofilter does its check before the user sees the 
messages, but keep in mind that my idea only concerns the process of 
manually flagging the individual emails, i.e. when we "teach" bogofilter 
by calling it with the -s or -n option.

> After thinking a bit your idea seems more reasonable.  Assuming your 
> MUA can pass the read/unread state to a script, then the script would 
> be able to translate the MUA flag to a bogofilter training flag.  This 
> seems to be a technique that could be implemented for a MUA, rather 
> than a capability needing a change to bogofilter.

The client certainly needs the ability to pass message status, but 
bogofilter should then use the status to decide whether to "learn" by 
looking only at the subject line (unread message == the giveaway is in 
the subject) or by ignoring the subject and looking at the message body 
(if the user needed to open the message to understand it was spam).

Cheers,
Andrew

_______________________________________________
Bogofilter-dev mailing list
Bogofilter-dev <at> bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter-dev
(Continue reading)

David Relson | 7 Sep 2007 03:33
Favicon

Re: Idea for improving the learning stage

On Fri, 7 Sep 2007 01:01:41 +0000 (UTC)
Andrew wrote:

> On Thu, 6 Sep 2007 19:11:21 -0400,
> David Relson <relson <at> osagesoftware.com> wrote:
> 
> > Since bogofilter normally classifies a message before the MUA is
> > used to view it, your message had me puzzled for a while.
> 
> 
> Hi David, surely bogofilter does its check before the user sees the 
> messages, but keep in mind that my idea only concerns the process of 
> manually flagging the individual emails, i.e. when we "teach"
> bogofilter by calling it with the -s or -n option.
> 
> 
> > After thinking a bit your idea seems more reasonable.  Assuming
> > your MUA can pass the read/unread state to a script, then the
> > script would be able to translate the MUA flag to a bogofilter
> > training flag.  This seems to be a technique that could be
> > implemented for a MUA, rather than a capability needing a change to
> > bogofilter.
> 
> 
> The client certainly needs the ability to pass message status, but 
> bogofilter should then use the status to decide whether to "learn" by 
> looking only at the subject line (unread message == the giveaway is
> in the subject) or by ignoring the subject and looking at the message
> body (if the user needed to open the message to understand it was
> spam).
(Continue reading)

Andrew | 7 Sep 2007 12:14
Picon
Favicon

Re: Idea for improving the learning stage

On Thu, 6 Sep 2007 21:33:42 -0400,
David Relson <relson <at> osagesoftware.com> wrote:

> The intelligence you suggest belongs in a script driving bogofilter.
> With claws-mail I have two actions "classify as spam" and "classify as
> ham".  These actions forward the messages to special addresses on my
> mail server and procmail spots the messages and passes them to a
> reclassify script.  The reclassify script looks at the forwarding
> address and the message's X-Bogosity line then invokes bogofilter with
> appropriate flags.  For example, since "X-Bogosity: Spam" and "forward
> as ham" indicates a "False Positive" bogofilter gets run with "-S
> -n".  Note that all the decision making is _outside_ of bogofilter.

So how could an external script tell bogofilter to "ignore the subject" 
or "ignore the body" ?

Regards,
Andrew

_______________________________________________
Bogofilter-dev mailing list
Bogofilter-dev <at> bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter-dev

David Relson | 7 Sep 2007 12:46
Favicon

Re: Idea for improving the learning stage

On Fri, 7 Sep 2007 10:14:56 +0000 (UTC)
Andrew wrote:

> On Thu, 6 Sep 2007 21:33:42 -0400,
> David Relson <relson <at> osagesoftware.com> wrote:
> 
> > The intelligence you suggest belongs in a script driving bogofilter.
> > With claws-mail I have two actions "classify as spam" and "classify
> > as ham".  These actions forward the messages to special addresses
> > on my mail server and procmail spots the messages and passes them
> > to a reclassify script.  The reclassify script looks at the
> > forwarding address and the message's X-Bogosity line then invokes
> > bogofilter with appropriate flags.  For example, since "X-Bogosity:
> > Spam" and "forward as ham" indicates a "False Positive" bogofilter
> > gets run with "-S -n".  Note that all the decision making is
> > _outside_ of bogofilter.
> 
> 
> So how could an external script tell bogofilter to "ignore the
> subject" or "ignore the body" ?
> 
> 
> Regards,
> Andrew

Bogofilter doesn't have such capabilities, nor does it need them.  If
you want part of a message to be excluded, a copy of the message needs
to be created without that part.  Tools that you should consider are
formail, awk, and grep.  

(Continue reading)

Matthias Andree | 8 Sep 2007 09:38
Picon
Picon

Re: Idea for improving the learning stage

On Fri, 07 Sep 2007, David Relson wrote:

> grep can be used for simple exclusion tasks.  For example, to exclude
> only the subject: 
> 
>    grep -v ^Subject: < message | bogofilter ...

Sorry David, but this isn't robust. If you know the MUA presents the
first Subject: header found (if the message is malformed), it's easy,
just insert "head -n1" into the middle of the pipeline. If the MUA
presents the last of several Subject: headers in malformed messages,
you'll need to first strip the header (sed '/^$/q' might do, but
cause "broken pipe" errors earlier on) and then use tail -n1.

This isn't trivial given that spammers try to deceive filters and
everything.

--

-- 
Matthias Andree
_______________________________________________
Bogofilter-dev mailing list
Bogofilter-dev <at> bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter-dev

mouss | 7 Sep 2007 23:53
Picon
Favicon

Re: Idea for improving the learning stage

David Relson wrote:
> On Fri, 7 Sep 2007 10:14:56 +0000 (UTC)
> Andrew wrote:
> 
>> On Thu, 6 Sep 2007 21:33:42 -0400,
>> David Relson <relson <at> osagesoftware.com> wrote:
>>
>>> The intelligence you suggest belongs in a script driving bogofilter.
>>> With claws-mail I have two actions "classify as spam" and "classify
>>> as ham".  These actions forward the messages to special addresses
>>> on my mail server and procmail spots the messages and passes them
>>> to a reclassify script.  The reclassify script looks at the
>>> forwarding address and the message's X-Bogosity line then invokes
>>> bogofilter with appropriate flags.  For example, since "X-Bogosity:
>>> Spam" and "forward as ham" indicates a "False Positive" bogofilter
>>> gets run with "-S -n".  Note that all the decision making is
>>> _outside_ of bogofilter.
>>
>> So how could an external script tell bogofilter to "ignore the
>> subject" or "ignore the body" ?
>>
>>
>> Regards,
>> Andrew
> 
> Bogofilter doesn't have such capabilities, nor does it need them.  If
> you want part of a message to be excluded, a copy of the message needs
> to be created without that part.  Tools that you should consider are
> formail, awk, and grep.  
> 
(Continue reading)

Andrew | 8 Sep 2007 00:57
Picon
Favicon

Re: Idea for improving the learning stage

On Fri, 07 Sep 2007 23:53:06 +0200, mouss <mlist.only <at> free.fr> wrote:

> [body only]
> Isn't "Subject" a token and that removing it will make it no more 
> neutral? I mean, suppose you remove Subject from thousand spam messages, 
> then "Subject" may become a ham sign, which it should not be.

Good point, provided that Bogofilter actually treats "Subject:" as any 
other word. If that's the case, we should pass a line that only says 
"Subject:".

> [subject only]
> and if you only train by subject, you will miss the spammy body tokens. 

But you'll also ignore possible "polluting" words in the body, while 
taking note of those words (the subject) that really prompted the user 
to flag the message as spam.

Regards,
Andrew

_______________________________________________
Bogofilter-dev mailing list
Bogofilter-dev <at> bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter-dev

Matthias Andree | 8 Sep 2007 09:41
Picon
Picon

Re: Idea for improving the learning stage

On Fri, 07 Sep 2007, Andrew wrote:

> Good point, provided that Bogofilter actually treats "Subject:" as any 
> other word. If that's the case, we should pass a line that only says 
> "Subject:".

Not unless you tell it to. Else you'll see head:Subject tokens and
subj:WHATEVER for each of the tokens that was observed on Subject lines.

> > [subject only]
> > and if you only train by subject, you will miss the spammy body tokens. 
> 
> 
> But you'll also ignore possible "polluting" words in the body, while 
> taking note of those words (the subject) that really prompted the user 
> to flag the message as spam.

I see however no way yet to tell bogofilter a clean "ignore body" or
"ignore subject" when scoring a newly arriving message yet. Details in
my reply to your initial suggestion message.

--

-- 
Matthias Andree
_______________________________________________
Bogofilter-dev mailing list
Bogofilter-dev <at> bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter-dev


Gmane