Õž§ | 27 Mar 2007 04:21
Picon
Favicon

Can bogofilter use as a Chinese spam filter?


I've read a lot of papers about ending spam as well as Mr Graham's A Plan for spam but I have a problem and was
wondering if anyone can point me to the correct direction. 

I'm currently doing my senior project to design a spam filter on Chinese emails. In the file of
bogofilter-faq.html, the part of "What can I do about Asian spam?" seems to suggest that bogofilter does
not support Chinese language. Am I right? Would you like to give me some suggestions if I want to use
bogofilter in Chinese language enviroment, that is to filter Chinese spam from Chinese mails? 

Best Regards!

Yours sincerely,
Zhang Jing
_______________________________________________
Bogofilter-dev mailing list
Bogofilter-dev <at> bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter-dev

Matthias Andree | 28 Mar 2007 00:43
Picon
Picon

Re: Can bogofilter use as a Chinese spam filter?

Õž§ schrieb:

> I'm currently doing my senior project to design a spam filter on Chinese emails. In the file of
bogofilter-faq.html, the part of "What can I do about Asian spam?" seems to suggest that bogofilter does
not support Chinese language. Am I right? Would you like to give me some suggestions if I want to use
bogofilter in Chinese language enviroment, that is to filter Chinese spam from Chinese mails? 

Zhang Jing, your "Subject" header line is not properly encoded in your
character set and displays random characters here, not your Chinese name.

Anyways: the problem with Chinese is that, unlike Indoeuropean languages
that we know, written Chinese has no spaces between words, but just
concatenates them until a full-stop. Bogofilter is not programmed to
handle that, but will instead parse full sentences, unless they contain
more than 30 words - so it may catch common spam phrases, but not
individual words unfortunately.

As David suggested, help with enhancing lexer.l to properly emit Chinese
words as single tokens is most welcome.

Hope that helps.

--

-- 
Matthias Andree
_______________________________________________
Bogofilter-dev mailing list
Bogofilter-dev <at> bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter-dev
David Relson | 27 Mar 2007 13:23
Favicon

Re: Can bogofilter use as a Chinese spam filter?

On Tue, 27 Mar 2007 10:21:25 +0800 (CST)
Õž§ wrote:

> 
> I've read a lot of papers about ending spam as well as Mr Graham's A
> Plan for spam but I have a problem and was wondering if anyone can
> point me to the correct direction. 
> 
> I'm currently doing my senior project to design a spam filter on
> Chinese emails. In the file of bogofilter-faq.html, the part of "What
> can I do about Asian spam?" seems to suggest that bogofilter does not
> support Chinese language. Am I right? Would you like to give me some
> suggestions if I want to use bogofilter in Chinese language
> enviroment, that is to filter Chinese spam from Chinese mails? 
> 
> Best Regards!
> 
> Yours sincerely,
> Zhang Jing

Hello Õž§,

About 2 years ago, unicode support was implemented in bogofilter.  This
provides a standardized character set for use in the wordlist and for
processing messages.  How well this works with Chinese is not clear.

Also, bogofilter's parser is based on a flex grammar (see file
src/lexer_v3.l).  The parser recognizes standard email headers (such as
From, Subject:, etc), multipart mime messages, etc.  As these
headers are defined by RFC standards, they apply regardless of the
(Continue reading)


Gmane