Konstantin | 21 May 2012 20:59
Picon
Favicon

detect an email with japanese characters


Hi,

How it is possible to detect (and filter) an email written in Japanese characters (which I cannot read anyway)?

The content-Type specifies charset="utf-8". The "From" field is apparently invalid, and may not
necessarily contain .jp

Regards,
Konstantin.
Alan Clifford | 21 May 2012 22:32
Favicon

Re: detect an email with japanese characters

Konstantin wrote (at 14:59 (-0400) on Monday, 21st May, 2012):

>
> Hi,
>
> How it is possible to detect (and filter) an email written in Japanese 
> characters (which I cannot read anyway)?
>
> The content-Type specifies charset="utf-8". The "From" field is 
> apparently invalid, and may not necessarily contain .jp
>

I took a copy of this some time ago.  It might be useful

http://clifford.ac/chinese.html

The mentioned files are in chinese.zip

Alan
Konstantin | 23 May 2012 04:33
Picon
Favicon

Re: detect an email with japanese characters


Great! Thank you all for suggestions.

Konstantin.

On 5/21/2012 4:32 PM, Alan Clifford wrote:
> Konstantin wrote (at 14:59 (-0400) on Monday, 21st May, 2012):
> 
>>
>> Hi,
>>
>> How it is possible to detect (and filter) an email written in Japanese characters (which I cannot read anyway)?
>>
>> The content-Type specifies charset="utf-8". The "From" field is apparently invalid, and may not
necessarily contain .jp
>>
> 
> I took a copy of this some time ago. It might be useful
> 
> http://clifford.ac/chinese.html
> 
> The mentioned files are in chinese.zip
> 
> 
> Alan
> 
> ____________________________________________________________
> procmail mailing list Procmail homepage: http://www.procmail.org/
> procmail <at> lists.RWTH-Aachen.de
> http://mailman.rwth-aachen.de/mailman/listinfo/procmail
(Continue reading)

Robert Bonomi | 21 May 2012 23:52

Re: detect an email with japanese characters

> From procmail-bounces <at> lists.RWTH-Aachen.de  Mon May 21 14:01:49 2012
> Date: Mon, 21 May 2012 14:59:41 -0400
> From: Konstantin <klk206 <at> panix.com>
> To: procmail <at> lists.RWTH-Aachen.de
> Subject: detect an email with japanese characters
>
>
> Hi,
>
> How it is possible to detect (and filter) an email written in Japanese characters (which I cannot read anyway)?
>
> The content-Type specifies charset="utf-8". The "From" field is apparently invalid, and may not
necessarily contain .jp
>
> Regards,
> Konstantin.
>
> ____________________________________________________________
> procmail mailing list   Procmail homepage: http://www.procmail.org/
> procmail <at> lists.RWTH-Aachen.de
> http://mailman.rwth-aachen.de/mailman/listinfo/procmail
>
Robert Bonomi | 22 May 2012 01:25

Re: detect an email with japanese characters


 Konstantin <klk206 <at> panix.com> wrote:
>
> Hi,
>
> How it is possible to detect (and filter) an email written in Japanese chara
> cters (which I cannot read anyway)?
>
> The content-Type specifies charset="utf-8". The "From" field is apparently i
> nvalid, and may not necessarily contain .jp

What I do is:

  a) specify a list of charsets that I understand:

     OK_CHARSET=(ASCII|DISPAY|ISO-8859-[12]|WINDOWS-125[012]|utf-8|utf8)

  b) filter anything that (1) specifies charset, and (2) does -not- have
     one of those charsets:h

     :0 H
     * ^(From|To|Subject): *\=\?\?.*
     * ! $ MATCH ?? ${OK_CHARSET}
     $DISCARD

     :0 H
     * ^Content-Type:.*charset\/.*
     * ! $ MATCH ?? ${OK_CHARSET}
     $DISCARD

(Continue reading)

LuKreme | 22 May 2012 03:31
Favicon

Re: detect an email with japanese characters

On May 21, 2012, at 17:25, Robert Bonomi <bonomi <at> mail.r-bonomi.com> wrote:

> Note: you cannot 'safely' drop 'anything' with such a glyph in it
>     since Microsoft products routinely use use several 3-byte glyphs --
>     things like 'smartquotes', dashes, etc.   (*snarl*)

Oh, it's not just MSFT, there are many high byte characters in UTF-8 that are perfectly usable and proper.
The days of 7-bit email are long behind us, and that's a good thing.
Robert Bonomi | 22 May 2012 04:44

Re: detect an email with japanese characters

> From procmail-bounces <at> lists.RWTH-Aachen.de  Mon May 21 20:34:42 2012
> Subject: Re: detect an email with japanese characters
> From: LuKreme <kremels <at> kreme.com>
> Date: Mon, 21 May 2012 19:31:42 -0600
> To: "procmail <at> lists.RWTH-Aachen.de" <procmail <at> lists.RWTH-Aachen.de>
>
> On May 21, 2012, at 17:25, Robert Bonomi <bonomi <at> mail.r-bonomi.com> wrote:
>
> > Note: you cannot 'safely' drop 'anything' with such a glyph in it
> >     since Microsoft products routinely use use several 3-byte glyphs --
> >     things like 'smartquotes', dashes, etc.   (*snarl*)
>
> Oh, it's not just MSFT, there are many high byte characters in UTF-8 tha
> t are perfectly usable and proper. The days of 7-bit email are long behin
> d us, and that's a good thing.

In 'western' usage, it is exceedingly rare to -need- anything beyond the 
so-called C0 through C3 glyph sets (roughly 256 'printable' symbols).

Microsoft is well known for it's egregious MISUSE of UTF-8 multi-byte 
glyphs.  *Especially* in documents that are identified as using something 
_other_ than UTF-8.  One simply cannot 'trust' MS products to get the 
'content-type' right.  Their products are notorious for, say, _declaring_
a document as 'iso-8859-1' or 'Windows-1251', but including in that 
document a handful of UTF-8 3-byte sequences from the '0xe2', '0xe7', 
and '0xef' ranges.  

For processing arbitrary e-mail from a Microsoft product, one has to
essentially throw away the declared charset, parse out the 'valid'
ASCII/ISO-8859/WINDOWS-125x/UTF-8 glyphs that one can recognize, and
(Continue reading)

Re: detect an email with japanese characters

At 19:33 2012-05-22, Konstantin wrote:

>Great! Thank you all for suggestions.

One late arrival:

Check out "furrin.rc" at:

<http://www.professional.org/procmail/spam.html>

That groups various character sets and checks for hibit characters, 
etc.  There are a number of links to references and pertinent RFCs as well.

I wrote it quite a few years ago (last time that was even altered was 
9 years ago), and it's been quite effective for me.

---
  Sean B. Straw / Professional Software Engineering

  Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
  Please DO NOT carbon me on list replies.  I'll get my copy from the list.
LuKreme | 24 May 2012 00:19
Favicon

Re: detect an email with japanese characters

PSE-L <at> mail.professional.org (Professional Software Engineering) spake on Tuesday 22-May-2012 <at> 21:58:42
> At 19:33 2012-05-22, Konstantin wrote:
> 
>> Great! Thank you all for suggestions.
> 
> One late arrival:
> 
> Check out "furrin.rc" at:
> 
> <http://www.professional.org/procmail/spam.html>
> 
> That groups various character sets and checks for hibit characters, etc.  There are a number of links to
references and pertinent RFCs as well.
> 
> I wrote it quite a few years ago (last time that was even altered was 9 years ago), and it's been quite
effective for me.

furrin.rc does a pretty decent job, but i there are no habit characters in the subject, as there often
aren't, then it fails on utf-8 encoded foreign spam. OTOH, body checks for habit seem a rather high price to pay.

--

-- 
Against stupidity the gods themselves contend in vain.

Gmane