Jannik Zschiesche | 20 Jan 2012 18:15
Favicon
Gravatar

Minor mistakes in the english vocabulary

Hello everyone,

I am working on a PHP implementation of the stemmer algorithms for english, german and spanish (as soon as I am done, I will host it on github for everyone to use).

While testing the implementation against the english vocabulary I found some - what I think - mistakes. Please correct me, if I am wrong.

In the vocabulary, there are the following transformations (and some more, but I don't want to flood you):



1. skies -> sky
R1: ""
R2: ""

Step 1a: replace "ies" with "i" -> ski
=> ski



2. sky -> sky
R1: ""
R2: ""

Step 1c: replace "y" with "i" -> ski
=> ski



3. succeed -> succeed
R1: "ceed"
R2: ""

Step 1b: replace "eed" with "ee" -> succee
Step 5: replace "ee" with "e" -> succe
=> succe



4. succeeds -> succeed
R1: "ceed"
R2: ""

Step 1a: remove "s" -> succeed
-> see "succeed"



5. tying -> tie
R1: "g"
R2: ""

Step 1b: remove "ing" -> ty
=> ty



6. ugly -> ugli
R1: "ly"
R2: ""

Step 1c: replace "y" with "i" -> ugli
=> ugli


Did I miss anything within the rules?




Kind Regards
Jannik Zschiesche


PS: I hope you won't receive this message twice, because I already sent it without being a subscriber.

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Olly Betts | 21 Jan 2012 07:10
Favicon
Gravatar

Re: Minor mistakes in the english vocabulary

On Fri, Jan 20, 2012 at 06:15:58PM +0100, Jannik Zschiesche wrote:
> While testing the implementation against the english vocabulary I
> found some - what I think - mistakes. Please correct me, if I am
> wrong.
> (I used Porter2: http://snowball.tartarus.org/algorithms/english/stemmer.html)
> 
> In the vocabulary, there are the following transformations (and some
> more, but I don't want to flood you):

These cases are all in the exceptions list - see
http://snowball.tartarus.org/algorithms/english/stemmer.html and search
for the code which starts:

define exception1 as (

and:

define exception2 as (

The reasons for these exceptions are also covered there:

  The exception lists in the English stemmer are meant to be
  illustrative ('this is how it is done if you want to do it'), and were
  derived piecemeal. 

  a) The new stemmer improves on the Porter stemmer in handling short
  words ending e and y. There is however a mishandling of the four forms
  sky, skies, ski, skis, which is easily corrected by treating three of
  these words as special cases. 

  b) Similarly there is a problem with the ing form of three letter
  verbs ending ie. There are only three such verbs: die, lie and tie, so
  a special case is made for dying, lying and tying.

[...]

  e) The remaining words were included following complaints from users
  of the Porter algorithm. news is not the plural of new (noticed when
  IR systems were being set up for Reuters). Howe is a surname, and
  needs to be separated from how (noticed when doing a search for 'Sir
  Geoffrey Howe' in a demonstration at the House of Commons). succeed
  etc are not past participles, so the ed should not be removed (pointed
  out to me in an email from India). herring should not stem to her
  (another email from Russia).

Cheers,
    Olly
Jannik Zschiesche | 24 Jan 2012 12:49
Favicon
Gravatar

Re: Minor mistakes in the english vocabulary

Hi everyone,

thanks for your answers.
Yes, I missed the most obvious spot: the list of exceptions.

I will implement them and check the vocabulary again.
I have found some differences in the german vocabulary too, but they might be (missed) exceptions, too.


<at> Robert Hafner:
thank you for the offer, but I am working on a more general approach. I want to provide a fundament for implementations for all (listed) languages (mainly: en, es, de), and therefore I'd like to have a single library for all of them.


Kind Regards
Jannik Zschiesche

Am 21.01.2012 um 07:10 schrieb Olly Betts:

On Fri, Jan 20, 2012 at 06:15:58PM +0100, Jannik Zschiesche wrote:
While testing the implementation against the english vocabulary I
found some - what I think - mistakes. Please correct me, if I am
wrong.
(I used Porter2: http://snowball.tartarus.org/algorithms/english/stemmer.html)

In the vocabulary, there are the following transformations (and some
more, but I don't want to flood you):

These cases are all in the exceptions list - see
http://snowball.tartarus.org/algorithms/english/stemmer.html and search
for the code which starts:

define exception1 as (

and:

define exception2 as (

The reasons for these exceptions are also covered there:

 The exception lists in the English stemmer are meant to be
 illustrative ('this is how it is done if you want to do it'), and were
 derived piecemeal.

 a) The new stemmer improves on the Porter stemmer in handling short
 words ending e and y. There is however a mishandling of the four forms
 sky, skies, ski, skis, which is easily corrected by treating three of
 these words as special cases.

 b) Similarly there is a problem with the ing form of three letter
 verbs ending ie. There are only three such verbs: die, lie and tie, so
 a special case is made for dying, lying and tying.

[...]

 e) The remaining words were included following complaints from users
 of the Porter algorithm. news is not the plural of new (noticed when
 IR systems were being set up for Reuters). Howe is a surname, and
 needs to be separated from how (noticed when doing a search for 'Sir
 Geoffrey Howe' in a demonstration at the House of Commons). succeed
 etc are not past participles, so the ed should not be removed (pointed
 out to me in an email from India). herring should not stem to her
 (another email from Russia).

Cheers,
   Olly


_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Robert Hafner | 24 Jan 2012 23:38

Re: Minor mistakes in the english vocabulary


<at> Martin- sure, feel free to! 

<at> Jannik- I understand, and that makes sense, but I think you'd have better luck finishing a php back end for the snowball compiler than going through and porting each algorithm. This was something I started before, and honestly would like to pick up again once I have a bit more time. The Java backend seems like the best place to start for it, since PHP's OOP syntax is heavily based off of java anyways (most of the work you'll find is cutting out typing and things like that).


Robert





On Jan 24, 2012, at 3:49 AM, Jannik Zschiesche wrote:

Hi everyone,

thanks for your answers.
Yes, I missed the most obvious spot: the list of exceptions.

I will implement them and check the vocabulary again.
I have found some differences in the german vocabulary too, but they might be (missed) exceptions, too.


<at> Robert Hafner:
thank you for the offer, but I am working on a more general approach. I want to provide a fundament for implementations for all (listed) languages (mainly: en, es, de), and therefore I'd like to have a single library for all of them.


Kind Regards
Jannik Zschiesche

Am 21.01.2012 um 07:10 schrieb Olly Betts:

On Fri, Jan 20, 2012 at 06:15:58PM +0100, Jannik Zschiesche wrote:
While testing the implementation against the english vocabulary I
found some - what I think - mistakes. Please correct me, if I am
wrong.
(I used Porter2: http://snowball.tartarus.org/algorithms/english/stemmer.html)

In the vocabulary, there are the following transformations (and some
more, but I don't want to flood you):

These cases are all in the exceptions list - see
http://snowball.tartarus.org/algorithms/english/stemmer.html and search
for the code which starts:

define exception1 as (

and:

define exception2 as (

The reasons for these exceptions are also covered there:

 The exception lists in the English stemmer are meant to be
 illustrative ('this is how it is done if you want to do it'), and were
 derived piecemeal.

 a) The new stemmer improves on the Porter stemmer in handling short
 words ending e and y. There is however a mishandling of the four forms
 sky, skies, ski, skis, which is easily corrected by treating three of
 these words as special cases.

 b) Similarly there is a problem with the ing form of three letter
 verbs ending ie. There are only three such verbs: die, lie and tie, so
 a special case is made for dying, lying and tying.

[...]

 e) The remaining words were included following complaints from users
 of the Porter algorithm. news is not the plural of new (noticed when
 IR systems were being set up for Reuters). Howe is a surname, and
 needs to be separated from how (noticed when doing a search for 'Sir
 Geoffrey Howe' in a demonstration at the House of Commons). succeed
 etc are not past participles, so the ed should not be removed (pointed
 out to me in an email from India). herring should not stem to her
 (another email from Russia).

Cheers,
   Olly


_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Robert Hafner | 21 Jan 2012 11:30

Re: Minor mistakes in the english vocabulary



I have a working PHP Port of the english one here if you'd like-






On Jan 20, 2012, at 9:15 AM, Jannik Zschiesche wrote:

Hello everyone,

I am working on a PHP implementation of the stemmer algorithms for english, german and spanish (as soon as I am done, I will host it on github for everyone to use).

While testing the implementation against the english vocabulary I found some - what I think - mistakes. Please correct me, if I am wrong.

In the vocabulary, there are the following transformations (and some more, but I don't want to flood you):



1. skies -> sky
R1: ""
R2: ""

Step 1a: replace "ies" with "i" -> ski
=> ski



2. sky -> sky
R1: ""
R2: ""

Step 1c: replace "y" with "i" -> ski
=> ski



3. succeed -> succeed
R1: "ceed"
R2: ""

Step 1b: replace "eed" with "ee" -> succee
Step 5: replace "ee" with "e" -> succe
=> succe



4. succeeds -> succeed
R1: "ceed"
R2: ""

Step 1a: remove "s" -> succeed
-> see "succeed"



5. tying -> tie
R1: "g"
R2: ""

Step 1b: remove "ing" -> ty
=> ty



6. ugly -> ugli
R1: "ly"
R2: ""

Step 1c: replace "y" with "i" -> ugli
=> ugli


Did I miss anything within the rules?




Kind Regards
Jannik Zschiesche


PS: I hope you won't receive this message twice, because I already sent it without being a subscriber.

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss <at> lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss
Martin Porter | 24 Jan 2012 12:37
Picon

Re: Minor mistakes in the english vocabulary

Olly, thanks for the further clarification.

Robert, I'll include a link to your php version (if I may) from
snownball next time I do an update,

Martin

On Sat, Jan 21, 2012 at 10:30 AM, Robert Hafner <tedivm <at> tedivm.com> wrote:
>
>
> I have a working PHP Port of the english one here if you'd like-
>
> http://mortar.googlecode.com/svn/trunk/modules/Graffiti/classes/Stemmers/English.class.php
>

Gmane