Lars Aronsson | 2 Mar 00:36 2002
Picon

whitespace in search

If I search for "van eyck" (with the whitespace, but sans the quotes),
I get an ugly error message.  Searching for "eyck" actually solves my
particular problem for the moment, but the error message is still
ugly.  Is it really such a crime to want to search for a string that
includes the whitespace?  Perhaps this could be explained somewhat
more politely?

Also, if I search for "van\ eyck", the error message comes up and the
input box contains "van\\ eyck".  Maybe there is a potential risk for
a buffer overrun or something more exotic here?  I think the input
string should be quoted before being used in a regexp search.

--

-- 
  Lars Aronsson
  <lars@...>
  tel +46-70-7891609
  http://aronsson.se/ http://elektrosmog.nu/ http://susning.nu/

Jan Hidders | 7 Mar 17:48 2002
Picon

Re: whitespace in search

From: "Lars Aronsson" <lars@...>
>
> If I search for "van eyck" (with the whitespace, but sans the quotes),
> I get an ugly error message.

You may not realize it but there is a little parser that parses your querie
(you can now use and/or/not and brackets in your search query) and it took
quite some of my time to let this parser give you a more detailed error
message than just "syntax error". :-)

>  Is it really such a crime to want to search for a string that
> includes the whitespace?

No. The problem is that because of how the MySQL indexing works you cannot
do that anymore. For the same reason you also cannot search for anything
that is not a letter or words with less than 4 letters. I have plans on how
this might be solved, but that is not going to be easy.

> Also, if I search for "van\ eyck", the error message comes up and the
> input box contains "van\\ eyck".

That's a little bug. A forgot to unescape the letters that are escaped for
the URL. That will be fixed in the near future.

-- Jan Hidders

Lars Aronsson | 9 Mar 18:36 2002
Picon

Re: whitespace in search

Jan Hidders wrote:
> No. The problem is that because of how the MySQL indexing works you
> cannot do that anymore. For the same reason you also cannot search
> for anything that is not a letter or words with less than 4 letters.

Well, I *am* a programmer, but the minute that Julie Kemp finds out
about this, you're gonna be in big trouble.  :-)

This is a software problem that should not be exposed to the user.
The four letter limit is stupid, and should be raised to allow three
letter words.  If my search queary contains a whitespace, each
separate word could still be sent to the MySQL search engine, and the
resulting hit lists can be joined so that pages containing both words
are listed ahead of pages that contain only one of the words.

--

-- 
  Lars Aronsson (lars@...)
  Aronsson Datateknik
  Teknikringen 1e, SE-583 30 Linuxköping, Sweden
  tel +46-70-7891609
  http://aronsson.se/ http://elektrosmog.nu/ http://susning.nu/

Jimmy Wales | 9 Mar 22:55 2002

Re: whitespace in search

Lars Aronsson wrote:
> This is a software problem that should not be exposed to the user.
> The four letter limit is stupid, and should be raised to allow three
> letter words.  If my search queary contains a whitespace, each
> separate word could still be sent to the MySQL search engine, and the
> resulting hit lists can be joined so that pages containing both words
> are listed ahead of pages that contain only one of the words.

I concur.

In my own personal experience writing the search engine for Bomis and
the old fastcgi Wikipedia search engine, it is not particularly costly
to have on-the-fly scoring logic that's suitable for the problem at
hand.  Even some very ad hoc measures (scoring titles higher than the
body is very powerful for Wikipedia, for example... at Bomis, I have a
measure of "uselessness" for words that does a nice job of helping
relevancy over a raw keyword search) can be hugely beneficial in
having the search results cause joy in the searcher.

I personally wonder about the raw performance of MySQL as compared to
btree dbm files.  I could, in theory, write a perl script to go
through and index the current wikipedia database once per night, using
some of my "tricks of the trade", and make the search engine a lot
better than it is now.

However, I wonder if that's the right approach.  The downside to doing
it my old-fashioned way is that the search engine only updates as
often as I set the cron to update the search engine index.  If the
data is always "live" in the mysql database, and if it's just as fast,
then that's obviously the better way to do it.
(Continue reading)

Brion L. VIBBER | 10 Mar 03:22 2002
Picon

Re: whitespace in search

On sab, 2002-03-09 at 09:36, Lars Aronsson wrote:
> Jan Hidders wrote:
> > No. The problem is that because of how the MySQL indexing works you
> > cannot do that anymore. For the same reason you also cannot search
> > for anything that is not a letter or words with less than 4 letters.
> 
> Well, I *am* a programmer, but the minute that Julie Kemp finds out
> about this, you're gonna be in big trouble.  :-)
> 
> This is a software problem that should not be exposed to the user.
> The four letter limit is stupid, and should be raised to allow three
> letter words.

Just three? What if I'm looking up the ancient city of "Ur" or the
fictional land of "Oz"? Or "Wen Ho Li"?

I'd rather there were no length limitation. (This can be adjusted by
recompiling mysql and re-indexing the database. Alternatively, as Jan
has suggested the index field can be munged such that two-character
words would be counted as 4 characters in the indexing.)

Rather, if we're going to eliminate "useless" search terms, we should
have a (per-language) list of such words.

>  If my search queary contains a whitespace, each
> separate word could still be sent to the MySQL search engine, and the
> resulting hit lists can be joined so that pages containing both words
> are listed ahead of pages that contain only one of the words.

I concur.
(Continue reading)

Jimmy Wales | 10 Mar 18:28 2002

Re: whitespace in search

Brion L. VIBBER wrote:
> Rather, if we're going to eliminate "useless" search terms, we should
> have a (per-language) list of such words.

A useful and simple (though not perfect) measure of uselessness is how
many pages are returned for a given word.  In English, 'a', 'an' and
'the' will appear in nearly every article.  In Japanese, 'wa' and
other similar marker words will appear in nearly every article.

The more articles that are returned for a given search term, the less
informative it is.

--Jimbo

Tomasz Wegrzanowski | 10 Mar 19:31 2002
Picon
Picon

Re: whitespace in search

On Sun, Mar 10, 2002 at 09:28:46AM -0800, Jimmy Wales wrote:
> Brion L. VIBBER wrote:
> > Rather, if we're going to eliminate "useless" search terms, we should
> > have a (per-language) list of such words.
> 
> A useful and simple (though not perfect) measure of uselessness is how
> many pages are returned for a given word.  In English, 'a', 'an' and
> 'the' will appear in nearly every article.  In Japanese, 'wa' and
> other similar marker words will appear in nearly every article.

I'm wondering how is search going to work in Japanese.
Not only some articles are in romaji and other kanjis,
but kanji words usually aren't separated by whitesace,
so it might be a bit difficult.

> The more articles that are returned for a given search term, the less
> informative it is.

Only if they are not sorted.
We could just give less priority to more frequent word and more priority
to less frequent or something like that. So for example search for "the foo"
would rate +1 point for every the and +10 for every foo (which is 10x less
frequent than "the"). And then sort the results according to this score.

Axel Boldt | 10 Mar 03:53 2002
Picon

Re: whitespace in search

Jimbo, the tricks that you mention (weighing titles more than text,
weighing rare words more than common ones, stopwords etc.) are already
used by mysql's fulltext index; I don't think it would make sense to
reinvent the wheel, especially since we now have real-time searching
and boolean searches, both of which are kind of nice. I am also pretty
sure that the mysql index is reasonably fast, being written in C.
They explain a bit about it at
http://www.mysql.com/doc/F/u/Fulltext_Search.html

Regarding the three letter limit: since we right now already parse
the search string for AND, OR and NOT anyway, it should be pretty easy
to remove short words from the search string, start the query without them,
and then later report the results with a warning like
  The short word "the" was ignored.

Axel

Jimmy Wales | 10 Mar 18:37 2002

Re: Re: whitespace in search

Axel Boldt wrote:
> Jimbo, the tricks that you mention (weighing titles more than text,
> weighing rare words more than common ones, stopwords etc.) are already
> used by mysql's fulltext index;

I was not aware of that!  So, very good then!

> Regarding the three letter limit: since we right now already parse
> the search string for AND, OR and NOT anyway, it should be pretty easy
> to remove short words from the search string, start the query without them,
> and then later report the results with a warning like
>   The short word "the" was ignored.

I would that this is what we should do, at a bare minimum, from the
principle of least astonishment.  Still, what if there are legitimate
2 letter searches, like for "Ur" as someone else pointed out?

--Jimbo

Axel Boldt | 10 Mar 20:15 2002
Picon

Re: whitespace in search

>Still, what if there are legitimate
>2 letter searches, like for "Ur" as someone else pointed out?

The only way to make mysql index shorter words is by recompiling and
setting MIN_WORD_LEN to a different number. They claim that changing
this variable from the current value of 4 to 0 will enlarge the index
by a factor of 20. So maybe we should go with 2. After that, the
indexes have to be rebuilt.

Axel

Jimmy Wales | 11 Mar 03:26 2002

Re: Re: whitespace in search

Axel Boldt wrote:
> The only way to make mysql index shorter words is by recompiling and
> setting MIN_WORD_LEN to a different number. They claim that changing
> this variable from the current value of 4 to 0 will enlarge the index
> by a factor of 20. So maybe we should go with 2. After that, the
> indexes have to be rebuilt.

Wow, factor of 20!  That's pretty severe.  2 seems good enough for now.

I'll try this out in a few days time.


Gmane