Sent: Sunday, November 30, 2008 3:22 AM
To: Peter Constable; LTRU list; CLDR Users
Subject: RE: [Ltru] [CLDR] Re: Support of ISO 639 (was: Survey Tool pre-alpha)
"Peter Constable" <
petercon <at> microsoft.com>
> The need for a language hierarchy (by families) is to simplify the search...
>
> An informal suggestion: while Ethnologue is not formally part of ISO 639, it is maintained so as to stay
consistent with ISO 639, and ISO 639-3 makes use of Ethnologue as a source to clarify the denotation of its encoded
categories. Since the Ethnologue site provides a comprehensive language-family classification, one could search on
the Ethnologue site to find particular languages, and then follow the links provided to get to the corresponding
ISO 639-3 entry.
That's exactly the kind of reason why we need such classification ALSO in other languages than English. But without
a reliable codification of families, of their hierarchy (at least a minimal classification in the most important
groups, possibly excluding finely tuned intermediate subdivisions), and more importantly of the membership of
isolated languages and macrolanguages that are direct children of those families, building such hierarchy and
making it usable is illusory.
Anyway, the fact that families ARE encoded in addition with languages, and the fact that families are hierarchized
as well, creates a hole that must be filled between families and languages (this will close the mess that was
introduced in ISO 639-1/2 when exclusive (and unstable) family names were given (with various and non interoperable
results about which languages get included or not in a search of results by family names).
Believe it, searching for terms within a complete language family rather than precise language name or even just
macrolanguage, is not an unbelievable situation. Linguists are performing such things very often, notably when
looking for etymologia; translators are also looking for translated terms that were chosen in other related
languages; terminologists and advertizers or "brand builders" want to look for terms in families to check if a new
chosen term for a given language may be misinterpreted by less qualified translators or readers of another
language.
Yes it's true that encoded texts should never be tagged and indexed directly by a family language code. But family
codes are as essential as language codes for full-text searches.
In addition, it is ESSENTIAL that the labels displayed when selecting any collective code from a list containing
ISO 639 codes of various scopes MUST reflect the fact that this is effectively a collection of distinct languages
(so, no more label that just displays "Apache" or "Bihari").
That's exactly the reverse decision that CLDR made, and I do think that this is an error (on the opposite, I
support the decision of dropping the "(Other)" word). If a short name is needed (without any plural mark and
without the "languages" word that generally comes with the language adjective), it should be encoded as a separate
variant in CLDR: this short name should be used only when displaying filtered lists that contain only collections.
Note that isolated short language names are generally nouns, but if they are used as a complement to an expression
containing "language(s)", then they are adjectives and may be written differently (sometimes not even with the same
words despite that, in general, the adjectives are simple derivation still needing some changes for marking the
plural, feminine or genitive cases, depending on the language used to name the referenced language).
Note also that some English names/descriptions used by ISO 639 and in the RFC 4645bis draft or in the IANA database
for BCP 47 may contain some non capitalizable letters, but ISO 639 is always wrong about these letters when it uses
ASCII punctuation like "!" and math symbols like "/", "//" or "=/" or ASCII apostrophe instead of true Latin clicks
or dropping the apostrophe letters in a way that makes the language name ambiguous or unreadable; note also that
The Ethnologue lists, for some of them but not all of them, some synonymes using capitalizable letters only):
The ISO 639 documents say that they are themselves normally encoded with UTF-8 (possible using numeric character
entities for the plain-text version), meaning that these documents should support Unicode characters and should not
use any ASCII substitutes... This is also true for the HTML version displayed online on the ISO 639/RA sites
(including on SIL.org), and the language names that were finally used in the English locale of the CLDR!