verdy_p | 27 Nov 19:54
Picon

Re: [OT] Re: Support of ISO 639 (was: Survey Tool pre-alpha)

"Doug Ewell" wrote:
> Warning: this is completely OT for the Unicode list.  Future discussion 
> should be on the LTRU list (ltru <at> ietf.org) or CLDR list 
> (cldr-users <at> unicode.org) as appropriate.

You have just replied to the Unicode list yourself (despite I was replaying to you using a CC to the CLDR list...)

> "verdy underscore p" <verdy underscore p at wanadoo dot fr> wrote:
> 
> > If only we could have some access to ISO 639-5 data (for managing the 
> > language families instead of using the historic and bdly designed 
> > language collections of ISO 639-1 (code [bi] only) and ISO 639-2...
> 
> I wish the ISO 639-5 Registration Authority, which is the same as that 
> for ISO 639-2 (Library of Congress), would set up an official 639-5 Web 
> site.  It has been a long time coming.

Well, still waiting (sorry, my interest for the subject is mostly personal, although I could have use of it 
professionnally, but I can't pay myself for getting a copy of the published paper; it's too expensive for me).

> I don't agree with characterizing 639-1 and 639-2 as "badly designed." 
> They were designed for different purposes.

Apparently not. Your description just indicates that 639-5 is effectively continuing the 639-2 (and
639-1 for 
bihari) model, and does not create what was expected (a comprehensive hierarchy similar to the
Ethnologue); in 
addition, the 639-5 is now incompatible with 639-2 and 639-1, making it mostly unusable within the RFC
4645/4646 
bis framework). For me, this means that 639-5 is already a dead standard before its publication, unless the
(Continue reading)

Kent Karlsson | 27 Nov 21:06
Picon

Re: [OT] Re: Support of ISO 639 (was: Survey Tool pre-alpha)


Den 2008-11-27 19.54, skrev "verdy_p" <verdy_p <at> wanadoo.fr>:

> Apparently not. Your description just indicates that 639-5 is effectively
> continuing the 639-2 (and 639-1 for
> bihari) model, and does not create what was expected (a comprehensive
> hierarchy similar to the Ethnologue); in

I'm not sure exactly who expected that (beside yourself).

> addition, the 639-5 is now incompatible with 639-2 and 639-1, making it mostly

-2 apparently keeps the "(other)" interpretation, while -5 does not. LTRU
has adopted the -5 interpretation (inclusive), otherwise the collection
codes would stand for nothing.

> unusable within the RFC 4645/4646

The use of collection codes in language tags is dubious, like saying
"it's language group so-and-so, but information about individual language is
not available".

[...] 
> There's absolutely no integration. Or more exactly, it does not create the
> encoding framework that would allow the
> efective creation of a comprehensive hierarchy of language families.

As Randy said, out of scope for LTRU.

> It just says that they are just added as possible subtags, usable as prefixes,
(Continue reading)

verdy_p | 28 Nov 17:29
Picon

Re: [CLDR] Re: Support of ISO 639 (was: Survey Tool pre-alpha)

"Kent Karlsson" <kent.karlsson14 <at> comhem.se> wrote:
> > It just says that they are just added as possible subtags, usable as prefixes,
> > but immediately, the included list
> > of tags make these combinations of a collection subtag plus a language subtag
> 
> This is for some of the so-called macrolanguage codes. While macrolanguage
> codes are (informally) like collection codes, they are "special" collections
> (in a particular way), and they are not formally collection codes.

I must have read the current RFC4645bis draft better than you: there ARE combinations of a collection code
and a 
language code, they are listed ONLY for the collection code [sgn] (Sign languages), within the proposed
registry as 
full "Tag:" elements, rather than just "Subtag-Type:" elements.

And I was NOT speaking about the case of macrolanguages (that are already correctly handled in ISO 639-3,
and well 
integrated in RFC 4645bis, except a few diferences that should be corrected to match what ISO 639-3
indicates, but 
anyway, the RFC 4647 alerady contains the statements needed to avoid or correct these small discrepencies).

Note: my message was not supposed to be out-of-topic: I posted it to the LTRU list under indication given by
Doug 
Ewell, the signing author of the drafts for RFC 4645bis and 4646bis (initially it was a post on the Unicode
list 
(than Doug Ewell suggested me to forward it to the Unicode CLDR list as well), and this explains the "[OT]"
label 
that remained, and the title of this thread (related to the "CDLR Survey Tool" and its list speaking about
its 
current use of RFC 4645-4647 series).
(Continue reading)

Peter Constable | 29 Nov 23:44
Picon
Favicon

Re: [CLDR] Re: Support of ISO 639 (was: Survey Tool pre-alpha)

As I've explained in the past on this list and various other loci, the exclusive nature of some collections
in 639-2 has all along been a problem because of the dynamic nature of the standard: not only are the
denotations fuzzy, they are unstable. Broadening the scope of those existing collections does not
introduce any compatibility issues with existing applications, given that conforming applications
can continue to treat certain collections as exclusive in that given application context if desired, and
it eliminates the general problem of instability.

CLDR and other applications of ISO 639 may experience a one-time change in the name data published with ISO
639, but it is known that ISO 639 can and not infrequently does make name changes, and that applications
must allow for that. So, this is not an exceptional problem.

Also, adding new categories to replace the existing ones is incredibly destabilizing, with
compatibility impact on all applications.

Thus, I strongly disagree with Philippe.

Peter

From: cldr-users-bounce <at> unicode.org [mailto:cldr-users-bounce <at> unicode.org] On Behalf Of verdy_p

[snip]

I remain convinced that the unexpected change of scope for most collections of ISO 639-1/2 (where their exclusive
scope was also very fuzzy, undetermined across versions of ISO 639-1/2 that constantly reduced their
scope) when
converted to ISO 639-5 is a defect. And that for the future, a comprehensive list containing only inclusive
families coded distinctly from the old exclusive codes would have been better:

...
Under this scheme, there would have been less problems in the CLDR collections (that were updated inconsistently
(Continue reading)

Kent Karlsson | 28 Nov 18:37
Picon

Re: [CLDR] Re: Support of ISO 639 (was: Survey Tool pre-alpha)


Den 2008-11-28 17.29, skrev "verdy_p" <verdy_p <at> wanadoo.fr>:

> there ARE combinations of a collection code and a
> language code, they are listed ONLY for the collection code [sgn] (Sign
> languages), within the proposed registry as

Yes, sgn was included in that LTRU compromise. Still not sure why it was.

> full "Tag:" elements, rather than just "Subtag-Type:" elements.

No, it goes via the "Type: extlang" and "Prefix: sgn" mechanisms.

> And I was NOT speaking about the case of macrolanguages (that are already
> correctly handled in ISO 639-3, and well
> integrated in RFC 4645bis, except a few diferences that should be corrected to
> match what ISO 639-3 indicates, but

No, the compromise does not handle all macrolanguages the same. Only some
were selected as extlang prefixes.

> anyway, the RFC 4647 alerady contains the statements needed to avoid or
> correct these small discrepencies).
[...]

> I remain convinced that the unexpected change of scope for most collections of
> ISO 639-1/2 (where their exclusive
> scope was also very fuzzy, undetermined across versions of ISO 639-1/2 that
> constantly reduced their scope) when
> converted to ISO 639-5 is a defect. And that for the future, a comprehensive
(Continue reading)

verdy_p | 28 Nov 22:31
Picon

Re: [CLDR] Re: Support of ISO 639 (was: Survey Tool pre-alpha)

> De : "Kent Karlsson" <kent.karlsson14 <at> comhem.se>
> A : "verdy_p" <verdy_p <at> wanadoo.fr>, "Doug Ewell" <doug <at> ewellic.org>
> Copie à : "LTRU list" <ltru <at> ietf.org>, "CLDR Users" <cldr-users <at> unicode.org>
> Objet : Re: [Ltru] [CLDR] Re: Support of ISO 639 (was: Survey Tool pre-alpha)
> 
> 
> 
> Den 2008-11-28 17.29, skrev "verdy_p" <verdy_p <at> wanadoo.fr>:
> 
> > there ARE combinations of a collection code and a
> > language code, they are listed ONLY for the collection code [sgn] (Sign
> > languages), within the proposed registry as
> 
> Yes, sgn was included in that LTRU compromise. Still not sure why it was.
> 
> > full "Tag:" elements, rather than just "Subtag-Type:" elements.
> 
> No, it goes via the "Type: extlang" and "Prefix: sgn" mechanisms.

That's not what RFC 4645bis-version 07 says. It clearly states that it is a collection, not a
macrolanguage, but 
included there to be treated like macrolanguages whose subtags are used in FULL "Tag:" elements for
entries related 
to "Type: Redundant"... This is the only exception made that allows a collection to be used in locale tags.

And I've NEVER said that language collections codes should be part of locale tags. My need for a
comprehensive 
hierarchy is for something else: organizing long lists of languages (and their many synonyms, including
those in 
other languages than just English used in BCP 47).
(Continue reading)

Peter Constable | 30 Nov 00:17
Picon
Favicon

Re: [CLDR] Re: Support of ISO 639 (was: Survey Tool pre-alpha)

From: cldr-users-bounce <at> unicode.org [mailto:cldr-users-bounce <at> unicode.org] On Behalf Of verdy_p
Sent: Friday, November 28, 2008 1:31 PM

The need for a language hierarchy (by families) is to simplify the search...

An informal suggestion: while Ethnologue is not formally part of ISO 639, it is maintained so as to stay
consistent with ISO 639, and ISO 639-3 makes use of Ethnologue as a source to clarify the denotation of its
encoded categories. Since the Ethnologue site provides a comprehensive language-family
classification, one could search on the Ethnologue site to find particular languages, and then follow
the links provided to get to the corresponding ISO 639-3 entry.

For example, starting at the Ethnologue language-family index
(http://www.ethnologue.com/family_index.asp), you can follow the link for the Iroquoian family to
get the complete hierarchy of Iroquoian languages, then select the link for (say) Mohawk to get the entry
for that language, and then from there follow the link to get to the entry on the ISO 639-3 site for "moh".

Caveat: as with any hierarchical language classification, the classification used by Ethnologue is one
of several possible analyses (Ethnologue primarily follows the _International Encyclopaedia of
Linguistics_), and not all experts would necessarily posit the same hierarchy, though most linguists
would likely be somewhat familiar with that analysis and find it reasonably workable for searching by
language-family hierarchy.

Peter
verdy_p | 30 Nov 12:22
Picon

Re: [CLDR] Re: Support of ISO 639 (was: Survey Tool pre-alpha)

"Peter Constable" <petercon <at> microsoft.com>
> The need for a language hierarchy (by families) is to simplify the search...
> 
> An informal suggestion: while Ethnologue is not formally part of ISO 639, it is maintained so as to stay 
consistent with ISO 639, and ISO 639-3 makes use of Ethnologue as a source to clarify the denotation of its
encoded 
categories. Since the Ethnologue site provides a comprehensive language-family classification, one
could search on 
the Ethnologue site to find particular languages, and then follow the links provided to get to the
corresponding 
ISO 639-3 entry.

That's exactly the kind of reason why we need such classification ALSO in other languages than English. But
without 
a reliable codification of families, of their hierarchy (at least a minimal classification in the most
important 
groups, possibly excluding finely tuned intermediate subdivisions), and more importantly of the
membership of 
isolated languages and macrolanguages that are direct children of those families, building such
hierarchy and 
making it usable is illusory.

Anyway, the fact that families ARE encoded in addition with languages, and the fact that families are
hierarchized 
as well, creates a hole that must be filled between families and languages (this will close the mess that was 
introduced in ISO 639-1/2 when exclusive (and unstable) family names were given (with various and non
interoperable 
results about which languages get included or not in a search of results by family names).

Believe it, searching for terms within a complete language family rather than precise language name or
(Continue reading)

Peter Constable | 1 Dec 05:48
Picon
Favicon

Re: [CLDR] Re: Support of ISO 639 (was: Survey Tool pre-alpha)

> Anyway, the fact that families ARE encoded in addition with languages,
> and the fact that families are hierarchized as well, creates a hole
> that must be filled between families and languages...

If you mean that there is action needed on the part of the ISO 639 RAs or JAC, then that is out of scope for this WG
and should be taken up elsewhere (with those bodies).

If you mean that something needs to be changed in the language-subtag registry, or in 4646bis, then I don't
see any such need: ISO 639-3 provides very comprehensive coverage of languages, and there is not a lot of
likelihood that users would need to tag content for some language not covered by 639-3 but potentially
(perhaps not clearly) covered by one or more collective categories in 639-5. Also, the use case for
collective categories in language tagging is, I think, not that great.

> it is ESSENTIAL that the labels displayed when selecting any collective
> code from a list containing ISO 639 codes of various scopes MUST reflect
> the fact that this is effectively a collection of distinct languages

If you mean that it must be clear when a subtag in the Language Subtag Registry represents a collection, then
I completely agree.

> That's exactly the reverse decision that CLDR made...

That is an issue for the CLDR TC to consider and is out of scope for the LTRU WG.

Btw, can we please *not* cross-post items between LTRU and CLDR Users. (I've moved CLDR Users to bcc to that
end.) CLDR Users is an informal discussion list for users of CLDR, while LTRU is a technical working group:
IMO there cannot possibly be a discussion that would be appropriate for both lists.

> but ISO 639 is always wrong about these letters when it uses
> ASCII punctuation...
(Continue reading)

Mark Davis | 1 Dec 07:10

Re: [CLDR] Re: Support of ISO 639 (was: Survey Tool pre-alpha)

I agree with Peter, on all counts. 


In particular, it muddies the waters when you cross-post, because it becomes completely unclear what you are asking for, why, and from whom.

Mark


On Sun, Nov 30, 2008 at 20:48, Peter Constable <petercon <at> microsoft.com> wrote:
> Anyway, the fact that families ARE encoded in addition with languages,
> and the fact that families are hierarchized as well, creates a hole
> that must be filled between families and languages...

If you mean that there is action needed on the part of the ISO 639 RAs or JAC, then that is out of scope for this WG and should be taken up elsewhere (with those bodies).

If you mean that something needs to be changed in the language-subtag registry, or in 4646bis, then I don't see any such need: ISO 639-3 provides very comprehensive coverage of languages, and there is not a lot of likelihood that users would need to tag content for some language not covered by 639-3 but potentially (perhaps not clearly) covered by one or more collective categories in 639-5. Also, the use case for collective categories in language tagging is, I think, not that great.


> it is ESSENTIAL that the labels displayed when selecting any collective
> code from a list containing ISO 639 codes of various scopes MUST reflect
> the fact that this is effectively a collection of distinct languages

If you mean that it must be clear when a subtag in the Language Subtag Registry represents a collection, then I completely agree.


> That's exactly the reverse decision that CLDR made...

That is an issue for the CLDR TC to consider and is out of scope for the LTRU WG.

Btw, can we please *not* cross-post items between LTRU and CLDR Users. (I've moved CLDR Users to bcc to that end.) CLDR Users is an informal discussion list for users of CLDR, while LTRU is a technical working group: IMO there cannot possibly be a discussion that would be appropriate for both lists.


> but ISO 639 is always wrong about these letters when it uses
> ASCII punctuation...

Please address concerns regarding ISO 639 to the relevant RA.


Peter


-----Original Message-----
From: verdy_p [mailto:verdy_p <at> wanadoo.fr]
Sent: Sunday, November 30, 2008 3:22 AM
To: Peter Constable; LTRU list; CLDR Users
Subject: RE: [Ltru] [CLDR] Re: Support of ISO 639 (was: Survey Tool pre-alpha)

"Peter Constable" <petercon <at> microsoft.com>
> The need for a language hierarchy (by families) is to simplify the search...
>
> An informal suggestion: while Ethnologue is not formally part of ISO 639, it is maintained so as to stay
consistent with ISO 639, and ISO 639-3 makes use of Ethnologue as a source to clarify the denotation of its encoded
categories. Since the Ethnologue site provides a comprehensive language-family classification, one could search on
the Ethnologue site to find particular languages, and then follow the links provided to get to the corresponding
ISO 639-3 entry.

That's exactly the kind of reason why we need such classification ALSO in other languages than English. But without
a reliable codification of families, of their hierarchy (at least a minimal classification in the most important
groups, possibly excluding finely tuned intermediate subdivisions), and more importantly of the membership of
isolated languages and macrolanguages that are direct children of those families, building such hierarchy and
making it usable is illusory.

Anyway, the fact that families ARE encoded in addition with languages, and the fact that families are hierarchized
as well, creates a hole that must be filled between families and languages (this will close the mess that was
introduced in ISO 639-1/2 when exclusive (and unstable) family names were given (with various and non interoperable
results about which languages get included or not in a search of results by family names).

Believe it, searching for terms within a complete language family rather than precise language name or even just
macrolanguage, is not an unbelievable situation. Linguists are performing such things very often, notably when
looking for etymologia; translators are also looking for translated terms that were chosen in other related
languages; terminologists and advertizers or "brand builders" want to look for terms in families to check if a new
chosen term for a given language may be misinterpreted by less qualified translators or readers of another
language.

Yes it's true that encoded texts should never be tagged and indexed directly by a family language code. But family
codes are as essential as language codes for full-text searches.

In addition, it is ESSENTIAL that the labels displayed when selecting any collective code from a list containing
ISO 639 codes of various scopes MUST reflect the fact that this is effectively a collection of distinct languages
(so, no more label that just displays "Apache" or "Bihari").

That's exactly the reverse decision that CLDR made, and I do think that this is an error (on the opposite, I
support the decision of dropping the "(Other)" word). If a short name is needed (without any plural mark and
without the "languages" word that generally comes with the language adjective), it should be encoded as a separate
variant in CLDR: this short name should be used only when displaying filtered lists that contain only collections.

Note that isolated short language names are generally nouns, but if they are used as a complement to an expression
containing "language(s)", then they are adjectives and may be written differently (sometimes not even with the same
words despite that, in general, the adjectives are simple derivation still needing some changes for marking the
plural, feminine or genitive cases, depending on the language used to name the referenced language).

Note also that some English names/descriptions used by ISO 639 and in the RFC 4645bis draft or in the IANA database
for BCP 47 may contain some non capitalizable letters, but ISO 639 is always wrong about these letters when it uses
ASCII punctuation like "!" and math symbols like "/", "//" or "=/" or ASCII apostrophe instead of true Latin clicks
or dropping the apostrophe letters in a way that makes the language name ambiguous or unreadable; note also that
The Ethnologue lists, for some of them but not all of them, some synonymes using capitalizable letters only):

The ISO 639 documents say that they are themselves normally encoded with UTF-8 (possible using numeric character
entities for the plain-text version), meaning that these documents should support Unicode characters and should not
use any ASCII substitutes... This is also true for the HTML version displayed online on the ISO 639/RA sites
(including on SIL.org), and the language names that were finally used in the English locale of the CLDR!



_______________________________________________
Ltru mailing list
Ltru <at> ietf.org
https://www.ietf.org/mailman/listinfo/ltru

Doug Ewell | 1 Dec 06:08
Favicon

Re: Support of ISO 639 (was: Survey Tool pre-alpha)

Peter Constable <petercon at microsoft dot com> wrote:

>> it is ESSENTIAL that the labels displayed when selecting any 
>> collective code from a list containing ISO 639 codes of various 
>> scopes MUST reflect the fact that this is effectively a collection of 
>> distinct languages
>
> If you mean that it must be clear when a subtag in the Language Subtag 
> Registry represents a collection, then I completely agree.

Draft-4645bis does this, as described in sections 2.2 and 2.3, and as 
seen in the included Registry:

%%
Type: language
Subtag: bh
Description: Bihari
Added: 2005-10-16
Scope: collection
%%

This does not mean that tags containing 'bh' or any other collection 
subtag would themselves carry any indication that the language subtag 
represents a collection, nor that any UI used to create language tags 
would necessarily identify collection subtags.  Someone could write a UI 
to do this, and it might be a nice feature, but we do not and cannot 
require this.

--
Doug Ewell  *  Thornton, Colorado, USA  *  RFC 4645  *  UTN #14
http://www.ewellic.org
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ

_______________________________________________
Ltru mailing list
Ltru <at> ietf.org
https://www.ietf.org/mailman/listinfo/ltru
Doug Ewell | 30 Nov 04:02
Favicon

Re: Support of ISO 639 (was: Survey Tool pre-alpha)

Peter Constable <petercon at microsoft dot com> wrote:

>> The use of collection codes in language tags is dubious, like saying 
>> "it's language group so-and-so, but information about individual 
>> language is not available".
>
> I'm of the same general opinion.

So am I, when we are talking about the "traditional" uses of language 
tags, viz. tagging Web pages and e-mails, or specifying desired matches 
in Web search engines.

But there may be other applications of language tags which we haven't 
yet thought of, in which the sender could say, "This content is in some 
Austro-Asiatic language, but I don't know which one" and the receiver 
may be able to make some use of that knowledge.  I think it would be a 
mistake for us to assume that no such application exists or will exist. 
Indeed, we should hope that BCP 47 is so wonderful and so well designed 
that people find new uses for it.

Collection codes have not caused significant trouble in the past.  There 
has been no great epidemic of people tagging content as "Germanic 
(other)" when they should have chosen a subtag for a specific Germanic 
language instead, and I don't imagine the change in scope from exclusive 
to inclusive will bring on such an epidemic.  As with private-use 
subtags, I'd be opposed to making any change that would further 
discourage these.

--
Doug Ewell  *  Thornton, Colorado, USA  *  RFC 4645  *  UTN #14
http://www.ewellic.org
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ 

_______________________________________________
Ltru mailing list
Ltru <at> ietf.org
https://www.ietf.org/mailman/listinfo/ltru
Peter Constable | 1 Dec 05:17
Picon
Favicon

Re: Support of ISO 639 (was: Survey Tool pre-alpha)

From: ltru-bounces <at> ietf.org [mailto:ltru-bounces <at> ietf.org] On Behalf Of Doug Ewell

>>> The use of collection codes in language tags is dubious, like saying
>>> "it's language group so-and-so, but information about individual
>>> language is not available".
>>
>> I'm of the same general opinion.
>
> So am I, when we are talking about the "traditional" uses of language
> tags, viz. tagging Web pages and e-mails, or specifying desired matches
> in Web search engines.
>
> But there may be other applications of language tags...

Which is why I included the "general" qualifier in my comment. I didn't mean to re-open this issue: we
debated this a long time ago, and the consensus taken was to continue to include collective language
categories in the registry. There's no reason to re-visit that.

Peter
Randy Presuhn | 29 Nov 05:06
Picon

Re: [CLDR] Re: Support of ISO 639 (was: Survey Tool pre-alpha)

Hi -

As co-chair...

> From: "verdy_p" <verdy_p <at> wanadoo.fr>
> To: "Kent Karlsson" <kent.karlsson14 <at> comhem.se>; "Doug Ewell" <doug <at> ewellic.org>
> Cc: "CLDR Users" <cldr-users <at> unicode.org>; "LTRU list" <ltru <at> ietf.org>
> Sent: Friday, November 28, 2008 1:31 PM
> Subject: Re: [Ltru] [CLDR] Re: Support of ISO 639 (was: Survey Tool pre-alpha)
...
> And I've NEVER said that language collections codes should be part of locale tags. My need for a
comprehensive 
> hierarchy is for something else: organizing long lists of languages (and their many synonyms, including
those in 
> other languages than just English used in BCP 47).

This is outside the ltru WG's scope.

> The need for a language hierarchy (by families) is to simplify the search, detect name collisions in the
synonyms 
> or between their various localizations (and this is something that the CLDR project will need to address,
because 
> it contains many translated lists of language names), and the prefered name in a localiztion my collide
with the 
> prefered name in another language (or dialect) translation.

This is outside the ltru WG's scope.

> On French Wiktionary for example, the problem was detected and has caused lots of confusion, by having
list of 
> translations for some terms to become useless. And some of the chosen names (in French) were proven to be 
> ambiguous, despite they were not in ISO 639 or in BCP 47.

This is outside the ltru WG's scope.

> The other problem is that languages could not be found, and several codes were allocated, most often the
wrong ones 
> or creating collisions with other language codes. This had to be fixed manually, but other things will be
difficult 
> to correct (for example words said to be in "Apache" or "Berber", that are collections and not
macrolanguages, do 
> not explicitly indicate the effective individual languages or even the macrolanguage in which they are written).

This is outside the ltru WG's scope.
If there is a need to request a new language code, collection code, 
macrolanguage code, etc., please take the matter to the appropriate
registration authority.

> When you want to fix this, the resolution is much logner than the initial creation, because it requires
lenghty 
> verifications and disambiguating work. I think that the same problem occurs in every major library that
wants to 
> maintain a coherent index of their collected books and publications, or with companies providing
translation 
> services and trying to manage many large translation lists or repositories between various languages:

This is not our problem.

> Resolving conflicts and ambiguities is really a difficult work that could have been avoided by allowing
the correct 
> and precise language to be selected at the first time, by the person that better knows that language (but
not 
> necessarily its associated standard code), or that could forget or ignore that some language names they
know are 
> perfect homonyms (and then ambiguous).

This is not our problem.  This is an application design issue.

> Finally the various persons knowing the same language (or a dialectal/orthographic variety of that same
language) 
> do not call it the with the same name, for cultural or political reasons (Moldavian vs. Romanian is a
perfect 
> example, Alsatian vs. Alemanic vs. German Swiss is another example), and the need to display synonyms in
addition 
> to the "preferred" name becomes more important.

This is why multiple descriptions are permitted.
If you have need of a specific addition, the ietf-language <at> iana.org 
list is the correct place to pursue the issue, not here.

> For those working with Chinese (a macrolanguage encompassing many distinct individual oral
languages), this is even 
> more important to know what text really means in an individual language and how words are effectively used
and 
> understood (when the written and encoded orthography is not sufficient to allow the distinction); the
problem with 
> Chinese is that many of the individual languages are known with conflicting "synonyms" (and not the same
in all 
> languages):
> 
> Even if it makes no difference for encoding those texts, originately encoded with synograms, it makes a
huge 
> difference when converting such text to exhibit the phonology (including for the romanization
purpose), or when 
> trying to select translations, or when trying to perform full text searches with consistant results.

Please give a specific example of the problem that concerns you here.
Several contributors on this list work with Chinese and are intimately
familiar with the issues there.

> A precise and correct identification of languages is then necessary, both for the readers/users of those
texts, and 
> for their creators (but this task is too difficult to do within a very huge list of 7000 "prefered" language
names 
> plus all their known synonyms). A comprehensive hierarchy can help narrow the search and help users
become aware of 
> possible conflicts or ambiguities without forcing them to read thousands of names. It also really helps
to select 
> which name to choose as the preferred one (for display in a list) to avoid future conflicts in a given
translation 
> of such list of language names.

As a technical contributor: Non-linguists are generally quite unaware of language
classification hierarchies, or will look in the wrong place because their idea
of the heirarchy is different from a linguist's.  Two specific examples:  some
speakers of English think it is a Romance language based the large number
of borrowings from those languages; some speakers of Vietnamese think their
language is Sino-Tibetan due to the large number of borrowings from Chinese.
Both are wrong.

As co-chair: this issue it outside the scope of the ltru working group.

> Hmmm... I'm not sure now which about which of the CLDR or LTRU list is appropriate to discuss such things, so
I'll 
> still post this message to both lists. All of the above is related to localization in software interfaces
(but 
> neither the BCP47, nor the CLDR repository provides a way to manage language families correctly, and both
contain 
> just the strict minimum support needed for macrolanguages, ISO 639-3 providing more meaningful
information for 
> them, and only The Ethnologue provides the data needed for collections but without any proposed
encoding, something 
> that I had hoped that ISO 639-5 would have offered, and that ISO 639-6 for dialectal/orthographic
varieties will 
> not address at all).
...

As far as I can see, these concerns are outside the scope of the ltru working group.

Randy

verdy_p | 28 Nov 19:53
Picon

Re: [CLDR] Re: Support of ISO 639 (was: Survey Tool pre-alpha)

"Kent Karlsson" <kent.karlsson14 <at> comhem.se>
> > NB : For your information I have just built yesterday (temporarily: if it's
> > not acceptable there, I can remove it)
> > a easily navigatable and sortable version of the proposed registry that is
> > part of RFC 4645bis Draft 07 on
> > <URL:http://fr.wiktionary.org/wiki/Wiktionnaire:RFC_4645>, on a site that
> 
> (Well, note that RFC 4645 had no macrolanguage concept, nor covered the new
> codes in ISO 639-3. RFC 4645bis is what you should refer to.)

That's what I'm refering to throughout these pages (however I did not give it the article title name "RFC
4645bis", 
thinking that this is not the definitive name for this draft RFC revision) that indicate "bis" explicitly
(well I 
should correct the link that MediaWiki autogenerates in the middle of the name "RFC 4645bis", leaving
"bis" 
separated in the rendered page, by disabling this automatically generated link that still points to the
wrong 
version.)

> > currently needs a comprehensive list of
> > language families to allow searches of words across "similar" languages, and
> > an easy way to search for language
> > synonyms (or dialect names) localized in other languages than just English.
> > I've correctly stated that this version
> > is a draft with a publication date and the validity date, and a direct
> > reference to the draft text currently
> > published by the IETF.

My purpose is just to make the registry more easily searchable and readable (I've not attepted to translate
the 
"Description" fields, and not even the "Comments" field, I've just translated to French the headers, only
because 
it is on the French Wiktionary), there are other pages for this that are better tuned for information in
French, 
including documenting and linking languages names in Fernch with their definitions and alternative
orthographies 
for which references could be found) : they are just there to point to some URL that can be stable. It's a 
presentation that could also be helpful for your HTML version of the draft.

For now the "bis" version is not fully released, so I don't think the page name should adopt the "bis" name,
unless 
it has been said that this will be the final name. If the name changes, I don't want to redirect it again.

Will the "bis" be kept after release, or won't that be the same RFC number (pointing directly to the revized
text), 
or another RFC number ?

Philippe.

Doug Ewell | 28 Nov 20:10
Favicon

Re: [CLDR] Re: Support of ISO 639 (was: Survey Tool pre-alpha)

Just a quick comment on one point; I'll have to spend some time reading 
through all the others.

"verdy underscore p" <verdy underscore p at wanadoo dot fr> wrote:

> For now the "bis" version is not fully released, so I don't think the 
> page name should adopt the "bis" name, unless it has been said that 
> this will be the final name. If the name changes, I don't want to 
> redirect it again.
>
> Will the "bis" be kept after release, or won't that be the same RFC 
> number (pointing directly to the revized text), or another RFC number 
> ?

RFCs are never "updated" with the same number; they are superseded or 
replaced by a new RFC with a different number.  The is unlike ISO and 
other standards, which normally keep the same number through revisions.

"RFC 4645bis" means roughly "the Internet-Draft that is intended to 
supersede or replace RFC 4645bis."  If and when it is approved as an 
RFC, it will be assigned a new number, known only to the RFC Editor 
until the moment of publication.

So any page that describes RFC 4645bis, or reformats its content in a 
different way, should not be labeled "RFC 4645."  As you will see by 
reading RFC 4645, that is a completely different document.

--
Doug Ewell  *  Thornton, Colorado, USA  *  RFC 4645  *  UTN #14
http://www.ewellic.org
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ

_______________________________________________
Ltru mailing list
Ltru <at> ietf.org
https://www.ietf.org/mailman/listinfo/ltru
Randy Presuhn | 28 Nov 19:40
Picon

Re: [CLDR] Re: Support of ISO 639 (was: Survey Tool pre-alpha)

Hi -

As co-chair -

I find very little in this thread germane to the ltru working group.
If you have *specific* technical objections to the current drafts
(which have completed WG last call - we're just waiting for the
 updates to appear so we can hand them off to the IESG)
then please provide replacement text which would address those
issues.  Unless those issue are "show stoppers" (so far it sounds
like they are not) we will probably not take any action on them.
Of course, you would be free to submit them as IETF last call
comments when we reach that stage.

Randy

Peter Constable | 29 Nov 23:30
Picon
Favicon

Re: [OT] Re: Support of ISO 639 (was: Survey Tool pre-alpha)

From: ltru-bounces <at> ietf.org [mailto:ltru-bounces <at> ietf.org] On Behalf Of Kent Karlsson

> -2 apparently keeps the "(other)" interpretation, while -5 does not. LTRU
> has adopted the -5 interpretation (inclusive), otherwise the collection
> codes would stand for nothing.

The JAC is discussing changes to -2 that would make -2 and -5 consistent, with all collections inclusive,
and allowing for conforming applications to recognize only subsets and to treat some collections as
exclusive (i.e., the "(other)" interpretation) in that application context.

> The use of collection codes in language tags is dubious, like saying
> "it's language group so-and-so, but information about individual language is
> not available".

I'm of the same general opinion.

Peter
Doug Ewell | 27 Nov 20:46
Favicon

Re: [OT] Re: Support of ISO 639 (was: Survey Tool pre-alpha)

"verdy underscore p" <verdy underscore p at wanadoo dot fr> wrote:

>> I don't agree with characterizing 639-1 and 639-2 as "badly 
>> designed."
>> They were designed for different purposes.
>
> Apparently not. Your description just indicates that 639-5 is 
> effectively continuing the 639-2 (and 639-1 for bihari) model, and 
> does not create what was expected (a comprehensive hierarchy similar 
> to the Ethnologue);

639-5 does include a hierarchy of sorts; it shows which language 
families or groups belong to other language families or groups.  It does 
not attempt to show which individual languages belong to which family or 
group, since it does not deal with individual languages at all.

> in addition, the 639-5 is now incompatible with 639-2 and 639-1, 
> making it mostly unusable within the RFC 4645/4646 bis framework).

I'm not sure what you mean by "incompatible."  There are differences 
between the code lists, but the 639-2/639-5 and 639-3 representatives 
have pledged to resolve them (though this does not happen quickly).

> For me, this means that 639-5 is already a dead standard before its 
> publication, unless the 639-2
> and 639-1 collections are completely removed, due to the changes that 
> occured in 639-5.

To make sure I understand: You would propose to remove 'bih' and 'him', 
which are language collections appearing in 639-2, but keep the other 
collections introduced by 639-5 on the sole basis that they do not 
appear in 639-2?

The Introduction to 639-5 says, "This part of ISO 639 supplements the 
coding of language groups and language families in ISO 639-2."  The 
639-2 collections are part of 639-5; they do not form a conceptually 
different set from 639-5.

>>> Also I'm still waiting to see how ISO 639-5 can be integrated with 
>>> the RFC 4545bis and RFC 4646bis rules.
>>
>> This is clearly laid out in the two LTRU drafts:
>>
>> http://www.ietf.org/internet-drafts/draft-ietf-ltru-4646bis-18.txt
>> http://www.ietf.org/internet-drafts/draft-ietf-ltru-4645bis-07.txt
>> http://www.ewellic.org/rfc4645bis.html
>
> There's absolutely no integration. Or more exactly, it does not create 
> the encoding framework that would allow the efective creation of a 
> comprehensive hierarchy of language families.

This is not a goal of RFC 4646bis or any of its predecessors.

> It just says that they are just added as possible subtags, usable as 
> prefixes, but immediately, the included list of tags make these 
> combinations of a collection subtag plus a language subtag accepted as 
> non prefered aliases for the prefered tag consisting in the language 
> tag only (this is fine for me, as it effectively initiates the 
> organization as a hierarchy; however, the list is not complete enough 
> to organize the 639-3 list of macrolanguages and individual languages)

No, that's not correct.  It is not acceptable to write, say, "roa-fr" as 
an alias for "fr".

Macrolanguages are a completely different concept from language 
collections.  A macrolanguage is a single language name that is used in 
some contexts to refer to a group of related languages which are known 
by their individual names in other contexts.  For example, "Chinese" is 
a blanket term that is sometimes used to refer collectively to Mandarin, 
Cantonese, Wu, Min-Nan, and others.  In other cases, those languages are 
known by the individual names.

This is different from language collections.  Someone might say, "That 
document is written in Chinese," but nobody would say, "That document is 
written in Sino-Tibetan languages."  That is why 639-3 deals with 
macrolanguages -- indeed, they invented the term -- but does not deal 
with collections.  It is not because someone forgot to include them, or 
ignored part of the problem they set out to solve.

> Also I don't understand the need to create a prefix subtag for the 
> special language scopes of 639-3 (except as a categorization in a more 
> compelte hierarchy). Well, this is not critical and does not affect my 
> own projects with them.

By "prefix subtag" I assume you are talking about a primary language 
subtag used with an extended language subtag, like "zh-cmn".

The reasons for extended language subtags are complex and the decision 
to include this feature was debated at great length, but the intended 
purpose is not to try to categorize languages into hierarchies.  Someone 
who wants to indicate that such-and-so content is in "Chinese" or 
"Mandarin," or search for content in "Chinese" or "Mandarin," has no 
need to point out that Mandarin is a member of the "Chinese languages" 
group, which in turn is a member of the "Sino-Tibetan languages" group.

So as you said, this is irrelevant to your project.

>> In brief, 639-5 code elements are simply added as more language 
>> subtags that represent language collections, just like existing 
>> subtags such as 'alg'.  This is very straightforward.
>
> If you say so... For me, this absolutely does not change what was 
> already in use, or attempted, without 639-5.

That's correct.  The addition of 639-5-based subtags to BCP 47 isn't 
intended to change the existing way of using these subtags.  It was 
intended to complete the existing, incomplete list.

> The 639-5 part solves absolutely no additional problem, but just 
> creates more confusion (due to its incompatibilities with 639-1/2).

You haven't demonstrated any incompatibility.

> I would have really hoped that those unstable collections of 639-1/2 
> were deprecated (grandfathered with no indication of a prefered new 
> code, due to the ambiguities, just like what has been done for 
> "cel-gaulish" in RFC 4645bis),

"Grandfathered" isn't the same as "deprecated."  Grandfathered means 
that the tag was registered under RFC 1766 or 3066, and does not follow 
the syntactic rules introduced in RFC 4646.  It has nothing to do with 
whether the tag is ambiguous, or should or should not be used.  That's 
why some grandfathered tags are also marked as deprecated, and others 
(like "cel-gaulish") are not.

> and that new codes were assigned to the codes that were changed to be 
> inclusive and built according to serious definitions (e.g., the 
> exclusive [ine] collection of 639-2 would have been grandfathered, and 
> a new code added in 639-5 for the inclusive Indo-european family).

(See above for the real meaning of "grandfathered.")

In RFC 4646bis and 4645bis, the existing subtags based on 639-2 
collections will take the 639-5 names.  This has the effect, for 
example, that the meaning of 'ine' will be broadened from "Indo-European 
(Other)" to "Indo-European languages."  This is explained in 
draft-4645bis, Section 2.3.

> You forget that I had hoped for 639-5 to be made interoperable with 
> 639-1/2 and so with the RFC 4645/4646 framework. For me it has 
> completely failed to this, and I can predict that the integration of 
> 639-5 in RFC 4645/4646 will fail.

I don't follow this argument.

> But I hope that 639-5 will be rapidly corrected to solve what I 
> consider as severe bugs (or lack of analysis and decisions about its 
> policies, stability and compatibility with the rest of the 639 
> standard).

639-5 simply needs to be corrected to remove the small inconsistencies 
with other parts of 639.  The overall architecture is fine, and does not 
need to be revised.

> (and I can understand now why The Ethnologue has chosen to NOT use or 
> display the 639-5 "collection" codes for its language families !)

Everyone has their own ideas about how languages should be categorized. 
This is well known to be an academic gray area, and arguing over whose 
categorizations are "right" and whose are "wrong" is well known to be an 
academic black hole.

--
Doug Ewell  *  Thornton, Colorado, USA  *  RFC 4645  *  UTN #14
http://www.ewellic.org
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ

_______________________________________________
Ltru mailing list
Ltru <at> ietf.org
https://www.ietf.org/mailman/listinfo/ltru
Randy Presuhn | 27 Nov 20:21
Picon

Re: [OT] Re: Support of ISO 639 (was: Survey Tool pre-alpha)

Hi -

> From: "verdy_p" <verdy_p <at> wanadoo.fr>
> To: "Doug Ewell" <doug <at> ewellic.org>
> Cc: "CLDR Users" <cldr-users <at> unicode.org>; "LTRU list" <ltru <at> ietf.org>
> Sent: Thursday, November 27, 2008 10:54 AM
> Subject: Re: [Ltru] [OT] Re: Support of ISO 639 (was: Survey Tool pre-alpha)
...
> > > Also I'm still waiting to see how ISO 639-5 can be integrated with the 
> > > RFC 4545bis and RFC 4646bis rules.
> > 
> > This is clearly laid out in the two LTRU drafts:
> > 
> > http://www.ietf.org/internet-drafts/draft-ietf-ltru-4646bis-18.txt
> > http://www.ietf.org/internet-drafts/draft-ietf-ltru-4645bis-07.txt
> > http://www.ewellic.org/rfc4645bis.html
> 
> There's absolutely no integration. Or more exactly, it does not create the encoding framework that would
allow the 
> efective creation of a comprehensive hierarchy of language families.

As co-chair:
That's outside the scope of the ltru working group charter.

As a technical contributor:
Anyone who's seriously worked in linguistics will recognize that
creating a "heirarchy of language families" would be open to
endless debate.  The discussion of how to handle Erzgebirgisch
on the ietf-languages <at> iana.org list shows how difficult this can
be even for what one might think would be an easy case.
I'm very glad it's out of our scope.

...
> I would have really hoped that those unstable collections of 639-1/2 were deprecated (grandfathered
with no 
> indication of a prefered new code, due to the ambiguities, just like what has been done for "cel-gaulish"
in RFC 
> 4645bis), and that new codes were assigned to the codes that were changed to be inclusive and built
according to 
> serious definitions (e.g., the exclusive [ine] collection of 639-2 would have been grandfathered, and a
new code 
> added in 639-5 for the inclusive Indo-european family).
...

As co-chair:
Generating substantial updates to 639-5 would have gone well beyond the
scope of the ltru working group.

Randy


Gmane