Bob Duncan | 27 Nov 20:54

RSS and diacritics


Greetings,

I'm getting ready to offer RSS feeds for our library's recent 
acquisitions lists and have run into a little snag:  characters with 
diacritics.  I understand why I can't use HTML character entity 
references and expect all feed readers to play nicely, so I tried 
encoding the ampersand in the HTML entity reference (a suggested fix 
that I can no longer document).  While this works great for some feed 
readers, other readers and the two major browsers display the raw 
code instead of the character with diacritical mark.

Other than displaying plain letters without diacritics, is there a 
way to code feeds so that all (or at least most) feed readers will 
display the character with the mark?  (I'd like to be able to this in 
item titles and descriptions.)

Thanks,

Bob Duncan

~!~!~!~!~!~!~!~!~!~!~!~!~
Robert E. Duncan
Systems Librarian
Editor of IT Communications
Lafayette College
Easton, PA  18042
duncanr@...
http://www.library.lafayette.edu/ 
(Continue reading)

Jonathan Gorman | 27 Nov 21:33

Re: RSS and diacritics


---- Original message ----
>Date: Tue, 27 Nov 2007 14:56:56 -0500
>From: Bob Duncan <duncanr@...>  
>Subject: [Web4lib] RSS and diacritics  
>To: web4lib@...
>
>
>Greetings,
>
>I'm getting ready to offer RSS feeds for our library's recent 
>acquisitions lists and have run into a little snag:  characters with 
>diacritics.  I understand why I can't use HTML character entity 
>references and expect all feed readers to play nicely, so I tried 
>encoding the ampersand in the HTML entity reference (a suggested fix 
>that I can no longer document).  While this works great for some feed 
>readers, other readers and the two major browsers display the raw 
>code instead of the character with diacritical mark.
>
>Other than displaying plain letters without diacritics, is there a 
>way to code feeds so that all (or at least most) feed readers will 
>display the character with the mark?  (I'd like to be able to this in 
>item titles and descriptions.)
>
>Thanks,
>

I guess I'm a little confused.  This could possibly be several problems and there's a lot more we need to know. 
Where are you getting your information from that has diacritics?  What encoding are those diacritics?  Are
you sure the data isn't being converted or corrupted when you are querying the source?
(Continue reading)

Andrew Cunningham | 28 Nov 01:09

Re: RSS and diacritics


Jonathan Gorman wrote:
> There's a lot of software and fonts that don't have very complete character sets.  Arial Unicode so far has
the most complete that I know of. People using a browser will have to have it set to use a unicode font to 
see unicode characters correctly.  On top of that, there's a lot of 
software that mishandles combining diacritics (IE 6 is one example, if I 
recall correctly) and will never display them correctly.
> 

There are a few common misconceptions here.

all modern web browsers are Unicode based at the core, older 8-bit 
legacy encodings are supported by transcoding to Unicode on the fly

This has been the case since Netscape 4 and IE 3/4

All core operating system fonts (Windows and MacOS are Unicode based) 
even core fonts on Windows 98.

There are no pan unicode fonts. There are too many characters in unicode 
to be able to have a single font support them. Fonts have physical limits.

Arial Unicode MS only supports a very old version of Unicode, and that 
incompletely. It is useful for characters with diacritics when those 
characters are precomposed characters. It is not suitable for combining 
diacritics. It doesn't have the required mark and mkmk OpenType features 
for the Latin script.

Combining diacritic support on the Windows platform requires:

(Continue reading)

Jonathan Gorman | 27 Nov 21:54

Re: RSS and diacritics


Apologizes, In rereading I realized I mis-interpreted what you were saying.  I thought you had two distinct
problems (using html character entities) and issues with diacritics.

The answer as far as the entities?  RSS can be a mess ;).  RSS feeds are XML.  Sadly, a widespread practice has
occurred of using "escaped html" in fields of the RSS feeds.  There's no way to ensure that these escaping
nightmares will be parsed correctly.

HTML defines some character entities, but RSS doesn't have all of them.  You can attempt to add these
characters to the RSS feed via including them in a Doctype declaration at the beginning of the feed.  This
wikipedia page looks like it has some examples of that: http://en.wikipedia.org/wiki/XML.

The best solution?  Not really sure.  I'd lean towards not using "escaped html" in my RSS feed.  Instead use
just rss and the character references, which should display cleanly assuming that the rss feeder isn't junk.

(And by character reference, I mean use &#x..; where .. is the appropriate code point).

See http://en.wikipedia.org/wiki/Character_entity_reference for a bit more information.

Jon Gorman

---- Original message ----
>Date: Tue, 27 Nov 2007 14:56:56 -0500
>From: Bob Duncan <duncanr@...>  
>Subject: [Web4lib] RSS and diacritics  
>To: web4lib@...
>
>
>Greetings,
>
(Continue reading)

Bob Duncan | 27 Nov 23:58

Re: RSS and diacritics

At 03:56 PM 11/27/2007, Jonathan Gorman wrote:
>Apologizes, In rereading I realized I mis-interpreted what you were 
>saying.  I thought you had two distinct problems (using html 
>character entities) and issues with diacritics.

Phew!  I thought I was going to have to attempt a reply to your first 
response. ;o)

>The answer as far as the entities?  RSS can be a mess ;).  RSS feeds 
>are XML.  Sadly, a widespread practice has occurred of using 
>"escaped html" in fields of the RSS feeds.  There's no way to ensure 
>that these escaping nightmares will be parsed correctly.
>
>HTML defines some character entities, but RSS doesn't have all of 
>them.  You can attempt to add these characters to the RSS feed via 
>including them in a Doctype declaration at the beginning of the 
>feed.  This wikipedia page looks like it has some examples of that: 
>http://en.wikipedia.org/wiki/XML.
>
>The best solution?  Not really sure.  I'd lean towards not using 
>"escaped html" in my RSS feed.  Instead use just rss and the 
>character references, which should display cleanly assuming that the 
>rss feeder isn't junk.
>
>(And by character reference, I mean use &#x..; where .. is the 
>appropriate code point).

Thanks.  I think that will do it.  I was using name-based references 
(Egrave, etc.) and escaping the ampersand, which worked in most feed 
readers but not in everything capable of displaying a feed.  The 
(Continue reading)

Andrew Cunningham | 28 Nov 00:54

Re: RSS and diacritics

Which version of RSS are you using, and does its schema/DTD defined the 
entities you want to use?

re, NCRs have a look at http://www.w3.org/International/questions/qa-escapes

Bob Duncan wrote:
> At 03:56 PM 11/27/2007, Jonathan Gorman wrote:
>> Apologizes, In rereading I realized I mis-interpreted what you were 
>> saying.  I thought you had two distinct problems (using html character 
>> entities) and issues with diacritics.
> 
> Phew!  I thought I was going to have to attempt a reply to your first 
> response. ;o)
> 
>> The answer as far as the entities?  RSS can be a mess ;).  RSS feeds 
>> are XML.  Sadly, a widespread practice has occurred of using "escaped 
>> html" in fields of the RSS feeds.  There's no way to ensure that these 
>> escaping nightmares will be parsed correctly.
>>

named entities need to be defined. XML by default only supports a small 
handful. Most of the named entities in HTMl don't exist in XML, unless 
the schema or DTD in question defines them.

for XML documents its best to use an appropriate encoding that supports 
all your character requirements rather than using entities or NCRs.

>> HTML defines some character entities, but RSS doesn't have all of 
>> them.  You can attempt to add these characters to the RSS feed via 
>> including them in a Doctype declaration at the beginning of the feed.  
(Continue reading)

Jonathan Gorman | 28 Nov 03:57

Re: RSS and diacritics

>One other question:  which numeric reference is preferable?  For 
>example, both É and É (xC9 and 201) produce a Latin capital 
>E acute.  Are there good reasons to use one over the other?  (And is 
>either more likely than the other to be correctly rendered by 
>browsers in non-RSS situations?)

That, I must say, is for either a linguist or a character set expert to answer ;).  As a general rule, I try to
avoid combining diacritics, but that's just me.

Good luck.

Jon Gorman
Bob Rasmussen | 28 Nov 04:50

Re: RSS and diacritics

On Tue, 27 Nov 2007, Jonathan Gorman wrote:

> >One other question:  which numeric reference is preferable?  For 
> >example, both É and É (xC9 and 201) produce a Latin capital 
> >E acute.  Are there good reasons to use one over the other?  (And is 
> >either more likely than the other to be correctly rendered by 
> >browsers in non-RSS situations?)
> 
> That, I must say, is for either a linguist or a character set expert to 
> answer ;).  

I might claim to be the latter... The two are equivalent. Hexadecimal C9 
is equivalent to decimal 201. I would expect any software that handles RSS 
to handle either notation equally well.

(By the way, the Windows calculator can do conversions between hex and 
decimal. Do Start:Run:Calc.)

(By the other way, the Windows character map utility if useful also. Do 
Start:Run:charmap.)

> As a general rule, I try to avoid combining diacritics, but 
> that's just me.

Just to make sure there's no confusion, these are not combining 
diacritics, they are combined. The combining equivalent would be to output 
an "E" followed by the character entity for a combining acute, which is 
hex 301.

That stated, I agree, use combined if possible, not combining.
(Continue reading)

Jonathan Gorman | 28 Nov 04:52

Re: RSS and diacritics

>> There's a lot of software and fonts that don't have very complete character sets.  Arial Unicode so far has
the most complete that I know of. People using a browser will have to have it set to use a unicode font to 
>see unicode characters correctly.  On top of that, there's a lot of 
>software that mishandles combining diacritics (IE 6 is one example, if I 
>recall correctly) and will never display them correctly.
>> 
>
>There are a few common misconceptions here.
>
>all modern web browsers are Unicode based at the core, older 8-bit 
>legacy encodings are supported by transcoding to Unicode on the fly
>
>This has been the case since Netscape 4 and IE 3/4
>

Well, yes, but if the font they're using doesn't have anything beyond the basic ascii mapping, they're not
what I would call unicode compatible ;).  In my defense  I wasn't saying browsers have issues, but there's a
lot of software that does.    

IE 6 for a while in a default configuration did have several bugs relating to unicode.  Heck, there's till a
lot of programming languages out there that have horrible unicode support.  I was flabbergasted a few
years ago to see how poor the support was in Ruby, for crying out loud.

>All core operating system fonts (Windows and MacOS are Unicode based) 
>even core fonts on Windows 98.
>

Well, I'm not an expert on these things.  My main goal was to advise someone having trouble with seeing
characters appear on the webpage to use by default a font that had the widest implemented character set.  I
chose the font that came to mind, which is probably out of date ;).  I didn't think that Windows 98 core fonts,
(Continue reading)

Andrew Cunningham | 28 Nov 08:19

Re: RSS and diacritics


Jonathan Gorman wrote:

> Well, yes, but if the font they're using doesn't have anything beyond the basic ascii mapping, they're not
what I would call unicode compatible ;).  In my defense  I wasn't saying browsers have issues, but there's a
lot of software that does.    

Core windows fonts were designed to support WGL 4.0 which was seen as a 
necessary subset of Unicode to meet the needs of European languages. 
Documentation should be available on the Microsoft Typography site.

The reality is that fonts are developed for specific sets of languages 
and scripts. From the point of typographic design it isn't desirable to 
mix scripts. it is a fine are to design glyphs for one script that can 
harmonise with another script without distorting the scripts.

I'd draw a distinction between:

1) software that is not internationalised or the developers made a mess of;
2) software based on the windows 95 internationalization model (i.e. 
remapping Unicode to Windows codepages) although microsoft itself moved 
away form this model with web browsing technology. One languages 
directly supported by code pages are su[pported by this model.
3) windows 2000 internationalization model, which is unicode at the core 
and maps unicode to code pages.

As you indicate, there is a lot of badly written code out there. But the 
approaches to handling  Unicode have been around for many years. We've 
gone through various interactions of operating systems and applications 
that are Unicode based. And there are only a limited number of languages 
(Continue reading)

Thomas Dowling | 29 Nov 14:56

Re: RSS and diacritics

Jonathan Gorman wrote:
>> all modern web browsers are Unicode based at the core,...
>>
>>     
>
> Well, yes, but if the font they're using doesn't have anything beyond the basic ascii mapping, they're not
what I would call unicode compatible...

The more adept browsers out there figured this out quite a while ago.  
If the font they're using doesn't have a glyph for the character 
requested, they pull the correct glyph from a font that does have it.  
Awkwardly, there's a less adept browser that fails to do this, that has 
about 80% market share...

CSS2 requires that browsers work their way down the list of specified 
fonts to find the right glyph, not just find a matching font name.  
IIRC, Gecko-based browsers and Opera go beyond that to find any system 
font with the right glyph.

--

-- 
Thomas Dowling
tdowling@...
Bob Rasmussen | 29 Nov 15:42

Re: RSS and diacritics

On Thu, 29 Nov 2007, Thomas Dowling wrote:

> The more adept browsers out there figured this out quite a while ago.  If the
> font they're using doesn't have a glyph for the character requested, they pull
> the correct glyph from a font that does have it.  Awkwardly, there's a less
> adept browser that fails to do this, that has about 80% market share...
> 
> CSS2 requires that browsers work their way down the list of specified fonts to
> find the right glyph, not just find a matching font name.  IIRC, Gecko-based
> browsers and Opera go beyond that to find any system font with the right
> glyph.

As an aside, that is precisely the approach taken by Anzio, our terminal 
emulation package, and Print Wizard, our printing utility. These programs 
also take many steps to handle combining diacritics well, including 
raising the "above" diacritics where necessary to avoid collision with the 
base character.

My perception of the most common issues in regards to library systems 
displaying (and printing) diacritics and non-Latin characters:

1) Very few fonts have the combining double tilde and combining double 
ligature marks, used mostly with transliterated Russian.

2) Software does not correctly combine combining diacritics. 

3) Fonts are inconsistent in the way they specify the X-location of 
combining diacritics.

4) Library software I have worked with does not give the browsers 
(Continue reading)

Andrew Cunningham | 29 Nov 22:29

Re: RSS and diacritics

Hi Bob

Bob Rasmussen wrote:
 > On Thu, 29 Nov 2007, Thomas Dowling wrote:
 >
 >> The more adept browsers out there figured this out quite a while 
ago.  If the
 >> font they're using doesn't have a glyph for the character requested, 
they pull
 >> the correct glyph from a font that does have it.  Awkwardly, there's 
a less
 >> adept browser that fails to do this, that has about 80% market share...
 >>
 >> CSS2 requires that browsers work their way down the list of 
specified fonts to
 >> find the right glyph, not just find a matching font name.  IIRC, 
Gecko-based
 >> browsers and Opera go beyond that to find any system font with the right
 >> glyph.
 >
 > As an aside, that is precisely the approach taken by Anzio, our 
terminal emulation package, and Print Wizard, our printing utility. 
These programs also take many steps to handle combining diacritics well, 
including raising the "above" diacritics where necessary to avoid 
collision with the base character.
 >
 > My perception of the most common issues in regards to library systems 
displaying (and printing) diacritics and non-Latin characters:
 >
 > 1) Very few fonts have the combining double tilde and combining 
(Continue reading)

Andrew Cunningham | 29 Nov 22:28

Re: RSS and diacritics


Thomas Dowling wrote:
> Jonathan Gorman wrote:
.
> 
> CSS2 requires that browsers work their way down the list of specified 
> fonts to find the right glyph, not just find a matching font name.  
> IIRC, Gecko-based browsers and Opera go beyond that to find any system 
> font with the right glyph.
> 

not that simple. When using combining diacritics you need to treat Latin 
script as a complex script.

choping and changing fonts is more likely to break complex rendering.

And such an approach assumes that each codepoint is represented by a 
single glyph. The reality in some OpenType fonts is that each codepoint 
may have multiple glyphs, one of which is a default.

And all this is irrelevant. If the web developer wrote the page 
properly, then appropriate fonts would be referenced and if necessary 
help or support files would point to none core fonts required.

The ransom note effect in gecko browsers shouldn't be necessary.

just my two cents worth, although that's no longer legal tender here ;)

As far as i'm concerned we're talking about poor web 
internationalization and poor web design practice. The weak point has 
(Continue reading)


Gmane