Karl Wright (Created) (JIRA | 17 Jan 2012 09:24
Picon
Favicon

[jira] [Created] (FOR-1231) Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and generates corrupted HTML

Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and
generates corrupted HTML
----------------------------------------------------------------------------------------------------------------------------------

                 Key: FOR-1231
                 URL: https://issues.apache.org/jira/browse/FOR-1231
             Project: Forrest
          Issue Type: Bug
          Components: Internationalisation (i18n)
    Affects Versions: 0.9, 0.10-dev
            Reporter: Karl Wright
            Priority: Critical

We're using Forrest to generate the Apache ManifoldCF site.  We've added Japanese content.  The content
worked fine via localhost:8888, but the html images do not load properly in a browser, even though the
browser correctly presumes the page is utf-8.  It looks like many characters are handled correctly but
some are corrupted.  I've also tried the fix in FORREST-668 but this does not help.  See
http://incubator.apache.org/connectors and click on the tab in Japanese to see what I mean.  The current
source for the site can be found in: https://svn.apache.org/repos/asf/incubator/lcf/trunk/site.

I checked out latest Forrest trunk and build that but there is no improvement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

Karl Wright (Updated) (JIRA | 17 Jan 2012 09:32
Picon
Favicon

[jira] [Updated] (FOR-1231) Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and generates corrupted HTML


     [
https://issues.apache.org/jira/browse/FOR-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright updated FOR-1231:
-----------------------------

    Description: 
We're using Forrest to generate the Apache ManifoldCF site.  We've added Japanese content.  The content
worked fine via localhost:8888, but the generated html content does not load properly in a browser, even
though the browser correctly divines that the HTML page has utf-8 encoding.  It looks like many utf-8
characters are handled correctly but some are corrupted.  I've also tried the fix in FORREST-668 but this
does not help.  See http://incubator.apache.org/connectors and click on the tab in Japanese to see what I
mean.  The current source for the site can be found in: https://svn.apache.org/repos/asf/incubator/lcf/trunk/site.

I checked out latest Forrest trunk and build that but there is no improvement.

  was:
We're using Forrest to generate the Apache ManifoldCF site.  We've added Japanese content.  The content
worked fine via localhost:8888, but the html images do not load properly in a browser, even though the
browser correctly presumes the page is utf-8.  It looks like many characters are handled correctly but
some are corrupted.  I've also tried the fix in FORREST-668 but this does not help.  See
http://incubator.apache.org/connectors and click on the tab in Japanese to see what I mean.  The current
source for the site can be found in: https://svn.apache.org/repos/asf/incubator/lcf/trunk/site.

I checked out latest Forrest trunk and build that but there is no improvement.

    
> Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and
generates corrupted HTML
(Continue reading)

Karl Wright (Updated) (JIRA | 17 Jan 2012 09:34
Picon
Favicon

[jira] [Updated] (FOR-1231) Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and generates corrupted HTML


     [
https://issues.apache.org/jira/browse/FOR-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright updated FOR-1231:
-----------------------------

    Description: 
We're using Forrest to generate the Apache ManifoldCF site.  We've added Japanese content.  The content
worked fine via localhost:8888, but the generated html content does not load properly in a browser, even
though the browser correctly divines that the HTML page has utf-8 encoding.  It looks like many utf-8
characters in the source XML are handled correctly but some are corrupted.  I've also tried the fix in
FORREST-668 but this does not help.  See http://incubator.apache.org/connectors and click on the tab in
Japanese to see what I mean.  The current source for the site can be found in: https://svn.apache.org/repos/asf/incubator/lcf/trunk/site.

I checked out latest Forrest trunk and built and used that but there has been no improvement.

  was:
We're using Forrest to generate the Apache ManifoldCF site.  We've added Japanese content.  The content
worked fine via localhost:8888, but the generated html content does not load properly in a browser, even
though the browser correctly divines that the HTML page has utf-8 encoding.  It looks like many utf-8
characters in the source XML are handled correctly but some are corrupted.  I've also tried the fix in
FORREST-668 but this does not help.  See http://incubator.apache.org/connectors and click on the tab in
Japanese to see what I mean.  The current source for the site can be found in: https://svn.apache.org/repos/asf/incubator/lcf/trunk/site.

I checked out latest Forrest trunk and build that but there is no improvement.

    
> Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and
generates corrupted HTML
(Continue reading)

Karl Wright (Updated) (JIRA | 17 Jan 2012 09:34
Picon
Favicon

[jira] [Updated] (FOR-1231) Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and generates corrupted HTML


     [
https://issues.apache.org/jira/browse/FOR-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright updated FOR-1231:
-----------------------------

    Description: 
We're using Forrest to generate the Apache ManifoldCF site.  We've added Japanese content.  The content
worked fine via localhost:8888, but the generated html content does not load properly in a browser, even
though the browser correctly divines that the HTML page has utf-8 encoding.  It looks like many utf-8
characters in the source XML are handled correctly but some are corrupted.  I've also tried the fix in
FORREST-668 but this does not help.  See http://incubator.apache.org/connectors and click on the tab in
Japanese to see what I mean.  The current source for the site can be found in: https://svn.apache.org/repos/asf/incubator/lcf/trunk/site.

I checked out latest Forrest trunk and build that but there is no improvement.

  was:
We're using Forrest to generate the Apache ManifoldCF site.  We've added Japanese content.  The content
worked fine via localhost:8888, but the generated html content does not load properly in a browser, even
though the browser correctly divines that the HTML page has utf-8 encoding.  It looks like many utf-8
characters are handled correctly but some are corrupted.  I've also tried the fix in FORREST-668 but this
does not help.  See http://incubator.apache.org/connectors and click on the tab in Japanese to see what I
mean.  The current source for the site can be found in: https://svn.apache.org/repos/asf/incubator/lcf/trunk/site.

I checked out latest Forrest trunk and build that but there is no improvement.

    
> Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and
generates corrupted HTML
(Continue reading)

Picon
Favicon

[jira] [Commented] (FOR-1231) Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and generates corrupted HTML


    [
https://issues.apache.org/jira/browse/FOR-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188130#comment-13188130
] 

Hitoshi Ozawa commented on FOR-1231:
------------------------------------

While at this, would appreciate if it's possible to install Japanese fonts as well so pdf containing
Japanese would show up correctly as well.

> Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and
generates corrupted HTML
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FOR-1231
>                 URL: https://issues.apache.org/jira/browse/FOR-1231
>             Project: Forrest
>          Issue Type: Bug
>          Components: Internationalisation (i18n)
>    Affects Versions: 0.9, 0.10-dev
>            Reporter: Karl Wright
>            Priority: Critical
>
> We're using Forrest to generate the Apache ManifoldCF site.  We've added Japanese content.  The content
worked fine via localhost:8888, but the generated html content does not load properly in a browser, even
though the browser correctly divines that the HTML page has utf-8 encoding.  It looks like many utf-8
characters in the source XML are handled correctly but some are corrupted.  I've also tried the fix in
FORREST-668 but this does not help.  See http://incubator.apache.org/connectors and click on the tab in
Japanese to see what I mean.  The current source for the site can be found in: https://svn.apache.org/repos/asf/incubator/lcf/trunk/site.
(Continue reading)

Picon
Favicon

[jira] [Commented] (FOR-1231) Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and generates corrupted HTML


    [
https://issues.apache.org/jira/browse/FOR-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188150#comment-13188150
] 

David Crossley commented on FOR-1231:
-------------------------------------

Please ask about separate usage issues on the user mailing list.

The PDF fonts are configurable. See that plugin's docs:
http://forrest.apache.org/docs/plugins/org.apache.forrest.plugin.output.pdf/

> Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and
generates corrupted HTML
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FOR-1231
>                 URL: https://issues.apache.org/jira/browse/FOR-1231
>             Project: Forrest
>          Issue Type: Bug
>          Components: Internationalisation (i18n)
>    Affects Versions: 0.9, 0.10-dev
>            Reporter: Karl Wright
>            Priority: Critical
>
> We're using Forrest to generate the Apache ManifoldCF site.  We've added Japanese content.  The content
worked fine via localhost:8888, but the generated html content does not load properly in a browser, even
though the browser correctly divines that the HTML page has utf-8 encoding.  It looks like many utf-8
characters in the source XML are handled correctly but some are corrupted.  I've also tried the fix in
(Continue reading)

Picon
Favicon

[jira] [Commented] (FOR-1231) Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and generates corrupted HTML


    [
https://issues.apache.org/jira/browse/FOR-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188216#comment-13188216
] 

Karl Wright commented on FOR-1231:
----------------------------------

I'm told that the Japanese portion of the site is correctly generated on a system that has a default locale of
ja_JP.  Obviously, though, this is not a good solution to the problem since we cannot select different
locales when there is more than one language involved.

                
> Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and
generates corrupted HTML
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FOR-1231
>                 URL: https://issues.apache.org/jira/browse/FOR-1231
>             Project: Forrest
>          Issue Type: Bug
>          Components: Internationalisation (i18n)
>    Affects Versions: 0.9, 0.10-dev
>            Reporter: Karl Wright
>            Priority: Critical
>
> We're using Forrest to generate the Apache ManifoldCF site.  We've added Japanese content.  The content
worked fine via localhost:8888, but the generated html content does not load properly in a browser, even
though the browser correctly divines that the HTML page has utf-8 encoding.  It looks like many utf-8
characters in the source XML are handled correctly but some are corrupted.  I've also tried the fix in
(Continue reading)

Picon
Favicon

[jira] [Commented] (FOR-1231) Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and generates corrupted HTML


    [
https://issues.apache.org/jira/browse/FOR-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188245#comment-13188245
] 

Hitoshi Ozawa commented on FOR-1231:
------------------------------------

Sorry David, I thought the html pages were being dynamically generated on the Apache server.
It seems it's not. "forrest site" works fine on my Japanese OS.

Karl, is your system setup to use en_US-UTF-8?
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8

> Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and
generates corrupted HTML
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FOR-1231
>                 URL: https://issues.apache.org/jira/browse/FOR-1231
>             Project: Forrest
>          Issue Type: Bug
>          Components: Internationalisation (i18n)
>    Affects Versions: 0.9, 0.10-dev
>            Reporter: Karl Wright
>            Priority: Critical
>
> We're using Forrest to generate the Apache ManifoldCF site.  We've added Japanese content.  The content
(Continue reading)

Picon
Favicon

[jira] [Commented] (FOR-1231) Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and generates corrupted HTML


    [
https://issues.apache.org/jira/browse/FOR-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188259#comment-13188259
] 

Karl Wright commented on FOR-1231:
----------------------------------

bq. Karl, is your system setup to use en_US-UTF-8?
bq. export LC_ALL=en_US.UTF-8
bq. export LANG=en_US.UTF-8
bq. export LANGUAGE=en_US.UTF-8 

I set the equivalent Windows variables but no change in the generated code for me.  So it must be something else.

                
> Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and
generates corrupted HTML
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FOR-1231
>                 URL: https://issues.apache.org/jira/browse/FOR-1231
>             Project: Forrest
>          Issue Type: Bug
>          Components: Internationalisation (i18n)
>    Affects Versions: 0.9, 0.10-dev
>            Reporter: Karl Wright
>            Priority: Critical
>
> We're using Forrest to generate the Apache ManifoldCF site.  We've added Japanese content.  The content
(Continue reading)

Picon
Favicon

[jira] [Commented] (FOR-1231) Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and generates corrupted HTML


    [
https://issues.apache.org/jira/browse/FOR-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188380#comment-13188380
] 

Karl Wright commented on FOR-1231:
----------------------------------

I figured it out. What we need to do is set the JAVA default encoding to UTF-8. The easy way to do this is (on Windows):

set JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8

 ... or on Linux: 

export JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8

Doing this before a Forrest invocation causes all JVMs it brings up to have the right encoding. (It's Cocoon
that seems to be broken, by the way) 

> Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and
generates corrupted HTML
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FOR-1231
>                 URL: https://issues.apache.org/jira/browse/FOR-1231
>             Project: Forrest
>          Issue Type: Bug
>          Components: Internationalisation (i18n)
>    Affects Versions: 0.9, 0.10-dev
>            Reporter: Karl Wright
(Continue reading)

Karl Wright (Updated) (JIRA | 21 Jan 2012 11:32
Picon
Favicon

[jira] [Updated] (FOR-1231) Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and generates corrupted HTML


     [
https://issues.apache.org/jira/browse/FOR-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright updated FOR-1231:
-----------------------------

    Attachment: FOR-1231.patch

This patch works, at least as far as generating Japanese correctly on an en_US Windows machine.

                
> Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and
generates corrupted HTML
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FOR-1231
>                 URL: https://issues.apache.org/jira/browse/FOR-1231
>             Project: Forrest
>          Issue Type: Bug
>          Components: Internationalisation (i18n)
>    Affects Versions: 0.9, 0.10-dev
>            Reporter: Karl Wright
>            Priority: Critical
>         Attachments: FOR-1231.patch
>
>
> We're using Forrest to generate the Apache ManifoldCF site.  We've added Japanese content.  The content
worked fine via localhost:8888, but the generated html content does not load properly in a browser, even
though the browser correctly divines that the HTML page has utf-8 encoding.  It looks like many utf-8
(Continue reading)

Picon
Favicon

[jira] [Commented] (FOR-1231) Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and generates corrupted HTML


    [
https://issues.apache.org/jira/browse/FOR-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190838#comment-13190838
] 

David Crossley commented on FOR-1231:
-------------------------------------

Thanks. I was thinking of a similar patch. However i wondered if it would need to append this setting to any
existing JAVA_TOOL_OPTIONS then reset at finish.

I have applied your patch as-is. Thanks.
If someone thinks that it needs more, then please do.

Regarding the Cocoon situation, i think that the doc comments refer to the fact that Cocoon/Forrest have
many supporting products handling various parts of the system. Perhaps some of those treat the encoding
differently. So this environment setting seems a good solution.

> Forrest does not deal properly with UTF-8 .xml content, even with the proper XML content-type header, and
generates corrupted HTML
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FOR-1231
>                 URL: https://issues.apache.org/jira/browse/FOR-1231
>             Project: Forrest
>          Issue Type: Bug
>          Components: Internationalisation (i18n)
>    Affects Versions: 0.9, 0.10-dev
>            Reporter: Karl Wright
>            Priority: Critical
(Continue reading)


Gmane