Uploaded image for project: 'XWiki Platform'
  1. XWiki Platform
  2. XWIKI-8007

OpenOffice exporter breaks document charset rendering non-Latin text non-readable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 4.2-milestone-2
    • 4.1.1, 4.1.2
    • Office, Old Core
    • None
    • Trivial
    • N/A

    Description

      Proposed change attached in comments 10/Jul/12

      You probably use file.encoding in export ?
      http://www.mindspring.com/~mgrand/java-system-properties.htm

      I managed to intercept export_input.html and it is obviously broken.

      It's header claims

      <meta content="text/html; charset=UTF-8" http-equiv="Content-Type" /><meta content="ru" name="language" />

      But the content is in OS-dependent SBCS: namely windows-1251 to me instead of UTF-8.

      You should sync them and stick to one case or another.
      Personally i am for all UTF-8, since who can give warrants that user did not entered into wiki-page characters that are beyond OS-local SBCS encoding, like greek chars or german or whatever ?

      Attachments

        Issue Links

          Activity

            [XWIKI-8007] OpenOffice exporter breaks document charset rendering non-Latin text non-readable

            Fixed by using the configured XWiki encoding when creating the input stream from the HTML string.

            mflorea Marius Dumitru Florea added a comment - Fixed by using the configured XWiki encoding when creating the input stream from the HTML string.
            bdv Dmitry added a comment - - edited

            Can under some conditions context or resulting html have encoding diferent from utf-8 ?
            XSLT can specify any encoding - http://www.sagehill.net/docbookxsl/OutputEncoding.html

            How would it work with given Java and XWiki code ?
            com.xpn.xwiki.pdf.impl.PdfExportImpl.ApplyXSLT returns String, which is mandator UTF-16 in Java.
            So it is probably already internally inconsistent, marking UTF-8 in header and UTF-16 beeing used...

            http://xwiki.475771.n2.nabble.com/which-HTML-parsing-libs-are-already-using-shiipped-with-XWiki-tp7580136p7580141.html
            This has excerpt from HTML Cleaner (BSD 3 clause license) to regexp charset form string.

            So better proposal would be kind of:

            {{

            public class OfficeExporter extends PdfExportImpl
            ...
            protected String guessOutputCharset(String html) throws IOException 
            { String head = null;
              int cutoff = html.length();
              
              if (cutoff > 0) {
                if (cutoff>2048) 
                   head = html.substring(0, 2048)
                else
                   head = html;
            
                String pattern = "\\<meta\\s*http-equiv=[\\\"\\']content-type[\\\"\\']\\s*content\\s*=\\s*[\"']text/html\\s*;\\s*charset=([a-z\\d\\-]*)[\\\"\\'\\>]"; 
                Matcher matcher = Pattern.compile(pattern,  Pattern.CASE_INSENSITIVE).matcher( head ); 
            
                while (matcher.find()) 
                { 
                   String candidate = matcher.group(1); 
                   if (Charset.isSupported(candidate)) 
                      { return candidate; }
                }
               }
            
               return "utf-8"; // default fall-back
            }
            
            -----------
            private void exportXHTML
            ...
            ++ inputStreams.put(inputFileName, new ByteArrayInputStream(html.getBytes(guessOutputCharset(html))));
            -- inputStreams.put(inputFileName, new ByteArrayInputStream(html.getBytes()));
            

            }}

            That would most probably require

            import java.nio.charset.Charset;
            import java.util.regex.Matcher;
            import java.util.regex.Pattern;

            bdv Dmitry added a comment - - edited Can under some conditions context or resulting html have encoding diferent from utf-8 ? XSLT can specify any encoding - http://www.sagehill.net/docbookxsl/OutputEncoding.html How would it work with given Java and XWiki code ? com.xpn.xwiki.pdf.impl.PdfExportImpl.ApplyXSLT returns String, which is mandator UTF-16 in Java. So it is probably already internally inconsistent, marking UTF-8 in header and UTF-16 beeing used... http://xwiki.475771.n2.nabble.com/which-HTML-parsing-libs-are-already-using-shiipped-with-XWiki-tp7580136p7580141.html This has excerpt from HTML Cleaner (BSD 3 clause license) to regexp charset form string. So better proposal would be kind of: {{ public class OfficeExporter extends PdfExportImpl ... protected String guessOutputCharset(String html) throws IOException { String head = null; int cutoff = html.length(); if (cutoff > 0) { if (cutoff>2048) head = html.substring(0, 2048) else head = html; String pattern = "\\<meta\\s*http-equiv=[\\\"\\']content-type[\\\"\\']\\s*content\\s*=\\s*[\"']text/html\\s*;\\s*charset=([a-z\\d\\-]*)[\\\"\\'\\>]"; Matcher matcher = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE).matcher( head ); while (matcher.find()) { String candidate = matcher.group(1); if (Charset.isSupported(candidate)) { return candidate; } } } return "utf-8"; // default fall-back } ----------- private void exportXHTML ... ++ inputStreams.put(inputFileName, new ByteArrayInputStream(html.getBytes(guessOutputCharset(html)))); -- inputStreams.put(inputFileName, new ByteArrayInputStream(html.getBytes())); }} That would most probably require import java.nio.charset.Charset; import java.util.regex.Matcher; import java.util.regex.Pattern;
            bdv Dmitry added a comment - - edited

            Again, i can google JavaDocs, but i cannot get expectations for different variables inside your code.

            https://github.com/xwiki/xwiki-platform/blob/master/xwiki-platform-core/xwiki-platform-oldcore/src/main/java/com/xpn/xwiki/internal/export/OfficeExporter.java
            private void exportXHTML
            ... inputStreams.put(inputFileName, new ByteArrayInputStream(html.getBytes()));

            Proposal:
            1) in the line above use ....html.getBytes("utf-8")));
            2) also add the comment that this charset SHOULD match the one, that is got from XSLT transformer above and stored in HTML header.

            Hopefully that would fix export in single file. But i don not know and cannot know if there si possibility to have several semantically connected text files generated for export.

            It might be guessed that InputStreams and OutputStreams and all the rest string<->Stream containers are expected to have binary streams plus some (which? sole or different case-by-case?) charset for text streams. But is there explicitly stated requirement for those ?

            Maybe there is sense to centralize it, encapsulating string->bytes translation into some extra method?

            bdv Dmitry added a comment - - edited Again, i can google JavaDocs, but i cannot get expectations for different variables inside your code. https://github.com/xwiki/xwiki-platform/blob/master/xwiki-platform-core/xwiki-platform-oldcore/src/main/java/com/xpn/xwiki/internal/export/OfficeExporter.java private void exportXHTML ... inputStreams.put(inputFileName, new ByteArrayInputStream(html.getBytes())); Proposal: 1) in the line above use ....html.getBytes("utf-8"))); 2) also add the comment that this charset SHOULD match the one, that is got from XSLT transformer above and stored in HTML header. Hopefully that would fix export in single file. But i don not know and cannot know if there si possibility to have several semantically connected text files generated for export. — It might be guessed that InputStreams and OutputStreams and all the rest string<->Stream containers are expected to have binary streams plus some (which? sole or different case-by-case?) charset for text streams. But is there explicitly stated requirement for those ? Maybe there is sense to centralize it, encapsulating string->bytes translation into some extra method?
            bdv Dmitry added a comment -

            yeah, having virtually no Java experience and 250MB per-week traffic, i really had a lot of chances to even compile and locally deploy it.

            On algorithmic level that is obvious, however getting knowledge in all the Java-around eco-system is not.
            Even if i had unlimited traffic here and step-by-step dumb-proof instruction, i would hardly able to fix the 1st malfunction of exclipse OSGi, or maven or such.

            What to change is simple. Where to change is much less so, giving all the stack of libraries and calls used.

            bdv Dmitry added a comment - yeah, having virtually no Java experience and 250MB per-week traffic, i really had a lot of chances to even compile and locally deploy it. On algorithmic level that is obvious, however getting knowledge in all the Java-around eco-system is not. Even if i had unlimited traffic here and step-by-step dumb-proof instruction, i would hardly able to fix the 1st malfunction of exclipse OSGi, or maven or such. What to change is simple. Where to change is much less so, giving all the stack of libraries and calls used.
            vmassol Vincent Massol added a comment -

            Dmitry what you could do to help fix an issue is issue a GitHub pull request. Since the issue is "obvious" (your word) that should be pretty simple to do.

            vmassol Vincent Massol added a comment - Dmitry what you could do to help fix an issue is issue a GitHub pull request. Since the issue is "obvious" (your word) that should be pretty simple to do.
            bdv Dmitry added a comment -

            can u fix it ?
            the change is obvious!

            you need to find a place where u create file writer and hardcode UTF8 there too.

            optional would be to add remark where you right HTTP meta header to not change the encoding out of sync from the palce above.
            Or, if possible, to take encoding option from the file writer opject.

            Really, when one even cannot export the entered data, leave alone import, that makes anyone just refuse to feel something into the wiki. They view it like a sink, which can not be saved and reused.

            bdv Dmitry added a comment - can u fix it ? the change is obvious! you need to find a place where u create file writer and hardcode UTF8 there too. optional would be to add remark where you right HTTP meta header to not change the encoding out of sync from the palce above. Or, if possible, to take encoding option from the file writer opject. Really, when one even cannot export the entered data, leave alone import, that makes anyone just refuse to feel something into the wiki. They view it like a sink, which can not be saved and reused.

            Each bug concerns the version specified by the "Affects Version/s:" field. We have many releases and we cannot assume that an issue with no "Affects Version/s:" affects the "most recent releases", especially when time passes, and the issue was created a few months, or even years ago.

            If you want the bug to be fixed quickly you should fill al the fields of the jira issue.

            mflorea Marius Dumitru Florea added a comment - Each bug concerns the version specified by the "Affects Version/s:" field. We have many releases and we cannot assume that an issue with no "Affects Version/s:" affects the "most recent releases", especially when time passes, and the issue was created a few months, or even years ago. If you want the bug to be fixed quickly you should fill al the fields of the jira issue.
            bdv Dmitry added a comment -

            by default all bugs are concerning most recent releases, no ?

            i put few more notes into Envir, but i think that is not relevant and the problem is of rather generic kind.
            I think file writer is the same in any Java build of last 5 years, and u used that file writer since ever.

            bdv Dmitry added a comment - by default all bugs are concerning most recent releases, no ? i put few more notes into Envir, but i think that is not relevant and the problem is of rather generic kind. I think file writer is the same in any Java build of last 5 years, and u used that file writer since ever.
            bdv Dmitry added a comment -

            Hmmm, necessary fields are those, that disallows ticket creation, if left blank, no ?
            How can newcomer guess which fields are mandatory, if Jira does nto tell him so ?

            bdv Dmitry added a comment - Hmmm, necessary fields are those, that disallows ticket creation, if left blank, no ? How can newcomer guess which fields are mandatory, if Jira does nto tell him so ?

            Please fill the necessary fields. We can't fix anything if we don't know what version is affected. Thanks.

            mflorea Marius Dumitru Florea added a comment - Please fill the necessary fields. We can't fix anything if we don't know what version is affected. Thanks.

            People

              mflorea Marius Dumitru Florea
              bdv Dmitry
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: