Uploaded image for project: 'XWiki Commons'
  1. XWiki Commons
  2. XCOMMONS-2265

HTML Encoded Line Feed character transformed in whitespace

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 16.3.0-rc-1
    • 12.5-rc-1
    • XML
    • None
    • Mozilla Firefox 91.0 and Google Chrome 92.0.4515.159
    • Unit
    • Unknown
    • N/A
    • N/A

    Description

      Steps to reproduce:

      • Add a HTML Encoded Line Feed character like
        


        or

        


        .etc, in WYSIWYG editor

      • Save the page

      Expected results:

      • The character is displayed after the page is saved in WYSIWYG editor.

      Results:

      • The character is transformed in a whitespace / non-breaking space
         

      Saving the page in Wiki editor or in WYSIWYG editor while in source mode the character is not transformed.

      Attachments

        Issue Links

          Activity

            [XCOMMONS-2265] HTML Encoded Line Feed character transformed in whitespace

            I discovered a potential issue with my pull request (or rather the related XCOMMONS-2276): Opening curly braces are no longer HTML-Encoded in any HTML string: https://github.com/xwiki/xwiki-commons/pull/149#discussion_r744222776

            This might mean that these braces are under some conditions interpreted as the opening of a macro.
            I have not been able to produce any problems via the WYSIWYG Editor. However as similar explicit conversions of opening curly braces into their equivalent HTML Entities happen in other places, I guess there is a reason why this behavior has been that way in the past.

            camil7 Clemens Robbenhaar added a comment - I discovered a potential issue with my pull request (or rather the related XCOMMONS-2276 ): Opening curly braces are no longer HTML-Encoded in any HTML string: https://github.com/xwiki/xwiki-commons/pull/149#discussion_r744222776 This might mean that these braces are under some conditions interpreted as the opening of a macro. I have not been able to produce any problems via the WYSIWYG Editor. However as similar explicit conversions of opening curly braces into their equivalent HTML Entities happen in other places, I guess there is a reason why this behavior has been that way in the past.

            Pull Request is now at https://github.com/xwiki/xwiki-commons/pull/149 with request for comments, especially about the tests showing now a slightly different behavior.

            camil7 Clemens Robbenhaar added a comment - Pull Request is now at https://github.com/xwiki/xwiki-commons/pull/149 with request for comments, especially about the tests showing now a slightly different behavior.
            mflorea Marius Dumitru Florea added a comment - This was introduced in https://github.com/xwiki/xwiki-commons/commit/79fc44ff5fc7aac167f80ebcbfa9bc19a374c559   for XCOMMONS-1938 .

            We serialize the cleaned DOM using org.jdom.output.XMLOutputter, but we extend this class to add some custom behavior. The JDOM to serialize looks good. The method escapeElementEntities is called to escape the value of the text node:

            "
"
            

            We overwrite this method:

            String result = super.escapeElementEntities(text);
            
            // "\r" characters are automatically transformed in 
 but we want to keep the original \r there.
            return cleanAmpersandEscape(result).replaceAll("
", "\r");
            

            The base implementation produces

            "
"
            

            as expected. But then calling cleanAmpersandEscape breaks the escaping.. producing back:

            "
"
            

            This looks very bad. I don't understand why we remove the ampersand escaping. The JavaDoc doesn't explain much.

            mflorea Marius Dumitru Florea added a comment - We serialize the cleaned DOM using org.jdom.output.XMLOutputter , but we extend this class to add some custom behavior. The JDOM to serialize looks good. The method escapeElementEntities is called to escape the value of the text node: "
" We overwrite this method: String result = super.escapeElementEntities(text); // "\r" characters are automatically transformed in 
 but we want to keep the original \r there. return cleanAmpersandEscape(result).replaceAll("
", "\r"); The base implementation produces "
" as expected. But then calling cleanAmpersandEscape breaks the escaping.. producing back: "
" This looks very bad. I don't understand why we remove the ampersand escaping. The JavaDoc doesn't explain much.

            The HTMLCleaner produces a ContentNode with this value / content:

            "
"
            

            I assume this ContentNode corresponds to a DOM Text node so it's value is as expected: it should be the plain text that the user sees in the end. The code is then creating a DOM Text node with this value (as is, without any changes):

            element.appendChild(document.createTextNode(content));
            

            The JavaDoc for https://docs.oracle.com/javase/8/docs/api/org/w3c/dom/Document.html#createTextNode-java.lang.String- is not very clear but I'm pretty sure that the value passed when creating a text node is plain text. So the way the text node is created seem good. This means the cleaned DOM should be fine and the problem is when we serialize the cleaned DOM.

            mflorea Marius Dumitru Florea added a comment - The HTMLCleaner produces a ContentNode with this value / content: "
" I assume this ContentNode corresponds to a DOM Text node so it's value is as expected: it should be the plain text that the user sees in the end. The code is then creating a DOM Text node with this value (as is, without any changes): element.appendChild(document.createTextNode(content)); The JavaDoc for https://docs.oracle.com/javase/8/docs/api/org/w3c/dom/Document.html#createTextNode-java.lang.String- is not very clear but I'm pretty sure that the value passed when creating a text node is plain text . So the way the text node is created seem good. This means the cleaned DOM should be fine and the problem is when we serialize the cleaned DOM.

            The WYSIWYG editor uses the HTML Cleaner which transforms this HTML received from the editor:

            <p>&quot;&amp;#10;&quot;</p>
            

            into

            <?xml version="1.0" encoding="UTF-8"?>
            <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
            <html><head></head><body><p>"&#10;"</p></body></html>
            

            This is pretty bad because it looses the intention of the editor which is to have

            "&#10;"
            

            as plain text, not HTML. This is something the user has typed into the editor. Which is why the ampersand is escaped. There's no reason to unescape the ampersand when cleaning the HTML.

            mflorea Marius Dumitru Florea added a comment - The WYSIWYG editor uses the HTML Cleaner which transforms this HTML received from the editor: <p>&quot;&amp;#10;&quot;</p> into <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html><head></head><body><p>"&#10;"</p></body></html> This is pretty bad because it looses the intention of the editor which is to have "&#10;" as plain text , not HTML. This is something the user has typed into the editor. Which is why the ampersand is escaped. There's no reason to unescape the ampersand when cleaning the HTML.

            People

              camil7 Clemens Robbenhaar
              andreic Camelia Andrei
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: