Description
I received an XML export with an entities.xml file containing BS (BACKSPACE, code 8) characters at the end of some CDATA tag start, which makes the XML parsing completely fail.
In the Confluence migrator application, it manifests as follows:
The text in red:
Exception thrown during job execution [org.xwiki.filter.FilterException: Failed to read package, at org.xwiki.contrib.confluence.filter.internal.input.ConfluenceInputFilterStream.readInternal(ConfluenceInputFilterStream.java:183), at org.xwiki.contrib.confluence.filter.internal.input.ConfluenceInputFilterStream.read(ConfluenceInputFilterStream.java:169), at org.xwiki.contrib.confluence.filter.internal.input.ConfluenceInputFilterStream.read(ConfluenceInputFilterStream.java:95), at org.xwiki.filter.input.AbstractBeanInputFilterStream.read(AbstractBeanInputFilterStream.java:79), at org.xwiki.filter.internal.job.FilterStreamConverterJob.runInternal(FilterStreamConverterJob.java:97), at org.xwiki.job.AbstractJob.runInContext(AbstractJob.java:246), at org.xwiki.job.AbstractJob.run(AbstractJob.java:223), at org.xwiki.filter.script.internal.ScriptFilterStreamConverterJob.run(ScriptFilterStreamConverterJob.java:75), at com.xwiki.confluencepro.internal.ConfluenceMigrationJob.runInternal(ConfluenceMigrationJob.java:159), at org.xwiki.job.AbstractJob.runInContext(AbstractJob.java:246), at org.xwiki.job.AbstractJob.run(AbstractJob.java:223), at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128), at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628), at java.base/java.lang.Thread.run(Thread.java:829), Caused by: org.xwiki.filter.FilterException: Failed to analyze the package index, at org.xwiki.contrib.confluence.filter.input.ConfluenceXMLPackage.read(ConfluenceXMLPackage.java:543), at org.xwiki.contrib.confluence.filter.internal.input.ConfluenceInputFilterStream.readInternal(ConfluenceInputFilterStream.java:181), ... 13 more, Caused by: com.ctc.wstx.exc.WstxParsingException: String ']]>' not allowed in textual content, except as the end marker of CDATA section, at [row,col {unknown-source}]: [19672539,50], at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:634), at com.ctc.wstx.sr.StreamScanner.throwWfcException(StreamScanner.java:479), at com.ctc.wstx.sr.BasicStreamReader.readTextPrimary(BasicStreamReader.java:4678), at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2970), at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1122), at org.xwiki.xml.stax.StAXUtils.skipElement(StAXUtils.java:197), at org.xwiki.contrib.confluence.filter.input.ConfluenceXMLPackage.readProperty(ConfluenceXMLPackage.java:1130), at org.xwiki.contrib.confluence.filter.input.ConfluenceXMLPackage.readObjectProperties(ConfluenceXMLPackage.java:849), at org.xwiki.contrib.confluence.filter.input.ConfluenceXMLPackage.readPageObject(ConfluenceXMLPackage.java:1006), at org.xwiki.contrib.confluence.filter.input.ConfluenceXMLPackage.readObject(ConfluenceXMLPackage.java:789), at org.xwiki.contrib.confluence.filter.input.ConfluenceXMLPackage.createTree(ConfluenceXMLPackage.java:774), at org.xwiki.contrib.confluence.filter.input.ConfluenceXMLPackage.read(ConfluenceXMLPackage.java:541), ... 14 more]
A solution is to remove the character while reading the file. Fortunately, since this is an ASCII character, one does not need to interpret utf-8 and its more complex character decoding rules to do it.
Here's the error you could see when parsing such an XML file:
[ERROR] Tests run: 26, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.028 s <<< FAILURE! - in JUnit Vintage [ERROR] confluencexml/cdatazerowidthspace.test [confluence+xml, filter+xml] Time elapsed: 0.051 s <<< ERROR! org.xwiki.filter.FilterException: Failed to read package at org.xwiki.contrib.confluence.filter.internal.input.ConfluenceInputFilterStream.readInternal(ConfluenceInputFilterStream.java:188) at org.xwiki.contrib.confluence.filter.internal.input.ConfluenceInputFilterStream.read(ConfluenceInputFilterStream.java:169) at org.xwiki.contrib.confluence.filter.internal.input.ConfluenceInputFilterStream.read(ConfluenceInputFilterStream.java:95) at org.xwiki.filter.input.AbstractBeanInputFilterStream.read(AbstractBeanInputFilterStream.java:79) at org.xwiki.filter.test.integration.FilterTest.runTestInternal(FilterTest.java:259) at org.xwiki.filter.test.integration.FilterTest.execute(FilterTest.java:102) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) at org.xwiki.filter.test.integration.FilterTestSuite$TestClassRunnerForParameters.runChild(FilterTestSuite.java:140) at org.xwiki.filter.test.integration.FilterTestSuite$TestClassRunnerForParameters.runChild(FilterTestSuite.java:80) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.junit.runners.Suite.runChild(Suite.java:128) at org.junit.runners.Suite.runChild(Suite.java:27) at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at org.junit.runners.ParentRunner.run(ParentRunner.java:413) at org.junit.runner.JUnitCore.run(JUnitCore.java:137) at org.junit.runner.JUnitCore.run(JUnitCore.java:115) at org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:42) at org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:80) at org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:72) at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:147) at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:127) at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:90) at org.junit.platform.launcher.core.EngineExecutionOrchestrator.lambda$execute$0(EngineExecutionOrchestrator.java:55) at org.junit.platform.launcher.core.EngineExecutionOrchestrator.withInterceptedStreams(EngineExecutionOrchestrator.java:102) at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:54) at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:114) at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:86) at org.junit.platform.launcher.core.DefaultLauncherSession$DelegatingLauncher.execute(DefaultLauncherSession.java:86) at org.junit.platform.launcher.core.SessionPerRequestLauncher.execute(SessionPerRequestLauncher.java:53) at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invokeAllTests(JUnitPlatformProvider.java:150) at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:124) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) Caused by: org.xwiki.filter.FilterException: Failed to analyze the package index at org.xwiki.contrib.confluence.filter.input.ConfluenceXMLPackage.read(ConfluenceXMLPackage.java:543) at org.xwiki.contrib.confluence.filter.internal.input.ConfluenceInputFilterStream.readInternal(ConfluenceInputFilterStream.java:186) ... 55 more Caused by: com.ctc.wstx.exc.WstxParsingException: String ']]>' not allowed in textual content, except as the end marker of CDATA section at [row,col {unknown-source}]: [55,58] at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:634) at com.ctc.wstx.sr.StreamScanner.throwWfcException(StreamScanner.java:479) at com.ctc.wstx.sr.BasicStreamReader.readTextPrimary(BasicStreamReader.java:4678) at com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2970) at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1122) at org.xwiki.xml.stax.StAXUtils.skipElement(StAXUtils.java:197) at org.xwiki.contrib.confluence.filter.input.ConfluenceXMLPackage.readProperty(ConfluenceXMLPackage.java:1130) at org.xwiki.contrib.confluence.filter.input.ConfluenceXMLPackage.readObjectProperties(ConfluenceXMLPackage.java:849) at org.xwiki.contrib.confluence.filter.input.ConfluenceXMLPackage.readPageObject(ConfluenceXMLPackage.java:1006) at org.xwiki.contrib.confluence.filter.input.ConfluenceXMLPackage.readObject(ConfluenceXMLPackage.java:789) at org.xwiki.contrib.confluence.filter.input.ConfluenceXMLPackage.createTree(ConfluenceXMLPackage.java:774) at org.xwiki.contrib.confluence.filter.input.ConfluenceXMLPackage.read(ConfluenceXMLPackage.java:541) ... 56 more
This error is specific to the location if the BS character in the XML file. It appears the XML parser we use actually applies the BS character and therefore removes the previous character. Since in my case the BS character is at the end of the CDATA start tag, its last character is removed and The parser complains that ']]>' is found outside a CDATA, which is forbidden.
For the record, xmllint complains about both the disallowed character of code 8 and the ]]> CDATA end tag being in regular content:
$ xmllint --format entities.xml entities.xml:55: parser error : Unregistered error message <property name="title"><![CDATAAlex Rodriguez]]></property> ^ entities.xml:55: parser error : PCDATA invalid Char value 8 <property name="title"><![CDATAAlex Rodriguez]]></property> ^ entities.xml:55: parser error : Sequence ']]>' not allowed in content <property name="title"><![CDATAAlex Rodriguez]]></property> ^ entities.xml:56: parser error : Unregistered error message <property name="lowerTitle"><![CDATAalex rodriguez]]></property> ^ entities.xml:56: parser error : PCDATA invalid Char value 8 <property name="lowerTitle"><![CDATAalex rodriguez]]></property> ^ entities.xml:56: parser error : Sequence ']]>' not allowed in content <property name="lowerTitle"><![CDATAalex rodriguez]]></property> ^