Description
Ideally this is the job of htmlcleaner. I have asked a question in their forum here http://sourceforge.net/forum/forum.php?thread_id=2890619&forum_id=637245. While waiting for an answer for it, I think we have to workaround it for the moment. To test for the validity of my claim, consider the following html document:
<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>I AM YOUR DOCUMENT TITLE REPLACE ME</title> <meta http-equiv="content-type" content="application/xhtml+xml;charset=utf-8" /> <meta http-equiv="content-style-type" content="text/css" /> </head> <body> <table> <tbody> <tr/> </tbody> </table> </body> </html>
If this html code is entered in http://validator.w3.org/check it will report an error regarding the <tr/> tag. We have a choice to either convert the <tr/> into <tr><td/></tr> or to completely strip it off. The latter choice is more logical because empty rows are not rendered in html.