Word to HTML

I got an HTML export from Word. The HTML code was very ugly and, what's worse, confused the site where I wanted to submit the text. I failed to refine the code with tidy, therefore wrote a quick Perl hack.

The hack removes most of the garbage:

* The tags "p" and "b": attributes are removed.
* The tags "span", "font" and "o:p" are removed completely.
* Additionally, line feeds are removed from <p>...</p>.

The result of the running the script isn't ideal, and the manual work is required. But it isn't an issue.

$/ = undef;
$l = <>;
foreach $tname ('p', 'b') {
  $l =~ s/<$tname[^>]*>/<$tname>/gsmi;
}
foreach $tname ('span', 'font', 'o:p') {
  $l =~ s/<$tname[^>]*>//gsmi;
  $l =~ s/<\/$tname>//gsmi;
}
$s = '';
foreach $i ( split /<\/p>/i, $l) {
  $i =~ s/[\r\n]//g;
  $s .= $i . "</p>\n";
}
print $s;
Categories:

Updated: