Word to HTML
I got an HTML export from Word. The HTML code was very ugly and, what's worse, confused the site where I wanted to submit the text. I failed to refine the code with tidy, therefore wrote a quick Perl hack.
The hack removes most of the garbage:
* The tags "p" and "b": attributes are removed.
* The tags "span", "font" and "o:p" are removed completely.
* Additionally, line feeds are removed from <p>...</p>.
The result of the running the script isn't ideal, and the manual work is required. But it isn't an issue.
$/ = undef;
$l = <>;
foreach $tname ('p', 'b') {
$l =~ s/<$tname[^>]*>/<$tname>/gsmi;
}
foreach $tname ('span', 'font', 'o:p') {
$l =~ s/<$tname[^>]*>//gsmi;
$l =~ s/<\/$tname>//gsmi;
}
$s = '';
foreach $i ( split /<\/p>/i, $l) {
$i =~ s/[\r\n]//g;
$s .= $i . "</p>\n";
}
print $s;
Categories: