Hi
I have documents (either 'doc' or 'docx') that have a special character for 'greater than equal' and using codes in 'WordToHtmlConverter', I see those characters are converted into '('.
I tried with the latest apache poi release 4.1.0. My java code is: public class TestWordtoHtmlConverter { public static void main(String[] args ) { try { HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream(args[0])); WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter( DocumentBuilderFactory.newInstance().newDocumentBuilder() .newDocument()); wordToHtmlConverter.processDocument(wordDocument); Document htmlDocument = wordToHtmlConverter.getDocument(); ByteArrayOutputStream out = new ByteArrayOutputStream(); DOMSource domSource = new DOMSource(htmlDocument); StreamResult streamResult = new StreamResult(out); TransformerFactory tf = TransformerFactory.newInstance(); Transformer serializer = tf.newTransformer(); serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8"); serializer.setOutputProperty(OutputKeys.INDENT, "yes"); serializer.setOutputProperty(OutputKeys.METHOD, "html"); serializer.transform(domSource, streamResult); out.close(); String result = new String(out.toByteArray()); System.out.println(result); } catch (Exception e) { } Is there anyway I can correctly identify these symbols? In the sample document, I am interested in getting 'bad one'. Thanks T.
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@poi.apache.org For additional commands, e-mail: user-h...@poi.apache.org