Hi: While I am parsing the PDF or Word document using AutoDetectParser the <li>, <ul> tags are converted as <p> tags. I need the exact HTML content what is been there for PDF or Word Document.
I tried in several ways as below: ToHTMLContentHandler textHandler = new ToHTMLContentHandler(); Metadata metadata = new Metadata(); Parser parser = new AutoDetectParser(); ParseContext context = new ParseContext(); context.set(HtmlMapper.class, new IdentityHtmlMapper()); parser.parse(in, textHandler, metadata, context); --------------------------------------------------------- SAXTransformerFactory factory = (SAXTransformerFactory)SAXTransformerFactory.newInstance(); TransformerHandler handler = factory.newTransformerHandler(); handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html"); handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "no"); handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, "utf-8"); handler.setResult(new StreamResult(writer)); System.out.println(handler.toString()); return handler; But the <li> tags are been replaced with <p> tags with class but the CSS style is not seen in the parsed HTML output. Any help is appreciated. -- View this message in context: http://lucene.472066.n3.nabble.com/CSS-styles-and-ul-li-tags-been-ignored-while-parsing-tp3987555.html Sent from the Apache Tika - Development mailing list archive at Nabble.com.