A few additions, since <paragraph><commentRangeStart id="commentId" /><run><text>John</text></run><commentRangeStop id="commentId" /></paragraph> is the critical thing:
<!-- comment range, text run "John" --> <w:commentRangeStart w:id="0"/> <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000"> <w:rPr><w:rtl w:val="0"/></w:rPr> <w:t xml:space="preserve">John</w:t> </w:r> <w:commentRangeEnd w:id="0"/> <xsd:element name="commentRangeStart" type="CT_MarkupRange"> <xsd:annotation> <xsd:documentation>Comment Anchor Range Start</xsd:documentation> </xsd:annotation> </xsd:element> <xsd:element name="commentRangeEnd" type="CT_MarkupRange"> <xsd:annotation> <xsd:documentation>Comment Anchor Range End</xsd:documentation> </xsd:annotation> </xsd:element> So if performance isn't a concern here (you don't need to save pointers to where the comment ranges are), the pseudo-code for a XWPFComment method that gets the text that a comment refers to would be: public String getRefersToText() { StringBuilder refersTo = new StringBuilder(); for each CTParagraph in document: for each child element of the CTParagraph: if child element is a commentRangeStart and id==this.id append subsequent text runs to the refersTo buffer continue if we have found the comment range start and child element is a text run append this text run to the refersTo buffer if child element is a commentRangeEnd and id==this.id return refersTo.toString() (assuming that one comment may not refer to multiple text ranges) } This would require searching the entire document for every comment. https://svn.apache.org/viewvc/poi/trunk/src/ooxml/java/org/apache/poi/xwpf/usermodel/XWPFDocument.java?view=markup https://svn.apache.org/viewvc/poi/trunk/src/ooxml/java/org/apache/poi/xwpf/usermodel/XWPFParagraph.java?view=markup On Tue, May 9, 2017 at 11:14 PM, Javen O'Neal <one...@apache.org> wrote: > First, if you're using Java 1.5+(?), you can use for-each loops for > more readable code. > for (final XWPFComment comment : adoc.getComments()) { > final String id = comment.getId(); > final String author = comment.getAuthor(); > final String text = comment.getText(); > } > > I don't see anything in POI right now that make extracting the > annotated text that a track changes comment refers to. > > Here's the current implementation of XWPFComment: > https://svn.apache.org/viewvc/poi/trunk/src/ooxml/java/org/apache/poi/xwpf/usermodel/XWPFComment.java?view=markup > > Taking a look at the OOXML 2006 schemas wml.xsd (download from > http://www.ecma-international.org/publications/files/ECMA-ST/Office%20Open%20XML%201st%20edition%20Part%204%20(PDF).zip, > extract OfficeOpenXML-Part4a.zip, extract OfficeOpenXML-XMLSchema.zip, > open wml.xsd), I see that the comment (*.docx/word/comments.xml) > doesn't refer to the document text. > > <xsd:complexType name="CT_Comment"> > <xsd:complexContent> > <xsd:extension base="CT_TrackChange"> > <xsd:sequence> > <xsd:group ref="EG_BlockLevelElts" minOccurs="0" > maxOccurs="unbounded"></xsd:group> > </xsd:sequence> > <xsd:attribute name="initials" type="ST_String" use="optional"> > <xsd:annotation> > <xsd:documentation>Initials of Comment Author</xsd:documentation> > </xsd:annotation> > </xsd:attribute> > </xsd:extension> > </xsd:complexContent> > </xsd:complexType> > > <xsd:complexType name="CT_TrackChange"> > <xsd:complexContent> > <xsd:extension base="CT_Markup"> > <xsd:attribute name="author" type="ST_String" use="required"> > <xsd:annotation> > <xsd:documentation>Annotation Author</xsd:documentation> > </xsd:annotation> > </xsd:attribute> > <xsd:attribute name="date" type="ST_DateTime" use="optional"> > <xsd:annotation> > <xsd:documentation>Annotation Date</xsd:documentation> > </xsd:annotation> > </xsd:attribute> > </xsd:extension> > </xsd:complexContent> > </xsd:complexType> > > <xsd:complexType name="CT_Markup"> > <xsd:attribute name="id" type="ST_DecimalNumber" use="required"> > <xsd:annotation> > <xsd:documentation>Annotation Identifier</xsd:documentation> > </xsd:annotation> > </xsd:attribute> > </xsd:complexType> > > Examining the zipped xml contents of a simple comment example docx > file that I created, I see that the relationship is the other way > around: the document refers to the comments (this ordering makes more > sense anyways). > > For a simple file that I created with the text "My name is John." and > annotated the word John with a comment with the message "Noun", here's > what I got in CommentExample.docx/word/document.xml: > > <?xml version="1.0" encoding="UTF-8" standalone="yes"?> > <w:document xmlns....> > <w:body> > <!-- text paragraph: "My name is [[John]]." --> > <w:p w:rsidR="00000000" w:rsidDel="00000000" w:rsidP="00000000" > w:rsidRDefault="00000000" w:rsidRPr="00000000"> > <w:pPr> > <w:pBdr/> > <w:contextualSpacing w:val="0"/> > <w:rPr/> > </w:pPr> > > <!-- text run "My name is " --> > <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000"> > <w:rPr><w:rtl w:val="0"/></w:rPr> > <w:t xml:space="preserve">My name is </w:t> > </w:r> > > <!-- comment range, text run "John" --> > <w:commentRangeStart w:id="0"/> > <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000"> > <w:rPr><w:rtl w:val="0"/></w:rPr> > <w:t xml:space="preserve">John</w:t> > </w:r> > <w:commentRangeEnd w:id="0"/> > > <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000"> > <w:commentReference w:id="0"/> > </w:r> > > <!-- text run "." --> > <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000"> > <w:rPr><w:rtl w:val="0"/></w:rPr> > <w:t xml:space="preserve">.</w:t> > </w:r> > > </w:p> > <w:sectPr> > <w:pgSz w:h="15840" w:w="12240"/> > <w:pgMar w:bottom="1440" w:top="1440" w:left="1440" > w:right="1440" w:header="0"/> > <w:pgNumType w:start="1"/> > </w:sectPr> > </w:body> > </w:document> > > So to solve your problem, you could either: > 1. search the document.xml for all comments, looking up the comment's > author and text using the ID that is referenced in the document > commentRangeStart-commentRangeEnd and joining all the text contained > between those markers > 2. for each comment in the comment table, find the corresponding > commentRangeStart and commentRangeEnd tags in document.xml and get the > corresponding text that was annotated (in this example, John). > > If you don't already have a development environment set up, I > encourage you to do so. Patches are greatly appreciated. > > On Tue, May 9, 2017 at 9:42 AM, Ramani Routray <routr...@gmail.com> wrote: >> I have a Microsoft word (.docx) file and trying to retrieve the comments and >> it's associated highlighted text. Can you pls help. >> >> Attaching picture of the sample word document and the java code for >> extracting the comments. [ A file with a line "My name is John". The word >> "John" is highlighted with a comment "Noun" ] >> >> I am able to extract the comments (Noun, Adjective). I would like to extract >> the text associated with the comment "Noun" (Noun = John, Adjective = great) >> >> FileInputStream fis = new FileInputStream(new File(msWordFilePath)); >> XWPFDocument adoc = new XWPFDocument(fis); >> XWPFWordExtractor xwe = new XWPFWordExtractor(adoc); >> XWPFComment[] comments = adoc.getComments(); >> >> >> for(int idx=0; idx < comments.length; idx++) >> { >> MSWordAnnotation annot = new MSWordAnnotation(); >> annot.setAnnotationName(comments[idx].getId()); >> annot.setAnnotationValue(comments[idx].getText()); >> aList.add(annot); >> >> >> } >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org >> For additional commands, e-mail: dev-h...@poi.apache.org >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org