Re: Java (Apache POI) : How to retrieve comment/annotation and associated highlight text from Microsoft Word?

Javen O'Neal Tue, 09 May 2017 23:56:39 -0700

A few additions, since <paragraph><commentRangeStart id="commentId"
/><run><text>John</text></run><commentRangeStop id="commentId"
/></paragraph> is the critical thing:


        <!-- comment range, text run "John" -->
        <w:commentRangeStart w:id="0"/>
        <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
            <w:rPr><w:rtl w:val="0"/></w:rPr>
            <w:t xml:space="preserve">John</w:t>
        </w:r>
        <w:commentRangeEnd w:id="0"/>

      <xsd:element name="commentRangeStart" type="CT_MarkupRange">
        <xsd:annotation>
          <xsd:documentation>Comment Anchor Range Start</xsd:documentation>
        </xsd:annotation>
      </xsd:element>
      <xsd:element name="commentRangeEnd" type="CT_MarkupRange">
        <xsd:annotation>
          <xsd:documentation>Comment Anchor Range End</xsd:documentation>
        </xsd:annotation>
      </xsd:element>

So if performance isn't a concern here (you don't need to save
pointers to where the comment ranges are), the pseudo-code for a
XWPFComment method that gets the text that a comment refers to would
be:

    public String getRefersToText() {
        StringBuilder refersTo = new StringBuilder();
        for each CTParagraph in document:
            for each child element of the CTParagraph:
                if child element is a commentRangeStart and id==this.id
                    append subsequent text runs to the refersTo buffer
                    continue
                if we have found the comment range start and child
element is a text run
                    append this text run to the refersTo buffer
                if child element is a commentRangeEnd and id==this.id
                    return refersTo.toString() (assuming that one
comment may not refer to multiple text ranges)

    }

This would require searching the entire document for every comment.
https://svn.apache.org/viewvc/poi/trunk/src/ooxml/java/org/apache/poi/xwpf/usermodel/XWPFDocument.java?view=markup
https://svn.apache.org/viewvc/poi/trunk/src/ooxml/java/org/apache/poi/xwpf/usermodel/XWPFParagraph.java?view=markup

On Tue, May 9, 2017 at 11:14 PM, Javen O'Neal <one...@apache.org> wrote:
> First, if you're using Java 1.5+(?), you can use for-each loops for
> more readable code.
> for (final XWPFComment comment : adoc.getComments()) {
>     final String id = comment.getId();
>     final String author = comment.getAuthor();
>     final String text = comment.getText();
> }
>
> I don't see anything in POI right now that make extracting the
> annotated text that a track changes comment refers to.
>
> Here's the current implementation of XWPFComment:
> https://svn.apache.org/viewvc/poi/trunk/src/ooxml/java/org/apache/poi/xwpf/usermodel/XWPFComment.java?view=markup
>
> Taking a look at the OOXML 2006 schemas wml.xsd (download from
> http://www.ecma-international.org/publications/files/ECMA-ST/Office%20Open%20XML%201st%20edition%20Part%204%20(PDF).zip,
> extract OfficeOpenXML-Part4a.zip, extract OfficeOpenXML-XMLSchema.zip,
> open wml.xsd), I see that the comment (*.docx/word/comments.xml)
> doesn't refer to the document text.
>
>   <xsd:complexType name="CT_Comment">
>     <xsd:complexContent>
>       <xsd:extension base="CT_TrackChange">
>         <xsd:sequence>
>           <xsd:group ref="EG_BlockLevelElts" minOccurs="0"
> maxOccurs="unbounded"></xsd:group>
>         </xsd:sequence>
>         <xsd:attribute name="initials" type="ST_String" use="optional">
>           <xsd:annotation>
>             <xsd:documentation>Initials of Comment Author</xsd:documentation>
>           </xsd:annotation>
>         </xsd:attribute>
>       </xsd:extension>
>     </xsd:complexContent>
>   </xsd:complexType>
>
>   <xsd:complexType name="CT_TrackChange">
>     <xsd:complexContent>
>       <xsd:extension base="CT_Markup">
>         <xsd:attribute name="author" type="ST_String" use="required">
>           <xsd:annotation>
>             <xsd:documentation>Annotation Author</xsd:documentation>
>           </xsd:annotation>
>         </xsd:attribute>
>         <xsd:attribute name="date" type="ST_DateTime" use="optional">
>           <xsd:annotation>
>             <xsd:documentation>Annotation Date</xsd:documentation>
>           </xsd:annotation>
>         </xsd:attribute>
>       </xsd:extension>
>     </xsd:complexContent>
>   </xsd:complexType>
>
>   <xsd:complexType name="CT_Markup">
>     <xsd:attribute name="id" type="ST_DecimalNumber" use="required">
>       <xsd:annotation>
>         <xsd:documentation>Annotation Identifier</xsd:documentation>
>       </xsd:annotation>
>     </xsd:attribute>
>   </xsd:complexType>
>
> Examining the zipped xml contents of a simple comment example docx
> file that I created, I see that the relationship is the other way
> around: the document refers to the comments (this ordering makes more
> sense anyways).
>
> For a simple file that I created with the text "My name is John." and
> annotated the word John with a comment with the message "Noun", here's
> what I got in CommentExample.docx/word/document.xml:
>
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <w:document xmlns....>
> <w:body>
>     <!-- text paragraph: "My name is [[John]]." -->
>     <w:p w:rsidR="00000000" w:rsidDel="00000000" w:rsidP="00000000"
> w:rsidRDefault="00000000" w:rsidRPr="00000000">
>         <w:pPr>
>             <w:pBdr/>
>             <w:contextualSpacing w:val="0"/>
>             <w:rPr/>
>         </w:pPr>
>
>         <!-- text run "My name is " -->
>         <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
>             <w:rPr><w:rtl w:val="0"/></w:rPr>
>             <w:t xml:space="preserve">My name is </w:t>
>         </w:r>
>
>         <!-- comment range, text run "John" -->
>         <w:commentRangeStart w:id="0"/>
>         <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
>             <w:rPr><w:rtl w:val="0"/></w:rPr>
>             <w:t xml:space="preserve">John</w:t>
>         </w:r>
>         <w:commentRangeEnd w:id="0"/>
>
>         <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
>             <w:commentReference w:id="0"/>
>         </w:r>
>
>         <!-- text run "." -->
>         <w:r w:rsidDel="00000000" w:rsidR="00000000" w:rsidRPr="00000000">
>             <w:rPr><w:rtl w:val="0"/></w:rPr>
>             <w:t xml:space="preserve">.</w:t>
>         </w:r>
>
>     </w:p>
>     <w:sectPr>
>         <w:pgSz w:h="15840" w:w="12240"/>
>         <w:pgMar w:bottom="1440" w:top="1440" w:left="1440"
> w:right="1440" w:header="0"/>
>         <w:pgNumType w:start="1"/>
>     </w:sectPr>
> </w:body>
> </w:document>
>
> So to solve your problem, you could either:
> 1. search the document.xml for all comments, looking up the comment's
> author and text using the ID that is referenced in the document
> commentRangeStart-commentRangeEnd and joining all the text contained
> between those markers
> 2. for each comment in the comment table, find the corresponding
> commentRangeStart and commentRangeEnd tags in document.xml and get the
> corresponding text that was annotated (in this example, John).
>
> If you don't already have a development environment set up, I
> encourage you to do so. Patches are greatly appreciated.
>
> On Tue, May 9, 2017 at 9:42 AM, Ramani Routray <routr...@gmail.com> wrote:
>> I have a Microsoft word (.docx) file and trying to retrieve the comments and 
>> it's associated highlighted text. Can you pls help.
>>
>> Attaching picture of the sample word document and the java code for 
>> extracting the comments. [ A file with a line "My name is John". The word 
>> "John" is highlighted with a comment "Noun" ]
>>
>> I am able to extract the comments (Noun, Adjective). I would like to extract 
>> the text associated with the comment "Noun" (Noun = John, Adjective = great)
>>
>> FileInputStream fis = new FileInputStream(new File(msWordFilePath));
>>     XWPFDocument adoc = new XWPFDocument(fis);
>>     XWPFWordExtractor xwe = new XWPFWordExtractor(adoc);
>>     XWPFComment[] comments = adoc.getComments();
>>
>>
>>     for(int idx=0; idx < comments.length; idx++)
>>     {
>>         MSWordAnnotation annot = new MSWordAnnotation();
>>         annot.setAnnotationName(comments[idx].getId());
>>         annot.setAnnotationValue(comments[idx].getText());
>>         aList.add(annot);
>>
>>
>>     }
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
>> For additional commands, e-mail: dev-h...@poi.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
For additional commands, e-mail: dev-h...@poi.apache.org

Re: Java (Apache POI) : How to retrieve comment/annotation and associated highlight text from Microsoft Word?

Reply via email to