> To clarify, I would like to manipulate text at the word level in an arbitrary 
> docx file and preserve formatting/styling. The resulting docx file should 
> still be editable, essentially preserved apart from the word manipulations.
> In terms of word manipulations, I have constructed several algorithms, for 
> example an algorithm that restores capitalization to words which may not be 
> present in the docx file. These algorithms depend on looking at neighboring 
> words for each focus word, usually a window of 1-2 words to the left and 
> right. These algorithms can be applied to all words so searching for a 
> particular word and then finding it's context would not work in this 
> scenario. What is required is that every single word in the document is 
> inspected and its neighboring context (left and right words) determined. To 
> determine the left word(s) for the first word in a paragraph, it is OK to use 
> the last word(s) of the previous paragraph. Therefore the entire text 
> document can be treated by the algorithm as one continuous text.
> I have a method to split text into tokenized units such as words and 
> punctuation, but for simplicity we can just assume that the input is 
> tokenized by whitespace.
> Thanks                                          

 
I think you need to take a two step approach.

(1) You need an un-styled run of text to do your analysis. There is a project 
which grew out of Apache Lucene called Apache Tika. Apache Tika is all about 
getting text out of any document type. Tika depends on POI for Office Documents.

See http://poi.apache.org/text-extraction.html

Text extraction doesn't care about formatting It should give you the text view 
you need to do your analysis.

(2) If you then grab the document.xml part of your docx you can then simply 
find and modify the pieces of content. As long as you are preserving style and 
just replacing characters you should be able to do it.

Others here should be able to help with the details, I only have the time and 
knowledge to suggest an approach.

Regards,
Dave


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to