Thanks Dave. I am quite comfortable implementing the approach you described, in fact I had a little play around with it last night. I was just hoping there was a more efficient way of doing it.
> Subject: Re: Questions about manipulating text or words in a docx file. > From: [email protected] > Date: Tue, 8 Mar 2011 17:03:54 -0800 > To: [email protected] > > > To clarify, I would like to manipulate text at the word level in an > > arbitrary docx file and preserve formatting/styling. The resulting docx > > file should still be editable, essentially preserved apart from the word > > manipulations. > > In terms of word manipulations, I have constructed several algorithms, for > > example an algorithm that restores capitalization to words which may not be > > present in the docx file. These algorithms depend on looking at neighboring > > words for each focus word, usually a window of 1-2 words to the left and > > right. These algorithms can be applied to all words so searching for a > > particular word and then finding it's context would not work in this > > scenario. What is required is that every single word in the document is > > inspected and its neighboring context (left and right words) determined. To > > determine the left word(s) for the first word in a paragraph, it is OK to > > use the last word(s) of the previous paragraph. Therefore the entire text > > document can be treated by the algorithm as one continuous text. > > I have a method to split text into tokenized units such as words and > > punctuation, but for simplicity we can just assume that the input is > > tokenized by whitespace. > > Thanks > > > I think you need to take a two step approach. > > (1) You need an un-styled run of text to do your analysis. There is a project > which grew out of Apache Lucene called Apache Tika. Apache Tika is all about > getting text out of any document type. Tika depends on POI for Office > Documents. > > See http://poi.apache.org/text-extraction.html > > Text extraction doesn't care about formatting It should give you the text > view you need to do your analysis. > > (2) If you then grab the document.xml part of your docx you can then simply > find and modify the pieces of content. As long as you are preserving style > and just replacing characters you should be able to do it. > > Others here should be able to help with the details, I only have the time and > knowledge to suggest an approach. > > Regards, > Dave > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] >
