To clarify, I would like to manipulate text at the word level in an arbitrary 
docx file and preserve formatting/styling. The resulting docx file should still 
be editable, essentially preserved apart from the word manipulations.
In terms of word manipulations, I have constructed several algorithms, for 
example an algorithm that restores capitalization to words which may not be 
present in the docx file. These algorithms depend on looking at neighboring 
words for each focus word, usually a window of 1-2 words to the left and right. 
These algorithms can be applied to all words so searching for a particular word 
and then finding it's context would not work in this scenario. What is required 
is that every single word in the document is inspected and its neighboring 
context (left and right words) determined. To determine the left word(s) for 
the first word in a paragraph, it is OK to use the last word(s) of the previous 
paragraph. Therefore the entire text document can be treated by the algorithm 
as one continuous text.
I have a method to split text into tokenized units such as words and 
punctuation, but for simplicity we can just assume that the input is tokenized 
by whitespace.
Thanks                                    

Reply via email to