Sorry for re-cycling an old email... I read this pretty late.

On Thu, Oct 30, 2014 at 3:10 AM, Shrinivasan T <tshriniva...@gmail.com>
wrote:

> On working http://FreeTamilEbooks.com project, we get many books in
> non-unicode text.
>
> We are converting the text using open-tamil library.
>
> https://github.com/arcturusannamalai/open-tamil
> https://github.com/arulalant/txt2unicode
>
> It is a python script and runs in terminal.
> It can convert only the plain text.
>

This is a nice initiative.

The authors gives their works as MS word doc,
> with lot of formatting like Bold/Italic/Tables/Headings etc.
>

Is it .DOC or .DOCX or something else?


> We convert them to plain text and then convert to unicode.
> Now, it becomes a tired job to reformat them all the text
> as previous formatting.
>

Formatting of the text is a "meta data". Once you convert the entire WORD
into  TEXT, you loose the meta data. You will not be able to put it back.



> Looking for ideas on how to convert the encoding with the rich text
> without losing any formatting.
>

1. When you try to convert the WORD to HTML, the resultant HTML is UGLY.
You might end up with a hundred tags between two letters in the same word.
It is simply horrible :(.


For .DOC
2. I have tried to do this in the past. Read the Word Document using Apache
POI (Java). This would return each paragraph/ sentance with the same
formatting as a chunk. The next format would be in a different chunk (and
so on).You can read it this way, and convert it using your txt2Unicode and
put the text back. This way it would be easier.

WORD of Caution: If your MS Word Documents contains embedded images, we
were not able to read the exact order of the images in it. So, if there are
10 images, we do not know which image would stay where.


For .DOCX
3. If you are able to get the document in DOCX format instead of .DOC, then
do this:
--> Rename the DOCX to .ZIP and then unzip it. You will find a
"document.xml" inside. Happy converting :-)


--
Natarajan.
_______________________________________________
ILUGC Mailing List:
http://www.ae.iitm.ac.in/mailman/listinfo/ilugc
ILUGC Mailing List Guidelines:
http://ilugc.in/mailinglist-guidelines

Reply via email to