Re: Document conversion engine

Michael Meeks Mon, 09 Jul 2012 01:24:33 -0700

On Mon, 2012-07-09 at 00:25 +0100, Flavio Moringa wrote:
> nice to ear from someone so "up the ranks" like you.. makes me feel
> much more important :-)


        Ho hum; we try to avoid unpleasant hierarchy as much as possible.

>  I'll probably wont't be able to do a conversion engine by myself...
> but I can definitely mess around with code...

        Great :-)

> Yes, it's definitely something I can do... I do believe that the
> harder part is getting that " large corpus of documents out
> there...". At least as my experience goes, I've found that it's hard
> to get users to send us documents they use... either due to privacy
> questions or enterprise policies... But a tool like that makes a lot
> of sense

        Oh - so; getting the documents is not -that- hard; Google has a
document-type search that can be automated; just search for:

        filetype:docx

        And start scraping; as well as 7 million files, we get to take
advantage of Google's popularity ranking to get the most popular first
100 or whatever :-)

> For now then I'll start doing as you suggest and look in bugzilla for
> documents with conversion problems to try and compile as much examples
> as I can. Then maybe using the latest beta to do the conversion and
> see which problems are still there. Then maybe starting a perl script
> that can scrap the OOXML files to find the most used tags... and start
> from there...

        We also have tools for dumping all the documents out of bugzilla - see
the main 'core' repository:

        bin/get-bugzilla-attachments-by-mimetype

        so really the fun piece is writing the parser & element / attribute
value parser / database to analyse what pieces are popular and provide a
pretty UI or command-line for hackers to grok that.

        It'd be just great to have that data in hand.

        Thanks !

                Michael.

-- 
michael.me...@suse.com  <><, Pseudo Engineer, itinerant idiot

_______________________________________________
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: Document conversion engine

Reply via email to