On Mon, 2012-07-09 at 00:25 +0100, Flavio Moringa wrote: > nice to ear from someone so "up the ranks" like you.. makes me feel > much more important :-)
Ho hum; we try to avoid unpleasant hierarchy as much as possible. > I'll probably wont't be able to do a conversion engine by myself... > but I can definitely mess around with code... Great :-) > Yes, it's definitely something I can do... I do believe that the > harder part is getting that " large corpus of documents out > there...". At least as my experience goes, I've found that it's hard > to get users to send us documents they use... either due to privacy > questions or enterprise policies... But a tool like that makes a lot > of sense Oh - so; getting the documents is not -that- hard; Google has a document-type search that can be automated; just search for: filetype:docx And start scraping; as well as 7 million files, we get to take advantage of Google's popularity ranking to get the most popular first 100 or whatever :-) > For now then I'll start doing as you suggest and look in bugzilla for > documents with conversion problems to try and compile as much examples > as I can. Then maybe using the latest beta to do the conversion and > see which problems are still there. Then maybe starting a perl script > that can scrap the OOXML files to find the most used tags... and start > from there... We also have tools for dumping all the documents out of bugzilla - see the main 'core' repository: bin/get-bugzilla-attachments-by-mimetype so really the fun piece is writing the parser & element / attribute value parser / database to analyse what pieces are popular and provide a pretty UI or command-line for hackers to grok that. It'd be just great to have that data in hand. Thanks ! Michael. -- michael.me...@suse.com <><, Pseudo Engineer, itinerant idiot _______________________________________________ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice