Re: [GSoC 2014]Interested in Round trip conversion between LyX and .docx formats

Rainer M Krug Sun, 23 Feb 2014 11:30:24 -0800

stefano franchi <stefano.fran...@gmail.com> writes:

> Hi Prannoy,
>
> Welcome to LyX!

Welcome Prannoy.

>
> I am happy to hear  you found Lyx interesting and would like to
> contribute to our project.
> Let me me remind you, though, that Google has not announced which
> organization it will accept this year (the announcement will be made
> on Feb 24, i.e. tomorrow). We are hopeful we will be selected
> again, but there is no certainty.
>
> That being said, if you want to get a head start I would encourage you to
> start getting
> familiar with LyX's code base. A good starting point is our bug
> tracker [2]. Several bugs are marked "easyfix" and provide excellent
> entry points to begin working on the code. Developers documentation is
> available on our wiki as well at [3]. Ask question on the developers
> list on how to proceed and be sure to check out the beginner
> developers FAQ [3].

I have nothing to add here and second everything Stefano said.

>
> For the LyX<-->Word round-trip conversion project, check out this thread on
> the devel
> list as a starting point:
> https://www.mail-archive.com/lyx-devel%40lists.lyx.org/msg182083.html
> The main goals of the project are discussed there. Notice how the main goal
> of the conversion (either way) is the preservation of a document's
> "semantic" information, not its
> formatting.
>
> Thus, the first design choice is a careful definition of what counts as
> "semantic" information in a generic LyX (and Word) document. The bullet
> points in the project page provides a first defintion. This list should be
> formalized into its LyX and Word's formal counterparts (I.e. Lyx's
> parargaph environments and charater styles, and, similarly, styles of
> either kind for Word). Most likely, it would be best to create a simple,
> special Latex class/LyX layout that includes all and only the allowed
> styles, and, similarly, a Word template that includes all and only the
> allowed styles.
>
> Rob Oakes has been working on a Word-->Lyx converter in python [5] which
> you may want to check out as well.
>
> From a technical point of view, two early design choices are:
>
> 1. whether to start the conversion from the LyX format or from the LaTeX
> format that Lyx
>     can output.
>
>    This is a really tricky issue. On the one hand, working from LyX is much
> simpler, as we have direct access to the parsed
>    data, or we can leverage other tools that parse LyX's file fomat (e.g.
> eLyxer). On the other hand, some crucially important
>    information is actually absent from LyX and is actually *produced* by
> LaTex. Bibliographic references are the most important
>    example in this class: a fully "semantically" formatted reference is
> absent from LyX. It is bibtex|biblatex + LaTex that actually produce
>    the data. Index information are probably in this category too.
>    The difficult problem is how to extract information from LaTeX's output.
> There is an existing project, tex4ht  [6],
>    that pursues this approach.  The project is not actively developed now,
> due to the untimely death of its founder,
>    but it is still available, and it actually works. tex4ht runs latex with
> a special style which inserts parsing
>    commands into LaTeX's DVI output. A java program then parses the special
> DVI output and produces
>    html or ODF output. This approach allows tex4ht to exploit Latex's own
> processing (including the
>    processing of index and bibliographic information), at the cost of
> increased complexity.
>    One possibility would be to follow tex4ht's approach, while simplifying
> as much as possible the kind of
>    LaTeX information actively supported.
>
>   One important drawback of this second strategy (LyX-->LaTeX-->Word|ODF)
> is that LyX's only information
>   are lost when converting to LaTeX. The most important of those are
> tracked changes. Standard LaTex has no
>   conception of tracked changes. There are LaTeX additional packages that
> manage changes (e.g. [7]),
>   and we would have to convert LyX's changes into that format. This of
> course adds an additional dependency,
>   unless the package functionalities are somehow replicated by us.

I would like to add, that there is a third option (and I am not talking
about the xhtml), i.e. using the LyX format for what it can do, and ue
the LaTeX document to supplement the one created from LyX. By doing
this, one can use the simple approach for basic features, and then
supplement it with the LaTeX route when needed (I guwess that the most
important aspect is the bibliography).

>
> 2. Whether to target Microsoft's Word XML format or the Open Document
> Format (similarly XML-based)

I would strongly argue for the Microsoft Word XML, as each conversion
creates problems and inconsistencies. This said, if the conversion from
MS Word XML to ODF and back can be done without causing problems in the
roundtrip (i.e. the round-trip would then be lyx - ODF XML - MS XML -
ODF XML - lyx)I would argue for the more "open" format which can be used
on more Operating systems.

But there is also the (often raised) question of the move from the lyx
format to a XML based format... I have no idea at which stage the new
format is, but one should keep this likely change in .lyx format in mind.

>
>   You may want to start learning about both formats. I haven't looked into
> either in any depth yet,
>   but my first impression is that Microsoft's is  more complex.
>
>
>
> Feel free to ask more questions!
>
>

Cheers,

Rainer

>
> Cheers,
>
> Stefano
>
>
> [1] http://wiki.lyx.org/GSoC/GSoCProjectIdeasFor2014
> [2] http://www.lyx.org/trac/
> [3] http://www.lyx.org/DevFAQ
> [4] http://www.lyx.org/trac/search?q=advanced+find
> [5] http://blog.oak-tree.us/index.php/2012/03/08/word2lyx01-2
> [6] https://www.tug.org/applications/tex4ht/mn.html
> [7] http://texdoc.net/texmf-dist/doc/latex/changes/changes.english.pdf
>
>
> On Sun, Feb 23, 2014 at 6:17 AM, Prannoy Pilligundla <prannoy.b...@gmail.com
>> wrote:
>
>> Hi Everyone,
>>
>> I am Prannoy Pilligundla pursuing undergraduation in BITS-Pilani,India.I
>> am proficient in C,Java,Python and RoR. Here is the link to my bitbucket
>> profile https://bitbucket.org/prannoy1994
>>
>> I had a look at 2014 ideas page(
>> http://wiki.lyx.org/GSoC/GSoCProjectIdeasFor2014) and i am interested to
>> work on Round trip conversion between LyX and .docx formats. It would be
>> great if someone can guide me on how to start work on this.I want to get
>> accustomed to the existing code base and start contributing before writing
>> my application for GSoC 2014
>>
>> Thanks and Regards
>> Prannoy Pilligundla
>> ᐧ
>>
Welcome to LyX!

I am happy to hear  you found Lyx interesting and would like to
contribute to our project.
Let me me remind you, though, that Google has not announced which
organization it will accept this year (the announcement will be made
on Feb 24, i.e. tomorrow). We are hopeful we will be selected
again, but there is no certainty.

That being said, if you want to get a head start I would encourage you to
start getting
familiar with LyX's code base. A good starting point is our bug
tracker [2]. Several bugs are marked "easyfix" and provide excellent
entry points to begin working on the code. Developers documentation is
available on our wiki as well at [3]. Ask question on the developers
list on how to proceed and be sure to check out the beginner
developers FAQ [3].

For the LyX<-->Word round-trip conversion project, check out this thread on
the devel
list as a starting point:
https://www.mail-archive.com/lyx-devel%40lists.lyx.org/msg182083.html
The main goals of the project are discussed there. Notice how the main goal
of the conversion (either way) is the preservation of a document's
"semantic" information, not its
formatting.

Thus, the first design choice is a careful definition of what counts as
"semantic" information in a generic LyX (and Word) document. The bullet
points in the project page provides a first defintion. This list should be
formalized into its LyX and Word's formal counterparts (I.e. Lyx's
parargaph environments and charater styles, and, similarly, styles of
either kind for Word). Most likely, it would be best to create a simple,
special Latex class/LyX layout that includes all and only the allowed
styles, and, similarly, a Word template that includes all and only the
allowed styles.

Rob Oakes has been working on a Word-->Lyx converter in python [5] which
you may want to check out as well.

From a technical point of view, two early design choices are:

1. whether to start the conversion from the LyX format or from the LaTeX
format that Lyx
    can output.

   This is a really tricky issue. On the one hand, working from LyX is much
simpler, as we have direct access to the parsed
   data, or we can leverage other tools that parse LyX's file fomat (e.g.
eLyxer). On the other hand, some crucially important
   information is actually absent from LyX and is actually *produced* by
LaTex. Bibliographic references are the most important
   example in this class: a fully "semantically" formatted reference is
absent from LyX. It is bibtex|biblatex + LaTex that actually produce
   the data. Index information are probably in this category too.
   The difficult problem is how to extract information from LaTeX's output.
There is an existing project, tex4ht  [6],
   that pursues this approach.  The project is not actively developed now,
due to the untimely death of its founder,
   but it is still available, and it actually works. tex4ht runs latex with
a special style which inserts parsing
   commands into LaTeX's DVI output. A java program then parses the special
DVI output and produces
   html or ODF output. This approach allows tex4ht to exploit Latex's own
processing (including the
   processing of index and bibliographic information), at the cost of
increased complexity.
   One possibility would be to follow tex4ht's approach, while simplifying
as much as possible the kind of
   LaTeX information actively supported.

  One important drawback of this second strategy (LyX-->LaTeX-->Word|ODF)
is that LyX's only information
  are lost when converting to LaTeX. The most important of those are
tracked changes. Standard LaTex has no
  conception of tracked changes. There are LaTeX additional packages that
manage changes (e.g. [7]),
  and we would have to convert LyX's changes into that format. This of
course adds an additional dependency,
  unless the package functionalities are somehow replicated by us.

2. Whether to target Microsoft's Word XML format or the Open Document

-- 
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, 
UCT), Dipl. Phys. (Germany)

Centre of Excellence for Invasion Biology
Stellenbosch University
South Africa

Tel :       +33 - (0)9 53 10 27 44
Cell:       +33 - (0)6 85 62 59 98
Fax :       +33 - (0)9 58 10 27 44

Fax (D):    +49 - (0)3 21 21 25 22 44

email:      rai...@krugs.de

Skype:      RMkrug

pgp97EUI2H0Kl.pgp
Description: PGP signature

Re: [GSoC 2014]Interested in Round trip conversion between LyX and .docx formats

Reply via email to