I thought I would try to make a start on rewriting the Plaintext converter
in C. I thought it should be simple when compared with the HTML converter
that already exists. For one thing, there should be no customization API
for the Plaintext converter. All the conversion is done with straight
function calls rather than via hooks that can be overridden.
I looked at HTML.pm to see if I could see how this module was implemented
in C as part of the program. I noticed there were two modules, one called
HTML.pm, the other called HTMLNonXS.pm. I thought that HTML.pm might just
be a short module that would load HTMLNonXS.pm or XS code as required, but
HTML.pm itself appears to have a sizeable amount of code in itself, much
of it relevant to the conversion (it's over 8000 lines of code). HTMLNonXS.pm
is also a very sizeable module, at over 5000 lines of code. So I assume the
Perl code for converting HTML is split between these two modules for some
reason.
Compare HTML.pm and HTMLNonXS.pm with Texinfo/Convert/Paragraph.pm and
Texinfo/Convert/ParagraphNonXS.pm. Paragraph.pm is a very short module.
Whatever the relationship is between HTML.pm and HTMLNonXS.pm, it is clearly
different from the relationship between Paragraph.pm and ParagraphNonXS.pm.
I thought I could have an easier time understanding the ctexi2any code
to how a converter was defined.
There were a few things I found confusing.
texi2any.c (the file with 'main') calls a 'txi_converter_setup' function
with an argument based on the output format. This is in the file
C/convert/texinfo.c.
I would have expected a file called "convert/texinfo.c" to be to do with
converting to Texinfo as an output format (possibly useful for testing,
or for expanding macros). But texinfo.c appears to have code for a mixture
of different things:
/* Interface similar to the Perl modules interface for Texinfo parsing,
higher-level interface for document structure and transformations,
and interface similar to the Perl modules interface for conversion */
I don't get from this a clear idea of what this file is for.
'txi_converter_setup' calls 'converter_converter', defined in
C/convert/converter.c. This then refers to a data array
'converter_format_data',
defined in the same file:
/* table used to dispatch format specific functions.
Same purpose as inherited methods in Texinfo::Convert::Converter */
/* Should be kept in sync with enum converter_format
and TXI_CONVERSION_FORMAT_NR */
CONVERTER_FORMAT_DATA converter_format_data[] = {
{"html", "Texinfo::Convert::HTML", &html_format_setup, 0,
&html_converter_defaults,
&html_converter_initialize, &html_output, &html_convert,
&html_convert_tree, 0, &html_free_converter, &html_element_cdt_tree},
{"rawtext", "Texinfo::Convert::Text", 0, &rawtext_converter,
0, 0, &rawtext_output,
&rawtext_convert, &rawtext_convert_tree, 0, 0, 0},
{"plaintexinfo", "Texinfo::Convert::PlainTexinfo", 0, 0,
&plaintexinfo_converter_defaults, 0, &plaintexinfo_output,
&plaintexinfo_convert, &plaintexinfo_convert_tree, 0, 0, 0},
};
This appears to have a similar purpose to the array in texi2any.c:
static FORMAT_SPECIFICATION formats_table[] = {
{"info", STTF_nodes_tree | STTF_floats,
NULL, "Texinfo::Convert::Info", NULL},
{"html", STTF_relate_index_entries_to_table_items
| STTF_move_index_entries_after_items
| STTF_no_warn_non_empty_parts
| STTF_nodes_tree | STTF_floats | STTF_split
| STTF_internal_links,
NULL, "Texinfo::Convert::HTML", NULL},
{"plaintext", STTF_nodes_tree | STTF_floats | STTF_split,
NULL, "Texinfo::Convert::Plaintext", NULL},
{"latex", STTF_floats | STTF_move_index_entries_after_items
| STTF_no_warn_non_empty_parts,
NULL, "Texinfo::Convert::LaTeX", NULL},
{"docbook", STTF_move_index_entries_after_items
| STTF_no_warn_non_empty_parts,
NULL, "Texinfo::Convert::DocBook", NULL},
{"epub3", 0, "html", NULL, "epub3.pm"},
{"texinfoxml", STTF_nodes_tree,
NULL, "Texinfo::Example::TexinfoXML", NULL},
{"pdf", STTF_texi2dvi_format, NULL, NULL, NULL},
{"ps", STTF_texi2dvi_format, NULL, NULL, NULL},
{"dvi", STTF_texi2dvi_format, NULL, NULL, NULL},
{"dvipdf", STTF_texi2dvi_format, NULL, NULL, NULL},
{"debugtree", STTF_split,
NULL, "Texinfo::DebugTree", NULL},
{"textcontent", 0, NULL, "Texinfo::Convert::TextContent", NULL},
{"plaintexinfo", 0, NULL, NULL, NULL},
{"rawtext", 0, NULL, NULL, NULL},
{"parse", 0, NULL, NULL, NULL},
{"structure", STTF_nodes_tree | STTF_floats | STTF_split, NULL, NULL,
NULL},
{NULL, 0, NULL, NULL, NULL}
};
At the least, it appears to duplicate the association between format name
("html") and associated Perl module ("Texinfo::Convert::HTML"), although
there is no module given for "rawtext" or "plaintexinfo" in the array in
texi2any.c. 'converter_format_data' in converter.c appears only to have
the output formats with C code available.
So I suppose, if I were trying to write a Plaintext converter (and then
an Info converter), I would start by adding an entry to
'converter_format_data' and then see what other changes were needed to
surrounding code.
I hope this message is productive and gives a sense of my difficulty in
approaching this code base. My initial impression is that the handling
of getting the functions to be used for conversion could do with some
reorganization, with relevant code in three different source files
(texi2any.c, texinfo.c and convert.c).
Like I've said in the past, I'm hopeful that this code will start to get
simpler in the future rather than more complicated as more of it is written
in a single language (C) and the need for cross-language interfacing
infrastructure is reduced.