Re: hyphenating non-english characters

2024-08-15 Thread G. Branden Robinson
Hi Gergő,

At 2024-08-11T14:57:38+0200, Gáspár Gergő wrote:
> Kind of a facepalm moment there, I realized that the .hcode request
> don't work since my source file is encoded in UTF-8. Making a separate
> file with the requests, encoded in latin2, and sourcing it in the
> document allowed me to fully utilize the hyphenation pattern file.

Nice!

> So! I started work on preparing the localization file. The pattern
> file should also be formatted somehow, right? Or can we use it as-is?

As far as I know all you need to do is convert the character encoding.
I don't think there's much else you need to do aside from rename it to
fit groff practice ("hyphen.hu"); our "LICENSES" file needs an update to
reflect its separate copyright but I can take care of that.

> I still have to figure some stuff out with building groff and the
> exact workflow of contribution, but hopefully Hungarian support can
> soon be part of groff, which fills my heart with joy!

Excellent!

> If I have any questions regarding the afformentioned contribution
> process, is it okay to send you an email?

Sure.  I summarized the process back in January.

https://lists.gnu.org/archive/html/groff/2024-01/msg00025.html

> Thanks for all the help!

My pleasure!

Regards,
Branden


signature.asc
Description: PGP signature


Re: How subscripts/superscripts work with grohtml?

2024-08-15 Thread G. Branden Robinson
Hi Daniel,

At 2024-08-12T20:56:30-0300, Daniel Brigante wrote:
> I've been trying to make grohtml produce  and  html tags
> from an ms input without much success. From my current understanding,
> the grohtml device driver should detect both a vertical position
> change and a font size change from the device independent file (as it
> is described in the start_superscript/start_subscript function in the
> post-html.cpp file), that would cause the driver to emmit the  or
>  tag.

It's possible this feature is buggy or incomplete.

> When I try to force this behaviour with \v and \s calls in my ms file,
> I get the expected behaviour if I set the output device to ps, but
> when I set it to html I only get a  tag generated by the device
> driver. I also noticed that the "V" instructions, in the device
> independent file (using -Thtml and -Z options with groff) get removed
> by (probably, I think, the html preprocessor), only remaining the s
> instruction.

I don't have much useful advice here.  I suspect, but cannot prove, that
it's simply not possible to produce high-quality HTML by instrumenting
the formatter with a state machine, which is the approach that was
taken.  It's a bold claim, I know, and Werner Lemberg and Gaius Mulley
are capable developers, but grohtml never really got out of beta state--
it kind of halted in its tracks about 20 years ago--and I am tempted to
blame insurmountable design challenges for that.

What I would do is attack the problem at the macro package level.  Since
macros (or strings, as in ms's case) are often used to render super- and
subscripts, these could inject device control commands to give a hint to
the output driver what was going on.

For example, groff ms, defines strings like this.

.\" superscript
.ds par@sup-start \v'-.9m\s'\En[.s]*7u/10u'+.7m'
.als { par@sup-start
.ds par@sup-end \v'-.7m\s0+.9m'
.als } par@sup-end
.\" subscript
.ds par@sub-start \v'+.3m\s'\En[.s]*7u/10u'-.1m'
.als < par@sub-start
.ds par@sub-end \v'+.1m\s0-.3m'
.als > par@sub-end

It could just as well do this:

.ds par@sup-start \X'html: '\v'-.9m\s'\En[.s]*7u/10u'+.7m'
.ds par@sup-end   \v'-.7m\s0+.9m'\X'html: '
.ds par@sub-start \X'html: '\v'+.3m\s'\En[.s]*7u/10u'-.1m'
.ds par@sub-end   \v'+.1m\s0-.3m'\X'html: '

...which, with appropriate recognizers in post-grohtml, would, I think,
make it hard for the output driver to guess wrong.

There _is_ already a function in post-grohtml that attempts to recognize
super- and subscripts from context without the assistance of such tags.
But, if your experience is any indication, it's not reliable.

https://git.savannah.gnu.org/cgit/groff.git/tree/src/devices/grohtml/post-html.cpp?h=1.23.0#n4074

See particularly line 4099.

My speculation, based solely on my imagination and a degree of
familiarity with the code base rather than insight into Werner and
Gaius's plans (beyond what can be gleaned from their whitepaper[1]) is
that it was thought that the tedium of hacking up macro packages to
inject "higher-level" markup into the device-independent output to clue
in an HTML generator would not be necessary if only good enough
heuristics were written into GNU troff (the formatter) and grohtml (the
output driver).

That might be be true--but it would seem we never got heuristics that
were good enough.  Rendering tables, equations, and pic(1) diagrams as
PostScript and including them as raster images has proven particularly
painful.[2]

https://savannah.gnu.org/bugs/?60052
https://savannah.gnu.org/bugs/?62890

Furthermore, the level of demand for HTML production from "raw" groff
language input seems staggeringly low.  It doesn't seem that anyone
exists who wants to compose groff to produce HTML without availing
themselves of a macro package, and even if they don't want to use the
macro packages we ship, they can absolutely write their own macros to
do the sort of thing I spitballed above.

I don't mean to criticize, but in my opinion groff's unspectacular HTML
production story has led to some losses.  If it had been less ambitious
and focused on rendering man pages well as an initial goal, many ad hoc
grotty output scrapers would never have been written to produce HTML,
and mandoc(1) might not ever have happened.

Again, just my opinion--I wasn't there at the time.  Werner and/or Gaius
may very well have strong counterarguments that I simply haven't heard.
So if they weigh in, listen to them.

In any case I won't be tackling "groff html-ng" in the near future.  I'm
trying to finish feature changes to the formatter for 1.24 so it can
freeze, and, I hope, be released this calendar year.

Regards,
Branden

[1] https://www.gnu.org/software/groff/grohtml.pdf
[2] GNU eqn can already produce MathML.  It could use some automated
tests.


signature.asc
Description: PGP signature


Re: Unicode support for Troff-specific symbols

2024-08-15 Thread G. Branden Robinson
Hi John,

At 2024-08-14T18:43:44+1000, John Gardner wrote:
> In Unicode 13.0, a new block
>  was
> added to support graphical symbols used on legacy systems,[1]
> particularly those represented by obscure character encodings (like
> ATASCII ).[2]
> 
> I'm wondering if Troff's non-representable symbols (listed below) are
> eligible additions for this block. I'm envisioning their admission to
> use names like:
> 
> \[sqrtex] radical symbol extension[3]

This one had precedent in Adobe Type 1 fonts too.

https://savannah.gnu.org/bugs/?63179

> \(ul  troff under rule

I think we spell it "underrule".

> \(ru  troff baseline rule
> \(bs  troff client logotype[4]

Heh.  Yet another computer science problem solved with a layer of
abstraction.

> Giving these characters a canonical representation in Unicode won't
> benefit documents typeset by older troff(1) implementations, but it
> would simplify documentation of future Groff releases and smooth out
> wrinkles in text copied from gropdf(1) output.

Sure.

> Does this seem realistic to anybody else?

Seems worth a try.  These characters have provenance going back to the
original CSTR #54 in 1976.

>2. A supplementary block
>
>featuring Pac-Man and Space Invaders graphics has also been
>approved for the upcoming Unicode 16.0 (published next month).

Oh, good, I successfully procrastinated updating for 15.1 for so long
that I won't have to do it.  Just skip ahead to 16.0.

>4. I'm aware that corporate logos can't be added to the UCD (no
>matter how well-established they appear to be
>), due to the
>conflict-of-interest regarding trademarks and brand representation.
>Hence why a more "diplomatic" name is suggested here, which
>relegates the exact appearance of \(bs to a font-level concern.

It didn't have a stable meaning even within AT&T troff.  DWB troff, for
instance, rebranded `\(bs` as a name for the backslash glyph--what GNU
troff calls `\(rs`.[1]  Didn't help that Bell changed its logo after
1976 and would change it again when the Labs was sold to Lucent.  But
before that could happen, they killed the DWB product altogether.[2]

Regards,
Branden

[1] https://github.com/n-t-roff/DWB3.3/blob/master/postscript/devpost/R
[2] https://lists.gnu.org/archive/html/groff/2022-12/msg00097.html


signature.asc
Description: PGP signature