Pygments-based syntax highlighting preprocessor

2024-07-31 Thread Robin Haberkorn
Dear groffers,

as one of the things coming out of my master thesis (written completely
in ms with Groff), here is a small preprocessor for syntax highlighting
code blocks based on Pygments:

https://github.com/rhaberkorn/groff-tools#highlight-python

btw. I also still have Scintilla/Lexilla syntax highlighting support for
Troff lying around for years, that could improve dozens of text editors.
But I don't know when I will find the time and motivation to polish it
up.
Once you dive into Troff as a language, you realize that it's actually
unparseable in therefore impossible to properly syntax highlight. Even
after restricting the language in sane ways that will work in 99.9% of
the cases, it's still hard to highlight *correctly*. Basically you will
have to reimplement a large part of the Groff-parser to do it right, as
you need intimate knowledge of the syntax of certain requests. And then,
there ideally needs to be support for the other Troff variants out
there as well. That's why I never contributed it to the Lexilla project.

Best regards,
Robin



Re: Pygments-based syntax highlighting preprocessor

2024-07-31 Thread Bento Borges Schirmer
Dear Robin,

Em qua., 31 de jul. de 2024 às 10:31, Robin Haberkorn
 escreveu:
>
> Dear groffers,
>
> as one of the things coming out of my master thesis (written completely
> in ms with Groff), here is a small preprocessor for syntax highlighting
> code blocks based on Pygments:
>
> https://github.com/rhaberkorn/groff-tools#highlight-python

Interesting! might try your scripts in a few weeks.

I peeked around your repository. I noticed the hyperlinks to CSNOBOL4
are broken. what about SNOBOL4? is it cool? I understand it acts as a
filter? and is this BNF built into SNOBOL4? is it worth learning? or
do you think other languages/libraries/DSLs better solve the problems
it was designed for? how do you even learn it, like is there some
tutorial, manual or exercises about it? is it standardized, or are
there dozens of mutually exclusive dialects? is it compiled or
interpreted? are there optimizing implementations? can I call it
inside a C program as a subroutine? can it call C functions?

Also, please do share tricks, difficulties you faced, methodologies
and habits when writing your thesis in ms groff! I'm to follow similar
path some time.

Best regards,
Bento



Re: GNU maintainership update

2024-07-31 Thread Dave Kemper
On Tue, Jul 30, 2024 at 6:45 PM G. Branden Robinson
 wrote:
> I hope you will join me in thanking him for his excellent work.

Absolutely.  Bertrand, your contributions, being mostly to the
underpinnings, may never have been as visible as Werner's when he was
maintainer, but are no less important.  Thank you for all you've done
to give groff a more solid foundation.



removing the `de` macro and the old ".pl \n(nlu" trick

2024-07-31 Thread G. Branden Robinson
[resurrecting 26-month-old thread]

Hi Doug,

At 2023-05-03T09:29:15-0500, G. Branden Robinson wrote:
> > The same argument could be made about \applying .rm to any standard
> > request, and I would disagree for the same reason as above. (A
> > disappointing experimental discovery in this regard: .de seems to be
> > immune to removal.)
> 
> I have good news for you!  Removing `de` seems to work just fine for
> me in groff 1.22.4 and Git.

The example I offered to support my claim in May 2023 was not as
illustrative as it could have been; this recently came to my attention
and annoyed me.

So here's a better one.

$ nl EXPERIMENTS/remove-de-request.roff
 1  .de A
 2  foo
 3  ..
 4  .rm de
 5  .de B
 6  bar
 7  ..
 8  .A
 9  .B
10  .pl \n(nlu
$ ./build/test-groff -ww -T ascii EXPERIMENTS/remove-de-request.roff
troff:EXPERIMENTS/remove-de-request.roff:5: warning: macro 'de' not defined
troff:EXPERIMENTS/remove-de-request.roff:7: warning: macro '.' not defined
troff:EXPERIMENTS/remove-de-request.roff:9: warning: macro 'B' not defined
troff:EXPERIMENTS/remove-de-request.roff:10: warning: setting computed page 
length 0u to device vertical motion quantum
bar foo

Because defining the `B` macro didn't work, its body was interpreted
immediately, and so "bar" precedes "foo" in the output.  This would not
be the case if `de` were protected from removal.

This example also affords me the chance to point out something about my
recent arithmetic-refactoring change that led me to enforce a positive
page length.  In some degenerate cases like the above, you'll get a
warning if you use "the old `.pl \n(nlu` trick".

In the foregoing, since only one output line is ever written, and it had
not yet broken due to length, the value of the `nl` register is still
zero at the time we invoke the `pl` request.  (It is only reaching the
end of input that causes the "bar foo" output line to break, move the
vertical position downward, trigger (final) page ejection, and end text
processing.)

In my opinion, this is an acceptable diagnostic under the circumstances.
But I thought people might appreciate a heads-up about it.

Regards,
Branden


signature.asc
Description: PGP signature


Re: Pygments-based syntax highlighting preprocessor

2024-07-31 Thread Robin Haberkorn
Hello Bento!

On Wed, 31 Jul 2024 15:19:53 -0300
Bento Borges Schirmer  wrote:

> I peeked around your repository. I noticed the hyperlinks to CSNOBOL4
> are broken. what about SNOBOL4? is it cool? I understand it acts as a
> filter?

I updated the links. It seems to live here nowadays:
http://www.regressive.org/snobol4/csnobol4/

SNOBOL4 is a scripting language. A very old one. But it's still
interesting and worthwhile to learn because of its unique built-in
backtracking-based pattern matching language. Its descendant - Icon - is
also worth checking out.
On the downside, SNOBOL4 will torture you with its exclusive reliance on
Gotos for control flow. A rarity nowadays, but this language is
entirely "unstructured"!

> and is this BNF built into SNOBOL4?

No. I built a EBNF parser in SNOBOL4 that spits out GNU Pic code. That
was back when I wrote my bachelor thesis.
SNOBOL4 is good at prototyping small DSLs and compilers.

> is it worth learning? or do you think other
> languages/libraries/DSLs better solve the problems it was designed
> for?

If you have a fetish for old and obscure languages like I, it's
certainly worthwhile to learn. ;-)

> how do you even learn it, like is there some
> tutorial, manual or exercises about it?

There is lots of material on SNOBOL4 on the aforementioned website:
http://www.regressive.org/snobol4/
The most important book is probably the "Green Book":
http://www.math.bas.bg/bantchev/place/snobol/gpp-2ed.pdf

> is it standardized, or are
> there dozens of mutually exclusive dialects?

Pretty much standardized. There are only two dialects left, that you
could possibly run on a modern PC. That's CSNOBOL4 and SPITBOL.

> is it compiled or interpreted?

CSNOBOL4 is interpreted. But SPITBOL is actually a compiler - the first
and oldest compiler for a loosely typed high-level scripting language
AFAIK. (Mind you, this was before JIT-compilation...)

> can I call it inside a C program as a subroutine?

Not that I know.

> can it call C functions?

http://www.regressive.org/snobol4/csnobol4/curr/doc/snobol4ffi.3.html

> 
> Also, please do share tricks, difficulties you faced, methodologies
> and habits when writing your thesis in ms groff! I'm to follow similar
> path some time.

Mhh, better ask something more concrete.
Use `pdfmom --roff` even with `-ms` (`-mspdf`) and gropdf (pdfmom does
that by default).
Other than that, be prepared to write quite a few custom macros.
Be sure to check out Pic and Grap - it's so much fun to play around
with. Both my bachelor and master thesis contained only code-generated
graphics (not counting screenshots).

Best regards,
Robin



an observation and proposal about hyphenation codes

2024-07-31 Thread G. Branden Robinson
Hi folks,

Dave and I have been discussing hyphenation codes extensively over the
past few days; see recent bug-groff list traffic.

There is much I am coming to understand about GNU troff's hyphenation
system, and I've discovered a salient fact that no one has complained
about (as far as I know), but which seems material to the problem and
should motivate either a change to data organization inside the
formatter, or reorganization of some of the stock macro files we ship.

First, what's a hyphenation code?

As of very recently, our Texinfo manual explains them as follows.

---snip---
   For automatic hyphenation to work, the formatter must know which
letters are equivalent; for example, the letter 'E' behaves like 'e';
only the latter typically appears in hyphenation pattern files.  GNU
'troff' expects characters that participate in automatic hyphenation to
be assigned "hyphenation codes" that define these equivalence classes.
At startup, GNU 'troff' assigns hyphenation codes to the letters
'a'-'z', applies the same codes to 'A'-'Z' in one-to-one correspondence,
and assigns a code of zero to all other characters.

   The 'hcode' request extends this principle to letters outside the
Unicode basic Latin alphabet; without it, words containing such letters
won't be hyphenated properly even if the corresponding hyphenation
patterns contain them.
---end snip---

A fact I found noteworthy about how GNU troff actually sets up
hyphenation codes is that the equivalence classes it is designed to
support _are almost never used_ beyond lettercase coalescence.[1]

Instead, in our localization files, every non-basic-Latin character gets
its own bucket, occupied by upper and lowercase forms, if both exist.

Here are all the hyphenation codes we set up in all our localization
files.

cs.tmac (Czech):
.hcode á á  Á á
.hcode č č  Č č
.hcode ď ď  Ď ď
.hcode é é  É é
.hcode ě ě  Ě ě
.hcode í í  Í í
.hcode ň ň  Ň ň
.hcode ó ó  Ó ó
.hcode ř ř  Ř ř
.hcode š š  Š š
.hcode ť ť  Ť ť
.hcode ú ú  Ú ú
.hcode ů ů  Ů ů
.hcode ý ý  Ý ý
.hcode ž ž  Ž ž

de.tmac (German):
.hcode ä ä  â â  à à  á á  ã ã  å å  æ æ
.hcode ç ç
.hcode é é  è è  ë ë  ê ê
.hcode í í  ì ì  î î  ï ï
.hcode ñ ñ
.hcode ó ó  ò ò  ô ô  ö ö  ø ø
.hcode ú ú  ü ü  û û
.
.hcode Ä ä  Â â  À à  Á á  Ã ã  Å å  Æ æ
.hcode Ç ç
.hcode É é  È è  Ë ë  Ê ê
.hcode Í í  Ì ì  Î î  Ï ï
.hcode Ñ ñ
.hcode Ó ó  Ò ò  Ô ô  Ö ö  Ø ø
.hcode Ú ú  Ü ü  Û û
.
.hcode ß ß

en.tmac (English): NONE

es.tmac (Spanish):
.hcode á á  Á á
.hcode é é  É é
.hcode í í  Í í
.hcode ó ó  Ó ó
.hcode ú ú  Ú ú
.hcode ñ ñ  Ñ ñ
.hcode ü ü  Ü ü

fr.tmac (French):
.hcode à à  À à
.hcode â â  Â â
.hcode ç ç  Ç ç
.hcode è è  È è
.hcode é é  É é
.hcode ê ê  Ê ê
.hcode ë ë  Ë ë
.hcode î î  Î î
.hcode ï ï  Ï ï
.hcode ô ô  Ô ô
.hcode ù ù  Ù ù
.hcode û û  Û û
.hcode ü ü  Ü ü
.hcode ÿ ÿ  Ÿ ÿ
.hcode œ œ  Œ œ

it.tmac (Italian): NONE

ru.tmac (Russian):
.hcode а а  А а
.hcode б б  Б б
.hcode в в  В в
.hcode г г  Г г
.hcode д д  Д д
.hcode е е  Е е
.hcode ё ё  Ё ё
.hcode ж ж  Ж ж
.hcode з з  З з
.hcode и и  И и
.hcode й й  Й й
.hcode л л  Л л
.hcode л л  Л л
.hcode м м  М м
.hcode н н  Н н
.hcode о о  О о
.hcode п п  П п
.hcode р р  Р р
.hcode с с  С с
.hcode т т  Т т
.hcode у у  У у
.hcode ф ф  Ф ф
.hcode х х  Х х
.hcode ц ц  Ц ц
.hcode ч ч  Ч ч
.hcode ш ш  Ш ш
.hcode щ щ  Щ щ
.hcode ъ ъ  Ъ ъ
.hcode ы ы  Ы ы
.hcode ь ь  Ь ь
.hcode э э  Э э
.hcode ю ю  Ю ю
.hcode я я  Я я

sv.tmac (Swedish):
.hcode å å  Å å
.hcode ä ä  Ä ä
.hcode ö ö  Ö ö
.hcode é é  É é

You will observe that most languages declare hyphenation codes only for
standard letters in their alphabets.  For example, Czech omits the
Polish letter ł, even though that letter is present in the ISO 8859-2
encoding that the localization file requires.

Except German.  German goes ahead and eats every letter in the Latin-1
supplement even though many of them are unknown in pure German
orthography.  (Any language can employ loan words, of course.)

And that suggests the direction of an implementation decision that I am
questioning.

In GNU troff, hyphenation codes are _global_.  They are not dependent on
the hyphenation _language_ selected with the `hla` request, and which is
a property of a GNU troff _environment_.

This means that if, in one language, "á" should be treated as "a" for
hyphenation purposes, but as its own letter in another language, GNU
troff will not be able hyphenate for both languages correctly in the
same "run", even if distinct environments are used carefully to typeset
each of the languages.

Here's how you'd make these distinct declarations.

.hcode á a \" á participates in hyphenation just like a
.hcode á á \" á behaves as a distinct letter in hyphenation

This does suggest a workaround, if a tedious one: the document author
must write some macro that reconfigures the hyphenation codes as needed
when switching environments.

That seems like a problem to me--or it would, except that no one has
complained about our hyphenation being bad for non-English 

Re: GNU maintainership update

2024-07-31 Thread Bertrand Garrigues
Hi Branden,

On mar., juil. 30 2024 at 06:44:42 , "G. Branden Robinson" 
 wrote:
> Hi folks,
>
> I've heard back from Bertrand Garrigues, and he advised that I start the
> hand-off process for GNU maintainership of the groff package/project.
>
> I have consequently contacted maintainers@gnu to initiate that process.
> Neither of us have a clear idea how long it will take, but he seemed
> confident that it wouldn't resemble an overnight procedure.

Yes it should take a bit of time,

>
> Bertrand agreed that a 1/year cadence for groff releases sounded good,
> and said he'd be available to perform maintainer duties (mainly release
> management for 1.24.0 later this year) as long as they were (in my own
> wording) no more burdensome than those he had to handle for the groff
> 1.23.0 release.

This is doable for me, although not immediately (I'm starting my summer
vacations...)

> Bertrand's contributions of (1) migration to GNU Automake for our build
> system, (2) inclusion of gnulib as a C portability library, and (3)
> inauguration of an automated test suite for the code base have in my
> estimation delivered major benefits to the maintainability of groff; all
> were strongly forward-looking choices not necessarily easily appreciated
> directly by users.  The test harness alone has ensured that many
> bugs and mistakes never landed in the Git repository in the first place.
>
> As a developer, I would have found groff much less approachable and far
> more frustrating had it not been for Bertrand's improvements.
>
> Bertrand has been quiet for a while but his positive impact on groff is
> unmistakable to a software engineer's eyes.
>
> I hope you will join me in thanking him for his excellent work.

Thanks a lot for you kind words.  My work on the knuth-plass branch is
also interesting, I hope some day I can finish it (although it's quite
challenging to connect my code to the existing code base).

Regards,

Bertrand



Re: an observation and proposal about hyphenation codes

2024-07-31 Thread Werner LEMBERG

> A fact I found noteworthy about how GNU troff actually sets up
> hyphenation codes is that the equivalence classes it is designed to
> support _are almost never used_ beyond lettercase coalescence.[1]

Yes.  As originally intended in TeX (and groff closely follows), the
`.hcode` mechanism is used essentially for 'downcasing'.

> [1] "Almost never".  So what's an exception?
>
> tmac/ps.tmac:
>
> .fchar \[S ,] \o'S\[ac]'
> .hcode \[S ,]s
> .fchar \[s ,] \o's\[ac]'
> .hcode \[s ,]s

I no longer can remember why I've mapped the two Romanian characters
to 's' for hyphenation.  From today's point of view this looks like a
mistake – I should have mentioned my reasoning in the ChangeLog entry,
but I did not...

Commit c65ea0c8f, which introduced this, is from 2003, and at that
time it was common to use both 'ş' and 'ș' equivalently for this
language (while only the latter is correct today).

> [...] You will observe that most languages declare hyphenation codes
> only for standard letters in their alphabets.  For example, Czech
> omits the Polish letter ł, even though that letter is present in the
> ISO 8859-2 encoding that the localization file requires.
>
> Except German.  German goes ahead and eats every letter in the
> Latin-1 supplement even though many of them are unknown in pure
> German orthography.  (Any language can employ loan words, of
> course.)

[Wearing the maintainer hat of the German hyphenation patterns.]

This corresponds to the setup in

  https://repo.or.cz/wortliste.git/blob/HEAD:/daten/german.tr

used to generate the German hyphenation patterns.  Many of those
letters are indeed used.  Our wordlist at

  https://repo.or.cz/wortliste.git/blob/HEAD:/wortliste

contains entries like

  Abbé;Ab-bé
  Œuvre;Œu-vre
  Señor;Se-ñor
  Strømfjord;Strøm=fjord
  Tête;Tête
  Zaïre;Za-ï-re

to name a few.

> In GNU troff, hyphenation codes are _global_.  They are not
> dependent on the hyphenation _language_ selected with the `hla`
> request, and which is a property of a GNU troff _environment_.

The reason for that is the age of groff, closely following TeX's
hyphenation algorithm.  Unicode was not invented then.

> [...] the document author must write some macro that reconfigures
> the hyphenation codes as needed when switching environments.

Yes.  Again, this is following TeX, which can't change hyphenation
code values within a paragraph (i.e., the last one selected are taken
for hyphenating it).  IIRC, even today only LuaTeX has removed this
restriction.

> That seems like a problem to me--or it would, except that no one has
> complained about our hyphenation being bad for non-English
> languages.

The intersection of people who use groff, people who write
multilingual documents that need different scripts in a single
paragraph, and people who are typographically aware and take care of
bad or missing hyphenation, is probably very small...

> However, in the meantime, meaning for groff 1.24, I propose to move
> `hcode` definitions to where they make more sense: the character set
> macro files "koi8-r.tmac", "latin1.tmac", "latin2.tmac", and
> "latin9.tmac".  (If/when I do that, I'll need to update the
> "tmac/LOCALIZATION" file accordingly.)

Probably a good idea.  The few cases where this has to be changed
(classical example: Turkish needs 'İ' mapped to 'i' and 'I' mapped to
'ı') can be overridden in a language-specific hyphenation setup.


Werner


Re: GNU maintainership update

2024-07-31 Thread Bertrand Garrigues
Hi Dave,

On mer., juil. 31 2024 at 02:15:18 , Dave Kemper  wrote:
> On Tue, Jul 30, 2024 at 6:45 PM G. Branden Robinson
>  wrote:
>> I hope you will join me in thanking him for his excellent work.
>
> Absolutely.  Bertrand, your contributions, being mostly to the
> underpinnings, may never have been as visible as Werner's when he was
> maintainer, but are no less important.  Thank you for all you've done
> to give groff a more solid foundation.

Thanks too for your kind words,

Regards,

Bertrand