On 2024-10-24 11:22, Thomas Wolff via Cygwin wrote:



Am 24.10.2024 um 15:56 schrieb Brian Inglis via Cygwin:
On 2024-10-24 02:37, Thomas Wolff via Cygwin wrote:

Am 24.10.2024 um 07:01 schrieb Mark Geisert via Cygwin:
Replying to myself, I continue...

On 10/22/2024 10:33 PM, Mark Geisert via Cygwin wrote:
On 10/22/2024 8:00 PM, Backwoods BC via Cygwin wrote:
It appears that 'rev' is choking on any character \x80 or higher, but
is OK with those \x1f or smaller. It doesn't give an error or ignore
it, it just stops.

I don't have access to a Linux box so I can't see if this happens
there and nothing in the documentation suggests that this is the
correct functionality.

Test case:
printf 'no non-ASCII characters\nhex 01 >\x01< here\nhex 80 >\x80<
here\nLine 4\n'|rev|rev

This is for "rev from util-linux 2.33.1"

I don't have the current version of 'rev' on my system due to not
having updated in a while. I accidentally screwed up my installation
and have been reluctant to wipe it and start over.

So, is this the expected behaviour for the current version of 'rev'
under Cygwin and/or Linux?

The current Cygwin util-linux 2.39.3-2 rev behaves in the same,
broken way.  It looks like line-ending char(s) are not being handled
correctly.   Don't know yet if it's rev itself or fgetws() being used
by rev that's busted.  I'll investigate further.  Thanks for the
report!

This is a locale issue.  In the default Cygwin locale, rev mishandles
the \x80 byte and instead of stopping with an error message it enters
an infinite loop.  I'll probably report this upstream instead of
working out a local fix.

There is a work-around: change to the "C" locale just to run rev.
    LC_ALL=C rev zzz
where zzz is a file containing your four lines.  You can also run your
original testcase with "rev" replaced by "LC_ALL=C rev" in both places.
Sorry, this is not a good workaround as it corrupts all (proper)
non-ASCII characters.
You could do e.g.
grep . | rev

Not quite, as that just matches non-empty lines, you would have to do
something more like `grep -o . ...`, but not sure that would do what
you want either.

Ah, right, so:
egrep -e "(^$|.)" | rev
or maybe there is some more suitable tool.

The correct approach should be to match the execution locale to the
file locale, for example, `LC_ALL=...UTF-8 rev ...` which should
produce the expected results.
That's not the point. You can never be sure that there is no stray
wrong-encoded byte in your files, and rev should definitely not
endless-loop in that case.

>>>>>> it, it just stops.

I take that to mean it exits without processing past the invalid byte, and does not get stuck looping, and the function seems to be written that way, although my preference would be on error to copy the input *byte* to output, and proceed.

The latest tweak in the repo fgetwc_or_err() makes all utilities just die with an error message as soon as there is *any* encoding error.

As fget/putwc(3) depend on LC_CTYPE, it appears to be up to the user to set that locale category appropriately, and it appears then to be up to the user to pipe the output thru iconv or equivalent to convert to the terminal locale charset.

--
Take care. Thanks, Brian Inglis              Calgary, Alberta, Canada

La perfection est atteinte                   Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter  not when there is no more to add
mais lorsqu'il n'y a plus rien à retirer     but when there is no more to cut
                                -- Antoine de Saint-Exupéry


--
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

Reply via email to