bug#53145: 回覆: Re: bug#53145: "cut" can't segment Chinese characters correctly?

zendas via GNU coreutils Bug Reports Sun, 09 Jan 2022 21:15:21 -0800

zendas@Backup-Server:/tmp$ echo "你好啊" | cut -c 1-3 | od -b
0000000 344 275 240 012
0000004
zendas@Backup-Server:/tmp$ echo "你好啊" | cut -c 1 | od -b
0000000 344 012
0000002
zendas@Backup-Server:/tmp$ echo "你好啊" | cut -b 1 | od -b
0000000 344 012
0000002
zendas@Backup-Server:/tmp$ echo "你好啊" | cut -nb 1 | od -b
0000000 344 012
0000002
zendas@Backup-Server:/tmp$ echo "你好啊" | cut -c 1-3
你
zendas@Backup-Server:/tmp$ echo "你好啊" | cut -c 1
�
zendas@Backup-Server:/tmp$
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐


在 2022年1月10日 星期一 上午 3:51，zendas <[email protected]> 寫道：

> Reference source:
>
> https://blog.csdn.net/m0_38110132/article/details/79883827
>
> my environment is:
>
> zendas@Backup-Server:~$ cat /etc/debian_version
>
> 11.1
>
> zendas@Backup-Server:~$ cut --version
>
> cut (GNU coreutils) 8.32
>
> Copyright (C) 2020 Free Software Foundation, Inc.
>
> 授權條款 GPLv3+：GNU 通用公共授權條款第 3 版或更新版本 https://gnu.org/licenses/gpl.html。
>
> 本軟體是自由軟體：您可以自由修改和重新發布它。
>
> 在法律範圍內沒有其他保證。
>
> 由 David M. Ihnat、David MacKenzie 和 Jim Meyering 編寫。
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>
> 在 2022年1月10日 星期一 上午 3:40，Bob Proulx [email protected] 寫道：
>
> > zendas wrote:
> >
> > > Hello, I need to get Chinese characters from the string. I googled a
> > >
> > > lot of documents, it seems that the -c parameter of cut should be
> > >
> > > able to meet my needs, but I even directly execute the instructions
> > >
> > > on the web page, and the result is different from the
> > >
> > > demonstration. I have searched dozens of pages but the results are
> > >
> > > not the same as the demo, maybe this is a bug?
> >
> > Unfortunately the example was attached as images instead of as plain
> >
> > text. Please in the future copy and paste the example as text rather
> >
> > than as an image. As an image it is impossible to reproduce by trying
> >
> > to copy and paste the image. As an image it is impossible to search
> >
> > for the strings.
> >
> > The images were also lost somehow from the various steps in the
> >
> > mailing list pipelines with this message. First it was classified as
> >
> > spam by the anti-spam robot (SpamAssassin-Bogofilter-CRM114). I
> >
> > caught it in review and re-sent the message. That may have been the
> >
> > problem specifically with images.
> >
> > > For example:
> > >
> > > https://blog.csdn.net/xuzhangze/article/details/80930714
> > >
> > > [20180705173450701.png]
> > >
> > > the result of my attempt:
> > >
> > > [螢幕快照 2022-01-10 02:49:46.png]
> >
> > One of the two images:
> >
> > https://debbugs.gnu.org/cgi/bugreport.cgi?msg=5;bug=53145;att=3;filename=20180705173450701.png
> >
> > Second problem is that the first image shows as being corrupted. I
> >
> > can view the original however. To my eye they are similar enough that
> >
> > the one above is sufficient and I do not need to re-send the corrupted
> >
> > image.
> >
> > As to the problem you have reported it is due to lack of
> >
> > internationalization support for characters. -c is the same as -b at
> >
> > this moment.
> >
> > https://www.gnu.org/software/coreutils/manual/html_node/cut-invocation.html#cut-invocation
> >
> > ‘-c CHARACTER-LIST’
> >
> > ‘--characters=CHARACTER-LIST’
> >
> > Select for printing only the characters in positions listed in
> >
> > CHARACTER-LIST. The same as ‘-b’ for now, but internationalization
> >
> > will change that. Tabs and backspaces are treated like any other
> >
> > character; they take up 1 character. If an output delimiter is
> >
> > specified, (see the description of ‘--output-delimiter’), then
> >
> > output that string between ranges of selected bytes.
> >
> > For multi-byte UTF-8 characters the -c option will operate the same as
> >
> > the -b option as of the current version and is not suitable for
> >
> > dealing with multi-byte characters.
> >
> > $ echo '螢幕快照'
> >
> > 螢幕快照
> >
> > $ echo '螢幕快照' | cut -c 1
> >
> > ?
> >
> > $ echo '螢幕快照' | cut -c 1-3
> >
> > 螢
> >
> > $ echo '螢幕快照' | cut -b 1-3
> >
> > 螢
> >
> > If the characters are known to be 3 bytes multi-characters then I
> >
> > might suggest using -b to workaround the problem assuming 3 byte
> >
> > characters. Eventually when -c is coded to handle multi-byte
> >
> > characters the handling as bytes will change. Using -b would avoid
> >
> > that change.
> >
> > Some operating systems have patched that specific version of utilities
> >
> > locally to add multi-byte character handling. But the patches have
> >
> > not been found acceptable for inclusion. That is why there are
> >
> > differences between different operating systems.
> >
> > Bob

bug#53145: 回覆: Re: bug#53145: "cut" can't segment Chinese characters correctly?

Reply via email to