Yasuhito FUTATSUKI wrote on Wed, Jan 08, 2020 at 00:26:39 +0900: > On 2020/01/07 9:41, Yasuhito FUTATSUKI wrote: > > On 2020/01/07 6:52, Yasuhito FUTATSUKI wrote: > >> By the way, it seems another issue about truncate_subject that current > >> implementation of truncate_subject may break utf-8 multi-bytes character > >> sequence, but I didn't reproduce it(because I always use ascii > >> characters only for file names...). > > I could reproduce this problem. ⋮ > UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-4: invalid > continuation byte > > > Probably it needs something like this (but it doesn't support conbining > > characters, and I didn't any test...): > > [[[ > > Index: tools/hook-scripts/mailer/mailer.py > > =================================================================== > > --- tools/hook-scripts/mailer/mailer.py (revision 1872398) > > +++ tools/hook-scripts/mailer/mailer.py (working copy) > > @@ -159,7 +159,13 @@ > > truncate_subject = 0 > > > > if truncate_subject and len(subject) > truncate_subject: > > - subject = subject[:(truncate_subject - 3)] + "..." > > + # To avoid breaking utf-8 multi-bytes character sequence, we should > > + # search the top of the sequence if the byte of the truncate point is > > + # secound or later part of multi-bytes character sequence. > > + idx = truncate_subject - 3 > > + while 0x80 <= ord(subject[idx]) <= 0xbf: > > + idx -= 1 > > + subject = subject[:idx] + "..." > > return subject > > > > def start(self, group, params): > > ]]] > > After this patch applied, the script above runs without error. > > However, this produces Subject line below. > > [[[ > Subject: r1 - =?utf-8?b?44CH44CH44CH5LiA?= =?utf-8?b?44CH44CH44CH5LiJ?= > =?utf-8?b?44CH44CH44CH5LqM?= =?utf-8?b?44CH44CH44CH5LqU?= > =?utf-8?b?44CH44CH44CH5YWt?= =?utf-8?b?44CHLi4u?=^M > ]]] > > and decoded Results is > > "Subject: r1 - 〇〇〇一〇〇〇三〇〇〇二〇〇〇五〇〇〇六〇..." > > because white space(s) between encoded words are ignored. > I think this is not what we want.
We shouldn't be handling a UTF-8 string bytewise. If it needs truncating, then we should truncate it characterwise or wordwise, not bytewise. We shouldn't be doing the MIME-encoding ourselves. We should just provide the subject line to the 'email' module and let it worry about MIME encoding, line folding, and everything else. (This is a preëxisting problem.) Makes sense? Cheers, Daniel