Yasuhito FUTATSUKI wrote on Wed, Jan 08, 2020 at 00:26:39 +0900:
> On 2020/01/07 9:41, Yasuhito FUTATSUKI wrote:
> > On 2020/01/07 6:52, Yasuhito FUTATSUKI wrote:
> >> By the way, it seems another issue about truncate_subject that current
> >> implementation of truncate_subject may break utf-8 multi-bytes character
> >> sequence, but I didn't reproduce it(because I always use ascii
> >> characters only for file names...).
> 
> I could reproduce this problem.
⋮
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-4: invalid 
> continuation byte
>  
> > Probably it needs something like this (but it doesn't support conbining
> > characters, and I didn't any test...):
> > [[[
> > Index: tools/hook-scripts/mailer/mailer.py
> > ===================================================================
> > --- tools/hook-scripts/mailer/mailer.py (revision 1872398)
> > +++ tools/hook-scripts/mailer/mailer.py (working copy)
> > @@ -159,7 +159,13 @@
> >        truncate_subject = 0
> >  
> >      if truncate_subject and len(subject) > truncate_subject:
> > -      subject = subject[:(truncate_subject - 3)] + "..."
> > +      # To avoid breaking utf-8 multi-bytes character sequence, we should
> > +      # search the top of the sequence if the byte of the truncate point is
> > +      # secound or later part of multi-bytes character sequence. 
> > +      idx = truncate_subject - 3
> > +      while  0x80 <= ord(subject[idx]) <= 0xbf:
> > +        idx -= 1
> > +      subject = subject[:idx] + "..."
> >      return subject
> >  
> >    def start(self, group, params):
> > ]]]
> 
> After this patch applied, the script above runs without error. 
> 
> However, this produces Subject line below.
> 
> [[[
> Subject: r1 - =?utf-8?b?44CH44CH44CH5LiA?= =?utf-8?b?44CH44CH44CH5LiJ?= 
> =?utf-8?b?44CH44CH44CH5LqM?= =?utf-8?b?44CH44CH44CH5LqU?= 
> =?utf-8?b?44CH44CH44CH5YWt?= =?utf-8?b?44CHLi4u?=^M
> ]]]
> 
> and decoded Results is
> 
> "Subject: r1 - 〇〇〇一〇〇〇三〇〇〇二〇〇〇五〇〇〇六〇..."
> 
> because white space(s) between encoded words are ignored.
> I think this is not what we want.

We shouldn't be handling a UTF-8 string bytewise.  If it needs truncating, then
we should truncate it characterwise or wordwise, not bytewise.

We shouldn't be doing the MIME-encoding ourselves.  We should just provide the
subject line to the 'email' module and let it worry about MIME encoding, line
folding, and everything else.  (This is a preëxisting problem.)

Makes sense?

Cheers,

Daniel

Reply via email to