Re: [go-nuts] Efficiently switch io.Reader to another decoder on error

roger peppe Tue, 14 Jan 2025 14:41:00 -0800

Tangentially related to this thread, a while back, I wrote a Go
implementation of the base64 command that is agnostic about which encoding
it reads (and can write all the possible encodings). It can be installed
with:
go install github.com/rogpeppe/misc/cmd/base64@latest


It's arguably a little too lenient in what it accepts, but it works for me
:)

The source is here
https://github.com/rogpeppe/misc/blob/f64633da4fd4/cmd/base64/base64.go

On Tue, 14 Jan 2025 at 14:53, Rory Campbell-Lange <r...@campbell-lange.net>
wrote:

> Thanks for finding that foolish error, Brian.
>
> To wrap the thread up, the implementation below seems to work ok for
> reading both base64.RawStdEncoding and base64.StdEncoding encoded data
> using the base64.RawStdEncoding decoder.
>
> Example usage:
>
>     b64 := NewB64Translator(bytes.NewReader(encodedBytes))
>     b, err := io.ReadAll(base64.NewDecoder(base64.RawStdEncoding, b64))
>
> The implementation:
>
>     type B64Translator struct {
>         br *bufio.Reader
>     }
>
>     func NewB64Translator(r io.Reader) *B64Translator {
>         return &B64Translator{
>             br: bufio.NewReader(r),
>         }
>     }
>
>     // Read reads off the buffered reader expecting base64.StdEncoding
> bytes
>     // with (potentially) 1-3 '=' padding characters at the end.
>     // RawStdEncoding can be used for both StdEncoded and RawStdEncoded
> data
>     // if the padding is removed.
>     func (b *B64Translator) Read(p []byte) (n int, err error) {
>         h := make([]byte, len(p))
>         n, err = b.br.Read(h)
>         if err != nil {
>             return n, err
>         }
>         // check if there is any padding in the last three bytes
>         tail := make([]byte, 3)
>         if n > 3 {
>             _ = copy(tail, h[n-3:n])
>         } else {
>             _ = copy(tail, h[:n])
>         }
>         c := bytes.Count(tail, []byte("="))
>         copy(p, h[:n-c])
>         return n - c, nil
>     }
>
> For larger data the "tail" approach seems to have a tiny speed improvement
> over a naive bytes.Count(b, []byte("=")) over the whole buffer.
>
> Thanks to everyone for their help.
>
> Rory
>
> On 14/01/25, 'Brian Candler' via golang-nuts (golang-nuts@googlegroups.com)
> wrote:
> > I was more or less right. The input string, which you encoded to
> > "Qm9uam91ciwgam95ZXV4IGxpb24K", contains an encoded newline at the end.
> > It's not spurious.
> >
> > Confirmed by the "echo" pipeline I gave above, or in Go itself:
> > https://go.dev/play/p/6kSxiCfCTo4
> >
> > You can also confirm it by multiplying the length of the input by 3/4
> >
> > % echo -n "Qm9uam91ciwgam95ZXV4IGxpb24K" | wc -c
> >       28
> >
> > 28*3/4 = 21
> > B o n j o u r
> > , _ j o y e u
> > x _ l i o n \n
> >
> >
> > On Tuesday, 14 January 2025 at 10:10:22 UTC Brian Candler wrote:
> >
> > > Sorry ignore that, I hadn't checked your playground link.
> > >
> > > On Tuesday, 14 January 2025 at 10:07:53 UTC Brian Candler wrote:
> > >
> > >> > AS I wrote earlier, I'm trying to avoid reading the entire email
> part
> > >> into memory to discover if I should use base64.StdEncoding or
> > >> base64.RawStdEncoding.
> > >>
> > >> As I asked before, why would you ever need to use RawStdEncoding? It
> just
> > >> means the MIME part was invalid, most likely corrupted/truncated.
> > >>
> > >> > One odd thing is that I'm getting extraneous newlines (shown by
> stars
> > >> in the output), eg:
> > >>
> > >> You are feeding two different inputs which do not differ by
> truncation
> > >> alone.
> > >>
> > >> % echo -n "Qm9uam91ciwgam95ZXV4IGxpb24K" | base64 -D | hexdump -c
> > >> 0000000   B   o   n   j   o   u   r   ,       j   o   y   e   u   x
> > >> 0000010   l   i   o   n  \n
> > >> 0000015
> > >>
> > >> % echo -n "IkJvbmpvdXIsIGpveWV1eCBsaW9uIg==" | base64 -D | hexdump -c
> > >> 0000000   "   B   o   n   j   o   u   r   ,       j   o   y   e   u
>  x
> > >> 0000010       l   i   o   n   "
> > >> 0000016
> > >>
> > >> The second one has encoded double-quotes before and after the content.
> > >>
> > >> On Monday, 13 January 2025 at 22:43:51 UTC Rory Campbell-Lange wrote:
> > >>
> > >>> AS I wrote earlier, I'm trying to avoid reading the entire email
> part
> > >>> into memory to discover if I should use base64.StdEncoding or
> > >>> base64.RawStdEncoding.
> > >>>
> > >>> The following seems to work reasonably well:
> > >>>
> > >>> type B64Translator struct {
> > >>> br *bufio.Reader
> > >>> }
> > >>>
> > >>> func NewB64Translator(r io.Reader) *B64Translator {
> > >>> return &B64Translator{
> > >>> br: bufio.NewReader(r),
> > >>> }
> > >>> }
> > >>>
> > >>> // Read reads off the buffered reader expecting base64.StdEncoding
> bytes
> > >>> // with (potentially) 1-3 '=' padding characters at the end.
> > >>> // RawStdEncoding can be used for both StdEncoded and RawStdEncoded
> data
> > >>> // if the padding is removed.
> > >>> func (b *B64Translator) Read(p []byte) (n int, err error) {
> > >>> h := make([]byte, len(p))
> > >>> n, err = b.br.Read(h)
> > >>> if err != nil {
> > >>> return n, err
> > >>> }
> > >>> // to be optimised
> > >>> c := bytes.Count(h, []byte("="))
> > >>> copy(p, h[:n-c])
> > >>> // fmt.Println(string(h), n, string(p), n-c)
> > >>> return n - c, nil
> > >>> }
> > >>>
> > >>> https://go.dev/play/p/H6ii7Vy-8as
> > >>>
> > >>> One odd thing is that I'm getting extraneous newlines (shown by
> stars in
> > >>> the output), eg:
> > >>>
> > >>> --
> > >>> raw: Bonjour joyeux lion
> > >>> Qm9uam91ciwgam95ZXV4IGxpb24K
> > >>> ok: false
> > >>> decoded: Bonjour, joyeux lion* <-------------------- e.g. here
> > >>> --
> > >>> std: "Bonjour, joyeux lion"
> > >>> IkJvbmpvdXIsIGpveWV1eCBsaW9uIg==
> > >>> ok: true
> > >>> decoded: "Bonjour, joyeux lion"
> > >>> --
> > >>>
> > >>> Any thoughts on that would be gratefully received.
> > >>>
> > >>> Rory
> > >>>
> > >>>
> > >>> On 13/01/25, Rory Campbell-Lange (ro...@campbell-lange.net) wrote:
> > >>> > Thanks very much for the playground link and thoughts.
> > >>> >
> > >>> > The use case is reading base64 email parts, which could be of a
> very
> > >>> large size. It is unclear when processing these parts if they are
> base64
> > >>> padded or not.
> > >>> >
> > >>> > I'm trying to avoid reading the entire email part into memory.
> > >>> Consequently I think your earlier idea of adding padding (or
> removing it)
> > >>> in a wrapper could work. Perhaps wrapping the reader with another
> using a
> > >>> bufio.Reader to track bytes read and detect EOF. At EOF the wrapper
> could
> > >>> add padding if needed.
> > >>> >
> > >>> > Rory
> > >>> >
> > >>> > On 13/01/25, Axel Wagner (axel.wa...@googlemail.com) wrote:
> > >>> > > Just realized: If you twist the idea around, you get something
> easy
> > >>> to
> > >>> > > implement and more correct.
> > >>> > > Instead of stripping padding if it exist, you can ensure that
> the
> > >>> body *is*
> > >>> > > padded to a multiple of 4 bytes:
> https://go.dev/play/p/SsPRXV9ZfoS
> > >>> > > You can then feed that to base64.StdEncoding. If the wrapped
> Reader
> > >>> returns
> > >>> > > padded Base64, this does nothing. If it returns unpadded Base64,
> it
> > >>> adds
> > >>> > > padding. If it returns incorrect Base64, it will create a padded
> > >>> stream,
> > >>> > > that will then get rejected by the Base64 decoder.
> > >>> > >
> > >>> > > On Mon, 13 Jan 2025 at 10:31, Axel Wagner <
> axel.wa...@googlemail.com>
> > >>>
> > >>> > > wrote:
> > >>> > >
> > >>> > > > Hi,
> > >>> > > >
> > >>> > > > one way to solve your problem is to wrap the body into an
> > >>> io.Reader that
> > >>> > > > strips off everything after the first `=` it finds. That can
> then
> > >>> be fed to
> > >>> > > > base64.RawStdEncoding. This approach requires no extra
> buffering
> > >>> or copying
> > >>> > > > and is easy to implement: https://go.dev/play/p/CwcVz7oietI
> > >>> > > >
> > >>> > > > The downside is, that this will not verify that the body is
> > >>> *either*
> > >>> > > > correctly padded Base64 *or* unpadded Base64. So, it will not
> > >>> report an
> > >>> > > > error if fed something like "AAA=garbage".
> > >>> > > > That can be remedied by buffering up to four bytes and, when
> > >>> encountering
> > >>> > > > an EOF, check that there are at most three trailing `=` and
> that
> > >>> the total
> > >>> > > > length of the stream is divisible by four. It's more finicky
> to
> > >>> implement,
> > >>> > > > but it should also be possible without any extra copies and
> only
> > >>> requires a
> > >>> > > > very small extra buffer.
> > >>> > > >
> > >>> > > > On Sun, 12 Jan 2025 at 22:29, Rory Campbell-Lange <
> > >>> ro...@campbell-lange.net>
> > >>> > > > wrote:
> > >>> > > >
> > >>> > > >> Thanks very much for the links, pointers and possible
> solution.
> > >>> > > >>
> > >>> > > >> Trying to read base64 standard (padded) encoded data with
> > >>> > > >> base64.RawStdEncoding can produce an error such as
> > >>> > > >>
> > >>> > > >> illegal base64 data at input byte <n>
> > >>> > > >>
> > >>> > > >> Reading base64 raw (unpadded) encoded data produces the EOF
> > >>> error.
> > >>> > > >>
> > >>> > > >> I'll go with trying to read the standard encoded data up to
> maybe
> > >>> 1MB and
> > >>> > > >> then switch to base64.RawStdEncoding if I hit the "illegal
> base64
> > >>> data"
> > >>> > > >> problem, maybe with reference to bufio.Reader which has most
> of
> > >>> the methods
> > >>> > > >> suggested below.
> > >>> > > >>
> > >>> > > >> Yes, the use of a "Rewind" method would be crucial. I guess
> this
> > >>> would
> > >>> > > >> need to:
> > >>> > > >> 1. error if more than one buffer of data has been read
> > >>> > > >> 2. else re-read from byte 0
> > >>> > > >>
> > >>> > > >> Thanks again very much for these suggestions.
> > >>> > > >>
> > >>> > > >> Rory
> > >>> > > >>
> > >>> > > >> On 12/01/25, robert engels (ren...@ix.netcom.com) wrote:
> > >>> > > >> > Also, see this
> > >>> > > >>
> > >>>
> https://stackoverflow.com/questions/69753478/use-base64-stdencoding-or-base64-rawstdencoding-to-decode-base64-string-in-go
> > >>> > > >> as I expected the error should be reported earlier than the
> end
> > >>> of stream
> > >>> > > >> if the chosen format is wrong.
> > >>> > > >> >
> > >>> > > >> > > On Jan 12, 2025, at 2:57 PM, robert engels <
> > >>> ren...@ix.netcom.com>
> > >>> > > >> wrote:
> > >>> > > >> > >
> > >>> > > >> > > Also, this is what Gemini provided which looks basically
> > >>> correct -
> > >>> > > >> but I think encapsulating it with a Rewind() method would be
> > >>> easier to
> > >>> > > >> understand.
> > >>> > > >> > >
> > >>> > > >> > >
> > >>> > > >> > >
> > >>> > > >> > > While Go doesn't have a built-in PushbackReader like some
> > >>> other
> > >>> > > >> languages (e.g., Java), you can implement similar
> functionality
> > >>> using a
> > >>> > > >> custom struct and a buffer.
> > >>> > > >> > >
> > >>> > > >> > > Here's an example implementation:
> > >>> > > >> > >
> > >>> > > >> > > package main
> > >>> > > >> > >
> > >>> > > >> > > import (
> > >>> > > >> > > "bytes"
> > >>> > > >> > > "io"
> > >>> > > >> > > )
> > >>> > > >> > >
> > >>> > > >> > > type PushbackReader struct {
> > >>> > > >> > > reader io.Reader
> > >>> > > >> > > buffer *bytes.Buffer
> > >>> > > >> > > }
> > >>> > > >> > >
> > >>> > > >> > > func NewPushbackReader(r io.Reader) *PushbackReader {
> > >>> > > >> > > return &PushbackReader{
> > >>> > > >> > > reader: r,
> > >>> > > >> > > buffer: new(bytes.Buffer),
> > >>> > > >> > > }
> > >>> > > >> > > }
> > >>> > > >> > >
> > >>> > > >> > > func (p *PushbackReader) Read(b []byte) (n int, err
> error) {
> > >>> > > >> > > if p.buffer.Len() > 0 {
> > >>> > > >> > > return p.buffer.Read(b)
> > >>> > > >> > > }
> > >>> > > >> > > return p.reader.Read(b)
> > >>> > > >> > > }
> > >>> > > >> > >
> > >>> > > >> > > func (p *PushbackReader) UnreadByte() error {
> > >>> > > >> > > if p.buffer.Len() == 0 {
> > >>> > > >> > > return io.EOF
> > >>> > > >> > > }
> > >>> > > >> > > lastByte := p.buffer.Bytes()[p.buffer.Len()-1]
> > >>> > > >> > > p.buffer.Truncate(p.buffer.Len() - 1)
> > >>> > > >> > > p.buffer.WriteByte(lastByte)
> > >>> > > >> > > return nil
> > >>> > > >> > > }
> > >>> > > >> > >
> > >>> > > >> > > func (p *PushbackReader) Unread(buf []byte) error {
> > >>> > > >> > > if p.buffer.Len() == 0 {
> > >>> > > >> > > return io.EOF
> > >>> > > >> > > }
> > >>> > > >> > > p.buffer.Write(buf)
> > >>> > > >> > > return nil
> > >>> > > >> > > }
> > >>> > > >> > >
> > >>> > > >> > > func main() {
> > >>> > > >> > > // Example usage
> > >>> > > >> > > r := NewPushbackReader(bytes.NewBufferString("Hello,
> > >>> World!"))
> > >>> > > >> > > buf := make([]byte, 5)
> > >>> > > >> > > r.Read(buf)
> > >>> > > >> > > r.UnreadByte()
> > >>> > > >> > > r.Read(buf)
> > >>> > > >> > > }
> > >>> > > >> > >
> > >>> > > >> > > Explanation:
> > >>> > > >> > > PushbackReader struct: This struct holds the underlying
> > >>> io.Reader and
> > >>> > > >> a buffer to store the pushed-back bytes.
> > >>> > > >> > > NewPushbackReader: This function creates a new
> PushbackReader
> > >>> from an
> > >>> > > >> existing io.Reader.
> > >>> > > >> > > Read method: This method reads bytes from either the
> buffer
> > >>> (if it
> > >>> > > >> contains data) or the underlying reader.
> > >>> > > >> > > UnreadByte method: This method pushes back a single byte
> into
> > >>> the
> > >>> > > >> buffer.
> > >>> > > >> > > Unread method: This method pushes back a slice of bytes
> into
> > >>> the
> > >>> > > >> buffer.
> > >>> > > >> > > Important Considerations:
> > >>> > > >> > > The buffer size is not managed automatically. You may
> need to
> > >>> adjust
> > >>> > > >> the buffer size based on your use case.
> > >>> > > >> > > This implementation does not handle pushing back beyond
> the
> > >>> initially
> > >>> > > >> read data. If you need to support arbitrary pushback, you'll
> need
> > >>> a more
> > >>> > > >> complex solution.
> > >>> > > >> > >
> > >>> > > >> > > Generative AI is experimental.
> > >>> > > >> > >
> > >>> > > >> > >> On Jan 12, 2025, at 2:53 PM, Robert Engels <
> > >>> ren...@ix.netcom.com>
> > >>> > > >> wrote:
> > >>> > > >> > >>
> > >>> > > >> > >> You can see the two pass reader here
> > >>> > > >>
> > >>>
> https://stackoverflow.com/questions/20666594/how-can-i-push-bytes-into-a-reader-in-go
> > >>> > > >> > >>
> > >>> > > >> > >> But yea, the basic premise is that you buffer the data
> so
> > >>> you can
> > >>> > > >> rewind if needed
> > >>> > > >> > >>
> > >>> > > >> > >> Are you certain it is reading to the end to return EOF?
> It
> > >>> may be
> > >>> > > >> returning eof once the parsing fails.
> > >>> > > >> > >>
> > >>> > > >> > >> Otherwise I would expect this is being decoded wrong -
> eg
> > >>> the mime
> > >>> > > >> type or encoding type should tell you the correct format
> before
> > >>> you start
> > >>> > > >> decoding.
> > >>> > > >> > >>
> > >>> > > >> > >>> On Jan 12, 2025, at 2:46 PM, Rory Campbell-Lange <
> > >>> > > >> ro...@campbell-lange.net> wrote:
> > >>> > > >> > >>>
> > >>> > > >> > >>> Thanks for the suggestion of a ReadSeeker to wrap an
> > >>> io.Reader.
> > >>> > > >> > >>>
> > >>> > > >> > >>> My google fu must be deserting me. I can find
> > >>> PushbackReader
> > >>> > > >> implementations in Java, but the only similar thing for Go I
> > >>> could find was
> > >>> > > >> https://gitlab.com/osaki-lab/iowrapper. If you have a
> specific
> > >>> > > >> recommendation for a ReadSeeker wrapper to an io.Reader that
> > >>> would be great
> > >>> > > >> to know.
> > >>> > > >> > >>>
> > >>> > > >> > >>> Since the base64 decoding error I'm looking for is an
> EOF,
> > >>> I guess
> > >>> > > >> the wrapper approach will not work when the EOF byte position
> is
> > >>> > than the
> > >>> > > >> io.ReadSeeker buffer size.
> > >>> > > >> > >>>
> > >>> > > >> > >>> Rory
> > >>> > > >> > >>>
> > >>> > > >> > >>> On 12/01/25, robert engels (ren...@ix.netcom.com)
> wrote:
> > >>> > > >> > >>>> create a ReadSeeker that wraps the Reader providing
> the
> > >>> buffering
> > >>> > > >> (mark & reset) - normally the buffer only needs to be large
> > >>> enough to
> > >>> > > >> detect the format contained in the Reader.
> > >>> > > >> > >>>>
> > >>> > > >> > >>>> You can search Google for PushbackReader in Go and
> you’ll
> > >>> get a
> > >>> > > >> basic implementation.
> > >>> > > >> > >>>>
> > >>> > > >> > >>>>> On Jan 12, 2025, at 12:52 PM, Rory Campbell-Lange <
> > >>> > > >> ro...@campbell-lange.net> wrote:
> > >>> > > >> > >>> ...
> > >>> > > >> > >>>>> I'm attempting to rationalise the process [of
> avoiding
> > >>> reading
> > >>> > > >> email parts into byte slices] by simply wrapping the provided
> > >>> io.Reader
> > >>> > > >> with the necessary decoders to reduce memory usage and
> > >>> unnecessary
> > >>> > > >> processing.
> > >>> > > >> > >>>>>
> > >>> > > >> > >>>>> The wrapping strategy seems to work ok. However there
> is
> > >>> a
> > >>> > > >> particular issue in detecting base64.StdEncoding versus
> > >>> > > >> base64.RawStdEncoding, which requires draining the io.Reader
> > >>> using
> > >>> > > >> base64.StdEncoding and (based on the current implementation)
> > >>> switching to
> > >>> > > >> base64.RawStdEncoding if an io.ErrUnexpectedEOF is found.
> > >>> > > >> > >>>>>
> > >>> > > >> > >>
> > >>> > > >> > >>
> > >>> > > >> > >> --
> > >>> > > >> > >> You received this message because you are subscribed to
> the
> > >>> Google
> > >>> > > >> Groups "golang-nuts" group.
> > >>> > > >> > >> To unsubscribe from this group and stop receiving emails
> > >>> from it,
> > >>> > > >> send an email to golang-nuts...@googlegroups.com <mailto:
> > >>> > > >> golang-nuts...@googlegroups.com>.
> > >>> > > >> > >> To view this discussion visit
> > >>> > > >>
> > >>>
> https://groups.google.com/d/msgid/golang-nuts/DD0C1480-D237-447A-B978-78FC8951FE05%40ix.netcom.com
> > >>> > > >> <
> > >>> > > >>
> > >>>
> https://groups.google.com/d/msgid/golang-nuts/DD0C1480-D237-447A-B978-78FC8951FE05%40ix.netcom.com?utm_medium=email&utm_source=footer
> > >>> > > >> >.
> > >>> > > >> > >
> > >>> > > >> >
> > >>> > > >>
> > >>> > > >> --
> > >>> > > >> You received this message because you are subscribed to the
> > >>> Google Groups
> > >>> > > >> "golang-nuts" group.
> > >>> > > >> To unsubscribe from this group and stop receiving emails from
> it,
> > >>> send an
> > >>> > > >> email to golang-nuts...@googlegroups.com.
> > >>> > > >> To view this discussion visit
> > >>> > > >>
> > >>>
> https://groups.google.com/d/msgid/golang-nuts/Z4Q0AFRkkoNH52_B%40campbell-lange.net
> > >>> > > >> .
> > >>> > > >>
> > >>> > > >
> > >>> >
> > >>> > --
> > >>> > You received this message because you are subscribed to the Google
> > >>> Groups "golang-nuts" group.
> > >>> > To unsubscribe from this group and stop receiving emails from it,
> send
> > >>> an email to golang-nuts...@googlegroups.com.
> > >>> > To view this discussion visit
> > >>>
> https://groups.google.com/d/msgid/golang-nuts/Z4UQYJmuk7Oe6xSG%40campbell-lange.net.
>
> > >>>
> > >>>
> > >>
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups "golang-nuts" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to golang-nuts+unsubscr...@googlegroups.com.
> > To view this discussion visit
> https://groups.google.com/d/msgid/golang-nuts/a990ab8b-7437-45f3-a0e5-81d9b7cab4a3n%40googlegroups.com
> .
>
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts+unsubscr...@googlegroups.com.
> To view this discussion visit
> https://groups.google.com/d/msgid/golang-nuts/Z4Z6VkUeV3w3EOQS%40campbell-lange.net
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/golang-nuts/CAJhgacj4vDOVzcFT4up--HO5_F_Y2cvDg2U1bkJGejs5kr%2BGOw%40mail.gmail.com.

Re: [go-nuts] Efficiently switch io.Reader to another decoder on error

Reply via email to