python 3.1 - io.BufferedReader.peek() incomplete or change of behaviour.

Frederick Reeve Fri, 05 Jun 2009 13:14:29 -0700

Hello,

I have sent this message to the authors as well as to this list.  If
this is the wrong list please let me know where I should be sending
it... dev perhaps?


First the simple questions:

The versions of io.BufferedReader.peek() have different behavior which
one is going to stay long term?

Is the C version of the reader incomplete or simply changing the
behavior?

lastly will you consider my input on the api (see below)? 


Now a full explanation.  I am working on writing a multipart parser
for html returns in python 3.1.  The email parser being used by cgi
does not work currently and cgi is broken at the moment especially when
used with the wsgiref.simple_server as it is currently implemented.
This is what has pushed me to write my own implementation to _part_ of
cgi.py.  My thinking being that if it works well in the end I might
submit a patch as it needs one anyway.

My questions revolve around io.BufferedReader.peek().  There are two
implementations one writen in python and one in C.  At least in
python3.1 C is used by default.  The version written in python behaves
as follows:

want = min(n, self.buffer_size)
have = len(self._read_buf) - self._read_pos
if have < want or have <= 0:
    to_read = self.buffer_size - have
    current = self.raw.read(to_read)
    if current:
        self._read_buf = self._read_buf[self._read_pos:] + current
        self._read_pos = 0
return self._read_buf[self._read_pos:]

This basically means it will always return the requested number of
bytes up to buffersize and will preform a read on the underlying stream
to get extra data if the buffer has less than requested (upto full
buffersize).  It also will not return a longer buffer than the number
of bytes requested.  I have verified this is the behaviour of this.

The C version works a little different.  The C version
works as follows:

    Py_ssize_t have, r;

    have = Py_SAFE_DOWNCAST(READAHEAD(self), Py_off_t, Py_ssize_t);
    /* Constraints:
       1. we don't want to advance the file position.
       2. we don't want to lose block alignment, so we can't shift the
buffer to make some place.
       Therefore, we either return `have` bytes (if > 0), or a full
buffer. */
    if (have > 0) {
        return PyBytes_FromStringAndSize(self->buffer + self->pos,
have); }

    /* Fill the buffer from the raw stream, and copy it to the result.
*/ _BufferedReader_reset_buf(self);
    r = _BufferedReader_fill_buffer(self);
    if (r == -1)
        return NULL;
    if (r == -2)
        r = 0;
    self->pos = 0;
    return PyBytes_FromStringAndSize(self->buffer, r);

Which basically means it returns what ever is in the buffer period.
It will not fill the buffer any more from the raw stream to allow us to
peek up to one buffersize like the python version and it always
returns whats in the buffer regardless of how much you request.  The
only exception to this is if the buffer is empty.  In that case it will
read it full then return it.  So it can be said this function is
guaranteed to return 1 byte unless a raw read is not possible. The
author says they cannot shift the buffer. This is true to retain file
alignment. Double buffers maybe a solution if the python versions
behavior is wanted.  I have not yet checked how buffering is implemented
fully.

In writing the parser I found that being able to peek a number of bytes
was helpful but I need to be able to peek more than 1 consistently (70
in my case) to meet the rfc I am implementing.  This meant the C
version of peek would not work.  Fine I wrote a wrapper class that
adds a buffer...  This seemed dumb as I was already using a buffered
reader so I detach the stream and use my wrapper.  But now the logic
and buffer handling is in the slower python where I would rather not
have it. This defeats the purpose of the C buffer reader implementation
almost. The C version still has a valid use for being able to read
arbitrary size reads but that is really all the buffer reader is doing
and I can do block oriented reads and buffering in my wrapper since I
have to buffer anyway. Unless I only need a guaranteed peek of 1 byte
(baring EOF, etc.) the c version doesn't seem very useful other than
for random read cases. This is not a full explanation of course but may
give you the picture as I see it.

In light of the above and my questions I would like to give my input,
hopefully to be constructive.  This is what I think the api _should_
be the peek impementation.  I may have missed things of course but none
the less here it is:

---------------------

read(n):
Current be behavior

read1(n):
If n is greater than 0 return n or upto current buffer contents bytes
advancing the stream position.  If n is less than 0 or None return the
the buffer contents and advance the position.  If the buffer is empty
and EOF has not been reached return None.  If the buffer is empty and
EOF has been reached return b''.

peek(n):
If n is less than 0 or None return buffer contents with out advancing
stream position. Return n bytes up to _buffer size_(not contents) with
out advancing the stream position.  If the buffer contents is less than
n, buffer an additional block from the "raw" stream before hand.  This
may require a double buffer or such.  If EOF is encountered during the
raw read then return return as much as we can upto n.

leftover():
Return the number (an int) of bytes in the buffer.  This is not
strictly necessary with the new implementations of peek and read1 being
like above but I thought still useful.  I could be wrong and am not
tied to this idea personally.

---------------------

I feel that what I and possibly others would want from a _buffered
reader_ is a best try behaviour.  So the functions give you what you
want except when its very bad or impossible to do so.  Very bad meaning
losing block alignment and imposible in this case being reading past EOF
(or stream out of data).

I'm sorry I'm probably not very good at explaining but I do try.  I
would love to here your input and I would be willing to work on patches
for the C version of the buffered reader to implement this _if_ these
changes are supported by the authors and the community and _if_ the
authors will not will not write the changes but but still support them.
Regardless I would need my questions answered if possible.

Thanks so much!

Frederick Reeve

-- 
http://mail.python.org/mailman/listinfo/python-list

python 3.1 - io.BufferedReader.peek() incomplete or change of behaviour.

Reply via email to