DL Neil <pythonl...@danceswithmice.info> 于2019年6月24日周一 上午11:18写道:
> Yes, better to reply to list - others may 'jump in'... > > > On 20/06/19 5:37 PM, Windson Yang wrote: > > Thank you so much for you review DL Neil, it really helps :D. However, > > there are some parts still confused me, I replyed as below. > > It's not a particularly easy topic... > > > > DL Neil <pythonl...@danceswithmice.info > > <mailto:pythonl...@danceswithmice.info>> 于2019年6月19日周三 下午2:03写道: > > > > I've not gone 'back' to refer to any ComSc theory on > buffer-management. > > Perhaps you might benefit from such? > > > > I just take a crash course on it so I want to know if I understand the > > details correctly :D > > ...there are so many ways one can mess-up! > > > > I like your use of the word "shift", so I'll continue to use it. > > > > There are three separate units of data to consider - each of which > > could > > be called a "buffer". To avoid confusing (myself) I'll only call the > > 'middle one' that: > > 1 the unit of data 'coming' from the data-source > > 2 the "buffer" you are implementing > > 3 the unit of data 'going' out to a data-destination. > > > > Just to make it clear, when we use `f.write('abc')` in python, (1) means > > 'abc', (2) means the buffer handle by Python (by default 8kb), (2) means > > the file *f* we are writing to, right? > > Sorry, this is my typo, (3) means the file *f* we are writing to, right? > No! (sorry) f.write() is an output operation, thus nr3. > > "f" is not a "buffer handle" but a "file handle" or more accurately a > "file object". > > When we: > > one_input = f.read( NRbytes ) > > (ignoring EOF/short file and other exceptions) that many bytes will > 'appear' in our program labelled as "one_input". > > However, the OpSys may have read considerably more data, depending upon > the device(s) involved, the application, etc; eg if we ask for 2 bytes > the operating system will read a much larger block (or applicable unit) > of data from a disk drive. > > The same applies in reverse, with f.write( NRbytes/byte-object ), until > we flush or close the file. > > Those situations account for nr1 and nr3. In the usual case, we have no > control over the size of these buffers - and it is best not to meddle! > > I agreed with you. Hence:- > > > 1 and 3 may be dictated to you, eg hardware or file specifications, > > code > > requirements, etc. > > > > So, data is shifted into the (2) buffer in a unit-size decided by > (1) - > > in most use-cases each incoming unit will be the same size, but > > remember > > that the last 'unit' may/not be full-size. Similarly, data shifted > out > > from the (2) buffer to (3). > > > > The size of (1) is likely not that of (3) - otherwise why use a > > "buffer"? The size of (2) must be larger than (1) and larger than > (2) - > > for reasons already illustrated. > > > > Is this a typo? (2) larger than (1) larger than (2)? > > Correct - well spotted! nr2 > nr1 and nr2 > nr3 > When we run 'f.write(100', I understand why nr2 (by defaut 8kb) > nr1 (100), but I'm not sure why nr2 > nr3 (file object) here? > > > > I recall learning how to use buffers with a series of hand-drawn > block > > diagrams. Recommend you try similarly! > > Try this! > > > > Now, let's add a few critiques, as requested (interposed below):- > > > > > > On 19/06/19 3:53 PM, Windson Yang wrote:t > > > I'm trying to understand the workflow of how Python read/writes > > data with > > > buffer. I will be appreciated if someone can review it. > > > > > > ### Read n data > > > > - may need more than one read operation if the size of (3) "demands" > > more data than the size of (1)/one "read". > > > > > > Looks like the size of len of one read() depends on > > > https://github.com/python/cpython/blob/master/Modules/_io/bufferedio.c#L1655 > ? > > > You decide how many bytes should be read. That's how much will be > transferred from the OpSys' I/O into the Python program's space. With > the major exception, that if there is no (more) data available, it is > defined as an exception (EOF = end of file) or if there are fewer bytes > of data than requested (in which case you will be given only the number > of bytes of data-available. > > > > > 1. If the data already in the buffer, return data > > > > - this a data-transfer of size (3) > > > > For extra credit/an unnecessary complication (but probable > speed-up!): > > * if the data-remaining is less than size (3) consider a read-ahead > > mechanism > > > > > 2. If the data not in the buffer: > > > > - if buffer's data-len < size (3) > > > > > 1. copy all the current data from the buffer > > > > * if "buffer" is my (2), then no-op > > > > I don't understand your point here, when we read data we would copy some > > data from the current buffer from python, right? > > ( > https://github.com/python/cpython/blob/master/Modules/_io/bufferedio.c#L1638), > > > we use `out` (which point to res) to store the data here. > > We're becoming confused: the original heading 'here' was "### Read n > data" which is inconsistent with "out" and "from python". > > > If the read operation is set to transfer (say) 2KB into the program at a > time, but the code processes it in 100B units, then it would seem that > after the first read, twenty process loops will run before it is > necessary to issue another input request. > > In that example, the buffer (nr2) is twenty-times the length of the > input 'buffer' (nr1). > > So, from the second to the twentieth iteration of the process, your > step-1 "1. If the data already in the buffer, return data" (and thus my > "no-op) applies! > > This is a major advantage of having a buffer in the first place - > transfers within RAM are significantly faster than I/O operations! > > Yes, that is what I trying to say. Looks like I should add more details for the code. > > > > 2. create a new buffer object, fill the new buffer with raw > > read which > > > read data from disk. > > > > * this becomes: perform read operation and append incoming data (size > > (1)) to "buffer" - hence why "buffer" is larger than (1), by > definition. > > NB if size (1) is smaller than size (3), multiple read operations > > may be > > necessary. Thus a read-loop!? > > > > Yes, you are right, here is a while loop > > ( > https://github.com/python/cpython/blob/master/Modules/_io/bufferedio.c#L1652) > > > > > > > > > > 3. concat the data in the old buffer and new buffer. > > > > = now no-op. Hopefully the description of 'three buffers' removes > this > > confusion of/between buffers. > > > > I don't get it. When we call the function like seek(0) then > > read(1000), we can still use the data from buffer from python, right? > > I fear that we are having terminology issues - see the original > description of three 'buffers'. Which "buffer" are you talking about? > 1 the seek/read are carried-out against a file object, which will indeed > have its own buffer, size unknown to Python. (buffer 1) > 2 the read(1000) operation will (on its own) allow you to populate a > buffer within your code, 1000-bytes in length. (buffer 2) > > Is the file object in has its own buffer? Does it only happen when we use Standard I/O (FILE*)? I'm not sure I used it in CPython, or maybe I missed something. > > > > 4. return the data > > > > * make the above steps into a while-loop and there won't be a > separate > > step here (it is the existing step 1!) > > > > > > * build all of the above into a function/method, so that the > 'mainline' > > only has to say 'give me data'! > > > > > > > ### Write n data > > > 1. If data small enough to fill into the buffer, write data to > > the buffer > > > > =yes, the data coming from source (1), which in this case is 'your' > > code > > may/not be sufficient to fill the output size (3). So, load it into > the > > "buffer" (2). > > > > > 2. If data can't fill into the buffer > > > 1. flush the data in the buffer > > > > =This statement seems to suggest that if there is already some data > in > > the buffer, it will be wiped. Not recommended! > > > > We check if any data in the buffer if it does, we flush them to the disk > > ( > https://github.com/python/cpython/blob/master/Modules/_io/bufferedio.c#L1948) > > > > > > > =Have replaced the next steps, see below for your consideration:- > > > > > 1. If succeed: > > > 1. create a new buffer object. > > > 2. fill the new buffer with data return from raw > write > > > 2. If failed: > > > 1. Shifting the buffer to make room for writing data > > to the > > > buffer > > > 2. Buffer as much writing data as possible (may raise > > > BlockingIOError) > > > 2. return the data > > > > After above transfer from data-source (1) to "buffer" (2): > > > > * if len( data in "buffer" ) >= size (3): output > > else: keep going > > > > * output: > > shift size(3) from "buffer" to output > > retain 'the rest' in/as "buffer" > > > > NB if the size (2) of data in "buffer" is/could be multiples of size > > (3), then the "output" function should/could become a loop, ie keep > > emptying the "buffer" until size (2) < size (3). > > > > > > Finally, don't forget the special cases: > > What happens if we reach 'the end' (of 'input' or 'output' phase), > and > > there is still data in (1) or (2)? > > Presumably, in "Read" we would discard (1), but in the case of > "Write" > > we MUST empty "buffer" (2), even if it means the last write is of > less > > than size (3). > > > > Yes, you are right, when we are writing data to the buffer and the > > buffer is full, we have to flush it. > > > > NB The 'rules' for the latter may vary between use-cases, eg add > > 'stuffing' if the output record MUST be x-bytes long. > > > > > > Hope this helps. > > Do you need to hand-code this stuff though, or is there a better way? > > > > I'm trying to write an article for it :D > > > Perhaps it would help to discuss the use-case you will use as the > article's example. > > "I take a crash course" cf "write an article"??? > > > Web-Refs: > > Wikipedia: https://en.wikipedia.org/wiki/Data_buffer > > The PSL's IO library (?the code you've been reading): > > https://docs.python.org/3.6/library/io.html?highlight=buffer#io.TextIOBase.buffer > > The PSL's Readline library (which may be easier to visualise for > desktop-type users/coders - unless you're into IoT applications and > similar) > https://docs.python.org/3.6/library/readline.html?highlight=buffer > > PSL's Buffer protocol, in case you really want to 're-invent the wheel', > but with some possibly-helpful explanation: > https://docs.python.org/3.6/c-api/buffer.html?highlight=buffer > > > -- > Regards =dn > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list