On 24/04/2022 09.15, Chris Angelico wrote: > On Sun, 24 Apr 2022 at 07:13, Marco Sulla <marco.sulla.pyt...@gmail.com> > wrote: >> >> On Sat, 23 Apr 2022 at 23:00, Chris Angelico <ros...@gmail.com> wrote: >>>>> This is quite inefficient in general. >>>> >>>> Why inefficient? I think that readlines() will be much slower, not >>>> only more time consuming. >>> >>> It depends on which is more costly: reading the whole file (cost >>> depends on size of file) or reading chunks and splitting into lines >>> (cost depends on how well you guess at chunk size). If the lines are >>> all *precisely* the same number of bytes each, you can pick a chunk >>> size and step backwards with near-perfect efficiency (it's still >>> likely to be less efficient than reading a file forwards, on most file >>> systems, but it'll be close); but if you have to guess, adjust, and >>> keep going, then you lose efficiency there. >> >> Emh, why chunks? My function simply reads byte per byte and compares it to >> b"\n". When it find it, it stops and do a readline(): ...
> Ah. Well, then, THAT is why it's inefficient: you're seeking back one > single byte at a time, then reading forwards. That is NOT going to > play nicely with file systems or buffers. > > Compare reading line by line over the file with readlines() and you'll > see how abysmal this is. > > If you really only need one line (which isn't what your original post > suggested), I would recommend starting with a chunk that is likely to > include a full line, and expanding the chunk until you have that > newline. Much more efficient than one byte at a time. Disagreeing with @Chris in the sense that I use tail very frequently, and usually in the context of server logs - but I'm talking about the Linux implementation, not Python code! Agree with @Chris' assessment of the (in)efficiency. It is more likely than not, that you will have a good idea of the length of each line. Even if the line-length is highly-variable (thinking of some of my applications of the Python logging module!), one can still 'take a stab at it' (a "thumb suck" as an engineer-colleague used to say - apparently not an electrical engineer!) by remembering that lines exceeding 80-characters become less readable and thus have likely?hopefully been split into two. Thus, N*(80+p) where N is the number of lines desired and p is a reasonable 'safety'/over-estimation percentage, would give a good chunk size. Binar-ily grab that much of the end of the file, split on line-ending, and take the last N elements from that list. (with 'recovery code' just in case the 'chunk' wasn't large-enough). Adding to the efficiency (of the algorithm, but not the dev-time), consider that shorter files are likely to be more easily--handled by reading serially from the beginning. To side-step @Chris' criticism, use a generator to produce the individual lines (lazy evaluation and low storage requirement) and feed them into a circular-queue which is limited to N-entries. QED, as fast as the machine's I/O, and undemanding of storage-space! Running a few timing trials should reveal the 'sweet spot', at which one algorithm takes-over from the other! NB quite a few of IBM's (extensively researched) algorithms which formed utility program[me]s on mainframes, made similar such algorithmic choices, in the pursuit of efficiencies. -- Regards, =dn -- https://mail.python.org/mailman/listinfo/python-list