Re: [Python-Dev] My summary of the scandir (PEP 471)

2014-07-02 Thread Jonas Wielicki
On 01.07.2014 23:20, Paul Moore wrote:
> [snip]
> Please, let's stick to a low-level wrapper round the OS API for the
> first iteration of this feature. Enhancements can be added later, when
> real-world usage has proved their value.
> 
> Paul

+1 to the whole thing. That’s an ingeniously simple solution to the
issues we’re having here.

regards,
jwi

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471: scandir(fd) and pathlib.Path(name, dir_fd=None)

2014-07-02 Thread Charles-François Natali
2014-07-01 8:44 GMT+01:00 Victor Stinner :
>
> IMO we must decide if scandir() must support or not file descriptor.
> It's an important decision which has an important impact on the API.

I don't think we should support it: it's way too complicated to use,
error-prone, and leads to messy APIs.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471: scandir(fd) and pathlib.Path(name, dir_fd=None)

2014-07-02 Thread Victor Stinner
2014-07-02 12:51 GMT+02:00 Charles-François Natali :
> I don't think we should support it: it's way too complicated to use,
> error-prone, and leads to messy APIs.

Can you please elaborate? Which kind of issue do you see? Handling the
lifetime of the directory file descriptor?

You don't like the dir_fd parameter of os functions?

I don't have an opinion of supporting scandir(int). I asked to discuss
it in the PEP directly.

Victor
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] My summary of the scandir (PEP 471)

2014-07-02 Thread Ben Hoyt
Thanks for the effort in your response, Paul.

I'm all for KISS, but let's just slow down a bit here.

> I think that thin wrapper is needed - even
> if the various bells and whistles are useful, they can be built on top
> of a low-level version (whereas the converse is not the case).

Yes, but API design is important. For example, urllib2 has a kind of
the "thin wrapper approach", but millions of people use the 3rd-party
"requests" library because it's just so much nicer to use.

There are low-level functions in the "os" module, but there are also a
lot of higher-level functions (os.walk) and functions that smooth over
cross-platform issues (os.stat).

Detailed comments below.

> The return value is an object whose attributes correspond to the data
> the OS returns about a directory entry:
>
>   * name - the object's name
>   * full_name - the object's full name (including path)
>   * is_dir - whether the object is a directory
>   * is file - whether the object is a plain file
>   * is_symlink - whether the object is a symbolic link
>
> On Windows, the following attributes are also available
>
>   * st_size - the size, in bytes, of the object (only meaningful for files)
>   * st_atime - time of last access
>   * st_mtime - time of last write
>   * st_ctime - time of creation
>   * st_file_attributes - Windows file attribute bits (see the
> FILE_ATTRIBUTE_* constants in the stat module)

Again, this seems like a nice simple idea, but I think it's actually a
worst-of-both-worlds solution -- it has a few problems:

1) It's a nasty API to actually write code with. If you try to use it,
it gives off a "made only for low-level library authors" rather than
"designed for developers" smell. For example, here's a get_tree_size()
function I use written in both versions (original is the PEP 471
version with the addition of .full_name):

def get_tree_size_original(path):
"""Return total size of all files in directory tree at path."""
total = 0
for entry in os.scandir(path):
if entry.is_dir():
total += get_tree_size_original(entry.full_name)
else:
total += entry.lstat().st_size
return total

def get_tree_size_new(path):
"""Return total size of all files in directory tree at path."""
total = 0
for entry in os.scandir(path):
if hasattr(entry, 'is_dir') and hasattr(entry, 'st_size'):
is_dir = entry.is_dir
size = entry.st_size
else:
st = os.lstat(entry.full_name)
is_dir = stat.S_ISDIR(st.st_mode)
size = st.st_size
if is_dir:
total += get_tree_size_new(entry.full_name)
else:
total += size
return total

I know which version I'd rather write and maintain! It seems to me new
users and folks new to Python could easily write the top version, but
the bottom is longer, more complicated, and harder to get right. It
would also be very easy to write code in a way that works on Windows
but bombs hard on POSIX.

2) It seems like your assumption is that is_dir/is_file/is_symlink are
always available on POSIX via readdir. This isn't actually the case
(this was discussed in the original threads) -- if readdir() returns
dirent.d_type as DT_UNKNOWN, then you actually have to call os.stat()
anyway to get it. So, as the above definition of get_tree_size_new()
shows, you have to use getattr/hasattr on everything:
is_dir/is_file/is_symlink as well as the st_* attributes.

3) It's not much different in concept to the PEP 471 version, except
that PEP 471 has a built-in .lstat() method, making the user's life
much easier. This is the sense in which it's the worst of both worlds
-- it's a far less nice API to use, but it still has the same issues
with race conditions the original does.

So thinking about this again:

First, based on the +1's to Paul's new solution, I don't think people
are too concerned about the race condition issue (attributes being
different between the original readdir and the os.stat calls). I think
this is probably fair -- if folks care, they can handle it in an
application-specific way. So that means Paul's new solution and the
original PEP 471 approach are both okay on that score.

Second, comparing PEP 471 to Nick's solution: error handling is much
more straight-forward and simple to document with the original PEP 471
approach (just try/catch around the function calls) than with Nick's
get_lstat=True approach of doing the stat() if needed inside the
iterator. To catch errors with that approach, you'd either have to do
a "while True" loop and try/catch around next(it) manually (which is
very yucky code), or we'd have to add an onerror callback, which is
somewhat less nice to use and harder to document (signature of the
callback, exception object, etc).

So given all of the above, I'm fairly strongly in favour of the
approach in the original PEP 471 due to it's easy-to-use API and
straight-forward try/catch approach to error handling. (My s

Re: [Python-Dev] My summary of the scandir (PEP 471)

2014-07-02 Thread Paul Moore
tl;dr - I agree with your points and think that the original PEP 471
proposal is fine. The details here are just clarification of why my
proposal wasn't just "use PEP 471 as written" in the first place...

On 2 July 2014 13:41, Ben Hoyt  wrote:
> 1) It's a nasty API to actually write code with. If you try to use it,
> it gives off a "made only for low-level library authors" rather than
> "designed for developers" smell. For example, here's a get_tree_size()
> function I use written in both versions (original is the PEP 471
> version with the addition of .full_name):
>
> def get_tree_size_original(path):
> """Return total size of all files in directory tree at path."""
> total = 0
> for entry in os.scandir(path):
> if entry.is_dir():
> total += get_tree_size_original(entry.full_name)
> else:
> total += entry.lstat().st_size
> return total
>
> def get_tree_size_new(path):
> """Return total size of all files in directory tree at path."""
> total = 0
> for entry in os.scandir(path):
> if hasattr(entry, 'is_dir') and hasattr(entry, 'st_size'):
> is_dir = entry.is_dir
> size = entry.st_size
> else:
> st = os.lstat(entry.full_name)
> is_dir = stat.S_ISDIR(st.st_mode)
> size = st.st_size
> if is_dir:
> total += get_tree_size_new(entry.full_name)
> else:
> total += size
> return total
>
> I know which version I'd rather write and maintain!

Fair point. But *only* because is_dir isn't guaranteed to be
available. I could debate other aspects of your translation to use my
API, but it's not relevant as my proposal was flawed in terms of
is_XXX.

> It seems to me new
> users and folks new to Python could easily write the top version, but
> the bottom is longer, more complicated, and harder to get right.

Given the is_dir point, agreed.

> It
> would also be very easy to write code in a way that works on Windows
> but bombs hard on POSIX.

You may have a point here - my Windows bias may be showing. It's
already awfully easy to write code that works on POSIX but bombs hard
on Windows (deleting open files, for example) so I find it tempting to
think "give them a taste of their own medicine" :-)

More seriously, it seems to me that the scandir API is specifically
designed to write efficient code on platforms where the OS gives
information that allows you to do so. Warping the API too much to
cater for platforms where that isn't possible seems to have the
priorities backwards. Making the API not be an accident waiting to
happen is fine, though.

And let's be careful, too. My position is that it's not too hard to
write code that works on Windows, Linux and OS X but you're right you
could miss the problem with platforms that don't even support a free
is_dir(). It's *easier* to write Windows-only code by mistake, but the
fix to cover the "big three" is pretty simple (if not hasattr, lstat).

> 2) It seems like your assumption is that is_dir/is_file/is_symlink are
> always available on POSIX via readdir. This isn't actually the case
> (this was discussed in the original threads) -- if readdir() returns
> dirent.d_type as DT_UNKNOWN, then you actually have to call os.stat()
> anyway to get it. So, as the above definition of get_tree_size_new()
> shows, you have to use getattr/hasattr on everything:
> is_dir/is_file/is_symlink as well as the st_* attributes.

Ah, the wording in the PEP says "Linux, Windows, OS X". Superficially,
that said "everywhere" to me. It might be worth calling out
specifically some examples where it's not available without an extra
system call, just to make the point explicit.

You're right, though, that blows away the simplicity of my proposal.
The original PEP 471 seems precisely right to me, in that case.

I was only really arguing for attributes because they seem more
obviously static than a method call. And personally I don't care about
that aspect.

> 3) It's not much different in concept to the PEP 471 version, except
> that PEP 471 has a built-in .lstat() method, making the user's life
> much easier. This is the sense in which it's the worst of both worlds
> -- it's a far less nice API to use, but it still has the same issues
> with race conditions the original does.

Agreed. My intent was never to remove the race conditions, I see them
as the responsibility of the application to consider (many
applications simply won't care, and those that do will likely want a
specific solution, not a library-level compromise).

> So thinking about this again:
>
> First, based on the +1's to Paul's new solution, I don't think people
> are too concerned about the race condition issue (attributes being
> different between the original readdir and the os.stat calls). I think
> this is probably fair -- if folks care, they can handle it in an
> application-specific way. So that means Paul's new solution and the
> original PEP 471 appr

Re: [Python-Dev] My summary of the scandir (PEP 471)

2014-07-02 Thread Ben Hoyt
Thanks for the clarifications and support.

> Ah, the wording in the PEP says "Linux, Windows, OS X". Superficially,
> that said "everywhere" to me. It might be worth calling out
> specifically some examples where it's not available without an extra
> system call, just to make the point explicit.

Good call. I'll update the wording in the PEP here and try to call out
specific examples of where is_dir() could call os.stat().

Hard-core POSIX people, do you know when readdir() d_type will be
DT_UNKNOWN on (for example) Linux or OS X? I suspect this can happen
on certain network filesystems, but I'm not sure.

> PS I'd suggest adding a "Rejected proposals" section to the PEP which
> mentions the race condition issue and points to this discussion as an
> indication that people didn't seem to see it as a problem.

Definitely agreed. I'll add this, and clarify various other issues in
the PEP, and then repost.

-Ben
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 471: scandir(fd) and pathlib.Path(name, dir_fd=None)

2014-07-02 Thread Charles-François Natali
> 2014-07-02 12:51 GMT+02:00 Charles-François Natali :
>> I don't think we should support it: it's way too complicated to use,
>> error-prone, and leads to messy APIs.
>
> Can you please elaborate? Which kind of issue do you see? Handling the
> lifetime of the directory file descriptor?

Yes, among other things. You can e.g. have a look at os.fwalk() or
shutil._rmtree_safe_fd() to see that using those *properly* is far
from being trivial.

> You don't like the dir_fd parameter of os functions?

Exactly, I think it complicates the API for little benefit (FWIW, no
other language I know of exposes them).
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] My summary of the scandir (PEP 471)

2014-07-02 Thread Nikolaus Rath
Ben Hoyt  writes:
> Thanks for the clarifications and support.
>
>> Ah, the wording in the PEP says "Linux, Windows, OS X". Superficially,
>> that said "everywhere" to me. It might be worth calling out
>> specifically some examples where it's not available without an extra
>> system call, just to make the point explicit.
>
> Good call. I'll update the wording in the PEP here and try to call out
> specific examples of where is_dir() could call os.stat().
>
> Hard-core POSIX people, do you know when readdir() d_type will be
> DT_UNKNOWN on (for example) Linux or OS X? I suspect this can happen
> on certain network filesystems, but I'm not sure.

Any fuse file system mounted by some other user and without -o
allow_other. For these entries, stat() will fail as well.


Best,
-Nikolaus

-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

 »Time flies like an arrow, fruit flies like a Banana.«
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com