On 4/25/15 5:52 AM, Jilles Tjoelker wrote:
On Fri, Apr 24, 2015 at 04:28:12PM -0400, John Baldwin wrote:
Yes, this isn't at all safe. There's no guarantee whatsoever that
the offset on the directory fd that isn't something returned by
getdirentries has any meaning. In particular, the size of the
directory entry in a random filesystem might be a different size
than the structure returned by getdirentries (since it converts
things into a FS-independent format).
This might work for UFS by accident, but this is probably why ZFS
doesn't work.
Turns out they are all broken in the face of deletes.
I managed to make a patch that fixes things in the simple case of "backing
up by one (and only one time) record", which is what samba needs to do.
Samba reads until the latest entry won't fit (buffer is full) and then
effectively pushes the last one back by doing a seekdir back one record.
It then sends the full buffer back to the client and continus reading
at the
next record, which is the one it pushed back.
It turs out that because the buffer holding the last dirent presented
is always
still resident in memory, it is ALWAYS possible to slip the 'current
entry' pointer
back ONE entry, (but no more).
This unbreaks samba, even in the case of deletes. (at least with UFS
and ZFS).
I'll make the patch available when I've tested it more and cleaned it up.
However, this might be properly fixed by the thing that ino64 is
doing where each directory entry returned by getdirentries gives
you a seek offset that you _can_ directly seek to (as opposed to
seeking to the start of the block and then walking forward N
entries until you get an inter-block entry that is the same).
The ino64 branch only reserves space for d_off and does not use it in
any way. This is appropriate since actually using d_off is a major
feature addition.
d_cookie please.. :-)
A proper d_off would still be useful even if UFS's readdir keeps masking
off the offset so a directory read always starts at the beginning of a
512-byte directory block, since this allows more distinct offset values
than safely using getdirentries()'s *basep. With d_off, one outer loop
must read at least one directory block to avoid spinning indefinitely,
while using getdirentries()'s *basep requires reading the whole
getdirentries() buffer.
not sure I follow what you mean by that.
Some Linux filesystems go further and provide a unique d_off for each
entry.
yes ZFS provides that.
we currently don't export it beyond teh VFS, but there is code that is
supposed to
supply a cookie for each entry. It could be put into the dirent
structure should
we want to do that. Currently it is put into a separate memory array
in parallel
to the dirents, except that we give it a NULL pointer, which means "we
don't want
cookies thanks" .
We really need to do something because the current system is really
broken.
And the fact that dirent has *32 bit* inode number in it was a
shock.. I'd presumed
that had gone the way of the dinosaurs and dodo.
I think 11 needs to have a new dirent structure given out by a new
syscall.
(old one still present for compat reasons). Whether we need a
readdir64() and friends
I have not yet decided. Maybe it's time to bump libc's number again :-)
Another idea would be to store the last d_ino instead of dd_loc into the
struct ddloc. On seekdir(), this would seek to loc_seek as before and
skip entries until that d_ino is found, or to the start of the buffer if
not found (and possibly return some entries again that should not be
returned, but Samba copes with that).
I didn't see it coping with that.. an earlier version of my patch
suffered from duplicate entries in samba.
it is also dangerous because multiple files linked to the same inode
would get confused.
it may be possible however to link the telldir cookies to a HASH of
the filename
and inode number.
(assuming of course we don't have cookies directly from the fiesystem
itself.)
the current scheme of seeking to a buffer boundary and seeking forward in
the buffer looking for a matching entry would become quite robust
using this scheme,
assuming that the lseek actually worked on that file system.
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"