from:"Eric Sandeen"

Re: [PATCH] btrfs-progs: fix one bracket issue in mkfs.btrfs manpage

2013-03-21 Thread Eric Sandeen

On 3/21/13 2:17 AM, zwu.ker...@gmail.com wrote:
> From: Zhi Yong Wu 
> 
>   In "[ \fB\-f\fP\fI ]", the "\fI" will result in the front half "["of
> "[ -f ]" doesn't the back half "]"; When you issue the command
> "man mkfs.btrfs", you will see the difference.
> 
> Signed-off-by: Zhi Yong Wu 

Whoops, yes. Thanks for spotting that.

Reviewed-by: Eric Sandeen 

> ---
>  man/mkfs.btrfs.8.in | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/man/mkfs.btrfs.8.in b/man/mkfs.btrfs.8.in
> index cdccd6a..a3f1503 100644
> --- a/man/mkfs.btrfs.8.in
> +++ b/man/mkfs.btrfs.8.in
> @@ -6,7 +6,7 @@ mkfs.btrfs \- create a btrfs filesystem
>  [ \fB\-A\fP\fI alloc-start\fP ]
>  [ \fB\-b\fP\fI byte-count\fP ]
>  [ \fB\-d\fP\fI data-profile\fP ]
> -[ \fB\-f\fP\fI ]
> +[ \fB\-f\fP ]
>  [ \fB\-l\fP\fI leafsize\fP ]
>  [ \fB\-L\fP\fI label\fP ]
>  [ \fB\-m\fP\fI metadata profile\fP ]
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-23 Thread Eric Sandeen

On 10/23/12 5:19 PM, Theodore Ts'o wrote:
> On Tue, Oct 23, 2012 at 09:57:08PM +0100, Nix wrote:
>>
>> It is now quite clear that this is a bug introduced by one or more of
>> the post-3.6.1 ext4 patches (which have all been backported at least to
>> 3.5, so the problem is probably there too).
>>
>> [   60.290844] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: 
>> group 202, 1583 clusters in bitmap, 1675 in gd
>> [   60.291426] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 
>> 0). There's a risk of filesystem corruption in case of system crash.
>>
> 
> I think I've found the problem.  I believe the commit at fault is commit
> 14b4ed22a6 (upstream commit eeecef0af5e):
> 
> jbd2: don't write superblock when if its empty
> 
> which first appeared in v3.6.2.
> 
> The reason why the problem happens rarely is that the effect of the
> buggy commit is that if the journal's starting block is zero, we fail
> to truncate the journal when we unmount the file system.  This can
> happen if we mount and then unmount the file system fairly quickly,
> before the log has a chance to wrap.After the first time this has
> happened, it's not a disaster, since when we replay the journal, we'll
> just replay some extra transactions.  But if this happens twice, the
> oldest valid transaction will still not have gotten updated, but some
> of the newer transactions from the last mount session will have gotten
> written by the very latest transacitons, and when we then try to do
> the extra transaction replays, the metadata blocks can end up getting
> very scrambled indeed.

I'm stumped by this; maybe Ted can see if I'm missing something.

(and Nix, is there anything special about your fs?  Any nondefault
mkfs or mount options, external journal, inordinately large fs, or
anything like that?)

The suspect commit added this in jbd2_mark_journal_empty():

/* Is it already empty? */
if (sb->s_start == 0) {
read_unlock(&journal->j_state_lock);
return;
}

thereby short circuiting the function.

But Ted's suggestion that mounting the fs, doing a little work, and
unmounting before we wrap would lead to this doesn't make sense to
me.  When I do a little work, s_start is at 1, not 0.  We start
the journal at s_first:

load_superblock()
journal->j_first = be32_to_cpu(sb->s_first);

And when we wrap the journal, we wrap back to j_first:

jbd2_journal_next_log_block():
if (journal->j_head == journal->j_last)
journal->j_head = journal->j_first;

and j_first comes from s_first, which is set at journal creation
time to be "1" for an internal journal.

So s_start == 0 sure looks special to me; so far I can only see that
we get there if we've been through jbd2_mark_journal_empty() already,
though I'm eyeballing jbd2_journal_get_log_tail() as well.

Ted's proposed patch seems harmless but so far I don't understand
what problem it fixes, and I cannot recreate getting to
jbd2_mark_journal_empty() with a dirty log and s_start == 0.

-Eric

> *Sigh*.  My apologies for not catching this when I reviewed this
> patch.  I believe the following patch should fix the bug; once it's
> reviewed by other ext4 developers, I'll push this to Linus ASAP.
> 
>   - Ted
> 
> commit 26de1ba5acc39f0ab57ce1ed523cb128e4ad73a4
> Author: Theodore Ts'o 
> Date:   Tue Oct 23 18:15:22 2012 -0400
> 
> jbd2: fix a potential fs corrupting bug in jbd2_mark_journal_empty
> 
> Fix a potential file system corrupting bug which was introduced by
> commit eeecef0af5ea4efd763c9554cf2bd80fc4a0efd3: jbd2: don't write
> superblock when if its empty.
> 
> We should only skip writing the journal superblock if there is nothing
> to do --- not just when s_start is zero.
> 
> This has caused users to report file system corruptions in ext4 that
> look like this:
> 
> EXT4-fs error (device sdb3): ext4_mb_generate_buddy:741: group 436, 22902 
> clusters in bitmap, 22901 in gd
> JBD2: Spotted dirty metadata buffer (dev = sdb3, blocknr = 0). There's a 
> risk of filesystem corruption in case of system crash.
> 
> after the file system has been corrupted.
> 
> Signed-off-by: "Theodore Ts'o" 
> Cc: sta...@vger.kernel.org
> 
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 0f16edd..0064181 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -1351,18 +1351,20 @@ void jbd2_journal_update_sb_log_tail(journal_t 
> *journal, tid_t tail_tid,
>  static void jbd2_mark_journal_empty(journal_t *journal)
>  {
>   journal_superblock_t *sb = journal->j_superblock;
> + __be32  new_tail_sequence;
>  
>   BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex));
>   read_lock(&journal->j_state_lock);
> - /* Is it already empty? */
> - if (sb->s_start == 0) {
> + new_tail_sequence = cpu_to_be32(journal->j_tail_sequence);
> + /* Nothi

Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-23 Thread Eric Sandeen

On 10/23/12 3:57 PM, Nix wrote:



> (I'd provide more sample errors, but this bug has been eating
> newly-written logs in /var all day, so not much has survived.)
> 
> I rebooted into 3.6.1 rescue mode and fscked everything: lots of
> orphans, block group corruption and cross-linked files. The problems did
> not recur upon booting from 3.6.1 into 3.6.1 again. It is quite clear
> that metadata changes made in 3.6.3 are not making it to disk reliably,
> thus leading to corrupted filesystems marked clean on reboot into other
> kernels: pretty much every file appended to in 3.6.3 loses some or all
> of its appended data, and newly allocated blocks often end up
> cross-linked between multiple files.
> 
> The curious thing is this doesn't affect every filesystem: for a while
> it affected only /var, and now it's affecting only /var and /home. The
> massive writes to the ext4 filesystem mounted on /usr/src seem to have
> gone off without incident: fsck reports no problems.
> 
> 
> The only unusual thing about the filesystems on this machine are that
> they have hardware RAID-5 (using the Areca driver), so I'm mounting with
> 'nobarrier': 

I should have read more.  :(  More questions follow:

* Does the Areca have a battery backed write cache?
* Are you crashing or rebooting cleanly?
* Do you see log recovery messages in the logs for this filesystem?

> the full set of options for all my ext4 filesystems are:
> 
> rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,
> usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota

ok journal_async_commit is off the reservation a bit; that's really not
tested, and Jan had serious reservations about its safety.

* Can you reproduce this w/o journal_async_commit?

-Eric

> If there's anything I can do to help, I'm happy to do it, once I've
> restored my home directory from backup :(
> 
> 
> tune2fs output for one of the afflicted filesystems (after fscking):
> 
> tune2fs 1.42.2 (9-Apr-2012)
> Filesystem volume name:   home
> Last mounted on:  /home
> Filesystem UUID:  95bd22c2-253c-456f-8e36-b6cfb9ecd4ef
> Filesystem magic number:  0xEF53
> Filesystem revision #:1 (dynamic)
> Filesystem features:  has_journal ext_attr resize_inode dir_index 
> filetype needs_recovery extent flex_bg sparse_super large_file huge_file 
> uninit_bg dir_nlink extra_isize
> Filesystem flags: signed_directory_hash
> Default mount options:(none)
> Filesystem state: clean
> Errors behavior:  Continue
> Filesystem OS type:   Linux
> Inode count:  3276800
> Block count:  13107200
> Reserved block count: 655360
> Free blocks:  5134852
> Free inodes:  3174777
> First block:  0
> Block size:   4096
> Fragment size:4096
> Reserved GDT blocks:  20
> Blocks per group: 32768
> Fragments per group:  32768
> Inodes per group: 8192
> Inode blocks per group:   512
> RAID stripe width:16
> Flex block group size:64
> Filesystem created:   Tue May 26 21:29:41 2009
> Last mount time:  Tue Oct 23 21:32:07 2012
> Last write time:  Tue Oct 23 21:32:07 2012
> Mount count:  2
> Maximum mount count:  20
> Last checked: Tue Oct 23 21:22:16 2012
> Check interval:   15552000 (6 months)
> Next check after: Sun Apr 21 21:22:16 2013
> Lifetime writes:  1092 GB
> Reserved blocks uid:  0 (user root)
> Reserved blocks gid:  0 (group root)
> First inode:  11
> Inode size:   256
> Required extra isize: 28
> Desired extra isize:  28
> Journal inode:8
> First orphan inode:   1572907
> Default directory hash:   half_md4
> Directory Hash Seed:  a201983d-d8a3-460b-93ca-eb7804b62c23
> Journal backup:   inode blocks
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-23 Thread Eric Sandeen

On 10/23/12 11:15 PM, Nix wrote:
> On 24 Oct 2012, Eric Sandeen uttered the following:
> 
>> On 10/23/12 3:57 PM, Nix wrote:
>>> The only unusual thing about the filesystems on this machine are that
>>> they have hardware RAID-5 (using the Areca driver), so I'm mounting with
>>> 'nobarrier': 
>>
>> I should have read more.  :(  More questions follow:
>>
>> * Does the Areca have a battery backed write cache?
> 
> Yes (though I'm not powering off, just rebooting). Battery at 100% and
> happy, though the lack of power-off means it's not actually getting
> used, since the cache is obviously mains-backed as well.
> 
>> * Are you crashing or rebooting cleanly?
> 
> Rebooting cleanly, everything umounted happily including /home and /var.
> 
>> * Do you see log recovery messages in the logs for this filesystem?
> 
> My memory says yes, but nothing seems to be logged when this happens
> (though with my logs on the first filesystem damaged by this, this is
> rather hard to tell, they're all quite full of NULs by now).
> 
> I'll double-reboot tomorrow via the faulty kernel and check, unless I
> get asked not to in the interim. (And then double-reboot again to fsck
> everything...)
> 
>>> the full set of options for all my ext4 filesystems are:
>>>
>>> rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,
>>> usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota
>>
>> ok journal_async_commit is off the reservation a bit; that's really not
>> tested, and Jan had serious reservations about its safety.
> 
> OK, well, I've been 'testing' it for years :) No problems until now. (If
> anything, I was more concerned about journal_checksum. I thought that
> had actually been implicated in corruption before now...)

It had, but I fixed it AFAIK; OTOH, we turned it off by default
after that episode.

>> * Can you reproduce this w/o journal_async_commit?
> 
> I can try!

Ok, fair enough.  If the BBU is working, nobarrier is ok; I don't trust
journal_async_commit, but that doesn't mean this isn't a regression.

Thanks for the answers... onward.  :)

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Eric Sandeen

On 10/24/2012 12:23 AM, Theodore Ts'o wrote:
> On Tue, Oct 23, 2012 at 11:27:09PM -0500, Eric Sandeen wrote:
>>
>> Ok, fair enough.  If the BBU is working, nobarrier is ok; I don't trust
>> journal_async_commit, but that doesn't mean this isn't a regression.
> 
> Note that Toralf has reported almost exactly the same set of symptoms,
> but he's using an external USB stick --- and as far as I know he
> wasn't using nobarrier and/or the journal_async_commit.  Toralf, can
> you confirm what, if any, mount options you were using when you saw
> it.
> 
> I've been looking at this some more, and there's one other thing that
> the short circuit code does, which is neglects setting the
> JBD2_FLUSHED flag, which is used by the commit code to know when it
> needs to reset the s_start fields in the superblock when we make our
> next commit.  However, this would only happen if the short circuit
> code is getting hit some time other than when the file system is
> getting unmounted --- and that's what Eric and I can't figure out how
> it might be happening.  Journal flushes outside of an unmount does
> happen as part of online resizing, the FIBMAP ioctl, or when the file
> system is frozen.  But it didn't sound like Toralf or Nix was using
> any of those features.  (Toralf, Nix, please correct me if my
> assumptions here is wrong).

If I freeze w/ anything in the log, then s_start !=0 and we proceed
normally.  If I re-freeze w/o anything in the log, it's already set to
FLUSHED (which makes sense) so not re-setting it doesn't matter.  So I
don't see that that's an issue.

As for FIBMAP I think we only do journal_flush if it's data=journal.

In other news, Phoronix is on the case, so expect escalating freakouts ;)

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-24 Thread Eric Sandeen

On 10/24/2012 02:49 PM, Nix wrote:
> On 24 Oct 2012, Theodore Ts'o spake thusly:
>> Toralf, Nix, if you could try applying this patch (at the end of this
>> message), and let me know how and when the WARN_ON triggers, and if it
>> does, please send the empty_bug_workaround plus the WARN_ON(1) report.
>> I know about the case where a file system is mounted and then
>> immediately unmounted, but we don't think that's the problematic case.
>> If you see any other cases where WARN_ON is triggering, it would be
>> really good to know
> 
> Confirmed, it triggers. Traceback below.
> 



The warn on triggers, but I can't tell - did the corruption still occur
with Ted's patch?

-Eric

> 
> OK. That umount of local filesystems sprayed your added
> empty bug workaround and WARN_ONs so many times that nearly all of them
> scrolled off the screen -- and because syslogd was dead by now and this
> is where my netconsole logs go, they're lost. I suspect every single
> umounted filesystem sprayed one of these (and this happened long before
> any reboot-before-we're-done).
> 
> But I did the old trick of camera-capturing the last one (which was
> probably /boot, which has never got corrupted because I hardly ever
> write anything to it at all). I hope it's more useful than nothing. (I
> can rearrange things to umount /var last, and try again, if you think
> that a specific warning from an fs known to get corrupted is especially
> likely to be valuable.)
> 
> So I see, for one umount at least (and the chunk of the previous one
> that scrolled offscreen is consistent with this):
> 
> jbd2_mark_journal_empty bug workaround (21218, 21219)
> [obscured by light] at fs/jbd2/journal.c:1364 jbd2_mark_journal_empty+06c/0xbd
> ...
> [addresses omitted for sanity: traceback only]
> warn_slowpath_common+0x83/0x9b
> warn_slowpath_null+0x1a/0x1c
> jbd2_mark_journal_empty+06c/0xbd
> jbd2_journal_destroy+0x183/0x20c
> ? abort_exclusive_wait+0x8e/0x8e
> ext4_put_super+0x6c/0x316
> ? evict_inodes+0xe6/0xf1
> generic_shutdown_super+0x59/0xd1
> ? free_vfsmnt+0x18/0x3c
> kill_block_super+0x27/0x6a
> deactivate_locked_super+0x26/0x57
> deactivate_super+0x3f/0x43
> mntput_no_expire+0x134/0x13c
> sys_umount+0x308/0x33a
> system_call_fastpath+0x16/0x1b

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: EXT4-fs error w/ external USB drive

2012-10-25 Thread Eric Sandeen

On 10/25/12 1:20 PM, Theodore Ts'o wrote:
> On Thu, Oct 25, 2012 at 06:39:30PM +0200, Toralf Förster wrote:
>> After a lot of file operations (Gentoo emerging, kernel build, git
>> pulls, ...) I s2disk the system (that with the external USB drive)
>> yesterday, wake it up today, rebooted it -
>> and had to manually repair the file system, because the automatic fsck
>> gave up.
> 
> OK, I'm going to send another patch series which I'd hope you could
> test to see if reduces the rate at which this happens.
> 
>> Nevertheless there's another Linux system I have (64bit RH EL,internal
>> drive), where with kernel 3.5.4-1.el6.elrepo.x86_64 EXT4 errors occurred.
>> I attached the whole appropriate section of /var/log/message.
> 
> I don't have easy access to the RHEL kernel sources, and so I don't

Just FWIW, that's a 3rd party kernel, not something Red Hat
ships.  (see "elrepo")

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: xfs [_fsr] probs in 2.6.24.0

2008-02-12 Thread Eric Sandeen

Linda Walsh wrote:
> 
> David Chinner wrote:
>> Filesystem bugs rarely hang systems hard like that - more likely is
>> a hardware or driver problem. And neither of the lockdep reports
>> below are likely to be responsible for a system wide, no-response
>> hang.
> ---
>   "Ish", the 32-bitter, has been the only hard-hanger.

4k stacks?

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: xfs [_fsr] probs in 2.6.24.0

2008-02-12 Thread Eric Sandeen

Linda Walsh wrote:
> 
> Eric Sandeen wrote:
>> Linda Walsh wrote:
>>> David Chinner wrote:
>>>> Filesystem bugs rarely hang systems hard like that - more likely is
>>>> a hardware or driver problem. And neither of the lockdep reports
>>>> below are likely to be responsible for a system wide, no-response
>>>> hang.
>>> ---
>>> "Ish", the 32-bitter, has been the only hard-hanger.
>> 4k stacks?
> 
>   But but but...almost from the day they were introduced.  And
> these are more recent probs. Has stack usage increased for some reason,
> :-(.  I do have the option to detect stack-overflow turned on as well
> -- guess it doesn't work so well?

Resource requirements grow over time, film at 11? :)

the checker is a random thing, it checks only on interrupts; it won't
always hit.  you could try CONFIG_DEBUG_STACK_USAGE too, each thread
prints max stack used when it exits, to see if you're getting close on
normal usage.

Or just use 8k.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] ext3 freeze feature

2008-02-15 Thread Eric Sandeen

Takashi Sato wrote:
> Hi,
> 
> Christoph Hellwig wrote:
>> On Fri, Feb 08, 2008 at 08:26:57AM -0500, Andreas Dilger wrote:
>>> You may as well make the common ioctl the same as the XFS version,
>>> both by number and parameters, so that applications which already
>>> understand the XFS ioctl will work on other filesystems.
>> Yes.  In facy you should be able to lift the implementations of
>> XFS_IOC_FREEZE and XFS_IOC_THAW to generic code, there's nothing
>> XFS-specific in there.
> 
> According to Documentation/ioctl-number.txt,
> XFS_IOC_XXXs (_IOWR('X', aa, bb)) are defined for XFS like below.
> From Documentation/ioctl-number.txt:
> 
> CodeSeq#Include FileComments
> 
> : :
> 'X' all linux/xfs_fs.h
> 

It also says:

'f' 00-1F   linux/ext2_fs.h

and yet include/linux.h has:

#define FS_IOC_GETFLAGS _IOR('f', 1, long)
#define FS_IOC_SETFLAGS _IOW('f', 2, long)

as generic vfs ioctls.  These ioctls started out as
EXT2_IOC_SETFLAGS/EXT2_IOC_GETFLAGS but they were generically useful,
other filesystems picked them up, and they were "elevated" to the vfs.

> So XFS_IOC_FREEZE and XFS_IOC_THAW cannot be lifted to generic code simply.

It would be a simple matter of changing the documentation, I think.

> I think we should create new generic numbers for freeze and thaw
> like FIBMAP as followings.
> linux/fs.h:
> #define FIFREEZE _IO(0x00,3)
> #define FITHAW   _IO(0x00,4)
> 
> And xfs_freeze calls XFS_IOC_FREEZE with a magic number 1, but what is 1?

Looks like it's called "level" but it's probably a holdover, it doesn't
look like it's used.

> Instead, I'd like the sec to timeout on freeze API in order to thaw
> the filesystem automatically.  It can prevent a filesystem from staying
> frozen forever.
> (Because a freezer may cause a deadlock by accessing the frozen filesystem.)

I'm still not very comfortable with the timeout; if you un-freeze on a
timer, how do you know that the work for which you needed the fileystem
frozen is complete?  How would you know if your snapshot was good if
there's a possibility that the fs unfroze while it was being taken?

Thanks,
-Eric

> Any comments are very welcome.
> 
> Cheers, Takashi 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2] perf: Fix parallel build

2012-09-20 Thread Eric Sandeen

On 9/20/12 9:07 PM, Namhyung Kim wrote:
> Hi again,
> 
> On Thu, 20 Sep 2012 20:12:50 -0500, Eric Sandeen wrote:
>> -$(OUTPUT)util/parse-events-flex.c: util/parse-events.l
>> +$(OUTPUT)util/parse-events-flex.c: util/parse-events.l 
>> util/parse-events-bison.c
> 
> I realized that the generated *-bison.c files should have $(OUTPUT)
> prefix since they can be under another directory than *.[ly] files.
> 
> 
>>  $(QUIET_FLEX)$(FLEX) --header-file=$(OUTPUT)util/parse-events-flex.h 
>> $(PARSER_DEBUG_FLEX) -t util/parse-events.l > 
>> $(OUTPUT)util/parse-events-flex.c
>>  
>>  $(OUTPUT)util/parse-events-bison.c: util/parse-events.y
>>  $(QUIET_BISON)$(BISON) -v util/parse-events.y -d $(PARSER_DEBUG_BISON) 
>> -o $(OUTPUT)util/parse-events-bison.c
>>  
>> -$(OUTPUT)util/pmu-flex.c: util/pmu.l
>> +$(OUTPUT)util/pmu-flex.c: util/pmu.l util/pmu-bison.c
> 
> Ditto.

Sorry.  Too long since I worked on Makefiles :(

V3 soon ;)

-Eric

> Thanks,
> Namhyung
> 
> 
>>  $(QUIET_FLEX)$(FLEX) --header-file=$(OUTPUT)util/pmu-flex.h -t 
>> util/pmu.l > $(OUTPUT)util/pmu-flex.c
>>  
>>  $(OUTPUT)util/pmu-bison.c: util/pmu.y

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V3] perf: Fix parallel build

2012-09-20 Thread Eric Sandeen

Parallel builds of perf were failing for me on a 32p box, with:

* new build flags or prefix
util/pmu.l:7:23: error: pmu-bison.h: No such file or directory

...

make: *** [util/pmu-flex.o] Error 1
make: *** Waiting for unfinished jobs

This can pretty quickly be seen by adding a sleep in front of
the bison calls in tools/perf/Makefile and running make -j4 on a
smaller box i.e.:

sleep 10; $(QUIET_BISON)$(BISON) -v util/pmu.y -d -o 
$(OUTPUT)util/pmu-bison.c

Adding the following dependencies fixes it for me.

Signed-off-by: Eric Sandeen 
---

V2: Fix other bison dependency caught by Namhyung Kim , 
thanks.
V3: Add $(OUTPUT) to generated deps, thanks again.

diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index 35655c3..01325cd 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -221,13 +221,13 @@ export PERL_PATH
 FLEX = flex
 BISON= bison
 
-$(OUTPUT)util/parse-events-flex.c: util/parse-events.l
+$(OUTPUT)util/parse-events-flex.c: util/parse-events.l 
$(OUTPUT)util/parse-events-bison.c
$(QUIET_FLEX)$(FLEX) --header-file=$(OUTPUT)util/parse-events-flex.h 
$(PARSER_DEBUG_FLEX) -t util/parse-events.l > $(OUTPUT)util/parse-events-flex.c
 
 $(OUTPUT)util/parse-events-bison.c: util/parse-events.y
$(QUIET_BISON)$(BISON) -v util/parse-events.y -d $(PARSER_DEBUG_BISON) 
-o $(OUTPUT)util/parse-events-bison.c
 
-$(OUTPUT)util/pmu-flex.c: util/pmu.l
+$(OUTPUT)util/pmu-flex.c: util/pmu.l $(OUTPUT)util/pmu-bison.c
$(QUIET_FLEX)$(FLEX) --header-file=$(OUTPUT)util/pmu-flex.h -t 
util/pmu.l > $(OUTPUT)util/pmu-flex.c
 
 $(OUTPUT)util/pmu-bison.c: util/pmu.y


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Eric Sandeen

On 10/24/12 3:17 PM, Jan Kara wrote:
> On Tue 23-10-12 19:57:09, Eric Sandeen wrote:
>> On 10/23/12 5:19 PM, Theodore Ts'o wrote:
>>> On Tue, Oct 23, 2012 at 09:57:08PM +0100, Nix wrote:
>>>>
>>>> It is now quite clear that this is a bug introduced by one or more of
>>>> the post-3.6.1 ext4 patches (which have all been backported at least to
>>>> 3.5, so the problem is probably there too).
>>>>
>>>> [   60.290844] EXT4-fs error (device dm-3): ext4_mb_generate_buddy:741: 
>>>> group 202, 1583 clusters in bitmap, 1675 in gd
>>>> [   60.291426] JBD2: Spotted dirty metadata buffer (dev = dm-3, blocknr = 
>>>> 0). There's a risk of filesystem corruption in case of system crash.
>>>>
>>>
>>> I think I've found the problem.  I believe the commit at fault is commit
>>> 14b4ed22a6 (upstream commit eeecef0af5e):
>>>
>>> jbd2: don't write superblock when if its empty
>>>
>>> which first appeared in v3.6.2.
>>>
>>> The reason why the problem happens rarely is that the effect of the
>>> buggy commit is that if the journal's starting block is zero, we fail
>>> to truncate the journal when we unmount the file system.  This can
>>> happen if we mount and then unmount the file system fairly quickly,
>>> before the log has a chance to wrap.After the first time this has
>>> happened, it's not a disaster, since when we replay the journal, we'll
>>> just replay some extra transactions.  But if this happens twice, the
>>> oldest valid transaction will still not have gotten updated, but some
>>> of the newer transactions from the last mount session will have gotten
>>> written by the very latest transacitons, and when we then try to do
>>> the extra transaction replays, the metadata blocks can end up getting
>>> very scrambled indeed.
>>
>> I'm stumped by this; maybe Ted can see if I'm missing something.
>>
>> (and Nix, is there anything special about your fs?  Any nondefault
>> mkfs or mount options, external journal, inordinately large fs, or
>> anything like that?)
>>
>> The suspect commit added this in jbd2_mark_journal_empty():
>>
>> /* Is it already empty? */
>> if (sb->s_start == 0) {
>> read_unlock(&journal->j_state_lock);
>> return;
>> }
>>
>> thereby short circuiting the function.
>>
>> But Ted's suggestion that mounting the fs, doing a little work, and
>> unmounting before we wrap would lead to this doesn't make sense to
>> me.  When I do a little work, s_start is at 1, not 0.  We start
>> the journal at s_first:
>>
>> load_superblock()
>>  journal->j_first = be32_to_cpu(sb->s_first);
>>
>> And when we wrap the journal, we wrap back to j_first:
>>
>> jbd2_journal_next_log_block():
>> if (journal->j_head == journal->j_last)
>> journal->j_head = journal->j_first;
>>
>> and j_first comes from s_first, which is set at journal creation
>> time to be "1" for an internal journal.
>>
>> So s_start == 0 sure looks special to me; so far I can only see that
>> we get there if we've been through jbd2_mark_journal_empty() already,
>> though I'm eyeballing jbd2_journal_get_log_tail() as well.
>>
>> Ted's proposed patch seems harmless but so far I don't understand
>> what problem it fixes, and I cannot recreate getting to
>> jbd2_mark_journal_empty() with a dirty log and s_start == 0.
>   Agreed. I rather thing we might miss journal->j_flags |= JBD2_FLUSHED
> when shortcircuiting jbd2_mark_journal_empty(). But I still don't exactly
> see how that would cause the corruption...

Agreed, except so far I cannot see any way to get here with s_start == 0
without ALREADY having JBD2_FLUSHED set.  Can you?

Anyway, I think the problem is still poorly understood; lots of random facts
floating about, and a pretty weird usecase with nonstandard/dangerous mount
options.  I do want to figure out what regressed (if anything) but so far
this investigation doesn't seem very methodical.

-Eric

>   Honza
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-26 Thread Eric Sandeen

On 10/23/12 3:57 PM, Nix wrote:
> [Bruce, Trond, I fear it may be hard for me to continue chasing this NFS
>  lockd crash as long as ext4 on 3.6.3 is hosing my filesystems like
>  this. Apologies.]



> The only unusual thing about the filesystems on this machine are that
> they have hardware RAID-5 (using the Areca driver), so I'm mounting with
> 'nobarrier': the full set of options for all my ext4 filesystems are:
> 
> rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,
> usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota

Out of curiosity, when I test log replay with the journal_checksum option, I
almost always get something like:

[  999.917805] JBD2: journal transaction 84121 on dm-1-8 is corrupt.
[  999.923904] EXT4-fs (dm-1): error loading journal

after a simulated crash & log replay.

Do you see anything like that in your logs?



Thanks,
-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-27 Thread Eric Sandeen

On 10/27/12 7:45 AM, Nix wrote:
> [nfs people purged from Cc]
> 
> On 27 Oct 2012, Theodore Ts'o verbalised:
> 
>> Huh?  It's not turned on by default.  If you mount with no mount
>> options, journal checksums are *not* turned on.
> 
> ?! it's turned on for me, and though I use weird mount options I don't
> use that one:

journal_async_commit implies journal_checksum:

{Opt_journal_async_commit, (EXT4_MOUNT_JOURNAL_ASYNC_COMMIT |
EXT4_MOUNT_JOURNAL_CHECKSUM), MOPT_SET},

journal_checksum seems to have broken, at least for me, between 3.3 & 3.4, I 
think I've narrowed down the commit but not sure yet what the flaw is, will 
investigate & report back later.  Busy Saturday.

So anyway, turning on journal_async_commit (notionally unsafe) enables 
journal_checksum (apparently broken).

-Eric

> /dev/main/var   /varext4
> defaults,nobarrier,usrquota,grpquota,nosuid,nodev,relatime,journal_async_commit,commit=30,user_xattr,acl
>  1  2
> Default mount options:(none)
> /dev/mapper/main-var /var ext4 
> rw,nosuid,nodev,relatime,journal_checksum,journal_async_commit,nobarrier,quota,usrquota,grpquota,commit=30,stripe=16,data=ordered,usrquota,grpquota
>  0 0
> 
> ...
> 
> Ah! it's turned on by journal_async_commit. OK, that alone argues
> against use of journal_async_commit, tested or not, and I'd not have
> turned it on if I'd noticed that.
> 
> (So, the combinations I'll be trying for effect on this bug are:
> 
>  journal_async_commit (as now)
>  journal_checksum
>  none
> 
> Technically to investigate all possibilities we should try
> journal_async_commit,no_journal_checksum, but this seems so unlikely to
> have ever been tested by anyone that it's not worth looking into...)
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-27 Thread Eric Sandeen

On 10/27/12 1:47 PM, Nix wrote:
> On 27 Oct 2012, Theodore Ts'o said:
> 
>> On Sat, Oct 27, 2012 at 01:45:25PM +0100, Nix wrote:
>>> Ah! it's turned on by journal_async_commit. OK, that alone argues
>>> against use of journal_async_commit, tested or not, and I'd not have
>>> turned it on if I'd noticed that.
>>>
>>> (So, the combinations I'll be trying for effect on this bug are:
>>>
>>>  journal_async_commit (as now)
>>>  journal_checksum
>>>  none
>>
>> Can you also check and see whether the presence or absence of
>> "nobarrier" makes a difference?
> 
> Done. (Also checked the effect of your patches posted earlier this week:
> no effect, I'm afraid, certainly not under the fails-even-on-3.6.1 test
> I was carrying out, umount -l'ing /var as the very last thing I did
> before /sbin/reboot -f.)
> 
> nobarrier makes a difference that I, at least, did not expect:
> 
> [no options]No corruption
> 
> nobarrier   No corruption
> 
>   journal_checksum  Corruption
> Corrupted transaction, journal aborted
> 
> nobarrier,journal_checksum  Corruption
> Corrupted transaction, journal aborted
> 
>   journal_async_commit  Corruption
> Corrupted transaction, journal aborted
> 
> nobarrier,journal_async_commit  Corruption
> No corrupted transaction or aborted journal

That's what we needed.  Woulda been great a few days ago ;)

In my testing journal_checksum is broken, and my bisection seems to
implicate

commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31
Author: Theodore Ts'o 
Date:   Mon Feb 6 20:12:03 2012 -0500

ext4: fold ext4_claim_inode into ext4_new_inode

as the culprit.  I haven't had time to look into why, yet.

-Eric

> I didn't expect the last case at all, and it adequately explains why you
> are mostly seeing corrupted journal messages in your tests but I was
> not. It also explains why when I saw this for the first time I was able
> to mount the resulting corrupted filesystem read-write and corrupt it
> further before I noticed that anything was wrong.
> 
> It is also clear that journal_checksum and all that relies on it is
> worse than useless right now, as Eric reported while I was testing this.
> It should probably be marked CONFIG_BROKEN in future 3.[346].* stable
> kernels, if CONFIG_BROKEN existed anymore, which it doesn't.
> 
> It's a shame journal_async_commit depends on a broken feature: it might
> be notionally unsafe but on some of my systems (without nobarrier or
> flashy caching controllers) it was associated with a noticeable speedup
> of metadata-heavy workloads -- though that was way back in 2009...
> however, "safety first" definitely applies in this case.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-27 Thread Eric Sandeen

On 10/27/12 4:19 PM, Eric Sandeen wrote:
> On 10/27/12 1:47 PM, Nix wrote:
>> On 27 Oct 2012, Theodore Ts'o said:
>>
>>> On Sat, Oct 27, 2012 at 01:45:25PM +0100, Nix wrote:
>>>> Ah! it's turned on by journal_async_commit. OK, that alone argues
>>>> against use of journal_async_commit, tested or not, and I'd not have
>>>> turned it on if I'd noticed that.
>>>>
>>>> (So, the combinations I'll be trying for effect on this bug are:
>>>>
>>>>  journal_async_commit (as now)
>>>>  journal_checksum
>>>>  none
>>>
>>> Can you also check and see whether the presence or absence of
>>> "nobarrier" makes a difference?
>>
>> Done. (Also checked the effect of your patches posted earlier this week:
>> no effect, I'm afraid, certainly not under the fails-even-on-3.6.1 test
>> I was carrying out, umount -l'ing /var as the very last thing I did
>> before /sbin/reboot -f.)
>>
>> nobarrier makes a difference that I, at least, did not expect:
>>
>> [no options]No corruption
>>
>> nobarrier   No corruption
>>
>>   journal_checksum  Corruption
>> Corrupted transaction, journal aborted
>> 
>> nobarrier,journal_checksum  Corruption
>> Corrupted transaction, journal aborted
>>
>>   journal_async_commit  Corruption
>> Corrupted transaction, journal aborted
>>
>> nobarrier,journal_async_commit  Corruption
>> No corrupted transaction or aborted journal
> 
> That's what we needed.  Woulda been great a few days ago ;)
> 
> In my testing journal_checksum is broken, and my bisection seems to
> implicate
> 
> commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31
> Author: Theodore Ts'o 
> Date:   Mon Feb 6 20:12:03 2012 -0500
> 
> ext4: fold ext4_claim_inode into ext4_new_inode
> 
> as the culprit.  I haven't had time to look into why, yet.

It looks like the inode_bitmap_bh is being modified outside a transaction:

ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);

It needs a get_write_access / handle_dirty_metadata pair around it.

A hacky patch like this seems to work but it was done 5mins before I have
to be out the door to dinner so it probably needs cleanup or at least
scrutiny.

[PATCH] ext4: fix unjournaled inode bitmap modification

commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 modified this function
such that the inode bitmap was being modified outside a transaction,
which could lead to corruption, and was discovered when journal_checksum
found a bad checksum in the journal.

Signed-off-by: Eric Sandeen 
---

--- ialloc.c.reverted2  2012-10-27 17:31:20.351537073 -0500
+++ ialloc.c2012-10-27 17:40:18.643553576 -0500
@@ -669,6 +669,10 @@
inode_bitmap_bh = ext4_read_inode_bitmap(sb, group);
if (!inode_bitmap_bh)
goto fail;
+   BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
+   err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
+   if (err)
+   goto fail;
 
 repeat_in_this_group:
ino = ext4_find_next_zero_bit((unsigned long *)
@@ -690,6 +694,10 @@
ino++;  /* the inode bitmap is zero-based */
if (!ret2)
goto got; /* we grabbed the inode! */
+   BUFFER_TRACE(inode_bitmap_bh, "call 
ext4_handle_dirty_metadata");
+   err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
+   if (err)
+   goto fail;
if (ino < EXT4_INODES_PER_GROUP(sb))
goto repeat_in_this_group;
}




> -Eric
> 
>> I didn't expect the last case at all, and it adequately explains why you
>> are mostly seeing corrupted journal messages in your tests but I was
>> not. It also explains why when I saw this for the first time I was able
>> to mount the resulting corrupted filesystem read-write and corrupt it
>> further before I noticed that anything was wrong.
>>
>> It is also clear that journal_checksum and all that relies on it is
>> worse than useless right now, as Eric reported while I was testing this.
>> It should probably be marked CONFIG_BROKEN in future 3.[346].* stable
>> kernels, if CONFIG_BROKEN existed anymore, which it doesn't.
>>
>> It's a shame journal_async_commit depends on a broken fe

[PATCH] ext4: fix unjournaled inode bitmap modification

2012-10-27 Thread Eric Sandeen

commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 changed
ext4_new_inode() such that the inode bitmap was being modified
outside a transaction, which could lead to corruption, and was
discovered when journal_checksum found a bad checksum in the
journal during log replay.

Nix ran into this when using the journal_async_commit mount
option, which enables journal checksumming.  The ensuing
journal replay failures due to the bad checksums led to
filesystem corruption reported as the now infamous
"Apparent serious progressive ext4 data corruption bug"

I've tested this by mounting with journal_checksum and
running fsstress then dropping power; I've also tested by
hacking DM to create snapshots w/o first quiescing, which
allows me to test journal replay repeatedly w/o actually
power-cycling the box.  Without the patch I hit a journal
checksum error every time.  With this fix it survives
many iterations.

Signed-off-by: Eric Sandeen 
cc: Nix 
---

A little more going on here to try to properly handle error
cases & moving to the next group; despite
ext4_handle_release_buffer being a no-op, I've tried
to sprinkle it in at the right places.  Double checking
on review is probably a fine idea ;)

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 4facdd2..1d18fba 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -706,10 +706,17 @@ got_group:
continue;
}
 
-   brelse(inode_bitmap_bh);
+   if (inode_bitmap_bh) {
+   ext4_handle_release_buffer(handle, inode_bitmap_bh);
+   brelse(inode_bitmap_bh);
+   }
inode_bitmap_bh = ext4_read_inode_bitmap(sb, group);
if (!inode_bitmap_bh)
goto fail;
+   BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
+   err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
+   if (err)
+   goto fail;
 
 repeat_in_this_group:
ino = ext4_find_next_zero_bit((unsigned long *)
@@ -734,10 +741,16 @@ repeat_in_this_group:
if (ino < EXT4_INODES_PER_GROUP(sb))
goto repeat_in_this_group;
}
+   ext4_handle_release_buffer(handle, inode_bitmap_bh);
err = -ENOSPC;
goto out;
 
 got:
+   BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
+   err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
+   if (err)
+   goto fail;
+
/* We may have to initialize the block bitmap if it isn't already */
if (ext4_has_group_desc_csum(sb) &&
gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
@@ -771,11 +784,6 @@ got:
goto fail;
}
 
-   BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
-   err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
-   if (err)
-   goto fail;
-
BUFFER_TRACE(group_desc_bh, "get_write_access");
err = ext4_journal_get_write_access(handle, group_desc_bh);
if (err)
@@ -823,11 +831,6 @@ got:
}
ext4_unlock_group(sb, group);
 
-   BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
-   err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
-   if (err)
-   goto fail;
-
BUFFER_TRACE(group_desc_bh, "call ext4_handle_dirty_metadata");
err = ext4_handle_dirty_metadata(handle, NULL, group_desc_bh);
if (err)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-28 Thread Eric Sandeen

On 10/28/12 8:00 PM, Theodore Ts'o wrote:
> On Sat, Oct 27, 2012 at 05:42:07PM -0500, Eric Sandeen wrote:
>>
>> It looks like the inode_bitmap_bh is being modified outside a transaction:
>>
>> ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
>>
>> It needs a get_write_access / handle_dirty_metadata pair around it.
> 
> Oops.   Nice catch!!
> 
> The patch isn't quite right, though.  

Yeah, I knew it wasn't ;)  I did resend 
[PATCH] ext4: fix unjournaled inode bitmap modification
which is a bit more involved.

> We only want to call
> ext4_journal_get_write_access() when we know that there is an available
> bit in the bitmap.  (We could still lose the race, but in that case
> the winner of the race probably grabbed the bitmap block first.)
> 
> Also, we only need to call ext4_handle_dirty_metadata() if we
> successfully grab the bit in the bitmap.
> 
> So I suggest this patch instead:

That'll get_write_access on the same buffer over and over, I suppose
it's ok, but the patch I sent tries to minimize that, and call
ext4_handle_release_buffer if we're not going to use it (which is
a no-op today anyway and not normally used I guess...)

If ext4_handle_release_buffer() is dead code now, and repeated calls
via repeat_in_this_group: are no big deal, then your version looks fine.

-Eric

> commit 087eda81f1ac6a6a0394f781b44f1d555d8f64c6
> Author: Eric Sandeen 
> Date:   Sun Oct 28 20:59:57 2012 -0400
> 
> ext4: fix unjournaled inode bitmap modification
> 
> commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 modified this function
> such that the inode bitmap was being modified outside a transaction,
> which could lead to corruption, and was discovered when journal_checksum
> found a bad checksum in the journal.
> 
> Reported-by: Nix 
> Signed-off-by: Eric Sandeen 
> Signed-off-by: "Theodore Ts'o" 
> Cc: sta...@vger.kernel.org
> 
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 4facdd2..575afac 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -725,6 +725,10 @@ repeat_in_this_group:
>  "inode=%lu", ino + 1);
>   continue;
>   }
> + BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
> + err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
> + if (err)
> + goto fail;
>   ext4_lock_group(sb, group);
>   ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
>   ext4_unlock_group(sb, group);
> @@ -738,6 +742,11 @@ repeat_in_this_group:
>   goto out;
>  
>  got:
> + BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
> + err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
> + if (err)
> + goto fail;
> +
>   /* We may have to initialize the block bitmap if it isn't already */
>   if (ext4_has_group_desc_csum(sb) &&
>   gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Apparent serious progressive ext4 data corruption bug in 3.6.3 (and other stable branches?)

2012-10-28 Thread Eric Sandeen

On 10/28/12 9:34 PM, Theodore Ts'o wrote:
> On Sun, Oct 28, 2012 at 09:24:19PM -0500, Eric Sandeen wrote:
>> Yeah, I knew it wasn't ;)  I did resend 
>> [PATCH] ext4: fix unjournaled inode bitmap modification
>> which is a bit more involved.
> 
> Yeah, sorry, I didn't see your updated patch at first, since this mail
> thread is one complicated tangle.  :-(
> 
>> That'll get_write_access on the same buffer over and over, I suppose
>> it's ok, but the patch I sent tries to minimize that, and call
>> ext4_handle_release_buffer if we're not going to use it (which is
>> a no-op today anyway and not normally used I guess...)
> 
> Well, it's really rare that we will go through that loop more than
> once; it only happens if we have multiple processes race against each
> other trying to grab the same inode.
> 
>> If ext4_handle_release_buffer() is dead code now, and repeated calls
>> via repeat_in_this_group: are no big deal, then your version looks fine.
> 
> Yeah, I think it's pretty much dead code.  At least, I can't think of
> a good reason why we would want to actually try to handle
> ext4_handle_release_buffer() to claw back the transaciton credit.  And
> if we do, we'll have to do a sweep through the entire ext4 codebase
> anyway.

Yeah, seems that way.

Then your simpler version is probably the way to go.

Thanks,
-Eric

>   - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -v3] ext4: fix unjournaled inode bitmap modification

2012-10-28 Thread Eric Sandeen

On 10/28/12 9:30 PM, Theodore Ts'o wrote:
> On Sat, Oct 27, 2012 at 11:23:57PM -0500, Eric Sandeen wrote:
>> A little more going on here to try to properly handle error
>> cases & moving to the next group; despite
>> ext4_handle_release_buffer being a no-op, I've tried
>> to sprinkle it in at the right places.  Double checking
>> on review is probably a fine idea ;)
> 
> Sorry, I didn't see your newer version of your patch.  I'm not
> convinced it's worth it to try to get the calls to
> ext4_handle_release_buffer() right.  There are plenty of other places
> where we're not calling ext4_handle_release_buffer(), and I'm not
> convinced it would ever be useful to make it be something other than a
> no-op.  

Fair enough, I went a little overboard.

> In order to make it be useful, we'd have to enforce a rule
> that every single get_write_access() was matched with either a
> handle_dirty_metadata() or a handle_release_buffer().  That would be
> tricky; worse, we'd have to keep track of a refcount on each bh, which
> would cost us on the scalability front.  The main benefit would be
> that might be able to be able to reclaim bh's where we called
> get_write_access() and then changed our mind, but that's relatively
> rare, and I think it's easier to simply be more careful about calling
> get_write_acceess() until we're sure we're going to need write access.
> 
> Hence in my version of the patch, I've waited until right before the
> call to ext4_lock_group() before calling get_write_access().  Note
> that it's safe to call get_write_access() on a bh twice; the second
> time the jbd2 layer will notice that the bh is already a part of the
> transaction.

Yeah, I guess that's the norm.

So on the one hand you delay calling it until we're sure we need
it; OTOH it's no big deal if it does get called twice :)

> Also, leaving out the calls to ext4_handle_release_buffer() makes the
> patch easier to understand and reason about.

Fair enough.

> What do you think of this version?

Looks fine, tests fine.  Ship it ;)

-Eric

>   - Ted
> 
> commit 67d725143e9e7ea458a0c1c4a6625657c3dc7ba2
> Author: Eric Sandeen 
> Date:   Sun Oct 28 22:24:57 2012 -0400
> 
> ext4: fix unjournaled inode bitmap modification
> 
> commit 119c0d4460b001e44b41dcf73dc6ee794b98bd31 changed
> ext4_new_inode() such that the inode bitmap was being modified
> outside a transaction, which could lead to corruption, and was
> discovered when journal_checksum found a bad checksum in the
> journal during log replay.
> 
> Nix ran into this when using the journal_async_commit mount
> option, which enables journal checksumming.  The ensuing
> journal replay failures due to the bad checksums led to
> filesystem corruption reported as the now infamous
> "Apparent serious progressive ext4 data corruption bug"
> 
> [ Changed by tytso to only call ext4_journal_get_write_access() only
>   when we're fairly certain that we're going to allocate the inode. ]
> 
> I've tested this by mounting with journal_checksum and
> running fsstress then dropping power; I've also tested by
> hacking DM to create snapshots w/o first quiescing, which
> allows me to test journal replay repeatedly w/o actually
> power-cycling the box.  Without the patch I hit a journal
> checksum error every time.  With this fix it survives
> many iterations.
> 
> Reported-by: Nix 
> Signed-off-by: Eric Sandeen 
> Signed-off-by: "Theodore Ts'o" 
> Cc: sta...@vger.kernel.org
> 
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 4facdd2..3a100e7 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -725,6 +725,10 @@ repeat_in_this_group:
>  "inode=%lu", ino + 1);
>   continue;
>   }
> + BUFFER_TRACE(inode_bitmap_bh, "get_write_access");
> + err = ext4_journal_get_write_access(handle, inode_bitmap_bh);
> + if (err)
> + goto fail;
>   ext4_lock_group(sb, group);
>   ret2 = ext4_test_and_set_bit(ino, inode_bitmap_bh->b_data);
>   ext4_unlock_group(sb, group);
> @@ -738,6 +742,11 @@ repeat_in_this_group:
>   goto out;
>  
>  got:
> + BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
> + err = ext4_handle_dirty_metadata(handle, NULL, inode_bitmap_bh);
> + if (err)
> + goto fail;
&g

Re: ext4: fix memory leak in xattr code.

2013-09-20 Thread Eric Sandeen

On 9/20/13 9:29 AM, Dave Jones wrote:
> If we take the 2nd retry path in ext4_expand_extra_isize_ea, we potentionally
> return from the function without having freed these allocations.
> If we don't do the return, we over-write the previous allocation pointers, 
> so we leak either way.
> 
> Spotted with Coverity.
> 
> Signed-off-by: Dave Jones 

Looks right to me.

Reviewed-by: Eric Sandeen 

Thanks!

-Eric

> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index c081e34..f3a6220 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -1350,6 +1350,8 @@ retry:
>   s_min_extra_isize) {
>   tried_min_extra_isize++;
>   new_extra_isize = s_min_extra_isize;
> + kfree(is);
> + kfree(bs);
>   goto retry;
>   }
>   error = -1;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-10 Thread Eric Sandeen

On 9/10/13 4:02 PM, Theodore Ts'o wrote:
> On Tue, Sep 10, 2013 at 02:47:33PM -0600, Andreas Dilger wrote:
>> I agree that SELinux is enabled on enterprise distributions by default,
>> but I'm also interested to know how much overhead this imposes.  I would
>> expect that writing large external xattrs for each file would have quite
>> a significant performance overhead that should not be ignored.  Reducing
>> the mbcache overhead is good, but eliminating it entirely is better.
> 
> I was under the impression that using a 256 byte inode (which gives a
> bit over 100 bytes worth of xattr space) was plenty for SELinux.  If
> it turns out that SELinux's use of xattrs have gotten especially
> piggy, then we may need to revisit the recommended inode size for
> those systems who insist on using SELinux...  even if we eliminate the
> overhead associated with mbcache, the fact that files are requiring a
> separate xattr is going to seriously degrade performance.

On my RHEL6 system,

# find / -xdev -exec getfattr --only-values -m security.* {} 2>/dev/null \; | 
wc -c
11082179

bytes of names for:

# df -i /
FilesystemInodes   IUsed   IFree IUse% Mounted on
/dev/mapper/vg_bp05-lv_root
 3276800  280785 29960159% /

280785 inodes used,

so:
11082179/280785 = ~39.5 bytes per value on average, plus:

# echo -n "security.selinux" | wc -c
16

16 bytes for the name is only about 55-56 bytes per selinux attr on average.

So nope, not "especially piggy" on average.

Another way to do it is this; list all possible file contexts, and make
a histogram of sizes:

# for CONTEXT in `semanage fcontext -l | awk '{print $NF}' `; do echo -n 
$CONTEXT | wc -c; done | sort -n | uniq -c
  1 7
 33 8
356 26
 14 27
 14 28
 37 29
 75 30
237 31
295 32
425 33
324 34
445 35
548 36
229 37
193 38
181 39
259 40
 81 41
108 42
 96 43
 55 44
 55 45
 16 46
 41 47
 23 48
 28 49
 36 50
 10 51
 10 52
  5 54
  2 57

so a 57 byte value is max, but there aren't many of the larger values.

Above doesn't tell us the prevalence of various contexts on the actual system,
but they are all under 100 bytes in any case.

-Eric

>- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-11 Thread Eric Sandeen

On 9/11/13 6:30 AM, Theodore Ts'o wrote:
> On Tue, Sep 10, 2013 at 10:13:16PM -0500, Eric Sandeen wrote:
>>
>> Above doesn't tell us the prevalence of various contexts on the actual 
>> system,
>> but they are all under 100 bytes in any case.
> 
> OK, so in other words, on your system i_file_acl and i_file_acl_high
> (which is where we store the block # for the external xattr block),
> should always be zero for all inodes, yes?

Oh, hum - ok, so that would have been the better thing to check (or at
least an additional thing).

# find / -xdev -exec filefrag -x {} \; | awk -F : '{print $NF}' | sort | uniq -c

Finds quite a lot that claim to have external blocks, but it seems broken:

# filefrag -xv /var/lib/yum/repos/x86_64/6Server/epel
Filesystem type is: ef53
File size of /var/lib/yum/repos/x86_64/6Server/epel is 4096 (1 block, blocksize 
4096)
 ext logical physical expected length flags
   0   0 32212996252 100 not_aligned,inline
/var/lib/yum/repos/x86_64/6Server/epel: 1 extent found

So _filefrag_ says it has a block (at a 120T physical address not on my fs!)

And yet it's a small attr:

# getfattr -m - -d /var/lib/yum/repos/x86_64/6Server/epel
getfattr: Removing leading '/' from absolute path names
# file: var/lib/yum/repos/x86_64/6Server/epel
security.selinux="unconfined_u:object_r:rpm_var_lib_t:s0"

And debugfs thinks it's stored within the inode:

debugfs:  stat var/lib/yum/repos/x86_64/6Server/epel
Inode: 1968466   Type: directoryMode:  0755   Flags: 0x8
Generation: 2728788146Version: 0x:0001
User: 0   Group: 0   Size: 4096
File ACL: 0Directory ACL: 0
Links: 2   Blockcount: 8
Fragment:  Address: 0Number: 0Size: 0
 ctime: 0x50b4d808:cb7dd9a8 -- Tue Nov 27 09:11:04 2012
 atime: 0x522fc8fa:62eb2d90 -- Tue Sep 10 20:35:54 2013
 mtime: 0x50b4d808:cb7dd9a8 -- Tue Nov 27 09:11:04 2012
crtime: 0x50b4d808:cb7dd9a8 -- Tue Nov 27 09:11:04 2012
Size of extra inode fields: 28
Extended attributes stored in inode body: 
  selinux = "unconfined_u:object_r:rpm_var_lib_t:s0\000" (39)
EXTENTS:
(0): 7873422

sooo seems like filefrag -x is buggy and can't be trusted.  :(

> Thavatchai, can you check to see whether or not this is true on your
> system?  You can use debugfs on the file system, and then use the
> "stat" command to sample various inodes in your system.  Or I can make
> a version of e2fsck which counts the number of inodes with external
> xattr blocks --- it sounds like this is something we should do anyway.
> 
> One difference might be that Eric ran this test on RHEL 6, and
> Thavatchai is using an upstream kernel, so maybe this is bloat has
> been added recently?

It's a userspace policy so the kernel shouldn't matter... "bloat" would
only come from new, longer contexts (outside the kernel).

> The reason why I'm pushing here is that mbcache shouldn't be showing
> up in the profiles at all if there is no external xattr block.  And so
> if newer versions of SELinux (like Adnreas, I've been burned by
> SELinux too many times in the past, so I don't use SELinux on any of
> my systems) is somehow causing mbcache to get triggered, we should
> figure this out and understand what's really going on.

selinux, from an fs allocation behavior perspective, is simply setxattr.

And as I showed earlier, name+value for all of the attrs set by at least RHEL6
selinux policy are well under 100 bytes.

(Add in a bunch of other non-selinux xattrs, and you'll go out of a 256b inode,
sure, but selinux on its own should not).

> Sigh, I suppose I should figure out how to create a minimal KVM setup
> which uses SELinux just so I can see what the heck is going on

http://fedoraproject.org/en/get-fedora ;)

-Eric

>- Ted
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-11 Thread Eric Sandeen

On 9/11/13 11:49 AM, Eric Sandeen wrote:
> On 9/11/13 6:30 AM, Theodore Ts'o wrote:
>> On Tue, Sep 10, 2013 at 10:13:16PM -0500, Eric Sandeen wrote:
>>>
>>> Above doesn't tell us the prevalence of various contexts on the actual 
>>> system,
>>> but they are all under 100 bytes in any case.
>>
>> OK, so in other words, on your system i_file_acl and i_file_acl_high
>> (which is where we store the block # for the external xattr block),
>> should always be zero for all inodes, yes?
> 
> Oh, hum - ok, so that would have been the better thing to check (or at
> least an additional thing).
> 
> # find / -xdev -exec filefrag -x {} \; | awk -F : '{print $NF}' | sort | uniq 
> -c
> 
> Finds quite a lot that claim to have external blocks, but it seems broken:
> 
> # filefrag -xv /var/lib/yum/repos/x86_64/6Server/epel
> Filesystem type is: ef53
> File size of /var/lib/yum/repos/x86_64/6Server/epel is 4096 (1 block, 
> blocksize 4096)
>  ext logical physical expected length flags
>0   0 32212996252 100 not_aligned,inline
> /var/lib/yum/repos/x86_64/6Server/epel: 1 extent found
> 
> So _filefrag_ says it has a block (at a 120T physical address not on my fs!)

Oh, this is the special-but-not-documented "print inline extents in bytes
not blocks"  :(

I'll send a patch to ignore inline extents on fiemap calls to make this
easier, but in the meantime, neither my RHEL6 root nor my F17 root have
any out-of-inode selinux xattrs on 256-byte-inode filesystems.

So selinux alone should not be exercising mbcache much, if at all, w/ 256 byte
inodes.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-11 Thread Eric Sandeen

On 9/11/13 3:32 PM, David Lang wrote:
> On Wed, 11 Sep 2013, Eric Sandeen wrote:
> 
>>> The reason why I'm pushing here is that mbcache shouldn't be showing
>>> up in the profiles at all if there is no external xattr block.  And so
>>> if newer versions of SELinux (like Adnreas, I've been burned by
>>> SELinux too many times in the past, so I don't use SELinux on any of
>>> my systems) is somehow causing mbcache to get triggered, we should
>>> figure this out and understand what's really going on.
>>
>> selinux, from an fs allocation behavior perspective, is simply setxattr.
> 
> what you are missing is that Ted is saying that unless you are using xattrs, 
> the mbcache should not show up at all.
> 
> The fact that you are using SElinux, and SELinux sets the xattrs is
> what makes this show up on your system, but other people who don't
> use SELinux (and so don't have any xattrs set) don't see the same
> bottleneck.

Sure, I understand that quite well.  But Ted was also saying that perhaps 
selinux had "gotten piggy" and was causing this.  I've showed that it hasn't.

This matters because unless the selinux xattrs go out of the inode into their 
own block, mbcache should STILL not come into it at all.

And for attrs < 100 bytes, they stay in the inode.  And on inspection, my 
SELinux boxes have no external attr blocks allocated.

mbcache only handles extended attributes that live in separately-allocated 
blocks, and selinux does not cause that on its own.

Soo... selinux by itself should not be triggering any mbcache codepaths.

Ted suggested that "selinux had gotten piggy" so I checked, and showed that it 
hadn't.  That's all.

So at this point I think it's up to Mak to figure out why on his system, aim7 
is triggering mbcache codepaths.

-Eric

> David Lang

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-11 Thread Eric Sandeen

On 9/11/13 3:36 PM, Thavatchai Makphaibulchoke wrote:

> I seem to be seeing the same thing as Eric is seeing.

...
> For both filesystems, the security xattr are about 32.17 and 34.87 bytes 
> respectively.
...

Can you triple-check the inode size on your fs, for good measure?

dumpe2fs -h /dev/whatever | grep "Inode size"

> I also see a similar problem with filefrag.

turns out it's not a problem, it's an undocumented & surprising "feature."  :(

/* For inline data all offsets should be in bytes, not blocks */
if (fm_extent->fe_flags & FIEMAP_EXTENT_DATA_INLINE)
blk_shift = 0;

because ... ?  (the commit which added it didn't mention anything about it).

But I guess it does mean for at least those files, the xattr data is actually 
inline.

> At this point, I'm not sure why we get into the mbcache path when
> SELinux is enabled. As mentioned in one my earlier replies to
> Andreas, I did see actual calls into ext4_xattr_cache.
> 
> There seems to be one difference between 3.11 kernel and 2.6 kernel
> in set_inode_init_security(). There is an additional attempt to
> initialize evm xattr. But I do not seem to be seeing any evm xattr in
> any file.
> 
> I will continue to try to find out how we get into the mbcache path.
> Please let me know if anyone has any suggestion.

Sorry we got so far off the thread of the original patches.

But it seems like a mystery worth solving.

Perhaps in ext4_xattr_set_handle() you can instrument the case(s) where it
gets into ext4_xattr_block_set().

Or most simply, just printk inode number in ext4_xattr_block_set() so
you can look at them later via debugfs.

And in here,

} else {
error = ext4_xattr_ibody_set(handle, inode, &i, &is);
if (!error && !bs.s.not_found) {
i.value = NULL;
error = ext4_xattr_block_set(handle, inode, &i, &bs);
} else if (error == -ENOSPC) {
if (EXT4_I(inode)->i_file_acl && !bs.s.base) {
error = ext4_xattr_block_find(inode, &i, &bs);
if (error)
goto cleanup;
}
error = ext4_xattr_block_set(handle, inode, &i, &bs);

maybe print out in the ext4_xattr_ibody_set error case what the size of
of the xattr is, and probably inode number again for good measure, to get
an idea of what's causing it to fail to land in the inode?

-Eric

> 
> Thanks,
> Mak.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI

2012-12-07 Thread Eric Sandeen

On 12/7/12 3:14 PM, Theodore Ts'o wrote:
> On Fri, Dec 07, 2012 at 02:30:19PM -0500, Steven Rostedt wrote:
>> How is this similar? By adding this bit, we removed incentive from a
>> group of developers that have the means to fix the real issue at hand
>> (the performance problem with ext4). Thus, it means that they have a work
>> around that's good enough for them, but the rest of us suffer.
> 
> That assumes that there **is** a way to claw back the performance
> loss, and Chris Mason has demonstrated the performance hit exists with
> xfs as well (950 MB/s vs. 400 MB/s; that's more than a factor of two).

But he has not demonstrated that it can't be improved in XFS; I don't
think that anyone in the XFS community has even begun to look at
whether it can be improved ...

> Sometimes, you have to make the engineering tradeoffs.  That's why
> we're engineers, for goodness sakes.  Sometimes, it's just not
> possible to square the circle.

... so this strikes me as a bit premature.

> I don't believe that the technique of forcing people who need that
> performance to suffer in order to induce them to try to engineer a
> solution which may or may not exist is really the best or fairest way
> to go about things.

Have we exhausted efforts to improve ext4 as well?  Have we even identified
the performance bottlenecks yet via profiling?

What this seems to be is behavior nobody has asked for (expose
other users' stale data) in the name of solving a performance problem
(fine-grained conversion of unwritten extents comes at a non-negligible cost).

-Eric

>   - Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI

2012-12-07 Thread Eric Sandeen

On 12/7/12 3:57 PM, Chris Mason wrote:
> On Fri, Dec 07, 2012 at 02:49:04PM -0700, Ric Wheeler wrote:
>> On 12/07/2012 04:43 PM, Chris Mason wrote:
>>> On Fri, Dec 07, 2012 at 02:27:43PM -0700, Theodore Ts'o wrote:
 On Fri, Dec 07, 2012 at 04:09:32PM -0500, Chris Mason wrote:
> Persistent trim is what I had in mind, but there are other ideas that do
> imply a change in behavior as well.  Can we safely assume this feature
> won't matter on spinning media?  New features like persistent
> trim do make it much easier to solve securely, and using a bit for it
> means we can toss back an error to the app if the underlying storage
> isn't safe.
 We originally implemented no hide stale for spinning media.  Some
 folks have claimed that for XFS their superior technology means that
 no hide stale doesn't buy them anything for HDD's.  I'm not entirely
 sure I buy this, since if you need to update metadata, it means at
 least one extra seek for each random write into 4k preallocated space,
 and 7200 RPM disks only have about 200 seeks per second.
>>> True, 7200 RPM disks are slow, but even allowing them to expose stale
>>> data just makes them a little less slow.
>>>
>>> I know it's against the rules to pretend that disks don't matter.  But
>>> really, once you're doing random IO into a spindle you've given up on
>>> performance anyway.
>>>
>>> -chris
>>
>> That's right.
>>
>> And equally true, once you have moved the disk heads to that track, you can 
>> write a lot as cheaply as a little (i.e., do 1MB instead of 4KB). That will 
>> also 
>> avoid fragmentation of the extents.
> 
> When you do a 4K write, you have to remember that you've written just
> those 4K.  When you do a 1MB write, you have to remember that you've
> written just that 1MB.  It's the same operation, except with the 1MB
> you've also had to setup all the bios and send down the zeros, and do
> the proper locking to make sure you're not sending zeros down over
> some concurrent IO.
> 
> The 1MB setup is actually more work, but it does greatly reduce the
> amount of time the workload needs to run before it goes into a steady
> state.  For smaller files it may work well, but for larger ones I don't
> think it will be enough.

Ext4 already does this, actually, I think - see s_extent_max_zeroout_kb
and how it's used.

/* If extent is less than s_max_zeroout_kb, zeroout directly */

It's not a tunable (*gasp* ;)) but it's currently set to "32" as in
32 kb.  Would be fun to bump that up and see how your test goes.

-Eric

> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI

2012-12-07 Thread Eric Sandeen

On 12/7/12 3:57 PM, Chris Mason wrote:
> On Fri, Dec 07, 2012 at 02:49:04PM -0700, Ric Wheeler wrote:
>> On 12/07/2012 04:43 PM, Chris Mason wrote:
>>> On Fri, Dec 07, 2012 at 02:27:43PM -0700, Theodore Ts'o wrote:
 On Fri, Dec 07, 2012 at 04:09:32PM -0500, Chris Mason wrote:
> Persistent trim is what I had in mind, but there are other ideas that do
> imply a change in behavior as well.  Can we safely assume this feature
> won't matter on spinning media?  New features like persistent
> trim do make it much easier to solve securely, and using a bit for it
> means we can toss back an error to the app if the underlying storage
> isn't safe.
 We originally implemented no hide stale for spinning media.  Some
 folks have claimed that for XFS their superior technology means that
 no hide stale doesn't buy them anything for HDD's.  I'm not entirely
 sure I buy this, since if you need to update metadata, it means at
 least one extra seek for each random write into 4k preallocated space,
 and 7200 RPM disks only have about 200 seeks per second.
>>> True, 7200 RPM disks are slow, but even allowing them to expose stale
>>> data just makes them a little less slow.
>>>
>>> I know it's against the rules to pretend that disks don't matter.  But
>>> really, once you're doing random IO into a spindle you've given up on
>>> performance anyway.
>>>
>>> -chris
>>
>> That's right.
>>
>> And equally true, once you have moved the disk heads to that track, you can 
>> write a lot as cheaply as a little (i.e., do 1MB instead of 4KB). That will 
>> also 
>> avoid fragmentation of the extents.
> 
> When you do a 4K write, you have to remember that you've written just
> those 4K.  When you do a 1MB write, you have to remember that you've
> written just that 1MB.  It's the same operation, except with the 1MB
> you've also had to setup all the bios and send down the zeros, and do
> the proper locking to make sure you're not sending zeros down over
> some concurrent IO.
> 
> The 1MB setup is actually more work, but it does greatly reduce the
> amount of time the workload needs to run before it goes into a steady
> state.  For smaller files it may work well, but for larger ones I don't
> think it will be enough.

Ext4 already does this, actually, I think - see s_extent_max_zeroout_kb
and how it's used.

/* If extent is less than s_max_zeroout_kb, zeroout directly */

It's not a tunable (*gasp* ;)) but it's currently set to "32" as in
32 kb.  Would be fun to bump that up and see how your test goes.

-Eric

> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] xfs: return -EINVAL instead of -EUCLEAN when mounting non-xfs

2012-12-29 Thread Eric Sandeen

On 12/29/12 5:16 PM, Sergei Trofimovich wrote:
> It fixes boot panic when trying to boot from btrfs filesystem.
> kernel tries to mount as xfs and gets fatal -EUCLEAN:
> 
> [0.17] VFS: Cannot open root device "ubda" or unknown-block(98,0): 
> error -117
> [0.17] Please append a correct "root=" boot option; here are the 
> available partitions:
> [0.17] 6200 1048576 ubda  driver: uml-blkdev
> [0.17] Kernel panic - not syncing: VFS: Unable to mount root fs on 
> unknown-block(98,0)
> 
> init/do_mounts.c expects only -EINVAL as 'retry another' option.
> Fixes regression introduced by commit 98021821a502db347bd9c7671b6e8ce07ea6

yeah, that should work; great minds think alike ;)  Our patches
crossed in the ether I guess.

XFS uses EWRONGFS as an alias for EINVAL internally in these cases,
so maybe we should stick with that for consistency, *shrug*

-Eric

> Signed-off-by: Sergei Trofimovich 
> CC: Ben Myers 
> CC: Alex Elder 
> CC: x...@oss.sgi.com
> CC: linux-kernel@vger.kernel.org
> CC: Dave Chinner 
> CC: Phil White 
> ---
>  fs/xfs/xfs_mount.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index da50846..379cac1 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -641,41 +641,41 @@ xfs_sb_read_verify(
>  /*
>   * We may be probed for a filesystem match, so we may not want to emit
>   * messages when the superblock buffer is not actually an XFS superblock.
>   * If we find an XFS superblock, the run a normal, noisy mount because we are
>   * really going to mount it and want to know about errors.
>   */
>  static void
>  xfs_sb_quiet_read_verify(
>   struct xfs_buf  *bp)
>  {
>   struct xfs_sb   sb;
>  
>   xfs_sb_from_disk(&sb, XFS_BUF_TO_SBP(bp));
>  
>   if (sb.sb_magicnum == XFS_SB_MAGIC) {
>   /* XFS filesystem, verify noisily! */
>   xfs_sb_read_verify(bp);
>   return;
>   }
>   /* quietly fail */
> - xfs_buf_ioerror(bp, EFSCORRUPTED);
> + xfs_buf_ioerror(bp, EINVAL);
>  }
>  
>  static void
>  xfs_sb_write_verify(
>   struct xfs_buf  *bp)
>  {
>   xfs_sb_verify(bp);
>  }
>  
>  const struct xfs_buf_ops xfs_sb_buf_ops = {
>   .verify_read = xfs_sb_read_verify,
>   .verify_write = xfs_sb_write_verify,
>  };
>  
>  static const struct xfs_buf_ops xfs_sb_quiet_buf_ops = {
>   .verify_read = xfs_sb_quiet_read_verify,
>   .verify_write = xfs_sb_write_verify,
>  };
>  
>  /*
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 3.9.4 Oops running xfstests (WAS Re: 3.9.3: Oops running xfstests)

2013-05-30 Thread Eric Sandeen

On 5/30/13 10:03 PM, CAI Qian wrote:
> OK, so the minimal workload to trigger this I found so far was to
> run trinity, ltp and then xfstests. I have been able to easily
> reproduced on 3 servers so far, and I'll post full logs here for
> LKML and linux-mm as this may unrelated to XFS only. As far as
> I can tell from the previous testing results, this has never been
> reproduced back in 3.9 GA time. This seems also been reproduced
> on 3.10-rc3 mostly on s390x so far.
> CAI Qian
> 

Can you hit it w/o trinity?   I ask because trinity's stated
goal is to fuzz and corrupt, right - so it's quite possible
that blowing up later in xfs is a side effect?

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3] ext4: introduce two new ioctls

2013-06-23 Thread Eric Sandeen

On Jun 23, 2013, at 12:01 PM, Andreas Dilger  wrote:

> On 2013-06-23, at 0:07, Namjae Jeon  wrote:
> 
>> From: Namjae Jeon 
>> 
>> This patch series introduces 2 new ioctls for ext4.
>> 
>> Truncate_block_range ioctl truncates blocks from source file.
> 
> How is this different from fallocate(FALLOC_FL_PUNCH_HOLE)?  That is already 
> in existing kernels, and portable across multiple filesystems. 
> 
Same question.  Punch hole should do this already...


>> Transfer_block_range ioctl transfers data blocks from source file
>> and append them at the end of destination file.
> 
> There is already a similar ioctl for defragmenting files. Is it possible to 
> use that, or does it have different semantics?
> 
>> Ioctl1:EXT4_IOC_TRUNCATE_BLOCK_RANGE:
>> This ioctl truncates a range of data blocks from file.
>> It is useful to remove easily and quickly the garbage data
>> at the middle of file.
>> 
>> e.g. we have a movie file and there is long advertisement in movie file.
>> user want to remove only advertisement range.
> 
> While this works in theory, there is very little chance that the movie data 
> will align exactly to filesystem block boundaries. 
> 
Or more importantly on compression codec boundaries.   Wouldn't this look like 
corruption at playback time?

Eric

> Cheers, Andreas 
> 
>> 1) Movie file (8GB), There is the adverisement of 500MB size.
>> 
>> | |   | |
>> |   a) range  | b) Advertisement  | c) range| 
>> | | (unneeded data)   | |
>> |_|___|_| 
>> 
>> 2) Currently if user want to remove portion b), the conventional way
>>   would be to copy a) and c) (7.5GB) to new file by reading data from
>>   original file and writing to new file, followed up by  delete original
>>   file and rename new file. It will take long time.
>>   When we measure time, it takes around 3 minutes.
>> 
>> 3) If we use EXT4_IOC_TRUNCATE_BLOCK_RANGE, we can have garbage data removed
>>   in less than a second. Also, no need to perform deletion and rename.
>> ___
>> | ||
>> |   a) range  | c) range   | 
>> | ||
>> |_|| 
>> 
>> 
>> #define EXT4_IOC_TRUNCATE_BLOCK_RANGE  _IOW('f', 18, struct truncate_range)
>> struct truncate_range {
>>  __u32 start_block;
>>  __u32 length;
>> };
>> 
>> example =>
>> Originally the file "abc" has the below extent tree:
>> debugfs:  ex abc
>> Level Entries   LogicalPhysical Length Flags
>> 0/ 0   1/  3 0 - 4  33615 -  33619  5 
>> 0/ 0   2/  3 5 - 9  33855 -  33859  5 
>> 0/ 0   3/  310 -14  69657 -  69661  5
>> 
>> ls -lh abc
>> -rw-r--r--1 root 0  60.0K Jan  1 00:01 abc
>> 
>> du -h abc
>> 60.0Kabc
>> 
>> e4_truncate_block_range abc 2 10
>> Return:
>> : Success
>> 
>> After executing truncate_block_range ioctl, the extent tree:
>> ex abc
>> Level Entries   LogicalPhysical Length Flags
>> 0/ 0   1/  2 0 - 1  33615 -  33616  2 
>> 0/ 0   2/  2 2 - 4  69659 -  69661  3 
>> 
>> ls -lh abc
>> -rw-r--r--1 root 0  20.0K Jan  1 00:08 abc
>> 
>> du -h abc
>> 20.0Kabc
>> 
>> This ioctl works in 2 parts:
>> 1) remove _only_ data blocks that resides within specified range.
>> If the entire range is a hole than nothing is removed.
>> 
>> 2) update file's logical block offsets ranging from block number
>> "start_block + length" to last logical block of file such that
>> lblk_number = lblk_number - length;
>> This is done by updating starting block of all the extents that
>> resides within the range.
>> 
>> If "start_block + length" is already equal to the last block of file
>> than no block is updated. This case is similar to convential truncate.
>> 
>> In the above example:
>> The data blocks ranging from [2 - 11] have been removed
>> and the logical offsets of the file beyond block number 12 till last block 
>> of file are updated by subtracting length from each of logical numbers.
>> This gives a contiguous logical space to the file.
>> Also, the logical size and disksize of the file are updated accordingly.
>> 
>> Ioctl2:EXT4_IOC_TRANSFER_BLOCK_RANGE:
>> This ioctl transfers a range of data blocks from source file and append
>> them at the end of the destination file.
>> This is not actual data transfer but only metadata is moved.
>> 
>> 
>> | |   | |
>> |   a) range  | b) range  | c) range|
>> | |   |

Re: [PATCH 01/12] Btrfs: Remove superfluous casts from u64 to unsigned long long

2013-08-20 Thread Eric Sandeen

On 8/20/13 6:20 AM, Geert Uytterhoeven wrote:
> u64 is "unsigned long long" on all architectures now, so there's no need to
> cast it when formatting it using the "ll" length modifier.
> 
> Signed-off-by: Geert Uytterhoeven 
> ---
>  fs/btrfs/backref.c  |   24 ++--
>  fs/btrfs/check-integrity.c  |  294 
> ++-
>  fs/btrfs/compression.c  |5 +-
>  fs/btrfs/ctree.c|6 +-
>  fs/btrfs/delayed-inode.c|   10 +-
>  fs/btrfs/dev-replace.c  |2 +-
>  fs/btrfs/disk-io.c  |   25 ++--
>  fs/btrfs/extent-tree.c  |   47 +++
>  fs/btrfs/extent_io.c|   30 ++---
>  fs/btrfs/file-item.c|4 +-
>  fs/btrfs/free-space-cache.c |6 +-
>  fs/btrfs/inode.c|   24 ++--
>  fs/btrfs/ioctl.c|   10 +-
>  fs/btrfs/ordered-data.c |   11 +-
>  fs/btrfs/print-tree.c   |   80 +---
>  fs/btrfs/qgroup.c   |   10 +-
>  fs/btrfs/relocation.c   |7 +-
>  fs/btrfs/root-tree.c|3 +-
>  fs/btrfs/scrub.c|   16 +--
>  fs/btrfs/super.c|   10 +-
>  fs/btrfs/transaction.c  |3 +-
>  fs/btrfs/volumes.c  |   19 ++-
>  22 files changed, 222 insertions(+), 424 deletions(-)
> 
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index 8bc5e8c..c1a518e 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -295,10 +295,9 @@ static int __resolve_indirect_ref(struct btrfs_fs_info 
> *fs_info,
>   ret = btrfs_search_old_slot(root, &ref->key_for_search, path, time_seq);
>   pr_debug("search slot in root %llu (level %d, ref count %d) returned "
>"%d for key (%llu %u %llu)\n",
> -  (unsigned long long)ref->root_id, level, ref->count, ret,
> -  (unsigned long long)ref->key_for_search.objectid,
> -  ref->key_for_search.type,
> -  (unsigned long long)ref->key_for_search.offset);
> +  ref->root_id, level, ref->count, ret,
> +  ref->key_for_search.objectid, ref->key_for_search.type,
> +  ref->key_for_search.offset);
>   if (ret < 0)
>   goto out;
>  
> @@ -1326,8 +1325,7 @@ int extent_from_logical(struct btrfs_fs_info *fs_info, 
> u64 logical,
>found_key->type != BTRFS_METADATA_ITEM_KEY) ||
>   found_key->objectid > logical ||
>   found_key->objectid + size <= logical) {
> - pr_debug("logical %llu is not within any extent\n",
> -  (unsigned long long)logical);
> + pr_debug("logical %llu is not within any extent\n", logical);

...

Many platforms use "long" for u64, so without these casts, they'll generate
compile-time printf-format warnings.  See int-l64.h, and the various
arch/*/include/uapi/asm/types.h files that include it - ia64, ppc, etc.

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 01/12] Btrfs: Remove superfluous casts from u64 to unsigned long long

2013-08-20 Thread Eric Sandeen

On 8/20/13 9:16 AM, Eric Sandeen wrote:
> On 8/20/13 6:20 AM, Geert Uytterhoeven wrote:
>> u64 is "unsigned long long" on all architectures now, so there's no need to
>> cast it when formatting it using the "ll" length modifier.
>>
>> Signed-off-by: Geert Uytterhoeven 
>> ---
>>  fs/btrfs/backref.c  |   24 ++--
>>  fs/btrfs/check-integrity.c  |  294 
>> ++-
>>  fs/btrfs/compression.c  |5 +-
>>  fs/btrfs/ctree.c|6 +-
>>  fs/btrfs/delayed-inode.c|   10 +-
>>  fs/btrfs/dev-replace.c  |2 +-
>>  fs/btrfs/disk-io.c  |   25 ++--
>>  fs/btrfs/extent-tree.c  |   47 +++
>>  fs/btrfs/extent_io.c|   30 ++---
>>  fs/btrfs/file-item.c|4 +-
>>  fs/btrfs/free-space-cache.c |6 +-
>>  fs/btrfs/inode.c|   24 ++--
>>  fs/btrfs/ioctl.c|   10 +-
>>  fs/btrfs/ordered-data.c |   11 +-
>>  fs/btrfs/print-tree.c   |   80 +---
>>  fs/btrfs/qgroup.c   |   10 +-
>>  fs/btrfs/relocation.c   |7 +-
>>  fs/btrfs/root-tree.c|3 +-
>>  fs/btrfs/scrub.c|   16 +--
>>  fs/btrfs/super.c|   10 +-
>>  fs/btrfs/transaction.c  |3 +-
>>  fs/btrfs/volumes.c  |   19 ++-
>>  22 files changed, 222 insertions(+), 424 deletions(-)
>>
>> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
>> index 8bc5e8c..c1a518e 100644
>> --- a/fs/btrfs/backref.c
>> +++ b/fs/btrfs/backref.c
>> @@ -295,10 +295,9 @@ static int __resolve_indirect_ref(struct btrfs_fs_info 
>> *fs_info,
>>  ret = btrfs_search_old_slot(root, &ref->key_for_search, path, time_seq);
>>  pr_debug("search slot in root %llu (level %d, ref count %d) returned "
>>   "%d for key (%llu %u %llu)\n",
>> - (unsigned long long)ref->root_id, level, ref->count, ret,
>> - (unsigned long long)ref->key_for_search.objectid,
>> - ref->key_for_search.type,
>> - (unsigned long long)ref->key_for_search.offset);
>> + ref->root_id, level, ref->count, ret,
>> + ref->key_for_search.objectid, ref->key_for_search.type,
>> + ref->key_for_search.offset);
>>  if (ret < 0)
>>  goto out;
>>  
>> @@ -1326,8 +1325,7 @@ int extent_from_logical(struct btrfs_fs_info *fs_info, 
>> u64 logical,
>>   found_key->type != BTRFS_METADATA_ITEM_KEY) ||
>>  found_key->objectid > logical ||
>>  found_key->objectid + size <= logical) {
>> -pr_debug("logical %llu is not within any extent\n",
>> - (unsigned long long)logical);
>> +pr_debug("logical %llu is not within any extent\n", logical);
> 
> ...
> 
> Many platforms use "long" for u64, so without these casts, they'll generate
> compile-time printf-format warnings.  See int-l64.h, and the various
> arch/*/include/uapi/asm/types.h files that include it - ia64, ppc, etc.

Sorry, should have read your commit message first ;)  I didn't know that this
had changed...

-Eric

> -Eric
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/7] uselex.rb as a tiny tool to find dead code

2013-08-07 Thread Eric Sandeen

On 8/7/13 4:43 PM, Sergei Trofimovich wrote:

...

> Meet uselex.rb: one-file script to parse 'nm' output:
> 
> https://github.com/trofi/uselex/blob/master/uselex.rb

Nice to meet you!  I think I've met your close relative,
ftp://ftp.samba.org/pub/unpacked/junkcode/findstatic.pl  :)

#!/usr/bin/perl -w
# find a list of fns and variables in the code that could be static
# usually called with something like this:
#findstatic.pl `find . -name "*.o"`
# Andrew Tridgell 

(just FWIW)

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] ext4: Fix Opts: (null)

2013-07-23 Thread Eric Sandeen

On 7/22/13 5:24 PM, Jóhann B. Guðmundsson wrote:
> null null null no more Opts: (null) but something that actually makes sense to
> human beings...

It's not clear to me how this changes the (null) output...
Have you tested it?  What's the difference in output?

-Eric

> Signed-off-by: Jóhann B. Guðmundsson 
> ---
>  fs/ext4/super.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 85b3dd6..ef141b7 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4088,8 +4088,8 @@ no_journal:
>"the device does not support discard");
>   }
>  
> - ext4_msg(sb, KERN_INFO, "mounted filesystem with%s. "
> -  "Opts: %s%s%s", descr, sbi->s_es->s_mount_opts,
> + ext4_msg(sb, KERN_INFO, "mounted filesystem with%s "
> +  "%s%s%s mount option(s)", descr, sbi->s_es->s_mount_opts,
>*sbi->s_es->s_mount_opts ? "; " : "", orig_data);
>  
>   if (es->s_error_count)
> @@ -4866,7 +4866,7 @@ static int ext4_remount(struct super_block *sb, int 
> *flags, char *data)
>   }
>  #endif
>  
> - ext4_msg(sb, KERN_INFO, "re-mounted. Opts: %s", orig_data);
> + ext4_msg(sb, KERN_INFO, "re-mounted %s mount option(s)", orig_data);

>   kfree(orig_data);
>   return 0;
>  
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: Tree for Aug 13 (xfs)

2013-08-13 Thread Eric Sandeen

On 8/13/13 11:59 AM, Randy Dunlap wrote:
> On 08/13/13 01:28, Stephen Rothwell wrote:
>> Hi all,
>>
>> Changes since 20130812:
>>
> 
> on i386:
> 
> fs/built-in.o: In function `xfs_log_calc_minimum_size':
> (.text+0x1797a9): undefined reference to `__udivdi3'
> 
> 

See:

[PATCH] xfs: call roundup_64() to calculate the min_logblks

on the xfs list:

From: Jie Liu 

Replace roundup() with roundup_64() as we calculate min_logblks
with 64-bit divisions.  Hence, call roundup() will cause the
following error while compiling a 32-bit kernel:

fs/built-in.o: In function `xfs_log_calc_minimum_size':
fs/xfs/xfs_log_rlimit.c:140: undefined reference to `__udivdi3'

Reported-by: Fengguang Wu 
Cc: Dave Chinner 
Signed-off-by: Jie Liu 
---
 fs/xfs/xfs_log_rlimit.c |8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_log_rlimit.c b/fs/xfs/xfs_log_rlimit.c
index 6b17ef4..bbcec0b 100644
--- a/fs/xfs/xfs_log_rlimit.c
+++ b/fs/xfs/xfs_log_rlimit.c
@@ -136,10 +136,12 @@ xfs_log_calc_minimum_size(
 * Also, the log size should be a multiple of the log stripe unit, round
 * it up to lsunit boundary if lsunit is specified.
 */
-   if (lsunit)
-   min_logblks = roundup(BTOBB(max_logres), lsunit) + 2 * lsunit;
-   else
+   if (lsunit) {
+   min_logblks = roundup_64(BTOBB(max_logres), lsunit) +
+ 2 * lsunit;
+   } else
min_logblks = BTOBB(max_logres) + 2 * BBSIZE;
min_logblks *= XFS_MIN_LOG_FACTOR;
+
return XFS_BB_TO_FSB(mp, min_logblks);
 }
-- 

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] ext4 updates for 3.9

2013-03-01 Thread Eric Sandeen

On 2/27/13 2:58 PM, Dmitry Monakhov wrote:
> On Wed, 27 Feb 2013 14:29:07 -0500, Theodore Ts'o  wrote:
>> On Wed, Feb 27, 2013 at 02:19:23PM -0500, Dave Jones wrote:
>>>
>>> Looks like it's fixed here too.
>>>
>>> How did this make it through -next without anyone hitting it ?
>>>
>>> I can't remember how many years ago I last bought a disk < 1TB,
>>> and I can't be alone.  Or is everyone all about SSDs these days?
>>
>> I use LVM, so I have a number of volues which are smaler than 512GB,
>> but very few which are actually larger than 1TB.  And none on my test
>> boxes.  I was running the bleeding edge ext4 code on my laptop as for
>> dogfooding purposes, but I have an 80GB mSATA SSD and a 500GB HDD on
>> my X230 laptop (it requires a thin laptop drive, and 7mm drives don't
>> come any bigger, alas).
>>
>>> Is anyone running xfstests or similar on linux-next regularly ?
>>
>> I run xfstests on the ext4 tree, and I ran it on ext4 plus Linus's tip
>> before I submitted a pull request.  The problem is that XFSTESTS is
>> S-L-O-W if you use large partitions, so typically I use a 5GB
> Indeed. That's why i give-up rotated disks and run xfstest only on SSD
> or brd module 
>> partition sizes for my test runs.  Normally we're worried about race
>> condition bugs, not something as bone-headed as a bitmasking problem,
>> so it makes sense to use a smaller disk for most of your testing.
>> (Some folks do their xfstests run on SSD's or tmpfs image files, again
>> for speed reasons, and it's unlikely they would be big enough.)
>>
>> So what we probably need to do is to have a separate set of tests
>> using a loopback mount, and perhaps an artificially created file
>> system which has a large percentage of the blocks in the middle of the
>> file system busied out, to make efficient testing of these sorts of
>> bugs more efficient.  As I said, I'm thinking about how's the best way
>> to improve our testing regime to catch such problems the next time around.
> Amazing idea. Something like:
> 
> #dd if=/dev/zero of=/tmp/fs.img bs=1M seek=200 count=1
> #mkfs.ext4 -m0 -i4096000 /tmp/fs.img
> #mount /tmp/fs.img /mnt/ -oloop
> #for ((i=0; i < 2000; i++));do   fallocate -l $((1024*1024*1024)) /mnt/f$i 
> ;done
> #for ((i=0; i < 2000; i++));do   truncate -s $((1023*1024*1024)) /mnt/f$i 
> ;done
> 
> As result file system image has 2gb of free space wich is fragmented to ~2000
> chunks 1Mb each. But image itself is quite small
> # df /mnt
> Filesystem  1K-blocks   Used Available Use% Mounted on
> /dev/loop0 2047678076 2045679228   1998848 100% /mnt
> # du -sch /tmp/fs.img 
> 242M /tmp/fs.img
> 242M total
> 
> Later we can simply run xfstest/fio/fsx on this image.
> I'll prepare new xfstest based on that idea. But the only disadvantage
> is that loop dev has bottleneck, all requests will be serialized on i_mutex.

Before anyone does too much work, it would be worth revisiting
dchinner's
[PATCH 0/10] xfstests: rework large filesystem testing
series from July 2012 to see if it meets the needs already.

It almost got all reviews, with one sticking point left, AFAICT.

-Eric



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: EXT4 regression caused 4eec7

2013-05-13 Thread Eric Sandeen

On 5/12/13 4:01 AM, Dmitry Monakhov wrote:
> In fact '4eec70' are vexing because I have reviewed and tested this patch 
> before
> it was marked as Review-by, but missed the bug. This is because xfstests
> was executed manually logs was full of warnings but tainted flag was not
> checked at the end. 

Can you elaborate on this?  What was logged, and is it something we could
try to pick up post-test in xfstests?

Thanks,
-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: EXT4 regression caused 4eec7

2013-05-13 Thread Eric Sandeen

On 5/13/13 12:01 PM, Jan Kara wrote:
> On Mon 13-05-13 11:34:12, Eric Sandeen wrote:
>> On 5/12/13 4:01 AM, Dmitry Monakhov wrote:
>>> In fact '4eec70' are vexing because I have reviewed and tested this patch 
>>> before
>>> it was marked as Review-by, but missed the bug. This is because xfstests
>>> was executed manually logs was full of warnings but tainted flag was not
>>> checked at the end. 
>>
>> Can you elaborate on this?  What was logged, and is it something we could
>> try to pick up post-test in xfstests?
>   Generally I think it might be useful if xfstests would fail / warn if
> kernel became tainted during the test (e.g. due to WARN_ON or oops, or
> something like that). It should be even relatively easy to implement
> (just compare /proc/sys/kernel/tainted before and after each test).
> 
>   Honza
> 

Ah, right.  That should be easy, I'll see if I can cook that up.

Thanks,
-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: EXT4 regression caused 4eec7

2013-05-14 Thread Eric Sandeen

On 5/14/13 2:11 AM, Dmitry Monakhov wrote:
> On Mon, 13 May 2013 12:09:22 -0500, Eric Sandeen  wrote:
>> On 5/13/13 12:01 PM, Jan Kara wrote:
>>> On Mon 13-05-13 11:34:12, Eric Sandeen wrote:
>>>> On 5/12/13 4:01 AM, Dmitry Monakhov wrote:
>>>>> In fact '4eec70' are vexing because I have reviewed and tested this patch 
>>>>> before
>>>>> it was marked as Review-by, but missed the bug. This is because xfstests
>>>>> was executed manually logs was full of warnings but tainted flag was not
>>>>> checked at the end. 
>>>>
>>>> Can you elaborate on this?  What was logged, and is it something we could
>>>> try to pick up post-test in xfstests?
>>>   Generally I think it might be useful if xfstests would fail / warn if
>>> kernel became tainted during the test (e.g. due to WARN_ON or oops, or
>>> something like that). It should be even relatively easy to implement
>>> (just compare /proc/sys/kernel/tainted before and after each test).
>>>
>>> Honza
>>>
>>
>> Ah, right.  That should be easy, I'll see if I can cook that up.
> Also we can use abrt's kernel-oops handler to collect messages.

I sent a pretty simple patch to just check the sysctl to the xfs list
yesterday.

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: v3.10: unmount won't work

2013-05-14 Thread Eric Sandeen

On 5/14/13 2:12 PM, Toralf Förster wrote:
> At a stable 32 bit stable Gentoo with kernel v3.10-rc1-113-ga2c7a54 I cannot 
> umount an (EXT4) fs
> which was created in a file located in a tmpfs partition and loop mounted  :
> 
> That file system was used to hold victims files shared via NFSv4 to an user 
> mode linux 
> on which trinity was used to fuzz testing a patched UML guest kernel.
> 
> n22 ~ # mount
> ...
> nfsd on /proc/fs/nfsd type nfsd (rw,nodev,noexec,nosuid)
> /mnt/ramdisk/disk0 on /mnt/trinity type ext4 (rw)

Is your "mount" looking at /etc/mtab ot at /proc/mounts?

What does /proc/mounts say, does it contain this device or mountpoint?

> n22 ~ # umount /mnt/trinity
> umount: /mnt/trinity: not mounted

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: XFS assertion from truncate. (3.10-rc2)

2013-05-24 Thread Eric Sandeen

On 5/24/13 3:03 AM, Dave Chinner wrote:

> Right, patch below should fix the problem.
> 
> What a frustrating bug. Now, where's my bottle of scotch?

In your pantry, Dave.  Next to the others! ;)

-Eric

> Cheers,
> 
> Dave.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] perf: Fix parallel build

2012-09-20 Thread Eric Sandeen

Parallel builds of perf were failing for me on a 32p box, with:

* new build flags or prefix
util/pmu.l:7:23: error: pmu-bison.h: No such file or directory

...

make: *** [util/pmu-flex.o] Error 1
make: *** Waiting for unfinished jobs

This can pretty quickly be seen by adding a sleep in front of
the bison call in tools/perf/Makefile and running make -j4 on a
smaller box:

sleep 10; $(QUIET_BISON)$(BISON) -v util/pmu.y -d -o 
$(OUTPUT)util/pmu-bison.c

Adding the following dependency fixes it for me:

Signed-off-by: Eric Sandeen 
---

diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index bad726a..6c389d9 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -219,7 +219,7 @@ $(OUTPUT)util/parse-events-flex.c: util/parse-events.l
 $(OUTPUT)util/parse-events-bison.c: util/parse-events.y
$(QUIET_BISON)$(BISON) -v util/parse-events.y -d $(PARSER_DEBUG_BISON) 
-o $(OUTPUT)util/parse-events-bison.c
 
-$(OUTPUT)util/pmu-flex.c: util/pmu.l
+$(OUTPUT)util/pmu-flex.c: util/pmu.l util/pmu-bison.c
$(QUIET_FLEX)$(FLEX) --header-file=$(OUTPUT)util/pmu-flex.h -t 
util/pmu.l > $(OUTPUT)util/pmu-flex.c
 
 $(OUTPUT)util/pmu-bison.c: util/pmu.y

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] perf: Fix parallel build

2012-09-20 Thread Eric Sandeen

On 9/20/12 7:24 PM, Namhyung Kim wrote:
> Hi Eric,
> 
> On Thu, 20 Sep 2012 18:53:01 -0500, Eric Sandeen wrote:
>> Parallel builds of perf were failing for me on a 32p box, with:
>>
>> * new build flags or prefix
>> util/pmu.l:7:23: error: pmu-bison.h: No such file or directory
>>
>> ...
>>
>> make: *** [util/pmu-flex.o] Error 1
>> make: *** Waiting for unfinished jobs
>>
>> This can pretty quickly be seen by adding a sleep in front of
>> the bison call in tools/perf/Makefile and running make -j4 on a
>> smaller box:
>>
>>  sleep 10; $(QUIET_BISON)$(BISON) -v util/pmu.y -d -o 
>> $(OUTPUT)util/pmu-bison.c
>>
>> Adding the following dependency fixes it for me:
>>
>> Signed-off-by: Eric Sandeen 
>> ---
>>
>> diff --git a/tools/perf/Makefile b/tools/perf/Makefile
>> index bad726a..6c389d9 100644
>> --- a/tools/perf/Makefile
>> +++ b/tools/perf/Makefile
>> @@ -219,7 +219,7 @@ $(OUTPUT)util/parse-events-flex.c: util/parse-events.l
>>  $(OUTPUT)util/parse-events-bison.c: util/parse-events.y
>>  $(QUIET_BISON)$(BISON) -v util/parse-events.y -d $(PARSER_DEBUG_BISON) 
>> -o $(OUTPUT)util/parse-events-bison.c
>>  
>> -$(OUTPUT)util/pmu-flex.c: util/pmu.l
>> +$(OUTPUT)util/pmu-flex.c: util/pmu.l util/pmu-bison.c
>>  $(QUIET_FLEX)$(FLEX) --header-file=$(OUTPUT)util/pmu-flex.h -t 
>> util/pmu.l > $(OUTPUT)util/pmu-flex.c
>>  
>>  $(OUTPUT)util/pmu-bison.c: util/pmu.y
> 
> I guess the $(OUTPUT)util/parse-events-flex.c: line has the same
> problem.  Could you check and submit a patch for that too?
> 
> Thanks,
> Namhyung
> 

Whoops you are right, will resent V2 shortly, thanks.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2] perf: Fix parallel build

2012-09-20 Thread Eric Sandeen

Parallel builds of perf were failing for me on a 32p box, with:

* new build flags or prefix
util/pmu.l:7:23: error: pmu-bison.h: No such file or directory

...

make: *** [util/pmu-flex.o] Error 1
make: *** Waiting for unfinished jobs

This can pretty quickly be seen by adding a sleep in front of
the bison calls in tools/perf/Makefile and running make -j4 on a
smaller box i.e.:

sleep 10; $(QUIET_BISON)$(BISON) -v util/pmu.y -d -o 
$(OUTPUT)util/pmu-bison.c

Adding the following dependencies fixes it for me.

Signed-off-by: Eric Sandeen 
---

V2: Fix other bison dependency caught by Namhyung Kim , 
thanks.

diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index 35655c3..8d883de 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -221,13 +221,13 @@ export PERL_PATH
 FLEX = flex
 BISON= bison
 
-$(OUTPUT)util/parse-events-flex.c: util/parse-events.l
+$(OUTPUT)util/parse-events-flex.c: util/parse-events.l 
util/parse-events-bison.c
$(QUIET_FLEX)$(FLEX) --header-file=$(OUTPUT)util/parse-events-flex.h 
$(PARSER_DEBUG_FLEX) -t util/parse-events.l > $(OUTPUT)util/parse-events-flex.c
 
 $(OUTPUT)util/parse-events-bison.c: util/parse-events.y
$(QUIET_BISON)$(BISON) -v util/parse-events.y -d $(PARSER_DEBUG_BISON) 
-o $(OUTPUT)util/parse-events-bison.c
 
-$(OUTPUT)util/pmu-flex.c: util/pmu.l
+$(OUTPUT)util/pmu-flex.c: util/pmu.l util/pmu-bison.c
$(QUIET_FLEX)$(FLEX) --header-file=$(OUTPUT)util/pmu-flex.h -t 
util/pmu.l > $(OUTPUT)util/pmu-flex.c
 
 $(OUTPUT)util/pmu-bison.c: util/pmu.y

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Upgraded from 3.4 to 3.5.1 kernel: machine does not boot

2012-08-12 Thread Eric Sandeen

On 8/10/12 11:14 PM, Justin Piszcz wrote:
> On Fri, Aug 10, 2012 at 7:07 PM, Justin Piszcz
>>
>> Hi,
>>
>> Found the root cause, the 3.5.1 kernel cannot mount my ext4 filesystem
>> (60TB).

You are a brave man running ext4 at 60T, but thank you for testing :)

Backing out 8aeb00ff85ad25453765dd339b408c0087db1527 from 3.5.1
(952fc18ef9ec707ebdc16c0786ec360295e5ff15 upstream) probably helps?

>From a quick look, I think that essentially has a :

for (i = 0; i < ngroups; i++) {

for (j = 0; j < ngroups; j++) {

}
}

type nested loop going on; for a filesystem this big it's going to take almost
literally forever, if I read it right.

-Eric

>> The 3.4 kernel works fine.
>>
>> This is proven by commenting out the filesystem in /etc/fstab with
>> 3.5.1, and all is OK.
>>
>> --
>>
>> Hi again,
>>
>> I tested with linux-3.6-rc1:
>>
>> The same problem, here is what I get from the strace:
>>
>> irectory)
>> 4434  readlink("/dev", 0x7fff3b05c670, 4096) = -1 EINVAL (Invalid argument)
>> 4434  readlink("/dev/sda1", 0x7fff3b05c670, 4096) = -1 EINVAL (Invalid
>> argument)
>> 4434  readlink("/r1", 0x7fff3b05c670, 4096) = -1 EINVAL (Invalid argument)
>> 4434  getuid()  = 0
>> 4434  geteuid() = 0
>> 4434  getgid()  = 0
>> 4434  getegid() = 0
>> 4434  prctl(PR_GET_DUMPABLE)= 1
>> 4434  lstat("/etc/mtab", {st_mode=S_IFLNK|0777, st_size=12, ...}) = 0
>> 4434  getuid()  = 0
>> 4434  geteuid() = 0
>> 4434  getgid()  = 0
>> 4434  getegid() = 0
>> 4434  prctl(PR_GET_DUMPABLE)= 1
>> 4434  stat("/run", {st_mode=S_IFDIR|0755, st_size=820, ...}) = 0
>> 4434  lstat("/run/mount/utab", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
>> 4434  open("/run/mount/utab", O_RDWR|O_CREAT, 0644) = 3
>> 4434  close(3)  = 0
>> 4434  mount("/dev/sda1", "/r1", "ext4", MS_MGC_VAL|MS_NOATIME, NULL
>>
>> --
>>
>> (w/ 3.6-rc1)
>>
>> [   89.868843] mount   R  running task0  4434   4433
>> 0x0009
>> [   89.868847]  880c246b7b68 816c9279 880c246b7aa8
>> 880c246b7fd8
>> [   89.868851]  880c246b7fd8 4000 88062720cdb0
>> 880c246862d0
>> [   89.868855]  000116c0 880623a863c0 880623a863c0
>> 
>> [   89.868855] Call Trace:
>> [   89.868858]  [] ? __schedule+0x299/0x770
>> [   89.868860]  [] ? __schedule+0x299/0x770
>> [   89.868864]  [] ? ext4_get_group_desc+0x49/0xb0
>> [   89.868868]  [] ? ext4_calculate_overhead+0x131/0x3e0
>> [   89.868871]  [] ? ext4_fill_super+0x1a4b/0x28d0
>> [   89.868875]  [] ? mount_bdev+0x1a1/0x1e0
>> [   89.868877]  [] ? ext4_calculate_overhead+0x3e0/0x3e0
>> [   89.868880]  [] ? ext4_mount+0x10/0x20
>> [   89.868882]  [] ? mount_fs+0x1b/0xd0
>> [   89.868885]  [] ? vfs_kern_mount+0x6f/0x110
>> [   89.86]  [] ? do_kern_mount+0x4f/0x100
>> [   89.868890]  [] ? do_mount+0x2fe/0x8a0
>> [   89.868894]  [] ? strndup_user+0x53/0x70
>> [   89.868896]  [] ? sys_mount+0x90/0xe0
>> [   89.868899]  [] ? tracesys+0xd4/0xd9
>>
>> Justin.
>>
>>
>>
> 
> CC: linux-ext4
> 
> Any ideas here (kernel 3.4 and below can mount 60TB ext4 no issues)
> but > 3.5.1 (did not try 3.5) cannot mount the filesystem.
> 
> Justin.
> 
> Justin.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] reduce large do_mount stack usage with noinlines

2008-02-06 Thread Eric Sandeen

do_mount() uses a whopping 616 bytes of stack on x86_64 in 
2.6.24-mm1, largely thanks to gcc inlining the various helper 
functions.

noinlining these can slim it down a lot; on my box this patch
gets it down to 168, which is mostly the struct nameidata nd;
left on the stack.

These functions are called only as do_mount() helpers;
none of them should be in any path that would see a performance
benefit from inlining...

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>
---

Index: linux-2.6.24-mm1/fs/namespace.c
===
--- linux-2.6.24-mm1.orig/fs/namespace.c
+++ linux-2.6.24-mm1/fs/namespace.c
@@ -1296,7 +1296,7 @@ out_unlock:
 /*
  * recursively change the type of the mountpoint.
  */
-static int do_change_type(struct nameidata *nd, int flag)
+static noinline int do_change_type(struct nameidata *nd, int flag)
 {
struct vfsmount *m, *mnt = nd->path.mnt;
int recurse = flag & MS_REC;
@@ -1320,7 +1320,7 @@ static int do_change_type(struct nameida
 /*
  * do loopback mount.
  */
-static int do_loopback(struct nameidata *nd, char *old_name, int recurse)
+static noinline int do_loopback(struct nameidata *nd, char *old_name, int 
recurse)
 {
struct nameidata old_nd;
struct vfsmount *mnt = NULL;
@@ -1387,7 +1387,7 @@ static int change_mount_flags(struct vfs
  * If you've mounted a non-root directory somewhere and want to do remount
  * on it - tough luck.
  */
-static int do_remount(struct nameidata *nd, int flags, int mnt_flags,
+static noinline int do_remount(struct nameidata *nd, int flags, int mnt_flags,
  void *data)
 {
int err;
@@ -1425,7 +1425,7 @@ static inline int tree_contains_unbindab
return 0;
 }
 
-static int do_move_mount(struct nameidata *nd, char *old_name)
+static noinline int do_move_mount(struct nameidata *nd, char *old_name)
 {
struct nameidata old_nd, parent_nd;
struct vfsmount *p;
@@ -1505,7 +1505,7 @@ out:
  * create a new mount for userspace and request it to be added into the
  * namespace's tree
  */
-static int do_new_mount(struct nameidata *nd, char *type, int flags,
+static noinline int do_new_mount(struct nameidata *nd, char *type, int flags,
int mnt_flags, char *name, void *data)
 {
struct vfsmount *mnt;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] reduce large do_mount stack usage with noinlines

2008-02-06 Thread Eric Sandeen

Andrew Morton wrote:

> Does the patch actually help?  I mean, if a() calls b() and both use N
> bytes of locals, our worst-case stack usage remains ~2N whether or not b()
> was inlined in a()?  In fact, uninlining makes things a little worse due to
> callframe stuff.

I think it does.

[linux-2.6.24-mm1]$ make fs/namespace.o > /dev/null
[linux-2.6.24-mm1]$ objdump -d fs/namespace.o | scripts/checkstack.pl
x86_64 | grep do_mount
0x2307 do_mount [namespace.o]:  616
[linux-2.6.24-mm1]$ quilt push
Applying patch patches/do_mount_stack
patching file fs/namespace.c

Now at patch patches/do_mount_stack
[linux-2.6.24-mm1]$ make fs/namespace.o > /dev/null
[linux-2.6.24-mm1]$ objdump -d fs/namespace.o | scripts/checkstack.pl
x86_64 | grep do_mount
0x2a8b do_mount [namespace.o]:  168

So clearly that one function is reduced.  But it's more than that

I guess the problem is a() calls b() or c() or d() or e() or f(), and
gcc adds up all that stack usage, or seems to, and we get more like 6N
regardless of the path taken.

For example, 2 of the helper functions, once un-inlined, are:

0x1fd9 do_move_mount [namespace.o]: 288
0x1e94 do_loopback [namespace.o]:   168

so it looks like we do carry that baggage even if we go the
do_new_mount() path for example.

>> -static int do_change_type(struct nameidata *nd, int flag)
>> +static noinline int do_change_type(struct nameidata *nd, int flag)
> 
> There's no way for the reader to work out why this is here, so I do think
> it should be commented somewhere.

Ok, good point, will resend... want a comment on each, or perhaps above
do_mount?  I suppose on each.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] reduce large do_mount stack usage with noinlines

2008-02-06 Thread Eric Sandeen

Arjan van de Ven wrote:
> On Wed, 6 Feb 2008 14:34:57 -0800
> Andrew Morton <[EMAIL PROTECTED]> wrote:
> 
>> Does the patch actually help?  I mean, if a() calls b() and both use N
>> bytes of locals, our worst-case stack usage remains ~2N whether or
>> not b() was inlined in a()?  In fact, uninlining makes things a
>> little worse due to callframe stuff.
> 
> it gets interesting at the three-way..
> if a() calls b() and then calls c(), and they all use N,
> the total usage is now 3N not 2N.

*nod*

It'd be nice if it could be max of (a,b,c) though.  Or maybe compilers
don't work that way.  :)

> (although current gcc is already somewhat smarter about this, and 3N might 
> actually be 2N for some cases)

on x86, gcc 4.1.2, do_mount goes from 360 to 112 bytes w/ the patch I sent.

with gcc 4.3, it goes from 364 to 104.

-ERic
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] reduce large do_mount stack usage with noinlines

2008-02-06 Thread Eric Sandeen

(updated with comments about the noinlines, and a fix for
 an >80 char line)

do_mount() uses a whopping 616 bytes of stack on x86_64 in 
2.6.24-mm1, largely thanks to gcc inlining the various helper 
functions.

noinlining these can slim it down a lot; on my box this patch
gets it down to 168, which is mostly the struct nameidata nd;
left on the stack.

These functions are called only as do_mount() helpers;
none of them should be in any path that would see a performance
benefit from inlining...

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>
---

Index: linux-2.6.24-mm1/fs/namespace.c
===
--- linux-2.6.24-mm1.orig/fs/namespace.c
+++ linux-2.6.24-mm1/fs/namespace.c
@@ -1295,8 +1295,9 @@ out_unlock:
 
 /*
  * recursively change the type of the mountpoint.
+ * noinline this do_mount helper to save do_mount stack space.
  */
-static int do_change_type(struct nameidata *nd, int flag)
+static noinline int do_change_type(struct nameidata *nd, int flag)
 {
struct vfsmount *m, *mnt = nd->path.mnt;
int recurse = flag & MS_REC;
@@ -1319,8 +1320,10 @@ static int do_change_type(struct nameida
 
 /*
  * do loopback mount.
+ * noinline this do_mount helper to save do_mount stack space.
  */
-static int do_loopback(struct nameidata *nd, char *old_name, int recurse)
+static noinline int do_loopback(struct nameidata *nd, char *old_name,
+   int recurse)
 {
struct nameidata old_nd;
struct vfsmount *mnt = NULL;
@@ -1386,8 +1389,9 @@ static int change_mount_flags(struct vfs
  * change filesystem flags. dir should be a physical root of filesystem.
  * If you've mounted a non-root directory somewhere and want to do remount
  * on it - tough luck.
+ * noinline this do_mount helper to save do_mount stack space.
  */
-static int do_remount(struct nameidata *nd, int flags, int mnt_flags,
+static noinline int do_remount(struct nameidata *nd, int flags, int mnt_flags,
  void *data)
 {
int err;
@@ -1425,7 +1429,10 @@ static inline int tree_contains_unbindab
return 0;
 }
 
-static int do_move_mount(struct nameidata *nd, char *old_name)
+/*
+ * noinline this do_mount helper to save do_mount stack space.
+ */
+static noinline int do_move_mount(struct nameidata *nd, char *old_name)
 {
struct nameidata old_nd, parent_nd;
struct vfsmount *p;
@@ -1504,8 +1511,9 @@ out:
 /*
  * create a new mount for userspace and request it to be added into the
  * namespace's tree
+ * noinline this do_mount helper to save do_mount stack space.
  */
-static int do_new_mount(struct nameidata *nd, char *type, int flags,
+static noinline int do_new_mount(struct nameidata *nd, char *type, int flags,
int mnt_flags, char *name, void *data)
 {
struct vfsmount *mnt;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] reduce large do_mount stack usage with noinlines

2008-02-06 Thread Eric Sandeen

Andrew Morton wrote:
> On Wed, 06 Feb 2008 17:11:38 -0600
> Eric Sandeen <[EMAIL PROTECTED]> wrote:
> 
>>  /*
>>   * recursively change the type of the mountpoint.
>> + * noinline this do_mount helper to save do_mount stack space.
>>   */
>> -static int do_change_type(struct nameidata *nd, int flag)
>> +static noinline int do_change_type(struct nameidata *nd, int flag)
> 
> What we could do here is defined a new noinline_because_of_stack_suckiness
> and use that.  Reasons:
> 
> - self-documenting, so we don't need to comment each site
> 
> - can be made a no-op for suitable __GNUC__ values if gcc ever fixes this
> 
> - our children we can go through and delete them all when nobody is using
>   the offending gcc versions any more.
> 
> what thinkest thou?

Yes, sounds very good to me.  I'm sure there are more places which could
use this.

Or perhaps there are some magic flags to gcc so that it doesn't inline
so aggressively in situations like this?  IOW is it gcc breakage, or
design - but maybe with defaults that don't fit well in the limited
stack... (well, I suppose the "for N mutually exclusive helper functions
each with stack usage S, use about N*S stack when inlining them all" bit
probably isn't a feature.)

FWIW the XFS team finally just wrested control from GCC.
http://oss.sgi.com/archives/xfs/2006-11/msg00195.html

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] reduce large do_mount stack usage with noinlines

2008-02-07 Thread Eric Sandeen

Andrew Morton wrote:
> On Wed, 06 Feb 2008 17:11:38 -0600
> Eric Sandeen <[EMAIL PROTECTED]> wrote:
> 
>>  /*
>>   * recursively change the type of the mountpoint.
>> + * noinline this do_mount helper to save do_mount stack space.
>>   */
>> -static int do_change_type(struct nameidata *nd, int flag)
>> +static noinline int do_change_type(struct nameidata *nd, int flag)
> 
> What we could do here is defined a new noinline_because_of_stack_suckiness
> and use that.

Something like:

Index: linux-2.6.24-mm1/include/linux/compiler-gcc.h
===
--- linux-2.6.24-mm1.orig/include/linux/compiler-gcc.h
+++ linux-2.6.24-mm1/include/linux/compiler-gcc.h
@@ -53,3 +53,9 @@
 #define  noinline  __attribute__((noinline))
 #define __attribute_const____attribute__((__const__))
 #define __maybe_unused __attribute__((unused))
+
+/*
+ * When gcc inlines multiple functions into a parent function,
+ * the stack space used sometimes increases excessively...
+ */
+#define noinline_stackspacenoinline


?

I couldn't think of a great name for it.  There are several noinline
users throughout the kernel with stackspace related comments, so if
desired, I could sprinkle this around.  I'm not very pleased with it
aesthetically though. :)

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] reduce large do_mount stack usage with noinlines

2008-02-08 Thread Eric Sandeen

Andi Kleen wrote:
> Andrew Morton <[EMAIL PROTECTED]> writes:
>>>   */
>>> -static int do_change_type(struct nameidata *nd, int flag)
>>> +static noinline int do_change_type(struct nameidata *nd, int flag)
>> What we could do here is defined a new noinline_because_of_stack_suckiness
>> and use that.  Reasons:
>>
>> - self-documenting, so we don't need to comment each site
>>
>> - can be made a no-op for suitable __GNUC__ values if gcc ever fixes this
> 
> In theory it should be already fixed; iirc Richard H. (cc'ed) added
> code for this somewhere in 4.x. Don't quite remember which x, likely 
> either 1 or 2.
> 
> e.g. if I do a quick test here on gcc 4.2 then it definitely
> reuses stack slots between inlines. As you can see only ~100 bytes
> are allocated, not ~200.

On gcc 4.1.2 and 4.3 (fedora flavors) I don't see it re-used in
do_mount, though... *shrug*

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: WARNING: at drivers/tty/tty_buffer.c:476 flush_to_ldisc()

2013-01-19 Thread Eric Sandeen

On 1/19/13 4:44 PM, Sedat Dilek wrote:
> Hi Dave,
> 
> I suspected after initial testing a problem in TTY and applied two patches.
> After more testing the root cause was a problem in JBD2.
> A patch from Eric helped!

oh, excellent ;)

Helped what, exactly?

> Follow the thread in [1] for more details.
> 
> If you are on Linux-Next (next-20130118) you need the following three patches.
> 
> Ilya Zykov (2):
>   tty: Correct tty buffer flush.
>   tty: Add driver unthrottle in ioctl(...,TCFLSH,..).
> 
> Eric Sandeen (1):
>   jbd2: don't wake kjournald unnecessarily

Hum, I would not have expected this to *matter*, I thought it was just a perf
optimization.  Hm.  What do you think it fixed, exactly?

(sorry, have not followed that thread, busy elsewhere ATM)

-eric

> Hope this helps you.
> 
> Regards,
> - Sedat -
> 
> [0] http://marc.info/?t=13586202374&r=1&w=2
> [1] http://marc.info/?l=linux-serial&m=135860500714909&w=2
> [2] 
> http://git.kernel.org/?p=linux/kernel/git/gregkh/tty.git;a=commitdiff;h=a1bf9584429d61b7096f93ae09325e1ba538e9e8
> [3] http://patchwork.ozlabs.org/patch/207237/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: jbd2: don't wake kjournald unnecessarily

2013-01-22 Thread Eric Sandeen

On 1/22/13 5:50 PM, Jan Kara wrote:
> On Mon 21-01-13 18:11:30, Ted Tso wrote:
>> On Tue, Jan 22, 2013 at 12:04:32AM +0100, Sedat Dilek wrote:
>>>
>>> Beyond the FUSE/LOOP fun, will you apply this patch to your linux-next GIT 
>>> tree?
>>>
>>> Feel free to add...
>>>
>>>  Tested-by: Sedat Dilek 
>>>
>>> A similiar patch for JBD went through your tree into mainline (see [1] and 
>>> [2]).
>>
>> I'm not at all convinced that this patch has anything to do with your
>> problem.  I don't see how it could affect things, and I believe you
>> mentioned that you saw the problem even with this patch applied?  (I'm
>> not sure; some of your messages which you sent were hard to
>> understand, and you mentioned something about trying to send messages
>> when low on sleep :-).
>>
>> In any case, the reason why I haven't pulled this patch into the ext4
>> tree is because I was waiting for Eric and some of the performance
>> team folks at Red Hat to supply some additional information about why
>> this commit was making a difference in performance for a particular
>> proprietary, closed source benchmark.
>   Just a small correction - it was aim7 AFAIK which isn't closed source
> (anymore). You can download it from SourceForge
> (http://sourceforge.net/projects/aimbench/files/aim-suite7/Initial%20release/).
> Now I have some reservations about what the benchmark does but historically
> it has found quite a few issues for us as well.
> 
>> I'm very suspicious about applying patches under the "cargo cult"
>> school of programming.  ("We don't understand why it makes a
>> difference, but it seems to be good, so bombs away!" :-)
>   Well, neither am I ;) But it is obvious the patch speeds up
> log_start_commit() by 'a bit' (taking spinlock, disabling irqs, ...). And
> apparently 'a bit' is noticeable for particular workload on a particular
> machine - commit statistics Eric provided showed that clearly. I'd still be
> happier if Eric also told us how much log_start_commit() calls there were
> so that one could verify that 'a bit' could indeed multiply to a measurable
> difference. But given how simple the patch is, I gave away after a while
> and just merged it...

I am still trying to get our perf guys to collect that data, FWIW...
I will send it when I get it.  I bugged them again today.  :)

(Just to be sure: I was going to measure the wakeups the old way, and the
avoided wakeups with the new change; sound ok?)

-Eric

>   Honza
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: jbd2: don't wake kjournald unnecessarily

2013-01-23 Thread Eric Sandeen

On 1/23/13 3:44 AM, Jan Kara wrote:
> On Tue 22-01-13 19:37:46, Eric Sandeen wrote:
>> On 1/22/13 5:50 PM, Jan Kara wrote:
>>> On Mon 21-01-13 18:11:30, Ted Tso wrote:
>>>> On Tue, Jan 22, 2013 at 12:04:32AM +0100, Sedat Dilek wrote:
>>>>>
>>>>> Beyond the FUSE/LOOP fun, will you apply this patch to your linux-next 
>>>>> GIT tree?
>>>>>
>>>>> Feel free to add...
>>>>>
>>>>>  Tested-by: Sedat Dilek 
>>>>>
>>>>> A similiar patch for JBD went through your tree into mainline (see [1] 
>>>>> and [2]).
>>>>
>>>> I'm not at all convinced that this patch has anything to do with your
>>>> problem.  I don't see how it could affect things, and I believe you
>>>> mentioned that you saw the problem even with this patch applied?  (I'm
>>>> not sure; some of your messages which you sent were hard to
>>>> understand, and you mentioned something about trying to send messages
>>>> when low on sleep :-).
>>>>
>>>> In any case, the reason why I haven't pulled this patch into the ext4
>>>> tree is because I was waiting for Eric and some of the performance
>>>> team folks at Red Hat to supply some additional information about why
>>>> this commit was making a difference in performance for a particular
>>>> proprietary, closed source benchmark.
>>>   Just a small correction - it was aim7 AFAIK which isn't closed source
>>> (anymore). You can download it from SourceForge
>>> (http://sourceforge.net/projects/aimbench/files/aim-suite7/Initial%20release/).
>>> Now I have some reservations about what the benchmark does but historically
>>> it has found quite a few issues for us as well.
>>>
>>>> I'm very suspicious about applying patches under the "cargo cult"
>>>> school of programming.  ("We don't understand why it makes a
>>>> difference, but it seems to be good, so bombs away!" :-)
>>>   Well, neither am I ;) But it is obvious the patch speeds up
>>> log_start_commit() by 'a bit' (taking spinlock, disabling irqs, ...). And
>>> apparently 'a bit' is noticeable for particular workload on a particular
>>> machine - commit statistics Eric provided showed that clearly. I'd still be
>>> happier if Eric also told us how much log_start_commit() calls there were
>>> so that one could verify that 'a bit' could indeed multiply to a measurable
>>> difference. But given how simple the patch is, I gave away after a while
>>> and just merged it...
>>
>> I am still trying to get our perf guys to collect that data, FWIW...
>> I will send it when I get it.  I bugged them again today.  :)
>>
>> (Just to be sure: I was going to measure the wakeups the old way, and the
>> avoided wakeups with the new change; sound ok?)
>   Yes, that would be what I'm interested in.

Holy cow, this is much more than I expected, but here's what they report:

old JBD: AIM7 jobs/min 97624.39;  got 78193 jbd wakeups
new JBD: AIM7 jobs/min 85929.43;  got 6306999 jbd wakeups, 6264684 extra wakeups

The "extra wakeups" were hacked in like:

diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index d492d57..3e0c4eb 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -433,15 +433,25 @@ int __log_space_left(journal_t *journal)
return left;
 }
 
+unsigned long jbd_wakeups;
+unsigned long jbd_extra_wakeups;
+
 /*
  * Called under j_state_lock.  Returns true if a transaction commit was 
started.
  */
 int __log_start_commit(journal_t *journal, tid_t target)
 {
/*
-* Are we already doing a recent enough commit?
+* The only transaction we can possibly wait upon is the
+* currently running transaction (if it exists).  Otherwise,
+* the target tid must be an old one.
 */
-   if (!tid_geq(journal->j_commit_request, target)) {
+   if (/* journal->j_commit_request != target && <--- ERS: Undo "fix" */
+   journal->j_running_transaction &&
+   journal->j_running_transaction->t_tid == target) {
+   /* if we already have the right target, this is extra */
+   if (journal->j_commit_request == target)
+   jbd_extra_wakeups++;
/*
 * We want a new commit: OK, mark the request and wakup the
 * commit thread.  We do _not_ do the commit ourselves.
@@ -451,9 +461,17 @@ int __log_start_commit(journal_t *journal, tid_t target)

[PATCH] include falloc.h in header-y

2008-02-22 Thread Eric Sandeen

Include falloc.h in header-y; it defines a flag
for the fallocate sysctl.

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>

---

--- linux.orig/include/linux/Kbuild
+++ linux/include/linux/Kbuild
@@ -61,6 +61,7 @@ header-y += efs_fs_sb.h
 header-y += elf-fdpic.h
 header-y += elf-em.h
 header-y += fadvise.h
+header-y += falloc.h
 header-y += fd.h
 header-y += fdreg.h
 header-y += fib_rules.h

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [xfs-masters] filesystem corruption on xfs after 2.6.25-rc1 (bisected, powerpc related?)

2008-02-25 Thread Eric Sandeen

Gaudenz Steinlin wrote:
> Hi
> 
> Since upgrading to 2.6.25-rc1 I see filesystem corruption on my XFS
> filesystem. I can reproduce this by doing "git reset --hard v2.6.25-rc1"
> on a git checkout which is on some other revision. Git outputs strange
> error messages (like file xxx is a directory when xxx really is a file)
> and sometimes the filesystem "hangs" (I can no longer do any operations
> on it even from another shell). If I reboot with a working kernel and
> check the filesystem xfs_check reports many errors. I also see the
> problem when doing other (not related to git) operations on the
> filesystem. Git reset is just the easiest way to reproduce it.
> 
> I was able to track this corruption down to commit
> a69b176df246d59626e6a9c640b44c0921fa4566 ([XFS] Use the generic bitops
> rather than implementing them ourselves.) using git bisect.
> 
> Reverting edd319dc527733e61eec5bdc9ce20c94634b6482 ([XFS] Fix
> xfs_lowbit64) to avoid merge conflicts and the faulty commit on top of
> 2.6.25-rc3 fixes the problem.

If you're feeling motivated, maybe you can narrow it down to which of
the changes - xfs_highbit32, xfs_highbit64, xfs_lowbit32, or
xfs_lowbit64 - is causing the problem?  (or maybe they all are ...)

Or maybe someone looking at the commit can immediately see the
problem... but I can't :)

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [xfs-masters] filesystem corruption on xfs after 2.6.25-rc1 (bisected, powerpc related?)

2008-02-25 Thread Eric Sandeen

Rafael J. Wysocki wrote:
> On Monday, 25 of February 2008, Eric Sandeen wrote:


>> If you're feeling motivated, maybe you can narrow it down to which of
>> the changes - xfs_highbit32, xfs_highbit64, xfs_lowbit32, or
>> xfs_lowbit64 - is causing the problem?  (or maybe they all are ...)
>>
>> Or maybe someone looking at the commit can immediately see the
>> problem... but I can't :)
> 
> Well, IMO a reproducible filesystem corruption is a serious enough issue
> for reverting all of the commits in question.

I'm not suggesting a partial revert; I just wonder which part of the
change is causing the problem, as part of the debugging process.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] ext3 freeze feature ver 0.2

2008-02-26 Thread Eric Sandeen

Takashi Sato wrote:

> o Elevate XFS ioctl numbers (XFS_IOC_FREEZE and XFS_IOC_THAW) to the VFS
>   As Andreas Dilger and Christoph Hellwig advised me, I have elevated
>   them to include/linux/fs.h as below.
> #define FIFREEZE_IOWR('X', 119, int)
> 　  #define FITHAW  _IOWR('X', 120, int)
>   The ioctl numbers used by XFS applications don't need to be changed.
>   But my following ioctl for the freeze needs the parameter
>   as the timeout period.  So if XFS applications don't want the timeout
>   feature as the current implementation, the parameter needs to be
>   changed 1 (level?) into 0.

So, existing xfs applications calling the xfs ioctl now will behave
differently, right?  We can only keep the same ioctl number if the
calling semantics are the same.  Keeping the same number but changing
the semantics is harmful, IMHO

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [xfs-masters] Re: filesystem corruption on xfs after 2.6.25-rc1 (bisected, powerpc related?)

2008-02-26 Thread Eric Sandeen

Gaudenz Steinlin wrote:
> On Tue, Feb 26, 2008 at 01:13:56AM +0100, Rafael J. Wysocki wrote:
>> On Tuesday, 26 of February 2008, Christoph Hellwig wrote:
>>> On Tue, Feb 26, 2008 at 12:52:56AM +0100, Rafael J. Wysocki wrote:
> I'm not suggesting a partial revert; I just wonder which part of the
> change is causing the problem, as part of the debugging process.
> 
> I debuged this a bit further by testing the 4 changed functions
> individually. The problem only occurs with the new version of
> xfs_lowbit64. 

FWIW, Dave & I did some testing/debugging on 32-bit powerpc, and it is
indeed only xfs_lowbit64 which is doing the wrong thing on that arch,
because generic find_next_bit is doing the wrong thing on big-endian
32-bit systems, for sizes > 32 bits, near as I can tell.

Rather than reverting it all, I think just changing xfs_lowbit64 back to:

int
xfs_lowbit64(
   __uint64_t  v)
{
   __uint32_t  w = (__uint32_t)v;
   int n = 0;

   if (w) {/* lower bits */
   n = ffs(w);
   } else {/* upper bits */
   w = (__uint32_t)(v >> 32);
   if (w && (n = ffs(w)))
   n += 32;
   }
   return n - 1;
}

for now should fix it (this is essentially just ffs64())

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

"kaweth" usb ethernet driver in 2.4?

2001-02-03 Thread Eric Sandeen


I'm wondering about the status of the "kaweth" Kawasaki LSI KL5KUSB100
USB to Ethernet Controller driver for 2.4.  According to
http://www.hiru.aoba.yokohama.jp/%7eura/USB/usbether.html, this chipset
is used in the 3Com USB Network Adapter, Linksys USB10T, D-Link DSB-650,
SMC 2102USB, Netgear EA101, and a few other USB ethernet adapters.

The driver is included with the USB stuff for 2.2, but not in 2.4.

It also doesn't seem to work in 2.2.  :)  The original development of
this driver was going on at http://drivers.rd.ilan.net/kaweth/ but there
have been no updates for quite some time.

Donald Becker had a driver at one point as well, and there's still a
link at http://www.scyld.com/usb/ethernet.html, but the link is broken
now, and I don't have the code.

I have one of these beasts, and I'd like to get it working - I haven't
done any USB _or_ ethernet work for Linux, so it'll be an uphill climb
fro me.  Anybody else have one, and have some time to collaborate?  :) 
I have some of the chipset documentation, too, FWIW.

Thanks,
-Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: "kaweth" usb ethernet driver in 2.4?

2001-02-04 Thread Eric Sandeen

Michael Rothwell wrote:

> > It also doesn't seem to work in 2.2.  :)  The original development of
> > this driver was going on at http://drivers.rd.ilan.net/kaweth/ but there
> > have been no updates for quite some time.
> 
> Well, it doesn't work you _you_ on 2.2, obviously. But it works for us
> and other people. Can you provide any information to diagnose the
> problem you're having?

I'm sorry, I should have provided that info.  I searched the 'net, and
saw many people with problems on several mailing lists, and saw no
solutions, or reports of success, so I assumed that it was a widespread,
possibly even known, problem with the driver.

I'll preface this by saying that Brad Hards sent me an updated version
that works much better, at least as long as the module is loaded before
the device is plugged in.

But here's what I get with a 2.2 kernel and the stock driver:

Kawasaki USB->Ethernet Driver loading... 
usb.c: registered new driver kaweth 
usb.c: USB new device connect, assigned device number 2 
Kawasaki Device Probe (Device number:2): 0x0846:0x1001 
Device at c2192600 
Descriptor length: 12 type: 1 
NetGear EA-101 connected... 
Reading kaweth configuration 
Request type: c0  Request: 0  Value: 0 Index: 0 Length: 12 
usb-uhci.c: interrupt, status 2, frame# 1929 
kaweth control message failed (urb addr: c2c05ca0) 
usb_control/bulk_msg: timeout 
usb-uhci.c: interrupt, status 2, frame# 851 
Actual length: 0, length 18 
Resetting... 
usb-uhci.c: interrupt, status 2, frame# 854 
Downloading firmware at c48abb6c to kaweth device at c31be000... 
Firmware length: 3838 
Request type: 40  Request: ff  Value: 0 Index: 0 Length: efe 
usb-uhci.c: interrupt, status 2, frame# 871 
kaweth control message failed (urb addr: c213ab60) 
usb-uhci.c: interrupt, status 2, frame# 877 
usb-uhci.c: interrupt, status 2, frame# 882 
Actual length: 0, length 3838 
Error downloading firmware (-110), no net device created
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: "kaweth" usb ethernet driver in 2.4?

2001-02-05 Thread Eric Sandeen


Ok, so the problem with the current driver is that it will attempt to
load the firmware even if firmware is already loaded.  This will hang
the device.

The trick is to look at the device release number
(dev->descriptor.bcdDevice) - if no firmware is present, it returns
0x0002, if firmware is present, it returns 0x0202.

The second trick is to explicitly call usb_get_device_descriptor()
before checking this, as it seems to be cached otherwise.  Brad and I
had a simultaneous revelation on this one.  :)

Adding this test before the (firmware download, fix download, firmware
trigger) sequence makes things work much more reliably, although it
still fails occasionally if the device is plugged in after the driver is
loaded - fails on usb_get_device_descriptor() for some reason... perhaps
a short delay before re-reading the desriptor might help?  (just
guessing here...)

Patches sent to Brad & Michael.

-Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

[PATCH] updates for KLSI usb->ethernet

2001-02-06 Thread Eric Sandeen


This patch, against 2.4.1-ac4, does the following for the KLSI
USB->ethernet adapter:

(patch at http://lager.dyndns.org/kaweth/KLSI-2.4.1-ac4.patch.bz2)

o Fixes firmware downloading.  If firmware is already loaded
  and an attempt is made to download it again, the device
  will hang.  This will happen on a warm boot. Driver now 
  checks the bcdDevice value, which changes after firmware 
  is loaded.  It does this via usb_get_device_descriptor() 
  to avoid caching.  If device already has firmware, it will 
  skip the download.

o Reports bcdDevice revision in debugging messages

o Updates firmware revision, fresh from KLSI

o Actually _uses_ interrupt parameter passed to
  kaweth_trigger_firmware()

o added function prototype for 
  kaweth_internal_control_msg() to avoid warning

o spells "receive" correctly.  :)

There is another way to handle the firmware download check - there is a
chunk of firmware which can be downloaded that causes the device to
disconnect, wait, then reconnect to the USB bus.  When it reappears, it
has the new bcdDevice value in the descriptor.

This might be a better way to go, so that the device descriptor doesn't
silently change.  I've also seen some errors when I try to re-read the
device descriptor with usb_get_device_descriptor(), for some reason.

Any thoughts on what would be more correct, 

a) device descriptor silently changes
b) device magically disconnects/reconnects on its own

Both seem a bit odd, but take your pick.  :)

-Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [NEW][PATCH] updates for KLSI

2001-02-08 Thread Eric Sandeen


Greg KH wrote:

> Silently changing descriptor ids while connected is just asking for
> trouble :)

Ok, fair enough - here's a patch against 2.4.1-ac6 which causes the
device to disconnect/reconnect after firmware load (and a couple other
minor kernel logging & formatting cleanups, plus one more device ID).

-Eric

--- linux/drivers/usb/kaweth.c.ac6  Thu Feb  8 10:47:40 2001
+++ linux/drivers/usb/kaweth.c  Thu Feb  8 11:08:16 2001
@@ -131,6 +131,7 @@
{ USB_DEVICE(0x066b, 0x2202) }, /* Linksys USB10T */ 
{ USB_DEVICE(0x1645, 0x0005) }, /* Entrega E45 */ 
{ USB_DEVICE(0x2001, 0x4000) }, /* D-link DSB-650C */
+   { USB_DEVICE(0x07b8, 0x4000) }, /* D-Link DU-E10 */
{} /* Null terminator */
 };
 
@@ -647,9 +648,10 @@
if (net->flags & IFF_PROMISC) {
packet_filter_bitmap |= KAWETH_PACKET_FILTER_PROMISCUOUS;
} 
-   else if ((net->mc_count) ||  (net->flags & IFF_ALLMULTI)) {
+   else if ((net->mc_count) || (net->flags & IFF_ALLMULTI)) {
packet_filter_bitmap |= KAWETH_PACKET_FILTER_ALL_MULTICAST;
}
+
kaweth->packet_filter_bitmap = packet_filter_bitmap;
netif_wake_queue(net);
 }
@@ -721,11 +723,7 @@
const eth_addr_t bcast_addr = { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF };
int result = 0;
 
-   result = usb_get_device_descriptor(dev);
-   if (result < 0)
-   kaweth_err("Error re-loading device descriptor");
-   
-   kaweth_dbg("Kawasaki Device Probe (Device number:%d): 0x%4.4x:0x%4.4x revision 
0x%4.4x",
+   kaweth_dbg("Kawasaki Device Probe (Device number:%d): 0x%4.4x:0x%4.4x:0x%4.4x",
 dev->devnum, 
 (int)dev->descriptor.idVendor, 
 (int)dev->descriptor.idProduct,
@@ -754,52 +752,65 @@
kaweth_reset(kaweth);
 
/*
-* If we got bcdDevice 0x0202, firmware is already present,
-* don't try to download again.
+* If high byte of bcdDevice is nonzero, firmware is already
+* downloaded. Don't try to do it again, or we'll hang the device.
 */
 
-   if ((int)dev->descriptor.bcdDevice != 0x0202) {
-
-   if((result = kaweth_download_firmware(kaweth, 
+   if (dev->descriptor.bcdDevice >> 8) {
+   kaweth_info("Firmware present in device.");
+   } else {
+   /* Download the firmware */
+   kaweth_info("Downloading firmware...");
+   if ((result = kaweth_download_firmware(kaweth, 
  kaweth_new_code, 
  len_kaweth_new_code, 
  100, 
- 2)) < 0){
-   kaweth_err("Error downloading firmware (%d), no net device 
created", result);
+ 2)) < 0) {
+   kaweth_err("Error downloading firmware (%d)", result);
kfree(kaweth);
return NULL;
}
 
-   if((result = kaweth_download_firmware(kaweth, 
+   if ((result = kaweth_download_firmware(kaweth, 
  kaweth_new_code_fix, 
  len_kaweth_new_code_fix, 
  100, 
  3)) < 0) {
-   kaweth_err("Error downloading firmware fix (%d), no net device 
created", result);
+   kaweth_err("Error downloading firmware fix (%d)", result);
kfree(kaweth);
return NULL;
}
 
-   if((result = kaweth_trigger_firmware(kaweth, 100)) < 0){
-   kaweth_err("Error triggering firmware (%d), no net device 
created\n", result);
+   if ((result = kaweth_download_firmware(kaweth,
+ kaweth_trigger_code,
+ len_kaweth_trigger_code,
+ 126,
+ 2)) < 0) {
+   kaweth_err("Error downloading trigger code (%d)", result);
kfree(kaweth);
return NULL;
}
 
-   udelay(1000);
+   if ((result = kaweth_download_firmware(kaweth,
+ kaweth_trigger_code_fix,
+ len_kaweth_trigger_code_fix,
+ 126,
+ 3)) < 0) {
+   kaweth_err("Error downloading trigger code fix (%d)", res

Re: [xfs-masters] Re: [PATCH] pull XFS support out of Kconfig submenu

2005-08-18 Thread Eric Sandeen


Chris Wedgwood wrote:

On Wed, Aug 17, 2005 at 10:45:48PM +0200, Jesper Juhl wrote:



It seems slightly odd to me that XFS support should be in a separate
submenu, when all the other filesystems are not using submenus but
are directly selectable from the Filesystems menu.



XFS also has an out-of-tree version.  Making it a submenu is probably
to make maintenance easier (ie. replace files, not merge).



Where the Kconfig is vs. where the menu appears are 2 different things 
though.  The latest kernel has our own Kconfig in fs/xfs, and fs/Kconfig 
just does:


source "fs/xfs/Kconfig"

This does facilitate swapping in a devel version of fs/xfs via a 
symlink, etc.


However, fs/xfs/Kconfig does still start with

menu "XFS support"

which puts it in a submenu, unlike every other fs.

I have no problem with removing the submenu.

-Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [xfs-masters] Re: [PATCH] pull XFS support out of Kconfig submenu

2005-08-18 Thread Eric Sandeen


Eric Sandeen wrote:

I have no problem with removing the submenu.


Hm, however, if this is a purely cosmetic thing, let's go all the way 
and format the xfs options like the others, with indentation etc:


 config XFS_RT
-   bool "Realtime support (EXPERIMENTAL)"
+   bool "  XFS Realtime support (EXPERIMENTAL)"


May be simpler to just make this change internally & let Nathan push it 
out.  I do agree that it looks better.  :)


-Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] (UPDATED3) ecryptfs: check for existing key_tfm at mount time

2008-01-07 Thread Eric Sandeen

(ugh, remove extra mutex_unlock I missed when changing locking
per Andrew's suggestion...)

Jeff Moyer pointed out that a mount; umount loop of ecryptfs,
with the same cipher & other mount options, created a new 
ecryptfs_key_tfm_cache item each time, and the cache could
grow quite large this way.

Looking at this with mhalcrow, we saw that ecryptfs_parse_options()
unconditionally called ecryptfs_add_new_key_tfm(), which is what
was adding these items.

Refactor ecryptfs_get_tfm_and_mutex_for_cipher_name() to create a 
new helper function, ecryptfs_tfm_exists(), which checks for the 
cipher on the cached key_tfm_list, and sets a pointer
to it if it exists.  This can then be called from 
ecryptfs_parse_options(), and new key_tfm's can be added only when
a cached one is not found.

With list locking changes suggested by akpm.

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>
---

Index: linux-2.6.24-rc3/fs/ecryptfs/crypto.c
===
--- linux-2.6.24-rc3.orig/fs/ecryptfs/crypto.c
+++ linux-2.6.24-rc3/fs/ecryptfs/crypto.c
@@ -1780,7 +1780,7 @@ out:
 
 struct kmem_cache *ecryptfs_key_tfm_cache;
 static struct list_head key_tfm_list;
-static struct mutex key_tfm_list_mutex;
+struct mutex key_tfm_list_mutex;
 
 int ecryptfs_init_crypto(void)
 {
@@ -1789,6 +1789,11 @@ int ecryptfs_init_crypto(void)
return 0;
 }
 
+/**
+ * ecryptfs_destroy_crypto - free all cached key_tfms on key_tfm_list
+ *
+ * Called only at module unload time
+ */
 int ecryptfs_destroy_crypto(void)
 {
struct ecryptfs_key_tfm *key_tfm, *key_tfm_tmp;
@@ -1812,6 +1817,8 @@ ecryptfs_add_new_key_tfm(struct ecryptfs
struct ecryptfs_key_tfm *tmp_tfm;
int rc = 0;
 
+   BUG_ON(!mutex_is_locked(&key_tfm_list_mutex));
+
tmp_tfm = kmem_cache_alloc(ecryptfs_key_tfm_cache, GFP_KERNEL);
if (key_tfm != NULL)
(*key_tfm) = tmp_tfm;
@@ -1838,13 +1845,50 @@ ecryptfs_add_new_key_tfm(struct ecryptfs
(*key_tfm) = NULL;
goto out;
}
-   mutex_lock(&key_tfm_list_mutex);
list_add(&tmp_tfm->key_tfm_list, &key_tfm_list);
-   mutex_unlock(&key_tfm_list_mutex);
 out:
return rc;
 }
 
+/**
+ * ecryptfs_tfm_exists - Search for existing tfm for cipher_name.
+ * @cipher_name: the name of the cipher to search for
+ * @key_tfm: set to corresponding tfm if found
+ *
+ * Searches for cached key_tfm matching @cipher_name
+ * Must be called with &key_tfm_list_mutex held
+ * Returns 1 if found, with @key_tfm set
+ * Returns 0 if not found, with @key_tfm set to NULL
+ */
+int ecryptfs_tfm_exists(char *cipher_name, struct ecryptfs_key_tfm **key_tfm)
+{
+   struct ecryptfs_key_tfm *tmp_key_tfm;
+
+   BUG_ON(!mutex_is_locked(&key_tfm_list_mutex));
+
+   list_for_each_entry(tmp_key_tfm, &key_tfm_list, key_tfm_list) {
+   if (strcmp(tmp_key_tfm->cipher_name, cipher_name) == 0) {
+   if (key_tfm)
+   (*key_tfm) = tmp_key_tfm;
+   return 1;
+   }
+   }
+   if (key_tfm)
+   (*key_tfm) = NULL;
+   return 0;
+}
+
+/**
+ * ecryptfs_get_tfm_and_mutex_for_cipher_name
+ *
+ * @tfm: set to cached tfm found, or new tfm created
+ * @tfm_mutex: set to mutex for cached tfm found, or new tfm created
+ * @cipher_name: the name of the cipher to search for and/or add
+ *
+ * Sets pointers to @tfm & @tfm_mutex matching @cipher_name.
+ * Searches for cached item first, and creates new if not found.
+ * Returns 0 on success, non-zero if adding new cipher failed
+ */
 int ecryptfs_get_tfm_and_mutex_for_cipher_name(struct crypto_blkcipher **tfm,
   struct mutex **tfm_mutex,
   char *cipher_name)
@@ -1854,22 +1898,17 @@ int ecryptfs_get_tfm_and_mutex_for_ciphe
 
(*tfm) = NULL;
(*tfm_mutex) = NULL;
+
mutex_lock(&key_tfm_list_mutex);
-   list_for_each_entry(key_tfm, &key_tfm_list, key_tfm_list) {
-   if (strcmp(key_tfm->cipher_name, cipher_name) == 0) {
-   (*tfm) = key_tfm->key_tfm;
-   (*tfm_mutex) = &key_tfm->key_tfm_mutex;
-   mutex_unlock(&key_tfm_list_mutex);
+   if (!ecryptfs_tfm_exists(cipher_name, &key_tfm)) {
+   rc = ecryptfs_add_new_key_tfm(&key_tfm, cipher_name, 0);
+   if (rc) {
+   printk(KERN_ERR "Error adding new key_tfm to list; "
+   "rc = [%d]\n", rc);
goto out;
}
}
mutex_unlock(&key_tfm_list_mutex);
-   rc = ecryptfs_add_new_key_tfm(&key_tfm, cipher_name, 0);
-   if (rc) {
-   printk(KERN_ERR "Error ad

[PATCH] ecryptfs: make show_options reflect actual mount options

2007-12-07 Thread Eric Sandeen

Change ecryptfs_show_options to reflect the actual mount
options in use.  Note that this does away with the "dir="
output, which is not a valid mount option and appears to be
unused.

Mount options such as "ecryptfs_verbose" and
"ecryptfs_xattr_metadata" are somewhat indeterminate for
a given fs, but in any case the reported mount options can
be used in a new mount command to get the same behavior.

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>
Acked-by: Michael Halcrow <[EMAIL PROTECTED]>

---

Index: linux/fs/ecryptfs/super.c
===
--- linux.orig/fs/ecryptfs/super.c
+++ linux/fs/ecryptfs/super.c
@@ -157,32 +157,42 @@ static void ecryptfs_clear_inode(struct 
 /**
  * ecryptfs_show_options
  *
- * Prints the directory we are currently mounted over.
- * Returns zero on success; non-zero otherwise
+ * Prints the mount options for a given superblock.
+ * Returns zero; does not fail.
  */
 static int ecryptfs_show_options(struct seq_file *m, struct vfsmount *mnt)
 {
struct super_block *sb = mnt->mnt_sb;
-   struct dentry *lower_root_dentry = ecryptfs_dentry_to_lower(sb->s_root);
-   struct vfsmount *lower_mnt = ecryptfs_dentry_to_lower_mnt(sb->s_root);
-   char *tmp_page;
-   char *path;
-   int rc = 0;
-
-   tmp_page = (char *)__get_free_page(GFP_KERNEL);
-   if (!tmp_page) {
-   rc = -ENOMEM;
-   goto out;
-   }
-   path = d_path(lower_root_dentry, lower_mnt, tmp_page, PAGE_SIZE);
-   if (IS_ERR(path)) {
-   rc = PTR_ERR(path);
-   goto out;
+   struct ecryptfs_mount_crypt_stat *mount_crypt_stat =
+   &ecryptfs_superblock_to_private(sb)->mount_crypt_stat;
+   struct ecryptfs_global_auth_tok *walker;
+
+   mutex_lock(&mount_crypt_stat->global_auth_tok_list_mutex);
+   list_for_each_entry(walker,
+   &mount_crypt_stat->global_auth_tok_list,
+   mount_crypt_stat_list) {
+   seq_printf(m, ",ecryptfs_sig=%s", walker->sig);
}
-   seq_printf(m, ",dir=%s", path);
-   free_page((unsigned long)tmp_page);
-out:
-   return rc;
+   mutex_unlock(&mount_crypt_stat->global_auth_tok_list_mutex);
+
+   /* Note this is global and probably shouldn't be a mount option */
+   if (ecryptfs_verbosity)
+   seq_printf(m, ",ecryptfs_debug=%d\n", ecryptfs_verbosity);
+
+   seq_printf(m, ",ecryptfs_cipher=%s",
+   mount_crypt_stat->global_default_cipher_name);
+
+   if (mount_crypt_stat->global_default_cipher_key_size)
+   seq_printf(m, ",ecryptfs_key_bytes=%d",
+  mount_crypt_stat->global_default_cipher_key_size);
+   if (mount_crypt_stat->flags & ECRYPTFS_PLAINTEXT_PASSTHROUGH_ENABLED)
+   seq_printf(m, ",ecryptfs_passthrough");
+   if (mount_crypt_stat->flags & ECRYPTFS_XATTR_METADATA_ENABLED)
+   seq_printf(m, ",ecryptfs_xattr_metadata");
+   if (mount_crypt_stat->flags & ECRYPTFS_ENCRYPTED_VIEW_ENABLED)
+   seq_printf(m, ",ecryptfs_encrypted_view");
+
+   return 0;
 }
 
 struct super_operations ecryptfs_sops = {



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ecryptfs: initialize new auth_tokens before teardown

2007-12-11 Thread Eric Sandeen

ecryptfs_destroy_mount_crypt_stat() checks whether each 
auth_tok->global_auth_tok_key is nonzero and if so puts that
key.  However, in some early mount error paths nothing has initialized
the pointer, and we try to key_put() garbage.  Running the bad cipher 
tests in the testsuite exposes this, and it's happy with the following 
change.

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>

---

Index: linux/fs/ecryptfs/keystore.c
===
--- linux.orig/fs/ecryptfs/keystore.c
+++ linux/fs/ecryptfs/keystore.c
@@ -1851,7 +1851,7 @@ ecryptfs_add_global_auth_tok(struct ecry
struct ecryptfs_global_auth_tok *new_auth_tok;
int rc = 0;
 
-   new_auth_tok = kmem_cache_alloc(ecryptfs_global_auth_tok_cache,
+   new_auth_tok = kmem_cache_zalloc(ecryptfs_global_auth_tok_cache,
GFP_KERNEL);
if (!new_auth_tok) {
rc = -ENOMEM;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc8 oops ext3_clear_inode+0x25/0xa0

2008-01-19 Thread Eric Sandeen

Soeren Sonnenburg wrote:
> Dear all,
> 
> I've just got this oops (causing the machine to hang finally)...
> 
> Any ideas?
> Soeren

I've seen an awful lot of oopses out there on this path,
kswapd->shrink_icache_memory; some get a little further and oops in
ext3_discard_reservation.

A few were chalked up to bad memory, but others were not.  Do you happen
to use suspend/resume?

Thanks to kerneloops.org... :)

All code

   0:   12 11   adc(%ecx),%dl
   2:   f8  clc
   3:   ff 66 90jmp*0xff90(%esi)
   6:   83 ec 0csub$0xc,%esp
   9:   89 1c 24mov%ebx,(%esp)
   c:   8d 98 60 ff ff ff   lea0xff60(%eax),%ebx
  12:   89 74 24 04 mov%esi,0x4(%esp)
  16:   89 c6   mov%eax,%esi
  18:   89 7c 24 08 mov%edi,0x8(%esp)
  1c:   8b 53 70mov0x70(%ebx),%edx
  1f:   8b 7b 54mov0x54(%ebx),%edi
  22:   85 d2   test   %edx,%edx
  24:   74 16   je 0x3c
  26:   83 fa ffcmp$0x,%edx
  29:   74 11   je 0x3c
  2b:*  f0 ff 0alock decl (%edx) <-- trapping instruction

Looks like it blew up in (inlined) posix_acl_release(), I think
EXT3_I(inode)->i_acl passed to it was 66e88e66, in %edx.

I think %edi is the i_block_alloc_info, 0f01c883, which also looks
crunchy.  Use after free perhaps?

> BUG: unable to handle kernel paging request at virtual address 66e88e66

Nice symmetric number, anyway.  :)

I've seen enough of these now, something real seems to be going on but I
don't know what yet.

-Eric

> printing eip: c01fac85 *pde =  
> Oops: 0002 [#1] PREEMPT SMP 
> Modules linked in: hci_usb hidp rfcomm l2cap bluetooth tun cpufreq_stats 
> coretemp xfrm_user xfrm4_tunnel tunnel4 ipcomp esp4 ah4 aes_generic hfsplus 
> binfmt_misc fuse ebtable_broute bridge llc ebtable_nat ebtable_filter 
> ebtables eeprom applesmc hwmon input_polldev snd_hda_intel snd_pcm_oss 
> snd_mixer_oss snd_pcm snd_timer appletouch evdev i2c_i801 snd soundcore 
> snd_page_alloc sky2 video intel_agp output agpgart
> 
> Pid: 205, comm: kswapd0 Not tainted (2.6.24-rc8-sonne #7)
> EIP: 0060:[] EFLAGS: 00010213 CPU: 1
> EIP is at ext3_clear_inode+0x25/0xa0
> EAX: c008f0a0 EBX: c008f000 ECX:  EDX: 66e88e66
> ESI: c008f0a0 EDI: 0f01c883 EBP: 004d ESP: f7d29ebc
>  DS: 007b ES: 007b FS: 00d8 GS:  SS: 0068
> Process kswapd0 (pid: 205, ti=f7d28000 task=f7fdf540 task.ti=f7d28000)
> Stack: c008f0a0  f7d29ef8 c0192d62 004d c008f0a0 c008f0a8 
> c019309a 
>e98b9ac8 0080 0080 f7d29ef8 c01932ec  0080 
> c008f2b0 
>ea1bdcd8 0002d438 013f c04ac24c 00d0 c0166e4c 2e0b 
>  
> Call Trace:
>  [] clear_inode+0x62/0x140
>  [] dispose_list+0x1a/0xe0
>  [] shrink_icache_memory+0x18c/0x250
>  [] shrink_slab+0x12c/0x1a0
>  [] kswapd+0x32d/0x4d0
>  [] autoremove_wake_function+0x0/0x40
>  [] complete+0x3d/0x60
>  [] kswapd+0x0/0x4d0
>  [] kthread+0x42/0x70
>  [] kthread+0x0/0x70
>  [] kernel_thread_helper+0x7/0x14
>  ===
> Code: 12 11 f8 ff 66 90 83 ec 0c 89 1c 24 8d 98 60 ff ff ff 89 74 24 04 89 c6 
> 89 7c 24 08 8b 53 70 8b 7b 54 85 d2 74 16 83 fa ff 74 11  ff 0a 0f 94 c0 
> 84 c0 75 51 c7 43 70 ff ff ff ff 8b 53 74 85 
> EIP: [] ext3_clear_inode+0x25/0xa0 SS:ESP 0068:f7d29ebc
> ---[ end trace 8dd028de7ae6e34e ]---
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] 2.4: fix memory corruption from misinterpreted bad_inode_ops return values

2008-01-24 Thread Eric Sandeen

Willy Tarreau wrote:
> Hi Dann,
> 
> On Wed, Jan 23, 2008 at 11:12:12PM -0700, dann frazier wrote:
>> This is a 2.4 backport of a linux-2.6 change by Eric Sandeen
>> (commit be6aab0e9fa6d3c6d75aa1e38ac972d8b4ee82b8)
>>
>> CVE-2006-5753 was assigned for this issue.
>>
>> I've built and boot-tested this, but I'm not sure how to exercise
>> these codepaths.
> 
> I have no idea either. Let's consider that if nobody on the list knows
> how to do so, I'll merge it since you did not notice any regression.
> 
> Thanks,
> Willy
> 

Sorry... here you go.  Forgot to post this sooner.  I hit it with
this on 2.6.x


#include 
#include 
#include 

static int return_EIO(void)
{
return -EIO;
}

int main(int argc, char ** argv)
{
ssize_t error;
ssize_t realerror = -EIO;
ssize_t (*fn_ptr)(void);

fn_ptr = (void *)return_EIO;

error = (ssize_t)fn_ptr();
printf("and... error is %ld, should be %ld\n", error, realerror);
return 0;
}

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] 2.4: fix memory corruption from misinterpreted bad_inode_ops return values

2008-01-24 Thread Eric Sandeen

dann frazier wrote:

> Thanks Eric. Sounds like my comment about exercising these code paths
> wasn't too clear - the comments with your patch do make the issue
> clear, and this program demonstrates the void cast promotion issue
> well. I'm just not sure of a good way to demonstrate that my backport
> of this patch doesn't break anything for 2.4.

Ugh, no, that was my fault, I blindly copied & pasted something from a
bug which I thought was a testcase, but isn't.  Sorry!

I originally saw this problem on an fsfuzzed filesystem; I don't think I
still have that image around, though.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ecryptfs: check for existing key_tfm at mount time

2007-12-20 Thread Eric Sandeen

Jeff Moyer pointed out that a mount; umount loop of ecryptfs,
with the same cipher & other mount options, created a new 
ecryptfs_key_tfm_cache item each time, and the cache could
grow quite large this way.

Looking at this with mhalcrow, we saw that ecryptfs_parse_options()
unconditionally called ecryptfs_add_new_key_tfm(), which is what
was adding these items.

Refactor ecryptfs_get_tfm_and_mutex_for_cipher_name() to create a 
new helper function, ecryptfs_tfm_exists(), which checks for the 
cipher on the cached key_tfm_list, and sets a pointer
to it if it exists.  This can then be called from 
ecryptfs_parse_options(), and new key_tfm's can be added only when
a cached one is not found.

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>
---

Index: linux-2.6.24-rc3/fs/ecryptfs/crypto.c
===
--- linux-2.6.24-rc3.orig/fs/ecryptfs/crypto.c
+++ linux-2.6.24-rc3/fs/ecryptfs/crypto.c
@@ -1868,6 +1868,33 @@ out:
return rc;
 }
 
+/**
+ * ecryptfs_tfm_exists - Search for existing tfm for cipher_name.
+ * @cipher_name: the name of the cipher to search for
+ * @key_tfm: set to corresponding tfm if found
+ *
+ * Returns 1 if found, with key_tfm set
+ * Returns 0 if not found, key_tfm set to NULL
+ */
+int ecryptfs_tfm_exists(char *cipher_name, struct ecryptfs_key_tfm **key_tfm)
+{
+   struct ecryptfs_key_tfm *tmp_key_tfm;
+
+   mutex_lock(&key_tfm_list_mutex);
+   list_for_each_entry(tmp_key_tfm, &key_tfm_list, key_tfm_list) {
+   if (strcmp(tmp_key_tfm->cipher_name, cipher_name) == 0) {
+   mutex_unlock(&key_tfm_list_mutex);
+   if (key_tfm)
+   (*key_tfm) = tmp_key_tfm;
+   return 1;
+   }
+   }
+   mutex_unlock(&key_tfm_list_mutex);
+   if (key_tfm)
+   (*key_tfm) = NULL;
+   return 0;
+}
+
 int ecryptfs_get_tfm_and_mutex_for_cipher_name(struct crypto_blkcipher **tfm,
   struct mutex **tfm_mutex,
   char *cipher_name)
@@ -1877,22 +1904,15 @@ int ecryptfs_get_tfm_and_mutex_for_ciphe
 
(*tfm) = NULL;
(*tfm_mutex) = NULL;
-   mutex_lock(&key_tfm_list_mutex);
-   list_for_each_entry(key_tfm, &key_tfm_list, key_tfm_list) {
-   if (strcmp(key_tfm->cipher_name, cipher_name) == 0) {
-   (*tfm) = key_tfm->key_tfm;
-   (*tfm_mutex) = &key_tfm->key_tfm_mutex;
-   mutex_unlock(&key_tfm_list_mutex);
+
+   if (!ecryptfs_tfm_exists(cipher_name, &key_tfm)) {
+   rc = ecryptfs_add_new_key_tfm(&key_tfm, cipher_name, 0);
+   if (rc) {
+   printk(KERN_ERR "Error adding new key_tfm to list; "
+   "rc = [%d]\n", rc);
goto out;
}
}
-   mutex_unlock(&key_tfm_list_mutex);
-   rc = ecryptfs_add_new_key_tfm(&key_tfm, cipher_name, 0);
-   if (rc) {
-   printk(KERN_ERR "Error adding new key_tfm to list; rc = [%d]\n",
-  rc);
-   goto out;
-   }
(*tfm) = key_tfm->key_tfm;
(*tfm_mutex) = &key_tfm->key_tfm_mutex;
 out:
Index: linux-2.6.24-rc3/fs/ecryptfs/ecryptfs_kernel.h
===
--- linux-2.6.24-rc3.orig/fs/ecryptfs/ecryptfs_kernel.h
+++ linux-2.6.24-rc3/fs/ecryptfs/ecryptfs_kernel.h
@@ -623,6 +623,7 @@ ecryptfs_add_new_key_tfm(struct ecryptfs
 size_t key_size);
 int ecryptfs_init_crypto(void);
 int ecryptfs_destroy_crypto(void);
+int ecryptfs_tfm_exists(char *cipher_name, struct ecryptfs_key_tfm **key_tfm);
 int ecryptfs_get_tfm_and_mutex_for_cipher_name(struct crypto_blkcipher **tfm,
   struct mutex **tfm_mutex,
   char *cipher_name);
Index: linux-2.6.24-rc3/fs/ecryptfs/main.c
===
--- linux-2.6.24-rc3.orig/fs/ecryptfs/main.c
+++ linux-2.6.24-rc3/fs/ecryptfs/main.c
@@ -410,9 +410,11 @@ static int ecryptfs_parse_options(struct
if (!cipher_key_bytes_set) {
mount_crypt_stat->global_default_cipher_key_size = 0;
}
-   rc = ecryptfs_add_new_key_tfm(
-   NULL, mount_crypt_stat->global_default_cipher_name,
-   mount_crypt_stat->global_default_cipher_key_size);
+   if (!ecryptfs_tfm_exists(mount_crypt_stat->global_default_cipher_name,
+NULL))
+   rc = ecryptfs_add_new_key_tfm(
+   NULL, mount_crypt_stat->global_default_cipher_name,
+   mount_cry

[PATCH] (UPDATED) ecryptfs: check for existing key_tfm at mount time

2007-12-22 Thread Eric Sandeen

Andrew Morton wrote:

> It would all look a lot more solid if this locking was retained and both
> ecryptfs_tfm_exists() and ecryptfs_add_new_key_tfm() were designed to be
> called under key_tfm_list_mutex.

Hmm good point, sorry for missing that.  How's this look?

===

Jeff Moyer pointed out that a mount; umount loop of ecryptfs,
with the same cipher & other mount options, created a new 
ecryptfs_key_tfm_cache item each time, and the cache could
grow quite large this way.

Looking at this with mhalcrow, we saw that ecryptfs_parse_options()
unconditionally called ecryptfs_add_new_key_tfm(), which is what
was adding these items.

Refactor ecryptfs_get_tfm_and_mutex_for_cipher_name() to create a 
new helper function, ecryptfs_tfm_exists(), which checks for the 
cipher on the cached key_tfm_list, and sets a pointer
to it if it exists.  This can then be called from 
ecryptfs_parse_options(), and new key_tfm's can be added only when
a cached one is not found.

With list locking changes suggested by akpm.

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>
---



Index: linux-2.6.24-rc3/fs/ecryptfs/crypto.c
===
--- linux-2.6.24-rc3.orig/fs/ecryptfs/crypto.c
+++ linux-2.6.24-rc3/fs/ecryptfs/crypto.c
@@ -1812,6 +1812,11 @@ int ecryptfs_init_crypto(void)
return 0;
 }
 
+/**
+ * ecryptfs_destroy_crypto - free all cached key_tfms on key_tfm_list
+ *
+ * Called only at module unload time
+ */
 int ecryptfs_destroy_crypto(void)
 {
struct ecryptfs_key_tfm *key_tfm, *key_tfm_tmp;
@@ -1835,6 +1840,8 @@ ecryptfs_add_new_key_tfm(struct ecryptfs
struct ecryptfs_key_tfm *tmp_tfm;
int rc = 0;
 
+   BUG_ON(!mutex_is_locked(&key_tfm_list_mutex));
+
tmp_tfm = kmem_cache_alloc(ecryptfs_key_tfm_cache, GFP_KERNEL);
if (key_tfm != NULL)
(*key_tfm) = tmp_tfm;
@@ -1861,13 +1868,51 @@ ecryptfs_add_new_key_tfm(struct ecryptfs
(*key_tfm) = NULL;
goto out;
}
-   mutex_lock(&key_tfm_list_mutex);
list_add(&tmp_tfm->key_tfm_list, &key_tfm_list);
-   mutex_unlock(&key_tfm_list_mutex);
 out:
return rc;
 }
 
+/**
+ * ecryptfs_tfm_exists - Search for existing tfm for cipher_name.
+ * @cipher_name: the name of the cipher to search for
+ * @key_tfm: set to corresponding tfm if found
+ *
+ * Searches for cached key_tfm matching @cipher_name
+ * Must be called with &key_tfm_list_mutex held
+ * Returns 1 if found, with @key_tfm set
+ * Returns 0 if not found, with @key_tfm set to NULL
+ */
+int ecryptfs_tfm_exists(char *cipher_name, struct ecryptfs_key_tfm **key_tfm)
+{
+   struct ecryptfs_key_tfm *tmp_key_tfm;
+
+   BUG_ON(!mutex_is_locked(&key_tfm_list_mutex));
+
+   list_for_each_entry(tmp_key_tfm, &key_tfm_list, key_tfm_list) {
+   if (strcmp(tmp_key_tfm->cipher_name, cipher_name) == 0) {
+   mutex_unlock(&key_tfm_list_mutex);
+   if (key_tfm)
+   (*key_tfm) = tmp_key_tfm;
+   return 1;
+   }
+   }
+   if (key_tfm)
+   (*key_tfm) = NULL;
+   return 0;
+}
+
+/**
+ * ecryptfs_get_tfm_and_mutex_for_cipher_name
+ *
+ * @tfm: set to cached tfm found, or new tfm created
+ * @tfm_mutex: set to mutex for cached tfm found, or new tfm created
+ * @cipher_name: the name of the cipher to search for and/or add
+ *
+ * Sets pointers to @tfm & @tfm_mutex matching @cipher_name.
+ * Searches for cached item first, and creates new if not found.
+ * Returns 0 on success, non-zero if adding new cipher failed
+ */
 int ecryptfs_get_tfm_and_mutex_for_cipher_name(struct crypto_blkcipher **tfm,
   struct mutex **tfm_mutex,
   char *cipher_name)
@@ -1877,22 +1922,17 @@ int ecryptfs_get_tfm_and_mutex_for_ciphe
 
(*tfm) = NULL;
(*tfm_mutex) = NULL;
+
mutex_lock(&key_tfm_list_mutex);
-   list_for_each_entry(key_tfm, &key_tfm_list, key_tfm_list) {
-   if (strcmp(key_tfm->cipher_name, cipher_name) == 0) {
-   (*tfm) = key_tfm->key_tfm;
-   (*tfm_mutex) = &key_tfm->key_tfm_mutex;
-   mutex_unlock(&key_tfm_list_mutex);
+   if (!ecryptfs_tfm_exists(cipher_name, &key_tfm)) {
+   rc = ecryptfs_add_new_key_tfm(&key_tfm, cipher_name, 0);
+   if (rc) {
+   printk(KERN_ERR "Error adding new key_tfm to list; "
+   "rc = [%d]\n", rc);
goto out;
}
}
mutex_unlock(&key_tfm_list_mutex);
-   rc = ecryptfs_add_new_key_tfm(&key_tfm, cipher_name, 0);
-   if (rc)

Re: [PATCH] (UPDATED) ecryptfs: check for existing key_tfm at mount time

2007-12-22 Thread Eric Sandeen

Andrew Morton wrote:
> On Sat, 22 Dec 2007 11:42:37 -0600 Eric Sandeen <[EMAIL PROTECTED]> wrote:
> 
>> Andrew Morton wrote:
>>
>>> It would all look a lot more solid if this locking was retained and both
>>> ecryptfs_tfm_exists() and ecryptfs_add_new_key_tfm() were designed to be
>>> called under key_tfm_list_mutex.
>> Hmm good point, sorry for missing that.  How's this look?
> 
> key_tfm_list_mutex gets used in fs/ecryptfs/main.c but it is static in
> fs/ecryptfs/crypto.c
> 

Ah crud that was a bunk-ism in -mm that I missed.

I'll send another updated patch soon.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] (UPDATED2) ecryptfs: check for existing key_tfm at mount time

2007-12-23 Thread Eric Sandeen

Jeff Moyer pointed out that a mount; umount loop of ecryptfs,
with the same cipher & other mount options, created a new 
ecryptfs_key_tfm_cache item each time, and the cache could
grow quite large this way.

Looking at this with mhalcrow, we saw that ecryptfs_parse_options()
unconditionally called ecryptfs_add_new_key_tfm(), which is what
was adding these items.

Refactor ecryptfs_get_tfm_and_mutex_for_cipher_name() to create a 
new helper function, ecryptfs_tfm_exists(), which checks for the 
cipher on the cached key_tfm_list, and sets a pointer
to it if it exists.  This can then be called from 
ecryptfs_parse_options(), and new key_tfm's can be added only when
a cached one is not found.

With list locking changes suggested by akpm.

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>
---

Index: linux-2.6.24-rc3/fs/ecryptfs/crypto.c
===
--- linux-2.6.24-rc3.orig/fs/ecryptfs/crypto.c
+++ linux-2.6.24-rc3/fs/ecryptfs/crypto.c
@@ -1778,7 +1778,7 @@ out:
 
 struct kmem_cache *ecryptfs_key_tfm_cache;
 static struct list_head key_tfm_list;
-static struct mutex key_tfm_list_mutex;
+struct mutex key_tfm_list_mutex;
 
 int ecryptfs_init_crypto(void)
 {
@@ -1787,6 +1787,11 @@ int ecryptfs_init_crypto(void)
return 0;
 }
 
+/**
+ * ecryptfs_destroy_crypto - free all cached key_tfms on key_tfm_list
+ *
+ * Called only at module unload time
+ */
 int ecryptfs_destroy_crypto(void)
 {
struct ecryptfs_key_tfm *key_tfm, *key_tfm_tmp;
@@ -1810,6 +1815,8 @@ ecryptfs_add_new_key_tfm(struct ecryptfs
struct ecryptfs_key_tfm *tmp_tfm;
int rc = 0;
 
+   BUG_ON(!mutex_is_locked(&key_tfm_list_mutex));
+
tmp_tfm = kmem_cache_alloc(ecryptfs_key_tfm_cache, GFP_KERNEL);
if (key_tfm != NULL)
(*key_tfm) = tmp_tfm;
@@ -1835,13 +1842,51 @@ ecryptfs_add_new_key_tfm(struct ecryptfs
(*key_tfm) = NULL;
goto out;
}
-   mutex_lock(&key_tfm_list_mutex);
list_add(&tmp_tfm->key_tfm_list, &key_tfm_list);
-   mutex_unlock(&key_tfm_list_mutex);
 out:
return rc;
 }
 
+/**
+ * ecryptfs_tfm_exists - Search for existing tfm for cipher_name.
+ * @cipher_name: the name of the cipher to search for
+ * @key_tfm: set to corresponding tfm if found
+ *
+ * Searches for cached key_tfm matching @cipher_name
+ * Must be called with &key_tfm_list_mutex held
+ * Returns 1 if found, with @key_tfm set
+ * Returns 0 if not found, with @key_tfm set to NULL
+ */
+int ecryptfs_tfm_exists(char *cipher_name, struct ecryptfs_key_tfm **key_tfm)
+{
+   struct ecryptfs_key_tfm *tmp_key_tfm;
+
+   BUG_ON(!mutex_is_locked(&key_tfm_list_mutex));
+
+   list_for_each_entry(tmp_key_tfm, &key_tfm_list, key_tfm_list) {
+   if (strcmp(tmp_key_tfm->cipher_name, cipher_name) == 0) {
+   mutex_unlock(&key_tfm_list_mutex);
+   if (key_tfm)
+   (*key_tfm) = tmp_key_tfm;
+   return 1;
+   }
+   }
+   if (key_tfm)
+   (*key_tfm) = NULL;
+   return 0;
+}
+
+/**
+ * ecryptfs_get_tfm_and_mutex_for_cipher_name
+ *
+ * @tfm: set to cached tfm found, or new tfm created
+ * @tfm_mutex: set to mutex for cached tfm found, or new tfm created
+ * @cipher_name: the name of the cipher to search for and/or add
+ *
+ * Sets pointers to @tfm & @tfm_mutex matching @cipher_name.
+ * Searches for cached item first, and creates new if not found.
+ * Returns 0 on success, non-zero if adding new cipher failed
+ */
 int ecryptfs_get_tfm_and_mutex_for_cipher_name(struct crypto_blkcipher **tfm,
   struct mutex **tfm_mutex,
   char *cipher_name)
@@ -1851,22 +1896,17 @@ int ecryptfs_get_tfm_and_mutex_for_ciphe
 
(*tfm) = NULL;
(*tfm_mutex) = NULL;
+
mutex_lock(&key_tfm_list_mutex);
-   list_for_each_entry(key_tfm, &key_tfm_list, key_tfm_list) {
-   if (strcmp(key_tfm->cipher_name, cipher_name) == 0) {
-   (*tfm) = key_tfm->key_tfm;
-   (*tfm_mutex) = &key_tfm->key_tfm_mutex;
-   mutex_unlock(&key_tfm_list_mutex);
+   if (!ecryptfs_tfm_exists(cipher_name, &key_tfm)) {
+   rc = ecryptfs_add_new_key_tfm(&key_tfm, cipher_name, 0);
+   if (rc) {
+   printk(KERN_ERR "Error adding new key_tfm to list; "
+   "rc = [%d]\n", rc);
goto out;
}
}
mutex_unlock(&key_tfm_list_mutex);
-   rc = ecryptfs_add_new_key_tfm(&key_tfm, cipher_name, 0);
-   if (rc) {
-   printk(KERN_ERR "Error adding new key_tfm to list; rc =

Re: [PATCH] UPDATED: hfs: handle more on-disk corruptions without oopsing

2007-12-23 Thread Eric Sandeen

Roman Zippel wrote:
> Hi,
> 
> On Thursday 20 December 2007, Eric Sandeen wrote:
> 
>> Index: linux-2.6.24-rc3/fs/hfs/brec.c
>> ===
>> --- linux-2.6.24-rc3.orig/fs/hfs/brec.c
>> +++ linux-2.6.24-rc3/fs/hfs/brec.c
>> @@ -44,10 +44,21 @@ u16 hfs_brec_keylen(struct hfs_bnode *no
>>  recoff = hfs_bnode_read_u16(node, node->tree->node_size - (rec 
>> + 1) *
>> 2); if (!recoff)
>>  return 0;
>> -if (node->tree->attributes & HFS_TREE_BIGKEYS)
>> +if (node->tree->attributes & HFS_TREE_BIGKEYS) {
>>  retval = hfs_bnode_read_u16(node, recoff) + 2;
>> -else
>> +if (retval > node->tree->max_key_len + 2) {
>> +printk(KERN_ERR "hfs: keylen %d too large\n",
>> +retval);
>> +retval = HFS_BAD_KEYLEN;
>> +}
>> +} else {
>>  retval = (hfs_bnode_read_u8(node, recoff) | 1) + 1;
>> +if (retval > node->tree->max_key_len + 1) {
>> +printk(KERN_ERR "hfs: keylen %d too large\n",
>> +retval);
>> +retval = HFS_BAD_KEYLEN;
>> +}
>> +}
>>  }
>>  return retval;
>>  }
> 
> You can reuse 0 as failure value, a key has to be of nonzero size.

Ok.  Based on the other 0 returns I wasn't sure if they were considered
real errors or not... but also ISTR I ran into problems with a simple 0
return; I probably just to be sure need the callers check for it.

>> Index: linux-2.6.24-rc3/fs/hfs/btree.c
>> ===
>> --- linux-2.6.24-rc3.orig/fs/hfs/btree.c
>> +++ linux-2.6.24-rc3/fs/hfs/btree.c
>> @@ -81,6 +81,17 @@ struct hfs_btree *hfs_btree_open(struct
>>  goto fail_page;
>>  if (!tree->node_count)
>>  goto fail_page;
>> +if ((id == HFS_EXT_CNID) && (tree->max_key_len != HFS_MAX_EXT_KEYLEN)) {
>> +printk(KERN_ERR "hfs: invalid extent max_key_len %d\n",
>> +tree->max_key_len);
>> +goto fail_page;
>> +}
>> +if ((id == HFS_CAT_CNID) && (tree->max_key_len != HFS_MAX_CAT_KEYLEN)) {
>> +printk(KERN_ERR "hfs: invalid catalog max_key_len %d\n",
>> +tree->max_key_len);
>> +goto fail_page;
>> +}
>> +
>>  tree->node_size_shift = ffs(size) - 1;
>>  tree->pages_per_bnode = (tree->node_size + PAGE_CACHE_SIZE - 1) >>
>> PAGE_CACHE_SHIFT;
>>
> 
> I'd prefer a switch statement here.

Ok, I'd thought about doing it that way... :)

> It would be nice if you could do the same changes for hfsplus, so both stay 
> in 
> sync.

Yep, wanted to first see if it'd fly for HFS...

Thanks for the feedback,
-Eric

> Thanks.
> 
> bye, Roman

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [xfs-masters] [Patch 7/8] FS: Remove 'TOPDIR' from Makefiles

2008-01-01 Thread Eric Sandeen

WANG Cong wrote:
> TOPDIR is obsolete, use objtree instead.
> This patch removes TOPDIR from all fs/ Makefiles.

> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index 49e3e7e..d1d3d49 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -1 +1 @@
> -include $(TOPDIR)/fs/xfs/Makefile-linux-$(VERSION).$(PATCHLEVEL)
> +include $(objtree)/fs/xfs/Makefile-linux-$(VERSION).$(PATCHLEVEL)

FWIW $(TOPDIR) is already banished from the latest xfs build code:

http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/Makefile

and the patch is in -mm too via git:

http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc6/2.6.24-rc6-mm1/broken-out/git-xfs.patch

Thanks,

-Eric

p.s. what is $(objtree) exactly?  I don't see it mentioned in
Documentation/kbuild except as one line in an example... I thought
$(obj) and $(src) should be used outside of the core kbuild
infrastructure, and in this case wouldn't it be $(src) anyway?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] [KBUILD] fix external module install path

2008-01-01 Thread Eric Sandeen

Sam Ravnborg wrote:

> Hi Eric.
> 
> Took a look at this now and fixed it like this.
> Let me know if you see other issues.
> 
> [I know this is more than 6 months ago you reported it - sorry]
> 
>   Sam

Wow, blast from the past; if you'd done it a bit earlier it would have
been in the same year at least!  *grin*

Anyway, thanks for fixing!  This just bit me the other day, actually, so
will be nice to have fixed.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] UPDATED2: hfs: handle more on-disk corruptions without oopsing

2008-01-02 Thread Eric Sandeen

Roman, with this on top does it look better to you?
I'll get hfsplus done in a bit.

Thanks,
-Eric

-

Fix up previous hfs fsfuzzer patch to address Roman's comments.

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>

---

diff -u linux-2.6.24-rc3/fs/hfs/brec.c linux-2.6.24-rc3/fs/hfs/brec.c
--- linux-2.6.24-rc3/fs/hfs/brec.c
+++ linux-2.6.24-rc3/fs/hfs/brec.c
@@ -49,14 +49,14 @@
if (retval > node->tree->max_key_len + 2) {
printk(KERN_ERR "hfs: keylen %d too large\n",
retval);
-   retval = HFS_BAD_KEYLEN;
+   retval = 0;
}
} else {
retval = (hfs_bnode_read_u8(node, recoff) | 1) + 1;
if (retval > node->tree->max_key_len + 1) {
printk(KERN_ERR "hfs: keylen %d too large\n",
retval);
-   retval = HFS_BAD_KEYLEN;
+   retval = 0;
}
}
}
diff -u linux-2.6.24-rc3/fs/hfs/bfind.c linux-2.6.24-rc3/fs/hfs/bfind.c
--- linux-2.6.24-rc3/fs/hfs/bfind.c
+++ linux-2.6.24-rc3/fs/hfs/bfind.c
@@ -52,7 +52,7 @@
rec = (e + b) / 2;
len = hfs_brec_lenoff(bnode, rec, &off);
keylen = hfs_brec_keylen(bnode, rec);
-   if (keylen == HFS_BAD_KEYLEN) {
+   if (keylen == 0) {
res = -EINVAL;
goto fail;
}
@@ -71,7 +71,7 @@
if (rec != e && e >= 0) {
len = hfs_brec_lenoff(bnode, e, &off);
keylen = hfs_brec_keylen(bnode, e);
-   if (keylen == HFS_BAD_KEYLEN) {
+   if (keylen == 0) {
res = -EINVAL;
goto fail;
}
@@ -207,7 +207,7 @@
 
len = hfs_brec_lenoff(bnode, fd->record, &off);
keylen = hfs_brec_keylen(bnode, fd->record);
-   if (keylen == HFS_BAD_KEYLEN) {
+   if (keylen == 0) {
res = -EINVAL;
goto out;
}
diff -u linux-2.6.24-rc3/fs/hfs/hfs.h linux-2.6.24-rc3/fs/hfs/hfs.h
--- linux-2.6.24-rc3/fs/hfs/hfs.h
+++ linux-2.6.24-rc3/fs/hfs/hfs.h
@@ -28,8 +28,6 @@
 #define HFS_MAX_NAMELEN128
 #define HFS_MAX_VALENCE32767U
 
-#define HFS_BAD_KEYLEN 0xFF
-
 /* Meanings of the drAtrb field of the MDB,
  * Reference: _Inside Macintosh: Files_ p. 2-61
  */
diff -u linux-2.6.24-rc3/fs/hfs/btree.c linux-2.6.24-rc3/fs/hfs/btree.c
--- linux-2.6.24-rc3/fs/hfs/btree.c
+++ linux-2.6.24-rc3/fs/hfs/btree.c
@@ -81,15 +81,23 @@
goto fail_page;
if (!tree->node_count)
goto fail_page;
-   if ((id == HFS_EXT_CNID) && (tree->max_key_len != HFS_MAX_EXT_KEYLEN)) {
-   printk(KERN_ERR "hfs: invalid extent max_key_len %d\n",
-   tree->max_key_len);
-   goto fail_page;
-   }
-   if ((id == HFS_CAT_CNID) && (tree->max_key_len != HFS_MAX_CAT_KEYLEN)) {
-   printk(KERN_ERR "hfs: invalid catalog max_key_len %d\n",
-   tree->max_key_len);
-   goto fail_page;
+   switch(id) {
+   case HFS_EXT_CNID:
+   if (tree->max_key_len != HFS_MAX_EXT_KEYLEN) {
+   printk(KERN_ERR "hfs: invalid extent max_key_len %d\n",
+   tree->max_key_len);
+   goto fail_page;
+   }
+   break;
+   case HFS_CAT_CNID:
+   if (tree->max_key_len != HFS_MAX_CAT_KEYLEN) {
+   printk(KERN_ERR "hfs: invalid catalog max_key_len %d\n",
+   tree->max_key_len);
+   goto fail_page;
+   }
+   break;
+   default:
+   BUG();
}
 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6.24 patch] let EXT4DEV_FS depend on BROKEN

2008-01-02 Thread Eric Sandeen

Adrian Bunk wrote:
> Most people and all distributions use CONFIG_EXPERIMENTAL=y simply 
> because too many options (including options required for hardware 
> support) depend on it.
> 
> Compare e.g.:
> - "Marvell SATA support (HIGHLY EXPERIMENTAL)"
> - "Provide NFSv4 client support (EXPERIMENTAL)"
> - "Ext4dev/ext4 extended fs support development (EXPERIMENTAL)"

   tristate "Snapshot target (EXPERIMENTAL)"
   depends on BLK_DEV_DM && EXPERIMENTAL

   tristate "Mirror target (EXPERIMENTAL)"
   depends on BLK_DEV_DM && EXPERIMENTAL

...

It does seem that it might be a good goal to revisit options marked
EXPERIMENTAL, and see if they still should be marked as such, rather
than removing the option altogether.

init/Kconfig describes things in "EXPERIMENTAL" as "alpha-test" - I bet
there are a few things which have moved beyond this, but are still
marked as such.

-Eric


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [EXT4 set 6][PATCH 1/1]Export jbd stats through procfs

2007-11-30 Thread Eric Sandeen

Mingming Cao wrote:
> [PATCH] jbd2 stats through procfs
> 
> The patch below updates the jbd stats patch to 2.6.20/jbd2.
> The initial patch was posted by Alex Tomas in December 2005
> (http://marc.info/?l=linux-ext4&m=113538565128617&w=2).
> It provides statistics via procfs such as transaction lifetime and size.
> 
> [ This probably should be rewritten to use debugfs?   -- Ted]
> 
> Signed-off-by: Johann Lombardi <[EMAIL PROTECTED]>

I've started going through this one to clean it up to the point where it
can go forward.  It's been sitting at the top of the unstable portion of
the patch queue for long enough, I think :)

I've converted the msecs to jiffies until the user boundary, changed the
union #defines as suggested by Andrew, and various other little issues etc.

Remaining to do is a generic time-difference calculator (instead of
jbd2_time_diff), and looking into whether it should be made a config
option; I tend to think it should, but it's fairly well sprinkled
through the code, so I'll see how well that works.

Also did we ever decided if this should go to debugfs?

Thanks,

-Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] fix up ext2_fs.h for userspace after reservations backport

2007-11-26 Thread Eric Sandeen

From: Tobias Poschwatta <[EMAIL PROTECTED]>

In commit a686cd898bd999fd026a51e90fb0a3410d258ddb:

 "Val's cross-port of the ext3 reservations code into ext2."

include/linux/ext2_fs.h got a new function whose return value is only
defined if __KERNEL__ is defined. Putting #ifdef __KERNEL__ around the
function seems to help, patch below.

BR, Tobias

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>

---

Index: linux-2.6.24-rc1/include/linux/ext2_fs.h
===
--- linux-2.6.24-rc1.orig/include/linux/ext2_fs.h
+++ linux-2.6.24-rc1/include/linux/ext2_fs.h
@@ -563,11 +563,13 @@ enum {
 ~EXT2_DIR_ROUND)
 #define EXT2_MAX_REC_LEN   ((1<<16)-1)
 
+#ifdef __KERNEL__
 static inline ext2_fsblk_t
 ext2_group_first_block_no(struct super_block *sb, unsigned long group_no)
 {
return group_no * (ext2_fsblk_t)EXT2_BLOCKS_PER_GROUP(sb) +
le32_to_cpu(EXT2_SB(sb)->s_es->s_first_data_block);
 }
+#endif
 
 #endif /* _LINUX_EXT2_FS_H */

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] fix up ext2_fs.h for userspace after reservations backport

2007-11-27 Thread Eric Sandeen

Andrew Morton wrote:

> I did this instead:

(moved offending function from ext2_fs.h to ext2.h)

Look like a better plan, thanks.

-Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] address hfs on-disk corruption robustness review comments

2008-01-14 Thread Eric Sandeen

Address Roman's review comments for the previously sent on-disk
corruption hfs robustness patch.

I still owe a patch for hfsplus.

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>

---



Index: linux-2.6.24-rc6-mm1/fs/hfs/bfind.c
===
--- linux-2.6.24-rc6-mm1.orig/fs/hfs/bfind.c
+++ linux-2.6.24-rc6-mm1/fs/hfs/bfind.c
@@ -52,9 +52,9 @@ int __hfs_brec_find(struct hfs_bnode *bn
rec = (e + b) / 2;
len = hfs_brec_lenoff(bnode, rec, &off);
keylen = hfs_brec_keylen(bnode, rec);
-   if (keylen == HFS_BAD_KEYLEN) {
+   if (keylen == 0) {
res = -EINVAL;
-   goto done;
+   goto fail;
}
hfs_bnode_read(bnode, fd->key, off, keylen);
cmpval = bnode->tree->keycmp(fd->key, fd->search_key);
@@ -71,9 +71,9 @@ int __hfs_brec_find(struct hfs_bnode *bn
if (rec != e && e >= 0) {
len = hfs_brec_lenoff(bnode, e, &off);
keylen = hfs_brec_keylen(bnode, e);
-   if (keylen == HFS_BAD_KEYLEN) {
+   if (keylen == 0) {
res = -EINVAL;
-   goto done;
+   goto fail;
}
hfs_bnode_read(bnode, fd->key, off, keylen);
}
@@ -83,6 +83,7 @@ done:
fd->keylength = keylen;
fd->entryoffset = off + keylen;
fd->entrylength = len - keylen;
+fail:
return res;
 }
 
@@ -206,7 +207,7 @@ int hfs_brec_goto(struct hfs_find_data *
 
len = hfs_brec_lenoff(bnode, fd->record, &off);
keylen = hfs_brec_keylen(bnode, fd->record);
-   if (keylen == HFS_BAD_KEYLEN) {
+   if (keylen == 0) {
res = -EINVAL;
goto out;
}
Index: linux-2.6.24-rc6-mm1/fs/hfs/brec.c
===
--- linux-2.6.24-rc6-mm1.orig/fs/hfs/brec.c
+++ linux-2.6.24-rc6-mm1/fs/hfs/brec.c
@@ -49,14 +49,14 @@ u16 hfs_brec_keylen(struct hfs_bnode *no
if (retval > node->tree->max_key_len + 2) {
printk(KERN_ERR "hfs: keylen %d too large\n",
retval);
-   retval = HFS_BAD_KEYLEN;
+   retval = 0;
}
} else {
retval = (hfs_bnode_read_u8(node, recoff) | 1) + 1;
if (retval > node->tree->max_key_len + 1) {
printk(KERN_ERR "hfs: keylen %d too large\n",
retval);
-   retval = HFS_BAD_KEYLEN;
+   retval = 0;
}
}
}
Index: linux-2.6.24-rc6-mm1/fs/hfs/btree.c
===
--- linux-2.6.24-rc6-mm1.orig/fs/hfs/btree.c
+++ linux-2.6.24-rc6-mm1/fs/hfs/btree.c
@@ -81,15 +81,23 @@ struct hfs_btree *hfs_btree_open(struct 
goto fail_page;
if (!tree->node_count)
goto fail_page;
-   if ((id == HFS_EXT_CNID) && (tree->max_key_len != HFS_MAX_EXT_KEYLEN)) {
-   printk(KERN_ERR "hfs: invalid extent max_key_len %d\n",
-   tree->max_key_len);
-   goto fail_page;
-   }
-   if ((id == HFS_CAT_CNID) && (tree->max_key_len != HFS_MAX_CAT_KEYLEN)) {
-   printk(KERN_ERR "hfs: invalid catalog max_key_len %d\n",
-   tree->max_key_len);
-   goto fail_page;
+   switch(id) {
+   case HFS_EXT_CNID:
+   if (tree->max_key_len != HFS_MAX_EXT_KEYLEN) {
+   printk(KERN_ERR "hfs: invalid extent max_key_len %d\n",
+   tree->max_key_len);
+   goto fail_page;
+   }
+   break;
+   case HFS_CAT_CNID:
+   if (tree->max_key_len != HFS_MAX_CAT_KEYLEN) {
+   printk(KERN_ERR "hfs: invalid catalog max_key_len %d\n",
+   tree->max_key_len);
+   goto fail_page;
+   }
+   break;
+   default:
+   BUG();
}
 
tree->node_size_shift = ffs(size) - 1;
Index: linux-2.6.24-rc6-mm1/fs/hfs/hfs.h
===
--- linux-2.6.24-rc6-mm1.orig/fs/hfs/hfs.h
+++ linux-2.6.24-rc6-mm1/fs/hfs/hfs.h
@@ -28,8 +28,6 @@
 #define HFS_MAX_NAMELEN128
 #define HFS_MAX_VALENCE32767U
 
-#define HFS_BAD_KEYLEN 0xFF
-

Re: fs/hfs/btree.c: new NULL dereference

2008-01-14 Thread Eric Sandeen

Adrian Bunk wrote:
> The Coverity checker spotted the following NULL dereference introduced 
> by commit cf0594625083111ae522496dc1c256f7476939c2:

Oops.  Thanks, Adrian.

Patch sent on a different thread, following another fixup patch for
Roman - you're cc'd.

Thanks,

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] hfs: fix coverity-found null deref

2008-01-14 Thread Eric Sandeen

Fix potential null deref introduced by commit
cf0594625083111ae522496dc1c256f7476939c2
http://bugzilla.kernel.org/show_bug.cgi?id=9748

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>

---

Index: linux-2.6.24-rc6-mm1/fs/hfs/btree.c
===
--- linux-2.6.24-rc6-mm1.orig/fs/hfs/btree.c
+++ linux-2.6.24-rc6-mm1/fs/hfs/btree.c
@@ -61,7 +61,7 @@ struct hfs_btree *hfs_btree_open(struct 
mapping = tree->inode->i_mapping;
page = read_mapping_page(mapping, 0, NULL);
if (IS_ERR(page))
-   goto free_tree;
+   goto free_inode;
 
/* Load the header */
head = (struct hfs_btree_header_rec *)(kmap(page) + sizeof(struct 
hfs_bnode_desc));
@@ -107,11 +107,12 @@ struct hfs_btree *hfs_btree_open(struct 
page_cache_release(page);
return tree;
 
- fail_page:
+fail_page:
page_cache_release(page);
- free_tree:
+free_inode:
tree->inode->i_mapping->a_ops = &hfs_aops;
iput(tree->inode);
+free_tree:
kfree(tree);
return NULL;
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] address hfs on-disk corruption robustness review comments

2008-01-14 Thread Eric Sandeen

Andrew Morton wrote:
> On Mon, 14 Jan 2008 15:15:04 -0600
> Eric Sandeen <[EMAIL PROTECTED]> wrote:
> 
>> Address Roman's review comments for the previously sent on-disk
>> corruption hfs robustness patch.
>>
>> I still owe a patch for hfsplus.
>>
>> Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>
> 
> That isn't a changlog.  Please send a description of this patch?

Follows.

---

Per maintainer's request, use 0 as a failure value, rather than
making a new macro HFS_BAD_KEYLEN, and use a switch statement
instead of if's.

Add new fail: target to __hfs_brec_find to skip assignments using
bad values when exiting with a failure.

---

Sorry 'bout that.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-16 Thread Eric Sandeen

Alan Cox wrote:
>> Writeback cache on disk in iteself is not bad, it only gets bad if the
>> disk is not engineered to save all its dirty cache on power loss,
>> using the disk motor as a generator or alternatively a small battery.
>> It would be awfully nice to know which brands fail here, if any,
>> because writeback cache is a big performance booster.
> 
> AFAIK no drive saves the cache. The worst case cache flush for drives is
> several seconds with no retries and a couple of minutes if something
> really bad happens.
> 
> This is why the kernel has some knowledge of barriers and uses them to
> issue flushes when needed.

Problem is, ext3 has barriers off by default so it's not saving most people.

And then if you turn them on, but have your filesystem on an lvm device,
lvm strips them out again.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ecryptfs: set s_blocksize from lower fs in sb

2007-12-14 Thread Eric Sandeen

eCryptfs wasn't setting s_blocksize in it's superblock; just pick
it up from the lower FS.  Having an s_blocksize of 0 made things
like "filefrag" which call FIGETBSZ unhappy.

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>

---

Index: linux-2.6.24-rc3/fs/ecryptfs/main.c
===
--- linux-2.6.24-rc3.orig/fs/ecryptfs/main.c
+++ linux-2.6.24-rc3/fs/ecryptfs/main.c
@@ -523,6 +523,7 @@ static int ecryptfs_read_super(struct su
lower_mnt = nd.mnt;
ecryptfs_set_superblock_lower(sb, lower_root->d_sb);
sb->s_maxbytes = lower_root->d_sb->s_maxbytes;
+   sb->s_blocksize = lower_root->d_sb->s_blocksize;
ecryptfs_set_dentry_lower(sb->s_root, lower_root);
ecryptfs_set_dentry_lower_mnt(sb->s_root, lower_mnt);
rc = ecryptfs_interpose(lower_root, sb->s_root, sb, 0);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ext3: issue warning when bad inode found via ext3_lookup

2007-12-14 Thread Eric Sandeen

I have a hand-crafted bad filesystem image which has corruption:

[EMAIL PROTECTED] ~]# ls mnt/dir
file1  file2  file3  file4  file5
[EMAIL PROTECTED] ~]# ls mnt/dir/file4 
ls: cannot access mnt/dir/file4: No such file or directory
[EMAIL PROTECTED] ~]# ls -l mnt/dir
ls: cannot access mnt/dir/file4: No such file or directory
total 8
drwxr-xr-x 2 root root 1024 2007-09-04 13:36 file1
drwxr-xr-x 2 root root 1024 2007-09-04 13:36 file2
drwxr-xr-x 2 root root 1024 2007-09-04 13:36 file3
d? ? ??   ?? file4
drwxr-xr-x 2 root root 1024 2007-09-04 13:36 file5

e2fsck also knows it's corrupted:
Pass 2: Checking directory structure
Entry 'file4' in /dir (2049) has deleted/unused inode 13.  Clear? no

Entry 'file4' in /dir (2049) has an incorrect filetype (was 2, should be 1).
Fix? no

Pass 3: Checking directory connectivity
Unconnected directory inode 2053 (/dir/???)

BUT there are no kernel messages logged anywhere because ext3_read_inode
silently makes a bad_inode in this case, so that stale NFS filehandles
aren't noisy.  However, when we encounter such a problem after a by-name
lookup, I think a warning is appropriate, as it indicates filesystem
corruption.

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]
---

Index: linux-2.6.24-rc3/fs/ext3/namei.c
===
--- linux-2.6.24-rc3.orig/fs/ext3/namei.c
+++ linux-2.6.24-rc3/fs/ext3/namei.c
@@ -1049,6 +1049,9 @@ static struct dentry *ext3_lookup(struct
return ERR_PTR(-EACCES);
 
if (is_bad_inode(inode)) {
+   ext3_warning(inode->i_sb, __FUNCTION__,
+"bad inode %lu in dir #%lu",
+ inode->i_ino, dir->i_ino);
iput(inode);
return ERR_PTR(-ENOENT);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ext4: issue warning when bad inode found via ext4_lookup

2007-12-14 Thread Eric Sandeen

I have a hand-crafted bad filesystem image which has corruption:

[EMAIL PROTECTED] ~]# ls mnt/dir
file1  file2  file3  file4  file5
[EMAIL PROTECTED] ~]# ls mnt/dir/file4 
ls: cannot access mnt/dir/file4: No such file or directory
[EMAIL PROTECTED] ~]# ls -l mnt/dir
ls: cannot access mnt/dir/file4: No such file or directory
total 8
drwxr-xr-x 2 root root 1024 2007-09-04 13:36 file1
drwxr-xr-x 2 root root 1024 2007-09-04 13:36 file2
drwxr-xr-x 2 root root 1024 2007-09-04 13:36 file3
d? ? ??   ?? file4
drwxr-xr-x 2 root root 1024 2007-09-04 13:36 file5

e2fsck also knows it's corrupted:
Pass 2: Checking directory structure
Entry 'file4' in /dir (2049) has deleted/unused inode 13.  Clear? no

Entry 'file4' in /dir (2049) has an incorrect filetype (was 2, should be 1).
Fix? no

Pass 3: Checking directory connectivity
Unconnected directory inode 2053 (/dir/???)

BUT there are no kernel messages logged anywhere because ext4_read_inode
silently makes a bad_inode in this case, so that stale NFS filehandles
aren't noisy.  However, when we encounter such a problem after a by-name
lookup, I think a warning is appropriate, as it indicates filesystem
corruption.

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]
---

Index: linux-2.6.24-rc3/fs/ext4/namei.c
===
--- linux-2.6.24-rc3.orig/fs/ext4/namei.c
+++ linux-2.6.24-rc3/fs/ext4/namei.c
@@ -1050,6 +1050,9 @@ static struct dentry *ext4_lookup(struct
return ERR_PTR(-EACCES);
 
if (is_bad_inode(inode)) {
+   ext4_warning(inode->i_sb, __FUNCTION__,
+"bad inode %lu in dir #%lu",
+inode->i_ino, dir->i_ino);
iput(inode);
return ERR_PTR(-ENOENT);
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] export iov_shorten for ext4's use

2007-12-14 Thread Eric Sandeen

ext4 needs to deal with 2 different max file offsets for block- and 
extent-allocated file formats, whereas the s_maxbytes scheme can only deal 
with one.  So, for block-allocated files, we must catch and fix up
too-large offsets from within the filesystem.

Having iov_shorten exported allows such things as:

if (pos + length > sbi->s_bitmap_maxbytes) {
nr_segs = iov_shorten((struct iovec *)iov, nr_segs,
  sbi->s_bitmap_maxbytes - pos);
}

to fix up too-large writes to these files in ext4_file_write().

This patch is currently living in the ext4 patch queue.

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>

---

Index: linux-2.6.24-rc3/fs/read_write.c
===
--- linux-2.6.24-rc3.orig/fs/read_write.c
+++ linux-2.6.24-rc3/fs/read_write.c
@@ -451,6 +451,8 @@ unsigned long iov_shorten(struct iovec *
return seg;
 }
 
+EXPORT_SYMBOL(iov_shorten);
+
 ssize_t do_sync_readv_writev(struct file *filp, const struct iovec *iov,
unsigned long nr_segs, size_t len, loff_t *ppos, iov_fn_t fn)
 {



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ecryptfs: fix fsx data corruption problems

2007-12-16 Thread Eric Sandeen

ecryptfs in 2.6.24-rc3 wasn't surviving fsx for me at all,
dying after 4 ops.  Generally, encountering problems with stale
data and improperly zeroed pages.  An extending truncate + write
for example would expose stale data.

With the changes below I got to a million ops and beyond with all
mmap ops disabled - mmap still needs work.  (A version of this
patch on a RHEL5 kernel ran for over 110 million fsx ops)

I added a few comments as well, to the best of my understanding
as I read through the code.

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>

---


Index: linux-2.6.24-rc3/fs/ecryptfs/mmap.c
===
--- linux-2.6.24-rc3.orig/fs/ecryptfs/mmap.c
+++ linux-2.6.24-rc3/fs/ecryptfs/mmap.c
@@ -263,14 +263,13 @@ out:
return 0;
 }
 
+/* This function must zero any hole we create */
 static int ecryptfs_prepare_write(struct file *file, struct page *page,
  unsigned from, unsigned to)
 {
int rc = 0;
+   loff_t prev_page_end_size;
 
-   if (from == 0 && to == PAGE_CACHE_SIZE)
-   goto out;   /* If we are writing a full page, it will be
-  up to date. */
if (!PageUptodate(page)) {
rc = ecryptfs_read_lower_page_segment(page, page->index, 0,
  PAGE_CACHE_SIZE,
@@ -283,22 +282,32 @@ static int ecryptfs_prepare_write(struct
} else
SetPageUptodate(page);
}
-   if (page->index != 0) {
-   loff_t end_of_prev_pg_pos =
-   (((loff_t)page->index << PAGE_CACHE_SHIFT) - 1);
 
-   if (end_of_prev_pg_pos > i_size_read(page->mapping->host)) {
+   prev_page_end_size = ((loff_t)page->index << PAGE_CACHE_SHIFT);
+
+   /*
+* If creating a page or more of holes, zero them out via truncate.
+* Note, this will increase i_size.
+*/
+   if (page->index != 0) {
+   if (prev_page_end_size > i_size_read(page->mapping->host)) {
rc = ecryptfs_truncate(file->f_path.dentry,
-  end_of_prev_pg_pos);
+  prev_page_end_size);
if (rc) {
printk(KERN_ERR "Error on attempt to "
   "truncate to (higher) offset [%lld];"
-  " rc = [%d]\n", end_of_prev_pg_pos, rc);
+  " rc = [%d]\n", prev_page_end_size, rc);
goto out;
}
}
-   if (end_of_prev_pg_pos + 1 > i_size_read(page->mapping->host))
-   zero_user_page(page, 0, PAGE_CACHE_SIZE, KM_USER0);
+   }
+   /*
+* Writing to a new page, and creating a small hole from start of page?
+* Zero it out.
+*/
+   if ((i_size_read(page->mapping->host) == prev_page_end_size) &&
+   (from != 0)) {
+   zero_user_page(page, 0, PAGE_CACHE_SIZE, KM_USER0);
}
 out:
return rc;
Index: linux-2.6.24-rc3/fs/ecryptfs/read_write.c
===
--- linux-2.6.24-rc3.orig/fs/ecryptfs/read_write.c
+++ linux-2.6.24-rc3/fs/ecryptfs/read_write.c
@@ -124,6 +124,10 @@ int ecryptfs_write(struct file *ecryptfs
loff_t pos;
int rc = 0;
 
+   /*
+* if we are writing beyond current size, then start pos
+* at the current size - we'll fill in zeros from there.
+*/
if (offset > ecryptfs_file_size)
pos = ecryptfs_file_size;
else
@@ -137,6 +141,7 @@ int ecryptfs_write(struct file *ecryptfs
if (num_bytes > total_remaining_bytes)
num_bytes = total_remaining_bytes;
if (pos < offset) {
+   /* remaining zeros to write, up to destination offset */
size_t total_remaining_zeros = (offset - pos);
 
if (num_bytes > total_remaining_zeros)
@@ -167,17 +172,27 @@ int ecryptfs_write(struct file *ecryptfs
}
}
ecryptfs_page_virt = kmap_atomic(ecryptfs_page, KM_USER0);
+
+   /*
+* pos: where we're now writing, offset: where the request was
+* If current pos is before request, we are filling zeros
+* If we are at or beyond request, we are writing the *data*
+* If we're in a fresh page beyond eof, zero it in either case
+*/
+   if (pos < offset || !start_offset

Re: [PATCH] export iov_shorten for ext4's use

2007-12-17 Thread Eric Sandeen

(And, take 2... follow the coding style on export declarations...)

ext4 needs to deal with 2 different max file offsets for block- and 
extent-allocated file formats, whereas the s_maxbytes scheme can only deal 
with one.  So, for block-allocated files, we must catch and fix up
too-large offsets from within the filesystem.

Having iov_shorten exported allows such things as:

if (pos + length > sbi->s_bitmap_maxbytes) {
nr_segs = iov_shorten((struct iovec *)iov, nr_segs,
  sbi->s_bitmap_maxbytes - pos);
}

to fix up too-large writes to these files in ext4_file_write().

This patch is currently living in the ext4 patch queue.

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>

---


Index: linux-2.6.24-rc3/fs/read_write.c
===
--- linux-2.6.24-rc3.orig/fs/read_write.c
+++ linux-2.6.24-rc3/fs/read_write.c
@@ -450,6 +450,7 @@ unsigned long iov_shorten(struct iovec *
}
return seg;
 }
+EXPORT_SYMBOL(iov_shorten)
 
 ssize_t do_sync_readv_writev(struct file *filp, const struct iovec *iov,
unsigned long nr_segs, size_t len, loff_t *ppos, iov_fn_t fn)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ecryptfs: fix string overflow on long cipher names

2007-12-18 Thread Eric Sandeen

Passing a cipher name > 32 chars on mount results in an overflow
when the cipher name is printed, because the last character
in the struct ecryptfs_key_tfm's cipher_name string was never 
zeroed.

Signed-off-by: Eric Sandeen <[EMAIL PROTECTED]>

---

Index: linux-2.6.24-rc3/fs/ecryptfs/crypto.c
===
--- linux-2.6.24-rc3.orig/fs/ecryptfs/crypto.c
+++ linux-2.6.24-rc3/fs/ecryptfs/crypto.c
@@ -1847,6 +1847,7 @@ ecryptfs_add_new_key_tfm(struct ecryptfs
mutex_init(&tmp_tfm->key_tfm_mutex);
strncpy(tmp_tfm->cipher_name, cipher_name,
ECRYPTFS_MAX_CIPHER_NAME_SIZE);
+   tmp_tfm->cipher_name[ECRYPTFS_MAX_CIPHER_NAME_SIZE] = '\0';
tmp_tfm->key_size = key_size;
rc = ecryptfs_process_key_cipher(&tmp_tfm->key_tfm,
 tmp_tfm->cipher_name,



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 >

1 - 100 of 348 matches

Mail list logo