[PATCH] [char] random: fix priming of last_data

2013-03-19 Thread Jarod Wilson
Commit ec8f02da9e added priming of last_data per fips requirements.
Unfortuantely, it did so in a way that can lead to multiple threads all
incrementing nbytes, but only one actually doing anything with the extra
data, which leads to some fun  random corruption and panics.

The fix is to simply do everything needed to prime last_data in a single
shot, so there's no window for multiple cpus to increment nbytes -- in
fact, we won't even increment or decrement nbytes anymore, we'll just
extract the needed EXTRACT_BYTES one time per pool and then carry on with
the normal routine.

All these changes have been tested across multiple hosts and architectures
where panics were previously encoutered. The code changes are are strictly
limited to areas only touched when when booted in fips mode.

This change should also go into 3.8-stable, to make the myriads of fips
users on 3.8.x happy.

Tested-by: Jan Stancek 
Tested-by: Jan Stodola 
CC: Herbert Xu 
CC: Neil Horman 
CC: "David S. Miller" 
CC: Matt Mackall 
CC: "Theodore Ts'o" 
CC: linux-cry...@vger.kernel.org
CC: sta...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 drivers/char/random.c |   30 +++---
 1 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 85e81ec..15e5a2b 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -953,10 +953,23 @@ static ssize_t extract_entropy(struct entropy_store *r, 
void *buf,
 {
ssize_t ret = 0, i;
__u8 tmp[EXTRACT_SIZE];
+   unsigned long flags;
 
/* if last_data isn't primed, we need EXTRACT_SIZE extra bytes */
-   if (fips_enabled && !r->last_data_init)
-   nbytes += EXTRACT_SIZE;
+   if (fips_enabled) {
+   spin_lock_irqsave(&r->lock, flags);
+   if (!r->last_data_init) {
+   r->last_data_init = true;
+   spin_unlock_irqrestore(&r->lock, flags);
+   trace_extract_entropy(r->name, EXTRACT_SIZE,
+ r->entropy_count, _RET_IP_);
+   xfer_secondary_pool(r, EXTRACT_SIZE);
+   extract_buf(r, tmp);
+   spin_lock_irqsave(&r->lock, flags);
+   memcpy(r->last_data, tmp, EXTRACT_SIZE);
+   }
+   spin_unlock_irqrestore(&r->lock, flags);
+   }
 
trace_extract_entropy(r->name, nbytes, r->entropy_count, _RET_IP_);
xfer_secondary_pool(r, nbytes);
@@ -966,19 +979,6 @@ static ssize_t extract_entropy(struct entropy_store *r, 
void *buf,
extract_buf(r, tmp);
 
if (fips_enabled) {
-   unsigned long flags;
-
-
-   /* prime last_data value if need be, per fips 140-2 */
-   if (!r->last_data_init) {
-   spin_lock_irqsave(&r->lock, flags);
-   memcpy(r->last_data, tmp, EXTRACT_SIZE);
-   r->last_data_init = true;
-   nbytes -= EXTRACT_SIZE;
-   spin_unlock_irqrestore(&r->lock, flags);
-   extract_buf(r, tmp);
-   }
-
spin_lock_irqsave(&r->lock, flags);
if (!memcmp(tmp, r->last_data, EXTRACT_SIZE))
panic("Hardware RNG duplicated output!\n");
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [char] random: fix priming of last_data

2013-03-19 Thread Jarod Wilson
On Tue, Mar 19, 2013 at 12:18:09PM -0400, Jarod Wilson wrote:
> Commit ec8f02da9e added priming of last_data per fips requirements.
> Unfortuantely, it did so in a way that can lead to multiple threads all
> incrementing nbytes, but only one actually doing anything with the extra
> data, which leads to some fun  random corruption and panics.
> 
> The fix is to simply do everything needed to prime last_data in a single
> shot, so there's no window for multiple cpus to increment nbytes -- in
> fact, we won't even increment or decrement nbytes anymore, we'll just
> extract the needed EXTRACT_BYTES one time per pool and then carry on with
 ^
 EXTRACT_SIZE

Error in the description, if anyone actually cares. Oops.

> the normal routine.
> 
> All these changes have been tested across multiple hosts and architectures
> where panics were previously encoutered. The code changes are are strictly
> limited to areas only touched when when booted in fips mode.
> 
> This change should also go into 3.8-stable, to make the myriads of fips
> users on 3.8.x happy.

-- 
Jarod Wilson
ja...@redhat.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/9] ieee1394: sbp2: add INQUIRY delay workaround

2008-02-11 Thread Jarod Wilson

Stefan Richter wrote:

Add the same workaround as found in fw-sbp2 for feature parity and
compatibility of the workarounds module parameter.

Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>


Same deal as 2/9.

Signed-off-by: Jarod Wilson <[EMAIL PROTECTED]>

--
Jarod Wilson
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] firewire: fw-sbp2: add INQUIRY delay workaround

2008-02-11 Thread Jarod Wilson

Stefan Richter wrote:

Several different SBP-2 bridges accept a login early while the IDE
device is still powering up.  They are therefore unable to respond to
SCSI INQUIRY immediately, and the SCSI core has to retry the INQUIRY.
One of these retries is typically successful, and all is well.

But in case of Momobay FX-3A, the INQUIRY retries tend to fail entirely.
This can usually be avoided by waiting a little while after login before
letting the SCSI core send the INQUIRY.  The old sbp2 driver handles
this more gracefully for as yet unknown reasons (perhaps because it
waits for fetch agent resets to complete, unlike fw-sbp2 which quickly
proceeds after requesting the agent reset).  Therefore the workaround is
not as much necessary for sbp2.

Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>


Looks reasonable to me. I'm guessing there are some other devices out 
there that may well benefit from this (once added to the workarounds 
table, of course). Interesting that this is an encore appearance for a 
DViCO Momobay device...


Signed-off-by: Jarod Wilson <[EMAIL PROTECTED]>

--
Jarod Wilson
[EMAIL PROTECTED]

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/9] firewire: log GUID of new devices

2008-02-11 Thread Jarod Wilson

Stefan Richter wrote:

This should help to interpret user reports.  E.g. one can look up the
vendor OUI (first three bytes of the GUID) and thus tell what is what.

Also simplifies the math in the GUID sysfs attribute.

Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>


Rather liking this one, makes it much easier to figure out what's what 
without having to poke around in config roms.


Signed-off-by: Jarod Wilson <[EMAIL PROTECTED]>

--
Jarod Wilson
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 9/9] firewire: fw-sbp2: fix I/O errors during reconnect

2008-02-11 Thread Jarod Wilson

Stefan Richter wrote:

Jarod Wilson wrote:

Stefan Richter wrote:

+static void sbp2_conditionally_block(struct sbp2_logical_unit *lu)
+{
+struct fw_card *card =
fw_device(lu->tgt->unit->device.parent)->card;
+
+if (!atomic_read(&lu->tgt->dont_block) &&
+lu->generation != card->generation &&
+atomic_cmpxchg(&lu->blocked, 0, 1) == 0) {

Just to be absolutely sure, we don't need any barriers here to ensure we
get the right generations, do we?


I didn't think so.  But I will carefully look at it again later this
week.  The function definitely must not block the device when the
generation is current.  We look at two data fields here which makes this
even more problematic.  Could be that we need locks after all.


I didn't see anything else earlier on that guaranteed we got the current 
generation in both cases, but I didn't look exhaustively, and you know 
this code a lot better than I do, so I definitely defer to your better 
judgment, just wanted to make sure it had been considered.




Also, this isn't expected to let I/O survive a disk being unplugged
briefly, then plugged back in, is it?


No, this only tells the SCSI core to not bother fw-sbp2's
.queuecommand() with new commands before reconnect.  This will
mysteriously convince the SCSI core to not put the device offline too
quickly and will stabilize application client behavior thanks to
considerably fewer command retries.


Okay, that's what I thought and what I observed in operation.


To survive real or perceived temporary unplugs ("perceived" unplugs can
happen if a third node is slowly plugged in or out), we need to do
something in fw-device.c.  We have to keep the fw_device around after
node removal event until a timeout, to check newly added devices whether
they are in fact one of the undead devices, and to revive that one
rather than creating a new one.


Gotcha. So basically, a temporary table of devices recently "removed", 
which will expire to full/actual removal after some relatively short 
interval, but which we'll also check for matches of "newly" connected 
devices to see if we should cancel removing the device and just pretend 
like it never actually went away. Right? I wonder if there's any sort of 
guidance on this sort of thing in the firewire specs... I'll make an 
effort to search for relevant info, unless you've already got it.



(I recall that being discussed, but I think it was as a 'would be
nice to do in the future' thing).


I realized now that it is a 'need it sooner than later' thing because of
these "perceived" unplugs.  We need this feature at least with a minimal
timeout, otherwise people will sometimes lose connection to their
devices (the scsi_device will be destroyed and a new one created) when
they plug a 3rd or 4th or nth node.  As mentioned in another post, this
is an actual regression for those who migrated from ieee1394 to fw-core.
But fear not, it looks like I will have a prolonged weekend.  :-)


Heh, sounds good. I missed out on my entire past weekend (and half a 
week of work) due to family illnesses. Hoping to throw a bunch of time 
at further firewire work this week though.


--
Jarod Wilson
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/9] firewire: fw-sbp2: don't add scsi_device twice

2008-02-11 Thread Jarod Wilson

Stefan Richter wrote:

When a reconnect failed but re-login succeeded, __scsi_add_device was
called again.

Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>


Was the spurious __scsi_add_device simply failing, or was it causing 
other problems as well? I can't remember if I've hit that or not. In 
either case, not calling it when we know we don't need to makes sense.


Signed-off-by: Jarod Wilson <[EMAIL PROTECTED]>


--
Jarod Wilson
[EMAIL PROTECTED]

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] random: prime last_data value per fips requirements

2012-11-05 Thread Jarod Wilson
The value stored in last_data must be primed for FIPS 140-2 purposes. Upon
first use, either on system startup or after an RNDCLEARPOOL ioctl, we
need to take an initial random sample, store it internally in last_data,
then pass along the value after that to the requester, so that consistency
checks aren't being run against stale and possibly known data.

CC: Herbert Xu 
CC: "David S. Miller" 
CC: Neil Horman 
CC: Matt Mackall 
CC: linux-cry...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 drivers/char/random.c |   11 +++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index b86eae9..24d17b8 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -437,6 +437,7 @@ struct entropy_store {
int entropy_count;
int entropy_total;
unsigned int initialized:1;
+   bool last_data_init;
__u8 last_data[EXTRACT_SIZE];
 };
 
@@ -967,6 +968,15 @@ static ssize_t extract_entropy(struct entropy_store *r, 
void *buf,
if (fips_enabled) {
unsigned long flags;
 
+   /* prime last_data value if need be, per fips 140-2 */
+   if (!r->last_data_init) {
+   spin_lock_irqsave(&r->lock, flags);
+   memcpy(r->last_data, tmp, EXTRACT_SIZE);
+   r->last_data_init = true;
+   spin_unlock_irqrestore(&r->lock, flags);
+   continue;
+   }
+
spin_lock_irqsave(&r->lock, flags);
if (!memcmp(tmp, r->last_data, EXTRACT_SIZE))
panic("Hardware RNG duplicated output!\n");
@@ -1086,6 +1096,7 @@ static void init_std_data(struct entropy_store *r)
 
r->entropy_count = 0;
r->entropy_total = 0;
+   r->last_data_init = false;
mix_pool_bytes(r, &now, sizeof(now), NULL);
for (i = r->poolinfo->POOLBYTES; i > 0; i -= sizeof(rv)) {
if (!arch_get_random_long(&rv))
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] random: prime last_data value per fips requirements

2012-11-06 Thread Jarod Wilson
On Tue, Nov 06, 2012 at 07:05:23AM -0500, Neil Horman wrote:
> On Mon, Nov 05, 2012 at 04:00:10PM -0500, Jarod Wilson wrote:
> > The value stored in last_data must be primed for FIPS 140-2 purposes. Upon
> > first use, either on system startup or after an RNDCLEARPOOL ioctl, we
> > need to take an initial random sample, store it internally in last_data,
> > then pass along the value after that to the requester, so that consistency
> > checks aren't being run against stale and possibly known data.
> > 
> > CC: Herbert Xu 
> > CC: "David S. Miller" 
> > CC: Neil Horman 
> > CC: Matt Mackall 
> > CC: linux-cry...@vger.kernel.org
> > Signed-off-by: Jarod Wilson 
> > ---
> >  drivers/char/random.c |   11 +++
> >  1 files changed, 11 insertions(+), 0 deletions(-)
> > 
> > diff --git a/drivers/char/random.c b/drivers/char/random.c
> > index b86eae9..24d17b8 100644
> > --- a/drivers/char/random.c
> > +++ b/drivers/char/random.c
> > @@ -437,6 +437,7 @@ struct entropy_store {
> > int entropy_count;
> > int entropy_total;
> > unsigned int initialized:1;
> > +   bool last_data_init;
> > __u8 last_data[EXTRACT_SIZE];
> >  };
> >  
> > @@ -967,6 +968,15 @@ static ssize_t extract_entropy(struct entropy_store 
> > *r, void *buf,
> > if (fips_enabled) {
> > unsigned long flags;
> >  
> > +   /* prime last_data value if need be, per fips 140-2 */
> > +   if (!r->last_data_init) {
> > +   spin_lock_irqsave(&r->lock, flags);
> > +   memcpy(r->last_data, tmp, EXTRACT_SIZE);
> > +   r->last_data_init = true;
> > +   spin_unlock_irqrestore(&r->lock, flags);
> > +   continue;
> Continue?  Is that left over from earlier work?  Or did you have some other
> purpose in mind for it?

The continue takes you back to the top of the while loop for another
extract_buf() call, but continue could simply be replaced with another
extract_buf() call, so we don't have to restart the loop and check
last_data_init again. Otherwise, we're going to fail the memcmp and panic,
because tmp and r->last_data will be identical.

> Also, not that its in a hot path or anything, but it might be nice to
> consolodate this code such that you only lock and unlock r->flags once instead
> of twice here.

I thought about that, but figured it would be more trouble and possibly
more code to execute than it was worth in the normal case. But I can spin
up a v2 that tries to be a bit cleaner here.

-- 
Jarod Wilson
ja...@redhat.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] random: prime last_data value per fips requirements

2012-11-06 Thread Jarod Wilson
The value stored in last_data must be primed for FIPS 140-2 purposes. Upon
first use, either on system startup or after an RNDCLEARPOOL ioctl, we
need to take an initial random sample, store it internally in last_data,
then pass along the value after that to the requester, so that consistency
checks aren't being run against stale and possibly known data.

v2: streamline code flow a bit, eliminating extra loop and spinlock in the
case where we need to prime, and account for the extra primer bits.

CC: Herbert Xu 
CC: "David S. Miller" 
CC: Neil Horman 
CC: Matt Mackall 
CC: linux-cry...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 drivers/char/random.c |   15 +++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index b86eae9..6bdd57f 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -437,6 +437,7 @@ struct entropy_store {
int entropy_count;
int entropy_total;
unsigned int initialized:1;
+   bool last_data_init;
__u8 last_data[EXTRACT_SIZE];
 };
 
@@ -957,6 +958,10 @@ static ssize_t extract_entropy(struct entropy_store *r, 
void *buf,
ssize_t ret = 0, i;
__u8 tmp[EXTRACT_SIZE];
 
+   /* if last_data isn't primed, we need EXTRACT_SIZE extra bytes */
+   if (fips_enabled && !r->last_data_init)
+   nbytes += EXTRACT_SIZE;
+
trace_extract_entropy(r->name, nbytes, r->entropy_count, _RET_IP_);
xfer_secondary_pool(r, nbytes);
nbytes = account(r, nbytes, min, reserved);
@@ -968,6 +973,15 @@ static ssize_t extract_entropy(struct entropy_store *r, 
void *buf,
unsigned long flags;
 
spin_lock_irqsave(&r->lock, flags);
+
+   /* prime last_data value if need be, per fips 140-2 */
+   if (!r->last_data_init) {
+   memcpy(r->last_data, tmp, EXTRACT_SIZE);
+   r->last_data_init = true;
+   nbytes -= EXTRACT_SIZE;
+   extract_buf(r, tmp);
+   }
+
if (!memcmp(tmp, r->last_data, EXTRACT_SIZE))
panic("Hardware RNG duplicated output!\n");
memcpy(r->last_data, tmp, EXTRACT_SIZE);
@@ -1086,6 +1100,7 @@ static void init_std_data(struct entropy_store *r)
 
r->entropy_count = 0;
r->entropy_total = 0;
+   r->last_data_init = false;
mix_pool_bytes(r, &now, sizeof(now), NULL);
for (i = r->poolinfo->POOLBYTES; i > 0; i -= sizeof(rv)) {
if (!arch_get_random_long(&rv))
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3] random: prime last_data value per fips requirements

2012-11-06 Thread Jarod Wilson
The value stored in last_data must be primed for FIPS 140-2 purposes. Upon
first use, either on system startup or after an RNDCLEARPOOL ioctl, we
need to take an initial random sample, store it internally in last_data,
then pass along the value after that to the requester, so that consistency
checks aren't being run against stale and possibly known data.

v2: streamline code flow a bit, eliminating extra loop and spinlock in the
case where we need to prime, and account for the extra primer bits.

v3: extract_buf() can't be called with spinlock already held, so bring
back some extra lock/unlock calls.

CC: Herbert Xu 
CC: "David S. Miller" 
CC: Neil Horman 
CC: Matt Mackall 
CC: linux-cry...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 drivers/char/random.c |   17 +
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index b86eae9..d0139df 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -437,6 +437,7 @@ struct entropy_store {
int entropy_count;
int entropy_total;
unsigned int initialized:1;
+   bool last_data_init;
__u8 last_data[EXTRACT_SIZE];
 };
 
@@ -957,6 +958,10 @@ static ssize_t extract_entropy(struct entropy_store *r, 
void *buf,
ssize_t ret = 0, i;
__u8 tmp[EXTRACT_SIZE];
 
+   /* if last_data isn't primed, we need EXTRACT_SIZE extra bytes */
+   if (fips_enabled && !r->last_data_init)
+   nbytes += EXTRACT_SIZE;
+
trace_extract_entropy(r->name, nbytes, r->entropy_count, _RET_IP_);
xfer_secondary_pool(r, nbytes);
nbytes = account(r, nbytes, min, reserved);
@@ -967,6 +972,17 @@ static ssize_t extract_entropy(struct entropy_store *r, 
void *buf,
if (fips_enabled) {
unsigned long flags;
 
+
+   /* prime last_data value if need be, per fips 140-2 */
+   if (!r->last_data_init) {
+   spin_lock_irqsave(&r->lock, flags);
+   memcpy(r->last_data, tmp, EXTRACT_SIZE);
+   r->last_data_init = true;
+   nbytes -= EXTRACT_SIZE;
+   spin_unlock_irqrestore(&r->lock, flags);
+   extract_buf(r, tmp);
+   }
+
spin_lock_irqsave(&r->lock, flags);
if (!memcmp(tmp, r->last_data, EXTRACT_SIZE))
panic("Hardware RNG duplicated output!\n");
@@ -1086,6 +1102,7 @@ static void init_std_data(struct entropy_store *r)
 
r->entropy_count = 0;
r->entropy_total = 0;
+   r->last_data_init = false;
mix_pool_bytes(r, &now, sizeof(now), NULL);
for (i = r->poolinfo->POOLBYTES; i > 0; i -= sizeof(rv)) {
if (!arch_get_random_long(&rv))
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/9] firewire-sbp2: misc hotplug related patches

2008-02-05 Thread Jarod Wilson
On Sunday 03 February 2008 05:00:54 pm Stefan Richter wrote:
> Here is various stuff to hopefully improve fw-sbp2's behavior during bus
> resets.  The main piece is patch 9/9 which considerably raises the
> chance that ongoing I/O survives plugging and unplugging of other
> devices on the same bus as the device which services the I/O.
>
> The other patches are basically side products of patch 9/9 but contain
> quite useful fixes as well.
>
> I got quite good results with several OxSemi based SBP-2 devices

I've got one setup on which this doesn't seem to help much... Two firewire 
drives (both ox911 bridge, v4.0 firmware) hooked to a system, both of which 
are recognized, logged into, etc., on startup. However, pretty much without 
fail, at least one of them has to perform a reconnection. That claims to 
succeed, but the device isn't actually usable when this happens -- 
fdisk /dev/sdx fails with 'unable to read /dev/sdx'.

Example dmesg output when one of the two drives has to be reconnected:

firewire_core: created device fw0: GUID 00023c0031037366, S400
scsi6 : SBP-2 IEEE-1394
firewire_core: created device fw1: GUID 0050c501e001c394, S400
firewire_sbp2: fw1.0: logged in to LUN  (0 retries)
firewire_core: phy config: card 0, new root=ffc1, gap_count=5
scsi 6:0:0:0: Direct-Access-RBC ST312002 6A   8.01 PQ: 0 ANSI: 4
firewire_core: created device fw2: GUID 00010800f605, S800
sd 6:0:0:0: [sdc] 234441648 512-byte hardware sectors (120034 MB)
sd 6:0:0:0: [sdc] Write Protect is off
sd 6:0:0:0: [sdc] Mode Sense: 00 00 00 00
sd 6:0:0:0: [sdc] Asking for cache data failed
firewire_core: created device fw3: GUID d10080a575eb, S400
sd 6:0:0:0: [sdc] Assuming drive cache: write through
sd 6:0:0:0: [sdc] READ CAPACITY failed
sd 6:0:0:0: [sdc] Result: hostbyte=DID_BUS_BUSY 
driverbyte=DRIVER_OK,SUGGEST_OK
sd 6:0:0:0: [sdc] Sense not available.
sd 6:0:0:0: [sdc] Write Protect is off
sd 6:0:0:0: [sdc] Mode Sense: 00 00 00 00
sd 6:0:0:0: [sdc] Asking for cache data failed
sd 6:0:0:0: [sdc] Assuming drive cache: write through
sd 6:0:0:0: [sdc] Attached SCSI disk
scsi7 : SBP-2 IEEE-1394
firewire_sbp2: fw1.0: reconnected to LUN  (0 retries)
firewire_core: created device fw4: GUID 0050c501e00b23e9, S400
firewire_core: phy config: card 2, new root=ffc1, gap_count=5
firewire_sbp2: fw4.0: logged in to LUN  (0 retries)
scsi 7:0:0:0: Direct-Access-RBC ST312002 2A   8.01 PQ: 0 ANSI: 4
sd 7:0:0:0: [sdd] 234441648 512-byte hardware sectors (120034 MB)
sd 7:0:0:0: [sdd] Write Protect is off
sd 7:0:0:0: [sdd] Mode Sense: 11 00 00 00
sd 7:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA
sd 7:0:0:0: [sdd] 234441648 512-byte hardware sectors (120034 MB)
sd 7:0:0:0: [sdd] Write Protect is off
sd 7:0:0:0: [sdd] Mode Sense: 11 00 00 00
sd 7:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA
 sdd: sdd1
sd 7:0:0:0: [sdd] Attached SCSI disk

fw device decoder ring:
fw0 = fw400 card
fw1 = /dev/sdc (120G HD in ox911 case, hooked to fw0)
fw2 = fw800 card
fw3 = fw400 card
fw4 = /dev/sdd (120G HD in ox911 case, hooked to fw3)


# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4] [raid0] 
md4 : active raid1 sdd1[1]
  117218176 blocks [2/1] [_U]

# fdisk /dev/sdc

Unable to read /dev/sdc


Given the READ CAPACITY failed and DID_BUS_BUSY messages for sdc (and lack of 
notice about its partitions), it sort of looks like we never set the disk up 
correctly in the first place, and we're subsequently just reconnecting to 
that failed setup... So the reconnect code may be doing the right thing, and 
the real problem I'm looking at is us screwing up the setup of the device in 
the first place, for some reason...


-- 
Jarod Wilson
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/9] firewire: fw-sbp2: enforce a retry of __scsi_add_device if bus generation changed

2008-02-08 Thread Jarod Wilson
On Wednesday 06 February 2008 04:09:47 pm Stefan Richter wrote:
> fw-sbp2 is unable to reconnect while performing __scsi_add_device
> because there is only a single workqueue thread context available for
> both at the moment.  This should be fixed eventually.
>
> Until then, take care that __scsi_add_device does not return success
> even though the SCSI high-level driver probing failed (sd READ_CAPACITY
> and friends) due to bus reset.  The trick to do so is to use a different
> error indicator in the command completion as long as __scsi_add_device
> did not return.
>
> An actual failure of __scsi_add_device is easy to handle, but an
> incomplete execution of __scsi_add_device with an sdev returned would
> remain undetected and leave the SBP-2 target unusable.
>
> Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>
> ---
>
> Jarod, does this work like I assume and fixes your setup of two OXFW911
> based disks?

Well, it results in the dmesg spew saying "sd 6:0:0:0: [sdc] Result: 
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK" -- i.e., 
DID_NO_CONNECT instead of DID_BUS_BUSY, but other than that, no change in 
behavior, sdc remains unusable just as before.


-- 
Jarod Wilson
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/9 update] firewire: fw-sbp2: enforce a retry of __scsi_add_device if bus generation changed

2008-02-10 Thread Jarod Wilson
On Friday 08 February 2008 04:33:29 pm Stefan Richter wrote:
> fw-sbp2 is unable to reconnect while performing __scsi_add_device
> because there is only a single workqueue thread context available for
> both at the moment.  This should be fixed eventually.
>
> An actual failure of __scsi_add_device is easy to handle, but an
> incomplete execution of __scsi_add_device with an sdev returned would
> remain undetected and leave the SBP-2 target unusable.
>
> Therefore we use a workaround:  If there was a bus reset during
> __scsi_add_device (i.e. during the SCSI probe), we remove the new sdev
> immediately, log out, and attempt login and SCSI probe again.
>
> Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>

Now *this* does the trick. I get the 'READ CAPACITY failed' as before, 
then 'firewire_sbp2: fw1.0: error status: 0:4', followed by a new login and 
SCSI probe, both of which are successful this time, disk is available for use 
and all that good stuff.

Signed-off-by: Jarod Wilson <[EMAIL PROTECTED]>


-- 
Jarod Wilson
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 9/9] firewire: fw-sbp2: fix I/O errors during reconnect

2008-02-11 Thread Jarod Wilson

Stefan Richter wrote:

While fw-sbp2 takes the necessary time to reconnect to a logical unit
after bus reset, the SCSI core keeps sending new commands.  They are all
immediately completed with host busy status, and application clients or
filesystems will break quickly.  The SCSI device might even be taken
offline:  http://bugzilla.kernel.org/show_bug.cgi?id=9734

The only remedy seems to be to block the SCSI device until reconnect.
Alas the SCSI core has no useful API to block only one logical unit i.e.
the scsi_device, therefore we block the entire Scsi_Host.  This
currently corresponds to an SBP-2 target.  In case of targets with
multiple logical units, we need to satisfy the dependencies between
logical units by carefully tracking the blocking state of the target and
its units.  We block all logical units of a target as soon as one of
them needs to be blocked, and keep them blocked until all of them are
ready to be unblocked.

Furthermore, as the history of the old sbp2 driver has shown, the
scsi_block_requests() API is a minefield with high potential of
deadlocks.  We therefore take extra measures to keep logical units
unblocked during __scsi_add_device() and during shutdown.

Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>



+/*
+ * Blocks lu->tgt if all of the following conditions are met:
+ *   - Login, INQUIRY, and high-level SCSI setup of all logical units of the
+ * target have been successfully finished (indicated by dont_block == 0).
+ *   - The lu->generation is stale.  sbp2_reconnect will unblock lu later.
+ */
+static void sbp2_conditionally_block(struct sbp2_logical_unit *lu)
+{
+   struct fw_card *card = fw_device(lu->tgt->unit->device.parent)->card;
+
+   if (!atomic_read(&lu->tgt->dont_block) &&
+   lu->generation != card->generation &&
+   atomic_cmpxchg(&lu->blocked, 0, 1) == 0) {


Just to be absolutely sure, we don't need any barriers here to ensure we 
get the right generations, do we?


Also, this isn't expected to let I/O survive a disk being unplugged 
briefly, then plugged back in, is it? (I recall that being discussed, 
but I think it was as a 'would be nice to do in the future' thing).


--
Jarod Wilson
[EMAIL PROTECTED]

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 7/9] firewire: fw-sbp2: logout and login after failed reconnect

2008-02-11 Thread Jarod Wilson

Stefan Richter wrote:

If fw-sbp2 was too late with requesting the reconnect, the target would
reject this.  In this case, log out before attempting the reconnect.
Else several firmwares will deny the re-login because they somehow
didn't invalidate the old login.

Also, don't retry reconnects in this situation.  The retries won't
succeed either.

These changes improve chances for successful re-login and shorten the
period during which the logical unit is inaccessible.

Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>


Good stuff, easy to see how these will help our chances.

Signed-off-by: Jarod Wilson <[EMAIL PROTECTED]>

--
Jarod Wilson
[EMAIL PROTECTED]

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/9] firewire: fw-sbp2: log bus_id at management request failures

2008-02-11 Thread Jarod Wilson

Stefan Richter wrote:

for easier readable logs if more than one SBP-2 device is present.

Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>


Agree this makes for easier to follow logs, esp. with lots of devices 
(not just spb2).


Signed-off-by: Jarod Wilson <[EMAIL PROTECTED]>

--
Jarod Wilson
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] MAINTAINERS: Remove Jarod Wilson and orphan LIRC drivers

2013-02-15 Thread Jarod Wilson
On Wed, Feb 13, 2013 at 09:36:04AM +0300, Dan Carpenter wrote:
> On Tue, Feb 12, 2013 at 01:20:36PM -0800, Joe Perches wrote:
> > His email bounces and he hasn't done work on
> > these sections in a couple of years.
> > 
> 
> I've added him to the CC list.
> 
> Can we just update MAINTAINERS with the correct email address?  It's
> been useful to CC him on stuff.

Domain reg lapsed, never saw the notice, but its all better now. Yes, I've
been inactive on this stuff of late, lots going on in my personal life
that has reduced free time for upstream work to nearly nil. I hope to get
back into it once things settle down, which should be happening relativley
soon...

-- 
Jarod Wilson
ja...@redhat.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] firewire: fw-sbp2: use device generation, not card generation

2008-01-24 Thread Jarod Wilson
Stefan Richter wrote:
> There was a small window where a login or reconnect job could use an
> already updated card generation with an outdated node ID.  We have to
> use the fw_device.generation here, not the fw_card.generation, because
> the generation must never be newer than the node ID when we emit a
> transaction.  This cannot be guaranteed with fw_card.generation.
> 
> Furthermore, the target's and initiator's node IDs can be obtained from
> fw_device and fw_card.  Dereferencing their underlying topology objects
> is not necessary.
> 
> Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>

Verified in concert with parts 2 and 3 as well as with 2, 3 and 4, to fix
'giving up on config rom' issues on multiple system and drive combinations
that were previously affected.

Signed-off-by: Jarod Wilson <[EMAIL PROTECTED]>

-- 
Jarod Wilson
[EMAIL PROTECTED]

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] firewire: fw-cdev: use device generation, not card generation

2008-01-24 Thread Jarod Wilson
Stefan Richter wrote:
> We have to use the fw_device.generation here, not the fw_card.generation,
> because the generation must never be newer than the node ID when we emit
> a transaction.  This cannot be guaranteed with fw_card.generation.
> 
> Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>

Verified in concert with parts 1 and 3 as well as with 1, 3 and 4, to fix
'giving up on config rom' issues on multiple system and drive combinations
that were previously affected.

Signed-off-by: Jarod Wilson <[EMAIL PROTECTED]>


-- 
Jarod Wilson
[EMAIL PROTECTED]

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] firewire: enforce access order between generation and node ID

2008-01-24 Thread Jarod Wilson
Stefan Richter wrote:
> fw_device.node_id and fw_device.generation are accessed without mutexes.
> We have to ensure that all readers will get to see node_id updates
> before generation updates.
> 
> An earlier incarnation of this patch fixes an inability to recognize
> devices after "giving up on config rom",
> https://bugzilla.redhat.com/show_bug.cgi?id=429950
> 
> Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>

Verified in concert with parts 1 and 2 as well as with 1, 2 and 4, to fix
'giving up on config rom' issues on multiple system and drive combinations
that were previously affected.

Signed-off-by: Jarod Wilson <[EMAIL PROTECTED]>



-- 
Jarod Wilson
[EMAIL PROTECTED]

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] firewire: fw-core: react on bus resets while the config ROM is being fetched

2008-01-24 Thread Jarod Wilson
Stefan Richter wrote:
> read_rom() obtained a fresh new fw_device.generation for each read
> transaction.  Hence it was able to continue reading in the middle of the
> ROM even if a bus reset happened.  However the device may have modified
> the ROM during the reset.  We would end up with a corrupt fetched ROM
> image then.
> 
> Although all of this is quite unlikely, it is not impossible.
> Therefore we now restart reading the ROM if the bus generation changed.
> 
> Side note:  The barrier in read_rom(), inserted by patch "firewire:
> enforce access order between generation and node ID" is not necessary
> anymore because the sequence of calls
>   fw_device_init() ->
>   read_bus_info_block() ->
>   read_rom()
>   read_rom()
>   read_rom()
>   ...
> will take care that generation is read before node_id, won't it?
> 
> Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>

Based on a quick read through the code path, coupled with empirical evidence,
yes, it appears safe to remove the barrier in read_rom().

Signed-off-by: Jarod Wilson <[EMAIL PROTECTED]>

-- 
Jarod Wilson
[EMAIL PROTECTED]

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4] firewire: fw-core: react on bus resets while the config ROM is being fetched

2008-01-24 Thread Jarod Wilson
Jarod Wilson wrote:
> Stefan Richter wrote:
>> read_rom() obtained a fresh new fw_device.generation for each read
>> transaction.  Hence it was able to continue reading in the middle of the
>> ROM even if a bus reset happened.  However the device may have modified
>> the ROM during the reset.  We would end up with a corrupt fetched ROM
>> image then.
>>
>> Although all of this is quite unlikely, it is not impossible.
>> Therefore we now restart reading the ROM if the bus generation changed.
>>
>> Side note:  The barrier in read_rom(), inserted by patch "firewire:
>> enforce access order between generation and node ID" is not necessary
>> anymore because the sequence of calls
>>  fw_device_init() ->
>>  read_bus_info_block() ->
>>  read_rom()
>>  read_rom()
>>  read_rom()
>>  ...
>> will take care that generation is read before node_id, won't it?
>>
>> Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>
> 
> Based on a quick read through the code path, coupled with empirical evidence,
> yes, it appears safe to remove the barrier in read_rom().

Crap. My earlier 'empirical evidence' seems to have been happy coincidence.
After a fresh boot, I'm consistently hitting 'failed to read config rom'
failures with this patch applied. Simply putting the barrier back in gets
things working again though. Interestingly, subsequent reloading of firewire-*
modules less the barrier also tend to work until the system is rebooted.


-- 
Jarod Wilson
[EMAIL PROTECTED]

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/4 update] firewire: fw-core: react on bus resets while the config ROM is being fetched

2008-01-25 Thread Jarod Wilson
On Friday 25 January 2008 11:53:49 am Stefan Richter wrote:
> read_rom() obtained a fresh new fw_device.generation for each read
> transaction.  Hence it was able to continue reading in the middle of the
> ROM even if a bus reset happened.  However the device may have modified
> the ROM during the reset.  We would end up with a corrupt fetched ROM
> image then.
>
> Although all of this is quite unlikely, it is not impossible.
> Therefore we now restart reading the ROM if the bus generation changed.
>
> Note, the memory barrier in read_rom() is still necessary according to
> tests by Jarod Wilson, despite of the ->generation access being moved up
> in the call chain.
>
> Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>
> ---
>
> First posted on 2007-11-01.  Update:  Barrier stays.

This is essentially what I've been beating on locally, and I've yet to hit 
another config rom read failure with it.

Signed-off-by: Jarod Wilson <[EMAIL PROTECTED]>


-- 
Jarod Wilson
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] firewire: fw-sbp2: don't retry login or reconnect after unplug

2008-01-28 Thread Jarod Wilson
On Sunday 27 January 2008 01:14:44 pm Stefan Richter wrote:
> If a device is being unplugged while fw-sbp2 had a login or reconnect on
> schedule, it would take about half a minute to shut the fw_unit down:
>
> Jan 27 18:34:54 stein firewire_sbp2: logged in to fw2.0 LUN  (0
> retries) 
> Jan 27 18:34:59 stein firewire_sbp2: sbp2_scsi_abort
> Jan 27 18:34:59 stein scsi 25:0:0:0: Device offlined - not ready after
> error recovery Jan 27 18:35:01 stein firewire_sbp2: orb reply timed out,
> rcode=0x11 Jan 27 18:35:06 stein firewire_sbp2: orb reply timed out,
> rcode=0x11 Jan 27 18:35:12 stein firewire_sbp2: orb reply timed out,
> rcode=0x11 Jan 27 18:35:17 stein firewire_sbp2: orb reply timed out,
> rcode=0x11 Jan 27 18:35:22 stein firewire_sbp2: orb reply timed out,
> rcode=0x11 Jan 27 18:35:27 stein firewire_sbp2: orb reply timed out,
> rcode=0x11 Jan 27 18:35:32 stein firewire_sbp2: orb reply timed out,
> rcode=0x11 Jan 27 18:35:32 stein firewire_sbp2: failed to login to fw2.0
> LUN  Jan 27 18:35:32 stein firewire_sbp2: released fw2.0
>
> After this patch, typically only a few seconds spent in __scsi_add_device
> remain:
>
> Jan 27 19:05:50 stein firewire_sbp2: logged in to fw2.0 LUN  (0
> retries) 
> Jan 27 19:05:56 stein firewire_sbp2: sbp2_scsi_abort
> Jan 27 19:05:56 stein scsi 33:0:0:0: Device offlined - not ready after
> error recovery Jan 27 19:05:56 stein firewire_sbp2: released fw2.0
>
> The benefit of this is negligible on simple setups.  But on buses with
> several devices, we should avoid any unnecessary blockade of fw-sbp2's
> workqueue thread.
>
> Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>

Looks good, makes sense to me, I've hit this one a few times myself. Even if 
we do end up still having the workqueue sleep time, nuking all the 
unnecessary spew and saving a few cycles is definitely worth it.

Signed-off-by: Jarod Wilson <[EMAIL PROTECTED]>


-- 
Jarod Wilson
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH update] firewire: fix "kobject_add failed for fw* with -EEXIST"

2008-01-28 Thread Jarod Wilson
On Sunday 27 January 2008 12:20:40 pm Stefan Richter wrote:
> There is a race between shutdown and creation of devices:  fw-core may
> attempt to add a device with the same name of an already existing
> device.  http://bugzilla.kernel.org/show_bug.cgi?id=9828
>
> Impact of the bug:  Happens rarely, forces the user to unplug and replug
> the new device to get it working.

If you're crazy enough to set up a software raid array on two firewire drives 
that end up contending for the fwX device, its much worse than simply having 
to unplug and replug though, since all hell breaks loose at the fs level and 
the array level.

We may have another issue there though, as when this happened to me, the md 
layer apparently never noticed (after ~6 hours) that one of the array members 
had disappeared -- not sure if that's firewire's fault or md's though... This 
will presumably avoid this situation entirely, but worth noting that there 
may still be somewhere we need to better communicate status to an upper 
layer.

> The fix moves deregistration of the minor number and device_unregister()
> into a common rw_sem protected section.
>
> We also move the ref count increment from fw_device_op_open into an
> rw_sem protected section with the lookup of the device, so that the
> device pointer can't become invalid between lookup and usage.
>
> Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>

Looks straight-forward enough, and I'll give these a spin shortly and see if I 
can reproduce the situation I was hitting with my raid array...


-- 
Jarod Wilson
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH update] firewire: fix "kobject_add failed for fw* with -EEXIST"

2008-01-28 Thread Jarod Wilson
On Monday 28 January 2008 01:54:14 pm Stefan Richter wrote:
> Jarod Wilson wrote:
> > We may have another issue there though, as when this happened to me, the
> > md layer apparently never noticed (after ~6 hours) that one of the array
> > members had disappeared -- not sure if that's firewire's fault or md's
> > though... This will presumably avoid this situation entirely, but worth
> > noting that there may still be somewhere we need to better communicate
> > status to an upper layer.
>
> I don't know how md ticks, so I have no idea what might have happened
> there.

It looks like firewire is doing the right thing, unregistering the fw* device, 
and the SCSI layer is subsequently removing the appropriate /dev/sd* nodes, 
but for whatever reason, md hasn't a clue this has happened. I can reproduce 
this particular part of the problem by bringing the array up, and then simply 
pulling the firewire cable on one of the drives in the array...

> Somewhat related:  What if
>   - we lose connection to disk "A", represented by scsi_device "a",
>   - the SCSI core sets "a" offline,
>   - we gain connection to disk "A" again (i.e. it only shortly
> disappeared from the bus from firewire-core's and -sbp2's point
> of view),
>   - and firewire-sbp2 adds it as scsi_device "b", even before SCSI
> core got rid of "a"?
> No big problem for stand-alone volumes (unless it happens when the
> volume is in use), but maybe trouble for md managed volumes.

That does appear to be the case. If I reconnect the drive I disconnected, 
which was originally /dev/sdb, it comes back up as /dev/sdd now. So 
apparently, the scsi layer is at least bright enough to see that someone (md) 
is still trying to use /dev/sdb, but I'm clueless as to why md doesn't have 
any idea that /dev/sdb actually went away. :\

> To smooth such issues out, my longer term goal was to allow brief
> periods of disconnection in (firewire-)sbp2.  I.e. the SCSI core
> wouldn't notice that "A"/"a" went away, it would only notice that "a"
> wasn't accessible for a short time.  I think the Fibre Channel drivers
> already support this.  The ieee1394 driver even has a "limbo" for
> devices which went away, in order to remember them until they come back,
> but sbp2 doesn't use this feature.  (Nobody did the work to enhance sbp2
> to utilize the feature.)
>
> BTW, if you unplug and replug a FireWire disk under Mac OS X fairly
> quickly, OS X will pretend that nothing happened and let the user
> continue using the disk if he hadn't "ejected" it before the brief
> connection loss.

Certainly sounds like a feature we'd benefit from having in this particular 
case...

> Anyhow, we have a few more urgent problems to solve in firewire-sbp2's
> reconnection handling before we can think about such extras.

Very true... Perhaps I'll just file this one away a bit down the TODO list for 
now... ;)

-- 
Jarod Wilson
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH update] firewire: fix "kobject_add failed for fw* with -EEXIST"

2008-01-28 Thread Jarod Wilson
On Sunday 27 January 2008 12:20:40 pm Stefan Richter wrote:
> There is a race between shutdown and creation of devices:  fw-core may
> attempt to add a device with the same name of an already existing
> device.  http://bugzilla.kernel.org/show_bug.cgi?id=9828
>
> Impact of the bug:  Happens rarely, forces the user to unplug and replug
> the new device to get it working.
>
> The fix moves deregistration of the minor number and device_unregister()
> into a common rw_sem protected section.
>
> We also move the ref count increment from fw_device_op_open into an
> rw_sem protected section with the lookup of the device, so that the
> device pointer can't become invalid between lookup and usage.
>
> Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>

While I've got remaining issues with firewire mdraid, this does appear fix the 
problem of devices racing the grab the same fw* device node, and it seems 
obvious that should indeed be the case from the code changes.

Signed-off-by: Jarod Wilson <[EMAIL PROTECTED]>

-- 
Jarod Wilson
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] firewire: fail open() quickly if the node doesn't exist anymore

2008-01-28 Thread Jarod Wilson
On Sunday 27 January 2008 12:21:56 pm Stefan Richter wrote:
> Scenario:  Process A keeps the character device file of node N open.
> N is being unplugged.  File /dev/fwN won't be destroyed as long as A
> doesn't close it.  Now, process B opens /dev/fwN as well.  Previously
> it would succeed but be unable to do any IO on it of course.  With this
> patch, process B's open() will fail immediately with -ENODEV.
>
> Signed-off-by: Stefan Richter <[EMAIL PROTECTED]>

Makes perfect sense to me, no problems with it in cursory testing.

Signed-off-by: Jarod Wilson <[EMAIL PROTECTED]>

-- 
Jarod Wilson
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH net] e1000e: fix PTP on e1000_pch_lpt variants

2016-07-19 Thread Jarod Wilson
I've got reports that the Intel I-218V NIC in Intel NUC5i5RYH systems used
as a PTP slave experiences random ~10 hour clock jumps, which are resolved
if the same workaround for the 82574 and 82583 is employed. Switching from
an if to a select, because the list of NIC types could well grow further
and we'd already have to wrap the conditionals.

CC: Jeff Kirsher 
CC: intel-wired-...@lists.osuosl.org
CC: net...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 2b2e2f8..866fea0 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -4335,7 +4335,10 @@ static cycle_t e1000e_cyclecounter_read(const struct 
cyclecounter *cc)
systim = (cycle_t)systimel;
systim |= (cycle_t)systimeh << 32;
 
-   if ((hw->mac.type == e1000_82574) || (hw->mac.type == e1000_82583)) {
+   switch (hw->mac.type) {
+   case e1000_82574:
+   case e1000_82583:
+   case e1000_pch_lpt:
u64 time_delta, rem, temp;
u32 incvalue;
int i;
@@ -4360,6 +4363,9 @@ static cycle_t e1000e_cyclecounter_read(const struct 
cyclecounter *cc)
(rem == 0))
break;
}
+   break;
+   default:
+   break;
}
return systim;
 }
-- 
1.8.3.1



Re: [Intel-wired-lan] [PATCH net] e1000e: fix PTP on e1000_pch_lpt variants

2016-07-20 Thread Jarod Wilson
On Tue, Jul 19, 2016 at 08:49:03PM +, Rustad, Mark D wrote:
> Jarod Wilson  wrote:
> 
> >I've got reports that the Intel I-218V NIC in Intel NUC5i5RYH systems used
> >as a PTP slave experiences random ~10 hour clock jumps, which are resolved
> >if the same workaround for the 82574 and 82583 is employed. Switching from
> >an if to a select, because the list of NIC types could well grow further
> >and we'd already have to wrap the conditionals.
> >
> >CC: Jeff Kirsher 
> >CC: intel-wired-...@lists.osuosl.org
> >CC: net...@vger.kernel.org
> >Signed-off-by: Jarod Wilson 
> >---
> > drivers/net/ethernet/intel/e1000e/netdev.c | 8 +++-
> > 1 file changed, 7 insertions(+), 1 deletion(-)
> >
> >diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
> >b/drivers/net/ethernet/intel/e1000e/netdev.c
> >index 2b2e2f8..866fea0 100644
> >--- a/drivers/net/ethernet/intel/e1000e/netdev.c
> >+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> >@@ -4335,7 +4335,10 @@ static cycle_t
> >e1000e_cyclecounter_read(const struct cyclecounter *cc)
> > systim = (cycle_t)systimel;
> > systim |= (cycle_t)systimeh << 32;
> >
> >-if ((hw->mac.type == e1000_82574) || (hw->mac.type == e1000_82583)) {
> >+switch (hw->mac.type) {
> >+case e1000_82574:
> >+case e1000_82583:
> >+case e1000_pch_lpt:
> > u64 time_delta, rem, temp;
> > u32 incvalue;
> > int i;
> 
> I don't think that it is acceptable to declare local variables
> inside a switch statement quite like this. At a minimum, a new block
> needs to be opened to allow the declarations.

Gah, sorry, I think testing was done with an if, made a late change to a
switch without doing sufficient re-testing. I'll fix that up and re-test.

-- 
Jarod Wilson
ja...@redhat.com



[PATCH net-next 1/2] e1000e: factor out systim sanitization

2016-07-23 Thread Jarod Wilson
This is prepatory work for an expanding list of adapter families that have
occasional ~10 hour clock jumps when being used for PTP. Factor out the
sanitization function and convert to a switch statement, rather than using
an if.

CC: Jeff Kirsher 
CC: intel-wired-...@lists.osuosl.org
CC: net...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 71 ++
 1 file changed, 44 insertions(+), 27 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 41f32c0..955b294 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -4303,6 +4303,42 @@ void e1000e_reinit_locked(struct e1000_adapter *adapter)
 }
 
 /**
+ * e1000e_sanitize_systim - sanitize raw cycle counter reads
+ * @hw: pointer to the HW structure
+ * @systim: cycle_t value read, sanitized and returned
+ *
+ * Errata for 82574/82583 possible bad bits read from SYSTIMH/L:
+ * check to see that the time is incrementing at a reasonable
+ * rate and is a multiple of incvalue.
+ **/
+static cycle_t e1000e_sanitize_systim(struct e1000_hw *hw, cycle_t systim)
+{
+   u64 time_delta, rem, temp;
+   cycle_t systim_next;
+   u32 incvalue;
+   int i;
+
+   incvalue = er32(TIMINCA) & E1000_TIMINCA_INCVALUE_MASK;
+   for (i = 0; i < E1000_MAX_82574_SYSTIM_REREADS; i++) {
+   /* latch SYSTIMH on read of SYSTIML */
+   systim_next = (cycle_t)er32(SYSTIML);
+   systim_next |= (cycle_t)er32(SYSTIMH) << 32;
+
+   time_delta = systim_next - systim;
+   temp = time_delta;
+   /* VMWare users have seen incvalue of zero, don't div / 0 */
+   rem = incvalue ? do_div(temp, incvalue) : (time_delta != 0);
+
+   systim = systim_next;
+
+   if ((time_delta < E1000_82574_SYSTIM_EPSILON) && (rem == 0))
+   break;
+   }
+
+   return systim;
+}
+
+/**
  * e1000e_cyclecounter_read - read raw cycle counter (used by time counter)
  * @cc: cyclecounter structure
  **/
@@ -4312,7 +4348,7 @@ static cycle_t e1000e_cyclecounter_read(const struct 
cyclecounter *cc)
 cc);
struct e1000_hw *hw = &adapter->hw;
u32 systimel, systimeh;
-   cycle_t systim, systim_next;
+   cycle_t systim;
/* SYSTIMH latching upon SYSTIML read does not work well.
 * This means that if SYSTIML overflows after we read it but before
 * we read SYSTIMH, the value of SYSTIMH has been incremented and we
@@ -4335,32 +4371,13 @@ static cycle_t e1000e_cyclecounter_read(const struct 
cyclecounter *cc)
systim = (cycle_t)systimel;
systim |= (cycle_t)systimeh << 32;
 
-   if ((hw->mac.type == e1000_82574) || (hw->mac.type == e1000_82583)) {
-   u64 time_delta, rem, temp;
-   u32 incvalue;
-   int i;
-
-   /* errata for 82574/82583 possible bad bits read from SYSTIMH/L
-* check to see that the time is incrementing at a reasonable
-* rate and is a multiple of incvalue
-*/
-   incvalue = er32(TIMINCA) & E1000_TIMINCA_INCVALUE_MASK;
-   for (i = 0; i < E1000_MAX_82574_SYSTIM_REREADS; i++) {
-   /* latch SYSTIMH on read of SYSTIML */
-   systim_next = (cycle_t)er32(SYSTIML);
-   systim_next |= (cycle_t)er32(SYSTIMH) << 32;
-
-   time_delta = systim_next - systim;
-   temp = time_delta;
-   /* VMWare users have seen incvalue of zero, don't div / 
0 */
-   rem = incvalue ? do_div(temp, incvalue) : (time_delta 
!= 0);
-
-   systim = systim_next;
-
-   if ((time_delta < E1000_82574_SYSTIM_EPSILON) &&
-   (rem == 0))
-   break;
-   }
+   switch (hw->mac.type) {
+   case e1000_82574:
+   case e1000_82583:
+   systim = e1000e_sanitize_systim(hw, systim);
+   break;
+   default:
+   break;
}
return systim;
 }
-- 
1.8.3.1



[PATCH net-next 2/2] e1000e: fix PTP on e1000_pch_lpt variants

2016-07-23 Thread Jarod Wilson
I've got reports that the Intel I-218V NIC in Intel NUC5i5RYH systems used
as a PTP slave experiences random ~10 hour clock jumps, which are resolved
if the same workaround for the 82574 and 82583 is employed.

Reported-by: Rupesh Patel 
CC: Jeff Kirsher 
CC: intel-wired-...@lists.osuosl.org
CC: net...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 955b294..206bd6a 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -4374,6 +4374,7 @@ static cycle_t e1000e_cyclecounter_read(const struct 
cyclecounter *cc)
switch (hw->mac.type) {
case e1000_82574:
case e1000_82583:
+   case e1000_pch_lpt:
systim = e1000e_sanitize_systim(hw, systim);
break;
default:
-- 
1.8.3.1



[PATCH v2 net-next 0/2] e1000e: fix PTP on e1000_pch_variants

2016-07-23 Thread Jarod Wilson
This little series factors out the systim sanitization code first, then
adds e1000_pch_lpt as a new case in the switch that calls the sanitize
function, fixing PTP clock issues I've had reported against an Intel
I-218V NIC in an Intel NUC5ik5RYH system.

Jarod Wilson (2):
  e1000e: factor out systim sanitization
  e1000e: fix PTP on e1000_pch_lpt variants

Note: this series replaces the previously submitted singleton patch that
was, er, broken, titled the same.

Reported-by: Rupesh Patel 
CC: Jeff Kirsher 
CC: intel-wired-...@lists.osuosl.org
CC: net...@vger.kernel.org
Signed-off-by: Jarod Wilson 

 drivers/net/ethernet/intel/e1000e/netdev.c | 72 +++---
 1 file changed, 45 insertions(+), 27 deletions(-)

-- 
1.8.3.1



Re: [Intel-wired-lan] [PATCH net-next v3 1/2] e1000e: factor out systim sanitization

2016-07-27 Thread Jarod Wilson
On Wed, Jul 27, 2016 at 02:09:13PM +, Avargil, Raanan wrote:
>> This is prepatory work for an expanding list of adapter families that have 
>> occasional ~10 hour clock jumps when being used for PTP. Factor out the 
>> sanitization function and convert to using a feature (bug) flag, per 
>> suggestion from Jesse Brandeburg.
>> 
>> Littering functional code with device-specific checks is much messier than 
>> simply checking a flag, and having device-specific init set flags as needed.
>> There are probably a number of other cases in the e1000e code that 
>> could/should be converted similarly.
> 
> Looks ok to me.
> Adding Chris who asked what happens if we reach the max retry counter 
> (E1000_MAX_82574_SYSTIM_REREAD)?
> This counter is set to 50. 
> Can you, for testing purposes, decreased this value (or even set it to 0) and 
> see what happens?

Unfortunately, I don't have direct access to the affected hardware myself,
so I'd have to prep a test build, hand it off to someone and play relay. I
could do that, but it'd have some lag and possible multiple round-trips...
Anyone inside Intel have hardware handy to test on? :p

-- 
Jarod Wilson
ja...@redhat.com



Re: [PATCH net-next 2/2] net: deprecate eth_change_mtu, remove usage

2016-09-14 Thread Jarod Wilson
On Wed, Sep 14, 2016 at 05:59:14AM +0800, kbuild test robot wrote:
> Hi Jarod,
> 
> [auto build test ERROR on net-next/master]
> 
> url:
> https://github.com/0day-ci/linux/commits/Jarod-Wilson/net-centralize-net_device-min-max-MTU-checking/20160913-042130
> config: mips-xway_defconfig (attached as .config)
> compiler: mips-linux-gnu-gcc (Debian 5.4.0-6) 5.4.0 20160609
> reproduce:
> wget 
> https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
>  -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=mips 
> 
> All error/warnings (new ones prefixed by >>):
> 
>drivers/net/ethernet/lantiq_etop.c: In function 'ltq_etop_change_mtu':
> >> drivers/net/ethernet/lantiq_etop.c:524:7: error: 'ret' undeclared (first 
> >> use in this function)
>  if (!ret) {
>   ^
>drivers/net/ethernet/lantiq_etop.c:524:7: note: each undeclared identifier 
> is reported only once for each function it appears in
> >> drivers/net/ethernet/lantiq_etop.c:534:1: warning: control reaches end of 
> >> non-void function [-Wreturn-type]
> }
> ^

Crap. Did build-test, but only x86_64, so I missed this apparently
mips-specific breakage. Locally fixed up, but waiting to see if anyone has
any other feedback about these two patches before I go ahead with v2. Also
have a ~15-part drivers/net/ethernet/ cleanup series ready and waiting,
pending some traction with the core infra.

-- 
Jarod Wilson
ja...@redhat.com



[PATCH v4] net/bonding: send arp in interval if no active slave

2015-10-06 Thread Jarod Wilson
From: Uwe Koziolek 

With some very finicky switch hardware, active backup bonding can get into
a situation where we play ping-pong between interfaces, trying to get one
to come up as the active slave. There seems to be an issue with the
switch's arp replies either taking too long, or simply getting lost, so we
wind up unable to get any interface up and active. Sometimes, the issue
sorts itself out after a while, sometimes it doesn't.

Testing with num_grat_arp has proven fruitless, but sending an additional
arp on curr_arp_slave if we're still in the arp_interval timeslice in
bond_ab_arp_probe(), has shown to produce 100% reliability in testing with
this hardware combination.

[jarod: manufacturing of changelog, addition of modparam gating]
CC: Jay Vosburgh 
CC: Andy Gospodarek 
CC: Veaceslav Falico 
CC: net...@vger.kernel.org
Signed-off-by: Uwe Koziolek 
Signed-off-by: Jarod Wilson 
---
v2: add code comment as to why change is needed
v3: fix wrapping of comments
v4: [jarod] add module parameter gating of code addition

 drivers/net/bonding/bond_main.c | 24 
 include/net/bonding.h   |  1 +
 2 files changed, 25 insertions(+)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 90f2615..72ab512 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -95,6 +95,7 @@ static int miimon;
 static int updelay;
 static int downdelay;
 static int use_carrier = 1;
+static int arp_slow_switch;
 static char *mode;
 static char *primary;
 static char *primary_reselect;
@@ -133,6 +134,10 @@ MODULE_PARM_DESC(downdelay, "Delay before considering link 
down, "
 module_param(use_carrier, int, 0);
 MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in miimon; 
"
  "0 for off, 1 for on (default)");
+module_param(arp_slow_switch, int, 0);
+MODULE_PARM_DESC(arp_slow_switch, "Do extra arp checks for switches with arp "
+ "caches that are slow to update; "
+ "0 for off (default), 1 for on");
 module_param(mode, charp, 0);
 MODULE_PARM_DESC(mode, "Mode of operation; 0 for balance-rr, "
   "1 for active-backup, 2 for balance-xor, "
@@ -2793,6 +2798,18 @@ static bool bond_ab_arp_probe(struct bonding *bond)
return should_notify_rtnl;
}
 
+   /* Sometimes the forwarding tables of the switches are not update
+* fast enough, so the first arp response after a slave change is
+* received on the wrong slave.
+*
+* The arp requests will be retried 2 times on the same slave.
+*/
+   if (arp_slow_switch &&
+   bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2)) {
+   bond_arp_send_all(bond, curr_arp_slave);
+   return should_notify_rtnl;
+   }
+
bond_set_slave_inactive_flags(curr_arp_slave, BOND_SLAVE_NOTIFY_LATER);
 
bond_for_each_slave_rcu(bond, slave, iter) {
@@ -4280,6 +4297,12 @@ static int bond_check_params(struct bond_params *params)
use_carrier = 1;
}
 
+   if ((arp_slow_switch != 0) && (arp_slow_switch != 1)) {
+   pr_warn("Warning: arp_slow_switch module parameter (%d), not of 
valid value (0/1), so it was set to 1\n",
+   arp_slow_switch);
+   arp_slow_switch = 1;
+   }
+
if (num_peer_notif < 0 || num_peer_notif > 255) {
pr_warn("Warning: num_grat_arp/num_unsol_na (%d) not in range 
0-255 so it was reset to 1\n",
num_peer_notif);
@@ -4516,6 +4539,7 @@ static int bond_check_params(struct bond_params *params)
params->updelay = updelay;
params->downdelay = downdelay;
params->use_carrier = use_carrier;
+   params->arp_slow_switch = arp_slow_switch;
params->lacp_fast = lacp_fast;
params->primary[0] = 0;
params->primary_reselect = primary_reselect_value;
diff --git a/include/net/bonding.h b/include/net/bonding.h
index c1740a2..208d31c 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -120,6 +120,7 @@ struct bond_params {
int arp_validate;
int arp_all_targets;
int use_carrier;
+   int arp_slow_switch;
int fail_over_mac;
int updelay;
int downdelay;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4] net/bonding: send arp in interval if no active slave

2015-10-06 Thread Jarod Wilson

Jarod Wilson wrote:

From: Uwe Koziolek

With some very finicky switch hardware, active backup bonding can get into
a situation where we play ping-pong between interfaces, trying to get one
to come up as the active slave. There seems to be an issue with the
switch's arp replies either taking too long, or simply getting lost, so we
wind up unable to get any interface up and active. Sometimes, the issue
sorts itself out after a while, sometimes it doesn't.

Testing with num_grat_arp has proven fruitless, but sending an additional
arp on curr_arp_slave if we're still in the arp_interval timeslice in
bond_ab_arp_probe(), has shown to produce 100% reliability in testing with
this hardware combination.

[jarod: manufacturing of changelog, addition of modparam gating]
CC: Jay Vosburgh
CC: Andy Gospodarek
CC: Veaceslav Falico
CC: net...@vger.kernel.org
Signed-off-by: Uwe Koziolek
Signed-off-by: Jarod Wilson
---
v2: add code comment as to why change is needed
v3: fix wrapping of comments
v4: [jarod] add module parameter gating of code addition

  drivers/net/bonding/bond_main.c | 24 
  include/net/bonding.h   |  1 +
  2 files changed, 25 insertions(+)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 90f2615..72ab512 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -95,6 +95,7 @@ static int miimon;
  static int updelay;
  static int downdelay;
  static int use_carrier= 1;
+static int arp_slow_switch;
  static char *mode;
  static char *primary;
  static char *primary_reselect;
@@ -133,6 +134,10 @@ MODULE_PARM_DESC(downdelay, "Delay before considering link 
down, "
  module_param(use_carrier, int, 0);
  MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in miimon; 
"
  "0 for off, 1 for on (default)");
+module_param(arp_slow_switch, int, 0);
+MODULE_PARM_DESC(arp_slow_switch, "Do extra arp checks for switches with arp "
+ "caches that are slow to update; "
+ "0 for off (default), 1 for on");
  module_param(mode, charp, 0);
  MODULE_PARM_DESC(mode, "Mode of operation; 0 for balance-rr, "
   "1 for active-backup, 2 for balance-xor, "
@@ -2793,6 +2798,18 @@ static bool bond_ab_arp_probe(struct bonding *bond)
return should_notify_rtnl;
}

+   /* Sometimes the forwarding tables of the switches are not update
+* fast enough, so the first arp response after a slave change is
+* received on the wrong slave.
+*
+* The arp requests will be retried 2 times on the same slave.
+*/
+   if (arp_slow_switch &&


This here should actually be bond->params.arp_slow_switch, but I'd like 
to hear first if a module parameter gating this change is even a 
remotely acceptable idea. It'd keep the logic identical in the default 
case though, and still allow for people like Uwe that need it to deploy 
the work-around.


Though I'm slightly curious if this problem does NOT manifest by simply 
setting a larger arp_interval. Early on, I thought I'd heard that other 
intervals had been tried with the same results, but a comment in this 
thread suggested maybe only 500 had been tried.


--
Jarod Wilson
ja...@redhat.com


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Intel-wired-lan] [PATCH] igb: add more checks for disconnected adapter

2015-10-06 Thread Jarod Wilson

Alexander Duyck wrote:

On 09/21/2015 09:14 PM, Jarod Wilson wrote:

Alexander Duyck wrote:

On 09/21/2015 10:11 AM, Jarod Wilson wrote:

Some pci changes upcoming in 4.3 seem to cause additional disconnects,
which can happen at unfortuitous times for igb, leading to issues
such as
this, where the disconnect happened just before
igb_configure_tx_ring():


...

Note: this is a follow-up patch in addition to the previously submitted
"igb: don't unmap NULL hw_addr"

drivers/net/ethernet/intel/igb/igb_main.c | 15 +--
1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
b/drivers/net/ethernet/intel/igb/igb_main.c
index 6369f9e..7060edf 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -952,6 +952,11 @@ static int igb_request_msix(struct igb_adapter
*adapter)
if (err)
goto err_out;

+ if (E1000_REMOVED(hw->hw_addr)) {
+ err = -EIO;
+ goto err_free;
+ }
+
for (i = 0; i < adapter->num_q_vectors; i++) {
struct igb_q_vector *q_vector = adapter->q_vector[i];




Instead of using E1000_REMOVED we should just replace the
adapter->hw.hw_addr in the setup of the itr_register with
adapter->io_addr like you did for Rx/Tx below.


I just tried that, and it reliably blows up horrifically, wedging the
machine to the point where all I could get was a screen shot with my
phone thus far, when we jump from igb_request_msix() to
igb_configure_msix() to igb_assign_vector() and finally to
igb_write_ivar(), at least as best as I can tell from what I was able to
see in the trace remnants still on-screen.


Take a look at array_rd32, that is buggy and doesn't match up with the
rd32 implementation. If you fix that then the blow-up should go away.

You shouldn't need to worry about itr_register since it will only get
triggered if the interrupt can be enabled and that shouldn't be able to
happen if the device is not present.

...

Just switching to adapter->io_addr everywhere seems to not work as noted
above. :\ Note that I'm also chasing this from the other end with the
author of the pci patches that seem to have triggered this, so the real
bug might be over in pci-land, but hardening against explosions in igb
still seems like a worthwhile effort here.


I am pretty sure array_rd32 is the problem. If you fix that then I
suspect you should quit seeing any further issues.


Okay, I've reworked array_rd32 a bit to call igb_rd32 as well, and with 
that and some other minor tweaks, nothing blowing up. The hardware still 
doesn't work when hotplugged, due to disappearing off the pci bus 
temporarily, but I think that's a pci hotplug issue, not igb's fault. 
Things seem to behave just fine if the device is connected at boot. 
Updated patch forthcoming for dissection...


--
Jarod Wilson
ja...@redhat.com


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] igb: improve handling of disconnected adapters

2015-10-06 Thread Jarod Wilson
810]  [] netlink_unicast+0x14e/0x220
[  414.855821]  [] netlink_sendmsg+0x31a/0x390
[  414.855833]  [] sock_sendmsg+0x38/0x50
[  414.855843]  [] ___sys_sendmsg+0x27e/0x2a0
[  414.855855]  [] ? sysctl_head_finish+0x3f/0x50
[  414.855866]  [] ? proc_put_long+0xb0/0xb0
[  414.855877]  [] ? proc_sys_call_handler+0x79/0xc0
[  414.855890]  [] ? lockref_put_or_lock+0x4c/0x80
[  414.855902]  [] __sys_sendmsg+0x42/0x80
[  414.855913]  [] SyS_sendmsg+0x12/0x20
[  414.855924]  [] entry_SYSCALL_64_fastpath+0x12/0x71
[  414.855935] Code: c1 49 89 4e 30 49 8b 85 c0 05 00 00 48 85 c0 0f 84 0e 01 
00 00 81 c2 10 38 00 00 48 63 d2 48 01 d0 31 d2 89 10 49 8b 46 30 31 d2 <89> 10 
41 8b 95 44 06 00 00 b8 14 01 10 02 83
fa 05 74 0b 83 fa
[  414.856037] RIP  [] igb_configure_tx_ring+0x14c/0x250 [igb]
[  414.856052]  RSP 
[  414.856057] CR2: 3818
[  414.872327] ---[ end trace e97522c0c584ea70 ]---

This can at least be reduced to a harmless disconnect if we do some
additional checking of device presence, similar to ixgbe, and use a saved
io_addr instead of hw_addr, which we may have made NULL.

[ 8010.562550] igb :15:00.0: enabling device ( -> 0002)
[ 8010.597402] pps pps0: new PPS source ptp1
[ 8010.597406] igb :15:00.0: added PHC on eth0
[ 8010.597407] igb :15:00.0: Intel(R) Gigabit Ethernet Network Connection
[ 8010.597409] igb :15:00.0: eth0: (PCIe:2.5Gb/s:Width x1) e8:ea:6a:00:1b:2a
[ 8010.597543] igb :15:00.0: eth0: PBA No: 000200-000
[ 8010.597545] igb :15:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx 
queue(s)
[ 8010.600468] igb :15:00.0 enp21s0: renamed from eth0
[ 8010.619354] IPv6: ADDRCONF(NETDEV_UP): enp21s0: link is not ready
[ 8010.663999] igb :15:00.0 enp21s0: PCIe link lost, device now detached
[ 8011.012427] igb :15:00.0: Unable to allocate memory for vectors

CC: Mark Rustad 
CC: Jeff Kirsher 
CC: Alexand Duyck 
CC: intel-wired-...@lists.osuosl.org
CC: net...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 drivers/net/ethernet/intel/igb/e1000_regs.h |  7 +++
 drivers/net/ethernet/intel/igb/igb_main.c   | 14 +++---
 2 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_regs.h 
b/drivers/net/ethernet/intel/igb/e1000_regs.h
index 4af2870..0227f0b 100644
--- a/drivers/net/ethernet/intel/igb/e1000_regs.h
+++ b/drivers/net/ethernet/intel/igb/e1000_regs.h
@@ -368,7 +368,7 @@
 
 struct e1000_hw;
 
-u32 igb_rd32(struct e1000_hw *hw, u32 reg);
+u32 igb_rd32(struct e1000_hw *hw, u32 reg, u32 offset);
 
 /* write operations, indexed using DWORDS */
 #define wr32(reg, val) \
@@ -378,15 +378,14 @@ do { \
writel((val), &hw_addr[(reg)]); \
 } while (0)
 
-#define rd32(reg) (igb_rd32(hw, reg))
+#define rd32(reg) (igb_rd32(hw, reg, 0))
 
 #define wrfl() ((void)rd32(E1000_STATUS))
 
 #define array_wr32(reg, offset, value) \
wr32((reg) + ((offset) << 2), (value))
 
-#define array_rd32(reg, offset) \
-   (readl(hw->hw_addr + reg + ((offset) << 2)))
+#define array_rd32(reg, offset) (igb_rd32(hw, reg, offset << 2))
 
 /* DMA Coalescing registers */
 #define E1000_PCIEMISC 0x05BB8 /* PCIE misc config register */
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index d645d65..3b21f85 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -743,7 +743,7 @@ static void igb_cache_ring_register(struct igb_adapter 
*adapter)
}
 }
 
-u32 igb_rd32(struct e1000_hw *hw, u32 reg)
+u32 igb_rd32(struct e1000_hw *hw, u32 reg, u32 offset)
 {
struct igb_adapter *igb = container_of(hw, struct igb_adapter, hw);
u8 __iomem *hw_addr = ACCESS_ONCE(hw->hw_addr);
@@ -752,10 +752,10 @@ u32 igb_rd32(struct e1000_hw *hw, u32 reg)
if (E1000_REMOVED(hw_addr))
return ~value;
 
-   value = readl(&hw_addr[reg]);
+   value = readl(&hw_addr[reg + offset]);
 
/* reads should not return all F's */
-   if (!(~value) && (!reg || !(~readl(hw_addr {
+   if (!(~value) && (!(reg + offset) || !(~readl(hw_addr {
struct net_device *netdev = igb->netdev;
hw->hw_addr = NULL;
netif_device_detach(netdev);
@@ -959,7 +959,7 @@ static int igb_request_msix(struct igb_adapter *adapter)
 
vector++;
 
-   q_vector->itr_register = hw->hw_addr + E1000_EITR(vector);
+   q_vector->itr_register = adapter->io_addr + E1000_EITR(vector);
 
if (q_vector->rx.ring && q_vector->tx.ring)
sprintf(q_vector->name, "%s-TxRx-%u", netdev->name,
@@ -1230,7 +1230,7 @@ static int igb_alloc_q_vector(struct igb_adapter *adapter,
q_vector->tx.work_limit = adapter->tx_work_limit;
 
/* initialize ITR configuration */
-   q_vector->itr_regis

[PATCH v3] igb: improve handling of disconnected adapters

2015-10-07 Thread Jarod Wilson
810]  [] netlink_unicast+0x14e/0x220
[  414.855821]  [] netlink_sendmsg+0x31a/0x390
[  414.855833]  [] sock_sendmsg+0x38/0x50
[  414.855843]  [] ___sys_sendmsg+0x27e/0x2a0
[  414.855855]  [] ? sysctl_head_finish+0x3f/0x50
[  414.855866]  [] ? proc_put_long+0xb0/0xb0
[  414.855877]  [] ? proc_sys_call_handler+0x79/0xc0
[  414.855890]  [] ? lockref_put_or_lock+0x4c/0x80
[  414.855902]  [] __sys_sendmsg+0x42/0x80
[  414.855913]  [] SyS_sendmsg+0x12/0x20
[  414.855924]  [] entry_SYSCALL_64_fastpath+0x12/0x71
[  414.855935] Code: c1 49 89 4e 30 49 8b 85 c0 05 00 00 48 85 c0 0f 84 0e 01 
00 00 81 c2 10 38 00 00 48 63 d2 48 01 d0 31 d2 89 10 49 8b 46 30 31 d2 <89> 10 
41 8b 95 44 06 00 00 b8 14 01 10 02 83
fa 05 74 0b 83 fa
[  414.856037] RIP  [] igb_configure_tx_ring+0x14c/0x250 [igb]
[  414.856052]  RSP 
[  414.856057] CR2: 3818
[  414.872327] ---[ end trace e97522c0c584ea70 ]---

This can at least be reduced to a harmless disconnect if we do some
additional checking of device presence, similar to ixgbe, and use a saved
io_addr instead of hw_addr, which we may have made NULL, and clean up
array_rd32 so that it uses igb_rd32 like rd32 does, per Alexander Duyck's
suggestion.

[ 8010.562550] igb :15:00.0: enabling device ( -> 0002)
[ 8010.597402] pps pps0: new PPS source ptp1
[ 8010.597406] igb :15:00.0: added PHC on eth0
[ 8010.597407] igb :15:00.0: Intel(R) Gigabit Ethernet Network Connection
[ 8010.597409] igb :15:00.0: eth0: (PCIe:2.5Gb/s:Width x1) e8:ea:6a:00:1b:2a
[ 8010.597543] igb :15:00.0: eth0: PBA No: 000200-000
[ 8010.597545] igb :15:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx 
queue(s)
[ 8010.600468] igb :15:00.0 enp21s0: renamed from eth0
[ 8010.619354] IPv6: ADDRCONF(NETDEV_UP): enp21s0: link is not ready
[ 8010.663999] igb :15:00.0 enp21s0: PCIe link lost, device now detached
[ 8011.012427] IPv6: ADDRCONF(NETDEV_UP): enp21s0: link is not ready

The device is still not functional at this point, but I believe that's a
pci hotplug issue, not an igb issue.

CC: Mark Rustad 
CC: Jeff Kirsher 
CC: Alexander Duyck 
CC: intel-wired-...@lists.osuosl.org
CC: net...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
note: this patch is dependent on the previously submitted "igb: don't unmap 
hw_addr
  if its NULL", which added io_addr.

v3: don't modify igb_rd32, just pass reg + ((offset) <<2) as suggested by
Alexander Duyck.

 drivers/net/ethernet/intel/igb/e1000_regs.h | 3 +--
 drivers/net/ethernet/intel/igb/igb_main.c   | 8 
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_regs.h 
b/drivers/net/ethernet/intel/igb/e1000_regs.h
index 4af2870..e7cb23f 100644
--- a/drivers/net/ethernet/intel/igb/e1000_regs.h
+++ b/drivers/net/ethernet/intel/igb/e1000_regs.h
@@ -385,8 +385,7 @@ do { \
 #define array_wr32(reg, offset, value) \
wr32((reg) + ((offset) << 2), (value))
 
-#define array_rd32(reg, offset) \
-   (readl(hw->hw_addr + reg + ((offset) << 2)))
+#define array_rd32(reg, offset) (igb_rd32(hw, reg + ((offset) << 2)))
 
 /* DMA Coalescing registers */
 #define E1000_PCIEMISC 0x05BB8 /* PCIE misc config register */
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index d645d65..b05d1cb 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -959,7 +959,7 @@ static int igb_request_msix(struct igb_adapter *adapter)
 
vector++;
 
-   q_vector->itr_register = hw->hw_addr + E1000_EITR(vector);
+   q_vector->itr_register = adapter->io_addr + E1000_EITR(vector);
 
if (q_vector->rx.ring && q_vector->tx.ring)
sprintf(q_vector->name, "%s-TxRx-%u", netdev->name,
@@ -1230,7 +1230,7 @@ static int igb_alloc_q_vector(struct igb_adapter *adapter,
q_vector->tx.work_limit = adapter->tx_work_limit;
 
/* initialize ITR configuration */
-   q_vector->itr_register = adapter->hw.hw_addr + E1000_EITR(0);
+   q_vector->itr_register = adapter->io_addr + E1000_EITR(0);
q_vector->itr_val = IGB_START_ITR;
 
/* initialize pointer to rings */
@@ -3284,7 +3284,7 @@ void igb_configure_tx_ring(struct igb_adapter *adapter,
 tdba & 0xULL);
wr32(E1000_TDBAH(reg_idx), tdba >> 32);
 
-   ring->tail = hw->hw_addr + E1000_TDT(reg_idx);
+   ring->tail = adapter->io_addr + E1000_TDT(reg_idx);
wr32(E1000_TDH(reg_idx), 0);
writel(0, ring->tail);
 
@@ -3640,7 +3640,7 @@ void igb_configure_rx_ring(struct igb_adapter *adapter,
 ring->count * sizeof(union e1000_adv_rx_desc));
 
/* initialize head and tail */
-   ring->tail = hw->hw_addr + E1000_RDT(reg_idx);
+   ring->tail = 

Re: [PATCH v4] net/bonding: send arp in interval if no active slave

2015-10-07 Thread Jarod Wilson

Nikolay Aleksandrov wrote:

On 10/06/2015 09:53 PM, Jarod Wilson wrote:

From: Uwe Koziolek

With some very finicky switch hardware, active backup bonding can get into
a situation where we play ping-pong between interfaces, trying to get one
to come up as the active slave. There seems to be an issue with the
switch's arp replies either taking too long, or simply getting lost, so we
wind up unable to get any interface up and active. Sometimes, the issue
sorts itself out after a while, sometimes it doesn't.

Testing with num_grat_arp has proven fruitless, but sending an additional
arp on curr_arp_slave if we're still in the arp_interval timeslice in
bond_ab_arp_probe(), has shown to produce 100% reliability in testing with
this hardware combination.

[jarod: manufacturing of changelog, addition of modparam gating]
CC: Jay Vosburgh
CC: Andy Gospodarek
CC: Veaceslav Falico
CC: net...@vger.kernel.org
Signed-off-by: Uwe Koziolek
Signed-off-by: Jarod Wilson
---
v2: add code comment as to why change is needed
v3: fix wrapping of comments
v4: [jarod] add module parameter gating of code addition


Hi all,
As Andy already stated I'm not a fan of such workarounds either but it's
necessary sometimes so if this is going to be actually considered then a
few things need to be fixed. Please make this a proper bonding option
which can be changed at runtime and not only via a module parameter.


Okay, I can give that a shot, however...


Now, I saw that you've only tested with 500 ms, can't this be fixed by using
a different interval ? This seems like a very specific problem to have a
whole new option for.


...I'll wait until we've heard confirmation from Uwe that intervals 
other than 500ms don't fix things.



I really want to say fix the switch but I know that's not an option. :-)


Yeah, unfortunately not!


A few minor nits below,


  drivers/net/bonding/bond_main.c | 24 
  include/net/bonding.h   |  1 +
  2 files changed, 25 insertions(+)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 90f2615..72ab512 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -95,6 +95,7 @@ static int miimon;
  static int updelay;
  static int downdelay;
  static int use_carrier= 1;
+static int arp_slow_switch;
  static char *mode;
  static char *primary;
  static char *primary_reselect;
@@ -133,6 +134,10 @@ MODULE_PARM_DESC(downdelay, "Delay before considering link 
down, "
  module_param(use_carrier, int, 0);
  MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in miimon; 
"
  "0 for off, 1 for on (default)");
+module_param(arp_slow_switch, int, 0);
+MODULE_PARM_DESC(arp_slow_switch, "Do extra arp checks for switches with arp "
+ "caches that are slow to update; "
+ "0 for off (default), 1 for on");
  module_param(mode, charp, 0);
  MODULE_PARM_DESC(mode, "Mode of operation; 0 for balance-rr, "
   "1 for active-backup, 2 for balance-xor, "
@@ -2793,6 +2798,18 @@ static bool bond_ab_arp_probe(struct bonding *bond)
return should_notify_rtnl;
}

+   /* Sometimes the forwarding tables of the switches are not update

^ s/update/updated/


D'oh. Fixed locally.


@@ -4280,6 +4297,12 @@ static int bond_check_params(struct bond_params *params)
use_carrier = 1;
}

+   if ((arp_slow_switch != 0) &&  (arp_slow_switch != 1)) {

^^ no need for the extra ()


Copy-pasta from use_carrier checks right above it. Never quite sure if I 
should stick with the same possibly sub-optimal formatting conventions 
already in the file, try to fix them while also fixing bugs, or just mix 
styles...




+   pr_warn("Warning: arp_slow_switch module parameter (%d), not of 
valid value (0/1), so it was set to 1\n",
+   arp_slow_switch);
+   arp_slow_switch = 1;

^^ please default to old behaviour in this case (0)


Will do.



--
Jarod Wilson
ja...@redhat.com


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] net/bonding: send arp in interval if no active slave

2015-08-31 Thread Jarod Wilson

On 2015-08-17 4:51 PM, Uwe Koziolek wrote:

On Mon, Aug 17, 2015 at 09:14PM +0200, Jay Vosburgh wrote:

Uwe Koziolek  wrote:


On2015-08-17 07:12 PM,Jarod Wilson wrote:

...

Uwe, can you perhaps further enlighten us as to what num_grat_arp
settings were tried that didn't help? I'm still of the mind that if
num_grat_arp *didn't* help, we probably need to do something keyed off
num_grat_arp.

The bonding slaves are connected to high available switches, each of the
slaves is connected to a different switch. If the bond is starting, only
the selected slave sends one arp-request. If a matching arp_response was
received, this slave and the bond is going into state up, sending the
gratitious arps...
But if you got no arp reply the next slave was selected.
With most of the newer switches, not overloaded, or with other software
bugs, or with a single switch configuration, you would get a arp
response
on the first arp request.
But in case of high availability configuration with non perfect switches
like HP ProCurve 54xx, also with some Cisco models, you may not get a
response on the first arp request.

I have seen network snoops, there the switches are not responding to the
first arp request on slave 1, the second arp request was sent on slave 2
but the response was received on slave one,  and all following arp
requests are anwsered on the wrong slave for a longer time.

Could you elaborate on the exact "high availability
configuration" here, including the model(s) of switch(es) involved?

Is this some kind of race between the switch or switches
updating the forwarding tables and the bond flip flopping between the
slaves?  E.g., source MAC from ARP sent on slave 1 is used to populate
the forwarding table, but (for whatever reason) there is no reply.  ARP
on slave 2 is sent (using the same source MAC, unless you set
fail_over_mac), but forwarding tables still send that MAC to slave 1, so
reply is sent there.

High availability:
2 managed switches with routing capabilities have an interconnect.
One slave of a bonding interface is connected to the first switch, the
second slave is connected to the other switch.
The switch models are HP ProCurve 5406 and HP ProCurve 5412. As far as i
remember also HP E 3500 and  E 3800 are also
affected, for the affected Cisco models I can't answer today.
Affected single switch configurations was not seen.

Yes, race conditions with delayed upgrades of the forwarding tables is a
well matching explanation for the problem.


The proposed change sents up to 3 arp requests on a down bond using the
same slave, delayed by arp_interval.
Using problematic switches i have seen the the arp response on the right
slave at latest on the second arp request. So the bond is going into
state
up.

How does it works:
The bonds in up state are handled on the beginning of bond_ab_arp_probe
procedure, the other part of this procedure is handling the slave
change.
The proposed change is bypassing the slave change for 2 additional calls
of bond_ab_arp_probe.
Now the retries are not only for an up bond available, they are also
implemented for a down bond.

Does this delay failover or bringup on switches that are not
"problematic"?  I.e., if arp_interval is, say, 1000 (1 second), will
this impact failover / recovery times?

-J

It depends.
failover times are not impacted, this is handled different.
Only the transition from a down bonding interface (bond and all slaves
are down) to the state up can be increased by up to 2 times arp_interval,
If the selected interface did not came up .If well working switches are
used, and everything other is also ok, there are no impacts.


Jay, any further thoughts on this given Uwe's reply? Uwe, did you have a 
chance to get affected Cisco model numbers too?


--
Jarod Wilson
ja...@redhat.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4] net/bonding: send arp in interval if no active slave

2015-10-12 Thread Jarod Wilson

Jay Vosburgh wrote:

Jarod Wilson  wrote:


Jarod Wilson wrote:
...

As Andy already stated I'm not a fan of such workarounds either but it's
necessary sometimes so if this is going to be actually considered then a
few things need to be fixed. Please make this a proper bonding option
which can be changed at runtime and not only via a module parameter.

Is there any particular userspace tool that would need some updating, or
is adding the sysfs knobs sufficient here? I think I've got all the sysfs
stuff thrown together now, but still need to test.


Most (all?) bonding options should be configurable via iproute
(netlink) now.


D'oh, of course. I've done the kernel-side netlink bits now too, and 
started looking at the iproute source. However...




Now, I saw that you've only tested with 500 ms, can't this be fixed by
using
a different interval ? This seems like a very specific problem to have a
whole new option for.

...I'll wait until we've heard confirmation from Uwe that intervals
other than 500ms don't fix things.

Okay, so I believe the "only tested with 500ms" was in reference to
testing with Uwe's initial patch. I do have supporting evidence in a
bugzilla report that shows upwards of 5000ms still experience the problem
here.


I did set up some switches and attempt to reproduce this
yesterday; I daisy-chained three switches (two Cisco and an HP) together
and connected the bonded interfaces to the "end" switches.  I tried
various ARP targets (the switch, hosts on various points of the switch)
and varying arp_intervals and was unable to reproduce the problem.

As I understand it, the working theory is something like this:

- host with two bonded interfaces, A and B.  For active-backup
mode, the interfaces have been assigned the same MAC address.

- switch has MAC for B in its forwarding table

- bonding goes from down to up, and thinks all its slaves are
down, and starts the "curr_arp_slave" search for an active
arp_ip_target.  In this case, it starts with A, and sends an ARP from A.

As an aside, I'm not 100% clear on what exactly is going on in
the "bonding goes from down to up" transition; this seems to be key in
reproducing the issue.

- switch sees source mac coming from port A, starts to update
its forwarding table

- meanwhile, switch forwards ARP request, and receives ARP
reply, which it forwards to port B.  Bonding drops this, as the slave is
inactive.

- switch finishes updating forwarding table, MAC is now assigned
to port A.

- bonding now tries sending on port B, and the cycle repeats.

If this is what's taking place, then the arp_interval itself is
irrelevant, the race is between the switch table update and the
generation of the ARP reply.

Also, presuming the above is what's going on, we could modify
the ARP "curr_arp_slave" logic a bit to resolve this without requiring
any magic knobs.


I really like this idea. Still trying to grasp exactly how we get into 
this situation and what everything looks like as we hop through the 
various bond_ab_arp_* functions though.



For example, we could change the "drop on inactive" logic to
recognise the "curr_arp_slave" search and accept the unicast ARP reply,
and perhaps make that receiving slave the next curr_arp_slave
automatically.


Nothing ever actually getting picked as curr_arp_slave does appear to be 
the problem, so that does sound like it could do the trick.



I also wonder if the fail_over_mac option would affect this
behavior, as it would cause the slaves to keep their MAC address for the
duration, so the switch would not see the MAC move from port to port.


Not sure if that's an option for the particular environment, but we 
could certainly ask Uwe to give it a try.



Another thought would be to have the curr_arp_slave cycle
through the slaves in random order, but that could create
non-deterministic results even when things are working correctly.


I'd say avoid this route if at all possible, would rather not make 
things less predictable.


--
Jarod Wilson
ja...@redhat.com


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] igb: improve handling of disconnected adapters

2015-10-19 Thread Jarod Wilson
Clean up array_rd32 so that it uses igb_rd32 the same as rd32, per the
suggestion of Alexander Duyck, and use io_addr in more places, so that
we don't have the need to call E1000_REMOVED (which simply looks for a
null hw_addr) nearly as much.

CC: Mark Rustad 
CC: Jeff Kirsher 
CC: Alexander Duyck 
CC: intel-wired-...@lists.osuosl.org
CC: net...@vger.kernel.org
Acked-by: Alexander Duyck 
Signed-off-by: Jarod Wilson 
---
Note: this patch is rebased on Jeff's next-queue/dev-queue branch, which
already had an earlier revision of this applied, so I've essentially
reverted a2675ab and applied the revised version of this, squashed them
together, and here is the end result, which matches the version Alex acked.

 drivers/net/ethernet/intel/igb/e1000_regs.h |  3 +--
 drivers/net/ethernet/intel/igb/igb_main.c   | 15 ++-
 2 files changed, 3 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_regs.h 
b/drivers/net/ethernet/intel/igb/e1000_regs.h
index 0fdcd4d..21d9d02 100644
--- a/drivers/net/ethernet/intel/igb/e1000_regs.h
+++ b/drivers/net/ethernet/intel/igb/e1000_regs.h
@@ -386,8 +386,7 @@ do { \
 #define array_wr32(reg, offset, value) \
wr32((reg) + ((offset) << 2), (value))
 
-#define array_rd32(reg, offset) \
-   (readl(hw->hw_addr + reg + ((offset) << 2)))
+#define array_rd32(reg, offset) (igb_rd32(hw, reg + ((offset) << 2)))
 
 /* DMA Coalescing registers */
 #define E1000_PCIEMISC 0x05BB8 /* PCIE misc config register */
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 044a23e..68006a5 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -954,17 +954,12 @@ static int igb_request_msix(struct igb_adapter *adapter)
if (err)
goto err_out;
 
-   if (E1000_REMOVED(hw->hw_addr)) {
-   err = -EIO;
-   goto err_free;
-   }
-
for (i = 0; i < adapter->num_q_vectors; i++) {
struct igb_q_vector *q_vector = adapter->q_vector[i];
 
vector++;
 
-   q_vector->itr_register = hw->hw_addr + E1000_EITR(vector);
+   q_vector->itr_register = adapter->io_addr + E1000_EITR(vector);
 
if (q_vector->rx.ring && q_vector->tx.ring)
sprintf(q_vector->name, "%s-TxRx-%u", netdev->name,
@@ -1206,9 +1201,6 @@ static int igb_alloc_q_vector(struct igb_adapter *adapter,
if (txr_count > 1 || rxr_count > 1)
return -ENOMEM;
 
-   if (E1000_REMOVED(adapter->hw.hw_addr))
-   return -EIO;
-
ring_count = txr_count + rxr_count;
size = sizeof(struct igb_q_vector) +
   (sizeof(struct igb_ring) * ring_count);
@@ -1238,7 +1230,7 @@ static int igb_alloc_q_vector(struct igb_adapter *adapter,
q_vector->tx.work_limit = adapter->tx_work_limit;
 
/* initialize ITR configuration */
-   q_vector->itr_register = adapter->hw.hw_addr + E1000_EITR(0);
+   q_vector->itr_register = adapter->io_addr + E1000_EITR(0);
q_vector->itr_val = IGB_START_ITR;
 
/* initialize pointer to rings */
@@ -3281,9 +3273,6 @@ void igb_configure_tx_ring(struct igb_adapter *adapter,
u64 tdba = ring->dma;
int reg_idx = ring->reg_idx;
 
-   if (E1000_REMOVED(adapter->io_addr))
-   return;
-
/* disable the queue */
wr32(E1000_TXDCTL(reg_idx), 0);
wrfl();
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH net-next] net/core: initial support for stacked dev feature toggles

2015-10-23 Thread Jarod Wilson
There are some netdev features that make little sense to toggle on and
off in a stacked device setup on only one device in the stack. The prime
example is a bonded connection, where it really doesn't make sense to
disable LRO on the master, but not on any of the slaves, nor does it
really make sense to be able to shut LRO off on a slave when its still
enabled on the master.

The strategy here is to add a section near the end of
netdev_fix_features() that looks for upper and lower netdevs, then make
sure certain feature flags match both up and down the stack. At present,
only the LRO flag is included.

This has been successfully tested with bnx2x, qlcnic and netxen network
cards as slaves in a bond interface. Turning LRO on or off on the master
also turns it on or off on each of the slaves, new slaves are added with
LRO in the same state as the master, and LRO can't be toggled on the
slaves.

Also, this should largely remove the need for dev_disable_lro(), and most,
if not all, of its call sites can be replaced by simply making sure
NETIF_F_LRO isn't included in the relevant device's feature flags.

Note that this patch is driven by bug reports from users saying it was
confusing that bonds and slaves had different settings for the same
features, and while it won't be 100% in sync if a lower device doesn't
support a feature like LRO, I think this is a good step in the right
direction.

CC: "David S. Miller" 
CC: Eric Dumazet 
CC: Jay Vosburgh 
CC: Veaceslav Falico 
CC: Andy Gospodarek 
CC: Jiri Pirko 
CC: Nikolay Aleksandrov 
CC: Michal Kubecek 
CC: net...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 net/core/dev.c | 57 +
 1 file changed, 57 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 1225b4b..26f4e2d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6261,9 +6261,57 @@ static void rollback_registered(struct net_device *dev)
list_del(&single);
 }
 
+static netdev_features_t netdev_sync_upper_features(struct net_device *lower,
+   struct net_device *upper, netdev_features_t features)
+{
+   netdev_features_t want = upper->wanted_features & lower->hw_features;
+
+   if (!(upper->wanted_features & NETIF_F_LRO)
+   && (features & NETIF_F_LRO)) {
+   netdev_info(lower, "Dropping LRO, upper dev %s has it off.\n",
+  upper->name);
+   features &= ~NETIF_F_LRO;
+   } else if ((want & NETIF_F_LRO) && !(features & NETIF_F_LRO)) {
+   netdev_info(lower, "Keeping LRO, upper dev %s has it on.\n",
+  upper->name);
+   features |= NETIF_F_LRO;
+   }
+
+   return features;
+}
+
+static void netdev_sync_lower_features(struct net_device *upper,
+   struct net_device *lower, netdev_features_t features)
+{
+   netdev_features_t want = features & lower->hw_features;
+
+   if (!(features & NETIF_F_LRO) && (lower->features & NETIF_F_LRO)) {
+   netdev_info(upper, "Disabling LRO on lower dev %s.\n",
+  lower->name);
+   upper->wanted_features &= ~NETIF_F_LRO;
+   lower->wanted_features &= ~NETIF_F_LRO;
+   netdev_update_features(lower);
+   if (unlikely(lower->features & NETIF_F_LRO))
+   netdev_WARN(upper, "failed to disable LRO on %s!\n",
+   lower->name);
+   } else if ((want & NETIF_F_LRO) && !(lower->features & NETIF_F_LRO)) {
+   netdev_info(upper, "Enabling LRO on lower dev %s.\n",
+  lower->name);
+   upper->wanted_features |= NETIF_F_LRO;
+   lower->wanted_features |= NETIF_F_LRO;
+   netdev_update_features(lower);
+   if (unlikely(!(lower->features & NETIF_F_LRO)))
+   netdev_WARN(upper, "failed to enable LRO on %s!\n",
+   lower->name);
+   }
+}
+
 static netdev_features_t netdev_fix_features(struct net_device *dev,
netdev_features_t features)
 {
+   struct net_device *upper, *lower;
+   struct list_head *iter;
+
/* Fix illegal checksum combinations */
if ((features & NETIF_F_HW_CSUM) &&
(features & (NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM))) {
@@ -6318,6 +6366,15 @@ static netdev_features_t netdev_fix_features(struct 
net_device *dev,
}
}
 
+   /* some features should be kept in sync with upper devices */
+   upper = netdev_master_upper_dev_get(dev);
+   if (upper)
+   features = netdev_sync_upper_features(dev, upper, features);
+
+   /* lower devices need so

[PATCH] crypto/testmgr: mark rfc4106(gcm(aes)) as fips_allowed

2015-01-23 Thread Jarod Wilson
This gcm variant is popular for ipsec use, and there are folks who would
like to use it while in fips mode. Mark it with fips_allowed=1 to
facilitate that.

CC: LKML 
CC: Stephan Mueller 
Signed-off-by: Jarod Wilson 
---
 crypto/testmgr.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 235b1ff..758d028 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -3293,6 +3293,7 @@ static const struct alg_test_desc alg_test_descs[] = {
}, {
.alg = "rfc4106(gcm(aes))",
.test = alg_test_aead,
+   .fips_allowed = 1,
.suite = {
.aead = {
.enc = {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/6] block: export blkdev_reread_part()

2015-04-06 Thread Jarod Wilson
On Mon, Apr 06, 2015 at 12:40:12AM +0800, Ming Lei wrote:
> On Mon, Apr 6, 2015 at 12:12 AM, Christoph Hellwig  wrote:
> >> +/*
> >> + * This is exported as API for block driver, can be called
> >> + * with requiring bd_mutex or not.
> >> + */
> >> +int __blkdev_reread_part(struct block_device *bdev, bool lock)
> >>  {
> >>   struct gendisk *disk = bdev->bd_disk;
> >>   int res;
> >> @@ -159,12 +163,14 @@ static int blkdev_reread_part(struct block_device 
> >> *bdev)
> >>   return -EINVAL;
> >>   if (!capable(CAP_SYS_ADMIN))
> >>   return -EACCES;
> >> - if (!mutex_trylock(&bdev->bd_mutex))
> >> + if (lock && !mutex_trylock(&bdev->bd_mutex))
> >>   return -EBUSY;
> >
> > Please don't add funtions that do conditional locking, instead move
> > all the code into blkdev_reread_part_nolock, and then wrap it:
> >
> > int blkdev_reread_part(struct block_device *bdev)
> > {
> > if (!mutex_trylock(&bdev->bd_mutex))
> > return -EBUSY;
> > blkdev_reread_part_nolock(bdev);
> > mutex_unlock(&bdev->bd_mutex);
> > }
> 
> Yes, it is more clean, but with extra acquiring lock cost for the
> failure cases, especially when we replace trylock with mutex_lock().

I was working on a version of this myself over the past few days, I
actually removed blkdev_reread_part() entirely, renamed
fs/partition-generic.c::reread_partitions() to __reread_partitions(), then
moved the locking from blkdev_reread_part() into a new reread_partitions()
that wrapped around __reread_partitions(). Same difference, I guess.

> > Please also add a lockdep_assert_held to blkdev_reread_part_nolock to
> > ensure callers actually do hold the lock.
> 
> Good point!

Looks like fs/block_dev.c::__blkdev_get() is the only thing that would be
calling the _nolock variant of whichever route, as it handles bd_mutex
acquisition within __blkdev_get().

As an aside, there's a piece of that function that could be worth
duplicating over into loop.c as well:

if (bdev->bd_invalidated) {
if (!ret)
    rescan_partitions(bdev);
else if (ret == -ENOMEDIUM)
invalidate_partitions(disk, bdev);

Might this possibly be put to use to help with the problem commit
8761a3dc1f07b163414e2215a2cadbb4cfe2a107 was trying to solve?

-- 
Jarod Wilson
ja...@redhat.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/6] block: dasd_genhd: convert to blkdev_reread_part

2015-04-06 Thread Jarod Wilson
On Sun, Apr 05, 2015 at 03:24:47PM +0800, Ming Lei wrote:
> Also remove the obsolete comment.
> 
> Signed-off-by: Ming Lei 
> ---
>  drivers/s390/block/dasd_genhd.c |9 +++--
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/s390/block/dasd_genhd.c b/drivers/s390/block/dasd_genhd.c
> index 90f39f7..2af4619 100644
> --- a/drivers/s390/block/dasd_genhd.c
> +++ b/drivers/s390/block/dasd_genhd.c
> @@ -116,14 +116,11 @@ int dasd_scan_partitions(struct dasd_block *block)
> rc);
>   return -ENODEV;
>   }
> - /*
> -  * See fs/partition/check.c:register_disk,rescan_partitions
> -  * Can't call rescan_partitions directly. Use ioctl.
> -  */
> - rc = ioctl_by_bdev(bdev, BLKRRPART, 0);
> +
> + rc = blkdev_reread_part(bdev);
>   while (rc == -EBUSY && retry > 0) {
>   schedule();
> - rc = ioctl_by_bdev(bdev, BLKRRPART, 0);
> + rc = blkdev_reread_part(bdev);
>   retry--;
>   DBF_DEV_EVENT(DBF_ERR, block->base,
> "scan partitions error, retry %d rc %d",

Note: patch 6/6 in the series makes this whole while() loops pointless,
since the possibility of the -EBUSY return goes away.

-- 
Jarod Wilson
ja...@redhat.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/6] block: export blkdev_reread_part()

2015-04-06 Thread Jarod Wilson
On Mon, Apr 06, 2015 at 09:42:27AM -0400, Jarod Wilson wrote:
> On Mon, Apr 06, 2015 at 12:40:12AM +0800, Ming Lei wrote:
> > On Mon, Apr 6, 2015 at 12:12 AM, Christoph Hellwig  
> > wrote:
> > >> +/*
> > >> + * This is exported as API for block driver, can be called
> > >> + * with requiring bd_mutex or not.
> > >> + */
> > >> +int __blkdev_reread_part(struct block_device *bdev, bool lock)
> > >>  {
> > >>   struct gendisk *disk = bdev->bd_disk;
> > >>   int res;
> > >> @@ -159,12 +163,14 @@ static int blkdev_reread_part(struct block_device 
> > >> *bdev)
> > >>   return -EINVAL;
> > >>   if (!capable(CAP_SYS_ADMIN))
> > >>   return -EACCES;
> > >> - if (!mutex_trylock(&bdev->bd_mutex))
> > >> + if (lock && !mutex_trylock(&bdev->bd_mutex))
> > >>   return -EBUSY;
> > >
> > > Please don't add funtions that do conditional locking, instead move
> > > all the code into blkdev_reread_part_nolock, and then wrap it:
> > >
> > > int blkdev_reread_part(struct block_device *bdev)
> > > {
> > > if (!mutex_trylock(&bdev->bd_mutex))
> > > return -EBUSY;
> > > blkdev_reread_part_nolock(bdev);
> > > mutex_unlock(&bdev->bd_mutex);
> > > }
> > 
> > Yes, it is more clean, but with extra acquiring lock cost for the
> > failure cases, especially when we replace trylock with mutex_lock().
> 
> I was working on a version of this myself over the past few days, I
> actually removed blkdev_reread_part() entirely, renamed
> fs/partition-generic.c::reread_partitions() to __reread_partitions(), then

Sorry, that was block/partition-generic.c, not fs/.

-- 
Jarod Wilson
ja...@redhat.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/6] block: dasd_genhd: convert to blkdev_reread_part

2015-04-06 Thread Jarod Wilson
On Mon, Apr 06, 2015 at 09:46:55AM -0400, Jarod Wilson wrote:
> On Sun, Apr 05, 2015 at 03:24:47PM +0800, Ming Lei wrote:
> > Also remove the obsolete comment.
> > 
> > Signed-off-by: Ming Lei 
> > ---
> >  drivers/s390/block/dasd_genhd.c |9 +++--
> >  1 file changed, 3 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/s390/block/dasd_genhd.c 
> > b/drivers/s390/block/dasd_genhd.c
> > index 90f39f7..2af4619 100644
> > --- a/drivers/s390/block/dasd_genhd.c
> > +++ b/drivers/s390/block/dasd_genhd.c
> > @@ -116,14 +116,11 @@ int dasd_scan_partitions(struct dasd_block *block)
> >   rc);
> > return -ENODEV;
> > }
> > -   /*
> > -* See fs/partition/check.c:register_disk,rescan_partitions
> > -* Can't call rescan_partitions directly. Use ioctl.
> > -*/
> > -   rc = ioctl_by_bdev(bdev, BLKRRPART, 0);
> > +
> > +   rc = blkdev_reread_part(bdev);
> > while (rc == -EBUSY && retry > 0) {
> > schedule();
> > -   rc = ioctl_by_bdev(bdev, BLKRRPART, 0);
> > +   rc = blkdev_reread_part(bdev);
> > retry--;
> > DBF_DEV_EVENT(DBF_ERR, block->base,
> >   "scan partitions error, retry %d rc %d",
> 
> Note: patch 6/6 in the series makes this whole while() loops pointless,
> since the possibility of the -EBUSY return goes away.

Minor clarification: the -EBUSY due to the trylock, which is why that
retry loop exists, goes away. You *could* still get an -EBUSY through
blkdev_reread_part()->rescan_partitions()->drop_partitions() if
bdev->bd_part_count is non-zero.

-- 
Jarod Wilson
ja...@redhat.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] block/loop: improve reliability of partition scanning

2015-03-31 Thread Jarod Wilson
If losetup is called with the -P option, it sets a flag to have the
resulting loop block device scanned for partitions. Unfortunately, due
to the way flags are passed in from userspace, there's first a
loop_set_fd() call, which does no partition scanning, then a
loop_set_status() call, where the partition scanning should kick in.
However, particularly on a system with slow I/O (such as a file-backed
vm), there's a race between the loop_set_status() call and udev poking the
device, which leads to partition scanning failing with an -EBUSY (passed
up from block/ioctl.c's blkdev_reread_part()) because the block_device's
bd_mutex is already held by udev calling blkdev_open(), which grabs
bd_mutex, and then in turn calls lo_open(), which then in turn tries to
grab lo_ctl_mutex, which we're holding in all loop ioctls.

To combat this, if we discover bd_mutex is locked, we know partition
scanning will fail, and its probably because of udev, so we can
temporarily drop the lo_ctl_mutex ourselves to try to let udev do its
thing, then grab it back, and hopefully then successfully scan partitions.

Testing shows a definite improvement to partition scanning success when
calling losetup -fP file-image over and over (with matching losetup -D
too, of course), but still not to 100% success, I'm still getting the
occasional failure, which is typically due to an -EBUSY trying to rescan
partitions on loop device removal.

CC: Jens Axboe 
CC: Ming Lei 
CC: Mike Galbraith 
CC: Kent Overstreet 
CC: Mikulas Patocka 
Signed-off-by: Jarod Wilson 
---
 drivers/block/loop.c | 48 
 1 file changed, 44 insertions(+), 4 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d1f168b..b30e32c 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -75,6 +75,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "loop.h"
 
 #include 
@@ -529,6 +530,45 @@ static int loop_flush(struct loop_device *lo)
 }
 
 /*
+ * Re-reading partitions can fail with an -EBUSY return from block/ioctl.c's
+ * blkdev_reread_part(), which calls mutex_trylock on the bd_mutex. Now, udev
+ * is calling blkdev_open, which first grabs bd_mutex, then lo_ctl_mutex via
+ * lo_open, which occasionally happens before partition scanning, and will
+ * prevent partition scanning from ever being successful unless we give up
+ * the lo_ctl_mutex temporarily.
+ */
+static void loop_reread_partitions(struct loop_device *lo,
+  struct block_device *bdev)
+{
+   int rc;
+   int retry = 5;
+
+   pr_debug("%s: firing for loop%d (%s)\n",
+__func__, lo->lo_number, lo->lo_file_name);
+
+   /*
+* If no lo_device, we were (probably) called from loop_clr_fd(), and
+* retries never seem to help, so don't retry.
+*/
+   if (!lo->lo_device)
+   retry = 1;
+
+   while (mutex_is_locked(&bdev->bd_mutex) && retry > 0) {
+   mutex_unlock(&lo->lo_ctl_mutex);
+   msleep(50);
+   mutex_lock(&lo->lo_ctl_mutex);
+   retry--;
+   pr_debug("%s: unlocked lo_ctl temporarily (retries left: %d)\n",
+__func__, retry);
+   }
+
+   rc = ioctl_by_bdev(bdev, BLKRRPART, 0);
+   if (rc)
+   pr_warn("%s: partition scan of loop%d (%s) failed (rc=%d)\n",
+   __func__, lo->lo_number, lo->lo_file_name, rc);
+}
+
+/*
  * loop_change_fd switched the backing store of a loopback device to
  * a new file. This is useful for operating system installers to free up
  * the original file and in High Availability environments to switch to
@@ -576,7 +616,7 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
 
fput(old_file);
if (lo->lo_flags & LO_FLAGS_PARTSCAN)
-   ioctl_by_bdev(bdev, BLKRRPART, 0);
+   loop_reread_partitions(lo, bdev);
return 0;
 
  out_putf:
@@ -807,7 +847,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
if (part_shift)
lo->lo_flags |= LO_FLAGS_PARTSCAN;
if (lo->lo_flags & LO_FLAGS_PARTSCAN)
-   ioctl_by_bdev(bdev, BLKRRPART, 0);
+   loop_reread_partitions(lo, bdev);
 
/* Grab the block_device to prevent its destruction after we
 * put /dev/loopXX inode. Later in loop_clr_fd() we bdput(bdev).
@@ -920,7 +960,7 @@ static int loop_clr_fd(struct loop_device *lo)
/* This is safe: open() is still holding a reference. */
module_put(THIS_MODULE);
if (lo->lo_flags & LO_FLAGS_PARTSCAN && bdev)
-   ioctl_by_bdev(bdev, BLKRRPART, 0);
+   loop_reread_partitions(lo, bdev);
lo->lo_flags = 0;
if (!part_shift)
lo-&

Re: [PATCH] block/loop: improve reliability of partition scanning

2015-04-01 Thread Jarod Wilson
On Wed, Apr 01, 2015 at 08:41:46AM -0700, Christoph Hellwig wrote:
> Please take a look at this thread and my suggestions in it:
> 
> https://lkml.org/lkml/2015/1/26/137
> https://lkml.org/lkml/2015/1/26/564

Aw crap, I should have looked for prior attempts, it does seem to be a
current issue for a number of folks. I'll take my patch, David's patch,
Ming's patch and your comments all back to the drawing board for a bit and
see what I can come up with...

Thanks much,

-- 
Jarod Wilson
ja...@redhat.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/7] block: reread partitions improvements

2015-04-07 Thread Jarod Wilson
For loop, the root cause is one ABBA and one AA lock dependency
issue, and the two are fixed by patch 2 and patch 3 each.

Another reason is from the trylock in blkdev_reread_part(), which
may cause partition scanning failure too sometimes when another task
is holding the bd_mutex. In the discussion[1], both Tejun and Christoph
suggests to replace the trylock with mutex_lock in blkdev_reread_part(),
also Christoph suggests to export blkdev_reread_part.

Following the discussion, this patchset exports blkdev_reread_part(), and
introduces lockless __blkdev_reread_part() for fixing loop's AA lock issue.
Then ioctl_by_bdev(BLKRRPART) in loop, nbd and dasd is replaced with
blkdev_reread_part(). In the last patch, trylock in blkdev_reread_part()
is replaced with mutex_lock, and some analysis is provided about the
conversion.

This is a slightly reworked set from what Ming sent a few days ago,
based on Christoph's feedback. I've tested this out quite succesfully
in the scenario that prompted my first patch as well.


[1], https://lkml.org/lkml/2015/1/26/137
[2], https://lkml.org/lkml/2015/3/31/888

Jarod Wilson (3):
  block: export blkdev_reread_part() and __blkdev_reread_part()
  block: loop: fix another reread part failure
  s390/block/dasd: remove obsolete while -EBUSY loop

Ming Lei (4):
  block: loop: don't hold lo_ctl_mutex in lo_open
  block: nbd: convert to blkdev_reread_part()
  block: dasd_genhd: convert to blkdev_reread_part
  block: replace trylock with mutex_lock in blkdev_reread_part()

 block/ioctl.c   | 34 +++
 drivers/block/loop.c| 60 ++---
 drivers/block/loop.h|  1 +
 drivers/block/nbd.c |  2 +-
 drivers/s390/block/dasd_genhd.c | 22 +--
 include/linux/fs.h  |  3 +++
 6 files changed, 90 insertions(+), 32 deletions(-)

CC: Christoph Hellwig 
CC: Jens Axboe 
CC: Tejun Heo 
CC: Alexander Viro 
CC: Markus Pargmann 
CC: Stefan Weinhuber 
CC: Stefan Haberland 
CC: Sebastian Ott 
CC: Fabian Frederick 
CC: Ming Lei 
CC: David Herrmann 
CC: Mike Galbraith 
CC: Andrew Morton 
CC: Peter Zijlstra 
CC: nbd-gene...@lists.sourceforge.net
CC: linux-s...@vger.kernel.org

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/7] block: export blkdev_reread_part() and __blkdev_reread_part()

2015-04-07 Thread Jarod Wilson
This patch exports blkdev_reread_part() for block drivers, also
introduce __blkdev_reread_part(), a lockless version.

For some drivers, such as loop, a reread of partitions can be run
from the release path, and bd_mutex may already be held prior to
calling ioctl_by_bdev(bdev, BLKRRPART, 0), so introduce a lockless
path for use in such cases.

CC: Christoph Hellwig 
CC: Jens Axboe 
CC: Tejun Heo 
CC: Alexander Viro 
CC: Markus Pargmann 
CC: Stefan Weinhuber 
CC: Stefan Haberland 
CC: Sebastian Ott 
CC: Fabian Frederick 
CC: Ming Lei 
CC: David Herrmann 
CC: Mike Galbraith 
CC: Andrew Morton 
CC: Peter Zijlstra 
CC: nbd-gene...@lists.sourceforge.net
CC: linux-s...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 block/ioctl.c  | 26 +++---
 include/linux/fs.h |  3 +++
 2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index 7d8befd..64a4fcb 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -150,21 +150,41 @@ static int blkpg_ioctl(struct block_device *bdev, struct 
blkpg_ioctl_arg __user
}
 }
 
-static int blkdev_reread_part(struct block_device *bdev)
+/*
+ * This is an exported API for the block driver, and will not
+ * acquire bd_mutex, leaving it up to the caller to handle
+ * any necessary locking.
+ */
+int __blkdev_reread_part(struct block_device *bdev)
 {
struct gendisk *disk = bdev->bd_disk;
-   int res;
 
if (!disk_part_scan_enabled(disk) || bdev != bdev->bd_contains)
return -EINVAL;
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
+
+   return rescan_partitions(disk, bdev);
+}
+EXPORT_SYMBOL(__blkdev_reread_part);
+
+/*
+ * This is an exported API for the block driver, and will
+ * acquire bd_mutex. Make sure you aren't calling it with
+ * bd_mutex already held, or we'll return -EBUSY.
+ */
+int blkdev_reread_part(struct block_device *bdev)
+{
+   int res;
+
if (!mutex_trylock(&bdev->bd_mutex))
return -EBUSY;
-   res = rescan_partitions(disk, bdev);
+   res = __blkdev_reread_part(bdev);
mutex_unlock(&bdev->bd_mutex);
+
return res;
 }
+EXPORT_SYMBOL(blkdev_reread_part);
 
 static int blk_ioctl_discard(struct block_device *bdev, uint64_t start,
 uint64_t len, int secure)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f4131e8..11398e3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2245,6 +2245,9 @@ extern struct block_device *blkdev_get_by_path(const char 
*path, fmode_t mode,
 extern struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode,
  void *holder);
 extern void blkdev_put(struct block_device *bdev, fmode_t mode);
+extern int __blkdev_reread_part(struct block_device *bdev);
+extern int blkdev_reread_part(struct block_device *bdev);
+
 #ifdef CONFIG_SYSFS
 extern int bd_link_disk_holder(struct block_device *bdev, struct gendisk 
*disk);
 extern void bd_unlink_disk_holder(struct block_device *bdev,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 7/7] s390/block/dasd: remove obsolete while -EBUSY loop

2015-04-07 Thread Jarod Wilson
With the mutex_trylock bit gone from blkdev_reread_part(), the retry logic
in dasd_scan_partitions() shouldn't be necessary.

CC: Christoph Hellwig 
CC: Jens Axboe 
CC: Tejun Heo 
CC: Alexander Viro 
CC: Markus Pargmann 
CC: Stefan Weinhuber 
CC: Stefan Haberland 
CC: Sebastian Ott 
CC: Fabian Frederick 
CC: Ming Lei 
CC: David Herrmann 
CC: Mike Galbraith 
CC: Andrew Morton 
CC: Peter Zijlstra 
CC: nbd-gene...@lists.sourceforge.net
CC: linux-s...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 drivers/s390/block/dasd_genhd.c | 13 +++--
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/drivers/s390/block/dasd_genhd.c b/drivers/s390/block/dasd_genhd.c
index 8026585..783d826 100644
--- a/drivers/s390/block/dasd_genhd.c
+++ b/drivers/s390/block/dasd_genhd.c
@@ -99,9 +99,8 @@ void dasd_gendisk_free(struct dasd_block *block)
 int dasd_scan_partitions(struct dasd_block *block)
 {
struct block_device *bdev;
-   int retry, rc;
+   int rc;
 
-   retry = 5;
bdev = bdget_disk(block->gdp, 0);
if (!bdev) {
DBF_DEV_EVENT(DBF_ERR, block->base, "%s",
@@ -118,14 +117,8 @@ int dasd_scan_partitions(struct dasd_block *block)
}
 
rc = blkdev_reread_part(bdev);
-   while (rc == -EBUSY && retry > 0) {
-   schedule();
-   rc = blkdev_reread_part(bdev);
-   retry--;
-   DBF_DEV_EVENT(DBF_ERR, block->base,
- "scan partitions error, retry %d rc %d",
- retry, rc);
-   }
+   DBF_DEV_EVENT(DBF_ERR, block->base,
+ "scan partitions error, rc %d", rc);
 
/*
 * Since the matching blkdev_put call to the blkdev_get in
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/7] block: loop: don't hold lo_ctl_mutex in lo_open

2015-04-07 Thread Jarod Wilson
From: Ming Lei 

The lo_ctl_mutex is held for running all ioctl handlers, and
in some ioctl handlers, ioctl_by_bdev(BLKRRPART) is called for
rereading partitions, which requires bd_mutex.

So it is easy to cause failure because trylock(bd_mutex) may
fail inside blkdev_reread_part(), and follows the lock context:

blkid or other application:
->open()
->mutex_lock(bd_mutex)
->lo_open()
->mutex_lock(lo_ctl_mutex)

losetup(set fd ioctl):
->mutex_lock(lo_ctl_mutex)
->ioctl_by_bdev(BLKRRPART)
->trylock(bd_mutex)

This patch trys to eliminate the ABBA lock dependency by removing
lo_ctl_mutext in lo_open() with the following approach:

1) introduce lo_open_mutex to protect lo_refcnt and avoid acquiring
lo_ctl_mutex in lo_open():
- for open vs. add/del loop, no any problem because of loop_index_mutex
- lo_open_mutex is used for syncing open() and loop_clr_fd()
- both open() and release() have been serialized by bd_mutex already

2) don't hold lo_ctl_mutex for decreasing/checking lo_refcnt in
lo_release(), then lo_ctl_mutex is only required for the last release.

CC: Christoph Hellwig 
CC: Jens Axboe 
CC: Tejun Heo 
CC: Alexander Viro 
CC: Markus Pargmann 
CC: Stefan Weinhuber 
CC: Stefan Haberland 
CC: Sebastian Ott 
CC: Fabian Frederick 
CC: Ming Lei 
CC: David Herrmann 
CC: Mike Galbraith 
CC: Andrew Morton 
CC: Peter Zijlstra 
CC: nbd-gene...@lists.sourceforge.net
CC: linux-s...@vger.kernel.org
Signed-off-by: Ming Lei 
Signed-off-by: Jarod Wilson 
---
 drivers/block/loop.c | 32 ++--
 drivers/block/loop.h |  1 +
 2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d1f168b..81a6bc1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -879,14 +879,18 @@ static int loop_clr_fd(struct loop_device *lo)
 * /do something like mkfs/losetup -d  causing the losetup -d
 * command to fail with EBUSY.
 */
+   mutex_lock(&lo->lo_open_mutex);
if (lo->lo_refcnt > 1) {
+   mutex_unlock(&lo->lo_open_mutex);
lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
mutex_unlock(&lo->lo_ctl_mutex);
return 0;
}
 
-   if (filp == NULL)
+   if (filp == NULL) {
+   mutex_unlock(&lo->lo_open_mutex);
return -EINVAL;
+   }
 
spin_lock_irq(&lo->lo_lock);
lo->lo_state = Lo_rundown;
@@ -919,6 +923,15 @@ static int loop_clr_fd(struct loop_device *lo)
lo->lo_state = Lo_unbound;
/* This is safe: open() is still holding a reference. */
module_put(THIS_MODULE);
+
+   /*
+* Unlock open_mutex for avoiding -EBUSY of rereading part:
+* - try to acquire bd_mutex from reread part
+* - another task is opening the loop with holding bd_mutex
+*   and trys to acquire open_mutex
+*/
+   mutex_unlock(&lo->lo_open_mutex);
+
if (lo->lo_flags & LO_FLAGS_PARTSCAN && bdev)
ioctl_by_bdev(bdev, BLKRRPART, 0);
lo->lo_flags = 0;
@@ -1376,9 +1389,9 @@ static int lo_open(struct block_device *bdev, fmode_t 
mode)
goto out;
}
 
-   mutex_lock(&lo->lo_ctl_mutex);
+   mutex_lock(&lo->lo_open_mutex);
lo->lo_refcnt++;
-   mutex_unlock(&lo->lo_ctl_mutex);
+   mutex_unlock(&lo->lo_open_mutex);
 out:
mutex_unlock(&loop_index_mutex);
return err;
@@ -1387,13 +1400,16 @@ out:
 static void lo_release(struct gendisk *disk, fmode_t mode)
 {
struct loop_device *lo = disk->private_data;
-   int err;
+   int err, ref;
 
-   mutex_lock(&lo->lo_ctl_mutex);
+   mutex_lock(&lo->lo_open_mutex);
+   ref = --lo->lo_refcnt;
+   mutex_unlock(&lo->lo_open_mutex);
 
-   if (--lo->lo_refcnt)
+   if (ref)
goto out;
 
+   mutex_lock(&lo->lo_ctl_mutex);
if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) {
/*
 * In autoclear mode, stop the loop thread
@@ -1646,6 +1662,7 @@ static int loop_add(struct loop_device **l, int i)
disk->flags |= GENHD_FL_NO_PART_SCAN;
disk->flags |= GENHD_FL_EXT_DEVT;
mutex_init(&lo->lo_ctl_mutex);
+   mutex_init(&lo->lo_open_mutex);
lo->lo_number   = i;
spin_lock_init(&lo->lo_lock);
disk->major = LOOP_MAJOR;
@@ -1763,11 +1780,14 @@ static long loop_control_ioctl(struct file *file, 
unsigned int cmd,
mutex_unlock(&lo->lo_ctl_mutex);
break;
}
+   mutex_lock(&lo->lo_open_mute

[PATCH 5/7] block: dasd_genhd: convert to blkdev_reread_part

2015-04-07 Thread Jarod Wilson
From: Ming Lei 

Also remove the obsolete comment.

CC: Christoph Hellwig 
CC: Jens Axboe 
CC: Tejun Heo 
CC: Alexander Viro 
CC: Markus Pargmann 
CC: Stefan Weinhuber 
CC: Stefan Haberland 
CC: Sebastian Ott 
CC: Fabian Frederick 
CC: Ming Lei 
CC: David Herrmann 
CC: Mike Galbraith 
CC: Andrew Morton 
CC: Peter Zijlstra 
CC: nbd-gene...@lists.sourceforge.net
CC: linux-s...@vger.kernel.org
Signed-off-by: Ming Lei 
Signed-off-by: Jarod Wilson 
---
 drivers/s390/block/dasd_genhd.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/s390/block/dasd_genhd.c b/drivers/s390/block/dasd_genhd.c
index 90f39f7..8026585 100644
--- a/drivers/s390/block/dasd_genhd.c
+++ b/drivers/s390/block/dasd_genhd.c
@@ -116,14 +116,11 @@ int dasd_scan_partitions(struct dasd_block *block)
  rc);
return -ENODEV;
}
-   /*
-* See fs/partition/check.c:register_disk,rescan_partitions
-* Can't call rescan_partitions directly. Use ioctl.
-*/
-   rc = ioctl_by_bdev(bdev, BLKRRPART, 0);
+
+   rc = blkdev_reread_part(bdev);
while (rc == -EBUSY && retry > 0) {
schedule();
-   rc = ioctl_by_bdev(bdev, BLKRRPART, 0);
+   rc = blkdev_reread_part(bdev);
retry--;
DBF_DEV_EVENT(DBF_ERR, block->base,
  "scan partitions error, retry %d rc %d",
@@ -138,7 +135,7 @@ int dasd_scan_partitions(struct dasd_block *block)
 * dasd_generic_set_offline). As long as the partition
 * detection is running no offline should be allowed. That
 * is why the assignment to device->bdev is done AFTER
-* the BLKRRPART ioctl.
+* the blkdev_reread_part() call.
 */
block->bdev = bdev;
return 0;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 6/7] block: replace trylock with mutex_lock in blkdev_reread_part()

2015-04-07 Thread Jarod Wilson
From: Ming Lei 

The only possible problem of using mutex_lock() instead of trylock
is about deadlock.

If there aren't any locks held before calling blkdev_reread_part(lock),
deadlock can't be caused by this conversion.

If there are locks held before calling blkdev_reread_part(lock),
and if these locks arn't required in open, close handler and I/O
path, deadlock shouldn't be caused too.

Both user space's ioctl(BLKRRPART) and md_setup_drive() from
init/do_mounts_md.c belongs to the 1st case, so the conversion is safe
for the two cases.

For loop, the previous patches in this pathset has fixed the ABBA lock
dependency, so the conversion is OK.

For nbd, tx_lock is held when calling the function:

- both open and release won't hold the lock
- when blkdev_reread_part() is run, I/O thread has been stopped
already, so tx_lock won't be acquired in I/O path at that time.
- so the conversion won't cause deadlock for nbd

For dasd, both dasd_open(), dasd_release() and request function don't
acquire any mutex/semphone, so the conversion should be safe.

[jarod: update for modifications earlier in series.]

CC: Christoph Hellwig 
CC: Jens Axboe 
CC: Tejun Heo 
CC: Alexander Viro 
CC: Markus Pargmann 
CC: Stefan Weinhuber 
CC: Stefan Haberland 
CC: Sebastian Ott 
CC: Fabian Frederick 
CC: Ming Lei 
CC: David Herrmann 
CC: Mike Galbraith 
CC: Andrew Morton 
CC: Peter Zijlstra 
CC: nbd-gene...@lists.sourceforge.net
CC: linux-s...@vger.kernel.org
Signed-off-by: Ming Lei 
Signed-off-by: Jarod Wilson 
---
 block/ioctl.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/block/ioctl.c b/block/ioctl.c
index 64a4fcb..47c8e6d 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -170,15 +170,19 @@ EXPORT_SYMBOL(__blkdev_reread_part);
 
 /*
  * This is an exported API for the block driver, and will
- * acquire bd_mutex. Make sure you aren't calling it with
- * bd_mutex already held, or we'll return -EBUSY.
+ * acquire bd_mutex.
+ *
+ * Make sure held locks aren't required in open()/close()
+ * handlers and I/O paths to avoid an ABBA deadlock:
+ * - bd_mutex is held before calling a block driver's open/close
+ *   handler
+ * - reading a partition table may submit I/O to the block device
  */
 int blkdev_reread_part(struct block_device *bdev)
 {
int res;
 
-   if (!mutex_trylock(&bdev->bd_mutex))
-   return -EBUSY;
+   mutex_lock(&bdev->bd_mutex);
res = __blkdev_reread_part(bdev);
mutex_unlock(&bdev->bd_mutex);
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/7] block: loop: fix another reread part failure

2015-04-07 Thread Jarod Wilson
loop_clr_fd() can be run piggyback with lo_release(), and
under this situation, reread partition may always fail because
of bd_mutex which has been held in release path.

This patch detects the situation by the reference count, and
call blkdev_reread_part_nolock() to avoid acquiring the lock again.

In the meantime, this patch switches to new kernel APIs
of blkdev_reread_part() and blkdev_reread_part_nolock().

[jarod: this is a merger of my original patch, Ming's patch, and some
reworking for my reworked first patch in the series.]

CC: Christoph Hellwig 
CC: Jens Axboe 
CC: Tejun Heo 
CC: Alexander Viro 
CC: Markus Pargmann 
CC: Stefan Weinhuber 
CC: Stefan Haberland 
CC: Sebastian Ott 
CC: Fabian Frederick 
CC: Ming Lei 
CC: David Herrmann 
CC: Mike Galbraith 
CC: Andrew Morton 
CC: Peter Zijlstra 
CC: nbd-gene...@lists.sourceforge.net
CC: linux-s...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 drivers/block/loop.c | 28 
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 81a6bc1..ab5c678 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -528,6 +528,26 @@ static int loop_flush(struct loop_device *lo)
return loop_switch(lo, NULL);
 }
 
+static void loop_reread_partitions(struct loop_device *lo,
+  struct block_device *bdev)
+{
+   int rc;
+   bool in_release;
+
+   mutex_lock(&lo->lo_open_mutex);
+   in_release = lo->lo_refcnt == 0;
+   mutex_unlock(&lo->lo_open_mutex);
+
+   /* bd_mutex has been held already in release path */
+   if (in_release)
+   rc = __blkdev_reread_part(bdev);
+   else
+   rc = blkdev_reread_part(bdev);
+   if (rc)
+   pr_warn("%s: partition scan of loop%d (%s) failed (rc=%d)\n",
+   __func__, lo->lo_number, lo->lo_file_name, rc);
+}
+
 /*
  * loop_change_fd switched the backing store of a loopback device to
  * a new file. This is useful for operating system installers to free up
@@ -576,7 +596,7 @@ static int loop_change_fd(struct loop_device *lo, struct 
block_device *bdev,
 
fput(old_file);
if (lo->lo_flags & LO_FLAGS_PARTSCAN)
-   ioctl_by_bdev(bdev, BLKRRPART, 0);
+   loop_reread_partitions(lo, bdev);
return 0;
 
  out_putf:
@@ -807,7 +827,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
if (part_shift)
lo->lo_flags |= LO_FLAGS_PARTSCAN;
if (lo->lo_flags & LO_FLAGS_PARTSCAN)
-   ioctl_by_bdev(bdev, BLKRRPART, 0);
+   loop_reread_partitions(lo, bdev);
 
/* Grab the block_device to prevent its destruction after we
 * put /dev/loopXX inode. Later in loop_clr_fd() we bdput(bdev).
@@ -933,7 +953,7 @@ static int loop_clr_fd(struct loop_device *lo)
mutex_unlock(&lo->lo_open_mutex);
 
if (lo->lo_flags & LO_FLAGS_PARTSCAN && bdev)
-   ioctl_by_bdev(bdev, BLKRRPART, 0);
+   loop_reread_partitions(lo, bdev);
lo->lo_flags = 0;
if (!part_shift)
lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
@@ -1008,7 +1028,7 @@ loop_set_status(struct loop_device *lo, const struct 
loop_info64 *info)
 !(lo->lo_flags & LO_FLAGS_PARTSCAN)) {
lo->lo_flags |= LO_FLAGS_PARTSCAN;
lo->lo_disk->flags &= ~GENHD_FL_NO_PART_SCAN;
-   ioctl_by_bdev(lo->lo_device, BLKRRPART, 0);
+   loop_reread_partitions(lo, lo->lo_device);
}
 
lo->lo_encrypt_key_size = info->lo_encrypt_key_size;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/7] block: nbd: convert to blkdev_reread_part()

2015-04-07 Thread Jarod Wilson
From: Ming Lei 

CC: Christoph Hellwig 
CC: Jens Axboe 
CC: Tejun Heo 
CC: Alexander Viro 
CC: Markus Pargmann 
CC: Stefan Weinhuber 
CC: Stefan Haberland 
CC: Sebastian Ott 
CC: Fabian Frederick 
CC: Ming Lei 
CC: David Herrmann 
CC: Mike Galbraith 
CC: Andrew Morton 
CC: Peter Zijlstra 
CC: nbd-gene...@lists.sourceforge.net
CC: linux-s...@vger.kernel.org
Signed-off-by: Ming Lei 
Signed-off-by: Jarod Wilson 
---
 drivers/block/nbd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index a98c41f..7e9d26e 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -734,7 +734,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct 
nbd_device *nbd,
bdev->bd_inode->i_size = 0;
set_capacity(nbd->disk, 0);
if (max_part > 0)
-   ioctl_by_bdev(bdev, BLKRRPART, 0);
+   blkdev_reread_part(bdev);
if (nbd->disconnect) /* user requested, ignore socket errors */
return 0;
return nbd->harderror;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7] block: loop: don't hold lo_ctl_mutex in lo_open

2015-04-08 Thread Jarod Wilson
On Wed, Apr 08, 2015 at 02:50:59PM +0800, Ming Lei wrote:
> Hi Jarod,
> 
> On Wed, Apr 8, 2015 at 2:23 PM, Jarod Wilson  wrote:
> > From: Ming Lei 
> >
> > The lo_ctl_mutex is held for running all ioctl handlers, and
> > in some ioctl handlers, ioctl_by_bdev(BLKRRPART) is called for
> > rereading partitions, which requires bd_mutex.
> >
> > So it is easy to cause failure because trylock(bd_mutex) may
> > fail inside blkdev_reread_part(), and follows the lock context:
> >
> > blkid or other application:
> > ->open()
> > ->mutex_lock(bd_mutex)
> > ->lo_open()
> > ->mutex_lock(lo_ctl_mutex)
> >
> > losetup(set fd ioctl):
> > ->mutex_lock(lo_ctl_mutex)
> > ->ioctl_by_bdev(BLKRRPART)
> > ->trylock(bd_mutex)
> >
> > This patch trys to eliminate the ABBA lock dependency by removing
> > lo_ctl_mutext in lo_open() with the following approach:
> >
> > 1) introduce lo_open_mutex to protect lo_refcnt and avoid acquiring
> > lo_ctl_mutex in lo_open():
> 
> It is a bit quick since I said the lo_open_mutex can be removed,
> and Christoph agreed that too.
> 
> So looks we still need to post another version, :-)

Ah. I missed that bit. Just trying to keep up momentum.

-- 
Jarod Wilson
ja...@redhat.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7] block: loop: don't hold lo_ctl_mutex in lo_open

2015-04-08 Thread Jarod Wilson
On Wed, Apr 08, 2015 at 09:40:34AM -0400, Jarod Wilson wrote:
> On Wed, Apr 08, 2015 at 02:50:59PM +0800, Ming Lei wrote:
> > Hi Jarod,
> > 
> > On Wed, Apr 8, 2015 at 2:23 PM, Jarod Wilson  wrote:
> > > From: Ming Lei 
> > >
> > > The lo_ctl_mutex is held for running all ioctl handlers, and
> > > in some ioctl handlers, ioctl_by_bdev(BLKRRPART) is called for
> > > rereading partitions, which requires bd_mutex.
> > >
> > > So it is easy to cause failure because trylock(bd_mutex) may
> > > fail inside blkdev_reread_part(), and follows the lock context:
> > >
> > > blkid or other application:
> > > ->open()
> > > ->mutex_lock(bd_mutex)
> > > ->lo_open()
> > > ->mutex_lock(lo_ctl_mutex)
> > >
> > > losetup(set fd ioctl):
> > > ->mutex_lock(lo_ctl_mutex)
> > > ->ioctl_by_bdev(BLKRRPART)
> > > ->trylock(bd_mutex)
> > >
> > > This patch trys to eliminate the ABBA lock dependency by removing
> > > lo_ctl_mutext in lo_open() with the following approach:
> > >
> > > 1) introduce lo_open_mutex to protect lo_refcnt and avoid acquiring
> > > lo_ctl_mutex in lo_open():
> > 
> > It is a bit quick since I said the lo_open_mutex can be removed,
> > and Christoph agreed that too.
> > 
> > So looks we still need to post another version, :-)
> 
> Ah. I missed that bit. Just trying to keep up momentum.

Are you working on that atomic_t version then, or should I dive into it?

-- 
Jarod Wilson
ja...@redhat.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/7] block: export blkdev_reread_part() and __blkdev_reread_part()

2015-04-08 Thread Jarod Wilson
On Wed, Apr 08, 2015 at 05:03:25PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 08, 2015 at 10:50:56PM +0800, Ming Lei wrote:
> > > +/*
> > > + * This is an exported API for the block driver, and will not
> > > + * acquire bd_mutex, leaving it up to the caller to handle
> > > + * any necessary locking.
> > 
> > Actually, the function is introduced and should be used in case
> > that bd_mutex has been held already, such as clearing fd in
> > loop release().
> > 
> > > + */
> > > +int __blkdev_reread_part(struct block_device *bdev)
> > >  {
> > > struct gendisk *disk = bdev->bd_disk;
> > >
> 
>   lockdep_assert_held(&bdev->bd_mutex);
> 
> is an excellent means of avoiding that comment and verifying its
> actually true :-)

Ah, yes, that was actually suggested by Christoph as well, I was too hasty
shoving something back out the door on multiple counts.

> > > + */
> > > +int blkdev_reread_part(struct block_device *bdev)
> > > +{
> > > +   int res;
> > > +
> > > if (!mutex_trylock(&bdev->bd_mutex))
> > > return -EBUSY;
> 
> Is that really needed? It seems rather poor form.

It goes away later in the series and gets converted to a straight
mutex_lock().

-- 
Jarod Wilson
ja...@redhat.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7] block: loop: don't hold lo_ctl_mutex in lo_open

2015-04-08 Thread Jarod Wilson
On Wed, Apr 08, 2015 at 10:20:22PM +0800, Ming Lei wrote:
> On Wed, Apr 8, 2015 at 10:00 PM, Jarod Wilson  wrote:
> > On Wed, Apr 08, 2015 at 09:40:34AM -0400, Jarod Wilson wrote:
> >> On Wed, Apr 08, 2015 at 02:50:59PM +0800, Ming Lei wrote:
> >> > Hi Jarod,
> >> >
> >> > On Wed, Apr 8, 2015 at 2:23 PM, Jarod Wilson  wrote:
> >> > > From: Ming Lei 
> >> > >
> >> > > The lo_ctl_mutex is held for running all ioctl handlers, and
> >> > > in some ioctl handlers, ioctl_by_bdev(BLKRRPART) is called for
> >> > > rereading partitions, which requires bd_mutex.
> >> > >
> >> > > So it is easy to cause failure because trylock(bd_mutex) may
> >> > > fail inside blkdev_reread_part(), and follows the lock context:
> >> > >
> >> > > blkid or other application:
> >> > > ->open()
> >> > > ->mutex_lock(bd_mutex)
> >> > > ->lo_open()
> >> > > ->mutex_lock(lo_ctl_mutex)
> >> > >
> >> > > losetup(set fd ioctl):
> >> > > ->mutex_lock(lo_ctl_mutex)
> >> > > ->ioctl_by_bdev(BLKRRPART)
> >> > > ->trylock(bd_mutex)
> >> > >
> >> > > This patch trys to eliminate the ABBA lock dependency by removing
> >> > > lo_ctl_mutext in lo_open() with the following approach:
> >> > >
> >> > > 1) introduce lo_open_mutex to protect lo_refcnt and avoid acquiring
> >> > > lo_ctl_mutex in lo_open():
> >> >
> >> > It is a bit quick since I said the lo_open_mutex can be removed,
> >> > and Christoph agreed that too.
> >> >
> >> > So looks we still need to post another version, :-)
> >>
> >> Ah. I missed that bit. Just trying to keep up momentum.
> >
> > Are you working on that atomic_t version then, or should I dive into it?
> 
> It has been ready, and I will post them out once it passes my tests.

Okay, I'll hold tight then. Sorry for the noise. :p

-- 
Jarod Wilson
ja...@redhat.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v1 0/7] block: reread partitions changes and fix for loop

2015-04-08 Thread Jarod Wilson
On Wed, Apr 08, 2015 at 11:52:57PM +0800, Ming Lei wrote:
> Hi Guys,
> 
> Recently there are several reports about loop partition scanning
> failure[1][2].
> 
> For loop, the root cause is one ABBA and one AA lock dependency
> issue, and the two are fixed by patch 2 and patch 3 each.
> 
> Another reason is from the trylock in blkdev_reread_part(), which
> may cause partition scanning failure too sometimes when another task
> is holding the bd_mutex. In the discussion[1], both Tejun and Christoph
> suggests to replace the trylock with mutex_lock in blkdev_reread_part(),
> also Christoph suggests to export blkdev_reread_part.
> 
> Following the discussion, this patchset exports blkdev_reread_part(), and
> introduces __blkdev_reread_part() for fixing loop's AA lock issue.
> Then ioctl_by_bdev(BLKRRPART) in loop, nbd and dasd is replaced with
> blkdev_reread_part(). In the last patch, trylock in blkdev_reread_part()
> is replaced with mutex_lock, and some analysis is provided about the 
> conversion.
> 
> V1:
>   - introduce __blkdev_reread_part(), and use lockdep_assert_held()(1/7)
>   - replace lo_open_mutex with atomic reference count, plus freezing 
> queue(2/7)
>   - add comment about detecting release path(3/7)
>   - remove dead code in dasd(7/7)

I tinkered with the atomic replacement and added the lockdep assert on my
end as well, and have been testing. My end results were nearly identical
to this, and testing for my particular use case still looks good.

For the set:

Tested-by: Jarod Wilson 
Acked-by: Jarod Wilson 

Thanks much!

-- 
Jarod Wilson
ja...@redhat.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v1 7/7] s390/block/dasd: remove obsolete while -EBUSY loop

2015-04-08 Thread Jarod Wilson
On Wed, Apr 08, 2015 at 07:32:24PM +0200, Sebastian Ott wrote:
> On Wed, 8 Apr 2015, Ming Lei wrote:
> > From: Jarod Wilson 
> > 
> > With the mutex_trylock bit gone from blkdev_reread_part(), the retry logic
> > in dasd_scan_partitions() shouldn't be necessary.
> > 
> > CC: Christoph Hellwig 
> > CC: Jens Axboe 
> > CC: Tejun Heo 
> > CC: Alexander Viro 
> > CC: Markus Pargmann 
> > CC: Stefan Weinhuber 
> > CC: Stefan Haberland 
> > CC: Sebastian Ott 
> > CC: Fabian Frederick 
> > CC: Ming Lei 
> > CC: David Herrmann 
> > CC: Andrew Morton 
> > CC: Peter Zijlstra 
> > CC: nbd-gene...@lists.sourceforge.net
> > CC: linux-s...@vger.kernel.org
> > Signed-off-by: Jarod Wilson 
> > ---
> >  drivers/s390/block/dasd_genhd.c |   13 +++--
> >  1 file changed, 3 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/s390/block/dasd_genhd.c 
> > b/drivers/s390/block/dasd_genhd.c
> > index 2af4619..189ea2f 100644
> > --- a/drivers/s390/block/dasd_genhd.c
> > +++ b/drivers/s390/block/dasd_genhd.c
> > @@ -99,9 +99,8 @@ void dasd_gendisk_free(struct dasd_block *block)
> >  int dasd_scan_partitions(struct dasd_block *block)
> >  {
> > struct block_device *bdev;
> > -   int retry, rc;
> > +   int rc;
> > 
> > -   retry = 5;
> > bdev = bdget_disk(block->gdp, 0);
> > if (!bdev) {
> > DBF_DEV_EVENT(DBF_ERR, block->base, "%s",
> > @@ -118,14 +117,8 @@ int dasd_scan_partitions(struct dasd_block *block)
> > }
> > 
> > rc = blkdev_reread_part(bdev);
> > -   while (rc == -EBUSY && retry > 0) {
> > -   schedule();
> > -   rc = blkdev_reread_part(bdev);
> > -   retry--;
> > -   DBF_DEV_EVENT(DBF_ERR, block->base,
> > - "scan partitions error, retry %d rc %d",
> > - retry, rc);
> > -   }
> > +   DBF_DEV_EVENT(DBF_ERR, block->base,
> > + "scan partitions error, rc %d", rc);
> 
> Could you please change that to only write the debug message in the error
> case. Other than that, both dasd patches look good.

D'oh, absolutely. Ming, do you want me to send an updated patch 7 along in
reply here, or do you want to handle it and/or repost the entire set?

-- 
Jarod Wilson
ja...@redhat.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] fs/proc: allow larger /proc//cmdline output

2015-04-13 Thread Jarod Wilson
On Fri, Apr 10, 2015 at 01:45:20PM -0700, Andrew Morton wrote:
> On Fri, 10 Apr 2015 17:11:40 +0300 Alexey Dobriyan  
> wrote:
> 
> > > Just trying to be conservative and keep people from doing anything too
> > > insane, but I didn't really have any particularly good reason beyond that
> > > for capping it. I'll remove the upper bound next iteration.
> > 
> > There is no need for next iteration. Andrew has been ignoring real fix for 
> > more
> > than a month now!
> 
> I haven't been ignoring it.  I'm desperately hoping that the patch fairy
> will come up with something which is less large and complex :(

Shudder-inducing though it may be, I do have to say I'm quite happy with
the resulting behavior with Alexey's patch. It's both more
memory-efficient and more user-friendly, despite itself.

-- 
Jarod Wilson
ja...@redhat.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH try #3] proc: fix PAGE_SIZE limit of /proc/$PID/cmdline

2015-04-13 Thread Jarod Wilson
On Sat, Apr 11, 2015 at 01:09:06AM +0300, Alexey Dobriyan wrote:
> On Fri, Apr 10, 2015 at 02:01:32PM -0400, Jarod Wilson wrote:
> > On Fri, Apr 10, 2015 at 05:13:29PM +0300, Alexey Dobriyan wrote:
> > > /proc/$PID/cmdline truncates output at PAGE_SIZE. It is easy to see with
> > > 
> > >   $ cat /proc/self/cmdline $(seq 1037) 2>/dev/null
> > > 
> > > However, command line size was never limited to PAGE_SIZE but to 128 KB 
> > > and
> > > relatively recently limitation was removed altogether.
> > > 
> > > People noticed and are asking questions:
> > > http://stackoverflow.com/questions/199130/how-do-i-increase-the-proc-pid-cmdline-4096-byte-limit
> > > 
> > > seq file interface is not OK, because it kmalloc's for whole output and
> > > open + read(, 1) + sleep will pin arbitrary amounts of kernel memory.
> > > To not do that, limit must be imposed which is incompatible with
> > > arbitrary sized command lines.
> > > 
> > > I apologize for hairy code, but this it direct consequence of command line
> > > layout in memory and hacks to support things like "init [3]".
> > > 
> > > The loops are "unrolled" otherwise it is either macros which hide
> > > control flow or functions with 7-8 arguments with equal line count.
> > 
> > That definitely qualifies as hairy. How big of a problem is it really in
> > practice if we continued using seq_file though? This only happens when
> > someone actually accesses /proc/$PID/cmdline, no? And if they're doing
> > that, they probably want that info, so is it so terrible if memory is held
> > on to for a bit? We're only talking about a few kB. That said, properly
> > walking the entire cmdline without having to specify an arbitrary limit
> > ahead of time does sound slightly more end-user-friendly. I'll give this
> > patch a spin here.
> 
> Well, it's 8 MB at least because of kmalloc and more when it starts
> to vmalloc, so either you increase but keep the limit, or allow
> to pin semi-arbitrary amount of kernel memory IF you want to stay
> with seqfile. My patch requires just 1 page plus whatever g_u_p
> requires.

Okay, I've tested this out some. Its definitely more user-friendly than
having to require a boot param, and as a bonus, its even more
memory-efficient. Yes, its a bit fugly, but such is life sometimes...

Though I do wonder if this should perhaps be a helper in mm/util.c like
get_cmdline, maybe replacing get_cmdline or adding an alternative that
gives you everything, rather than an arbitrarily limited length. I only
see one other place actually using get_cmdline so far.

Tested-by: Jarod Wilson 
Acked-by: Jarod Wilson 

-- 
Jarod Wilson
ja...@redhat.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH try #4] proc: fix PAGE_SIZE limit of /proc/$PID/cmdline

2015-05-26 Thread Jarod Wilson
On May 26, 2015, at 5:24 PM, Alexey Dobriyan  wrote:
> 
>> On Tue, May 26, 2015 at 04:42:36PM -0400, Jarod Wilson wrote:
>>> On 5/8/2015 8:28 AM, Alexey Dobriyan wrote:
>>> /proc/$PID/cmdline truncates output at PAGE_SIZE. It is easy to see with
>>> 
>>>$ cat /proc/self/cmdline $(seq 1037) 2>/dev/null
>>> 
>>> However, command line size was never limited to PAGE_SIZE but to 128 KB and
>>> relatively recently limitation was removed altogether.
>>> 
>>> People noticed and ask questions:
>>> http://stackoverflow.com/questions/199130/how-do-i-increase-the-proc-pid-cmdline-4096-byte-limit
>>> 
>>> seq file interface is not OK, because it kmalloc's for whole output and
>>> open + read(, 1) + sleep will pin arbitrary amounts of kernel memory.
>>> To not do that, limit must be imposed which is incompatible with
>>> arbitrary sized command lines.
>>> 
>>> I apologize for hairy code, but this it direct consequence of command line
>>> layout in memory and hacks to support things like "init [3]".
>>> 
>>> The loops are "unrolled" otherwise it is either macros which hide
>>> control flow or functions with 7-8 arguments with equal line count.
>>> 
>>> There should be real setproctitle(2) or something.
>>> 
>>> Signed-off-by: Alexey Dobriyan 
>>> Tested-by: Jarod Wilson 
>>> Acked-by: Jarod Wilson 
>> 
>> Should have tested on more than just x86, it appears. We've started 
>> hammering on this internally across all arches, and its exploded 
>> multiple times on ppc64 now:
>> 
>> [ 2717.074699] [ cut here ]
>> [ 2717.074787] kernel BUG at fs/proc/base.c:244!
> 
>> OE--   3.10.0-255.el7.ppc64.debug #1
> 
> Which BUG_ON is this?
> 
>BUG_ON(*pos < 0);
>BUG_ON(arg_start > arg_end);
>BUG_ON(env_start > env_end);

Ah, sorry, right, might not be exactly the same with the back-up ported 
version... It was the env_start > env_end one.

-- 
Jarod Wilson--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Update][PATCH] PCIe / hotplug: Drop pointless ACPI-based "slot detection" check

2015-06-11 Thread Jarod Wilson

On 5/21/2015 9:21 PM, Rafael J. Wysocki wrote:

On Thursday, May 21, 2015 11:11:46 AM Bjorn Helgaas wrote:

On Tue, May 19, 2015 at 03:27:58PM +0200, Rafael J. Wysocki wrote:

From: Rafael J. Wysocki 

Jarod Wilson reports that the expresscard hotplug setup doesn't work
on HP ZBook G2.  The problem turns out to be the ACPI-based "slot
detection" code called from pciehp_probe() which tries to use some
questionable heuristics based on what ACPI objects are present for
the PCIe port device at hand to figure out whether or not to register
a hotplug slot for that port.

That code is used if there is at least one PCIe port having an ACPI
device configuration object related to hotplug (such as _EJ0 or _RMV)
and the Thunderbolt port on the affected machine has _RMV.  Of course,
Thunderbolt and PCIe native hotplug need not be mutually exclusive
(as they aren't on the machine in question), so that rule is simply
incorrect.

Moreover, the ACPI-based "slot detection" check does not add any
value if pciehp_probe() is called at all and the service type of the
device object it has been called for is PCIE_PORT_SERVICE_HP, because
PCIe hotplug services are only registered if the _OSC handshake in
acpi_pci_root_add() allows the kernel to control the PCIe native
hotplug feature.  No more checks need to be carried out to decide
whether or not to register a native PCIe hotlug slot in that case.

For the above reasons, make pciehp_probe() check if it has been
called for the right service type and drop the pointless ACPI-based
"slot detection" check from it.  Also remove the entire code whose
only user is that check (the entire pciehp_acpi.c file goes away
as a result) and drop function headers related to it from the
internal PCIeHP header file.

Link: http://marc.info/?t=14316321932&r=1&w=2
Link: https://bugzilla.kernel.org/show_bug.cgi?id=98581
Reported-by: Jarod Wilson 
Signed-off-by: Rafael J. Wysocki 


This is awesome!  Applied to pci/hotplug for v4.2, with Jarod's
reviewed/tested-by.


Thanks!


Looks like I didn't test enough. I can't explain WHY, but with this 
applied, now thunderbolt hot unplug of a network adapter goes haywire, 
where prior to the patch, it worked just fine. Still looking into it...


--
Jarod Wilson
ja...@redhat.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Update][PATCH] PCIe / hotplug: Drop pointless ACPI-based "slot detection" check

2015-06-11 Thread Jarod Wilson

On 6/11/2015 1:05 PM, Jarod Wilson wrote:

On 5/21/2015 9:21 PM, Rafael J. Wysocki wrote:

On Thursday, May 21, 2015 11:11:46 AM Bjorn Helgaas wrote:

On Tue, May 19, 2015 at 03:27:58PM +0200, Rafael J. Wysocki wrote:

From: Rafael J. Wysocki 

Jarod Wilson reports that the expresscard hotplug setup doesn't work
on HP ZBook G2.  The problem turns out to be the ACPI-based "slot
detection" code called from pciehp_probe() which tries to use some
questionable heuristics based on what ACPI objects are present for
the PCIe port device at hand to figure out whether or not to register
a hotplug slot for that port.

That code is used if there is at least one PCIe port having an ACPI
device configuration object related to hotplug (such as _EJ0 or _RMV)
and the Thunderbolt port on the affected machine has _RMV.  Of course,
Thunderbolt and PCIe native hotplug need not be mutually exclusive
(as they aren't on the machine in question), so that rule is simply
incorrect.

Moreover, the ACPI-based "slot detection" check does not add any
value if pciehp_probe() is called at all and the service type of the
device object it has been called for is PCIE_PORT_SERVICE_HP, because
PCIe hotplug services are only registered if the _OSC handshake in
acpi_pci_root_add() allows the kernel to control the PCIe native
hotplug feature.  No more checks need to be carried out to decide
whether or not to register a native PCIe hotlug slot in that case.

For the above reasons, make pciehp_probe() check if it has been
called for the right service type and drop the pointless ACPI-based
"slot detection" check from it.  Also remove the entire code whose
only user is that check (the entire pciehp_acpi.c file goes away
as a result) and drop function headers related to it from the
internal PCIeHP header file.

Link: http://marc.info/?t=14316321932&r=1&w=2
Link: https://bugzilla.kernel.org/show_bug.cgi?id=98581
Reported-by: Jarod Wilson 
Signed-off-by: Rafael J. Wysocki 


This is awesome!  Applied to pci/hotplug for v4.2, with Jarod's
reviewed/tested-by.


Thanks!


Looks like I didn't test enough. I can't explain WHY, but with this
applied, now thunderbolt hot unplug of a network adapter goes haywire,
where prior to the patch, it worked just fine. Still looking into it...


Filed bug, dmesg spew can be found in the bug.

https://bugzilla.kernel.org/show_bug.cgi?id=99841

--
Jarod Wilson
ja...@redhat.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Update][PATCH] PCIe / hotplug: Drop pointless ACPI-based "slot detection" check

2015-06-11 Thread Jarod Wilson

On 6/11/2015 5:16 PM, Rafael J. Wysocki wrote:

On Thursday, June 11, 2015 04:38:12 PM Jarod Wilson wrote:

On 6/11/2015 1:05 PM, Jarod Wilson wrote:

On 5/21/2015 9:21 PM, Rafael J. Wysocki wrote:

On Thursday, May 21, 2015 11:11:46 AM Bjorn Helgaas wrote:

On Tue, May 19, 2015 at 03:27:58PM +0200, Rafael J. Wysocki wrote:

From: Rafael J. Wysocki 

Jarod Wilson reports that the expresscard hotplug setup doesn't work
on HP ZBook G2.  The problem turns out to be the ACPI-based "slot
detection" code called from pciehp_probe() which tries to use some
questionable heuristics based on what ACPI objects are present for
the PCIe port device at hand to figure out whether or not to register
a hotplug slot for that port.

That code is used if there is at least one PCIe port having an ACPI
device configuration object related to hotplug (such as _EJ0 or _RMV)
and the Thunderbolt port on the affected machine has _RMV.  Of course,
Thunderbolt and PCIe native hotplug need not be mutually exclusive
(as they aren't on the machine in question), so that rule is simply
incorrect.

Moreover, the ACPI-based "slot detection" check does not add any
value if pciehp_probe() is called at all and the service type of the
device object it has been called for is PCIE_PORT_SERVICE_HP, because
PCIe hotplug services are only registered if the _OSC handshake in
acpi_pci_root_add() allows the kernel to control the PCIe native
hotplug feature.  No more checks need to be carried out to decide
whether or not to register a native PCIe hotlug slot in that case.

For the above reasons, make pciehp_probe() check if it has been
called for the right service type and drop the pointless ACPI-based
"slot detection" check from it.  Also remove the entire code whose
only user is that check (the entire pciehp_acpi.c file goes away
as a result) and drop function headers related to it from the
internal PCIeHP header file.

Link: http://marc.info/?t=14316321932&r=1&w=2
Link: https://bugzilla.kernel.org/show_bug.cgi?id=98581
Reported-by: Jarod Wilson 
Signed-off-by: Rafael J. Wysocki 


This is awesome!  Applied to pci/hotplug for v4.2, with Jarod's
reviewed/tested-by.


Thanks!


Looks like I didn't test enough. I can't explain WHY, but with this
applied, now thunderbolt hot unplug of a network adapter goes haywire,
where prior to the patch, it worked just fine. Still looking into it...


Filed bug, dmesg spew can be found in the bug.

https://bugzilla.kernel.org/show_bug.cgi?id=99841


If it worked for you previously, can you possibly try to re-create that
configuration and set of patches applied and retest then?


I tried the current Fedora 4.1-rc7 build first, everything was fine, 
then patched in JUST the one patch, and it went belly up. I'll add some 
extra debugging spew to a build both with and without the one patch, to 
see what differences there are in devices pciehp is claiming and follow 
up in the bug on your other info requests.


--
Jarod Wilson
ja...@redhat.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ethernet/sfc: mark state UNINIT after unregister

2015-06-12 Thread Jarod Wilson
Without this change, modprobe -r sfc hits the BUG_ON() in
efx_pci_remove_main(). Best as I can tell, this was just an oversight,
efx->state gets set to STATE_UNINIT in the error path of
efx_register_netdev() just after unregister_netdevice(), and the same
should happen in efx_unregister_netdev() after its unregister_netdevice()
call. Now I can load and unload no problem.

CC: Solarflare linux maintainers 
CC: net...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 drivers/net/ethernet/sfc/efx.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index 0c42ed9..f3eaade 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -2448,6 +2448,7 @@ static void efx_unregister_netdev(struct efx_nic *efx)
 #endif
device_remove_file(&efx->pci_dev->dev, &dev_attr_phy_type);
unregister_netdev(efx->net_dev);
+   efx->state = STATE_UNINIT;
}
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ethernet/sfc: mark state UNINIT after unregister

2015-06-15 Thread Jarod Wilson

On 6/15/2015 9:49 AM, Edward Cree wrote:

On 12/06/15 19:51, Jarod Wilson wrote:

Without this change, modprobe -r sfc hits the BUG_ON() in
efx_pci_remove_main(). Best as I can tell, this was just an oversight,
efx->state gets set to STATE_UNINIT in the error path of
efx_register_netdev() just after unregister_netdevice(), and the same
should happen in efx_unregister_netdev() after its unregister_netdevice()
call. Now I can load and unload no problem.

CC: Solarflare linux maintainers 
CC: net...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
  drivers/net/ethernet/sfc/efx.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c
index 0c42ed9..f3eaade 100644
--- a/drivers/net/ethernet/sfc/efx.c
+++ b/drivers/net/ethernet/sfc/efx.c
@@ -2448,6 +2448,7 @@ static void efx_unregister_netdev(struct efx_nic *efx)
  #endif
   device_remove_file(&efx->pci_dev->dev, &dev_attr_phy_type);
   unregister_netdev(efx->net_dev);
+ efx->state = STATE_UNINIT;
   }
  }


This isn't quite the right place, efx->state changes are supposed to be 
serialised by the RTNL lock.
Our out-of-tree driver has this in efx_pci_remove, just after the 
efx_disable_interrupts(efx) call and before rtnl_unlock() (see patch below).  
I'd suggest that's the change we should make, but I haven't tested it yet.

For reference, the "oversight" was in my e7fef9b45ae188066bb6eb3dde8310d33c2f7d5e 
"sfc: add sysfs entry to control MCDI tracing" which accidentally took our out-of-tree 
version of efx_unregister_netdev().  Before that the code was as in Jarod's patch.


Ah, I did notice the rtnl_lock calls before some other occasions where 
state was set, but was thinking the lock might not be necessary if we're 
that far along the teardown path with the netdev already unregistered. 
You know the code far better than I do though, and your version works 
just fine here too.


Feel free to add this (along with your signed-off-by, of course):

Reviewed-by: Jarod Wilson 

--
Jarod Wilson
ja...@redhat.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] net/bonding: fix propagation of user-specified bond MAC

2015-06-05 Thread Jarod Wilson
Its possible for users to specify their own MAC address for a bonded link,
and this used to work, until sometime in 2013...

First, commit 409cc1f8a changed a condition to set the bond's mac to a
slave device's, dropping the is_zero_ether_addr() check in favor of using
bond->dev_addr_from_first.

Next, commit 6c8c4e4c2 added a bond->slave_cnt == 0 condition.

Then, commit 97a1e6396 removed dev_addr_from_first and keyed off of
bond->dev->addr_assign_type.

The other contitional in the check to call bond_set_dev_addr() has gone
 through some permutations, finally landing at the following check:

if (!bond_has_slaves(bond) &&
bond->dev->addr_assign_type == NET_ADDR_RANDOM)
bond_set_dev_addr(bond->dev, slave_dev);

When the bond is initially brought up, with no active slaves, it gets
assigned a random address, and addr_assign_type is set to NET_ADDR_RANDOM.
Next up though, the user can provide their own MAC, which ultimately calls
bond_set_mac_address(). However, because addr_assign_type isn't touched,
the above conditions are still met, and the slave's MAC overwrites the
user-provided MAC.

The simple fix is to set addr_assign_type = NET_ADDR_SET at the tail end
of bond_set_mac_address() doing its thing, and user-specified MAC
addresses no longer get overwritten.

Nb: this is slightly tricky to test on current Fedora, as nmcli seems to
be braindead when it comes to setting a MAC address for a bond. I can do a
"nmcli con edit bond0", set ethernet.mac-address "xx:yy:zz:aa:bb:cc", but
it doesn't ever seem to do anything, and it doesn't persist to the next
boot. Manual tinkering was required to verify the issue and the fix using
ip link set commands.

CC: Jay Vosburgh 
CC: Veaceslav Falico 
CC: Andy Gospodarek 
CC: net...@vger.kernel.org (open list:BONDING DRIVER)
Signed-off-by: Jarod Wilson 
---
 drivers/net/bonding/bond_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 19eb990..2aa5b5f 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3544,6 +3544,7 @@ static int bond_set_mac_address(struct net_device 
*bond_dev, void *addr)
 
/* success */
memcpy(bond_dev->dev_addr, sa->sa_data, bond_dev->addr_len);
+   bond_dev->addr_assign_type = NET_ADDR_SET;
return 0;
 
 unwind:
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH net] bonding: show saner speed for broadcast mode

2020-08-12 Thread Jarod Wilson
Broadcast mode bonds transmit a copy of all traffic simultaneously out of
all interfaces, so the "speed" of the bond isn't really the aggregate of
all interfaces, but rather, the speed of the lowest active interface.

Also, the type of the speed field is u32, not unsigned long, so adjust
that accordingly, as required to make min() function here without
complaining about mismatching types.

Fixes: bb5b052f751b ("bond: add support to read speed and duplex via ethtool")
CC: Jay Vosburgh 
CC: Veaceslav Falico 
CC: Andy Gospodarek 
CC: "David S. Miller" 
CC: net...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 drivers/net/bonding/bond_main.c | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 5ad43aaf76e5..c853ca67058c 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4552,13 +4552,23 @@ static netdev_tx_t bond_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
return ret;
 }
 
+static u32 bond_mode_bcast_speed(struct slave *slave, u32 speed)
+{
+   if (speed == 0 || speed == SPEED_UNKNOWN)
+   speed = slave->speed;
+   else
+   speed = min(speed, slave->speed);
+
+   return speed;
+}
+
 static int bond_ethtool_get_link_ksettings(struct net_device *bond_dev,
   struct ethtool_link_ksettings *cmd)
 {
struct bonding *bond = netdev_priv(bond_dev);
-   unsigned long speed = 0;
struct list_head *iter;
struct slave *slave;
+   u32 speed = 0;
 
cmd->base.duplex = DUPLEX_UNKNOWN;
cmd->base.port = PORT_OTHER;
@@ -4570,8 +4580,13 @@ static int bond_ethtool_get_link_ksettings(struct 
net_device *bond_dev,
 */
bond_for_each_slave(bond, slave, iter) {
if (bond_slave_can_tx(slave)) {
-   if (slave->speed != SPEED_UNKNOWN)
-   speed += slave->speed;
+   if (slave->speed != SPEED_UNKNOWN) {
+   if (BOND_MODE(bond) == BOND_MODE_BROADCAST)
+   speed = bond_mode_bcast_speed(slave,
+ speed);
+   else
+   speed += slave->speed;
+   }
if (cmd->base.duplex == DUPLEX_UNKNOWN &&
slave->duplex != DUPLEX_UNKNOWN)
cmd->base.duplex = slave->duplex;
-- 
2.20.1



Re: [RFC PATCH net-next] bonding: add a vlan+srcmac tx hashing option

2021-01-12 Thread Jarod Wilson
On Thu, Jan 07, 2021 at 07:03:40PM -0500, Jarod Wilson wrote:
> On Fri, Dec 18, 2020 at 04:18:59PM -0800, Jay Vosburgh wrote:
> > Jarod Wilson  wrote:
> > 
> > >This comes from an end-user request, where they're running multiple VMs on
> > >hosts with bonded interfaces connected to some interest switch topologies,
> > >where 802.3ad isn't an option. They're currently running a proprietary
> > >solution that effectively achieves load-balancing of VMs and bandwidth
> > >utilization improvements with a similar form of transmission algorithm.
> > >
> > >Basically, each VM has it's own vlan, so it always sends its traffic out
> > >the same interface, unless that interface fails. Traffic gets split
> > >between the interfaces, maintaining a consistent path, with failover still
> > >available if an interface goes down.
> > >
> > >This has been rudimetarily tested to provide similar results, suitable for
> > >them to use to move off their current proprietary solution.
> > >
> > >Still on the TODO list, if these even looks sane to begin with, is
> > >fleshing out Documentation/networking/bonding.rst.
> > 
> > I'm sure you're aware, but any final submission will also need
> > to include netlink and iproute2 support.
> 
> I believe everything for netlink support is already included, but I'll
> double-check that before submitting something for inclusion consideration.

I'm not certain if what you actually meant was that I'd have to patch
iproute2 as well, which I've definitely stumbled onto today, but it's a
2-line patch, and everything seems to be working fine with it:

$ sudo ip link set bond0 type bond xmit_hash_policy 5
$ ip -d link show bond0
11: bond0:  mtu 1500 qdisc noop state DOWN mode 
DEFAULT group default qlen 1000
link/ether ce:85:5e:24:ce:90 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 
maxmtu 65535
bond mode balance-xor miimon 0 updelay 0 downdelay 0 peer_notify_delay 0 
use_carrier 1 arp_interval 0 arp_validate none arp_all_targets any 
primary_reselect always fail_over_mac none xmit_hash_policy vlansrc resend_igmp 
1 num_grat_arp 1 all_slaves_active 0 min_links 0 lp_interval 1 
packets_per_slave 1 lacp_rate slow ad_select stable tlb_dynamic_lb 1 
addrgenmode eui64 numtxqueues 16 numrxqueues 16 gso_max_size 65536 gso_max_segs 
65535
$ grep Hash /proc/net/bonding/bond0
Transmit Hash Policy: vlansrc (5)

Nothing bad seems to happen on an older kernel if one tries to set the new
hash, you just get told that it's an invalid argument.

I *think* this is all ready for submission then, so I'll get both the kernel
and iproute2 patches out soon.

-- 
Jarod Wilson
ja...@redhat.com



Re: [RFC PATCH net-next] bonding: add a vlan+srcmac tx hashing option

2021-01-12 Thread Jarod Wilson
On Tue, Jan 12, 2021 at 01:39:10PM -0800, Jay Vosburgh wrote:
> Jarod Wilson  wrote:
> 
> >On Thu, Jan 07, 2021 at 07:03:40PM -0500, Jarod Wilson wrote:
> >> On Fri, Dec 18, 2020 at 04:18:59PM -0800, Jay Vosburgh wrote:
> >> > Jarod Wilson  wrote:
> >> > 
> >> > >This comes from an end-user request, where they're running multiple VMs 
> >> > >on
> >> > >hosts with bonded interfaces connected to some interest switch 
> >> > >topologies,
> >> > >where 802.3ad isn't an option. They're currently running a proprietary
> >> > >solution that effectively achieves load-balancing of VMs and bandwidth
> >> > >utilization improvements with a similar form of transmission algorithm.
> >> > >
> >> > >Basically, each VM has it's own vlan, so it always sends its traffic out
> >> > >the same interface, unless that interface fails. Traffic gets split
> >> > >between the interfaces, maintaining a consistent path, with failover 
> >> > >still
> >> > >available if an interface goes down.
> >> > >
> >> > >This has been rudimetarily tested to provide similar results, suitable 
> >> > >for
> >> > >them to use to move off their current proprietary solution.
> >> > >
> >> > >Still on the TODO list, if these even looks sane to begin with, is
> >> > >fleshing out Documentation/networking/bonding.rst.
> >> > 
> >> >  I'm sure you're aware, but any final submission will also need
> >> > to include netlink and iproute2 support.
> >> 
> >> I believe everything for netlink support is already included, but I'll
> >> double-check that before submitting something for inclusion consideration.
> >
> >I'm not certain if what you actually meant was that I'd have to patch
> >iproute2 as well, which I've definitely stumbled onto today, but it's a
> >2-line patch, and everything seems to be working fine with it:
> 
>   Yes, that's what I meant.
> 
> >$ sudo ip link set bond0 type bond xmit_hash_policy 5
> 
>   Does the above work with the text label (presumably "vlansrc")
> as well as the number, and does "ip link add test type bond help" print
> the correct text for XMIT_HASH_POLICY?

All of the above looks correct to me, output below. Before submitting...
Could rename it from vlansrc to vlan+srcmac or some variation thereof if
it's desired. I tried to keep it relatively short, but it's perhaps a bit
less succinct like I have it now, and other modes include a +.

$ sudo modprobe bonding mode=2 max_bonds=1 xmit_hash_policy=0
$ sudo ip link set bond0 type bond xmit_hash_policy vlansrc
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v4.18.0-272.el8.vstx.x86_64

Bonding Mode: load balancing (xor)
Transmit Hash Policy: vlansrc (5)
MII Status: down
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

$ sudo ip link add test type bond help
Usage: ... bond [ mode BONDMODE ] [ active_slave SLAVE_DEV ]
[ clear_active_slave ] [ miimon MIIMON ]
[ updelay UPDELAY ] [ downdelay DOWNDELAY ]
[ peer_notify_delay DELAY ]
[ use_carrier USE_CARRIER ]
[ arp_interval ARP_INTERVAL ]
[ arp_validate ARP_VALIDATE ]
[ arp_all_targets ARP_ALL_TARGETS ]
[ arp_ip_target [ ARP_IP_TARGET, ... ] ]
[ primary SLAVE_DEV ]
[ primary_reselect PRIMARY_RESELECT ]
[ fail_over_mac FAIL_OVER_MAC ]
[ xmit_hash_policy XMIT_HASH_POLICY ]
[ resend_igmp RESEND_IGMP ]
[ num_grat_arp|num_unsol_na NUM_GRAT_ARP|NUM_UNSOL_NA ]
[ all_slaves_active ALL_SLAVES_ACTIVE ]
[ min_links MIN_LINKS ]
[ lp_interval LP_INTERVAL ]
[ packets_per_slave PACKETS_PER_SLAVE ]
[ tlb_dynamic_lb TLB_DYNAMIC_LB ]
[ lacp_rate LACP_RATE ]
        [ ad_select AD_SELECT ]
[ ad_user_port_key PORTKEY ]
[ ad_actor_sys_prio SYSPRIO ]
[ ad_actor_system LLADDR ]

BONDMODE := 
balance-rr|active-backup|balance-xor|broadcast|802.3ad|balance-tlb|balance-alb
ARP_VALIDATE := none|active|backup|all
ARP_ALL_TARGETS := any|all
PRIMARY_RESELECT := always|better|failure
FAIL_OVER_MAC := none|active|follow
XMIT_HASH_POLICY := layer2|layer2+3|layer3+4|encap2+3|encap3+4|vlansrc
LACP_RATE := slow|fast
AD_SELECT := stable|bandwidth|count


-- 
Jarod Wilson
ja...@redhat.com



Re: [PATCH net-next v3] bonding: add a vlan+srcmac tx hashing option

2021-01-18 Thread Jarod Wilson
On Mon, Jan 18, 2021 at 04:10:38PM -0700, David Ahern wrote:
> On 1/15/21 12:21 PM, Jarod Wilson wrote:
> > diff --git a/Documentation/networking/bonding.rst 
> > b/Documentation/networking/bonding.rst
> > index adc314639085..36562dcd3e1e 100644
> > --- a/Documentation/networking/bonding.rst
> > +++ b/Documentation/networking/bonding.rst
> > @@ -951,6 +951,19 @@ xmit_hash_policy
> > packets will be distributed according to the encapsulated
> > flows.
> >  
> > +   vlan+srcmac
> > +
> > +   This policy uses a very rudimentary vland ID and source mac
> 
> s/vland/vlan/
> 
> > +   ID hash to load-balance traffic per-vlan, with failover
> 
> drop ID on this line; just 'source mac'.

Bah. Crap. Didn't test documentation, clearly. Or proof-read it. Will fix
in v4. Hopefully, nothing else to change though...

-- 
Jarod Wilson
ja...@redhat.com



[PATCH net-next v4] bonding: add a vlan+srcmac tx hashing option

2021-01-18 Thread Jarod Wilson
This comes from an end-user request, where they're running multiple VMs on
hosts with bonded interfaces connected to some interest switch topologies,
where 802.3ad isn't an option. They're currently running a proprietary
solution that effectively achieves load-balancing of VMs and bandwidth
utilization improvements with a similar form of transmission algorithm.

Basically, each VM has it's own vlan, so it always sends its traffic out
the same interface, unless that interface fails. Traffic gets split
between the interfaces, maintaining a consistent path, with failover still
available if an interface goes down.

Unlike bond_eth_hash(), this hash function is using the full source MAC
address instead of just the last byte, as there are so few components to
the hash, and in the no-vlan case, we would be returning just the last
byte of the source MAC as the hash value. It's entirely possible to have
two NICs in a bond with the same last byte of their MAC, but not the same
MAC, so this adjustment should guarantee distinct hashes in all cases.

This has been rudimetarily tested to provide similar results to the
proprietary solution it is aiming to replace. A patch for iproute2 is also
posted, to properly support the new mode there as well.

Cc: Jay Vosburgh 
Cc: Veaceslav Falico 
Cc: Andy Gospodarek 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: Thomas Davis 
Cc: net...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
v2: verified netlink interfaces working, added Documentation, changed
tx hash mode name to vlan+mac for consistency and clarity.
v3: drop inline from hash function, use full source MAC, not just the
last byte, expand explanation in patch description, extend hash name to
vlan+srcmac.
v4: fix documentation issues pointed out by David Ahern

 Documentation/networking/bonding.rst | 13 +++
 drivers/net/bonding/bond_main.c  | 34 ++--
 drivers/net/bonding/bond_options.c   | 13 ++-
 include/linux/netdevice.h|  1 +
 include/uapi/linux/if_bonding.h  |  1 +
 5 files changed, 54 insertions(+), 8 deletions(-)

diff --git a/Documentation/networking/bonding.rst 
b/Documentation/networking/bonding.rst
index adc314639085..36562dcd3e1e 100644
--- a/Documentation/networking/bonding.rst
+++ b/Documentation/networking/bonding.rst
@@ -951,6 +951,19 @@ xmit_hash_policy
packets will be distributed according to the encapsulated
flows.
 
+   vlan+srcmac
+
+   This policy uses a very rudimentary vlan ID and source mac
+   hash to load-balance traffic per-vlan, with failover
+   should one leg fail. The intended use case is for a bond
+   shared by multiple virtual machines, all configured to
+   use their own vlan, to give lacp-like functionality
+   without requiring lacp-capable switching hardware.
+
+   The formula for the hash is simply
+
+   hash = (vlan ID) XOR (source MAC vendor) XOR (source MAC dev)
+
The default value is layer2.  This option was added in bonding
version 2.6.3.  In earlier versions of bonding, this parameter
does not exist, and the layer2 policy is the only policy.  The
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 5fe5232cc3f3..d4bc4d4e953b 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -164,7 +164,7 @@ module_param(xmit_hash_policy, charp, 0);
 MODULE_PARM_DESC(xmit_hash_policy, "balance-alb, balance-tlb, balance-xor, 
802.3ad hashing method; "
   "0 for layer 2 (default), 1 for layer 3+4, "
   "2 for layer 2+3, 3 for encap layer 2+3, "
-  "4 for encap layer 3+4");
+  "4 for encap layer 3+4, 5 for vlan+srcmac");
 module_param(arp_interval, int, 0);
 MODULE_PARM_DESC(arp_interval, "arp interval in milliseconds");
 module_param_array(arp_ip_target, charp, NULL, 0);
@@ -1434,6 +1434,8 @@ static enum netdev_lag_hash bond_lag_hash_type(struct 
bonding *bond,
return NETDEV_LAG_HASH_E23;
case BOND_XMIT_POLICY_ENCAP34:
return NETDEV_LAG_HASH_E34;
+   case BOND_XMIT_POLICY_VLAN_SRCMAC:
+   return NETDEV_LAG_HASH_VLAN_SRCMAC;
default:
return NETDEV_LAG_HASH_UNKNOWN;
}
@@ -3494,6 +3496,27 @@ static bool bond_flow_ip(struct sk_buff *skb, struct 
flow_keys *fk,
return true;
 }
 
+static u32 bond_vlan_srcmac_hash(struct sk_buff *skb)
+{
+   struct ethhdr *mac_hdr = (struct ethhdr *)skb_mac_header(skb);
+   u32 srcmac_vendor = 0, srcmac_dev = 0;
+   u16 vlan;
+   int i;
+
+   for (i = 0; i < 3; i++)
+   srcmac_vendor = (srcmac_vendor << 8) | mac_hdr->h_source[i];
+

Re: [RFC PATCH net-next] bonding: add a vlan+srcmac tx hashing option

2021-01-07 Thread Jarod Wilson
On Mon, Dec 28, 2020 at 11:11:45AM +0100, Jiri Pirko wrote:
> Fri, Dec 18, 2020 at 08:30:33PM CET, ja...@redhat.com wrote:
> >This comes from an end-user request, where they're running multiple VMs on
> >hosts with bonded interfaces connected to some interest switch topologies,
> >where 802.3ad isn't an option. They're currently running a proprietary
> >solution that effectively achieves load-balancing of VMs and bandwidth
> >utilization improvements with a similar form of transmission algorithm.
> >
> >Basically, each VM has it's own vlan, so it always sends its traffic out
> >the same interface, unless that interface fails. Traffic gets split
> >between the interfaces, maintaining a consistent path, with failover still
> >available if an interface goes down.
> >
> >This has been rudimetarily tested to provide similar results, suitable for
> >them to use to move off their current proprietary solution.
> >
> >Still on the TODO list, if these even looks sane to begin with, is
> >fleshing out Documentation/networking/bonding.rst.
> 
> Jarod, did you consider using team driver instead ? :)

That's actually one of the things that was suggested, since team I believe
already has support for this, but the user really wants to use bonding.
We're finding that a lot of users really still prefer bonding over team.

-- 
Jarod Wilson
ja...@redhat.com



Re: [RFC PATCH net-next] bonding: add a vlan+srcmac tx hashing option

2021-01-07 Thread Jarod Wilson
On Fri, Dec 18, 2020 at 04:18:59PM -0800, Jay Vosburgh wrote:
> Jarod Wilson  wrote:
> 
> >This comes from an end-user request, where they're running multiple VMs on
> >hosts with bonded interfaces connected to some interest switch topologies,
> >where 802.3ad isn't an option. They're currently running a proprietary
> >solution that effectively achieves load-balancing of VMs and bandwidth
> >utilization improvements with a similar form of transmission algorithm.
> >
> >Basically, each VM has it's own vlan, so it always sends its traffic out
> >the same interface, unless that interface fails. Traffic gets split
> >between the interfaces, maintaining a consistent path, with failover still
> >available if an interface goes down.
> >
> >This has been rudimetarily tested to provide similar results, suitable for
> >them to use to move off their current proprietary solution.
> >
> >Still on the TODO list, if these even looks sane to begin with, is
> >fleshing out Documentation/networking/bonding.rst.
> 
>   I'm sure you're aware, but any final submission will also need
> to include netlink and iproute2 support.

I believe everything for netlink support is already included, but I'll
double-check that before submitting something for inclusion consideration.

> >diff --git a/drivers/net/bonding/bond_main.c 
> >b/drivers/net/bonding/bond_main.c
> >index 5fe5232cc3f3..151ce8c7a56f 100644
> >--- a/drivers/net/bonding/bond_main.c
> >+++ b/drivers/net/bonding/bond_main.c
> >@@ -164,7 +164,7 @@ module_param(xmit_hash_policy, charp, 0);
> > MODULE_PARM_DESC(xmit_hash_policy, "balance-alb, balance-tlb, balance-xor, 
> > 802.3ad hashing method; "
> >"0 for layer 2 (default), 1 for layer 3+4, "
> >"2 for layer 2+3, 3 for encap layer 2+3, "
> >-   "4 for encap layer 3+4");
> >+   "4 for encap layer 3+4, 5 for vlan+srcmac");
> > module_param(arp_interval, int, 0);
> > MODULE_PARM_DESC(arp_interval, "arp interval in milliseconds");
> > module_param_array(arp_ip_target, charp, NULL, 0);
> >@@ -1434,6 +1434,8 @@ static enum netdev_lag_hash bond_lag_hash_type(struct 
> >bonding *bond,
> > return NETDEV_LAG_HASH_E23;
> > case BOND_XMIT_POLICY_ENCAP34:
> > return NETDEV_LAG_HASH_E34;
> >+case BOND_XMIT_POLICY_VLAN_SRCMAC:
> >+return NETDEV_LAG_HASH_VLAN_SRCMAC;
> > default:
> > return NETDEV_LAG_HASH_UNKNOWN;
> > }
> >@@ -3494,6 +3496,20 @@ static bool bond_flow_ip(struct sk_buff *skb, struct 
> >flow_keys *fk,
> > return true;
> > }
> > 
> >+static inline u32 bond_vlan_srcmac_hash(struct sk_buff *skb)
> >+{
> >+struct ethhdr *mac_hdr = (struct ethhdr *)skb_mac_header(skb);
> >+u32 srcmac = mac_hdr->h_source[5];
> >+u16 vlan;
> >+
> >+if (!skb_vlan_tag_present(skb))
> >+return srcmac;
> >+
> >+vlan = skb_vlan_tag_get(skb);
> >+
> >+return srcmac ^ vlan;
> 
>   For the common configuration wherein multiple VLANs are
> configured atop a single interface (and thus by default end up with the
> same MAC address), this seems like a fairly weak hash.  The TCI is 16
> bits (12 of which are the VID), but only 8 are used from the MAC, which
> will be a constant.
> 
>   Is this algorithm copying the proprietary solution you mention?

I've not actually seen the code in question, so I can't be 100% certain,
but this is exactly how it was described to me, and testing seems to bear
out that it behaves at least similarly enough for the user. They like
simplicity, and the very basic hash suits their needs, which are basically
just getting some load-balancing with failover w/o having to have lacp,
running on some older switches that can't do lacp.

-- 
Jarod Wilson
ja...@redhat.com



Re: [RFC PATCH net-next] bonding: add a vlan+srcmac tx hashing option

2021-01-08 Thread Jarod Wilson
On Fri, Jan 08, 2021 at 02:12:56PM +0100, Jiri Pirko wrote:
> Fri, Jan 08, 2021 at 12:58:13AM CET, ja...@redhat.com wrote:
> >On Mon, Dec 28, 2020 at 11:11:45AM +0100, Jiri Pirko wrote:
> >> Fri, Dec 18, 2020 at 08:30:33PM CET, ja...@redhat.com wrote:
> >> >This comes from an end-user request, where they're running multiple VMs on
> >> >hosts with bonded interfaces connected to some interest switch topologies,
> >> >where 802.3ad isn't an option. They're currently running a proprietary
> >> >solution that effectively achieves load-balancing of VMs and bandwidth
> >> >utilization improvements with a similar form of transmission algorithm.
> >> >
> >> >Basically, each VM has it's own vlan, so it always sends its traffic out
> >> >the same interface, unless that interface fails. Traffic gets split
> >> >between the interfaces, maintaining a consistent path, with failover still
> >> >available if an interface goes down.
> >> >
> >> >This has been rudimetarily tested to provide similar results, suitable for
> >> >them to use to move off their current proprietary solution.
> >> >
> >> >Still on the TODO list, if these even looks sane to begin with, is
> >> >fleshing out Documentation/networking/bonding.rst.
> >> 
> >> Jarod, did you consider using team driver instead ? :)
> >
> >That's actually one of the things that was suggested, since team I believe
> >already has support for this, but the user really wants to use bonding.
> >We're finding that a lot of users really still prefer bonding over team.
> 
> Do you know the reason, other than "nostalgia"?

I've heard a few different reasons that come to mind:

1) nostalgia is definitely one -- "we know bonding here"
2) support -- "the things I'm running say I need bonding to properly
support failover in their environment". How accurate this is, I don't
actually know.
3) monitoring -- "my monitoring solution knows about bonding, but not
about team". This is probably easily fixed, but may or may not be in the
user's direct control.
4) footprint -- "bonding does the job w/o team's userspace footprint".
I think this one is kind of hard for team to do anything about, bonding
really does have a smaller userspace footprint, which is a plus for
embedded type applications and high-security environments looking to keep
things as minimal as possible.

I think I've heard a few "we tried team years ago and it didn't work" as
well, which of course is ridiculous as a reason not to try something again,
since a lot can change in a few years in this world.

-- 
Jarod Wilson
ja...@redhat.com



[PATCH net-next v2] bonding: add a vlan+mac tx hashing option

2021-01-13 Thread Jarod Wilson
This comes from an end-user request, where they're running multiple VMs on
hosts with bonded interfaces connected to some interest switch topologies,
where 802.3ad isn't an option. They're currently running a proprietary
solution that effectively achieves load-balancing of VMs and bandwidth
utilization improvements with a similar form of transmission algorithm.

Basically, each VM has it's own vlan, so it always sends its traffic out
the same interface, unless that interface fails. Traffic gets split
between the interfaces, maintaining a consistent path, with failover still
available if an interface goes down.

This has been rudimetarily tested to provide similar results, suitable for
them to use to move off their current proprietary solution. A patch for
iproute2 is forthcoming as well, to properly support the new mode there as
well.

Cc: Jay Vosburgh 
Cc: Veaceslav Falico 
Cc: Andy Gospodarek 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: Thomas Davis 
Cc: net...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
v2: verified netlink interfaces working, added Documentation, changed
tx hash mode name to vlan+mac for consistency and clarity.

 Documentation/networking/bonding.rst | 13 +
 drivers/net/bonding/bond_main.c  | 27 +--
 drivers/net/bonding/bond_options.c   |  1 +
 include/linux/netdevice.h|  1 +
 include/uapi/linux/if_bonding.h  |  1 +
 5 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/bonding.rst 
b/Documentation/networking/bonding.rst
index adc314639085..c78ceb7630a0 100644
--- a/Documentation/networking/bonding.rst
+++ b/Documentation/networking/bonding.rst
@@ -951,6 +951,19 @@ xmit_hash_policy
packets will be distributed according to the encapsulated
flows.
 
+   vlan+mac
+
+   This policy uses a very rudimentary vland ID and source mac
+   ID hash to load-balance traffic per-vlan, with failover
+   should one leg fail. The intended use case is for a bond
+   shared by multiple virtual machines, all configured to
+   use their own vlan, to give lacp-like functionality
+   without requiring lacp-capable switching hardware.
+
+   The formula for the hash is simply
+
+   hash = (vlan ID) XOR (source MAC)
+
The default value is layer2.  This option was added in bonding
version 2.6.3.  In earlier versions of bonding, this parameter
does not exist, and the layer2 policy is the only policy.  The
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 5fe5232cc3f3..766c09a553c1 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -164,7 +164,7 @@ module_param(xmit_hash_policy, charp, 0);
 MODULE_PARM_DESC(xmit_hash_policy, "balance-alb, balance-tlb, balance-xor, 
802.3ad hashing method; "
   "0 for layer 2 (default), 1 for layer 3+4, "
   "2 for layer 2+3, 3 for encap layer 2+3, "
-  "4 for encap layer 3+4");
+  "4 for encap layer 3+4, 5 for vlan+mac");
 module_param(arp_interval, int, 0);
 MODULE_PARM_DESC(arp_interval, "arp interval in milliseconds");
 module_param_array(arp_ip_target, charp, NULL, 0);
@@ -1434,6 +1434,8 @@ static enum netdev_lag_hash bond_lag_hash_type(struct 
bonding *bond,
return NETDEV_LAG_HASH_E23;
case BOND_XMIT_POLICY_ENCAP34:
return NETDEV_LAG_HASH_E34;
+   case BOND_XMIT_POLICY_VLAN_SRCMAC:
+   return NETDEV_LAG_HASH_VLAN_SRCMAC;
default:
return NETDEV_LAG_HASH_UNKNOWN;
}
@@ -3494,6 +3496,20 @@ static bool bond_flow_ip(struct sk_buff *skb, struct 
flow_keys *fk,
return true;
 }
 
+static inline u32 bond_vlan_srcmac_hash(struct sk_buff *skb)
+{
+   struct ethhdr *mac_hdr = (struct ethhdr *)skb_mac_header(skb);
+   u32 srcmac = mac_hdr->h_source[5];
+   u16 vlan;
+
+   if (!skb_vlan_tag_present(skb))
+   return srcmac;
+
+   vlan = skb_vlan_tag_get(skb);
+
+   return srcmac ^ vlan;
+}
+
 /* Extract the appropriate headers based on bond's xmit policy */
 static bool bond_flow_dissect(struct bonding *bond, struct sk_buff *skb,
  struct flow_keys *fk)
@@ -3501,10 +3517,14 @@ static bool bond_flow_dissect(struct bonding *bond, 
struct sk_buff *skb,
bool l34 = bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34;
int noff, proto = -1;
 
-   if (bond->params.xmit_policy > BOND_XMIT_POLICY_LAYER23) {
+   switch (bond->params.xmit_policy) {
+   case BOND_XMIT_POLICY_ENCAP23:
+   case BOND_XMIT_POLICY_ENCAP34:
 

Re: [PATCH net-next v2] bonding: add a vlan+mac tx hashing option

2021-01-14 Thread Jarod Wilson
On Wed, Jan 13, 2021 at 05:58:18PM -0800, Jakub Kicinski wrote:
> On Wed, 13 Jan 2021 17:35:48 -0500 Jarod Wilson wrote:
> > This comes from an end-user request, where they're running multiple VMs on
> > hosts with bonded interfaces connected to some interest switch topologies,
> > where 802.3ad isn't an option. They're currently running a proprietary
> > solution that effectively achieves load-balancing of VMs and bandwidth
> > utilization improvements with a similar form of transmission algorithm.
> > 
> > Basically, each VM has it's own vlan, so it always sends its traffic out
> > the same interface, unless that interface fails. Traffic gets split
> > between the interfaces, maintaining a consistent path, with failover still
> > available if an interface goes down.
> > 
> > This has been rudimetarily tested to provide similar results, suitable for
> > them to use to move off their current proprietary solution. A patch for
> > iproute2 is forthcoming as well, to properly support the new mode there as
> > well.
> 
> > Signed-off-by: Jarod Wilson 
> > ---
> > v2: verified netlink interfaces working, added Documentation, changed
> > tx hash mode name to vlan+mac for consistency and clarity.
> > 
> >  Documentation/networking/bonding.rst | 13 +
> >  drivers/net/bonding/bond_main.c  | 27 +--
> >  drivers/net/bonding/bond_options.c   |  1 +
> >  include/linux/netdevice.h|  1 +
> >  include/uapi/linux/if_bonding.h  |  1 +
> >  5 files changed, 41 insertions(+), 2 deletions(-)
> > 
> > diff --git a/Documentation/networking/bonding.rst 
> > b/Documentation/networking/bonding.rst
> > index adc314639085..c78ceb7630a0 100644
> > --- a/Documentation/networking/bonding.rst
> > +++ b/Documentation/networking/bonding.rst
> > @@ -951,6 +951,19 @@ xmit_hash_policy
> > packets will be distributed according to the encapsulated
> > flows.
> >  
> > +   vlan+mac
> > +
> > +   This policy uses a very rudimentary vland ID and source mac
> > +   ID hash to load-balance traffic per-vlan, with failover
> > +   should one leg fail. The intended use case is for a bond
> > +   shared by multiple virtual machines, all configured to
> > +   use their own vlan, to give lacp-like functionality
> > +   without requiring lacp-capable switching hardware.
> > +
> > +   The formula for the hash is simply
> > +
> > +   hash = (vlan ID) XOR (source MAC)
> 
> But in the code it's only using one byte of the MAC, currently.
> 
> I think that's fine for the particular use case but should we call out
> explicitly in the commit message why it's considered sufficient?
> 
> Someone can change it later, if needed, but best if we spell out the
> current motivation.

In truth, this code started out as a copy of bond_eth_hash(), which also
only uses the last byte, though of both source and destination macs. In
the typical use case for the requesting user, the bond is formed from two
onboard NICs, which typically have adjacent mac addresses, i.e.,
AA:BB:CC:DD:EE:01 and AA:BB:CC:DD:EE:02, so only the last byte is really
relevant to hash differently, but in thinking about it, a replacement NIC
because an onboard one died could have the same last byte, and maybe we
ought to just go full source mac right off the go here.

Something like this instead maybe:

static u32 bond_vlan_srcmac_hash(struct sk_buff *skb)
{
struct ethhdr *mac_hdr = (struct ethhdr *)skb_mac_header(skb);
u32 srcmac = 0;
u16 vlan;
int i;

for (i = 0; i < ETH_ALEN; i++)
srcmac = (srcmac << 8) | mac_hdr->h_source[i];

if (!skb_vlan_tag_present(skb))
return srcmac;

vlan = skb_vlan_tag_get(skb);

return vlan ^ srcmac;
}

Then the documentation is spot-on, and we're future-proof, though
marginally less performant in calculating the hash, which may have been a
consideration when the original function was written, but is probably
basically irrelevant w/modern systems...

> > The default value is layer2.  This option was added in bonding
> > version 2.6.3.  In earlier versions of bonding, this parameter
> > does not exist, and the layer2 policy is the only policy.  The
> 
> > +static inline u32 bond_vlan_srcmac_hash(struct sk_buff *skb)
> 
> Can we drop the inline? It's a static function called once.

Works for me. That was also inherited by copying bond_eth_hash(). :)

> > +{
> > +   struct ethhdr *mac_hdr = (struct ethhdr *)skb_mac_hea

Re: [PATCH net-next v2] bonding: add a vlan+mac tx hashing option

2021-01-14 Thread Jarod Wilson
On Thu, Jan 14, 2021 at 01:23:14PM -0800, Jakub Kicinski wrote:
> On Thu, 14 Jan 2021 16:11:41 -0500 Jarod Wilson wrote:
> > In truth, this code started out as a copy of bond_eth_hash(), which also
> > only uses the last byte, though of both source and destination macs. In
> > the typical use case for the requesting user, the bond is formed from two
> > onboard NICs, which typically have adjacent mac addresses, i.e.,
> > AA:BB:CC:DD:EE:01 and AA:BB:CC:DD:EE:02, so only the last byte is really
> > relevant to hash differently, but in thinking about it, a replacement NIC
> > because an onboard one died could have the same last byte, and maybe we
> > ought to just go full source mac right off the go here.
> > 
> > Something like this instead maybe:
> > 
> > static u32 bond_vlan_srcmac_hash(struct sk_buff *skb)
> > {
> > struct ethhdr *mac_hdr = (struct ethhdr *)skb_mac_header(skb);
> > u32 srcmac = 0;
> > u16 vlan;
> > int i;
> > 
> > for (i = 0; i < ETH_ALEN; i++)
> > srcmac = (srcmac << 8) | mac_hdr->h_source[i];
> > 
> > if (!skb_vlan_tag_present(skb))
> > return srcmac;
> > 
> > vlan = skb_vlan_tag_get(skb);
> > 
> > return vlan ^ srcmac;
> > }
> > 
> > Then the documentation is spot-on, and we're future-proof, though
> > marginally less performant in calculating the hash, which may have been a
> > consideration when the original function was written, but is probably
> > basically irrelevant w/modern systems...
> 
> No preference, especially if bond_eth_hash() already uses the last byte.
> Just make sure the choice is explained in the commit message.

I've sold myself on using the full MAC, because if there's no vlan tag
present, mac is the only thing used for the hash, increasing the chances
of getting the same hash for two different interfaces, which won't happen
if we've got the full MAC. Of course, I'm not sure why someone would be
using this xmit hash outside of the very particular use-case that includes
VLANs, but people do strange things...

-- 
Jarod Wilson
ja...@redhat.com



Re: [PATCH net] wireless/nl80211: fix wdev_id may be used uninitialized

2021-03-15 Thread Jarod Wilson
On Fri, Mar 12, 2021 at 4:04 PM Kalle Valo  wrote:
>
> Jarod Wilson  writes:
>
> > Build currently fails with -Werror=maybe-uninitialized set:
> >
> > net/wireless/nl80211.c: In function '__cfg80211_wdev_from_attrs':
> > net/wireless/nl80211.c:124:44: error: 'wdev_id' may be used
> > uninitialized in this function [-Werror=maybe-uninitialized]
>
> Really, build fails? Is -Werror enabled by default now? I hope not.

Don't think so. But we (Red Hat) build all our kernels with a fair
amount of extra error-checking enabled.

-- 
Jarod Wilson
ja...@redhat.com



Re: [PATCH net] bonding: reduce rtnl lock contention in mii monitor thread

2020-12-09 Thread Jarod Wilson
On Tue, Dec 8, 2020 at 2:38 PM Jakub Kicinski  wrote:
>
> On Sat,  5 Dec 2020 18:43:54 -0500 Jarod Wilson wrote:
> > I'm seeing a system get stuck unable to bring a downed interface back up
> > when it's got an updelay value set, behavior which ceased when logging
> > spew was removed from bond_miimon_inspect(). I'm monitoring logs on this
> > system over another network connection, and it seems that the act of
> > spewing logs at all there increases rtnl lock contention, because
> > instrumented code showed bond_mii_monitor() never able to succeed in it's
> > attempts to call rtnl_trylock() to actually commit link state changes,
> > leaving the downed link stuck in BOND_LINK_DOWN. The system in question
> > appears to be fine with the log spew being moved to
> > bond_commit_link_state(), which is called after the successful
> > rtnl_trylock().
>
> But it's not called under rtnl_lock AFAICT. So something else is also
> spewing messages?
>
> While bond_commit_link_state() _is_ called under the lock. So you're
> increasing the retry rate, by putting the slow operation under the
> lock, is that right?

Partially, yes. I probably should have tagged this with RFC instead of
PATCH, tbh. My theory was that the log spew, being sent out *other*
network interfaces when monitoring the system or remote syslog or ssh
was potentially causing some rtnl_lock() calls, so not spewing until
after actually being able to grab the lock would lessen the problem
w/actually acquiring the lock, but I ... don't know offhand how to
verify that theory.


> Also isn't bond_commit_link_state() called from many more places?
> So we're adding new prints, effectively?

Ah. Crap. Yes. bond_set_slave_link_state() is called quite a few
places, and that in turn calls bond_commit_link_state().


> > I'm actually wondering if perhaps we ultimately need/want
> > some bond-specific lock here to prevent racing with bond_close() instead
> > of using rtnl, but this shift of the output appears to work. I believe
> > this started happening when de77ecd4ef02 ("bonding: improve link-status
> > update in mii-monitoring") went in, but I'm not 100% on that.
> >
> > The addition of a case BOND_LINK_BACK in bond_miimon_inspect() is somewhat
> > separate from the fix for the actual hang, but it eliminates a constant
> > "invalid new link 3 on slave" message seen related to this issue, and it's
> > not actually an invalid state here, so we shouldn't be reporting it as an
> > error.
>
> Let's make it a separate patch, then.

Sounds like Jay is confident that bit is valid, and I shouldn't be
ending up in that state, unless something else is going wrong.

-- 
Jarod Wilson
ja...@redhat.com



Re: [PATCH net] bonding: reduce rtnl lock contention in mii monitor thread

2020-12-09 Thread Jarod Wilson
On Tue, Dec 8, 2020 at 3:35 PM Jay Vosburgh  wrote:
>
> Jarod Wilson  wrote:
>
> >I'm seeing a system get stuck unable to bring a downed interface back up
> >when it's got an updelay value set, behavior which ceased when logging
> >spew was removed from bond_miimon_inspect(). I'm monitoring logs on this
> >system over another network connection, and it seems that the act of
> >spewing logs at all there increases rtnl lock contention, because
> >instrumented code showed bond_mii_monitor() never able to succeed in it's
> >attempts to call rtnl_trylock() to actually commit link state changes,
> >leaving the downed link stuck in BOND_LINK_DOWN. The system in question
> >appears to be fine with the log spew being moved to
> >bond_commit_link_state(), which is called after the successful
> >rtnl_trylock(). I'm actually wondering if perhaps we ultimately need/want
> >some bond-specific lock here to prevent racing with bond_close() instead
> >of using rtnl, but this shift of the output appears to work. I believe
> >this started happening when de77ecd4ef02 ("bonding: improve link-status
> >update in mii-monitoring") went in, but I'm not 100% on that.
>
> We use RTNL not to avoid deadlock with bonding itself, but
> because the "commit" side undertakes actions which require RTNL, e.g.,
> various events will eventually call netdev_lower_state_changed.
>
> However, the RTNL acquisition is a trylock to avoid the deadlock
> with bond_close.  Moving that out of line here (e.g., putting the commit
> into another work queue event or the like) has the same problem, in that
> bond_close needs to wait for all of the work queue events to finish, and
> it holds RTNL.

Ah, okay, it wasn't clear to me that we actually do need RTNL here,
I'd thought it was just for the deadlock avoidance with bond_close,
based on the comments in the source.

> Also, a dim memory says that the various notification messages
> were mostly placed in the "inspect" phase and not the "commit" phase to
> avoid doing printk-like activities with RTNL held.  As a general
> principle, I don't think we want to add more verbiage under RTNL.

Yeah, that's fair.

> >The addition of a case BOND_LINK_BACK in bond_miimon_inspect() is somewhat
> >separate from the fix for the actual hang, but it eliminates a constant
> >"invalid new link 3 on slave" message seen related to this issue, and it's
> >not actually an invalid state here, so we shouldn't be reporting it as an
> >error.
>
> Do you mean bond_miimon_commit here and not bond_miimon_inspect
> (which already has a case for BOND_LINK_BACK)?

Whoops, yes.

> In principle, bond_miimon_commit should not see _BACK or _FAIL
> state as a new link state, because those states should be managed at the
> bond_miimon_inspect level (as they are the result of updelay and
> downdelay).  These states should not be "committed" in the sense of
> causing notifications or doing actions that require RTNL.
>
> My recollection is that the "invalid new link" messages were the
> result of a bug in de77ecd4ef02, which was fixed in 1899bb325149
> ("bonding: fix state transition issue in link monitoring"), but maybe
> the RTNL problem here induces that in some other fashion.
>
> Either way, I believe this message is correct as-is.

Hm, okay, definitely seeing this message pop up regularly when a link
recovers, using a fairly simple reproducer:

slave1=p6p1
slave2=p6p2

modprobe -rv bonding
modprobe -v bonding mode=2 miimon=100 updelay=200
ip link set bond0 up
ifenslave bond0 $slave1 $slave2
sleep 10

while :
do
  ip link set $slave1 down
   sleep 1
   ip link set $slave1 up
   sleep 1
done

I wasn't actually seeing the problem until I was running a 'watch -n 1
"dmesg | tail -n 50"' or similar in a remote ssh session on the host.

I should add the caveat that this was also initially seen on an older
kernel, but with a fairly up-to-date bonding driver, which does
include both de77ecd4ef02 and 1899bb325149. I'm going to keep prodding
w/a more recent upstreamier kernel, and see if I can get a better idea
of what's actually going on.

-- 
Jarod Wilson
ja...@redhat.com



Re: [PATCH net v3] bonding: fix feature flag setting at init time

2020-12-03 Thread Jarod Wilson
On Thu, Dec 3, 2020 at 11:50 AM Jakub Kicinski  wrote:
>
> On Wed,  2 Dec 2020 19:43:57 -0500 Jarod Wilson wrote:
> >   bond_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL | NETIF_F_GSO_UDP_L4;
> > -#ifdef CONFIG_XFRM_OFFLOAD
> > - bond_dev->hw_features |= BOND_XFRM_FEATURES;
> > -#endif /* CONFIG_XFRM_OFFLOAD */
> >   bond_dev->features |= bond_dev->hw_features;
> >   bond_dev->features |= NETIF_F_HW_VLAN_CTAG_TX | 
> > NETIF_F_HW_VLAN_STAG_TX;
> >  #ifdef CONFIG_XFRM_OFFLOAD
> > - /* Disable XFRM features if this isn't an active-backup config */
> > - if (BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP)
> > - bond_dev->features &= ~BOND_XFRM_FEATURES;
> > + bond_dev->hw_features |= BOND_XFRM_FEATURES;
> > + /* Only enable XFRM features if this is an active-backup config */
> > + if (BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP)
> > + bond_dev->features |= BOND_XFRM_FEATURES;
> >  #endif /* CONFIG_XFRM_OFFLOAD */
>
> This makes no functional change, or am I reading it wrong?

You are correct, there's ultimately no functional change there, it
primarily just condenses the code down to a single #ifdef block, and
doesn't add and then remove BOND_XFRM_FEATURES from
bond_dev->features, instead omitting it initially and only adding it
when in AB mode. I'd poked at the code in that area while trying to
get to the bottom of this, thought it made it more understandable, so
I left it in, but ultimately, it's not necessary to fix the problem
here.

-- 
Jarod Wilson
ja...@redhat.com



[PATCH net-next] bonding: set xfrm feature flags more sanely

2020-12-05 Thread Jarod Wilson
We can remove one of the ifdef blocks here, and instead of setting both
the xfrm hw_features and features flags, then unsetting the the features
flags if not in AB, wait to set the features flags if we're actually in AB
mode.

Cc: Jay Vosburgh 
Cc: Veaceslav Falico 
Cc: Andy Gospodarek 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: Thomas Davis 
Cc: net...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 drivers/net/bonding/bond_main.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index e0880a3840d7..5fe5232cc3f3 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4746,15 +4746,13 @@ void bond_setup(struct net_device *bond_dev)
NETIF_F_HW_VLAN_CTAG_FILTER;
 
bond_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
-#ifdef CONFIG_XFRM_OFFLOAD
-   bond_dev->hw_features |= BOND_XFRM_FEATURES;
-#endif /* CONFIG_XFRM_OFFLOAD */
bond_dev->features |= bond_dev->hw_features;
bond_dev->features |= NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_STAG_TX;
 #ifdef CONFIG_XFRM_OFFLOAD
-   /* Disable XFRM features if this isn't an active-backup config */
-   if (BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP)
-   bond_dev->features &= ~BOND_XFRM_FEATURES;
+   bond_dev->hw_features |= BOND_XFRM_FEATURES;
+   /* Only enable XFRM features if this is an active-backup config */
+   if (BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP)
+   bond_dev->features |= BOND_XFRM_FEATURES;
 #endif /* CONFIG_XFRM_OFFLOAD */
 }
 
-- 
2.28.0



[PATCH net v4] bonding: fix feature flag setting at init time

2020-12-05 Thread Jarod Wilson
Don't try to adjust XFRM support flags if the bond device isn't yet
registered. Bad things can currently happen when netdev_change_features()
is called without having wanted_features fully filled in yet. This code
runs both on post-module-load mode changes, as well as at module init
time, and when run at module init time, it is before register_netdevice()
has been called and filled in wanted_features. The empty wanted_features
led to features also getting emptied out, which was definitely not the
intended behavior, so prevent that from happening.

Originally, I'd hoped to stop adjusting wanted_features at all in the
bonding driver, as it's documented as being something only the network
core should touch, but we actually do need to do this to properly update
both the features and wanted_features fields when changing the bond type,
or we get to a situation where ethtool sees:

esp-hw-offload: off [requested on]

I do think we should be using netdev_update_features instead of
netdev_change_features here though, so we only send notifiers when the
features actually changed.

Fixes: a3b658cfb664 ("bonding: allow xfrm offload setup post-module-load")
Reported-by: Ivan Vecera 
Suggested-by: Ivan Vecera 
Cc: Jay Vosburgh 
Cc: Veaceslav Falico 
Cc: Andy Gospodarek 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: Thomas Davis 
Cc: net...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
v2: rework based on further testing and suggestions from ivecera
v3: add helper function, remove goto
v4: drop hunk not directly related to fix, clean up ifdeffery

 drivers/net/bonding/bond_options.c | 22 +++---
 include/net/bonding.h  |  2 --
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/drivers/net/bonding/bond_options.c 
b/drivers/net/bonding/bond_options.c
index 9abfaae1c6f7..a4e4e15f574d 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -745,6 +745,19 @@ const struct bond_option *bond_opt_get(unsigned int option)
return &bond_opts[option];
 }
 
+static void bond_set_xfrm_features(struct net_device *bond_dev, u64 mode)
+{
+   if (!IS_ENABLED(CONFIG_XFRM_OFFLOAD))
+   return;
+
+   if (mode == BOND_MODE_ACTIVEBACKUP)
+   bond_dev->wanted_features |= BOND_XFRM_FEATURES;
+   else
+   bond_dev->wanted_features &= ~BOND_XFRM_FEATURES;
+
+   netdev_update_features(bond_dev);
+}
+
 static int bond_option_mode_set(struct bonding *bond,
const struct bond_opt_value *newval)
 {
@@ -767,13 +780,8 @@ static int bond_option_mode_set(struct bonding *bond,
if (newval->value == BOND_MODE_ALB)
bond->params.tlb_dynamic_lb = 1;
 
-#ifdef CONFIG_XFRM_OFFLOAD
-   if (newval->value == BOND_MODE_ACTIVEBACKUP)
-   bond->dev->wanted_features |= BOND_XFRM_FEATURES;
-   else
-   bond->dev->wanted_features &= ~BOND_XFRM_FEATURES;
-   netdev_change_features(bond->dev);
-#endif /* CONFIG_XFRM_OFFLOAD */
+   if (bond->dev->reg_state == NETREG_REGISTERED)
+   bond_set_xfrm_features(bond->dev, newval->value);
 
/* don't cache arp_validate between modes */
bond->params.arp_validate = BOND_ARP_VALIDATE_NONE;
diff --git a/include/net/bonding.h b/include/net/bonding.h
index d9d0ff3b0ad3..adc3da776970 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -86,10 +86,8 @@
 #define bond_for_each_slave_rcu(bond, pos, iter) \
netdev_for_each_lower_private_rcu((bond)->dev, pos, iter)
 
-#ifdef CONFIG_XFRM_OFFLOAD
 #define BOND_XFRM_FEATURES (NETIF_F_HW_ESP | NETIF_F_HW_ESP_TX_CSUM | \
NETIF_F_GSO_ESP)
-#endif /* CONFIG_XFRM_OFFLOAD */
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
 extern atomic_t netpoll_block_tx;
-- 
2.28.0



Re: [PATCH net v3] bonding: fix feature flag setting at init time

2020-12-05 Thread Jarod Wilson
On Thu, Dec 3, 2020 at 11:45 AM Jakub Kicinski  wrote:
...
> nit: let's narrow down the ifdef-enery
>
> no need for the ifdef here, if the helper looks like this:
>
> +static void bond_set_xfrm_features(struct net_device *bond_dev, u64 mode)
> +{
> +#ifdef CONFIG_XFRM_OFFLOAD
> +   if (mode == BOND_MODE_ACTIVEBACKUP)
> +   bond_dev->wanted_features |= BOND_XFRM_FEATURES;
> +   else
> +   bond_dev->wanted_features &= ~BOND_XFRM_FEATURES;
> +
> +   netdev_update_features(bond_dev);
> +#endif /* CONFIG_XFRM_OFFLOAD */
> +}
>
> Even better:
>
> +static void bond_set_xfrm_features(struct net_device *bond_dev, u64 mode)
> +{
> +   if (!IS_ENABLED(CONFIG_XFRM_OFFLOAD))
> +   return;
> +
> +   if (mode == BOND_MODE_ACTIVEBACKUP)
> +   bond_dev->wanted_features |= BOND_XFRM_FEATURES;
> +   else
> +   bond_dev->wanted_features &= ~BOND_XFRM_FEATURES;
> +
> +   netdev_update_features(bond_dev);
> +}
>
> (Assuming BOND_XFRM_FEATURES doesn't itself hide under an ifdef.)

It is, but doesn't need to be. I can mix these changes in as well.

-- 
Jarod Wilson
ja...@redhat.com



[PATCH net] bonding: reduce rtnl lock contention in mii monitor thread

2020-12-05 Thread Jarod Wilson
I'm seeing a system get stuck unable to bring a downed interface back up
when it's got an updelay value set, behavior which ceased when logging
spew was removed from bond_miimon_inspect(). I'm monitoring logs on this
system over another network connection, and it seems that the act of
spewing logs at all there increases rtnl lock contention, because
instrumented code showed bond_mii_monitor() never able to succeed in it's
attempts to call rtnl_trylock() to actually commit link state changes,
leaving the downed link stuck in BOND_LINK_DOWN. The system in question
appears to be fine with the log spew being moved to
bond_commit_link_state(), which is called after the successful
rtnl_trylock(). I'm actually wondering if perhaps we ultimately need/want
some bond-specific lock here to prevent racing with bond_close() instead
of using rtnl, but this shift of the output appears to work. I believe
this started happening when de77ecd4ef02 ("bonding: improve link-status
update in mii-monitoring") went in, but I'm not 100% on that.

The addition of a case BOND_LINK_BACK in bond_miimon_inspect() is somewhat
separate from the fix for the actual hang, but it eliminates a constant
"invalid new link 3 on slave" message seen related to this issue, and it's
not actually an invalid state here, so we shouldn't be reporting it as an
error.

CC: Mahesh Bandewar 
CC: Jay Vosburgh 
CC: Veaceslav Falico 
CC: Andy Gospodarek 
CC: "David S. Miller" 
CC: Jakub Kicinski 
CC: Thomas Davis 
CC: net...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 drivers/net/bonding/bond_main.c | 26 ++
 include/net/bonding.h   | 38 +
 2 files changed, 44 insertions(+), 20 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 47afc5938c26..cdb6c64f16b6 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2292,23 +2292,13 @@ static int bond_miimon_inspect(struct bonding *bond)
bond_propose_link_state(slave, BOND_LINK_FAIL);
commit++;
slave->delay = bond->params.downdelay;
-   if (slave->delay) {
-   slave_info(bond->dev, slave->dev, "link status 
down for %sinterface, disabling it in %d ms\n",
-  (BOND_MODE(bond) ==
-   BOND_MODE_ACTIVEBACKUP) ?
-   (bond_is_active_slave(slave) ?
-"active " : "backup ") : "",
-  bond->params.downdelay * 
bond->params.miimon);
-   }
+
fallthrough;
case BOND_LINK_FAIL:
if (link_state) {
/* recovered before downdelay expired */
bond_propose_link_state(slave, BOND_LINK_UP);
slave->last_link_up = jiffies;
-   slave_info(bond->dev, slave->dev, "link status 
up again after %d ms\n",
-  (bond->params.downdelay - 
slave->delay) *
-  bond->params.miimon);
commit++;
continue;
}
@@ -2330,19 +2320,10 @@ static int bond_miimon_inspect(struct bonding *bond)
commit++;
slave->delay = bond->params.updelay;
 
-   if (slave->delay) {
-   slave_info(bond->dev, slave->dev, "link status 
up, enabling it in %d ms\n",
-  ignore_updelay ? 0 :
-  bond->params.updelay *
-  bond->params.miimon);
-   }
fallthrough;
case BOND_LINK_BACK:
if (!link_state) {
bond_propose_link_state(slave, BOND_LINK_DOWN);
-   slave_info(bond->dev, slave->dev, "link status 
down again after %d ms\n",
-  (bond->params.updelay - 
slave->delay) *
-  bond->params.miimon);
commit++;
continue;
}
@@ -2456,6 +2437,11 @@ static void bond_miimon_commit(struct bonding *bond)
 
continue;
 
+   case BOND_LINK_BACK:
+   bond_propose_link_state(slave, BOND_LINK_NOCHANGE);
+

[RFC PATCH net-next] bonding: add a vlan+srcmac tx hashing option

2020-12-18 Thread Jarod Wilson
This comes from an end-user request, where they're running multiple VMs on
hosts with bonded interfaces connected to some interest switch topologies,
where 802.3ad isn't an option. They're currently running a proprietary
solution that effectively achieves load-balancing of VMs and bandwidth
utilization improvements with a similar form of transmission algorithm.

Basically, each VM has it's own vlan, so it always sends its traffic out
the same interface, unless that interface fails. Traffic gets split
between the interfaces, maintaining a consistent path, with failover still
available if an interface goes down.

This has been rudimetarily tested to provide similar results, suitable for
them to use to move off their current proprietary solution.

Still on the TODO list, if these even looks sane to begin with, is
fleshing out Documentation/networking/bonding.rst.

Cc: Jay Vosburgh 
Cc: Veaceslav Falico 
Cc: Andy Gospodarek 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: Thomas Davis 
Cc: net...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 drivers/net/bonding/bond_main.c| 27 +--
 drivers/net/bonding/bond_options.c |  1 +
 include/linux/netdevice.h  |  1 +
 include/uapi/linux/if_bonding.h|  1 +
 4 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 5fe5232cc3f3..151ce8c7a56f 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -164,7 +164,7 @@ module_param(xmit_hash_policy, charp, 0);
 MODULE_PARM_DESC(xmit_hash_policy, "balance-alb, balance-tlb, balance-xor, 
802.3ad hashing method; "
   "0 for layer 2 (default), 1 for layer 3+4, "
   "2 for layer 2+3, 3 for encap layer 2+3, "
-  "4 for encap layer 3+4");
+  "4 for encap layer 3+4, 5 for vlan+srcmac");
 module_param(arp_interval, int, 0);
 MODULE_PARM_DESC(arp_interval, "arp interval in milliseconds");
 module_param_array(arp_ip_target, charp, NULL, 0);
@@ -1434,6 +1434,8 @@ static enum netdev_lag_hash bond_lag_hash_type(struct 
bonding *bond,
return NETDEV_LAG_HASH_E23;
case BOND_XMIT_POLICY_ENCAP34:
return NETDEV_LAG_HASH_E34;
+   case BOND_XMIT_POLICY_VLAN_SRCMAC:
+   return NETDEV_LAG_HASH_VLAN_SRCMAC;
default:
return NETDEV_LAG_HASH_UNKNOWN;
}
@@ -3494,6 +3496,20 @@ static bool bond_flow_ip(struct sk_buff *skb, struct 
flow_keys *fk,
return true;
 }
 
+static inline u32 bond_vlan_srcmac_hash(struct sk_buff *skb)
+{
+   struct ethhdr *mac_hdr = (struct ethhdr *)skb_mac_header(skb);
+   u32 srcmac = mac_hdr->h_source[5];
+   u16 vlan;
+
+   if (!skb_vlan_tag_present(skb))
+   return srcmac;
+
+   vlan = skb_vlan_tag_get(skb);
+
+   return srcmac ^ vlan;
+}
+
 /* Extract the appropriate headers based on bond's xmit policy */
 static bool bond_flow_dissect(struct bonding *bond, struct sk_buff *skb,
  struct flow_keys *fk)
@@ -3501,10 +3517,14 @@ static bool bond_flow_dissect(struct bonding *bond, 
struct sk_buff *skb,
bool l34 = bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34;
int noff, proto = -1;
 
-   if (bond->params.xmit_policy > BOND_XMIT_POLICY_LAYER23) {
+   switch (bond->params.xmit_policy) {
+   case BOND_XMIT_POLICY_ENCAP23:
+   case BOND_XMIT_POLICY_ENCAP34:
memset(fk, 0, sizeof(*fk));
return __skb_flow_dissect(NULL, skb, &flow_keys_bonding,
  fk, NULL, 0, 0, 0, 0);
+   default:
+   break;
}
 
fk->ports.ports = 0;
@@ -3556,6 +3576,9 @@ u32 bond_xmit_hash(struct bonding *bond, struct sk_buff 
*skb)
skb->l4_hash)
return skb->hash;
 
+   if (bond->params.xmit_policy == BOND_XMIT_POLICY_VLAN_SRCMAC)
+   return bond_vlan_srcmac_hash(skb);
+
if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER2 ||
!bond_flow_dissect(bond, skb, &flow))
return bond_eth_hash(skb);
diff --git a/drivers/net/bonding/bond_options.c 
b/drivers/net/bonding/bond_options.c
index a4e4e15f574d..9826fe46fca1 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -101,6 +101,7 @@ static const struct bond_opt_value bond_xmit_hashtype_tbl[] 
= {
{ "layer2+3", BOND_XMIT_POLICY_LAYER23, 0},
{ "encap2+3", BOND_XMIT_POLICY_ENCAP23, 0},
{ "encap3+4", BOND_XMIT_POLICY_ENCAP34, 0},
+   { "vlansrc",  BOND_XMIT_POLICY_VLAN_SRCMAC,  0},
{ NULL,   -1,   0},
 };
 
diff --g

Re: [PATCH net] bonding: reduce rtnl lock contention in mii monitor thread

2020-12-18 Thread Jarod Wilson
On Tue, Dec 8, 2020 at 3:35 PM Jay Vosburgh  wrote:
>
> Jarod Wilson  wrote:
...
> >The addition of a case BOND_LINK_BACK in bond_miimon_commit() is somewhat
> >separate from the fix for the actual hang, but it eliminates a constant
> >"invalid new link 3 on slave" message seen related to this issue, and it's
> >not actually an invalid state here, so we shouldn't be reporting it as an
> >error.
...
> In principle, bond_miimon_commit should not see _BACK or _FAIL
> state as a new link state, because those states should be managed at the
> bond_miimon_inspect level (as they are the result of updelay and
> downdelay).  These states should not be "committed" in the sense of
> causing notifications or doing actions that require RTNL.
>
> My recollection is that the "invalid new link" messages were the
> result of a bug in de77ecd4ef02, which was fixed in 1899bb325149
> ("bonding: fix state transition issue in link monitoring"), but maybe
> the RTNL problem here induces that in some other fashion.
>
> Either way, I believe this message is correct as-is.

For reference, with 5.10.1 and this script:

#!/bin/sh

slave1=ens4f0
slave2=ens4f1

modprobe -rv bonding
modprobe -v bonding mode=2 miimon=100 updelay=200
ip link set bond0 up
ifenslave bond0 $slave1 $slave2
sleep 5

while :
do
ip link set $slave1 down
sleep 1
ip link set $slave1 up
sleep 1
done

I get this repeating log output:

[ 9488.262291] sfc :05:00.0 ens4f0: link up at 1Mbps
full-duplex (MTU 1500)
[ 9488.339508] bond0: (slave ens4f0): link status up, enabling it in 200 ms
[ 9488.339511] bond0: (slave ens4f0): invalid new link 3 on slave
[ 9488.547643] bond0: (slave ens4f0): link status definitely up, 1
Mbps full duplex
[ 9489.276614] bond0: (slave ens4f0): link status definitely down,
disabling slave
[ 9490.273830] sfc :05:00.0 ens4f0: link up at 1Mbps
full-duplex (MTU 1500)
[ 9490.315540] bond0: (slave ens4f0): link status up, enabling it in 200 ms
[ 9490.315543] bond0: (slave ens4f0): invalid new link 3 on slave
[ 9490.523641] bond0: (slave ens4f0): link status definitely up, 1
Mbps full duplex
[ 9491.356526] bond0: (slave ens4f0): link status definitely down,
disabling slave
[ 9492.285249] sfc :05:00.0 ens4f0: link up at 1Mbps
full-duplex (MTU 1500)
[ 9492.291522] bond0: (slave ens4f0): link status up, enabling it in 200 ms
[ 9492.291523] bond0: (slave ens4f0): invalid new link 3 on slave
[ 9492.499604] bond0: (slave ens4f0): link status definitely up, 1
Mbps full duplex
[ 9493.331594] bond0: (slave ens4f0): link status definitely down,
disabling slave

"invalid new link 3 on slave" is there every single time.

Side note: I'm not actually able to reproduce the repeating "link
status up, enabling it in 200 ms" and never recovering from a downed
link on this host, no clue why it's so reproducible w/another system.

-- 
Jarod Wilson
ja...@redhat.com



[PATCH net v2] bonding: fix feature flag setting at init time

2020-12-02 Thread Jarod Wilson
Don't try to adjust XFRM support flags if the bond device isn't yet
registered. Bad things can currently happen when netdev_change_features()
is called without having wanted_features fully filled in yet. Basically,
this code was racing against register_netdevice() filling in
wanted_features, and when it got there first, the empty wanted_features
led to features also getting emptied out, which was definitely not the
intended behavior, so prevent that from happening.

Originally, I'd hoped to stop adjusting wanted_features at all in the
bonding driver, as it's documented as being something only the network
core should touch, but we actually do need to do this to properly update
both the features and wanted_features fields when changing the bond type,
or we get to a situation where ethtool sees:

esp-hw-offload: off [requested on]

I do think we should be using netdev_update_features instead of
netdev_change_features here though, so we only send notifiers when the
features actually changed.

v2: rework based on further testing and suggestions from ivecera

Fixes: a3b658cfb664 ("bonding: allow xfrm offload setup post-module-load")
Reported-by: Ivan Vecera 
Suggested-by: Ivan Vecera 
Cc: Jay Vosburgh 
Cc: Veaceslav Falico 
Cc: Andy Gospodarek 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: Thomas Davis 
Cc: net...@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 drivers/net/bonding/bond_main.c| 10 --
 drivers/net/bonding/bond_options.c |  6 +-
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index e0880a3840d7..5fe5232cc3f3 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4746,15 +4746,13 @@ void bond_setup(struct net_device *bond_dev)
NETIF_F_HW_VLAN_CTAG_FILTER;
 
bond_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
-#ifdef CONFIG_XFRM_OFFLOAD
-   bond_dev->hw_features |= BOND_XFRM_FEATURES;
-#endif /* CONFIG_XFRM_OFFLOAD */
bond_dev->features |= bond_dev->hw_features;
bond_dev->features |= NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_STAG_TX;
 #ifdef CONFIG_XFRM_OFFLOAD
-   /* Disable XFRM features if this isn't an active-backup config */
-   if (BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP)
-   bond_dev->features &= ~BOND_XFRM_FEATURES;
+   bond_dev->hw_features |= BOND_XFRM_FEATURES;
+   /* Only enable XFRM features if this is an active-backup config */
+   if (BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP)
+   bond_dev->features |= BOND_XFRM_FEATURES;
 #endif /* CONFIG_XFRM_OFFLOAD */
 }
 
diff --git a/drivers/net/bonding/bond_options.c 
b/drivers/net/bonding/bond_options.c
index 9abfaae1c6f7..19205cfac751 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -768,11 +768,15 @@ static int bond_option_mode_set(struct bonding *bond,
bond->params.tlb_dynamic_lb = 1;
 
 #ifdef CONFIG_XFRM_OFFLOAD
+   if (bond->dev->reg_state != NETREG_REGISTERED)
+   goto noreg;
+
if (newval->value == BOND_MODE_ACTIVEBACKUP)
bond->dev->wanted_features |= BOND_XFRM_FEATURES;
else
bond->dev->wanted_features &= ~BOND_XFRM_FEATURES;
-   netdev_change_features(bond->dev);
+   netdev_update_features(bond->dev);
+noreg:
 #endif /* CONFIG_XFRM_OFFLOAD */
 
/* don't cache arp_validate between modes */
-- 
2.28.0



Re: [PATCH net v2] bonding: fix feature flag setting at init time

2020-12-02 Thread Jarod Wilson
On Wed, Dec 2, 2020 at 12:53 PM Jakub Kicinski  wrote:
>
> On Wed,  2 Dec 2020 12:30:53 -0500 Jarod Wilson wrote:
> > + if (bond->dev->reg_state != NETREG_REGISTERED)
> > + goto noreg;
> > +
> >   if (newval->value == BOND_MODE_ACTIVEBACKUP)
> >   bond->dev->wanted_features |= BOND_XFRM_FEATURES;
> >   else
> >   bond->dev->wanted_features &= ~BOND_XFRM_FEATURES;
> > - netdev_change_features(bond->dev);
> > + netdev_update_features(bond->dev);
> > +noreg:
>
> Why the goto?

Seemed cleaner to prevent an extra level of indentation of the code
following the goto and before the label, but I'm not that attached to
it if it's not wanted for coding style reasons.

-- 
Jarod Wilson
ja...@redhat.com



Re: [PATCH net v2] bonding: fix feature flag setting at init time

2020-12-02 Thread Jarod Wilson
On Wed, Dec 2, 2020 at 12:55 PM Jay Vosburgh  wrote:
>
> Jarod Wilson  wrote:
>
> >Don't try to adjust XFRM support flags if the bond device isn't yet
> >registered. Bad things can currently happen when netdev_change_features()
> >is called without having wanted_features fully filled in yet. Basically,
> >this code was racing against register_netdevice() filling in
> >wanted_features, and when it got there first, the empty wanted_features
> >led to features also getting emptied out, which was definitely not the
> >intended behavior, so prevent that from happening.
>
> Is this an actual race?  Reading Ivan's prior message, it sounds
> like it's an ordering problem (in that bond_newlink calls
> register_netdevice after bond_changelink).

Sorry, yeah, this is not actually a race condition, just an ordering
issue, bond_check_params() gets called at init time, which leads to
bond_option_mode_set() being called, and does so prior to
bond_create() running, which is where we actually call
register_netdevice().

> The change to bond_option_mode_set tests against reg_state, so
> presumably it wants to skip the first(?) time through, before the
> register_netdevice call; is that right?

Correct. Later on, when the bonding driver is already loaded, and
parameter changes are made, bond_option_mode_set() gets called and if
the mode changes to or from active-backup, we do need/want this code
to run to update wanted and features flags properly.


-- 
Jarod Wilson
ja...@redhat.com



  1   2   3   4   5   >