Re: [PATCH 2/2] mmc: Add mmc_force_detect_change_begin / _end functions
Hi Ulf, On Wed, Aug 30, 2017 at 03:43:49PM +0200, Ulf Hansson wrote: > On 30 August 2017 at 14:44, Hans de Goede wrote: > > Hi, > > > > > > On 21-07-17 16:35, Quentin Schulz wrote: > >> > >> From: Hans de Goede > >> > >> Some sdio devices have a multiple stage bring-up process. Specifically > >> the esp8089 (for which an out of tree driver is available) loads firmware > >> on the first call to its sdio-drivers' probe function and then resets > >> the device causing it to reboot from its RAM with the new firmware. > >> > >> When this sdio device reboots it comes back up in 1 bit 400 KHz mode > >> again, and we need to walk through the whole ios negatiation and sdio > >> setup > >> again. > >> > >> There are 2 problems with this: > >> > >> 1) Typically these devices are soldered onto some (ARM) tablet / SBC > >> PCB and as such are described in devicetree as "non-removable", which > >> causes the mmc-core to scan them only once and not poll for the device > >> dropping of the bus. Normally this is the right thing todo but in the > >> eso8089 example we need the mmc-core to notice the module has disconnected > >> (since it is now in 1 bit mode again it will not talk to the host in 4 bit > >> mode). This can be worked around by using "broken-cd" in devicetree > >> instead of "non-removable", but that is not a proper fix since the device > >> really is non-removable. > >> > >> 2) When the mmc-core detects the device has disconnected it will poweroff > >> the device, causing the RAM loaded firmware to be lost. This can be worked > >> around in devicetree by using regulator-always-on (and avoiding the use of > >> mmc-pwrseq), but again that is more of a hack then a proper fix. > >> > >> This commmit fixes 1) by adding a mmc_force_detect_change function which > >> will cause scanning for device removal / insertion until a new device is > >> detected. 2) Is fixed by a keep_power flag to the mmc_force_detect_change > >> function which when set causes the mmc-core to keep the power to the > >> device > >> on during the rescan. > >> > >> Cc: Icenowy Zheng > >> Cc: Maxime Ripard > >> Cc: Chen-Yu Tsai > >> Signed-off-by: Hans de Goede > > > > > > So when I posted this patch quite a while back, there was some discussion > > about this and a consensus that this is not the right solution. > > > > So first of all lets describe the problem: > > > > The esp8089 sdio wifi chip is really an ARM core with a wifi phy > > connected to it (as many wifi chipsets are). > > > > But this one comes up in some really generic sdio capable boot-loader > > mode and we need to feed it firmware and then reboot it into the > > new firmware. > > > > The reboot is where the problems happens. It seems to fallback > > from the negotiated 4 wire sdio mode to single wire spi mode then. > > > > The out of tree version of the driver deals with this by not setting > > the non-removable flag as well as setting the broken_cd flag so that > > the mmc core polls the device, after the reboot the poll fails > > because the mmc-controller and the esp8089 are using a different > > amount of wires so the mmc-cmd the poll uses times out. > > > > After which the esp8089 drivers remove function gets called, and > > the mmc stack re-discovers the esp8089 by restarting the whole > > number of wires (and speed) used negotiation. After which the > > esp8089 driver's probe function gets called (again) and on > > firmware loading is has set a global flag, so now it actually > > acts as a wifi driver rather then trying to load the firmware > > a second time. > > > > Since I did not want to rely on broken_cd polling I came up > > with the hack which is this patch. > > > > So when this patch was first discussed we came to the conclusion > > that what we really need is some sort of mmc_reprobe_device > > function which the driver can call from probe which will > > redo the number of wires (and speed) used negotiation, > > while keeping the sdio_function device as is so that probe can > > simply continue after this and we also don't need the ugly > > global flag. > > > > The idea would be for this function to be some wrapper > > around mmc_init_card() which resets the ios settings as is > > normally done on remove and then call mmc_init_card() > > passing in the existing card the same way as is done > > one resume, so that the existing card / sdio_function > > devices get reused. > > > > IIRC Ulf would look into writing this mmc_reprobe_device > > function and then I would test it with the esp8089, but > > Ulf never got around to writing the function and I ended > > up working on other things too. > > Thanks for summary! > > Just to let you know, I haven't forgot about this problem. I am > planning for a major update of the SDIO for power management support, > within a not too far future. > The issue described above, is then also one of the things I also plan > to look into. > I'd like to know if any progress has been made on that problem (I may have missed patche
[PATCH] staging: fsl-dpaa2/eth: Defer probing if no MC portal available
MC portals may not be available at the initial probing attempt due to dependencies on other modules. Check the return value of the MC portal allocation function and defer probing in case it's not available yet. For all other error cases the behaviour stays the same. Signed-off-by: Ioana Radulescu Suggested-by: Nipun Gupta --- drivers/staging/fsl-dpaa2/ethernet/dpaa2-eth.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/staging/fsl-dpaa2/ethernet/dpaa2-eth.c b/drivers/staging/fsl-dpaa2/ethernet/dpaa2-eth.c index 2817e67..e4c2804 100644 --- a/drivers/staging/fsl-dpaa2/ethernet/dpaa2-eth.c +++ b/drivers/staging/fsl-dpaa2/ethernet/dpaa2-eth.c @@ -2440,7 +2440,10 @@ static int dpaa2_eth_probe(struct fsl_mc_device *dpni_dev) err = fsl_mc_portal_allocate(dpni_dev, FSL_MC_IO_ATOMIC_CONTEXT_PORTAL, &priv->mc_io); if (err) { - dev_err(dev, "MC portal allocation failed\n"); + if (err == -ENXIO) + err = -EPROBE_DEFER; + else + dev_err(dev, "MC portal allocation failed\n"); goto err_portal_alloc; } -- 2.7.4 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH 2/2] mmc: Add mmc_force_detect_change_begin / _end functions
On 8 February 2018 at 15:59, Quentin Schulz wrote: > Hi Ulf, > > On Wed, Aug 30, 2017 at 03:43:49PM +0200, Ulf Hansson wrote: >> On 30 August 2017 at 14:44, Hans de Goede wrote: >> > Hi, >> > >> > >> > On 21-07-17 16:35, Quentin Schulz wrote: >> >> >> >> From: Hans de Goede >> >> >> >> Some sdio devices have a multiple stage bring-up process. Specifically >> >> the esp8089 (for which an out of tree driver is available) loads firmware >> >> on the first call to its sdio-drivers' probe function and then resets >> >> the device causing it to reboot from its RAM with the new firmware. >> >> >> >> When this sdio device reboots it comes back up in 1 bit 400 KHz mode >> >> again, and we need to walk through the whole ios negatiation and sdio >> >> setup >> >> again. >> >> >> >> There are 2 problems with this: >> >> >> >> 1) Typically these devices are soldered onto some (ARM) tablet / SBC >> >> PCB and as such are described in devicetree as "non-removable", which >> >> causes the mmc-core to scan them only once and not poll for the device >> >> dropping of the bus. Normally this is the right thing todo but in the >> >> eso8089 example we need the mmc-core to notice the module has disconnected >> >> (since it is now in 1 bit mode again it will not talk to the host in 4 bit >> >> mode). This can be worked around by using "broken-cd" in devicetree >> >> instead of "non-removable", but that is not a proper fix since the device >> >> really is non-removable. >> >> >> >> 2) When the mmc-core detects the device has disconnected it will poweroff >> >> the device, causing the RAM loaded firmware to be lost. This can be worked >> >> around in devicetree by using regulator-always-on (and avoiding the use of >> >> mmc-pwrseq), but again that is more of a hack then a proper fix. >> >> >> >> This commmit fixes 1) by adding a mmc_force_detect_change function which >> >> will cause scanning for device removal / insertion until a new device is >> >> detected. 2) Is fixed by a keep_power flag to the mmc_force_detect_change >> >> function which when set causes the mmc-core to keep the power to the >> >> device >> >> on during the rescan. >> >> >> >> Cc: Icenowy Zheng >> >> Cc: Maxime Ripard >> >> Cc: Chen-Yu Tsai >> >> Signed-off-by: Hans de Goede >> > >> > >> > So when I posted this patch quite a while back, there was some discussion >> > about this and a consensus that this is not the right solution. >> > >> > So first of all lets describe the problem: >> > >> > The esp8089 sdio wifi chip is really an ARM core with a wifi phy >> > connected to it (as many wifi chipsets are). >> > >> > But this one comes up in some really generic sdio capable boot-loader >> > mode and we need to feed it firmware and then reboot it into the >> > new firmware. >> > >> > The reboot is where the problems happens. It seems to fallback >> > from the negotiated 4 wire sdio mode to single wire spi mode then. >> > >> > The out of tree version of the driver deals with this by not setting >> > the non-removable flag as well as setting the broken_cd flag so that >> > the mmc core polls the device, after the reboot the poll fails >> > because the mmc-controller and the esp8089 are using a different >> > amount of wires so the mmc-cmd the poll uses times out. >> > >> > After which the esp8089 drivers remove function gets called, and >> > the mmc stack re-discovers the esp8089 by restarting the whole >> > number of wires (and speed) used negotiation. After which the >> > esp8089 driver's probe function gets called (again) and on >> > firmware loading is has set a global flag, so now it actually >> > acts as a wifi driver rather then trying to load the firmware >> > a second time. >> > >> > Since I did not want to rely on broken_cd polling I came up >> > with the hack which is this patch. >> > >> > So when this patch was first discussed we came to the conclusion >> > that what we really need is some sort of mmc_reprobe_device >> > function which the driver can call from probe which will >> > redo the number of wires (and speed) used negotiation, >> > while keeping the sdio_function device as is so that probe can >> > simply continue after this and we also don't need the ugly >> > global flag. >> > >> > The idea would be for this function to be some wrapper >> > around mmc_init_card() which resets the ios settings as is >> > normally done on remove and then call mmc_init_card() >> > passing in the existing card the same way as is done >> > one resume, so that the existing card / sdio_function >> > devices get reused. >> > >> > IIRC Ulf would look into writing this mmc_reprobe_device >> > function and then I would test it with the esp8089, but >> > Ulf never got around to writing the function and I ended >> > up working on other things too. >> >> Thanks for summary! >> >> Just to let you know, I haven't forgot about this problem. I am >> planning for a major update of the SDIO for power management support, >> within a not too far future. >> The issue describ
[PATCH] scsi: storvsc: missing error code in storvsc_probe()
From: Long Li This patch backports upstream commit ca8dc694045e9aa248e9916e0f614deb0494cb3d for 4.14-stable. commit ca8dc694045e9aa248e9916e0f614deb0494cb3d: We should set the error code if fc_remote_port_add() fails. Cc: #v4.12+ Fixes: daf0cd445a21 ("scsi: storvsc: Add support for FC rport.") Signed-off-by: Dan Carpenter Reviewed-by: Cathy Avery Acked-by: K. Y. Srinivasan Signed-off-by: Martin K. Petersen Signed-off-by: Long Li --- drivers/scsi/storvsc_drv.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index 5e7200f..c17ccb9 100644 --- a/drivers/scsi/storvsc_drv.c +++ b/drivers/scsi/storvsc_drv.c @@ -1826,8 +1826,10 @@ static int storvsc_probe(struct hv_device *device, fc_host_node_name(host) = stor_device->node_name; fc_host_port_name(host) = stor_device->port_name; stor_device->rport = fc_remote_port_add(host, 0, &ids); - if (!stor_device->rport) + if (!stor_device->rport) { + ret = -ENOMEM; goto err_out3; + } } #endif return 0; -- 2.7.4 ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH 41/80] staging: lustre: lmv: separate master object with master stripe
On Tue, Aug 16 2016, James Simmons wrote: > > +static inline bool > +lsm_md_eq(const struct lmv_stripe_md *lsm1, const struct lmv_stripe_md *lsm2) > +{ > + int idx; > + > + if (lsm1->lsm_md_magic != lsm2->lsm_md_magic || > + lsm1->lsm_md_stripe_count != lsm2->lsm_md_stripe_count || > + lsm1->lsm_md_master_mdt_index != lsm2->lsm_md_master_mdt_index || > + lsm1->lsm_md_hash_type != lsm2->lsm_md_hash_type || > + lsm1->lsm_md_layout_version != lsm2->lsm_md_layout_version || > + !strcmp(lsm1->lsm_md_pool_name, lsm2->lsm_md_pool_name)) > + return false; Hi James and all, This patch (8f18c8a48b736c2f in linux) is different from the corresponding patch in lustre-release (60e07b972114df). In that patch, the last clause in the 'if' condition is + strcmp(lsm1->lsm_md_pool_name, + lsm2->lsm_md_pool_name) != 0) Whoever converted it to "!strcmp()" inverted the condition. This is a perfect example of why I absolutely *loathe* the "!strcmp()" construct!! This causes many tests in the 'sanity' test suite to return -ENOMEM (that had me puzzled for a while!!). This seems to suggest that no-one has been testing the mainline linux lustre. It also seems to suggest that there is a good chance that there are other bugs that have crept in while no-one has really been caring. Given that the sanity test suite doesn't complete for me, but just hangs (in test_27z I think), that seems particularly likely. So my real question - to anyone interested in lustre for mainline linux - is: can we actually trust this code at all? I'm seriously tempted to suggest that we just rm -r drivers/staging/lustre drivers/staging is great for letting the community work on code that has been "thrown over the wall" and is not openly developed elsewhere, but that is not the case for lustre. lustre has (or seems to have) an open development process. Having on-going development happen both there and in drivers/staging seems a waste of resources. Might it make sense to instead start cleaning up the code in lustre-release so as to make it meet the upstream kernel standards. Then when the time is right, the kernel code can be moved *out* of lustre-release and *in* to linux. Then development can continue in Linux (just like it does with other Linux filesystems). An added bonus of this is that there is an obvious path to getting server support in mainline Linux. The current situation of client-only support seems weird given how interdependent the two are. What do others think? Is there any chance that the current lustre in Linux will ever be more than a poor second-cousin to the external lustre-release. If there isn't, should we just discard it now and move on? Thanks, NeilBrown signature.asc Description: PGP signature ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [PATCH 41/80] staging: lustre: lmv: separate master object with master stripe
> On Feb 8, 2018, at 8:39 PM, NeilBrown wrote: > > On Tue, Aug 16 2016, James Simmons wrote: my that’s an old patch > >> >> +static inline bool >> +lsm_md_eq(const struct lmv_stripe_md *lsm1, const struct lmv_stripe_md >> *lsm2) >> +{ >> +int idx; >> + >> +if (lsm1->lsm_md_magic != lsm2->lsm_md_magic || >> +lsm1->lsm_md_stripe_count != lsm2->lsm_md_stripe_count || >> +lsm1->lsm_md_master_mdt_index != lsm2->lsm_md_master_mdt_index || >> +lsm1->lsm_md_hash_type != lsm2->lsm_md_hash_type || >> +lsm1->lsm_md_layout_version != lsm2->lsm_md_layout_version || >> +!strcmp(lsm1->lsm_md_pool_name, lsm2->lsm_md_pool_name)) >> +return false; > > Hi James and all, > This patch (8f18c8a48b736c2f in linux) is different from the > corresponding patch in lustre-release (60e07b972114df). > > In that patch, the last clause in the 'if' condition is > > + strcmp(lsm1->lsm_md_pool_name, > + lsm2->lsm_md_pool_name) != 0) > > Whoever converted it to "!strcmp()" inverted the condition. This is a > perfect example of why I absolutely *loathe* the "!strcmp()" construct!! > > This causes many tests in the 'sanity' test suite to return > -ENOMEM (that had me puzzled for a while!!). huh? I am not seeing anything of the sort and I was running sanity all the time until a recent pause (but going to resume). > This seems to suggest that no-one has been testing the mainline linux > lustre. > It also seems to suggest that there is a good chance that there > are other bugs that have crept in while no-one has really been caring. > Given that the sanity test suite doesn't complete for me, but just > hangs (in test_27z I think), that seems particularly likely. Works for me, here’s a run from earlier today on 4.15.0: == sanity test 27z: check SEQ/OID on the MDT and OST filesystems = 16:43:58 (1518126238) 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0169548 s, 61.8 MB/s 2+0 records in 2+0 records out 2097152 bytes (2.1 MB, 2.0 MiB) copied, 0.02782 s, 75.4 MB/s check file /mnt/lustre/d27z.sanity/f27z.sanity-1 FID seq 0x20401, oid 0x4640 ver 0x0 LOV seq 0x20401, oid 0x4640, count: 1 want: stripe:0 ost:0 oid:314/0x13a seq:0 Stopping /mnt/lustre-ost1 (opts:) on centos6-17 pdsh@fedora1: centos6-17: ssh exited with exit code 1 pdsh@fedora1: centos6-17: ssh exited with exit code 1 pdsh@fedora1: centos6-17: ssh exited with exit code 1 Starting ost1: -o loop /tmp/lustre-ost1 /mnt/lustre-ost1 Failed to initialize ZFS library: 256 h2tcp: deprecated, use h2nettype instead centos6-17.localnet: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super all -lnet -lnd -pinger 16 pdsh@fedora1: centos6-17: ssh exited with exit code 1 pdsh@fedora1: centos6-17: ssh exited with exit code 1 Started lustre-OST /mnt/lustre-ost1/O/0/d26/314: parent=[0x20401:0x4640:0x0] stripe=0 stripe_size=0 stripe_count=0 check file /mnt/lustre/d27z.sanity/f27z.sanity-2 FID seq 0x20401, oid 0x4642 ver 0x0 LOV seq 0x20401, oid 0x4642, count: 2 want: stripe:0 ost:1 oid:1187/0x4a3 seq:0 Stopping /mnt/lustre-ost2 (opts:) on centos6-17 pdsh@fedora1: centos6-17: ssh exited with exit code 1 pdsh@fedora1: centos6-17: ssh exited with exit code 1 pdsh@fedora1: centos6-17: ssh exited with exit code 1 Starting ost2: -o loop /tmp/lustre-ost2 /mnt/lustre-ost2 Failed to initialize ZFS library: 256 h2tcp: deprecated, use h2nettype instead centos6-17.localnet: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super all -lnet -lnd -pinger 16 pdsh@fedora1: centos6-17: ssh exited with exit code 1 pdsh@fedora1: centos6-17: ssh exited with exit code 1 Started lustre-OST0001 /mnt/lustre-ost2/O/0/d3/1187: parent=[0x20401:0x4642:0x0] stripe=0 stripe_size=0 stripe_count=0 want: stripe:1 ost:0 oid:315/0x13b seq:0 got: objid=0 seq=0 parent=[0x20401:0x4642:0x0] stripe=1 Resetting fail_loc on all nodes...done. 16:44:32 (1518126272) waiting for centos6-16 network 5 secs ... 16:44:32 (1518126272) network interface is UP 16:44:33 (1518126273) waiting for centos6-17 network 5 secs ... 16:44:33 (1518126273) network interface is UP > So my real question - to anyone interested in lustre for mainline linux > - is: can we actually trust this code at all? Absolutely. Seems that you just stumbled upon a corner case that was not being hit by people that do the testing, so you have something unique about your setup, I guess. > I'm seriously tempted to suggest that we just > rm -r drivers/staging/lustre > > drivers/staging is great for letting the community work on code that has > been "thrown over the wall" and is not openly developed elsewhere, but > that is not the case for lustre. lustre has (or seems to have) an open > development process. Having on-going development happen both there and > in drivers/staging seems a waste of resources. It is a bit of
Re: [PATCH 41/80] staging: lustre: lmv: separate master object with master stripe
On Thu, Feb 08 2018, Oleg Drokin wrote: >> On Feb 8, 2018, at 8:39 PM, NeilBrown wrote: >> >> On Tue, Aug 16 2016, James Simmons wrote: > > my that’s an old patch > >> ... >> >> Whoever converted it to "!strcmp()" inverted the condition. This is a >> perfect example of why I absolutely *loathe* the "!strcmp()" construct!! >> >> This causes many tests in the 'sanity' test suite to return >> -ENOMEM (that had me puzzled for a while!!). > > huh? I am not seeing anything of the sort and I was running sanity > all the time until a recent pause (but going to resume). That does surprised me - I reproduce it every time. I have two VMs running a SLE12-SP2 kernel with patches from lustre-release applied. These are servers. They have 2 3G virtual disks each. I have two over VMs running current mainline. These are clients. I guess your 'recent pause' included between v4.15-rc1 (8e55b6fd0660) and v4.15-rc6 (a93639090a27) - a full month when lustre wouldn't work at all :-( > >> This seems to suggest that no-one has been testing the mainline linux >> lustre. >> It also seems to suggest that there is a good chance that there >> are other bugs that have crept in while no-one has really been caring. >> Given that the sanity test suite doesn't complete for me, but just >> hangs (in test_27z I think), that seems particularly likely. > > Works for me, here’s a run from earlier today on 4.15.0: Well that's encouraging .. I haven't looked into this one yet - I'm not even sure where to start. > >> So my real question - to anyone interested in lustre for mainline linux >> - is: can we actually trust this code at all? > > Absolutely. Seems that you just stumbled upon a corner case that was not > being hit by people that do the testing, so you have something unique about > your setup, I guess. > >> I'm seriously tempted to suggest that we just >> rm -r drivers/staging/lustre >> >> drivers/staging is great for letting the community work on code that has >> been "thrown over the wall" and is not openly developed elsewhere, but >> that is not the case for lustre. lustre has (or seems to have) an open >> development process. Having on-going development happen both there and >> in drivers/staging seems a waste of resources. > > It is a bit of a waste of resources, but there are some other things here. > E.g. we cannot have any APIs with no users in the kernel. > Also some people like to have in-kernel modules coming with their distros > (there were some users that used staging client on ubuntu as their > setup). > > Instead the plan was to clean up the staging client into acceptable state, > move it out of staging, bring in all the missing features and then > drop the client (more or less) from the lustre-release. That sounds like a great plan. Any idea why it didn't happen? It seems there is a lot of upstream work mixed in with the clean up, and I don't think that really helps anyone. Is it at all realistic that the client might be removed from lustre-release? That might be a good goal to work towards. > >> Might it make sense to instead start cleaning up the code in >> lustre-release so as to make it meet the upstream kernel standards. >> Then when the time is right, the kernel code can be moved *out* of >> lustre-release and *in* to linux. Then development can continue in >> Linux (just like it does with other Linux filesystems). > > While we can be cleaning lustre in lustre-release, there are some things > we cannot do as easily, e.g. decoupling Lustre client from the server. > Also it would not attract any reviews from all the janitor or > (more importantly) Al Viro and other people with a sharp eyes. > >> An added bonus of this is that there is an obvious path to getting >> server support in mainline Linux. The current situation of client-only >> support seems weird given how interdependent the two are. > > Given the pushback Lustre client was given I have no hope Lustre server > will get into mainline in my lifetime. Even if it is horrible it would be nice to have it in staging... I guess the changes required to ext4 prohibit that... I don't suppose it can be made to work with mainline ext4 in a reduced-functionality-and-performance way?? I think it would be a lot easier to motivate forward progress if there were a credible end goal of everything being in mainline. > >> What do others think? Is there any chance that the current lustre in >> Linux will ever be more than a poor second-cousin to the external >> lustre-release. If there isn't, should we just discard it now and move >> on? > > > I think many useful cleanups and fixes came from the staging tree at > the very least. > The biggest problem with it all is that we are in staging tree so > we cannot bring it to parity much. And we are in staging tree because > there’s a whole bunch of “cleanups” requested that take a lot of effort > (in both implementing them and then in finding other ways of achieving > things that were done in old ways before)
Re: [lustre-devel] [PATCH 41/80] staging: lustre: lmv: separate master object with master stripe
> On Feb 8, 2018, at 10:10 PM, NeilBrown wrote: > > On Thu, Feb 08 2018, Oleg Drokin wrote: > >>> On Feb 8, 2018, at 8:39 PM, NeilBrown wrote: >>> >>> On Tue, Aug 16 2016, James Simmons wrote: >> >> my that’s an old patch >> >>> > ... >>> >>> Whoever converted it to "!strcmp()" inverted the condition. This is a >>> perfect example of why I absolutely *loathe* the "!strcmp()" construct!! >>> >>> This causes many tests in the 'sanity' test suite to return >>> -ENOMEM (that had me puzzled for a while!!). >> >> huh? I am not seeing anything of the sort and I was running sanity >> all the time until a recent pause (but going to resume). > > That does surprised me - I reproduce it every time. > I have two VMs running a SLE12-SP2 kernel with patches from > lustre-release applied. These are servers. They have 2 3G virtual disks > each. > I have two over VMs running current mainline. These are clients. > > I guess your 'recent pause' included between v4.15-rc1 (8e55b6fd0660) > and v4.15-rc6 (a93639090a27) - a full month when lustre wouldn't work at > all :-( More than that, but I am pretty sure James Simmons is running tests all the time too (he has a different config, I only have tcp). >>> This seems to suggest that no-one has been testing the mainline linux >>> lustre. >>> It also seems to suggest that there is a good chance that there >>> are other bugs that have crept in while no-one has really been caring. >>> Given that the sanity test suite doesn't complete for me, but just >>> hangs (in test_27z I think), that seems particularly likely. >> >> Works for me, here’s a run from earlier today on 4.15.0: > > Well that's encouraging .. I haven't looked into this one yet - I'm not > even sure where to start. m… debug logs for example (greatly neutered in staging tree, but still useful)? try lctl dk and see what’s in there. >> Instead the plan was to clean up the staging client into acceptable state, >> move it out of staging, bring in all the missing features and then >> drop the client (more or less) from the lustre-release. > > That sounds like a great plan. Any idea why it didn't happen? Because meeting open-ended demands is hard and certain demands sound like “throw away your X and rewrite it from scratch" (e.g. everything IB-related). Certain things that sound useless (like the debug subsystem in Lustre) is very useful when you have a 10k nodes in a cluster and need to selectively pull stuff from a run to debug a complicated cross-node interaction. I asked NFS people how do they do it and they don’t have anything that scales and usually involves reducing the problem to a much smaller set of nodes first. > It seems there is a lot of upstream work mixed in with the clean up, and > I don't think that really helps anyone. I don’t understand what you mean here. > Is it at all realistic that the client might be removed from > lustre-release? That might be a good goal to work towards. Assuming we can bring the whole functionality over - sure. Of course there’d still be some separate development place and we would need to create patches (new features?) for like SuSE and other distros and for testing of server features, I guess, but that could just that - a side branch somewhere I hope. It’s not that we are super glad to chase every kernel vendors put out, of course it would be much easier if the kernels already included a very functional Lustre client. >>> Might it make sense to instead start cleaning up the code in >>> lustre-release so as to make it meet the upstream kernel standards. >>> Then when the time is right, the kernel code can be moved *out* of >>> lustre-release and *in* to linux. Then development can continue in >>> Linux (just like it does with other Linux filesystems). >> >> While we can be cleaning lustre in lustre-release, there are some things >> we cannot do as easily, e.g. decoupling Lustre client from the server. >> Also it would not attract any reviews from all the janitor or >> (more importantly) Al Viro and other people with a sharp eyes. >> >>> An added bonus of this is that there is an obvious path to getting >>> server support in mainline Linux. The current situation of client-only >>> support seems weird given how interdependent the two are. >> >> Given the pushback Lustre client was given I have no hope Lustre server >> will get into mainline in my lifetime. > > Even if it is horrible it would be nice to have it in staging... I guess > the changes required to ext4 prohibit that... I don't suppose it can be > made to work with mainline ext4 in a reduced-functionality-and-performance > way?? We support unpatched ZFS as a server too! ;) (and if somebody invests the time into it, there was some half-baked btrfs backend too I think). That said nobody here believes in any success of pushing Lustre server into mainline. It would just be easier to push the whole server into userspace (And there was a project like this in the past, now abandoned because i