Re: Improving AIO cancellation
On 02/08/13 04:42, Anatol Pomozov wrote: Hi, At Google we have several applications that heavily use asynchronous IO. One thing that our userspace developers need is effective AIO cancellation. You might say "sure use io_cancel syscall". Well, while it cancels AIO requests it does it ineffectively. Currently (I am looking at linux-next) io_cancel only marks kiocb as cancelled. The bios still be issued to device even after kiocb was cancelled. Let's say you have a congested device and want to cancel some AIO requests - io_cancel will not make situation better. We would like to see more resource effective AIO cancellation. I had a discussion with Ted Tso and Kent Overstreet about improving this situation and would like to share the ideas with you, linux community. Once direct async IO is submitted the request can be at several stages: 1) Sitting in kernel request queue of a congested device 2) Sent to device and sitting in device queue (if NCQ is enabled) 3) Executing on device Ideally if we can cancel an IO request on any of these stages. But currently we are especially interested in case #1. I do not know if cancellation at stage #2 and #3 is possible and/or reasonable. BTW AIO cancellation makes sense only for direct IO. Buffered AIO will end up in buffer soon and kiocb will be marked as completed. Later (maybe much later) writeback will flush those buffers to disk, but you cannot cancel it.. And yet another thing to remember is md/RAID. Some types of raid support stripes consistency. When md splits a WRITE across disks either all or no of the child requests should be completed. If we do partial write then the disk data will become inconsistent. Ted and Kent suggested following solution: any time when we do forward progress with request/bio we need to check its status. If user cancelled the request then just skip this bio. So it covers case #1. The draft implementation will look like this. struct bio should have some way to get current status of kiocb that generated bio. So we add a pointer to bool flag. struct bio { bool *cancelled; } in async DIO codepath this pointer will be initialized with bool at "struct kiocb" bio->cancelled = &kiocb->cancelled; except md. If it is RAID5 and we perform WRITE request then we do not initialize this pointer. when we do forward progress with request/bio we check its cancellation status: if (bio->cancelled && *bio->cancelled) goto do_not_process_bio_because_it_cancelled; So to cancel kiocb we do kiocb->cancelled = true; and all bio created from the request will not be send to device anymore. The solution seems straightforward, but I would like to hear if there are other solutions to make AIO cancellation better. Does suggested implementation looks good? Are there better solutions? What about cancelling requests that are already sent to device? If the proposal is fine then I start implementing it. Hello Anatol, Had you already noticed this message: http://marc.info/?l=linux-fsdevel&m=136024044202362 ? Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: How to online remove an error scsi disk from the system?
On 02/01/13 07:13, Tao Ma wrote: In our product system, we have several sata disks attached to one machine. So when one of the disk fails, the jbd2(yes, we use ext4) will hang forever and we will get something in /var/log/messages like below. It seems to me that the io sent to the scsi layer is never returned back with -EIO which is a little bit surprised for me(It should be a timeout somewhere, right?). We have tried echo "offline" > /sys/block/sdl/device/state, but it doesn't work. So is there any way for us to let the scsi device returns all the io requests back with EIO so that all the end_io can be called accordingly? Am I missing something here? Please note that I'm not familiar with SAS. But I found this in drivers/scsi/scsi_proc.c: * proc_scsi_write - handle writes to /proc/scsi/scsi * @file: not used * @buf: buffer to write * @length: length of buf, at most PAGE_SIZE * @ppos: not used * * Description: this provides a legacy mechanism to add or remove * devices by Host, Channel, ID, and Lun. To use, * "echo 'scsi add-single-device 0 1 2 3' > /proc/scsi/scsi" or * "echo 'scsi remove-single-device 0 1 2 3' > /proc/scsi/scsi" with * "0 1 2 3" replaced by the Host, Channel, Id, and Lun. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Fix bugs in ib_srp patches for H.A. purposes
On 08/31/12 20:00, dongsu.p...@profitbricks.com wrote: > This patchset aims at fixing bugs that have been discovered in our own > SRP test environment so far. These patches are based on your patchset v4, > "Make ib_srp better suited for H.A. purposes",(09 Aug 2012). > > The 5th patch, "fix an error accessing invalid memory in > rport_dev_loss_timedout" is including your suggestion (30 Aug 2012). > > You can also pull the following git repo to get the patches. > > git://github.com/advance38/linux.git srp-ha > > Our test setup consists of two systems. > Kernel 3.2.15 with SCST v4193 on the target, > and Kernel 3.2.8 with ib_srp-ha on the initiator. > Although I rebased the patches again onto 3.6-rc3, > I suppose there will be no significant differences. > > All of the known critical issues seem to have been resolved > according to our internal tests in the last weeks. Thanks for the testing and the feedback. Let's hope that the ib_srp maintainer will be able to review these patches soon. For the linux-scsi and LKML readers: I've sent my feedback about this patch series to the linux-rdma mailing list only. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integration of SCST in the mainstream Linux kernel
On Jan 30, 2008 2:54 PM, FUJITA Tomonori <[EMAIL PROTECTED]> wrote: > On Wed, 30 Jan 2008 14:10:47 +0100 > "Bart Van Assche" <[EMAIL PROTECTED]> wrote: > > > On Jan 30, 2008 11:56 AM, FUJITA Tomonori <[EMAIL PROTECTED]> wrote: > > > > > > Sorry, I can't say. I don't know much about iSER. But seems that Pete > > > and Robin can get the better I/O performance - line speed ratio with > > > STGT. > > > > Robin Humble was using a DDR InfiniBand network, while my tests were > > performed with an SDR InfiniBand network. Robin's results can't be > > directly compared to my results. > > I know that you use different hardware. I used 'ratio' word. Let's start with summarizing the relevant numbers from Robin's measurements and my own measurements. Maximum bandwidth of the underlying physical medium: 2000 MB/s for a DDR 4x InfiniBand network and 1000 MB/s for a SDR 4x InfiniBand network. Maximum bandwidth reported by the OFED ib_write_bw test program: 1473 MB/s for Robin's setup and 933 MB/s for my setup. These numbers match published ib_write_bw results (see e.g. figure 11 in http://www.usenix.org/events/usenix06/tech/full_papers/liu/liu_html/index.html or chapter 7 in http://www.voltaire.com/ftp/rocks/HCA-4X0_Linux_GridStack_4.3_Release_Notes_DOC-00171-A00.pdf) Throughput measured for communication via STGT + iSER to a remote RAM disk via direct I/O with dd: 800 MB/s for writing and 751 MB/s for reading in Robin's setup, and 647 MB/s for writing and 589 MB/s for reading in my setup. >From this we can compute the I/O-performance to ib_write_bw bandwidth: 54 % for writing and 51 % for reading in Robin's setup, and 69 % for writing and 63 % for reading in my setup. Or a slightly better utilization of the bandwidth in my setup than in Robin's setup. This is no surprise -- the faster a communication link is, the harder it is to use all of the available bandwidth. So why did you state that in Robin's tests the I/O performance to line speed ratio was better than in my tests ? Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integration of SCST in the mainstream Linux kernel
On Jan 31, 2008 2:25 PM, Nicholas A. Bellinger <[EMAIL PROTECTED]> wrote: > Since this particular code is located in a non-data path critical > section, the kernel vs. user discussion is a wash. If we are talking > about data path, yes, the relevance of DD tests in kernel designs are > suspect :p. For those IB testers who are interested, perhaps having a > look with disktest from the Linux Test Project would give a better > comparision between the two implementations on a RDMA capable fabric > like IB for best case performance. I think everyone is interested in > seeing just how much data path overhead exists between userspace and > kernel space in typical and heavy workloads, if if this overhead can be > minimized to make userspace a better option for some of this very > complex code. I can run disktest on the same setups I ran dd on. This will take some time however. Disktest is new to me -- any hints with regard to suitable combinations of command line parameters are welcome. The most recent version I could find on http://ltp.sourceforge.net/ is ltp-20071231. Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel
On Jan 31, 2008 5:25 PM, Joe Landman <[EMAIL PROTECTED]> wrote: > Vladislav Bolkhovitin wrote: > > Actually, I don't know what kind of conclusions it is possible to make > > from disktest's results (maybe only how throughput gets bigger or slower > > with increasing number of threads?), it's a good stress test tool, but > > not more. > > Unfortunately, I agree. Bonnie++, dd tests, and a few others seem to > bear far closer to "real world" tests than disktest and iozone, the > latter of which does more to test the speed of RAM cache and system call > performance than actual IO. I have ran some tests with Bonnie++, but found out that on a fast network like IB the filesystem used for the test has a really big impact on the test results. If anyone has a suggestion for a better test than dd to compare the performance of SCSI storage protocols, please let it know. Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integration of SCST in the mainstream Linux kernel
On Jan 31, 2008 6:14 PM, Nicholas A. Bellinger <[EMAIL PROTECTED]> wrote: > Also, with STGT being a pretty new design which has not undergone alot > of optimization, perhaps profiling both pieces of code against similar > tests would give us a better idea of where userspace bottlenecks reside. > Also, the overhead involved with traditional iSCSI for bulk IO from > kernel / userspace would also be a key concern for a much larger set of > users, as iSER and SRP on IB is a pretty small userbase and will > probably remain small for the near future. Two important trends in data center technology are server consolidation and storage consolidation. A.o. every web hosting company can profit from a fast storage solution. I wouldn't call this a small user base. Regarding iSER and SRP on IB: InfiniBand is today the most economic solution for a fast storage network. I do not know which technology will be the most popular for storage consolidation within a few years -- this can be SRP, iSER, IPoIB + SDP, FCoE (Fibre Channel over Ethernet) or maybe yet another technology. No matter which technology becomes the most popular for storage applications, there will be a need for high-performance storage software. References: * Michael Feldman, Battle of the Network Fabrics, HPCwire, December 2006, http://www.hpcwire.com/hpc/1145060.html * NetApp, Reducing Data Center Power Consumption Through Efficient Storage, February 2007, http://www.netapp.com/ftp/wp-reducing-datacenter-power-consumption.pdf Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integration of SCST in the mainstream Linux kernel
On Jan 31, 2008 2:25 PM, Nicholas A. Bellinger <[EMAIL PROTECTED]> wrote: > > The PyX storage engine supports a scatterlist linked list algorithm that > ... Which parts of the PyX source code are licensed under the GPL and which parts are closed source ? A Google query for PyX + iSCSI showed information about licensing deals. Licensing deals can only be closed for software that is not entirely licensed under the GPL. Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integration of SCST in the mainstream Linux kernel
On Jan 31, 2008 7:15 PM, Nicholas A. Bellinger <[EMAIL PROTECTED]> wrote: > > I meant small referring to storage on IB fabrics which has usually been > in the research and national lab settings, with some other vendors > offering IB as an alternative storage fabric for those who [w,c]ould not > wait for 10 Gb/sec copper Ethernet and Direct Data Placement to come > online. These types of numbers compared to say traditional iSCSI, that > is getting used all over the place these days in areas I won't bother > listing here. InfiniBand has several advantages over 10 Gbit/s Ethernet (the list below probably isn't complete): - Lower latency. Communication latency is not only determined by the latency of a switch. The whole InfiniBand protocol stack was designed with low latency in mind. Low latency is really important for database software that accesses storage over a network. - High-availability is implemented at the network layer. Suppose that a group of servers has dual-port network interfaces and is interconnected via a so-called dual star topology, With an InfiniBand network, failover in case of a single failure (link or switch) is handled without any operating system or application intervention. With Ethernet, failover in case of a single failure must be handled either by the operating system or by the application. - You do not have to use iSER or SRP to use the bandwidth of an InfiniBand network effectively. The SDP (Sockets Direct Protocol) makes it possible that applications benefit from RDMA by using the very classic IPv4 Berkeley sockets interface. An SDP implementation in software is already available today via OFED. iperf reports 470 MB/s on single-threaded tests and 975 MB/s for a performance test with two threads on an SDR 4x InfiniBand network. These tests were performed with the OFED 1.2.5.4 SDP implementation. It is possible that future SDP implementations will perform even better. (Note: I could not yet get iSCSI over SDP working.) We should leave the choice of networking technology open -- both Ethernet and InfiniBand have specific advantages. See also: InfiniBand Trade Association, InfiniBand Architecture Specification Release 1.2.1, http://www.infinibandta.org/specs/register/publicspec/ Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integration of SCST in the mainstream Linux kernel
On Feb 1, 2008 11:39 AM, Nicholas A. Bellinger <[EMAIL PROTECTED]> wrote: > > On Fri, 2008-02-01 at 09:11 +0100, Bart Van Assche wrote: > > On Jan 31, 2008 2:25 PM, Nicholas A. Bellinger <[EMAIL PROTECTED]> wrote: > > > > > > The PyX storage engine supports a scatterlist linked list algorithm that > > > ... > > > > Which parts of the PyX source code are licensed under the GPL and > > which parts are closed source ? A Google query for PyX + iSCSI showed > > information about licensing deals. Licensing deals can only be closed > > for software that is not entirely licensed under the GPL. > > > > I was using the name PyX to give an historical context to the > discussion. ... Regarding the PyX Target Code: I have found a link via which I can download a free 30-day demo. This means that a company is earning money via this target code and that the source code is not licensed under the GPL. This is fine, but it also means that today the PyX target code is not a candidate for inclusion in the Linux kernel, and that it is unlikely that all of the PyX target code (kernelspace + userspace) will be made available under GPL soon. Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integration of SCST in the mainstream Linux kernel
On Feb 1, 2008 1:05 PM, Nicholas A. Bellinger <[EMAIL PROTECTED]> wrote: > > All of the kernel and C userspace code is open source and available from > linux-iscsi.org and licensed under the GPL. I found a statement on a web page that the ERL2 implementation is not included in the GPL version (http://zaal.org/iscsi/index.html). The above implies that this statement is incorrect. Tomo, are you the maintainer of this web page ? I'll try to measure the performance of the LIO Target Stack on the same setup on which I ran the other performance tests. Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integration of SCST in the mainstream Linux kernel
On Feb 4, 2008 1:27 PM, Vladislav Bolkhovitin <[EMAIL PROTECTED]> wrote: > > So, James, what is your opinion on the above? Or the overall SCSI target > project simplicity doesn't matter much for you and you think it's fine > to duplicate Linux page cache in the user space to keep the in-kernel > part of the project as small as possible? It's too early to draw conclusions about performance. I'm currently performing more measurements, and the results are not easy to interpret. My plan is to measure the following: * Setup: target with RAM disk of 2 GB as backing storage. * Throughput reported by dd and xdd (direct I/O). * Transfers with dd/xdd in units of 1 KB to 1 GB (the smallest transfer size that can be specified to xdd is 1 KB). * Target SCSI software to be tested: IETD iSCSI via IPoIB, STGT iSCSI via IPoIB, STGT iSER, SCST iSCSI via IPoIB, SCST SRP, LIO iSCSI via IPoIB. The reason I chose dd/xdd for these tests is that I want to measure the performance of the communication protocols, and that I am assuming that this performance can be modeled by the following formula: (transfer time in s) = (transfer setup latency in s) + (transfer size in MB) / (bandwidth in MB/s). Measuring the time needed for transfers with varying block size allows to compute the constants in the above formula via linear regression. One difficulty I already encountered is that the performance of the Linux IPoIB implementation varies a lot under high load (http://bugzilla.kernel.org/show_bug.cgi?id=9883). Another issue I have to look further into is that dd and xdd report different results for very large block sizes (> 1 MB). Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cbe-oss-dev] LIO Target iSCSI/SE PS3-Linux / FC8 builds
On Feb 4, 2008 5:44 PM, Marc Dietrich <[EMAIL PROTECTED]> wrote: > ... > Anyway, heres a quick ugly fix for the ARCH detection code, tested on ps3. > ... Architecture detection is indeed broken in LIO. Would it be possible to use the standard config.guess script instead of the custom LIO arch detection script ? config.guess is included with a.o. automake and libtool. Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integration of SCST in the mainstream Linux kernel
On Feb 4, 2008 11:57 PM, Jeff Garzik <[EMAIL PROTECTED]> wrote: > Networked block devices are attractive because the concepts and > implementation are more simple than networked filesystems... but usually > you want to run some sort of filesystem on top. At that point you might > as well run NFS or [gfs|ocfs|flavor-of-the-week], and ditch your > networked block device (and associated complexity). Running a filesystem on top of iSCSI results in better performance than NFS, especially if the NFS client conforms to the NFS standard (=synchronous writes). By searching the web search for the keywords NFS, iSCSI and performance I found the following (6 years old) document: http://www.technomagesinc.com/papers/ip_paper.html. A quote from the conclusion: Our results, generated by running some of industry standard benchmarks, show that iSCSI significantly outperforms NFS for situations when performing streaming, database like accesses and small file transactions. Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kara_am
On Feb 5, 2008 1:10 PM, am kara <[EMAIL PROTECTED]> wrote: > I am trying to understand IPsec and found some related > subfolders Please read this book (or a similar book) first: Understanding Linux Network Internals By Christian Benvenuti Publisher: O'Reilly Pub Date: December 2005 ISBN: 0-596-00255-6 Pages: 1062 Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integration of SCST in the mainstream Linux kernel
Regarding the performance tests I promised to perform: although until now I only have been able to run two tests (STGT + iSER versus SCST + SRP), the results are interesting. I will run the remaining test cases during the next days. About the test setup: dd and xdd were used to transfer 2 GB of data between an initiator system and a target system via direct I/O over an SDR InfiniBand network (1GB/s). The block size varied between 512 bytes and 1 GB, but was always a power of two. Expected results: * The measurement results are consistent with the numbers I published earlier. * During data transfers all data is transferred in blocks between 4 KB and 32 KB in size (according to the SCST statistics). * For small and medium block sizes (<= 32 KB) transfer times can be modeled very well by the following formula: (transfer time) = (setup latency) + (bytes transferred)/(bandwidth). The correlation numbers are very close to one. * The latency and bandwidth parameters depend on the test tool (dd versus xdd), on the kind of test performed (reading versus writing), on the SCSI target and on the communication protocol. * When using RDMA (iSER or SRP), SCST has a lower latency and higher bandwidth than STGT (results from linear regression for block sizes <= 32 KB): Test Latency(us) Bandwidth (MB/s) Correlation STGT+iSER, read, dd 64 560 0.95 STGT+iSER, read, xdd 65 556 0.94 STGT+iSER, write, dd 53 394 0.71 STGT+iSER, write, xdd 54 445 0.59 SCST+SRP, read, dd39 657 0.83 SCST+SRP, read, xdd 41 668 0.87 SCST+SRP, write, dd 52 449 0.62 SCST+SRP, write, xdd 52 516 0.77 Results that I did not expect: * A block transfer size of 1 MB is not enough to measure the maximal throughput. The maximal throughput is only reached at much higher block sizes (about 10 MB for SCST + SRP and about 100 MB for STGT + iSER). * There is one case where dd and xdd results are inconsistent: when reading via SCST + SRP and for block sizes of about 1 MB. * For block sizes > 64 KB the measurements differ from the model. This is probably because all initiator-target transfers happen in blocks of 32 KB or less. For the details and some graphs, see also http://software.qlayer.com/display/iSCSI/Measurements . Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integration of SCST in the mainstream Linux kernel
On Feb 5, 2008 6:10 PM, Erez Zilber <[EMAIL PROTECTED]> wrote: > One may claim that STGT should have lower performance than SCST because > its data path is from userspace. However, your results show that for > non-IB transports, they both show the same numbers. Furthermore, with IB > there shouldn't be any additional difference between the 2 targets > because data transfer from userspace is as efficient as data transfer > from kernel space. > > The only explanation that I see is that fine tuning for iSCSI & iSER is > required. As was already mentioned in this thread, with SDR you can get > ~900 MB/sec with iSER (on STGT). My most recent measurements also show that one can get 900 MB/s with STGT + iSER on an SDR IB network, but only for very large block sizes (>= 100 MB). A quote from Linus Torvalds is relevant here (February 5, 2008): Block transfer sizes over about 64kB are totally irrelevant for 99% of all people. Please read my e-mail (posted earlier today) with a comparison for 4 KB - 64 KB block transfer sizes between SCST and STGT. Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch/rfc/rft] sd: allocate request_queue on device's local numa node
On 10/22/12 21:01, Jeff Moyer wrote: All of the infrastructure is available to allocate a request_queue on a particular numa node, but it isn't being utilized at all. Wire up the sd driver to allocate the request_queue on the HBA's local numa node. This is a request for comments and testing (I've built and booted it, nothing more). I believe that this should be a performance win, but I have no numbers to back it up as yet. Suggestions for workloads to test are welcome. Cheers, Jeff Signed-off-by: Jeff Moyer diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index da36a3a..7986483 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1664,7 +1664,8 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost, struct request_queue *q; struct device *dev = shost->dma_dev; - q = blk_init_queue(request_fn, NULL); + q = blk_init_queue_node(request_fn, NULL, + dev_to_node(&shost->shost_dev)); if (!q) return NULL; Are you sure this approach will always result in the queue being allocated on the same NUMA node as the HCA ? If e.g. a user triggers LUN scanning via sysfs the above code may be invoked on another NUMA node than the node to which the HCA is connected. Also, if you have a look at e.g. scsi_request_fn() or scsi_device_unbusy() you will see that in order to avoid inter-node traffic it's important to allocate the sdev and shost data structures on the same NUMA node. How about the following approach ? - Add a variant of scsi_host_alloc() that allows to specify on which NUMA node to allocate the shost structure and also that stores the identity of that node in the shost structure. - Modify __scsi_alloc_queue() such that it allocates the sdev structure on the same NUMA node as the shost structure. - Modify the SCSI LLD of your choice such that it uses the new scsi_host_alloc() call. According to what is appropriate the NUMA node on which to allocate the shost could be specified by the user or could be identical to the NUMA node of the HCA controlled by the SCSI LLD (see e.g. /sys/devices/pci*/*/numa_node). Please keep in mind that a single PCIe bus may have a minimal distance to more than one NUMA node. See e.g. the diagram at the top of page 8 in http://bizsupport1.austin.hp.com/bc/docs/support/SupportManual/c03261871/c03261871.pdf for a system diagram of a NUMA system where each PCIe bus has a minimal distance to two different NUMA nodes. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch/rfc/rft] sd: allocate request_queue on device's local numa node
On 10/23/12 18:52, Jeff Moyer wrote: Bart Van Assche writes: Please keep in mind that a single PCIe bus may have a minimal distance to more than one NUMA node. See e.g. the diagram at the top of page 8 in http://bizsupport1.austin.hp.com/bc/docs/support/SupportManual/c03261871/c03261871.pdf for a system diagram of a NUMA system where each PCIe bus has a minimal distance to two different NUMA nodes. That's an interesting configuration. I wonder what the numa_node sysfs file contains for such systems--do you know? I'm not sure how we could allow this to be user-controlled at probe time. Did you have a specific mechanism in mind? Module parameters? Something else? As far as I can see in drivers/pci/pci-sysfs.c the numa_node sysfs attribute contains a single number, even for a topology like the one described above. With regard to user control of the numa node: I'm not sure how to solve this in general. But for the ib_srp driver this should be easy to do: SCSI host creation is triggered by sending a login string to a sysfs attribute ("add_target"). It wouldn't take much time to add a parameter to that login string that specifies the NUMA node. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel
On Feb 6, 2008 1:11 AM, Nicholas A. Bellinger <[EMAIL PROTECTED]> wrote: > I have always observed the case with LIO SE/iSCSI target mode ... Hello Nicholas, Are you sure that the LIO-SE kernel module source code is ready for inclusion in the mainstream Linux kernel ? As you know I tried to test the LIO-SE iSCSI target. Already while configuring the target I encountered a kernel crash that froze the whole system. I can reproduce this kernel crash easily, and I reported it 11 days ago on the LIO-SE mailing list (February 4, 2008). One of the call stacks I posted shows a crash in mempool_alloc() called from jbd. Or: the crash is most likely the result of memory corruption caused by LIO-SE. Because I was curious to know why it took so long to fix such a severe crash, I started browsing through the LIO-SE source code. Analysis of the LIO-SE kernel module source code learned me that this crash is not a coincidence. Dynamic memory allocation (kmalloc()/kfree()) in the LIO-SE kernel module is complex and hard to verify. There are 412 memory allocation/deallocation calls in the current version of the LIO-SE kernel module source code, which is a lot. Additionally, because of the complexity of the memory handling in LIO-SE, it is not possible to verify the correctness of the memory handling by analyzing a single function at a time. In my opinion this makes the LIO-SE source code hard to maintain. Furthermore, the LIO-SE kernel module source code does not follow conventions that have proven their value in the past like grouping all error handling at the end of a function. As could be expected, the consequence is that error handling is not correct in several functions, resulting in memory leaks in case of an error. Some examples of functions in which error handling is clearly incorrect: * transport_allocate_passthrough(). * iscsi_do_build_list(). Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integration of SCST in the mainstream Linux kernel
On Thu, Feb 7, 2008 at 2:45 PM, Vladislav Bolkhovitin <[EMAIL PROTECTED]> wrote: > > Bart Van Assche wrote: > > Since the focus of this thread shifted somewhat in the last few > > messages, I'll try to summarize what has been discussed so far: > > - There was a number of participants who joined this discussion > > spontaneously. This suggests that there is considerable interest in > > networked storage and iSCSI. > > - It has been motivated why iSCSI makes sense as a storage protocol > > (compared to ATA over Ethernet and Fibre Channel over Ethernet). > > - The direct I/O performance results for block transfer sizes below 64 > > KB are a meaningful benchmark for storage target implementations. > > - It has been discussed whether an iSCSI target should be implemented > > in user space or in kernel space. It is clear now that an > > implementation in the kernel can be made faster than a user space > > implementation > (http://kerneltrap.org/mailarchive/linux-kernel/2008/2/4/714804). > > Regarding existing implementations, measurements have a.o. shown that > > SCST is faster than STGT (30% with the following setup: iSCSI via > > IPoIB and direct I/O block transfers with a size of 512 bytes). > > - It has been discussed which iSCSI target implementation should be in > > the mainstream Linux kernel. There is no agreement on this subject > > yet. The short-term options are as follows: > > 1) Do not integrate any new iSCSI target implementation in the > > mainstream Linux kernel. > > 2) Add one of the existing in-kernel iSCSI target implementations to > > the kernel, e.g. SCST or PyX/LIO. > > 3) Create a new in-kernel iSCSI target implementation that combines > > the advantages of the existing iSCSI kernel target implementations > > (iETD, STGT, SCST and PyX/LIO). > > > > As an iSCSI user, I prefer option (3). The big question is whether the > > various storage target authors agree with this ? > > I tend to agree with some important notes: > > 1. IET should be excluded from this list, iSCSI-SCST is IET updated for > SCST framework with a lot of bugfixes and improvements. > > 2. I think, everybody will agree that Linux iSCSI target should work > over some standard SCSI target framework. Hence the choice gets > narrower: SCST vs STGT. I don't think there's a way for a dedicated > iSCSI target (i.e. PyX/LIO) in the mainline, because of a lot of code > duplication. Nicholas could decide to move to either existing framework > (although, frankly, I don't think there's a possibility for in-kernel > iSCSI target and user space SCSI target framework) and if he decide to > go with SCST, I'll be glad to offer my help and support and wouldn't > care if LIO-SCST eventually replaced iSCSI-SCST. The better one should win. If I understood the above correctly, regarding a kernel space iSCSI target implementation, only LIO-SE and SCST should be considered. What I know today about these Linux iSCSI target implementations is as follows: * SCST performs slightly better than LIO-SE, and LIO-SE performs slightly better than STGT (both with regard to latency and with regard to bandwidth). * The coding style of SCST is closer to the Linux kernel coding style than the coding style of the LIO-SE project. * The structure of SCST is closer to what Linus expects than the structure of LIO-SE (i.e., authentication handled in userspace, data transfer handled by the kernel -- LIO-SE handles both in kernel space). * Until now I did not encounter any strange behavior in SCST. The issues I encountered with LIO-SE are being resolved via the LIO-SE mailing list (http://groups.google.com/group/linux-iscsi-target-dev). It would take too much effort to develop a new kernel space iSCSI target from scratch -- we should start from either LIO-SE or SCST. My opinion is that the best approach is to start with integrating SCST in the mainstream kernel, and that the more advanced features from LIO-SE that are not yet in SCST can be ported from LIO-SE to the SCST framework. Nicholas, do you think the structure of SCST is powerful enough to be extended with LIO-SE's powerful features like ERL-2 ? Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG/ spinlock lockup, 2.6.24
2008/2/15 Denys Fedoryshchenko <[EMAIL PROTECTED]>: > I have random crashes, at least once per week. It is very difficult to catch > error message, and only recently i setup netconsole. Now i got crash, but > there is no traceback and only single line came over netconsole, mentioned > before. Did you already run memtest ? You can run memtest by booting from the Knoppix CD-ROM or DVD. Most Linux distributions also have included memtest on their bootable distribution CD's/DVD's. Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 01/10] scsi: make __scsi_alloc_queue numa-aware
On 10/30/12 21:14, Jeff Moyer wrote: Pass the numa node id set in the Scsi_Host on to blk_init_queue_node in order to keep all allocations local to the numa node the device is closest to. Signed-off-by: Jeff Moyer --- drivers/scsi/scsi_lib.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index da36a3a..8662a09 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1664,7 +1664,7 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost, struct request_queue *q; struct device *dev = shost->dma_dev; - q = blk_init_queue(request_fn, NULL); + q = blk_init_queue_node(request_fn, NULL, shost->numa_node); if (!q) return NULL; Hello Jeff, I haven't seen the patch that introduces numa_node in struct Scsi_Host nor the cover letter of this patch series ? Have these been posted on the linux-scsi mailing list ? Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch,v2 01/10] scsi: add scsi_host_alloc_node
On 11/02/12 22:45, Jeff Moyer wrote: diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index 593085a..7d7ad8b 100644 --- a/drivers/scsi/hosts.c +++ b/drivers/scsi/hosts.c @@ -336,16 +336,25 @@ static struct device_type scsi_host_type = { **/ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize) { + return scsi_host_alloc_node(sht, privsize, -1); Using NUMA_NO_NODE here might improve readability. diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h index 4908480..a1b5c8e 100644 --- a/include/scsi/scsi_host.h +++ b/include/scsi/scsi_host.h @@ -733,6 +733,12 @@ struct Scsi_Host { struct device *dma_dev; /* +* Numa node this device is closest to, used for allocating +* data structures locally. +*/ + int numa_node; Have you considered using #ifdef CONFIG_NUMA / #endif here ? I've noticed that all other numa_node members in structures under include/ have this. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch,v2 04/10] scsi: allocate scsi_cmnd-s from the device's local numa node
On 11/02/12 22:45, Jeff Moyer wrote: diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c index 2936b44..4db6973 100644 --- a/drivers/scsi/scsi.c +++ b/drivers/scsi/scsi.c @@ -173,16 +173,20 @@ static DEFINE_MUTEX(host_cmd_pool_mutex); * NULL on failure */ static struct scsi_cmnd * -scsi_pool_alloc_command(struct scsi_host_cmd_pool *pool, gfp_t gfp_mask) +scsi_pool_alloc_command(struct scsi_host_cmd_pool *pool, gfp_t gfp_mask, + int node) { struct scsi_cmnd *cmd; - cmd = kmem_cache_zalloc(pool->cmd_slab, gfp_mask | pool->gfp_mask); + cmd = kmem_cache_alloc_node(pool->cmd_slab, + gfp_mask | pool->gfp_mask | __GFP_ZERO, + node); if (!cmd) return NULL; - cmd->sense_buffer = kmem_cache_alloc(pool->sense_slab, -gfp_mask | pool->gfp_mask); + cmd->sense_buffer = kmem_cache_alloc_node(pool->sense_slab, + gfp_mask | pool->gfp_mask | __GFP_ZERO, + node); It's not clear to me why __GFP_ZERO is added to the allocation flags ? Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch,v2 05/10] sd: use alloc_disk_node
On 11/02/12 22:45, Jeff Moyer wrote: Signed-off-by: Jeff Moyer --- drivers/scsi/sd.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 12f6fdf..8deb915 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -2714,7 +2714,7 @@ static int sd_probe(struct device *dev) if (!sdkp) goto out; - gd = alloc_disk(SD_MINORS); + gd = alloc_disk_node(SD_MINORS, dev_to_node(dev)); if (!gd) goto out_free; shost->numa_node can be another NUMA node than dev_to_node(dev). Have you considered using shost->numa_node here ? Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch,v2 05/10] sd: use alloc_disk_node
On 11/05/12 15:12, Jeff Moyer wrote: Bart Van Assche writes: On 11/02/12 22:45, Jeff Moyer wrote: Signed-off-by: Jeff Moyer --- drivers/scsi/sd.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index 12f6fdf..8deb915 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -2714,7 +2714,7 @@ static int sd_probe(struct device *dev) if (!sdkp) goto out; - gd = alloc_disk(SD_MINORS); + gd = alloc_disk_node(SD_MINORS, dev_to_node(dev)); if (!gd) goto out_free; shost->numa_node can be another NUMA node than dev_to_node(dev). Have you considered using shost->numa_node here ? It can? How? E.g. if the LLD allows the user to specify the value of numa_node and passes that value to scsi_host_alloc_node() (see also http://lkml.org/lkml/2012/10/23/477 for further information). Just so I'm clear, you're suggesting I use the scsi_device's host pointer to get to the Scsi_Host, and that *will* be filled in that this point, right? As far as I can see the sdev->host pointer is set in scsi_alloc_sdev() and that happens before sd_probe() is invoked. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] virtio-scsi: fix parsing of hotplug/hot-unplug LUN number
On 07/26/12 09:21, James Bottomley wrote: > On Thu, 2012-07-26 at 11:04 +0200, Paolo Bonzini wrote: >> Il 26/07/2012 10:52, James Bottomley ha scritto: > +static unsigned int virtscsi_get_lun(u8 *lun_bytes) > +{ > + unsigned int lun = (lun_bytes[2] << 8) | lun_bytes[3]; > + return lun & 16383; > +} > + >>> Why are you rolling your own incomplete version of scsilun_to_int here? >> >> Because scsilun_to_int does not do the AND, so it would have exactly the >> same bug I'm fixing. > > It's not a bug ... it's the encoding. All the other devices use this > too. Ideally we should have switched to 64 bit lun numbers for the > encoding to be exact, but nothing so far has gone over 32 bits. If we > don't encode the Address method as part of the lun number, we don't get > the reverse transform right and the addressing often fails. > > That does mean that arrays that use address method=1 in REPORT LUNS have > their lun numbers start at 16384. Has it already been considered to modify scsilun_to_int() such that LUN numbers start at zero even for addressing method 1 ? This is what e.g. the function scst_unpack_lun() already does. See also http://scst.svn.sourceforge.net/viewvc/scst/trunk/scst/src/scst_lib.c?revision=HEAD&view=markup. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] please pull infiniband.git
On 07/10/13 16:38, Roland Dreier wrote: On Wed, Jul 10, 2013 at 7:35 AM, Sebastian Riemer wrote: I've checked the commits on that tag and the following commit is not what we've agreed on: Sorry about that. The discussion was long and complex and I probably made a mistake in aplying the patches. Please me send a patch to fix the driver to what it should be, and I will merge it ASAP. I will send such a patch in reply to this e-mail but without CC-ing the LKML. Sebastian, if you could review that patch, that would be appreciated. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] prevent CVE-2012-2372 rds-ping causes kernel panic
On 10/22/12 20:12, Jay Fenlason wrote: If you use rds-ping of the local IP address of some Infiniband HCAs (QLogic, possibly others) the machine will panic with a BUG_ON due to an overly restrictive check. Loosen the restriction a bit This should have gotten sent a while ago (it was first noticed in https://bugzilla.redhat.com/show_bug.cgi?id=803936 and patched in kernel-2.6.32-275.el6) but I got confused about its embargo status and lost it. Signed-off-by: Jay Fenlason diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c index e590949..7920c85 100644 --- a/net/rds/ib_send.c +++ b/net/rds/ib_send.c @@ -544,7 +544,7 @@ int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm, int flow_controlled = 0; int nr_sig = 0; - BUG_ON(off % RDS_FRAG_SIZE); + BUG_ON(!conn->c_loopback && off % RDS_FRAG_SIZE); BUG_ON(hdr_off != 0 && hdr_off != sizeof(struct rds_header)); /* Do not send cong updates to IB loopback */ Hello Jay, The recommended approach for checking preconditions is to use WARN_ON_ONCE(), not BUG_ON(). Linus explained this last month in a message posted on the LKML (http://lkml.org/lkml/2012/9/27/461). Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch,v3 04/10] scsi: allocate scsi_cmnd-s from the device's local numa node
On 11/09/12 20:18, Jeff Moyer wrote: - cmd = kmem_cache_zalloc(pool->cmd_slab, gfp_mask | pool->gfp_mask); + cmd = kmem_cache_alloc_node(pool->cmd_slab, + gfp_mask | pool->gfp_mask | __GFP_ZERO, + node); Hello Jeff, Is it necessary to add __GFP_ZERO here ? And if so, why ? Thanks, Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch,v3 00/10] make I/O path allocations more numa-friendly
On 11/09/12 20:17, Jeff Moyer wrote: This patch set makes memory allocations for data structures used in the I/O path more numa friendly by allocating them from the same numa node as the storage device. I've only converted a handful of drivers at this point. My testing showed that, for workloads where the I/O processes were not tied to the numa node housing the device, a speedup of around 6% was observed. When the I/O processes were tied to the numa node of the device, there was no measurable difference in my test setup. Given my relatively low-end setup[1], I wouldn't be surprised if others could show a more significant performance advantage. Comments would be greatly appreciated. Sorry but I'm not familiar with any of the SCSI LLDs modified via this patch series. But I'm fine with the SCSI core patches in this series. So if you want you can add the following to the first five patches in this series: Reviewed-by: Bart Van Assche Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] kbuild: Unbreak cleaning external module build directory
Avoid that "$(MAKE) -C $(KDIR) M=$(PWD) clean" triggers the following error message when invoked from inside the Makefile of an external kernel module: rm: cannot remove 'System.map': Permission denied make[1]: *** [clean] Error 1 Cc: Sam Ravnborg Cc: Arnaud Lacombe Cc: Nick Bowler Cc: Richard Weinberger Cc: Michal Marek Cc: Signed-off-by: Bart Van Assche --- Makefile |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Makefile b/Makefile index 9f6ca12..46b9be5 100644 --- a/Makefile +++ b/Makefile @@ -1252,7 +1252,7 @@ scripts: ; endif # KBUILD_EXTMOD clean: $(clean-dirs) - $(Q)$(CONFIG_SHELL) $(srctree)/scripts/link-vmlinux.sh clean + $(Q)$(if $(KBUILD_EXTMOD),,$(CONFIG_SHELL) $(srctree)/scripts/link-vmlinux.sh clean) $(call cmd,rmdirs) $(call cmd,rmfiles) @find $(if $(KBUILD_EXTMOD), $(KBUILD_EXTMOD), .) $(RCS_FIND_IGNORE) \ -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] timer: Fix jiffies wrap behavior of round_jiffies*()
On 05/30/13 01:01, Andrew Morton wrote: On Tue, 21 May 2013 20:43:50 +0200 Bart Van Assche wrote: Make sure that the round_jiffies*() functions return a time that is in the future when the jiffies counter is about to wrap. Actually "when the jiffies counter has recently wrapped". I assume this was found by inspection? Hello Andrew, You are correct, this was found via source code inspection. I started reviewing the round_jiffies*() implementation because I was chasing a bug in a kernel driver using one of these functions. - return original; - return j; + return time_is_after_jiffies(j) ? j : original; } Your email client mangles patches, btw. Sorry. Will take more care in the future. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: Allow block drivers to poll for I/O instead of sleeping
On 06/25/13 05:18, Matthew Wilcox wrote: On Mon, Jun 24, 2013 at 10:07:51AM +0200, Ingo Molnar wrote: I'm wondering, how will this scheme work if the IO completion latency is a lot more than the 5 usecs in the testcase? What if it takes 20 usecs or 100 usecs or more? There's clearly a threshold at which it stops making sense, and our current NAND-based SSDs are almost certainly on the wrong side of that threshold! I can't wait for one of the "post-NAND" technologies to make it to market in some form that makes it economical to use in an SSD. The problem is that some of the people who are looking at those technologies are crazy. They want to "bypass the kernel" and "do user space I/O" because "the kernel is too slow". This patch is part of an effort to show them how crazy they are. And even if it doesn't convince them, at least users who refuse to rewrite their applications to take advantage of magical userspace I/O libraries will see real performance benefits. Recently I attended an interesting talk about this subject in which it was proposed not only to bypass the kernel for access to high-IOPS devices but also to allow byte-addressability for block devices. The slides that accompanied that talk can be found here (includes a performance comparison with the traditional block driver API): Bernard Metzler, On Suitability of High-Performance Networking API for Storage, OFA Int'l Developer Workshop, April 24, 2013 (http://www.openfabrics.org/ofa-documents/presentations/doc_download/559-on-suitability-of-high-performance-networking-api-for-storage.html). This approach leaves the choice of whether to use polling or an interrupt-based completion notification to the user of the new API, something the Linux InfiniBand RDMA verbs API already allows today. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] scsi_prep_fn() check for empty queue
On 06/26/13 11:02, Maxim Uvarov wrote: This fix: end_request: I/O error, dev sdc, sector 976576 rport-0:0-3: blocked FC remote port time out: removing target and saving binding BUG: unable to handle kernel NULL pointer dereference at 0400 IP: [] scsi_prep_state_check+0xe/0x99 [] scsi_setup_blk_pc_cmnd+0x1b/0x115 [] scsi_prep_fn+0x29/0x3b [] blk_peek_request+0xe1/0x1b3 [] scsi_request_fn+0x3a/0x4d2 [] __generic_unplug_device+0x32/0x36 [] blk_execute_rq_nowait+0x77/0x9e [] blk_execute_rq+0xa6/0xde [] ? printk+0x41/0x46 [] ? get_rdac_req+0x81/0xe8 [scsi_dh_rdac] [] send_mode_select+0x29f/0x489 [scsi_dh_rdac] [] ? probe_workqueue_execution+0xb1/0xce [] worker_thread+0x1a9/0x237 [] ? send_mode_select+0x0/0x489 [scsi_dh_rdac] [] ? autoremove_wake_function+0x0/0x39 [] ? worker_thread+0x0/0x237 [] kthread+0x7f/0x87 [] child_rip+0xa/0x20 [] ? kthread+0x0/0x87 [] ? child_rip+0x0/0x20 Signed-off-by: Maxim Uvarov --- drivers/scsi/scsi_lib.c |3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 86d5220..8e89ed9 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1295,6 +1295,9 @@ int scsi_prep_fn(struct request_queue *q, struct request *req) struct scsi_device *sdev = q->queuedata; int ret = BLKPREP_KILL; + if (!sdev) + return ret; + if (req->cmd_type == REQ_TYPE_BLOCK_PC) ret = scsi_setup_blk_pc_cmnd(sdev, req); return scsi_prep_return(q, req, ret); Sorry but this patch does not look like a proper fix to me. What you probably need is a scsi_device_get() call in scsi_dh_rdac.c somewhere before the queue_work(kmpath_rdacd, &ctlr->ms_work) call and a scsi_device_put() call once send_mode_select() has finished using the sdev. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: I/O blocked while dirty pages are being flushed
On 03/24/13 06:12, Fredrik Tolf wrote: While this flush is running, I find that many a process goes into disk sleep waiting for the flush to complete. This includes the process manipulating the mmapped file whenever it tries to redirty a page currently waiting to be flushed, but also, for instance, programs that write() to log files (since, I guess, the buffer page backing the last written portion of the log file is being flushed). Had you already encountered this article: Jonathan Corbet, The trouble with stable pages, March 13, 2012 (http://lwn.net/Articles/486311/) ? Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V7 4/5] virtio-scsi: introduce multiqueue support
On 03/23/13 12:28, Wanlong Gao wrote: +static struct virtio_scsi_vq *virtscsi_pick_vq(struct virtio_scsi *vscsi, + struct virtio_scsi_target_state *tgt) +{ + struct virtio_scsi_vq *vq; + unsigned long flags; + u32 queue_num; + + spin_lock_irqsave(&tgt->tgt_lock, flags); + + /* +* The memory barrier after atomic_inc_return matches +* the smp_read_barrier_depends() in virtscsi_req_done. +*/ + if (atomic_inc_return(&tgt->reqs) > 1) + vq = ACCESS_ONCE(tgt->req_vq); + else { + queue_num = smp_processor_id(); + while (unlikely(queue_num >= vscsi->num_queues)) + queue_num -= vscsi->num_queues; + + tgt->req_vq = vq = &vscsi->req_vqs[queue_num]; + } + + spin_unlock_irqrestore(&tgt->tgt_lock, flags); + return vq; +} Is there any reason why the smp_processor_id() % vscsi->num_queues computation in virtscsi_pick_vq() has been implemented as a loop instead of as an arithmetic operation ? If so, I would appreciate it if that could be explained in a comment. Thanks, Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] Final round of SCSI updates for the 3.8+ merge window
On 03/01/13 10:19, James Bottomley wrote: This is an assorted set of stragglers into the merge window with driver updates for qla2xxx, megaraid_sas, storvsc and ufs. It also includes pulls of the uapi tree (all the remaining SCSI pieces) and the fcoe tree (updates to fcoe and libfc) Hello James, I'm curious to know your opinion about the patch called "Fix race between starved list processing and device removal" (http://article.gmane.org/gmane.linux.scsi/80090/match=scsi_run_queue) ? It was posted for the first time several months ago and has been ack-ed and tested by other people. So far I haven't seen any comments from you on that patch. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] timer: Fix jiffies wrap behavior of round_jiffies*()
Make sure that the round_jiffies*() functions return a time that is in the future when the jiffies counter is about to wrap. Signed-off-by: Bart Van Assche Cc: Thomas Gleixner Cc: Arjan van de Ven Cc: Stephen Rothwell --- kernel/timer.c |4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/kernel/timer.c b/kernel/timer.c index 15ffdb3..aa8b964 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -149,9 +149,7 @@ static unsigned long round_jiffies_common(unsigned long j, int cpu, /* now that we have rounded, subtract the extra skew again */ j -= cpu * 3; - if (j <= jiffies) /* rounding ate our timeout entirely; */ - return original; - return j; + return time_is_after_jiffies(j) ? j : original; } /** -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fix NULL-pointer dereference on scsi_run_queue
On 08/02/12 08:41, Chanho Min wrote: > This patch is to fix a oops from a torn down device. When > scsi_run_queue process starved queues, scsi_request_fn can race with > scsi_remove_device. In this case, rarely, scsi_request_fn release the > last reference and set sdev->request_queue to NULL. It result in > NULL-pointer dereference when spin_unlock is tried with (NULL)-> > queue_lock. We need to add an extra reference to the device on both > sides of the __blk_run_queue to hold reference until scsi_request_fn > is finished. Good catch. So far I haven't been able to trigger this issue in my tests. So it would be appreciated if you could verify whether the patch below helps (patch is based on 3.6-rc1): --- drivers/scsi/scsi_sysfs.c |8 +++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c index 093d4f6..59e523c 100644 --- a/drivers/scsi/scsi_sysfs.c +++ b/drivers/scsi/scsi_sysfs.c @@ -348,7 +348,6 @@ static void scsi_device_dev_release_usercontext(struct work_struct *work) starget->reap_ref++; list_del(&sdev->siblings); list_del(&sdev->same_target_siblings); - list_del(&sdev->starved_entry); spin_unlock_irqrestore(sdev->host->host_lock, flags); cancel_work_sync(&sdev->event_work); @@ -956,6 +955,8 @@ int scsi_sysfs_add_sdev(struct scsi_device *sdev) void __scsi_remove_device(struct scsi_device *sdev) { struct device *dev = &sdev->sdev_gendev; + struct Scsi_Host *shost = sdev->host; + unsigned long flags; if (sdev->is_visible) { if (scsi_device_set_state(sdev, SDEV_CANCEL) != 0) @@ -977,6 +978,11 @@ void __scsi_remove_device(struct scsi_device *sdev) blk_cleanup_queue(sdev->request_queue); cancel_work_sync(&sdev->requeue_work); + spin_lock_irqsave(shost->host_lock, flags); + if (!list_empty(&sdev->starved_entry)) + list_del(&sdev->starved_entry); + spin_unlock_irqrestore(shost->host_lock, flags); + if (sdev->host->hostt->slave_destroy) sdev->host->hostt->slave_destroy(sdev); transport_destroy_device(dev); -- 1.7.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fix NULL-pointer dereference on scsi_run_queue
On 08/04/12 16:46, Mike Christie wrote: > I think we have to have scsi-ml do a get_device when a sdev is added to > the starved entry and then do a put_device when it is removed (must do > these under the host lock for the starved entry case too). I am not sure > if that is just a hack/papering-over of the problem and there are more > issues like this. That would result in a more complex patch than the patch at the start of this thread, isn't it ? Also, IMHO it would help to document which functions in the scsi-ml are called with an sdev reference and which ones not. That would make the scsi-ml code easier to verify for issues like the one reported at the start of this thread. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fix NULL-pointer dereference on scsi_run_queue
On 08/04/12 22:36, Mike Christie wrote: > On 08/04/2012 03:18 PM, Bart Van Assche wrote: >> On 08/04/12 16:46, Mike Christie wrote: >>> I think we have to have scsi-ml do a get_device when a sdev is added to >>> the starved entry and then do a put_device when it is removed (must do >>> these under the host lock for the starved entry case too). I am not sure >>> if that is just a hack/papering-over of the problem and there are more >>> issues like this. >> >> That would result in a more complex patch than the patch at the start of >> this thread, isn't it ? > > Yaah, but the original patch in this thread is still racey isn't it? Indeed. How about the patch below ? Scsi devices are removed from starved_list after blk_cleanup_queue() and before put_device(). That guarantees that inside scsi_run_queue() get_device() under host lock will succeed. --- drivers/scsi/scsi_lib.c |5 + drivers/scsi/scsi_sysfs.c |7 ++- 2 files changed, 11 insertions(+), 1 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index ffd7773..bd7daec 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -452,10 +452,15 @@ static void scsi_run_queue(struct request_queue *q) continue; } + get_device(&sdev->sdev_gendev); spin_unlock(shost->host_lock); + spin_lock(sdev->request_queue->queue_lock); __blk_run_queue(sdev->request_queue); spin_unlock(sdev->request_queue->queue_lock); + + put_device(&sdev->sdev_gendev); + spin_lock(shost->host_lock); } /* put any unprocessed entries back */ diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c index 093d4f6..44f232e 100644 --- a/drivers/scsi/scsi_sysfs.c +++ b/drivers/scsi/scsi_sysfs.c @@ -348,7 +348,6 @@ static void scsi_device_dev_release_usercontext(struct work_struct *work) starget->reap_ref++; list_del(&sdev->siblings); list_del(&sdev->same_target_siblings); - list_del(&sdev->starved_entry); spin_unlock_irqrestore(sdev->host->host_lock, flags); cancel_work_sync(&sdev->event_work); @@ -956,6 +955,8 @@ int scsi_sysfs_add_sdev(struct scsi_device *sdev) void __scsi_remove_device(struct scsi_device *sdev) { struct device *dev = &sdev->sdev_gendev; + struct Scsi_Host *shost = sdev->host; + unsigned long flags; if (sdev->is_visible) { if (scsi_device_set_state(sdev, SDEV_CANCEL) != 0) @@ -977,6 +978,10 @@ void __scsi_remove_device(struct scsi_device *sdev) blk_cleanup_queue(sdev->request_queue); cancel_work_sync(&sdev->requeue_work); + spin_lock_irqsave(shost->host_lock, flags); + list_del(&sdev->starved_entry); + spin_unlock_irqrestore(shost->host_lock, flags); + if (sdev->host->hostt->slave_destroy) sdev->host->hostt->slave_destroy(sdev); transport_destroy_device(dev); -- 1.7.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fix NULL-pointer dereference on scsi_run_queue
On 08/07/12 08:53, Chanho Min wrote: > On Tue, Aug 7, 2012 at 2:56 AM, Bart Van Assche wrote: >> Indeed. How about the patch below ? Scsi devices are removed from >> starved_list after blk_cleanup_queue() and before put_device(). That >> guarantees that inside scsi_run_queue() get_device() under host lock >> will succeed. > > Thanks, IMHO, it's harmless and the simple way to solve this issue. > But, I think the second half of your patches are not required, extra > referecne is might suffice? I'm afraid that without the second half of that patch the following race is still possible: - sdev reference count drops to zero while scsi_run_queue() is in progress and while that sdev is on the starved_list of its SCSI host; scsi_device_dev_release_usercontext() call is scheduled but not yet executed. - scsi_run_queue() takes that sdev off the local starved_list. - scsi_run_queue() calls get_device() and that call fails since the sdev reference count is zero. - scsi_device_dev_release_usercontext() gets scheduled and frees the sdev. - scsi_run_queue() proceeds and calls __blk_run_queue() on a freed queue, which is what we were trying to avoid. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Kernel hangs in truncate_inode_pages()
Hello, If I log in with the open-iscsi initiator to an iSCSI target an power down the target the initiator hangs in truncate_inode_pages(). This happens systematically with kernel 3.6-rc3 and it's something I've never seen with kernel 3.4 or before. Has anyone else already run into this issue ? >From the output of echo w >/proc/sysrq-trigger: SysRq : Show Blocked State taskPC stack pid father udisks-part-id D 810f7530 0 30803 10057 0x0006 8801063c39a8 0046 880126955d60 0002 8801063c3fd8 880126955d60 8801063c3fd8 8801063c3fd8 81813420 880126955d60 8801063c39a8 88012fa13990 Call Trace: [] ? __lock_page+0x70/0x70 [] schedule+0x29/0x70 [] io_schedule+0x8f/0xd0 [] sleep_on_page+0xe/0x20 [] __wait_on_bit_lock+0x5a/0xc0 [] ? find_get_pages+0x2a/0x1c0 [] ? release_pages+0x1af/0x200 [] __lock_page+0x67/0x70 [] ? autoremove_wake_function+0x40/0x40 [] truncate_inode_pages_range+0x456/0x480 [] ? trace_hardirqs_on+0xd/0x10 [] ? on_each_cpu_cond+0xaa/0xf0 [] truncate_inode_pages+0x15/0x20 [] kill_bdev+0x2f/0x40 [] __blkdev_put+0x76/0x1d0 [] blkdev_put+0x5d/0x180 [] blkdev_close+0x25/0x30 [] __fput+0xcc/0x240 [] fput+0xe/0x10 [] task_work_run+0x69/0xa0 [] do_exit+0x78f/0x930 [] ? get_signal_to_deliver+0xf0/0x6f0 [] ? _raw_spin_lock_irq+0x17/0x60 [] ? _raw_spin_unlock_irq+0x30/0x50 [] do_group_exit+0x4e/0xc0 [] get_signal_to_deliver+0x228/0x6f0 [] do_signal+0x3c/0x620 [] ? sysret_signal+0x5/0x3d [] do_notify_resume+0x6d/0xb0 [] ? trace_hardirqs_on_thunk+0x3a/0x3f [] int_signal+0x12/0x17 Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Kernel hangs in truncate_inode_pages()
On 08/24/12 17:24, Mike Christie wrote: > On 08/24/2012 08:13 AM, Bart Van Assche wrote: >> If I log in with the open-iscsi initiator to an iSCSI target an power >> down the target the initiator hangs in truncate_inode_pages(). This >> happens systematically with kernel 3.6-rc3 and it's something I've >> never seen with kernel 3.4 or before. Has anyone else already run into >> this issue ? > > I have not seen this issue before and have not seen it reported. > > Are you doing IO the disk at the time of power down, or are you doing a > iscsiadm -m node ... --logout command? There was no I/O load test running at the time of the target power down. The call stack I posted is what I see before logout. And this is what I get after iscsi logout: BUG: soft lockup - CPU#3 stuck for 23s! [iscsid:8589] Modules linked in: ... [last unloaded: crc32c] irq event stamp: 918076162 hardirqs last enabled at (918076161): [] _raw_spin_unlock_irqrestore+0x3f/0x70 hardirqs last disabled at (918076162): [] apic_timer_interrupt+0x67/0x80 softirqs last enabled at (917955754): [] __do_softirq+0x142/0x2a0 softirqs last disabled at (917955749): [] call_softirq+0x1c/0x30 CPU 3 Pid: 8589, comm: iscsid Not tainted 3.6.0-rc3+ #1 Gigabyte Technology Co., Ltd. Z68X-UD3H-B3/Z68X-UD3H-B3 RIP: 0010:[] [] _raw_spin_unlock_irqrestore+0x41/0x70 RSP: 0018:8800ae7e5928 EFLAGS: 0286 RAX: 880106002eb0 RBX: 810956a2 RCX: 0006 RDX: 0040 RSI: 880106003498 RDI: 880106002eb0 RBP: 8800ae7e5938 R08: R09: 0001 R10: 0002 R11: R12: 0006 R13: 0007 R14: 880106002eb0 R15: 880106003498 FS: 7fa8cfc22700() GS:88012fac() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 7fa8cf26617c CR3: bc87c000 CR4: 000407e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process iscsid (pid: 8589, threadinfo 8800ae7e4000, task 880106002eb0) Stack: 88011af0c000 88001ea0fff0 8800ae7e5988 a00aaafc 0286 88001ea1 8800ae7e5968 88008ae44a28 88008ae44800 88008ae44818 0001 Call Trace: [] scsi_remove_target+0x1bc/0x1f0 [scsi_mod] [] __iscsi_unbind_session+0xc6/0x1a0 [scsi_transport_iscsi] [] iscsi_remove_session+0x105/0x1e0 [scsi_transport_iscsi] [] iscsi_destroy_session+0x16/0x60 [scsi_transport_iscsi] [] iscsi_session_teardown+0xa9/0xd0 [libiscsi] [] iscsi_sw_tcp_session_destroy+0x50/0x70 [iscsi_tcp] [] iscsi_if_rx+0xab8/0xf60 [scsi_transport_iscsi] [] netlink_unicast+0x1ad/0x230 [] netlink_sendmsg+0x2fb/0x350 [] ? sock_update_classid+0xc8/0x140 [] sock_sendmsg+0xa2/0xe0 [] ? might_fault+0x9c/0xb0 [] ? might_fault+0x53/0xb0 [] ? verify_iovec+0x56/0xd0 [] __sys_sendmsg+0x382/0x390 [] ? might_fault+0x9c/0xb0 [] ? might_fault+0x53/0xb0 [] ? lg_local_unlock+0x23/0x50 [] ? cp_new_stat+0x116/0x130 [] sys_sendmsg+0x49/0x90 [] system_call_fastpath+0x16/0x1b Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fix NULL-pointer dereference on scsi_run_queue
On 08/07/12 08:53, Chanho Min wrote: > In addition, Is it ironic that we are careful to use put_device at > scsi_request_fn?. If we trigger the ->remove(), > It occur a oops. What about the removal of unlock/lock as patch bellow? > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > index 4037fd5..8d9eccd 100644 > --- a/drivers/scsi/scsi_lib.c > +++ b/drivers/scsi/scsi_lib.c > @@ -1608,11 +1608,7 @@ out_delay: > if (sdev->device_busy == 0) > blk_delay_queue(q, SCSI_QUEUE_DELAY); > out: > - /* must be careful here...if we trigger the ->remove() function > -* we cannot be holding the q lock */ > - spin_unlock_irq(q->queue_lock); > put_device(&sdev->sdev_gendev); > - spin_lock_irq(q->queue_lock); > } As far as I can see the comment in the above code was added before scsi_device_dev_release() was moved to user context, so it might be outdated. See also http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=65110b2168950a19cc78b5027ed18cb811fbdae8. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fix NULL-pointer dereference on scsi_run_queue
On 08/08/12 03:42, Chanho Min wrote: > Thank you for the explanation. It look correct. Let's check one more thing. > What If __scsi_remove_device doesn't release device? : reference count > is more than 2. > So We lost starved_list but device is exist. Is there any issue about this? As far as I can see that scenario will also be handled correctly by the proposed patch. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integration of SCST in the mainstream Linux kernel
On Feb 18, 2008 10:43 AM, Erez Zilber <[EMAIL PROTECTED]> wrote: > If you use a high value for FirstBurstLength, all (or most) of your data > will be sent as unsolicited data-out PDUs. These PDUs don't use the RDMA > engine, so you miss the advantage of IB. Hello Erez, Did you notice the e-mail Roland Dreier wrote on Februari 6, 2008 ? This is what Roland wrote: > I think the confusion here is caused by a slight misuse of the term > "RDMA". It is true that all data is always transported over an > InfiniBand connection when iSER is used, but not all such transfers > are one-sided RDMA operations; some data can be transferred using > send/receive operations. Or: data sent during the first burst is not transferred via one-sided remote memory reads or writes but via two-sided send/receive operations. At least on my setup, these operations are as fast as one-sided remote memory reads or writes. As an example, I obtained the following numbers on my setup (SDR 4x network); ib_write_bw: 933 MB/s. ib_read_bw: 905 MB/s. ib_send_bw: 931 MB/s. Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
SMP-related kernel memory leak
Hello, I noticed that the amount of memory used by the Linux kernel steadily increases over time on SMP systems (x86 architecture, 32-bit kernel). This problem disappears when I add maxcpus=1 to the kernel command line. I have observed this behavior both on the 2.6.22.18 and 2.6.24.2 kernels. Did anyone notice anything similar ? See also: http://bugzilla.kernel.org/show_bug.cgi?id=9991 Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP-related kernel memory leak
On Feb 19, 2008 7:18 PM, Oliver Pinter <[EMAIL PROTECTED]> wrote: > On 2/19/08, Bart Van Assche <[EMAIL PROTECTED]> wrote: > > I noticed that the amount of memory used by the Linux kernel steadily > > increases over time on SMP systems (x86 architecture, 32-bit kernel). > > This problem disappears when I add maxcpus=1 to the kernel command > > line. I have observed this behavior both on the 2.6.22.18 and 2.6.24.2 > > kernels. Did anyone notice anything similar ? > > > > See also: http://bugzilla.kernel.org/show_bug.cgi?id=9991 > > this patch fixed them http://lkml.org/lkml/2008/2/18/405 ? Thanks for the hint. If I interpreted the 2.6.24 changelog correctly this patch is already included with 2.6.24 ? The problem still occurs with 2.6.24.2. I am currently trying to find the minimal kernel config which still triggers this problem. Any other hints for finding the cause of this issue are welcome of course. Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integration of SCST in the mainstream Linux kernel
On Feb 20, 2008 8:34 AM, Erez Zilber <[EMAIL PROTECTED]> wrote: > Bart Van Assche wrote: > > Or: data sent during the first burst is not transferred via one-sided > > remote memory reads or writes but via two-sided send/receive > > operations. At least on my setup, these operations are as fast as > > one-sided remote memory reads or writes. As an example, I obtained the > > following numbers on my setup (SDR 4x network); > > ib_write_bw: 933 MB/s. > > ib_read_bw: 905 MB/s. > > ib_send_bw: 931 MB/s. > > According to these numbers one can think that you don't need RDMA at > all, just send iSCSI PDUs over IB. Sorry, but you are misinterpreting what I wrote. > The benchmarks that you use are > synthetic IB benchmarks that are not equivalent to iSCSI over iSER. They > just send IB packets. I'm not surprised that you got more or less the > same performance because, AFAIK, ib_send_bw doesn't copy data (unlike > iSCSI that has to copy data that is sent/received without RDMA). I agree that ib_write_bw / ib_read_bw / ib_send_bw performance results are not equivalent to iSCSI over iSER. The reason that I included these performance results was to illustrate that two-sided data transfers over IB are about as fast as one-sided data transfers. > When you use RDMA with iSCSI (i.e. iSER), you don't need to create iSCSI > PDUs and process them. The CPU is not busy as it is with iSCSI over TCP > because no data copies are required. Another advantage is that you don't > need header/data digest because the IB HW does that. As far as I know, when using iSER, the FirstBurstLength bytes of data are sent via two-sided data transfers, and there is no CPU intervention required to transfer the data itself over the IB network. Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ofa-general] Re: Merging of completely unreviewed drivers
On Fri, Feb 22, 2008 at 2:46 AM, David Newall <[EMAIL PROTECTED]> wrote: > Krzysztof Halasa wrote: > > Perhaps we should increase line length limit, 132 should be fine. > > Especially useful with long printk() lines and long arithmetic > > expressions. > > Yes; or even longer. 80 characters might have made sense on a screen > when the alternative was 80 characters on a punched card, but on a > modern computer it's very restrictive. That's especially true with the > deep indents that you quickly get in C. Even short lines often need to > be split when you put a few tabs in front of them, and that makes > comprehension that bit harder, not to mention looks ugly. There is a reason to limit line length: scientific research has shown that readability of regular texts is optimal for a line length between 55 and 65 characters. My experience is that the readability of source code decreases when the lines are very long (more than 160 characters). Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][SCSI] remove the queue unlock in scsi_requset_fn
On 08/10/12 03:22, Chanho Min wrote: > We don't need to unlock the queue before put_device in scsi_request_fn() It looks like there is a typo in the patch subject ? Also, you can omit "[SCSI]" from the patch subject - AFAIK James has a script that inserts that text automatically. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND] remove the queue unlock in scsi_requset_fn
On 08/16/12 01:35, Chanho Min wrote: >> functions will occur in line. I also don't see why the sdev reference >> couldn't drop to zero here. > scsi_request_fn is called under the lock of request_queue->queue_lock. > If we drop the sdev reference to zero here, > scsi_device_dev_release_usercontext is > invoked and make request_queue to NULL. When caller of scsi_request_fn try to > unlock request_queue->queue_lock, the oops is occurred. Whether or not your patch is applied, if the put_device() call in scsi_request_fn() decreases the sdev reference count to zero, the scsi_request_fn() caller will still try to unlock the queue lock after scsi_request_fn() finished and hence will trigger a use-after-free. I'm afraid the only real solution is to modify the SCSI and/or block layers such that scsi_remove_device() can't finish while scsi_request_fn() is in progress. And once that is guaranteed the get_device() / put_device() pair can be left out from scsi_request_fn(). Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integration of SCST in the mainstream Linux kernel
On Feb 5, 2008 6:50 PM, Jeff Garzik <[EMAIL PROTECTED]> wrote: > For remotely accessing data, iSCSI+fs is quite simply more overhead than > a networked fs. With iSCSI you are doing > > local VFS -> local blkdev -> network > > whereas a networked filesystem is > > local VFS -> network There are use cases than can be solved better via iSCSI and a filesystem than via a network filesystem. One such use case is when deploying a virtual machine whose data is stored on a network server: in that case there is only one user of the data (so there are no locking issues) and filesystem and block device each run in another operating system: the filesystem runs inside the virtual machine and iSCSI either runs in the hypervisor or in the native OS. Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integration of SCST in the mainstream Linux kernel
On Feb 5, 2008 6:01 PM, Erez Zilber <[EMAIL PROTECTED]> wrote: > > Using such large values for FirstBurstLength will give you poor > performance numbers for WRITE commands (with iSER). FirstBurstLength > means how much data should you send as unsolicited data (i.e. without > RDMA). It means that your WRITE commands were sent without RDMA. Sorry, but I'm afraid you got this wrong. When the iSER transport is used instead of TCP, all data is sent via RDMA, including unsolicited data. If you have look at the iSER implementation in the Linux kernel (source files under drivers/infiniband/ulp/iser), you will see that all data is transferred via RDMA and not via TCP/IP. Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel
Since the focus of this thread shifted somewhat in the last few messages, I'll try to summarize what has been discussed so far: - There was a number of participants who joined this discussion spontaneously. This suggests that there is considerable interest in networked storage and iSCSI. - It has been motivated why iSCSI makes sense as a storage protocol (compared to ATA over Ethernet and Fibre Channel over Ethernet). - The direct I/O performance results for block transfer sizes below 64 KB are a meaningful benchmark for storage target implementations. - It has been discussed whether an iSCSI target should be implemented in user space or in kernel space. It is clear now that an implementation in the kernel can be made faster than a user space implementation (http://kerneltrap.org/mailarchive/linux-kernel/2008/2/4/714804). Regarding existing implementations, measurements have a.o. shown that SCST is faster than STGT (30% with the following setup: iSCSI via IPoIB and direct I/O block transfers with a size of 512 bytes). - It has been discussed which iSCSI target implementation should be in the mainstream Linux kernel. There is no agreement on this subject yet. The short-term options are as follows: 1) Do not integrate any new iSCSI target implementation in the mainstream Linux kernel. 2) Add one of the existing in-kernel iSCSI target implementations to the kernel, e.g. SCST or PyX/LIO. 3) Create a new in-kernel iSCSI target implementation that combines the advantages of the existing iSCSI kernel target implementations (iETD, STGT, SCST and PyX/LIO). As an iSCSI user, I prefer option (3). The big question is whether the various storage target authors agree with this ? Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND] remove the queue unlock in scsi_requset_fn
On 08/16/12 07:52, Bart Van Assche wrote: > On 08/16/12 01:35, Chanho Min wrote: >>> functions will occur in line. I also don't see why the sdev reference >>> couldn't drop to zero here. >> scsi_request_fn is called under the lock of request_queue->queue_lock. >> If we drop the sdev reference to zero here, >> scsi_device_dev_release_usercontext is >> invoked and make request_queue to NULL. When caller of scsi_request_fn try to >> unlock request_queue->queue_lock, the oops is occurred. > > Whether or not your patch is applied, if the put_device() call in > scsi_request_fn() decreases the sdev reference count to zero, the > scsi_request_fn() caller will still try to unlock the queue lock after > scsi_request_fn() finished and hence will trigger a use-after-free. I'm > afraid the only real solution is to modify the SCSI and/or block layers > such that scsi_remove_device() can't finish while scsi_request_fn() is > in progress. And once that is guaranteed the get_device() / put_device() > pair can be left out from scsi_request_fn(). (replying to my own e-mail) How about the patch below ? [PATCH] Fix device removal race If the put_device() call in scsi_request_fn() drops the sdev refcount to zero then the spin_lock_irq() call after the put_device() call triggers a use-after-free. Avoid that by making sure that blk_cleanup_queue() can only finish after all active scsi_request_fn() calls have returned. --- block/blk-core.c|1 + drivers/scsi/scsi_lib.c | 10 ++ include/linux/blkdev.h |5 + 3 files changed, 8 insertions(+), 8 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 4b4dbdf..0e4da3b 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -388,6 +388,7 @@ void blk_drain_queue(struct request_queue *q, bool drain_all) __blk_run_queue(q); drain |= q->nr_rqs_elvpriv; + drain |= q->request_fn_active; /* * Unfortunately, requests are queued at and tracked from diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index ffd7773..10bb348 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1514,9 +1514,7 @@ static void scsi_request_fn(struct request_queue *q) struct scsi_cmnd *cmd; struct request *req; - if(!get_device(&sdev->sdev_gendev)) - /* We must be tearing the block queue down already */ - return; + q->request_fn_active++; /* * To start with, we keep looping until the queue is empty, or until @@ -1626,11 +1624,7 @@ out_delay: if (sdev->device_busy == 0) blk_delay_queue(q, SCSI_QUEUE_DELAY); out: - /* must be careful here...if we trigger the ->remove() function -* we cannot be holding the q lock */ - spin_unlock_irq(q->queue_lock); - put_device(&sdev->sdev_gendev); - spin_lock_irq(q->queue_lock); + q->request_fn_active--; } u64 scsi_calculate_bounce_limit(struct Scsi_Host *shost) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 4e72a9d..11c1987 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -377,6 +377,11 @@ struct request_queue { unsigned intnr_sorted; unsigned intin_flight[2]; + /* +* Number of active request_fn() calls for those request_fn() +* implementations that unlock the queue_lock, e.g. scsi_request_fn(). +*/ + unsigned intrequest_fn_active; unsigned intrq_timeout; struct timer_list timeout; -- 1.7.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] IB/lmx4: silence GCC warning
On 10/29/12 10:50, Paul Bolle wrote: On Wed, 2012-10-10 at 09:23 +0200, Jack Morgenstein wrote: You could use: u16 uninitialized_var(vlan); instead. I guess we'd better just wait and see whether uninitialized_var() survives before discussing your suggestion (see the thread starting at https://lkml.org/lkml/2012/10/26/508 ). Although this in the special QP data flow, I still prefer to avoid adding extra code (even setting initial values at procedure entry). The line above will also do the job. "uninitialized_var" is used elsewhere in the driver. See, for example, mlx4_ib_post_send() in the same file (qp.c). (replying to an e-mail of a few months ago) If there are no further objections I'd like to see this patch to go upstream. It fixes an annoying compiler warning and I don't think that this patch has a negative performance impact. Thanks, Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RESEND PATCH] scsi: make struct scsi_varlen_cdb_hdr packed
On 10/11/12 11:15, James Hogan wrote: The struct scsi_varlen_cdb_hdr is expected to be exactly 10 bytes when used in struct osd_cdb_head, but it isn't marked as packed. Some architectures will round the struct size up which triggers BUILD_BUG_ON compile errors in osd_initiator.c when the outer structs are unexpected sizes. This is fixed by marking struct scsi_varlen_cdb_hdr as __packed. Signed-off-by: James Hogan --- include/scsi/scsi.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h index 66216c1..3beaef3 100644 --- a/include/scsi/scsi.h +++ b/include/scsi/scsi.h @@ -198,7 +198,7 @@ struct scsi_varlen_cdb_hdr { __u8 additional_cdb_length; /* total cdb length - 8 */ __be16 service_action; /* service specific data follows */ -}; +} __packed; static inline unsigned scsi_varlen_cdb_length(const void *hdr) Hello James, Are you aware that __packed can also be used on individual struct members and that doing so has a lower performance penalty than using the __packed attribute on an entire struct ? See e.g. for an example. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Deadlock observed in SCSI device removal
On 12/15/12 05:22, Amit Sahrawat wrote: We are using kernel version 3.0.33, and the patch which is mentioned for the above problem - is already applied in the kernel. If the kernel reports that io_schedule() hangs that usually means that an I/O request was not finished properly. It would help if you could verify whether or not you can reproduce this issue with kernel 3.6.10. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ofa-general] Re: Merging of completely unreviewed drivers
On Fri, Feb 22, 2008 at 7:54 PM, Ingo Molnar <[EMAIL PROTECTED]> wrote: > If a patch or if a file has a clean _style_, bugs and deeper > structural problems often stand out like a sore thumb. But if the > code is peppered with random style noise, it's a lot harder (for me > at least) to notice real bugs. I can notice bugs in a squeeky clean > code base about 5 times easier than in a noisy codebase. This effect > alone makes checkpatch indispensible for the scheduler and for > arch/x86. I also appreciate style uniformity in kernel code. My (limited) experience with checkpatch is that most checkpatch complaints are easy to resolve. Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Integration of SCST in the mainstream Linux kernel
As you probably know there is a trend in enterprise computing towards networked storage. This is illustrated by the emergence during the past few years of standards like SRP (SCSI RDMA Protocol), iSCSI (Internet SCSI) and iSER (iSCSI Extensions for RDMA). Two different pieces of software are necessary to make networked storage possible: initiator software and target software. As far as I know there exist three different SCSI target implementations for Linux: - The iSCSI Enterprise Target Daemon (IETD, http://iscsitarget.sourceforge.net/); - The Linux SCSI Target Framework (STGT, http://stgt.berlios.de/); - The Generic SCSI Target Middle Level for Linux project (SCST, http://scst.sourceforge.net/). Since I was wondering which SCSI target software would be best suited for an InfiniBand network, I started evaluating the STGT and SCST SCSI target implementations. Apparently the performance difference between STGT and SCST is small on 100 Mbit/s and 1 Gbit/s Ethernet networks, but the SCST target software outperforms the STGT software on an InfiniBand network. See also the following thread for the details: http://sourceforge.net/mailarchive/forum.php?thread_name=e2e108260801170127w2937b2afg9bef324efa945e43%40mail.gmail.com&forum_name=scst-devel. About the design of the SCST software: while one of the goals of the STGT project was to keep the in-kernel code minimal, the SCST project implements the whole SCSI target in kernel space. SCST is implemented as a set of new kernel modules, only minimal changes to the existing kernel are necessary before the SCST kernel modules can be used. This is the same approach that will be followed in the very near future in the OpenSolaris kernel (see also http://opensolaris.org/os/project/comstar/). More information about the design of SCST can be found here: http://scst.sourceforge.net/doc/scst_pg.html. My impression is that both the STGT and SCST projects are well designed, well maintained and have a considerable user base. According to the SCST maintainer (Vladislav Bolkhovitin), SCST is superior to STGT with respect to features, performance, maturity, stability, and number of existing target drivers. Unfortunately the SCST kernel code lives outside the kernel tree, which makes SCST harder to use than STGT. As an SCST user, I would like to see the SCST kernel code integrated in the mainstream kernel because of its excellent performance on an InfiniBand network. Since the SCST project comprises about 14 KLOC, reviewing the SCST code will take considerable time. Who will do this reviewing work ? And with regard to the comments made by the reviewers: Vladislav, do you have the time to carry out the modifications requested by the reviewers ? I expect a.o. that reviewers will ask to move SCST's configuration pseudofiles from procfs to sysfs. Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.25
On Jan 18, 2008 1:11 AM, Roland Dreier <[EMAIL PROTECTED]> wrote: > Anyway, here are all the pending things that I'm aware of. As usual, > if something isn't already in my tree and isn't listed below, I > probably missed it or dropped it by mistake. Please remind me again > in that case. Are there any plans to merge the SDP (Sockets Direct Protocol) implementation ? Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] I2C: add support for the PCF8575 chip
From: Bart Van Assche Add support for the PCF8575 I2C chip. Signed-off-by: Bart Van Assche ([EMAIL PROTECTED]) --- diff -uprN -X orig/linux-2.6.22.9/Documentation/dontdiff orig/linux-2.6.22.9/drivers/i2c/chips/Kconfig linux-2.6.22.9/drivers/i2c/chips/Kconfig --- orig/linux-2.6.22.9/drivers/i2c/chips/Kconfig 2007-09-26 20:03:01.0 +0200 +++ linux-2.6.22.9/drivers/i2c/chips/Kconfig2007-10-05 09:27:19.0 +0200 @@ -41,13 +41,27 @@ config SENSORS_PCF8574 default n help If you say yes here you get support for Philips PCF8574 and - PCF8574A chips. + PCF8574A chips. These chips are 8-bit I/O expanders for the I2C bus. This driver can also be built as a module. If so, the module will be called pcf8574. - These devices are hard to detect and rarely found on mainstream - hardware. If unsure, say N. + These devices are hard to detect automatically and are rarely found + on mainstream hardware. If unsure, say N. + +config SENSORS_PCF8575 + tristate "Philips PCF8575" + default n + help + If you say yes here you get support for Philips PCF8575 chip. + This chip is a 16-bit I/O expander for the I2C bus. Several other + chip manufacturers sell equivalent chips, e.g. Texas Instruments. + + This driver can also be built as a module. If so, the module + will be called pcf8575. + + This device is hard to detect automatically and is rarely found on + mainstream hardware. If unsure, say N. config SENSORS_PCA9539 tristate "Philips PCA9539 16-bit I/O port" diff -uprN -X orig/linux-2.6.22.9/Documentation/dontdiff orig/linux-2.6.22.9/drivers/i2c/chips/Makefile linux-2.6.22.9/drivers/i2c/chips/Makefile --- orig/linux-2.6.22.9/drivers/i2c/chips/Makefile 2007-09-26 20:03:01.0 +0200 +++ linux-2.6.22.9/drivers/i2c/chips/Makefile 2007-10-05 09:16:21.0 +0200 @@ -9,6 +9,7 @@ obj-$(CONFIG_SENSORS_MAX6875) += max6875 obj-$(CONFIG_SENSORS_M41T00) += m41t00.o obj-$(CONFIG_SENSORS_PCA9539) += pca9539.o obj-$(CONFIG_SENSORS_PCF8574) += pcf8574.o +obj-$(CONFIG_SENSORS_PCF8575) += pcf8575.o obj-$(CONFIG_SENSORS_PCF8591) += pcf8591.o obj-$(CONFIG_ISP1301_OMAP) += isp1301_omap.o obj-$(CONFIG_TPS65010) += tps65010.o diff -uprN -X orig/linux-2.6.22.9/Documentation/dontdiff orig/linux-2.6.22.9/drivers/i2c/chips/pcf8575.c linux-2.6.22.9/drivers/i2c/chips/pcf8575.c --- orig/linux-2.6.22.9/drivers/i2c/chips/pcf8575.c 2007-10-05 09:44:02.0 +0200 +++ linux-2.6.22.9/drivers/i2c/chips/pcf8575.c 2007-10-05 11:10:46.0 +0200 @@ -0,0 +1,305 @@ +/* + pcf8575.c - Part of lm_sensors, Linux kernel modules for hardware + monitoring + Copyright (c) 2000 Frodo Looijaard <[EMAIL PROTECTED]>, + Philip Edelbrock <[EMAIL PROTECTED]>, + Dan Eaton <[EMAIL PROTECTED]> + Ported to Linux 2.6 by Aurelien Jarno <[EMAIL PROTECTED]> with + the help of Jean Delvare <[EMAIL PROTECTED]> + + Copyright (C) 2006 Michael Hennerich, Analog Devices Inc. + <[EMAIL PROTECTED]> + based on the pcf8574.c + + Ported from ucLinux to mainstream Linux kernel by Bart Van Assche. + + About the pcf8575: the pcf8575 is an 16-bit I/O expander for the I2C bus + produced by a.o. Philips Semiconductors. + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. +*/ + + +#include +#include +#include +#include + + +/* Addresses to scan */ +static unsigned short normal_i2c[] = { I2C_CLIENT_END }; +#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 20) +static unsigned short normal_i2c_range[] = { I2C_CLIENT_END }; +#endif + +/* Insmod parameters */ +I2C_CLIENT_INSMOD; + + +/* Each client has this additional data */ +struct pcf8575_data { + struct i2c_client client; + + u16 write; /* Remember last written value */ + u8 buf[3]; +}; + +static int pcf8575_attach_adapter(struct i2c_adapter *adapter); +#if LINUX_VERSION_CODE >= KERNEL_VERSION(2, 6, 0) +static int pcf8575_detect(struct i2c_adapter *adapter, int address, int kind); +#else +static int pcf8575_detect(struct i2c_adapter *adapter, + int addr, unsigned short flags, int kind); +#endif +static int pcf8575_detach_client(struct i2c_c
Re: Integration of SCST in the mainstream Linux kernel
On Jan 30, 2008 12:32 AM, FUJITA Tomonori <[EMAIL PROTECTED]> wrote: > > iSER has parameters to limit the maximum size of RDMA (it needs to > repeat RDMA with a poor configuration)? Please specify which parameters you are referring to. As you know I had already repeated my tests with ridiculously high values for the following iSER parameters: FirstBurstLength, MaxBurstLength and MaxRecvDataSegmentLength (16 MB, which is more than the 1 MB block size specified to dd). Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integration of SCST in the mainstream Linux kernel
On Jan 29, 2008 9:42 PM, James Bottomley <[EMAIL PROTECTED]> wrote: > > As an SCST user, I would like to see the SCST kernel code integrated > > in the mainstream kernel because of its excellent performance on an > > InfiniBand network. Since the SCST project comprises about 14 KLOC, > > reviewing the SCST code will take considerable time. Who will do this > > reviewing work ? And with regard to the comments made by the > > reviewers: Vladislav, do you have the time to carry out the > > modifications requested by the reviewers ? I expect a.o. that > > reviewers will ask to move SCST's configuration pseudofiles from > > procfs to sysfs. > > The two target architectures perform essentially identical functions, so > there's only really room for one in the kernel. Right at the moment, > it's STGT. Problems in STGT come from the user<->kernel boundary which > can be mitigated in a variety of ways. The fact that the figures are > pretty much comparable on non IB networks shows this. Are you saying that users who need an efficient iSCSI implementation should switch to OpenSolaris ? The OpenSolaris COMSTAR project involves the migration of the existing OpenSolaris iSCSI target daemon from userspace to their kernel. The OpenSolaris developers are spending time on this because they expect a significant performance improvement. > I really need a whole lot more evidence than at worst a 20% performance > difference on IB to pull one implementation out and replace it with > another. Particularly as there's no real evidence that STGT can't be > tweaked to recover the 20% even on IB. My measurements on a 1 GB/s InfiniBand network have shown that the current SCST implementation is able to read data via direct I/O at a rate of 811 GB/s (via SRP) and that the current STGT implementation is able to transfer data at a rate of 589 MB/s (via iSER). That's a performance difference of 38%. And even more important, the I/O latency of SCST is significantly lower than that of STGT. This is very important for database workloads -- the I/O pattern caused by database software is close to random I/O, and database software needs low latency I/O in order to run efficiently. In the thread with the title "Performance of SCST versus STGT" on the SCST-devel / STGT-devel mailing lists not only the raw performance numbers were discussed but also which further performance improvements are possible. It became clear that the SCST performance can be improved further by implementing a well known optimization (zero-copy I/O). Fujita Tomonori explained in the same thread that it is possible to improve the performance of STGT further, but that this would require a lot of effort (implementing asynchronous I/O in the kernel and also implementing a new caching mechanism using pre-registered buffers). See also: http://sourceforge.net/mailarchive/forum.php?forum_name=scst-devel&viewmonth=200801&viewday=17 Bart Van Assche. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integration of SCST in the mainstream Linux kernel
On Jan 30, 2008 11:56 AM, FUJITA Tomonori <[EMAIL PROTECTED]> wrote: > On Wed, 30 Jan 2008 09:38:04 +0100 > "Bart Van Assche" <[EMAIL PROTECTED]> wrote: > > > > Please specify which parameters you are referring to. As you know I > > Sorry, I can't say. I don't know much about iSER. But seems that Pete > and Robin can get the better I/O performance - line speed ratio with > STGT. Robin Humble was using a DDR InfiniBand network, while my tests were performed with an SDR InfiniBand network. Robin's results can't be directly compared to my results. Pete Wyckoff's results (http://www.osc.edu/~pw/papers/wyckoff-iser-snapi07-talk.pdf) are hard to interpret. I have asked Pete which of the numbers in his test can be compared with what I measured, but Pete did not reply. > The version of OpenIB might matters too. For example, Pete said that > STGT reads loses about 100 MB/s for some transfer sizes for some > transfer sizes due to the OpenIB version difference or other unclear > reasons. > > http://article.gmane.org/gmane.linux.iscsi.tgt.devel/135 Pete wrote about a degradation from 600 MB/s to 500 MB/s for reads with STGT+iSER. In my tests I measured 589 MB/s for reads (direct I/O), which matches with the better result obtained by Pete. Note: the InfiniBand kernel modules I used were those from the 2.6.22.9 kernel, not from the OFED distribution. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integration of SCST in the mainstream Linux kernel
On Jan 30, 2008 5:34 PM, James Bottomley <[EMAIL PROTECTED]> wrote: > > On Wed, 2008-01-30 at 09:38 +0100, Bart Van Assche wrote: > > On Jan 30, 2008 12:32 AM, FUJITA Tomonori <[EMAIL PROTECTED]> wrote: > > > > > > iSER has parameters to limit the maximum size of RDMA (it needs to > > > repeat RDMA with a poor configuration)? > > > > Please specify which parameters you are referring to. As you know I > > had already repeated my tests with ridiculously high values for the > > following iSER parameters: FirstBurstLength, MaxBurstLength and > > MaxRecvDataSegmentLength (16 MB, which is more than the 1 MB block > > size specified to dd). > > the 1Mb block size is a bit of a red herring. Unless you've > specifically increased the max_sector_size and are using an sg_chain > converted driver, on x86 the maximum possible transfer accumulation is > 0.5MB. I did not publish the results, but I have also done tests with other block sizes. The other sizes I tested were between 0.1MB and 10MB. The performance difference for these other sizes compared to a block size of 1MB was small (smaller than the variance between individual tests results). Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integration of SCST in the mainstream Linux kernel
On Jan 30, 2008 5:22 PM, James Bottomley <[EMAIL PROTECTED]> wrote: > ... > Deciding what lives in userspace and what should be in the kernel lies > at the very heart of architectural decisions. However, the argument > that "it should be in the kernel because that would make it faster" is > pretty much a discredited one. To prevail on that argument, you have to > demonstrate that there's no way to enable user space to do the same > thing at the same speed. Further, it was the same argument used the > last time around when the STGT vs SCST investigation was done. Your own > results on non-IB networks show that both architectures perform at the > same speed. That tends to support the conclusion that there's something > specific about IB that needs to be tweaked or improved for STGT to get > it to perform correctly. You should know that given two different implementations in software of the same communication protocol, differences in latency and throughput become more visible as the network latency gets lower and the throughput gets higher. That's why conclusions can only be drawn from the InfiniBand numbers, and not from the 1 Gbit/s Ethernet numbers. Assuming that there is something specific in STGT with regard to InfiniBand is speculation. > Furthermore, if you have already decided before testing that SCST is > right and that STGT is wrong based on the architectures, it isn't > exactly going to increase my confidence in your measurement methodology > claiming to show this, now is it? I did not draw any conclusions from the architecture -- the only data I based my conclusions on were my own performance measurements. > ... > These are both features being independently worked on, are they not? > Even if they weren't, the combination of the size of SCST in kernel plus > the problem of having to find a migration path for the current STGT > users still looks to me to involve the greater amount of work. My proposal was to have both the SCST kernel code and the STGT kernel code in the mainstream Linux kernel. This would make it easier for current STGT users to evaluate SCST. It's too early to choose one of the two projects -- this choice can be made later on. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Is gcc thread-unsafe?
On 10/25/07, Linus Torvalds <> wrote: > The gcc developers seem to have had a total disregard for what people > want or need, and every time some code generation issue comes up, there's > a lot of people on the list that do language-lawyering, rather than admit > that there might be a problem. Please make a proposal for how gcc should be modified instead of just shooting on the gcc people -- the root cause here is the way the C/C++ memory model is defined. (Note: I'm not in any way involved in gcc development.) You can find my proposal to improve gcc here: http://gcc.gnu.org/ml/gcc/2007-10/msg00465.html Bart Van Assche. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Is gcc thread-unsafe?
On 10/26/07, Linus Torvalds <[EMAIL PROTECTED]> wrote: > > On Fri, 26 Oct 2007, Linus Torvalds wrote: > > > > On Fri, 26 Oct 2007, Bart Van Assche wrote: > > > > > > You can find my proposal to improve gcc here: > > > http://gcc.gnu.org/ml/gcc/2007-10/msg00465.html > > > > Btw, I think this is fine per se, but putting "__attribute__((acquire))" > > on the functions that acquire a lock does seem to be problematic, in that > > quite often you might well want to inline those things. How would you > > handle that? > > Thinking some more about this, you really have two cases: > - full functions taking/releasing locks (possibly conditionally, ie > with something like trylock and/or based on argument values). > > You simply *cannot* require these to be marked, because the locking may > have been done indirectly. Yes, you can mark things like > "pthread_mutex_trylock()" as being an acquire-function, but the fact > is, users will then wrap these things in *other* functions, and return > their return values. > > Ergo: a compiler *must* assume that a function call that it > didn't inline involves locking. There's no point in adding some > gcc-specific attributes to system header files, because it's not going > to fix anything in any portable program. You have a point here. > - inline assembly (together with, potentially, compiler primitives). >That's the only other way to reliably do locking from C. > >This one gcc could certainly extend on. But would there really be any >upside? It would be easier/better to say that inline assembly (at least >if it clobbers memory or is volatile) has the same serialization issues >as a function call. A problem is that the serialization properties defined for functions in the C standard only apply to volatile variables, not to non-volatile variables. But for asm statements this can be solved by adding memory to the list of clobbered registers -- this will prevent any reordering of manipulations of non-volatile variables and asm statements. Andrew, do you know whether gcc currently contains any optimization that interchanges the order of accesses to non-volatile variables and function calls ? > So the fact is, any compiler that turns > > if (conditional) > x++; > > into an unconditional write to x (where 'x' is potentially visible to the > outside - global visibility or has had its address taken) is just broken. > > No ifs, buts or maybes about it. You simply cannot do that optimization, > because there is no way for the compiler to know whether the conditional > implies that you hold a lock or not. I agree with the above, but I see this as a different issue -- it wasn't my intention to solve this with my proposal for acquire and release attributes. Bart Van Assche. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Is gcc thread-unsafe?
On 10/30/07, Andrew Haley <[EMAIL PROTECTED]> wrote: > David Schwartz writes: > > > > Can we get some kind of consensus that 'optimizations' that add > > writes to any object that the programmer might have taken the > > address of are invalid on any platform that supports memory > > protection? > > That's what the proposed standard language says, kinda-sorta. There's > an informal description at > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2338.html. There is other important information in the cited text. A.o. it is explained that register promotion of potentially shared variables can introduce data races. Or: register promotion can introduce bugs in multithreaded software when compiled with optimization enabled. Are there any register promotion transformations implemented in gcc that can introduce data races in multithreaded software ? This is very important information both for kernel developers and for developers of multithreaded userspace applications. Another conclusion from the cited text is that in contrast with what was stated before on the gcc mailing list, it is not required to declare thread-shared variables volatile if that thread-shared data is consistently protected by calls to locking functions. Bart Van Assche. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Is gcc thread-unsafe?
On 11/2/07, Andrew Haley <[EMAIL PROTECTED]> wrote: > Bart Van Assche writes: > > On 10/30/07, Andrew Haley <[EMAIL PROTECTED]> wrote: > > > That's what the proposed standard language says, kinda-sorta. There's > > > an informal description at > > > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2338.html. > > > > There is other important information in the cited text. A.o. it is > > explained that register promotion of potentially shared variables > > can introduce data races. Or: register promotion can introduce bugs > > in multithreaded software when compiled with optimization > > enabled. Are there any register promotion transformations > > implemented in gcc that can introduce data races in multithreaded > > software ? > > I expect so. We're going to have to audit this whole enormous code > base to find them all and take them out. > > Note that some of these optimizations have been around since gcc 3.4. Has it already been decided who will do this audit, and when this audit will happen ? Has a target date been set when this audit should be complete, or is the completion of this audit a requirement for the release of a specific gcc version ? And if there would exist register promotion transformations in gcc that can introduce data races, which would be the optimization levels that enable these transformations ? Bart Van Assche. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Is gcc thread-unsafe?
On 11/4/07, Linus Torvalds <[EMAIL PROTECTED]> wrote: > > On Sun, 4 Nov 2007, Bart Van Assche wrote: > > > > Has it already been decided who will do this audit, and when this > > audit will happen ? Has a target date been set when this audit should > > be complete, or is the completion of this audit a requirement for the > > release of a specific gcc version ? > > I am told that the gcc people realized that was indeed a bug (people were > able to show problems even in non-threaded environments with mprotect()), > and have now fixed it in the current gcc sources. That still leaves the > old versions with potential problems, but I think it makes it much less > interesting to audit for these things. > > Linus What I understood from the gcc mailing list is that a patch has been applied to the gcc sources that solves the issue with speculative stores that was already discussed here on the LKML (http://gcc.gnu.org/ml/gcc/2007-10/msg00554.html). But the issue I am referring to is a different issue: namely that a compiler optimization called register promotion can introduce data races. Hans J. Boehm has a clear explanation of this -- see also paragraph 4.3 in http://www.hpl.hp.com/techreports/2004/HPL-2004-209.pdf or http://portal.acm.org/citation.cfm?id=1064978.1065042 . Bart Van Assche. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] add u64 number parser
On 07/22/16 17:23, James Smart wrote: + buf = kmalloc(len + 1, GFP_KERNEL); + if (!buf) + return -ENOMEM; + memcpy(buf, s->from, len); + buf[len] = '\0'; Hello James, Have you considered to combine the above kmalloc() and memcpy() calls into a single kasprintf(GFP_KERNEL, "%.*s", len, s->from) call? Bart.
Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
On 09/19/16 03:38, Alexander Gordeev wrote: > On Fri, Sep 16, 2016 at 05:04:48PM -0400, Keith Busch wrote: > > CC-ing linux-bl...@vger.kernel.org > >> I'm not sure I see how this helps. That probably means I'm not considering >> the right scenario. Could you elaborate on when having multiple hardware >> queues to choose from a given CPU will provide a benefit? > > No, I do not keep in mind any particular scenario besides common > sense. Just an assumption deeper queues are better (in this RFC > a virtual combined queue consisting of multipe h/w queues). > > Apparently, there could be positive effects only in systems where > # of queues / # of CPUs > 1 or # of queues / # of cores > 1. But > I do not happen to have ones. If I had numbers this would not be > the RFC and I probably would not have posted in the first place ;) > > Would it be possible to give it a try on your hardware? Hello Alexander, It is your task to measure the performance impact of these patches and not Keith's task. BTW, I'm not convinced that multiple hardware queues per CPU will result in a performance improvement. I have not yet seen any SSD for which a queue depth above 512 results in better performance than queue depth equal to 512. Which applications do you think will generate and sustain a queue depth above 512? Additionally, my experience from another high performance context (RDMA) is that reducing the number of queues can result in higher IOPS due to fewer interrupts per I/O. Bart.
Re: [PATCH v3 2/3] block: convert to device_add_disk()
On 06/17/2016 06:34 PM, Dan Williams wrote: diff --git a/arch/um/drivers/ubd_kern.c b/arch/um/drivers/ubd_kern.c index ef6b4d960bad..4e7cb5df8e43 100644 --- a/arch/um/drivers/ubd_kern.c +++ b/arch/um/drivers/ubd_kern.c @@ -823,12 +823,11 @@ static int ubd_disk_register(int major, u64 size, int unit, ubd_devs[unit].pdev.dev.release = ubd_device_release; dev_set_drvdata(&ubd_devs[unit].pdev.dev, &ubd_devs[unit]); platform_device_register(&ubd_devs[unit].pdev); - disk->driverfs_dev = &ubd_devs[unit].pdev.dev; } disk->private_data = &ubd_devs[unit]; disk->queue = ubd_devs[unit].queue; - add_disk(disk); + device_add_disk(&ubd_devs[unit].pdev.dev, disk); *disk_out = disk; return 0; Have you reviewed this patch yourself? I think the above change is an unintended behavior change. Bart.
Re: [PATCH v3 1/3] block: introduce device_add_disk()
On 06/17/2016 06:34 PM, Dan Williams wrote: In preparation for removing the ->driverfs_dev member of a gendisk, add an api that takes the parent device as a parameter to add_disk(). For now this maintains the status quo of WARN()ing on failure, but not return a error code. Reviewed-by: Bart Van Assche
Re: [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag
On 06/20/2016 02:22 PM, Christoph Hellwig wrote: On Thu, Jun 16, 2016 at 05:39:07PM +0200, Bart Van Assche wrote: On 06/16/2016 05:20 PM, Christoph Hellwig wrote: On Wed, Jun 15, 2016 at 09:36:54PM +0200, Bart Van Assche wrote: My concern is that I doubt that there is an interrupt assignment scheme that works optimally for all workloads. Hence my request to preserve the ability to modify interrupt affinity from user space. I'd say let's do such an interface incrementally based on the use case - especially after we get networking over to use common code to distribute the interrupts. If you were doing something like this with the current blk-mq code it wouldn't work very well due to the fact that you'd have a mismatch between the assigned interrupt and the blk-mq queue mapping anyway. It might be a good idea to start brainstorming how we'd want to handle this change - we'd basically need a per-device notification that the interrupt mapping changes so that we can rebuild the queue mapping, which is somewhat similar to the lib/cpu_rmap.c code used by a few networking drivers. This would also help with dealing with cpu hotplug events that change the cpu mapping. A notification mechanism that reports interrupt mapping changes will definitely help. What would also help is an API that allows drivers to query the MSI-X IRQ of an adapter that is nearest given a cpumask, e.g. hctx->cpumask. Another function can then map that IRQ into an index in the range 0..n-1 where n is the number of MSI-X interrupts for that adapter. Every blk-mq/scsi-mq driver will need this functionality to decide which IRQ to associate with a block layer hctx. Bart.
Re: [PATCH] scsi: libfc: fix seconds_since_last_reset calculation
On 06/17/2016 05:47 PM, Arnd Bergmann wrote: - jiffies_to_timespec(jiffies, &v0); - jiffies_to_timespec(lport->boot_time, &v1); - fc_stats->seconds_since_last_reset = (v0.tv_sec - v1.tv_sec); + fc_stats->seconds_since_last_reset = (lport->boot_time - jiffies) / HZ; Does this patch introduce a 64-bit division? There might still be 32-bit users of this code. Bart.
Re: [PATCH] scsi: libfc: fix seconds_since_last_reset calculation
On 06/20/2016 04:48 PM, Arnd Bergmann wrote: On Monday, June 20, 2016 3:54:06 PM CEST Bart Van Assche wrote: On 06/17/2016 05:47 PM, Arnd Bergmann wrote: - jiffies_to_timespec(jiffies, &v0); - jiffies_to_timespec(lport->boot_time, &v1); - fc_stats->seconds_since_last_reset = (v0.tv_sec - v1.tv_sec); + fc_stats->seconds_since_last_reset = (lport->boot_time - jiffies) / HZ; Does this patch introduce a 64-bit division? There might still be 32-bit users of this code. No, both lport->boot_time and jiffies are 'unsigned long'. Ah, you're right. Hence: Reviewed-by: Bart Van Assche
Re: [RFC] IB/srp: Remove create_workqueue
On 06/20/2016 08:59 PM, Tejun Heo wrote: On Tue, Jun 07, 2016 at 01:00:13PM -0700, Bart Van Assche wrote: srp_remove_wq is used for SRP target port removal work only. This work is neither queued from inside a shrinker nor by the page writeback code so I think it is safe to drop WQ_MEM_RECLAIM. It should be able to use system_wq then. No. I have tried that but that resulted in a deadlock. See also commit bcc059103591 for the details. So, create_workqueue() limits concurrency to 1 per cpu and if you have a dependency between two work items and they get scheduled on the same cpu they can deadlock. system_wq doesn't have that restriction and should be fine, AFAICS. Agreed, as long as WQ_DFL_ACTIVE is not reduced from its current value (256) to a very low value (e.g. 1 or 2). This assumption should be documented but I'm not sure what the best way is to document this... Bart.
Re: [dm-devel] [RFC][PATCH] dm: Remove dm_bufio_cond_resched()
On 09/23/16 00:34, Peter Zijlstra wrote: Is anybody still using PREEMPT_NONE? Most workloads also care about latency to some extend. Lots of code has explicit cond_resched() and doesn't worry. From a SLES11 system: $ grep PREEMPT_NONE /boot/config-3.0.101-0.47.67-default CONFIG_PREEMPT_NONE=y From a SLES12 system: $ grep CONFIG_PREEMPT_NONE /boot/config-4.4.16-56-default CONFIG_PREEMPT_NONE=y Bart.
[PATCH] sched: Change READ_ONCE(jiffies) into jiffies
A quote from Documentation/memory_barriers.txt: For example, because 'jiffies' is marked volatile, it is never necessary to say READ_ONCE(jiffies). The reason for this is that READ_ONCE() and WRITE_ONCE() are implemented as volatile casts, which has no effect when its argument is already marked volatile. Hence change all 'READ_ONCE(jiffies)' occurrences into 'jiffies'. Signed-off-by: Bart Van Assche Cc: Peter Zijlstra Cc: Oleg Nesterov --- kernel/sched/core.c| 2 +- kernel/sched/cputime.c | 4 ++-- kernel/sched/fair.c| 8 kernel/sched/wait.c| 6 -- 4 files changed, 11 insertions(+), 9 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 44817c6..edb811e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3100,7 +3100,7 @@ void scheduler_tick(void) u64 scheduler_tick_max_deferment(void) { struct rq *rq = this_rq(); - unsigned long next, now = READ_ONCE(jiffies); + unsigned long next, now = jiffies; next = rq->last_sched_tick + HZ; diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index a846cf8..9aecaad 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -691,7 +691,7 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN static cputime_t vtime_delta(struct task_struct *tsk) { - unsigned long now = READ_ONCE(jiffies); + unsigned long now = jiffies; if (time_before(now, (unsigned long)tsk->vtime_snap)) return 0; @@ -701,7 +701,7 @@ static cputime_t vtime_delta(struct task_struct *tsk) static cputime_t get_vtime_delta(struct task_struct *tsk) { - unsigned long now = READ_ONCE(jiffies); + unsigned long now = jiffies; cputime_t delta, other; /* diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 039de34..db5ae54 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4802,7 +4802,7 @@ static void cpu_load_update_idle(struct rq *this_rq) if (weighted_cpuload(cpu_of(this_rq))) return; - cpu_load_update_nohz(this_rq, READ_ONCE(jiffies), 0); + cpu_load_update_nohz(this_rq, jiffies, 0); } /* @@ -4828,7 +4828,7 @@ void cpu_load_update_nohz_start(void) */ void cpu_load_update_nohz_stop(void) { - unsigned long curr_jiffies = READ_ONCE(jiffies); + unsigned long curr_jiffies = jiffies; struct rq *this_rq = this_rq(); unsigned long load; @@ -4851,7 +4851,7 @@ static void cpu_load_update_periodic(struct rq *this_rq, unsigned long load) { #ifdef CONFIG_NO_HZ_COMMON /* See the mess around cpu_load_update_nohz(). */ - this_rq->last_load_update_tick = READ_ONCE(jiffies); + this_rq->last_load_update_tick = jiffies; #endif cpu_load_update(this_rq, load, 1); } @@ -4864,7 +4864,7 @@ void cpu_load_update_active(struct rq *this_rq) unsigned long load = weighted_cpuload(cpu_of(this_rq)); if (tick_nohz_tick_stopped()) - cpu_load_update_nohz(this_rq, READ_ONCE(jiffies), load); + cpu_load_update_nohz(this_rq, jiffies, load); else cpu_load_update_periodic(this_rq, load); } diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index f15d6b6..af4959d 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -601,7 +601,8 @@ EXPORT_SYMBOL(bit_wait_io); __sched int bit_wait_timeout(struct wait_bit_key *word, int mode) { - unsigned long now = READ_ONCE(jiffies); + unsigned long now = jiffies; + if (time_after_eq(now, word->timeout)) return -EAGAIN; schedule_timeout(word->timeout - now); @@ -613,7 +614,8 @@ EXPORT_SYMBOL_GPL(bit_wait_timeout); __sched int bit_wait_io_timeout(struct wait_bit_key *word, int mode) { - unsigned long now = READ_ONCE(jiffies); + unsigned long now = jiffies; + if (time_after_eq(now, word->timeout)) return -EAGAIN; io_schedule_timeout(word->timeout - now); -- 2.10.0
Re: [PATCH] sched: Change READ_ONCE(jiffies) into jiffies
On 09/26/16 01:55, Peter Zijlstra wrote: On Sun, Sep 25, 2016 at 07:08:45PM -0700, Bart Van Assche wrote: A quote from Documentation/memory_barriers.txt: For example, because 'jiffies' is marked volatile, it is never necessary to say READ_ONCE(jiffies). The reason for this is that READ_ONCE() and WRITE_ONCE() are implemented as volatile casts, which has no effect when its argument is already marked volatile. Hence change all 'READ_ONCE(jiffies)' occurrences into 'jiffies'. So I'm not a huge fan of this patch. Yes its technically pointless, but it does convey intent. See the READ_ONCE() as a comment if you will. Hello Peter, Are you aware that the scheduler code is the only code that uses READ_ONCE() to read 'jiffies', and that even in the scheduler not all 'jiffies' reads use READ_ONCE()? This patch makes the scheduler code more consistent and easier to read without affecting the generated code. Bart.
Re: [PATCH 1/2] wd719x: Remove last declaration using DEFINE_PCI_DEVICE_TABLE
On 08/31/16 09:19, Joe Perches wrote: > Convert it to the preferred const struct pci_device_id instead. Reviewed-by: Bart Van Assche
Re: Block-level access
On 09/01/2016 02:48 PM, Alex Austin wrote: CC'ing linux-scsi and linux-block. Also, please CC me in replies. On Thu, Sep 1, 2016 at 4:43 PM, Alex Austin wrote: Hello, What is the most performant way to directly interface with an attached hard drive? I've so far used read()/write() on /dev/sd_ but I find error handling exceedingly difficult, as I don't always even get errors reported, even if the open() call includes O_DIRECT. I've also used ioctl(SG_IO), but find that it's extremely slow due to the lack of queuing support in the API. Is there a mid-level API that will get me decent error handling while allowing command queuing, or do I just need to make multiple threads all doing separate SG_IO ioctls? What the most efficient way is to interface is with a hard drive depends on the I/O pattern. Are you aware that buffered I/O performs read-ahead? Have you already tried to tune the read-ahead setting (blockdev --getra / blockdev --setra)? BTW, if you need an example of how to queue SG_IO, you are welcome to have a look at the fio source code. You will probably be interested in the bsg API. See e.g. https://lwn.net/Articles/174469/. Bart.
Re: Observing Softlockup's while running heavy IOs
On 09/01/2016 03:31 AM, Sreekanth Reddy wrote: I reduced the ISR workload by one third in-order to reduce the time that is spent per CPU in interrupt context, even then I am observing softlockups. As I mentioned before only same single CPU in the set of CPUs(enabled in affinity_hint) is busy with handling the interrupts from corresponding IRQx. I have done below experiment in driver to limit these softlockups/hardlockups. But I am not sure whether it is reasonable to do this in driver, Experiment: If the CPUx is continuously busy with handling the remote CPUs (enabled in the corresponding IRQ's affinity_hint) IO works by 1/4th of the HBA queue depth in the same ISR context then enable a flag called 'change_smp_affinity' for this IRQ. Also created a thread with will poll for this flag for every IRQ's (enabled by driver) for every second. If this thread see that this flag is enabled for any IRQ then it will write next CPU number from the CPUs enabled in the IRQ's affinity_hint to the IRQ's smp_affinity procfs attribute using 'call_usermodehelper()' API. This to make sure that interrupts are not processed by same single CPU all the time and to make the other CPUs to handle the interrupts if the current CPU is continuously busy with handling the other CPUs IO interrupts. For example consider a system which has 8 logical CPUs and one MSIx vector enabled (called IRQ 120) in driver, HBA queue depth as 8K. then IRQ's procfs attributes will be IRQ# 120, affinity_hint=0xff, smp_affinity=0x00 After starting heavy IOs, we will observe that only CPU0 will be busy with handling the interrupts. This experiment driver will change the smp_affinity to next CPU number i.e. 0x01 (using cmd 'echo 0x01 > /proc/irq/120/smp_affinity', driver issue's this cmd using call_usermodehelper() API) if it observes that CPU0 is continuously processing more than 2K of IOs replies of other CPUs i.e from CPU1 to CPU7. Whether doing this kind of stuff in driver is ok? Hello Sreekanth, To me this sounds like something that should be implemented in the I/O chipset on the motherboard. If you have a look at the Intel Software Developer Manuals then you will see that logical destination mode supports round-robin interrupt delivery. However, the Linux kernel selects physical destination mode on systems with more than eight logical CPUs (see also arch/x86/kernel/apic/apic_flat_64.c). I'm not sure the maintainers of the interrupt subsystem would welcome code that emulates round-robin interrupt delivery. So your best option is probably to minimize the amount of work that is done in interrupt context and to move as much work as possible out of interrupt context in such a way that it can be spread over multiple CPU cores, e.g. by using queue_work_on(). Bart.
Re: [PATCH] trivial treewide: Convert dev_set_uevent_suppress argument to bool
On 09/01/16 13:11, Joe Perches wrote: > Assigning an int to a bitfield:1 can lose precision. > Change the caller argument uses from 1/0 to true/false. Hello Joe, Can you clarify how assigning 0 or 1 to a one-bit bitfield can cause a loss of precision? Thanks, Bart.
Re: [PATCH] trivial treewide: Convert dev_set_uevent_suppress argument to bool
On 09/01/16 17:51, Joe Perches wrote: > On Fri, 2016-09-02 at 00:47 +0000, Bart Van Assche wrote: >> On 09/01/16 13:11, Joe Perches wrote: >>> >>> Assigning an int to a bitfield:1 can lose precision. >>> Change the caller argument uses from 1/0 to true/false. >> Hello Joe, > > Hi Bart. > >> Can you clarify how assigning 0 or 1 to a one-bit bitfield can cause a >> loss of precision? > > There are no existing defects. > > Using 1/0 is not a loss of precision, it's just > changing to use bool avoids potential errors and > promotes consistency. > > Other uses of this function already use true/false. Hello Joe, In the patch description you refer to loss of precision. However, your patch does not address any loss of precision issues. So I think that the patch description is misleading and could be made more clear. Bart.
Re: [PATCH] iscsi-target: fix spelling mistake "Unsolicitied" -> "Unsolicited"
On 09/02/2016 07:32 AM, Colin King wrote: Trivial fix to spelling mistakes in pr_debug message and comments Reviewed-by: Bart Van Assche
Re: [PATCH] trivial treewide: Convert dev_set_uevent_suppress argument to bool
On 09/02/2016 08:41 AM, Joe Perches wrote: On Fri, 2016-09-02 at 13:41 +, Bart Van Assche wrote: On 09/01/16 17:51, Joe Perches wrote: On Fri, 2016-09-02 at 00:47 +, Bart Van Assche wrote: On 09/01/16 13:11, Joe Perches wrote: Assigning an int to a bitfield:1 can lose precision. Change the caller argument uses from 1/0 to true/false. Can you clarify how assigning 0 or 1 to a one-bit bitfield can cause a loss of precision? There are no existing defects. Using 1/0 is not a loss of precision, it's just changing to use bool avoids potential errors and promotes consistency. Other uses of this function already use true/false. In the patch description you refer to loss of precision. However, your patch does not address any loss of precision issues. So I think that the patch description is misleading and could be made more clear. I tend towards terse being better than verbose. The original patch description says "no change to objects" What would you suggest? Hello Joe, How about the following: dev_set_uevent_suppress() expects a boolean as second argument. Make this clear by passing true/false instead of 1/0 as the second argument. Bart.
Re: [PATCH] sched: Avoid that __wait_on_bit_lock() hangs
On 08/04/2016 07:09 AM, Peter Zijlstra wrote: On Wed, Aug 03, 2016 at 02:51:23PM -0700, Bart Van Assche wrote: So I started testing the patch below that should fix the same hang but without triggering any wait list corruption. diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index f15d6b6..4e3f651 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -282,7 +282,7 @@ void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait, spin_lock_irqsave(&q->lock, flags); if (!list_empty(&wait->task_list)) list_del_init(&wait->task_list); - else if (waitqueue_active(q)) + if (waitqueue_active(q)) __wake_up_locked_key(q, mode, key); spin_unlock_irqrestore(&q->lock, flags); } So the problem with this patch is that it will violate the nr_exclusive semantics in that it can result in too many wakeups -- which is a much less severe (typically harmless) issue. We now always wake up the next waiter, even if there wasn't an actual wakeup we raced against. And if we then also get a wakeup, we can end up with 2 woken tasks (instead of the nr_exclusive=1). Now, since wait loops must all deal with spurious wakeups, this ends up as harmless overhead. How about adding a fifth argument to abort_exclusive_wait() that indicates whether or not the "if (waitqueue_active(q)) __wake_up_locked_key(q, mode, key)" code should be executed? __wait_event() could pass "condition" as fifth argument when calling abort_exclusive_wait(). But I'd still like to understand where we loose the wakeup. My assumption is that __wake_up_common() and signal delivery happen concurrently, that __wake_up_common() wakes up bit_wait_io() and that signal delivery happens after bit_wait_io() has been woken up but before it tests the signal pending state. Bart.
[PATCH v2 0/3] sched: Avoid that __wait_on_bit_lock() hangs
Hello Ingo, This patch series addresses an issue that my SRP driver test software can trigger reproducibly, namely that __wait_on_bit_lock() hangs. Please review and apply these patches whenever this is convenient for you. The changes compared to the first version of this series are: * Reworked the abort_exclusive_wait() changes. * Added two patches, namely a patch that introduces a local variable and a patch (nr. 3) that eliminates again the spurious wakeups introduced by patch 1. These patches have been tested on top of kernel v4.7 with kernel debugging enabled (detection of list corruption, lockdep, SLUB debugging, kmemleak, ...). See also https://lkml.org/lkml/2016/8/3/289. Thanks, Bart.
[PATCH v2 2/3] sched: Introduce a local variable in abort_exclusive_wait()
This patch does not change any functionality. Signed-off-by: Bart Van Assche Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Oleg Nesterov Cc: Andrew Morton Cc: Johannes Weiner Cc: Neil Brown Cc: Michael Shaver --- kernel/sched/wait.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index fa12939..dcc7693 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -429,18 +429,20 @@ int __sched __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q, wait_bit_action_f *action, unsigned mode) { + struct wait_bit_key *const key = &q->key; + do { int ret; prepare_to_wait_exclusive(wq, &q->wait, mode); - if (!test_bit(q->key.bit_nr, q->key.flags)) + if (!test_bit(key->bit_nr, key->flags)) continue; - ret = action(&q->key, mode); + ret = action(key, mode); if (!ret) continue; - abort_exclusive_wait(wq, &q->wait, mode, &q->key); + abort_exclusive_wait(wq, &q->wait, mode, key); return ret; - } while (test_and_set_bit(q->key.bit_nr, q->key.flags)); + } while (test_and_set_bit(key->bit_nr, key->flags)); finish_wait(wq, &q->wait); return 0; } -- 2.9.2
[PATCH v2 1/3] sched: Avoid that __wait_on_bit_lock() hangs
If delivery of a signal and __wake_up_common() happen concurrently it is possible that the signal is delivered after __wake_up_common() woke up the affected task and before bit_wait_io() checks whether a signal is pending. Avoid that the next waiter is not woken up if this happens. This patch fixes the following hang: INFO: task systemd-udevd:10111 blocked for more than 480 seconds. Not tainted 4.7.0-dbg+ #1 Call Trace: [] schedule+0x37/0x90 [] schedule_timeout+0x27f/0x470 [] io_schedule_timeout+0x9f/0x110 [] bit_wait_io+0x16/0x60 [] __wait_on_bit_lock+0x49/0xa0 [] __lock_page+0xb9/0xc0 [] truncate_inode_pages_range+0x3e0/0x760 [] truncate_inode_pages+0x10/0x20 [] kill_bdev+0x30/0x40 [] __blkdev_put+0x71/0x360 [] blkdev_put+0x49/0x170 [] blkdev_close+0x20/0x30 [] __fput+0xe8/0x1f0 [] fput+0x9/0x10 [] task_work_run+0x83/0xb0 [] do_exit+0x3ee/0xc40 [] do_group_exit+0x4b/0xc0 [] get_signal+0x2ca/0x940 [] do_signal+0x23/0x660 [] exit_to_usermode_loop+0x73/0xb0 [] syscall_return_slowpath+0xb0/0xc0 [] entry_SYSCALL_64_fastpath+0xa6/0xa8 Fixes: 777c6c5f1f6e ("wait: prevent exclusive waiter starvation") References: https://lkml.org/lkml/2012/8/24/185 References: http://www.spinics.net/lists/raid/msg53056.html Signed-off-by: Bart Van Assche Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Oleg Nesterov Cc: Andrew Morton Cc: Johannes Weiner Cc: Neil Brown Cc: Michael Shaver Cc: # v2.6.29+ --- kernel/sched/wait.c | 18 +++--- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index f15d6b6..fa12939 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -266,12 +266,16 @@ EXPORT_SYMBOL(finish_wait); * the wait descriptor from the given waitqueue if still * queued. * - * Wakes up the next waiter if the caller is concurrently - * woken up through the queue. - * - * This prevents waiter starvation where an exclusive waiter - * aborts and is woken up concurrently and no one wakes up - * the next waiter. + * Wakes up the next waiter to prevent waiter starvation + * when an exclusive waiter aborts and is woken up + * concurrently and no one wakes up the next waiter. Note: + * even when a signal is pending it is possible that + * __wake_up_common() wakes up the current thread and hence + * that @wait has been removed from the wait queue @q. Hence + * test whether there are more waiters on the wait queue + * even if @wait is not on the wait queue @q. This approach + * will cause a spurious wakeup if a signal is delivered and + * no other thread calls __wake_up_common() concurrently. */ void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait, unsigned int mode, void *key) @@ -282,7 +286,7 @@ void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait, spin_lock_irqsave(&q->lock, flags); if (!list_empty(&wait->task_list)) list_del_init(&wait->task_list); - else if (waitqueue_active(q)) + if (waitqueue_active(q)) __wake_up_locked_key(q, mode, key); spin_unlock_irqrestore(&q->lock, flags); } -- 2.9.2
[PATCH v2 3/3] sched: Avoid that abort_exclusive_wait() triggers spurious wakeups
Patch "sched: Avoid that __wait_on_bit_lock() hangs" made spurious wakeups possible. Avoid to trigger such spurious wakeups. Signed-off-by: Bart Van Assche Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Oleg Nesterov Cc: Andrew Morton Cc: Johannes Weiner Cc: Neil Brown Cc: Michael Shaver --- include/linux/wait.h | 5 +++-- kernel/sched/wait.c | 21 - 2 files changed, 11 insertions(+), 15 deletions(-) diff --git a/include/linux/wait.h b/include/linux/wait.h index 27d7a0a..5034fce 100644 --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -282,7 +282,7 @@ wait_queue_head_t *bit_waitqueue(void *, int); __ret = __int; \ if (exclusive) {\ abort_exclusive_wait(&wq, &__wait, \ -state, NULL); \ + condition, state, NULL);\ goto __out; \ } \ break; \ @@ -976,7 +976,8 @@ void prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state); void prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state); long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state); void finish_wait(wait_queue_head_t *q, wait_queue_t *wait); -void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait, unsigned int mode, void *key); +void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait, + bool wake_up_next, unsigned int mode, void *key); long wait_woken(wait_queue_t *wait, unsigned mode, long timeout); int woken_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key); int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key); diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index dcc7693..c5a913f 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -259,6 +259,7 @@ EXPORT_SYMBOL(finish_wait); * abort_exclusive_wait - abort exclusive waiting in a queue * @q: waitqueue waited on * @wait: wait descriptor + * @wake_up_next: Whether or not to wake up another exclusive waiter * @mode: runstate of the waiter to be woken * @key: key to identify a wait bit queue or %NULL * @@ -266,19 +267,12 @@ EXPORT_SYMBOL(finish_wait); * the wait descriptor from the given waitqueue if still * queued. * - * Wakes up the next waiter to prevent waiter starvation - * when an exclusive waiter aborts and is woken up - * concurrently and no one wakes up the next waiter. Note: - * even when a signal is pending it is possible that - * __wake_up_common() wakes up the current thread and hence - * that @wait has been removed from the wait queue @q. Hence - * test whether there are more waiters on the wait queue - * even if @wait is not on the wait queue @q. This approach - * will cause a spurious wakeup if a signal is delivered and - * no other thread calls __wake_up_common() concurrently. + * Wakes up the next waiter if @wake_up_next is set to prevent + * waiter starvation when an exclusive waiter aborts and is + * woken up concurrently and no one wakes up the next waiter. */ void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait, - unsigned int mode, void *key) + bool wake_up_next, unsigned int mode, void *key) { unsigned long flags; @@ -286,7 +280,7 @@ void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait, spin_lock_irqsave(&q->lock, flags); if (!list_empty(&wait->task_list)) list_del_init(&wait->task_list); - if (waitqueue_active(q)) + if (wake_up_next && waitqueue_active(q)) __wake_up_locked_key(q, mode, key); spin_unlock_irqrestore(&q->lock, flags); } @@ -440,7 +434,8 @@ __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q, ret = action(key, mode); if (!ret) continue; - abort_exclusive_wait(wq, &q->wait, mode, key); + abort_exclusive_wait(wq, &q->wait, + test_bit(key->bit_nr, key->flags), mode, key); return ret; } while (test_and_set_bit(key->bit_nr, key->flags)); finish_wait(wq, &q->wait); -- 2.9.2
Re: [PATCH v2 1/3] sched: Avoid that __wait_on_bit_lock() hangs
On 08/05/16 16:09, Bart Van Assche wrote: If delivery of a signal and __wake_up_common() happen concurrently it is possible that the signal is delivered after __wake_up_common() woke up the affected task and before bit_wait_io() checks whether a signal is pending. Avoid that the next waiter is not woken up if this happens. (replying to my own e-mail) Although this patch works reliably in my tests, the above description does not explain the hang. I will resend this patch with a better description. Bart.