from:"Bart Van Assche"

Re: Improving AIO cancellation

2013-02-08 Thread Bart Van Assche


On 02/08/13 04:42, Anatol Pomozov wrote:

Hi,

At Google we have several applications that heavily use asynchronous
IO. One thing that our userspace developers need is effective AIO
cancellation. You might say "sure use io_cancel syscall". Well, while
it cancels AIO requests it does it ineffectively. Currently (I am
looking at linux-next) io_cancel only marks kiocb as cancelled. The
bios still be issued to device even after kiocb was cancelled. Let's
say you have a congested device and want to cancel some AIO requests -
io_cancel will not make situation better. We would like to see more
resource effective AIO cancellation.

I had a discussion with Ted Tso and Kent Overstreet about improving
this situation and would like to share the ideas with you, linux
community.

Once direct async IO is submitted the request can be at several stages:
1) Sitting in kernel request queue of a congested device
2) Sent to device and sitting in device queue (if NCQ is enabled)
3) Executing on device

Ideally if we can cancel an IO request on any of these stages. But
currently we are especially interested in case #1. I do not know if
cancellation at stage #2 and #3 is possible and/or reasonable.

BTW AIO cancellation makes sense only for direct IO. Buffered AIO will
end up in buffer soon and kiocb will be marked as completed. Later
(maybe much later) writeback will flush those buffers to disk, but you
cannot cancel it..

And yet another thing to remember is md/RAID. Some types of raid
support stripes consistency. When md splits a WRITE across disks
either all or no of the child requests should be completed. If we do
partial write then the disk data will become inconsistent.


Ted and Kent suggested following solution: any time when we do forward
progress with request/bio we need to check its status. If user
cancelled the request then just skip this bio. So it covers case #1.

The draft implementation will look like this. struct bio should have
some way to get current status of kiocb that generated bio. So we add
a pointer to bool flag.

struct bio {
   bool *cancelled;
}

in async DIO codepath this pointer will be initialized with bool at
"struct kiocb"
   bio->cancelled = &kiocb->cancelled;
except md. If it is RAID5 and we perform WRITE request then we do not
initialize this pointer.


when we do forward progress with request/bio we check its cancellation status:
if (bio->cancelled && *bio->cancelled)
   goto do_not_process_bio_because_it_cancelled;

So to cancel kiocb we do
kiocb->cancelled = true;
and all bio created from the request will not be send to device anymore.


The solution seems straightforward, but I would like to hear if there
are other solutions to make AIO cancellation better. Does suggested
implementation looks good? Are there better solutions? What about
cancelling requests that are already sent to device?

If the proposal is fine then I start implementing it.


Hello Anatol,

Had you already noticed this message: 
http://marc.info/?l=linux-fsdevel&m=136024044202362 ?


Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: How to online remove an error scsi disk from the system?

2013-01-31 Thread Bart Van Assche


On 02/01/13 07:13, Tao Ma wrote:

In our product system, we have several sata disks attached to one
machine. So when one of the disk fails, the jbd2(yes, we use ext4) will
hang forever and we will get something in /var/log/messages like below.
It seems to me that the io sent to the scsi layer is never returned back
with -EIO which is a little bit surprised for me(It should be a timeout
somewhere, right?). We have tried echo "offline" >
/sys/block/sdl/device/state, but it doesn't work. So is there any way
for us to let the scsi device returns all the io requests back with EIO
so that all the end_io can be called accordingly? Am I missing something
here?


Please note that I'm not familiar with SAS. But I found this in 
drivers/scsi/scsi_proc.c:


 * proc_scsi_write - handle writes to /proc/scsi/scsi
 * @file: not used
 * @buf: buffer to write
 * @length: length of buf, at most PAGE_SIZE
 * @ppos: not used
 *
 * Description: this provides a legacy mechanism to add or remove
 * devices by Host, Channel, ID, and Lun.  To use,
 * "echo 'scsi add-single-device 0 1 2 3' > /proc/scsi/scsi" or
 * "echo 'scsi remove-single-device 0 1 2 3' > /proc/scsi/scsi" with
 * "0 1 2 3" replaced by the Host, Channel, Id, and Lun.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] Fix bugs in ib_srp patches for H.A. purposes

2012-09-01 Thread Bart Van Assche

On 08/31/12 20:00, dongsu.p...@profitbricks.com wrote:
> This patchset aims at fixing bugs that have been discovered in our own
> SRP test environment so far. These patches are based on your patchset v4,
> "Make ib_srp better suited for H.A. purposes",(09 Aug 2012).
> 
> The 5th patch, "fix an error accessing invalid memory in
> rport_dev_loss_timedout" is including your suggestion (30 Aug 2012).
> 
> You can also pull the following git repo to get the patches.
> 
>   git://github.com/advance38/linux.git srp-ha
> 
> Our test setup consists of two systems.
> Kernel 3.2.15 with SCST v4193 on the target,
> and Kernel 3.2.8 with ib_srp-ha on the initiator.
> Although I rebased the patches again onto 3.6-rc3,
> I suppose there will be no significant differences.
> 
> All of the known critical issues seem to have been resolved
> according to our internal tests in the last weeks.

Thanks for the testing and the feedback. Let's hope that the ib_srp
maintainer will be able to review these patches soon. For the linux-scsi
and LKML readers: I've sent my feedback about this patch series to the
linux-rdma mailing list only.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-01-30 Thread Bart Van Assche

On Jan 30, 2008 2:54 PM, FUJITA Tomonori <[EMAIL PROTECTED]> wrote:
> On Wed, 30 Jan 2008 14:10:47 +0100
> "Bart Van Assche" <[EMAIL PROTECTED]> wrote:
>
> > On Jan 30, 2008 11:56 AM, FUJITA Tomonori <[EMAIL PROTECTED]> wrote:
> > >
> > > Sorry, I can't say. I don't know much about iSER. But seems that Pete
> > > and Robin can get the better I/O performance - line speed ratio with
> > > STGT.
> >
> > Robin Humble was using a DDR InfiniBand network, while my tests were
> > performed with an SDR InfiniBand network. Robin's results can't be
> > directly compared to my results.
>
> I know that you use different hardware. I used 'ratio' word.

Let's start with summarizing the relevant numbers from Robin's
measurements and my own measurements.

Maximum bandwidth of the underlying physical medium: 2000 MB/s for a
DDR 4x InfiniBand network and 1000 MB/s for a SDR 4x InfiniBand
network.

Maximum bandwidth reported by the OFED ib_write_bw test program: 1473
MB/s for Robin's setup and 933 MB/s for my setup. These numbers match
published ib_write_bw results (see e.g. figure 11 in
http://www.usenix.org/events/usenix06/tech/full_papers/liu/liu_html/index.html
or chapter 7 in
http://www.voltaire.com/ftp/rocks/HCA-4X0_Linux_GridStack_4.3_Release_Notes_DOC-00171-A00.pdf)

Throughput measured for communication via STGT + iSER to a remote RAM
disk via direct I/O with dd: 800 MB/s for writing and 751 MB/s for
reading in Robin's setup, and 647 MB/s for writing and 589 MB/s for
reading in my setup.

>From this we can compute the I/O-performance to ib_write_bw bandwidth:
54 % for writing and 51 % for reading in Robin's setup, and 69 % for
writing and 63 % for reading in my setup. Or a slightly better
utilization of the bandwidth in my setup than in Robin's setup. This
is no surprise -- the faster a communication link is, the harder it is
to use all of the available bandwidth.

So why did you state that in Robin's tests the I/O performance to line
speed ratio was better than in my tests ?

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-01-31 Thread Bart Van Assche

On Jan 31, 2008 2:25 PM, Nicholas A. Bellinger <[EMAIL PROTECTED]> wrote:
> Since this particular code is located in a non-data path critical
> section, the kernel vs. user discussion is a wash.  If we are talking
> about data path, yes, the relevance of DD tests in kernel designs are
> suspect :p.  For those IB testers who are interested, perhaps having a
> look with disktest from the Linux Test Project would give a better
> comparision between the two implementations on a RDMA capable fabric
> like IB for best case performance.  I think everyone is interested in
> seeing just how much data path overhead exists between userspace and
> kernel space in typical and heavy workloads, if if this overhead can be
> minimized to make userspace a better option for some of this very
> complex code.

I can run disktest on the same setups I ran dd on. This will take some
time however.

Disktest is new to me -- any hints with regard to suitable
combinations of command line parameters are welcome. The most recent
version I could find on http://ltp.sourceforge.net/ is ltp-20071231.

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel

2008-01-31 Thread Bart Van Assche

On Jan 31, 2008 5:25 PM, Joe Landman <[EMAIL PROTECTED]> wrote:
> Vladislav Bolkhovitin wrote:
> > Actually, I don't know what kind of conclusions it is possible to make
> > from disktest's results (maybe only how throughput gets bigger or slower
> > with increasing number of threads?), it's a good stress test tool, but
> > not more.
>
> Unfortunately, I agree.  Bonnie++, dd tests, and a few others seem to
> bear far closer to "real world" tests than disktest and iozone, the
> latter of which does more to test the speed of RAM cache and system call
> performance than actual IO.

I have ran some tests with Bonnie++, but found out that on a fast
network like IB the filesystem used for the test has a really big
impact on the test results.

If anyone has a suggestion for a better test than dd to compare the
performance of SCSI storage protocols, please let it know.

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-01-31 Thread Bart Van Assche

On Jan 31, 2008 6:14 PM, Nicholas A. Bellinger <[EMAIL PROTECTED]> wrote:
> Also, with STGT being a pretty new design which has not undergone alot
> of optimization, perhaps profiling both pieces of code against similar
> tests would give us a better idea of where userspace bottlenecks reside.
> Also, the overhead involved with traditional iSCSI for bulk IO from
> kernel / userspace would also be a key concern for a much larger set of
> users, as iSER and SRP on IB is a pretty small userbase and will
> probably remain small for the near future.

Two important trends in data center technology are server
consolidation and storage consolidation. A.o. every web hosting
company can profit from a fast storage solution. I wouldn't call this
a small user base.

Regarding iSER and SRP on IB: InfiniBand is today the most economic
solution for a fast storage network. I do not know which technology
will be the most popular for storage consolidation within a few years
-- this can be SRP, iSER, IPoIB + SDP, FCoE (Fibre Channel over
Ethernet) or maybe yet another technology. No matter which technology
becomes the most popular for storage applications, there will be a
need for high-performance storage software.

References:
* Michael Feldman, Battle of the Network Fabrics, HPCwire, December
2006, http://www.hpcwire.com/hpc/1145060.html
* NetApp, Reducing Data Center Power Consumption Through Efficient
Storage, February 2007,
http://www.netapp.com/ftp/wp-reducing-datacenter-power-consumption.pdf

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-02-01 Thread Bart Van Assche

On Jan 31, 2008 2:25 PM, Nicholas A. Bellinger <[EMAIL PROTECTED]> wrote:
>
> The PyX storage engine supports a scatterlist linked list algorithm that
> ...

Which parts of the PyX source code are licensed under the GPL and
which parts are closed source ? A Google query for PyX + iSCSI showed
information about licensing deals. Licensing deals can only be closed
for software that is not entirely licensed under the GPL.

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-02-01 Thread Bart Van Assche

On Jan 31, 2008 7:15 PM, Nicholas A. Bellinger <[EMAIL PROTECTED]> wrote:
>
> I meant small referring to storage on IB fabrics which has usually been
> in the research and national lab settings, with some other vendors
> offering IB as an alternative storage fabric for those who [w,c]ould not
> wait for 10 Gb/sec copper Ethernet and Direct Data Placement to come
> online.  These types of numbers compared to say traditional iSCSI, that
> is getting used all over the place these days in areas I won't bother
> listing here.

InfiniBand has several advantages over 10 Gbit/s Ethernet (the list
below probably isn't complete):
- Lower latency. Communication latency is not only determined by the
latency of a switch. The whole InfiniBand protocol stack was designed
with low latency in mind. Low latency is really important for database
software that accesses storage over a network.
- High-availability is implemented at the network layer. Suppose that
a group of servers has dual-port network interfaces and is
interconnected via a so-called dual star topology, With an InfiniBand
network, failover in case of a single failure (link or switch) is
handled without any operating system or application intervention. With
Ethernet, failover in case of a single failure must be handled either
by the operating system or by the application.
- You do not have to use iSER or SRP to use the bandwidth of an
InfiniBand network effectively. The SDP (Sockets Direct Protocol)
makes it possible that applications benefit from RDMA by using the
very classic IPv4 Berkeley sockets interface. An SDP implementation in
software is already available today via OFED. iperf reports 470 MB/s
on single-threaded tests and 975 MB/s for a performance test with two
threads on an SDR 4x InfiniBand network. These tests were performed
with the OFED 1.2.5.4 SDP implementation. It is possible that future
SDP implementations will perform even better. (Note: I could not yet
get iSCSI over SDP working.)

We should leave the choice of networking technology open -- both
Ethernet and InfiniBand have specific advantages.

See also:
InfiniBand Trade Association, InfiniBand Architecture Specification
Release 1.2.1, http://www.infinibandta.org/specs/register/publicspec/

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-02-01 Thread Bart Van Assche

On Feb 1, 2008 11:39 AM, Nicholas A. Bellinger <[EMAIL PROTECTED]> wrote:
>
> On Fri, 2008-02-01 at 09:11 +0100, Bart Van Assche wrote:
> > On Jan 31, 2008 2:25 PM, Nicholas A. Bellinger <[EMAIL PROTECTED]> wrote:
> > >
> > > The PyX storage engine supports a scatterlist linked list algorithm that
> > > ...
> >
> > Which parts of the PyX source code are licensed under the GPL and
> > which parts are closed source ? A Google query for PyX + iSCSI showed
> > information about licensing deals. Licensing deals can only be closed
> > for software that is not entirely licensed under the GPL.
> >
>
> I was using the name PyX to give an historical context to the
> discussion. ...

Regarding the PyX Target Code: I have found a link via which I can
download a free 30-day demo. This means that a company is earning
money via this target code and that the source code is not licensed
under the GPL. This is fine, but it also means that today the PyX
target code is not a candidate for inclusion in the Linux kernel, and
that it is unlikely that all of the PyX target code (kernelspace +
userspace) will be made available under GPL soon.

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-02-01 Thread Bart Van Assche

On Feb 1, 2008 1:05 PM, Nicholas A. Bellinger <[EMAIL PROTECTED]> wrote:
>
> All of the kernel and C userspace code is open source and available from
> linux-iscsi.org and licensed under the GPL.

I found a statement on a web page that the ERL2 implementation is not
included in the GPL version (http://zaal.org/iscsi/index.html). The
above implies that this statement is incorrect. Tomo, are you the
maintainer of this web page ?

I'll try to measure the performance of the LIO Target Stack on the
same setup on which I ran the other performance tests.

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-02-04 Thread Bart Van Assche

On Feb 4, 2008 1:27 PM, Vladislav Bolkhovitin <[EMAIL PROTECTED]> wrote:
>
> So, James, what is your opinion on the above? Or the overall SCSI target
> project simplicity doesn't matter much for you and you think it's fine
> to duplicate Linux page cache in the user space to keep the in-kernel
> part of the project as small as possible?

It's too early to draw conclusions about performance. I'm currently
performing more measurements, and the results are not easy to
interpret. My plan is to measure the following:
* Setup: target with RAM disk of 2 GB as backing storage.
* Throughput reported by dd and xdd (direct I/O).
* Transfers with dd/xdd in units of 1 KB to 1 GB (the smallest
transfer size that can be specified to xdd is 1 KB).
* Target SCSI software to be tested: IETD iSCSI via IPoIB, STGT iSCSI
via IPoIB, STGT iSER, SCST iSCSI via IPoIB, SCST SRP, LIO iSCSI via
IPoIB.

The reason I chose dd/xdd for these tests is that I want to measure
the performance of the communication protocols, and that I am assuming
that this performance can be modeled by the following formula:
(transfer time in s) = (transfer setup latency in s) + (transfer size
in MB) / (bandwidth in MB/s). Measuring the time needed for transfers
with varying block size allows to compute the constants in the above
formula via linear regression.

One difficulty I already encountered is that the performance of the
Linux IPoIB implementation varies a lot under high load
(http://bugzilla.kernel.org/show_bug.cgi?id=9883).

Another issue I have to look further into is that dd and xdd report
different results for very large block sizes (> 1 MB).

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Cbe-oss-dev] LIO Target iSCSI/SE PS3-Linux / FC8 builds

2008-02-04 Thread Bart Van Assche

On Feb 4, 2008 5:44 PM, Marc Dietrich
<[EMAIL PROTECTED]> wrote:
> ...
> Anyway, heres a quick ugly fix for the ARCH detection code, tested on ps3.
> ...

Architecture detection is indeed broken in LIO. Would it be possible
to use the standard config.guess script instead of the custom LIO arch
detection script ? config.guess is included with a.o. automake and
libtool.

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread Bart Van Assche

On Feb 4, 2008 11:57 PM, Jeff Garzik <[EMAIL PROTECTED]> wrote:

> Networked block devices are attractive because the concepts and
> implementation are more simple than networked filesystems... but usually
> you want to run some sort of filesystem on top.  At that point you might
> as well run NFS or [gfs|ocfs|flavor-of-the-week], and ditch your
> networked block device (and associated complexity).

Running a filesystem on top of iSCSI results in better performance
than NFS, especially if the NFS client conforms to the NFS standard
(=synchronous writes).
By searching the web search for the keywords NFS, iSCSI and
performance I found the following (6 years old) document:
http://www.technomagesinc.com/papers/ip_paper.html. A quote from the
conclusion:
Our results, generated by running some of industry standard benchmarks,
show that iSCSI significantly outperforms NFS for situations when
performing streaming, database like accesses and small file transactions.

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kara_am

2008-02-05 Thread Bart Van Assche

On Feb 5, 2008 1:10 PM, am kara <[EMAIL PROTECTED]> wrote:
> I am trying to understand IPsec and found some related
> subfolders

Please read this book (or a similar book) first:

Understanding Linux Network Internals
By Christian Benvenuti
Publisher: O'Reilly
Pub Date: December 2005
ISBN: 0-596-00255-6
Pages: 1062

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread Bart Van Assche

Regarding the performance tests I promised to perform: although until
now I only have been able to run two tests (STGT + iSER versus SCST +
SRP), the results are interesting. I will run the remaining test cases
during the next days.

About the test setup: dd and xdd were used to transfer 2 GB of data
between an initiator system and a target system via direct I/O over an
SDR InfiniBand network (1GB/s). The block size varied between 512
bytes and 1 GB, but was always a power of two.

Expected results:
* The measurement results are consistent with the numbers I published earlier.
* During data transfers all data is transferred in blocks between 4 KB
and 32 KB in size (according to the SCST statistics).
* For small and medium block sizes (<= 32 KB) transfer times can be
modeled very well by the following formula: (transfer time) = (setup
latency) + (bytes transferred)/(bandwidth). The correlation numbers
are very close to one.
* The latency and bandwidth parameters depend on the test tool (dd
versus xdd), on the kind of test performed (reading versus writing),
on the SCSI target and on the communication protocol.
* When using RDMA (iSER or SRP), SCST has a lower latency and higher
bandwidth than STGT (results from linear regression for block sizes <=
32 KB):
   Test  Latency(us) Bandwidth (MB/s) Correlation
   STGT+iSER, read, dd   64  560  0.95
   STGT+iSER, read, xdd  65  556  0.94
   STGT+iSER, write, dd  53  394  0.71
   STGT+iSER, write, xdd 54  445  0.59
   SCST+SRP, read, dd39  657  0.83
   SCST+SRP, read, xdd   41  668  0.87
   SCST+SRP, write, dd   52  449  0.62
   SCST+SRP, write, xdd  52  516  0.77

Results that I did not expect:
* A block transfer size of 1 MB is not enough to measure the maximal
throughput. The maximal throughput is only reached at much higher
block sizes (about 10 MB for SCST + SRP and about 100 MB for STGT +
iSER).
* There is one case where dd and xdd results are inconsistent: when
reading via SCST + SRP and for block sizes of about 1 MB.
* For block sizes > 64 KB the measurements differ from the model. This
is probably because all initiator-target transfers happen in blocks of
32 KB or less.

For the details and some graphs, see also
http://software.qlayer.com/display/iSCSI/Measurements .

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread Bart Van Assche

On Feb 5, 2008 6:10 PM, Erez Zilber <[EMAIL PROTECTED]> wrote:
> One may claim that STGT should have lower performance than SCST because
> its data path is from userspace. However, your results show that for
> non-IB transports, they both show the same numbers. Furthermore, with IB
> there shouldn't be any additional difference between the 2 targets
> because data transfer from userspace is as efficient as data transfer
> from kernel space.
>
> The only explanation that I see is that fine tuning for iSCSI & iSER is
> required. As was already mentioned in this thread, with SDR you can get
> ~900 MB/sec with iSER (on STGT).

My most recent measurements also show that one can get 900 MB/s with
STGT + iSER on an SDR IB network, but only for very large block sizes
(>= 100 MB). A quote from Linus Torvalds is relevant here (February 5,
2008):

Block transfer sizes over about 64kB are totally irrelevant for
99% of all people.

Please read my e-mail (posted earlier today) with a comparison for 4
KB - 64 KB block transfer sizes between SCST and STGT.

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch/rfc/rft] sd: allocate request_queue on device's local numa node

2012-10-22 Thread Bart Van Assche


On 10/22/12 21:01, Jeff Moyer wrote:

All of the infrastructure is available to allocate a request_queue on a
particular numa node, but it isn't being utilized at all.  Wire up the
sd driver to allocate the request_queue on the HBA's local numa node.

This is a request for comments and testing (I've built and booted it,
nothing more).  I believe that this should be a performance win, but I
have no numbers to back it up as yet.  Suggestions for workloads to test
are welcome.

Cheers,
Jeff

Signed-off-by: Jeff Moyer 

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index da36a3a..7986483 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1664,7 +1664,8 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host 
*shost,
struct request_queue *q;
struct device *dev = shost->dma_dev;

-   q = blk_init_queue(request_fn, NULL);
+   q = blk_init_queue_node(request_fn, NULL,
+   dev_to_node(&shost->shost_dev));
if (!q)
return NULL;


Are you sure this approach will always result in the queue being 
allocated on the same NUMA node as the HCA ? If e.g. a user triggers LUN 
scanning via sysfs the above code may be invoked on another NUMA node 
than the node to which the HCA is connected. Also, if you have a look at 
e.g. scsi_request_fn() or scsi_device_unbusy() you will see that in 
order to avoid inter-node traffic it's important to allocate the sdev 
and shost data structures on the same NUMA node. How about the following 
approach ?

- Add a variant of scsi_host_alloc() that allows to specify on which
  NUMA node to allocate the shost structure and also that stores the
  identity of that node in the shost structure.
- Modify __scsi_alloc_queue() such that it allocates the sdev structure
  on the same NUMA node as the shost structure.
- Modify the SCSI LLD of your choice such that it uses the new
  scsi_host_alloc() call. According to what is appropriate the NUMA node
  on which to allocate the shost could be specified by the user or could
  be identical to the NUMA node of the HCA controlled by the SCSI LLD
  (see e.g. /sys/devices/pci*/*/numa_node). Please keep in mind that a
  single PCIe bus may have a minimal distance to more than one NUMA
  node. See e.g. the diagram at the top of page 8 in

http://bizsupport1.austin.hp.com/bc/docs/support/SupportManual/c03261871/c03261871.pdf
  for a system diagram of a NUMA system where each PCIe bus has a
  minimal distance to two different NUMA nodes.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch/rfc/rft] sd: allocate request_queue on device's local numa node

2012-10-23 Thread Bart Van Assche


On 10/23/12 18:52, Jeff Moyer wrote:

Bart Van Assche  writes:

Please keep in mind that a
single PCIe bus may have a minimal distance to more than one NUMA
node. See e.g. the diagram at the top of page 8 in
http://bizsupport1.austin.hp.com/bc/docs/support/SupportManual/c03261871/c03261871.pdf
for a system diagram of a NUMA system where each PCIe bus has a
minimal distance to two different NUMA nodes.


That's an interesting configuration.  I wonder what the numa_node sysfs
file contains for such systems--do you know?  I'm not sure how we could
allow this to be user-controlled at probe time.  Did you have a specific
mechanism in mind?  Module parameters?  Something else?


As far as I can see in drivers/pci/pci-sysfs.c the numa_node sysfs 
attribute contains a single number, even for a topology like the one 
described above.


With regard to user control of the numa node: I'm not sure how to solve 
this in general. But for the ib_srp driver this should be easy to do: 
SCSI host creation is triggered by sending a login string to a sysfs 
attribute ("add_target"). It wouldn't take much time to add a parameter 
to that login string that specifies the NUMA node.


Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel

2008-02-12 Thread Bart Van Assche

On Feb 6, 2008 1:11 AM, Nicholas A. Bellinger <[EMAIL PROTECTED]> wrote:
> I have always observed the case with LIO SE/iSCSI target mode ...

Hello Nicholas,

Are you sure that the LIO-SE kernel module source code is ready for
inclusion in the mainstream Linux kernel ? As you know I tried to test
the LIO-SE iSCSI target. Already while configuring the target I
encountered a kernel crash that froze the whole system. I can
reproduce this kernel crash easily, and I reported it 11 days ago on
the LIO-SE mailing list (February 4, 2008). One of the call stacks I
posted shows a crash in mempool_alloc() called from jbd. Or: the crash
is most likely the result of memory corruption caused by LIO-SE.

Because I was curious to know why it took so long to fix such a severe
crash, I started browsing through the LIO-SE source code. Analysis of
the LIO-SE kernel module source code learned me that this crash is not
a coincidence. Dynamic memory allocation (kmalloc()/kfree()) in the
LIO-SE kernel module is complex and hard to verify. There are 412
memory allocation/deallocation calls in the current version of the
LIO-SE kernel module source code, which is a lot. Additionally,
because of the complexity of the memory handling in LIO-SE, it is not
possible to verify the correctness of the memory handling by analyzing
a single function at a time. In my opinion this makes the LIO-SE
source code hard to maintain.
Furthermore, the LIO-SE kernel module source code does not follow
conventions that have proven their value in the past like grouping all
error handling at the end of a function. As could be expected, the
consequence is that error handling is not correct in several
functions, resulting in memory leaks in case of an error. Some
examples of functions in which error handling is clearly incorrect:
* transport_allocate_passthrough().
* iscsi_do_build_list().

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-02-15 Thread Bart Van Assche

On Thu, Feb 7, 2008 at 2:45 PM, Vladislav Bolkhovitin <[EMAIL PROTECTED]> wrote:
>
> Bart Van Assche wrote:
>  > Since the focus of this thread shifted somewhat in the last few
>  > messages, I'll try to summarize what has been discussed so far:
>  > - There was a number of participants who joined this discussion
>  > spontaneously. This suggests that there is considerable interest in
>  > networked storage and iSCSI.
>  > - It has been motivated why iSCSI makes sense as a storage protocol
>  > (compared to ATA over Ethernet and Fibre Channel over Ethernet).
>  > - The direct I/O performance results for block transfer sizes below 64
>  > KB are a meaningful benchmark for storage target implementations.
>  > - It has been discussed whether an iSCSI target should be implemented
>  > in user space or in kernel space. It is clear now that an
>  > implementation in the kernel can be made faster than a user space
>  > implementation 
> (http://kerneltrap.org/mailarchive/linux-kernel/2008/2/4/714804).
>  > Regarding existing implementations, measurements have a.o. shown that
>  > SCST is faster than STGT (30% with the following setup: iSCSI via
>  > IPoIB and direct I/O block transfers with a size of 512 bytes).
>  > - It has been discussed which iSCSI target implementation should be in
>  > the mainstream Linux kernel. There is no agreement on this subject
>  > yet. The short-term options are as follows:
>  > 1) Do not integrate any new iSCSI target implementation in the
>  > mainstream Linux kernel.
>  > 2) Add one of the existing in-kernel iSCSI target implementations to
>  > the kernel, e.g. SCST or PyX/LIO.
>  > 3) Create a new in-kernel iSCSI target implementation that combines
>  > the advantages of the existing iSCSI kernel target implementations
>  > (iETD, STGT, SCST and PyX/LIO).
>  >
>  > As an iSCSI user, I prefer option (3). The big question is whether the
>  > various storage target authors agree with this ?
>
>  I tend to agree with some important notes:
>
>  1. IET should be excluded from this list, iSCSI-SCST is IET updated for
>  SCST framework with a lot of bugfixes and improvements.
>
>  2. I think, everybody will agree that Linux iSCSI target should work
>  over some standard SCSI target framework. Hence the choice gets
>  narrower: SCST vs STGT. I don't think there's a way for a dedicated
>  iSCSI target (i.e. PyX/LIO) in the mainline, because of a lot of code
>  duplication. Nicholas could decide to move to either existing framework
>  (although, frankly, I don't think there's a possibility for in-kernel
>  iSCSI target and user space SCSI target framework) and if he decide to
>  go with SCST, I'll be glad to offer my help and support and wouldn't
>  care if LIO-SCST eventually replaced iSCSI-SCST. The better one should win.

If I understood the above correctly, regarding a kernel space iSCSI
target implementation, only LIO-SE and SCST should be considered. What
I know today about these Linux iSCSI target implementations is as
follows:
* SCST performs slightly better than LIO-SE, and LIO-SE performs
slightly better than STGT (both with regard to latency and with regard
to bandwidth).
* The coding style of SCST is closer to the Linux kernel coding style
than the coding style of the LIO-SE project.
* The structure of SCST is closer to what Linus expects than the
structure of LIO-SE (i.e., authentication handled in userspace, data
transfer handled by the kernel -- LIO-SE handles both in kernel
space).
* Until now I did not encounter any strange behavior in SCST. The
issues I encountered with LIO-SE are being resolved via the LIO-SE
mailing list (http://groups.google.com/group/linux-iscsi-target-dev).

It would take too much effort to develop a new kernel space iSCSI
target from scratch -- we should start from either LIO-SE or SCST. My
opinion is that the best approach is to start with integrating SCST in
the mainstream kernel, and that the more advanced features from LIO-SE
that are not yet in SCST can be ported from LIO-SE to the SCST
framework.

Nicholas, do you think the structure of SCST is powerful enough to be
extended with LIO-SE's powerful features like ERL-2 ?

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG/ spinlock lockup, 2.6.24

2008-02-15 Thread Bart Van Assche

2008/2/15 Denys Fedoryshchenko <[EMAIL PROTECTED]>:
>  I have random crashes, at least once per week. It is very difficult to catch
>  error message, and only recently i setup netconsole. Now i got crash, but
>  there is no traceback and only single line came over netconsole, mentioned
>  before.

Did you already run memtest ? You can run memtest by booting from the
Knoppix CD-ROM or DVD. Most Linux distributions also have included
memtest on their bootable distribution CD's/DVD's.

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 01/10] scsi: make __scsi_alloc_queue numa-aware

2012-11-02 Thread Bart Van Assche


On 10/30/12 21:14, Jeff Moyer wrote:

Pass the numa node id set in the Scsi_Host on to blk_init_queue_node
in order to keep all allocations local to the numa node the device is
closest to.

Signed-off-by: Jeff Moyer 
---
  drivers/scsi/scsi_lib.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index da36a3a..8662a09 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1664,7 +1664,7 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host 
*shost,
struct request_queue *q;
struct device *dev = shost->dma_dev;

-   q = blk_init_queue(request_fn, NULL);
+   q = blk_init_queue_node(request_fn, NULL, shost->numa_node);
if (!q)
return NULL;


Hello Jeff,

I haven't seen the patch that introduces numa_node in struct Scsi_Host 
nor the cover letter of this patch series ? Have these been posted on 
the linux-scsi mailing list ?


Bart.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch,v2 01/10] scsi: add scsi_host_alloc_node

2012-11-03 Thread Bart Van Assche


On 11/02/12 22:45, Jeff Moyer wrote:

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 593085a..7d7ad8b 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -336,16 +336,25 @@ static struct device_type scsi_host_type = {
   **/
  struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int 
privsize)
  {
+   return scsi_host_alloc_node(sht, privsize, -1);


Using NUMA_NO_NODE here might improve readability.


diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 4908480..a1b5c8e 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -733,6 +733,12 @@ struct Scsi_Host {
struct device *dma_dev;

/*
+* Numa node this device is closest to, used for allocating
+* data structures locally.
+*/
+   int numa_node;


Have you considered using #ifdef CONFIG_NUMA / #endif here ? I've 
noticed that all other numa_node members in structures under include/ 
have this.


Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch,v2 04/10] scsi: allocate scsi_cmnd-s from the device's local numa node

2012-11-03 Thread Bart Van Assche


On 11/02/12 22:45, Jeff Moyer wrote:

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 2936b44..4db6973 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -173,16 +173,20 @@ static DEFINE_MUTEX(host_cmd_pool_mutex);
   * NULL on failure
   */
  static struct scsi_cmnd *
-scsi_pool_alloc_command(struct scsi_host_cmd_pool *pool, gfp_t gfp_mask)
+scsi_pool_alloc_command(struct scsi_host_cmd_pool *pool, gfp_t gfp_mask,
+   int node)
  {
struct scsi_cmnd *cmd;

-   cmd = kmem_cache_zalloc(pool->cmd_slab, gfp_mask | pool->gfp_mask);
+   cmd = kmem_cache_alloc_node(pool->cmd_slab,
+   gfp_mask | pool->gfp_mask | __GFP_ZERO,
+   node);
if (!cmd)
return NULL;

-   cmd->sense_buffer = kmem_cache_alloc(pool->sense_slab,
-gfp_mask | pool->gfp_mask);
+   cmd->sense_buffer = kmem_cache_alloc_node(pool->sense_slab,
+   gfp_mask | pool->gfp_mask | __GFP_ZERO,
+   node);


It's not clear to me why __GFP_ZERO is added to the allocation flags ?

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch,v2 05/10] sd: use alloc_disk_node

2012-11-03 Thread Bart Van Assche


On 11/02/12 22:45, Jeff Moyer wrote:

Signed-off-by: Jeff Moyer 
---
  drivers/scsi/sd.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 12f6fdf..8deb915 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2714,7 +2714,7 @@ static int sd_probe(struct device *dev)
if (!sdkp)
goto out;

-   gd = alloc_disk(SD_MINORS);
+   gd = alloc_disk_node(SD_MINORS, dev_to_node(dev));
if (!gd)
goto out_free;


shost->numa_node can be another NUMA node than dev_to_node(dev). Have 
you considered using shost->numa_node here ?


Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch,v2 05/10] sd: use alloc_disk_node

2012-11-05 Thread Bart Van Assche


On 11/05/12 15:12, Jeff Moyer wrote:

Bart Van Assche  writes:


On 11/02/12 22:45, Jeff Moyer wrote:

Signed-off-by: Jeff Moyer 
---
   drivers/scsi/sd.c |2 +-
   1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 12f6fdf..8deb915 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2714,7 +2714,7 @@ static int sd_probe(struct device *dev)
if (!sdkp)
goto out;

-   gd = alloc_disk(SD_MINORS);
+   gd = alloc_disk_node(SD_MINORS, dev_to_node(dev));
if (!gd)
goto out_free;


shost->numa_node can be another NUMA node than dev_to_node(dev). Have
you considered using shost->numa_node here ?


It can?  How?


E.g. if the LLD allows the user to specify the value of numa_node and 
passes that value to scsi_host_alloc_node() (see also 
http://lkml.org/lkml/2012/10/23/477 for further information).



Just so I'm clear, you're suggesting I use the scsi_device's host
pointer to get to the Scsi_Host, and that *will* be filled in that this
point, right?


As far as I can see the sdev->host pointer is set in scsi_alloc_sdev() 
and that happens before sd_probe() is invoked.


Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] virtio-scsi: fix parsing of hotplug/hot-unplug LUN number

2012-07-26 Thread Bart Van Assche

On 07/26/12 09:21, James Bottomley wrote:
> On Thu, 2012-07-26 at 11:04 +0200, Paolo Bonzini wrote:
>> Il 26/07/2012 10:52, James Bottomley ha scritto:
> +static unsigned int virtscsi_get_lun(u8 *lun_bytes)
> +{
> + unsigned int lun = (lun_bytes[2] << 8) | lun_bytes[3];
> + return lun & 16383;
> +}
> +
>>> Why are you rolling your own incomplete version of scsilun_to_int here?
>>
>> Because scsilun_to_int does not do the AND, so it would have exactly the
>> same bug I'm fixing.
> 
> It's not a bug ... it's the encoding.  All the other devices use this
> too.  Ideally we should have switched to 64 bit lun numbers for the
> encoding to be exact, but nothing so far has gone over 32 bits.  If we
> don't encode the Address method as part of the lun number, we don't get
> the reverse transform right and the addressing often fails.
> 
> That does mean that arrays that use address method=1 in REPORT LUNS have
> their lun numbers start at 16384.

Has it already been considered to modify scsilun_to_int() such that LUN
numbers start at zero even for addressing method 1 ? This is what e.g.
the function scst_unpack_lun() already does. See also
http://scst.svn.sourceforge.net/viewvc/scst/trunk/scst/src/scst_lib.c?revision=HEAD&view=markup.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] please pull infiniband.git

2013-07-10 Thread Bart Van Assche


On 07/10/13 16:38, Roland Dreier wrote:

On Wed, Jul 10, 2013 at 7:35 AM, Sebastian Riemer
 wrote:


I've checked the commits on that tag and the following commit is not
what we've agreed on:


Sorry about that.  The discussion was long and complex and I probably
made a mistake in aplying the patches.  Please me send a patch to fix
the driver to what it should be, and I will merge it ASAP.


I will send such a patch in reply to this e-mail but without CC-ing the 
LKML. Sebastian, if you could review that patch, that would be appreciated.


Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] prevent CVE-2012-2372 rds-ping causes kernel panic

2012-10-22 Thread Bart Van Assche


On 10/22/12 20:12, Jay Fenlason wrote:

If you use rds-ping of the local IP address of some Infiniband HCAs
(QLogic, possibly others) the machine will panic with a BUG_ON due to
an overly restrictive check.  Loosen the restriction a bit

This should have gotten sent a while ago (it was first noticed in
  https://bugzilla.redhat.com/show_bug.cgi?id=803936
and patched in kernel-2.6.32-275.el6) but I got confused about its
embargo status and lost it.

Signed-off-by: Jay Fenlason 

diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index e590949..7920c85 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -544,7 +544,7 @@ int rds_ib_xmit(struct rds_connection *conn, struct 
rds_message *rm,
int flow_controlled = 0;
int nr_sig = 0;

-   BUG_ON(off % RDS_FRAG_SIZE);
+   BUG_ON(!conn->c_loopback && off % RDS_FRAG_SIZE);
BUG_ON(hdr_off != 0 && hdr_off != sizeof(struct rds_header));

/* Do not send cong updates to IB loopback */


Hello Jay,

The recommended approach for checking preconditions is to use 
WARN_ON_ONCE(), not BUG_ON(). Linus explained this last month in a 
message posted on the LKML (http://lkml.org/lkml/2012/9/27/461).


Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch,v3 04/10] scsi: allocate scsi_cmnd-s from the device's local numa node

2012-11-10 Thread Bart Van Assche


On 11/09/12 20:18, Jeff Moyer wrote:

-   cmd = kmem_cache_zalloc(pool->cmd_slab, gfp_mask | pool->gfp_mask);
+   cmd = kmem_cache_alloc_node(pool->cmd_slab,
+   gfp_mask | pool->gfp_mask | __GFP_ZERO,
+   node);


Hello Jeff,

Is it necessary to add __GFP_ZERO here ? And if so, why ?

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch,v3 00/10] make I/O path allocations more numa-friendly

2012-11-16 Thread Bart Van Assche


On 11/09/12 20:17, Jeff Moyer wrote:

This patch set makes memory allocations for data structures used in
the I/O path more numa friendly by allocating them from the same numa
node as the storage device.  I've only converted a handful of drivers
at this point.  My testing showed that, for workloads where the I/O
processes were not tied to the numa node housing the device, a speedup
of around 6% was observed.  When the I/O processes were tied to the
numa node of the device, there was no measurable difference in my test
setup.  Given my relatively low-end setup[1], I wouldn't be surprised
if others could show a more significant performance advantage.

Comments would be greatly appreciated.


Sorry but I'm not familiar with any of the SCSI LLDs modified via this 
patch series. But I'm fine with the SCSI core patches in this series. So 
if you want you can add the following to the first five patches in this 
series:


Reviewed-by: Bart Van Assche 

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] kbuild: Unbreak cleaning external module build directory

2012-11-18 Thread Bart Van Assche

Avoid that "$(MAKE) -C $(KDIR) M=$(PWD) clean" triggers the following
error message when invoked from inside the Makefile of an external
kernel module:

rm: cannot remove 'System.map': Permission denied
make[1]: *** [clean] Error 1

Cc: Sam Ravnborg 
Cc: Arnaud Lacombe 
Cc: Nick Bowler 
Cc: Richard Weinberger 
Cc: Michal Marek 
Cc: 
Signed-off-by: Bart Van Assche 
---
 Makefile |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index 9f6ca12..46b9be5 100644
--- a/Makefile
+++ b/Makefile
@@ -1252,7 +1252,7 @@ scripts: ;
 endif # KBUILD_EXTMOD
 
 clean: $(clean-dirs)
-   $(Q)$(CONFIG_SHELL) $(srctree)/scripts/link-vmlinux.sh clean
+   $(Q)$(if $(KBUILD_EXTMOD),,$(CONFIG_SHELL) 
$(srctree)/scripts/link-vmlinux.sh clean)
$(call cmd,rmdirs)
$(call cmd,rmfiles)
@find $(if $(KBUILD_EXTMOD), $(KBUILD_EXTMOD), .) $(RCS_FIND_IGNORE) \
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] timer: Fix jiffies wrap behavior of round_jiffies*()

2013-05-29 Thread Bart Van Assche


On 05/30/13 01:01, Andrew Morton wrote:

On Tue, 21 May 2013 20:43:50 +0200 Bart Van Assche  
wrote:


Make sure that the round_jiffies*() functions return a time that is
in the future when the jiffies counter is about to wrap.


Actually "when the jiffies counter has recently wrapped".

I assume this was found by inspection?


Hello Andrew,

You are correct, this was found via source code inspection. I started 
reviewing the round_jiffies*() implementation because I was chasing a 
bug in a kernel driver using one of these functions.



-   return original;
-   return j;
+   return time_is_after_jiffies(j) ? j : original;
  }


Your email client mangles patches, btw.


Sorry. Will take more care in the future.

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: Allow block drivers to poll for I/O instead of sleeping

2013-06-25 Thread Bart Van Assche

On 06/25/13 05:18, Matthew Wilcox wrote:

On Mon, Jun 24, 2013 at 10:07:51AM +0200, Ingo Molnar wrote:

I'm wondering, how will this scheme work if the IO completion latency is a
lot more than the 5 usecs in the testcase? What if it takes 20 usecs or
100 usecs or more?

There's clearly a threshold at which it stops making sense, and our
current NAND-based SSDs are almost certainly on the wrong side of that
threshold! I can't wait for one of the "post-NAND" technologies to make
it to market in some form that makes it economical to use in an SSD.

The problem is that some of the people who are looking at those
technologies are crazy. They want to "bypass the kernel" and "do user
space I/O" because "the kernel is too slow". This patch is part of an
effort to show them how crazy they are. And even if it doesn't convince
them, at least users who refuse to rewrite their applications to take
advantage of magical userspace I/O libraries will see real performance
benefits.

Recently I attended an interesting talk about this subject in which it
was proposed not only to bypass the kernel for access to high-IOPS
devices but also to allow byte-addressability for block devices. The
slides that accompanied that talk can be found here (includes a
performance comparison with the traditional block driver API):

Bernard Metzler, On Suitability of High-Performance Networking API for
Storage, OFA Int'l Developer Workshop, April 24, 2013
(http://www.openfabrics.org/ofa-documents/presentations/doc_download/559-on-suitability-of-high-performance-networking-api-for-storage.html).

This approach leaves the choice of whether to use polling or an
interrupt-based completion notification to the user of the new API,
something the Linux InfiniBand RDMA verbs API already allows today.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH] scsi_prep_fn() check for empty queue

2013-06-26 Thread Bart Van Assche


On 06/26/13 11:02, Maxim Uvarov wrote:

This fix:
end_request: I/O error, dev sdc, sector 976576
  rport-0:0-3: blocked FC remote port time out: removing target and saving
binding
BUG: unable to handle kernel NULL pointer dereference at 0400
IP: [] scsi_prep_state_check+0xe/0x99
  [] scsi_setup_blk_pc_cmnd+0x1b/0x115
  [] scsi_prep_fn+0x29/0x3b
  [] blk_peek_request+0xe1/0x1b3
  [] scsi_request_fn+0x3a/0x4d2
  [] __generic_unplug_device+0x32/0x36
  [] blk_execute_rq_nowait+0x77/0x9e
  [] blk_execute_rq+0xa6/0xde
  [] ? printk+0x41/0x46
  [] ? get_rdac_req+0x81/0xe8 [scsi_dh_rdac]
  [] send_mode_select+0x29f/0x489 [scsi_dh_rdac]
  [] ? probe_workqueue_execution+0xb1/0xce
  [] worker_thread+0x1a9/0x237
  [] ? send_mode_select+0x0/0x489 [scsi_dh_rdac]
  [] ? autoremove_wake_function+0x0/0x39
  [] ? worker_thread+0x0/0x237
  [] kthread+0x7f/0x87
  [] child_rip+0xa/0x20
  [] ? kthread+0x0/0x87
  [] ? child_rip+0x0/0x20
Signed-off-by: Maxim Uvarov 
---
  drivers/scsi/scsi_lib.c |3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 86d5220..8e89ed9 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1295,6 +1295,9 @@ int scsi_prep_fn(struct request_queue *q, struct request 
*req)
struct scsi_device *sdev = q->queuedata;
int ret = BLKPREP_KILL;

+   if (!sdev)
+   return ret;
+
if (req->cmd_type == REQ_TYPE_BLOCK_PC)
ret = scsi_setup_blk_pc_cmnd(sdev, req);
return scsi_prep_return(q, req, ret);


Sorry but this patch does not look like a proper fix to me. What you 
probably need is a scsi_device_get() call in scsi_dh_rdac.c somewhere 
before the queue_work(kmpath_rdacd, &ctlr->ms_work) call and a 
scsi_device_put() call once send_mode_select() has finished using the sdev.


Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: I/O blocked while dirty pages are being flushed

2013-03-24 Thread Bart Van Assche


On 03/24/13 06:12, Fredrik Tolf wrote:

While this flush is running, I find that many a process goes into disk
sleep waiting for the flush to complete. This includes the process
manipulating the mmapped file whenever it tries to redirty a page
currently waiting to be flushed, but also, for instance, programs that
write() to log files (since, I guess, the buffer page backing the last
written portion of the log file is being flushed).


Had you already encountered this article: Jonathan Corbet, The trouble 
with stable pages, March 13, 2012 (http://lwn.net/Articles/486311/) ?


Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V7 4/5] virtio-scsi: introduce multiqueue support

2013-03-25 Thread Bart Van Assche


On 03/23/13 12:28, Wanlong Gao wrote:

+static struct virtio_scsi_vq *virtscsi_pick_vq(struct virtio_scsi *vscsi,
+  struct virtio_scsi_target_state 
*tgt)
+{
+   struct virtio_scsi_vq *vq;
+   unsigned long flags;
+   u32 queue_num;
+
+   spin_lock_irqsave(&tgt->tgt_lock, flags);
+
+   /*
+* The memory barrier after atomic_inc_return matches
+* the smp_read_barrier_depends() in virtscsi_req_done.
+*/
+   if (atomic_inc_return(&tgt->reqs) > 1)
+   vq = ACCESS_ONCE(tgt->req_vq);
+   else {
+   queue_num = smp_processor_id();
+   while (unlikely(queue_num >= vscsi->num_queues))
+   queue_num -= vscsi->num_queues;
+
+   tgt->req_vq = vq = &vscsi->req_vqs[queue_num];
+   }
+
+   spin_unlock_irqrestore(&tgt->tgt_lock, flags);
+   return vq;
+}


Is there any reason why the smp_processor_id() % vscsi->num_queues 
computation in virtscsi_pick_vq() has been implemented as a loop instead 
of as an arithmetic operation ? If so, I would appreciate it if that 
could be explained in a comment.


Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] Final round of SCSI updates for the 3.8+ merge window

2013-03-01 Thread Bart Van Assche


On 03/01/13 10:19, James Bottomley wrote:

This is an assorted set of stragglers into the merge window with driver
updates for qla2xxx, megaraid_sas, storvsc and ufs.  It also includes
pulls of the uapi tree (all the remaining SCSI pieces) and the fcoe tree
(updates to fcoe and libfc)


Hello James,

I'm curious to know your opinion about the patch called "Fix race 
between starved list processing and device removal" 
(http://article.gmane.org/gmane.linux.scsi/80090/match=scsi_run_queue) ? 
It was posted for the first time several months ago and has been ack-ed 
and tested by other people. So far I haven't seen any comments from you 
on that patch.


Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] timer: Fix jiffies wrap behavior of round_jiffies*()

2013-05-21 Thread Bart Van Assche

Make sure that the round_jiffies*() functions return a time that is
in the future when the jiffies counter is about to wrap.

Signed-off-by: Bart Van Assche 
Cc: Thomas Gleixner 
Cc: Arjan van de Ven 
Cc: Stephen Rothwell 
---
 kernel/timer.c |4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/timer.c b/kernel/timer.c
index 15ffdb3..aa8b964 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -149,9 +149,7 @@ static unsigned long round_jiffies_common(unsigned long j, 
int cpu,
/* now that we have rounded, subtract the extra skew again */
j -= cpu * 3;
-   if (j <= jiffies) /* rounding ate our timeout entirely; */
-   return original;
-   return j;
+   return time_is_after_jiffies(j) ? j : original;
 }
  /**
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] fix NULL-pointer dereference on scsi_run_queue

2012-08-04 Thread Bart Van Assche

On 08/02/12 08:41, Chanho Min wrote:
> This patch is to fix a oops from a torn down device. When
> scsi_run_queue process starved queues, scsi_request_fn can race with
> scsi_remove_device. In this case, rarely, scsi_request_fn release the
> last reference and set sdev->request_queue to NULL. It result in
> NULL-pointer dereference when spin_unlock is tried with (NULL)->
> queue_lock. We need to add an extra reference to the device on both
> sides of the __blk_run_queue to hold reference until scsi_request_fn
> is finished.

Good catch. So far I haven't been able to trigger this issue in my
tests. So it would be appreciated if you could verify whether the patch
below helps (patch is based on 3.6-rc1):

---
 drivers/scsi/scsi_sysfs.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 093d4f6..59e523c 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -348,7 +348,6 @@ static void scsi_device_dev_release_usercontext(struct 
work_struct *work)
starget->reap_ref++;
list_del(&sdev->siblings);
list_del(&sdev->same_target_siblings);
-   list_del(&sdev->starved_entry);
spin_unlock_irqrestore(sdev->host->host_lock, flags);
 
cancel_work_sync(&sdev->event_work);
@@ -956,6 +955,8 @@ int scsi_sysfs_add_sdev(struct scsi_device *sdev)
 void __scsi_remove_device(struct scsi_device *sdev)
 {
struct device *dev = &sdev->sdev_gendev;
+   struct Scsi_Host *shost = sdev->host;
+   unsigned long flags;
 
if (sdev->is_visible) {
if (scsi_device_set_state(sdev, SDEV_CANCEL) != 0)
@@ -977,6 +978,11 @@ void __scsi_remove_device(struct scsi_device *sdev)
blk_cleanup_queue(sdev->request_queue);
cancel_work_sync(&sdev->requeue_work);
 
+   spin_lock_irqsave(shost->host_lock, flags);
+   if (!list_empty(&sdev->starved_entry))
+   list_del(&sdev->starved_entry);
+   spin_unlock_irqrestore(shost->host_lock, flags);
+
if (sdev->host->hostt->slave_destroy)
sdev->host->hostt->slave_destroy(sdev);
transport_destroy_device(dev);
-- 
1.7.7





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] fix NULL-pointer dereference on scsi_run_queue

2012-08-04 Thread Bart Van Assche

On 08/04/12 16:46, Mike Christie wrote:
> I think we have to have scsi-ml do a get_device when a sdev is added to
> the starved entry and then do a put_device when it is removed (must do
> these under the host lock for the starved entry case too). I am not sure
> if that is just a hack/papering-over of the problem and there are more
> issues like this.

That would result in a more complex patch than the patch at the start of
this thread, isn't it ? Also, IMHO it would help to document which
functions in the scsi-ml are called with an sdev reference and which
ones not. That would make the scsi-ml code easier to verify for issues
like the one reported at the start of this thread.

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] fix NULL-pointer dereference on scsi_run_queue

2012-08-06 Thread Bart Van Assche

On 08/04/12 22:36, Mike Christie wrote:
> On 08/04/2012 03:18 PM, Bart Van Assche wrote:
>> On 08/04/12 16:46, Mike Christie wrote:
>>> I think we have to have scsi-ml do a get_device when a sdev is added to
>>> the starved entry and then do a put_device when it is removed (must do
>>> these under the host lock for the starved entry case too). I am not sure
>>> if that is just a hack/papering-over of the problem and there are more
>>> issues like this.
>>
>> That would result in a more complex patch than the patch at the start of
>> this thread, isn't it ?
> 
> Yaah, but the original patch in this thread is still racey isn't it?

Indeed. How about the patch below ? Scsi devices are removed from
starved_list after blk_cleanup_queue() and before put_device(). That
guarantees that inside scsi_run_queue() get_device() under host lock
will succeed.

---
 drivers/scsi/scsi_lib.c   |5 +
 drivers/scsi/scsi_sysfs.c |7 ++-
 2 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index ffd7773..bd7daec 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -452,10 +452,15 @@ static void scsi_run_queue(struct request_queue *q)
continue;
}
 
+   get_device(&sdev->sdev_gendev);
spin_unlock(shost->host_lock);
+
spin_lock(sdev->request_queue->queue_lock);
__blk_run_queue(sdev->request_queue);
spin_unlock(sdev->request_queue->queue_lock);
+
+   put_device(&sdev->sdev_gendev);
+
spin_lock(shost->host_lock);
}
/* put any unprocessed entries back */
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 093d4f6..44f232e 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -348,7 +348,6 @@ static void scsi_device_dev_release_usercontext(struct 
work_struct *work)
starget->reap_ref++;
list_del(&sdev->siblings);
list_del(&sdev->same_target_siblings);
-   list_del(&sdev->starved_entry);
spin_unlock_irqrestore(sdev->host->host_lock, flags);
 
cancel_work_sync(&sdev->event_work);
@@ -956,6 +955,8 @@ int scsi_sysfs_add_sdev(struct scsi_device *sdev)
 void __scsi_remove_device(struct scsi_device *sdev)
 {
struct device *dev = &sdev->sdev_gendev;
+   struct Scsi_Host *shost = sdev->host;
+   unsigned long flags;
 
if (sdev->is_visible) {
if (scsi_device_set_state(sdev, SDEV_CANCEL) != 0)
@@ -977,6 +978,10 @@ void __scsi_remove_device(struct scsi_device *sdev)
blk_cleanup_queue(sdev->request_queue);
cancel_work_sync(&sdev->requeue_work);
 
+   spin_lock_irqsave(shost->host_lock, flags);
+   list_del(&sdev->starved_entry);
+   spin_unlock_irqrestore(shost->host_lock, flags);
+
if (sdev->host->hostt->slave_destroy)
sdev->host->hostt->slave_destroy(sdev);
transport_destroy_device(dev);
-- 
1.7.7


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] fix NULL-pointer dereference on scsi_run_queue

2012-08-07 Thread Bart Van Assche

On 08/07/12 08:53, Chanho Min wrote:
> On Tue, Aug 7, 2012 at 2:56 AM, Bart Van Assche  wrote:
>> Indeed. How about the patch below ? Scsi devices are removed from
>> starved_list after blk_cleanup_queue() and before put_device(). That
>> guarantees that inside scsi_run_queue() get_device() under host lock
>> will succeed.
>
> Thanks, IMHO, it's harmless and the simple way to solve this issue.
> But, I think the second half of your patches are not required, extra
> referecne is might suffice?

I'm afraid that without the second half of that patch the following race
is still possible:
- sdev reference count drops to zero while scsi_run_queue() is in
  progress and while that sdev is on the starved_list of its SCSI host;
  scsi_device_dev_release_usercontext() call is scheduled but not yet
  executed.
- scsi_run_queue() takes that sdev off the local starved_list.
- scsi_run_queue() calls get_device() and that call fails since the
  sdev reference count is zero.
- scsi_device_dev_release_usercontext() gets scheduled and frees the
  sdev.
- scsi_run_queue() proceeds and calls __blk_run_queue() on a freed
  queue, which is what we were trying to avoid.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Kernel hangs in truncate_inode_pages()

2012-08-24 Thread Bart Van Assche

Hello,

If I log in with the open-iscsi initiator to an iSCSI target an power
down the target the initiator hangs in truncate_inode_pages(). This
happens systematically with kernel 3.6-rc3 and it's something I've
never seen with kernel 3.4 or before. Has anyone else already run into
this issue ?

>From the output of echo w >/proc/sysrq-trigger:

SysRq : Show Blocked State
 taskPC stack   pid father
udisks-part-id  D 810f7530 0 30803  10057 0x0006
8801063c39a8 0046 880126955d60 0002
8801063c3fd8 880126955d60 8801063c3fd8 8801063c3fd8
81813420 880126955d60 8801063c39a8 88012fa13990
Call Trace:
[] ? __lock_page+0x70/0x70
[] schedule+0x29/0x70
[] io_schedule+0x8f/0xd0
[] sleep_on_page+0xe/0x20
[] __wait_on_bit_lock+0x5a/0xc0
[] ? find_get_pages+0x2a/0x1c0
[] ? release_pages+0x1af/0x200
[] __lock_page+0x67/0x70
[] ? autoremove_wake_function+0x40/0x40
[] truncate_inode_pages_range+0x456/0x480
[] ? trace_hardirqs_on+0xd/0x10
[] ? on_each_cpu_cond+0xaa/0xf0
[] truncate_inode_pages+0x15/0x20
[] kill_bdev+0x2f/0x40
[] __blkdev_put+0x76/0x1d0
[] blkdev_put+0x5d/0x180
[] blkdev_close+0x25/0x30
[] __fput+0xcc/0x240
[] fput+0xe/0x10
[] task_work_run+0x69/0xa0
[] do_exit+0x78f/0x930
[] ? get_signal_to_deliver+0xf0/0x6f0
[] ? _raw_spin_lock_irq+0x17/0x60
[] ? _raw_spin_unlock_irq+0x30/0x50
[] do_group_exit+0x4e/0xc0
[] get_signal_to_deliver+0x228/0x6f0
[] do_signal+0x3c/0x620
[] ? sysret_signal+0x5/0x3d
[] do_notify_resume+0x6d/0xb0
[] ? trace_hardirqs_on_thunk+0x3a/0x3f
[] int_signal+0x12/0x17

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel hangs in truncate_inode_pages()

2012-08-24 Thread Bart Van Assche

On 08/24/12 17:24, Mike Christie wrote:
> On 08/24/2012 08:13 AM, Bart Van Assche wrote:
>> If I log in with the open-iscsi initiator to an iSCSI target an power
>> down the target the initiator hangs in truncate_inode_pages(). This
>> happens systematically with kernel 3.6-rc3 and it's something I've
>> never seen with kernel 3.4 or before. Has anyone else already run into
>> this issue ?
> 
> I have not seen this issue before and have not seen it reported.
> 
> Are you doing IO the disk at the time of power down, or are you doing a
> iscsiadm -m node ... --logout command?

There was no I/O load test running at the time of the target power down.
The call stack I posted is what I see before logout. And this is what I
get after iscsi logout:

BUG: soft lockup - CPU#3 stuck for 23s! [iscsid:8589]
Modules linked in: ... [last unloaded: crc32c]
irq event stamp: 918076162
hardirqs last  enabled at (918076161): [] 
_raw_spin_unlock_irqrestore+0x3f/0x70
hardirqs last disabled at (918076162): [] 
apic_timer_interrupt+0x67/0x80
softirqs last  enabled at (917955754): [] 
__do_softirq+0x142/0x2a0
softirqs last disabled at (917955749): [] 
call_softirq+0x1c/0x30
CPU 3 
Pid: 8589, comm: iscsid Not tainted 3.6.0-rc3+ #1 Gigabyte Technology Co., Ltd. 
Z68X-UD3H-B3/Z68X-UD3H-B3
RIP: 0010:[]  [] 
_raw_spin_unlock_irqrestore+0x41/0x70
RSP: 0018:8800ae7e5928  EFLAGS: 0286
RAX: 880106002eb0 RBX: 810956a2 RCX: 0006
RDX: 0040 RSI: 880106003498 RDI: 880106002eb0
RBP: 8800ae7e5938 R08:  R09: 0001
R10: 0002 R11:  R12: 0006
R13: 0007 R14: 880106002eb0 R15: 880106003498
FS:  7fa8cfc22700() GS:88012fac() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 7fa8cf26617c CR3: bc87c000 CR4: 000407e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process iscsid (pid: 8589, threadinfo 8800ae7e4000, task 880106002eb0)
Stack:
88011af0c000 88001ea0fff0 8800ae7e5988 a00aaafc
0286 88001ea1 8800ae7e5968 88008ae44a28
88008ae44800 88008ae44818  0001
Call Trace:
[] scsi_remove_target+0x1bc/0x1f0 [scsi_mod]
[] __iscsi_unbind_session+0xc6/0x1a0 [scsi_transport_iscsi]
[] iscsi_remove_session+0x105/0x1e0 [scsi_transport_iscsi]
[] iscsi_destroy_session+0x16/0x60 [scsi_transport_iscsi]
[] iscsi_session_teardown+0xa9/0xd0 [libiscsi]
[] iscsi_sw_tcp_session_destroy+0x50/0x70 [iscsi_tcp]
[] iscsi_if_rx+0xab8/0xf60 [scsi_transport_iscsi]
[] netlink_unicast+0x1ad/0x230
[] netlink_sendmsg+0x2fb/0x350
[] ? sock_update_classid+0xc8/0x140
[] sock_sendmsg+0xa2/0xe0
[] ? might_fault+0x9c/0xb0
[] ? might_fault+0x53/0xb0
[] ? verify_iovec+0x56/0xd0
[] __sys_sendmsg+0x382/0x390
[] ? might_fault+0x9c/0xb0
[] ? might_fault+0x53/0xb0
[] ? lg_local_unlock+0x23/0x50
[] ? cp_new_stat+0x116/0x130
[] sys_sendmsg+0x49/0x90
[] system_call_fastpath+0x16/0x1b

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] fix NULL-pointer dereference on scsi_run_queue

2012-08-07 Thread Bart Van Assche

On 08/07/12 08:53, Chanho Min wrote:
> In addition, Is it ironic that we are careful to use put_device at
> scsi_request_fn?. If we trigger the ->remove(),
> It occur a oops. What about the removal of unlock/lock as patch bellow?
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 4037fd5..8d9eccd 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1608,11 +1608,7 @@ out_delay:
> if (sdev->device_busy == 0)
> blk_delay_queue(q, SCSI_QUEUE_DELAY);
>  out:
> -   /* must be careful here...if we trigger the ->remove() function
> -* we cannot be holding the q lock */
> -   spin_unlock_irq(q->queue_lock);
> put_device(&sdev->sdev_gendev);
> -   spin_lock_irq(q->queue_lock);
>  }

As far as I can see the comment in the above code was added before
scsi_device_dev_release() was moved to user context, so it might be
outdated. See also
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=65110b2168950a19cc78b5027ed18cb811fbdae8.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] fix NULL-pointer dereference on scsi_run_queue

2012-08-08 Thread Bart Van Assche

On 08/08/12 03:42, Chanho Min wrote:
> Thank you for the explanation. It look correct. Let's check one more thing.
> What If __scsi_remove_device doesn't release device? : reference count
> is more than 2.
> So We lost starved_list but device is exist. Is there any issue about this?

As far as I can see that scenario will also be handled correctly by the
proposed patch.

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-02-18 Thread Bart Van Assche

On Feb 18, 2008 10:43 AM, Erez Zilber <[EMAIL PROTECTED]> wrote:
> If you use a high value for FirstBurstLength, all (or most) of your data
> will be sent as unsolicited data-out PDUs. These PDUs don't use the RDMA
> engine, so you miss the advantage of IB.

Hello Erez,

Did you notice the e-mail Roland Dreier wrote on Februari 6, 2008 ?
This is what Roland wrote:
> I think the confusion here is caused by a slight misuse of the term
> "RDMA".  It is true that all data is always transported over an
> InfiniBand connection when iSER is used, but not all such transfers
> are one-sided RDMA operations; some data can be transferred using
> send/receive operations.

Or: data sent during the first burst is not transferred via one-sided
remote memory reads or writes but via two-sided send/receive
operations. At least on my setup, these operations are as fast as
one-sided remote memory reads or writes. As an example, I obtained the
following numbers on my setup (SDR 4x network);
ib_write_bw: 933 MB/s.
ib_read_bw: 905 MB/s.
ib_send_bw: 931 MB/s.

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

SMP-related kernel memory leak

2008-02-19 Thread Bart Van Assche

Hello,

I noticed that the amount of memory used by the Linux kernel steadily
increases over time on SMP systems (x86 architecture, 32-bit kernel).
This problem disappears when I add maxcpus=1 to the kernel command
line. I have observed this behavior both on the 2.6.22.18 and 2.6.24.2
kernels. Did anyone notice anything similar ?

See also: http://bugzilla.kernel.org/show_bug.cgi?id=9991

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SMP-related kernel memory leak

2008-02-19 Thread Bart Van Assche

On Feb 19, 2008 7:18 PM, Oliver Pinter <[EMAIL PROTECTED]> wrote:
> On 2/19/08, Bart Van Assche <[EMAIL PROTECTED]> wrote:
> > I noticed that the amount of memory used by the Linux kernel steadily
> > increases over time on SMP systems (x86 architecture, 32-bit kernel).
> > This problem disappears when I add maxcpus=1 to the kernel command
> > line. I have observed this behavior both on the 2.6.22.18 and 2.6.24.2
> > kernels. Did anyone notice anything similar ?
> >
> > See also: http://bugzilla.kernel.org/show_bug.cgi?id=9991
>
> this patch fixed them http://lkml.org/lkml/2008/2/18/405 ?

Thanks for the hint. If I interpreted the 2.6.24 changelog correctly
this patch is already included with 2.6.24 ? The problem still occurs
with 2.6.24.2. I am currently trying to find the minimal kernel config
which still triggers this problem. Any other hints for finding the
cause of this issue are welcome of course.

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-02-20 Thread Bart Van Assche

On Feb 20, 2008 8:34 AM, Erez Zilber <[EMAIL PROTECTED]> wrote:
> Bart Van Assche wrote:
> > Or: data sent during the first burst is not transferred via one-sided
> > remote memory reads or writes but via two-sided send/receive
> > operations. At least on my setup, these operations are as fast as
> > one-sided remote memory reads or writes. As an example, I obtained the
> > following numbers on my setup (SDR 4x network);
> > ib_write_bw: 933 MB/s.
> > ib_read_bw: 905 MB/s.
> > ib_send_bw: 931 MB/s.
>
> According to these numbers one can think that you don't need RDMA at
> all, just send iSCSI PDUs over IB.

Sorry, but you are misinterpreting what I wrote.

> The benchmarks that you use are
> synthetic IB benchmarks that are not equivalent to iSCSI over iSER. They
> just send IB packets. I'm not surprised that you got more or less the
> same performance because, AFAIK, ib_send_bw doesn't copy data (unlike
> iSCSI that has to copy data that is sent/received without RDMA).

I agree that ib_write_bw / ib_read_bw / ib_send_bw performance results
are not equivalent to iSCSI over iSER. The reason that I included
these performance results was to illustrate that two-sided data
transfers over IB are about as fast as one-sided data transfers.

> When you use RDMA with iSCSI (i.e. iSER), you don't need to create iSCSI
> PDUs and process them. The CPU is not busy as it is with iSCSI over TCP
> because no data copies are required. Another advantage is that you don't
> need header/data digest because the IB HW does that.

As far as I know, when using iSER, the FirstBurstLength bytes of data
are sent via two-sided data transfers, and there is no CPU
intervention required to transfer the data itself over the IB network.

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ofa-general] Re: Merging of completely unreviewed drivers

2008-02-22 Thread Bart Van Assche

On Fri, Feb 22, 2008 at 2:46 AM, David Newall <[EMAIL PROTECTED]> wrote:
> Krzysztof Halasa wrote:
>  > Perhaps we should increase line length limit, 132 should be fine.
>  > Especially useful with long printk() lines and long arithmetic
>  > expressions.
>
>  Yes; or even longer.  80 characters might have made sense on a screen
>  when the alternative was 80 characters on a punched card, but on a
>  modern computer it's very restrictive.  That's especially true with the
>  deep indents that you quickly get in C.  Even short lines often need to
>  be split when you put a few tabs in front of them, and that makes
>  comprehension that bit harder, not to mention looks ugly.

There is a reason to limit line length: scientific research has shown
that readability of regular texts is optimal for a line length between
55 and 65 characters. My experience is that the readability of source
code decreases when the lines are very long (more than 160
characters).

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][SCSI] remove the queue unlock in scsi_requset_fn

2012-08-13 Thread Bart Van Assche

On 08/10/12 03:22, Chanho Min wrote:
> We don't need to unlock the queue before put_device in scsi_request_fn()

It looks like there is a typo in the patch subject ? Also, you can omit
"[SCSI]" from the patch subject - AFAIK James has a script that inserts
that text automatically.

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND] remove the queue unlock in scsi_requset_fn

2012-08-16 Thread Bart Van Assche

On 08/16/12 01:35, Chanho Min wrote:
>> functions will occur in line.  I also don't see why the sdev reference
>> couldn't drop to zero here.
> scsi_request_fn is called under the lock of request_queue->queue_lock.
> If we drop the sdev reference to zero here,
> scsi_device_dev_release_usercontext is
> invoked and make request_queue to NULL. When caller of scsi_request_fn try to
> unlock request_queue->queue_lock, the oops is occurred.

Whether or not your patch is applied, if the put_device() call in
scsi_request_fn() decreases the sdev reference count to zero, the
scsi_request_fn() caller will still try to unlock the queue lock after
scsi_request_fn() finished and hence will trigger a use-after-free. I'm
afraid the only real solution is to modify the SCSI and/or block layers
such that scsi_remove_device() can't finish while scsi_request_fn() is
in progress. And once that is guaranteed the get_device() / put_device()
pair can be left out from scsi_request_fn().

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-02-06 Thread Bart Van Assche

On Feb 5, 2008 6:50 PM, Jeff Garzik <[EMAIL PROTECTED]> wrote:
> For remotely accessing data, iSCSI+fs is quite simply more overhead than
> a networked fs.  With iSCSI you are doing
>
> local VFS -> local blkdev -> network
>
> whereas a networked filesystem is
>
> local VFS -> network

There are use cases than can be solved better via iSCSI and a
filesystem than via a network filesystem. One such use case is when
deploying a virtual machine whose data is stored on a network server:
in that case there is only one user of the data (so there are no
locking issues) and filesystem and block device each run in another
operating system: the filesystem runs inside the virtual machine and
iSCSI either runs in the hypervisor or in the native OS.

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-02-06 Thread Bart Van Assche

On Feb 5, 2008 6:01 PM, Erez Zilber <[EMAIL PROTECTED]> wrote:
>
> Using such large values for FirstBurstLength will give you poor
> performance numbers for WRITE commands (with iSER). FirstBurstLength
> means how much data should you send as unsolicited data (i.e. without
> RDMA). It means that your WRITE commands were sent without RDMA.

Sorry, but I'm afraid you got this wrong. When the iSER transport is
used instead of TCP, all data is sent via RDMA, including unsolicited
data. If you have look at the iSER implementation in the Linux kernel
(source files under drivers/infiniband/ulp/iser), you will see that
all data is transferred via RDMA and not via TCP/IP.

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel

2008-02-07 Thread Bart Van Assche

Since the focus of this thread shifted somewhat in the last few
messages, I'll try to summarize what has been discussed so far:
- There was a number of participants who joined this discussion
spontaneously. This suggests that there is considerable interest in
networked storage and iSCSI.
- It has been motivated why iSCSI makes sense as a storage protocol
(compared to ATA over Ethernet and Fibre Channel over Ethernet).
- The direct I/O performance results for block transfer sizes below 64
KB are a meaningful benchmark for storage target implementations.
- It has been discussed whether an iSCSI target should be implemented
in user space or in kernel space. It is clear now that an
implementation in the kernel can be made faster than a user space
implementation (http://kerneltrap.org/mailarchive/linux-kernel/2008/2/4/714804).
Regarding existing implementations, measurements have a.o. shown that
SCST is faster than STGT (30% with the following setup: iSCSI via
IPoIB and direct I/O block transfers with a size of 512 bytes).
- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
yet. The short-term options are as follows:
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).

As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND] remove the queue unlock in scsi_requset_fn

2012-08-18 Thread Bart Van Assche

On 08/16/12 07:52, Bart Van Assche wrote:
> On 08/16/12 01:35, Chanho Min wrote:
>>> functions will occur in line.  I also don't see why the sdev reference
>>> couldn't drop to zero here.
>> scsi_request_fn is called under the lock of request_queue->queue_lock.
>> If we drop the sdev reference to zero here,
>> scsi_device_dev_release_usercontext is
>> invoked and make request_queue to NULL. When caller of scsi_request_fn try to
>> unlock request_queue->queue_lock, the oops is occurred.
> 
> Whether or not your patch is applied, if the put_device() call in
> scsi_request_fn() decreases the sdev reference count to zero, the
> scsi_request_fn() caller will still try to unlock the queue lock after
> scsi_request_fn() finished and hence will trigger a use-after-free. I'm
> afraid the only real solution is to modify the SCSI and/or block layers
> such that scsi_remove_device() can't finish while scsi_request_fn() is
> in progress. And once that is guaranteed the get_device() / put_device()
> pair can be left out from scsi_request_fn().

(replying to my own e-mail)

How about the patch below ?

[PATCH] Fix device removal race

If the put_device() call in scsi_request_fn() drops the sdev refcount
to zero then the spin_lock_irq() call after the put_device() call
triggers a use-after-free. Avoid that by making sure that blk_cleanup_queue()
can only finish after all active scsi_request_fn() calls have returned.
---
 block/blk-core.c|1 +
 drivers/scsi/scsi_lib.c |   10 ++
 include/linux/blkdev.h  |5 +
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4b4dbdf..0e4da3b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -388,6 +388,7 @@ void blk_drain_queue(struct request_queue *q, bool 
drain_all)
__blk_run_queue(q);
 
drain |= q->nr_rqs_elvpriv;
+   drain |= q->request_fn_active;
 
/*
 * Unfortunately, requests are queued at and tracked from
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index ffd7773..10bb348 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1514,9 +1514,7 @@ static void scsi_request_fn(struct request_queue *q)
struct scsi_cmnd *cmd;
struct request *req;
 
-   if(!get_device(&sdev->sdev_gendev))
-   /* We must be tearing the block queue down already */
-   return;
+   q->request_fn_active++;
 
/*
 * To start with, we keep looping until the queue is empty, or until
@@ -1626,11 +1624,7 @@ out_delay:
if (sdev->device_busy == 0)
blk_delay_queue(q, SCSI_QUEUE_DELAY);
 out:
-   /* must be careful here...if we trigger the ->remove() function
-* we cannot be holding the q lock */
-   spin_unlock_irq(q->queue_lock);
-   put_device(&sdev->sdev_gendev);
-   spin_lock_irq(q->queue_lock);
+   q->request_fn_active--;
 }
 
 u64 scsi_calculate_bounce_limit(struct Scsi_Host *shost)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4e72a9d..11c1987 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -377,6 +377,11 @@ struct request_queue {
 
unsigned intnr_sorted;
unsigned intin_flight[2];
+   /*
+* Number of active request_fn() calls for those request_fn()
+* implementations that unlock the queue_lock, e.g. scsi_request_fn().
+*/
+   unsigned intrequest_fn_active;
 
unsigned intrq_timeout;
struct timer_list   timeout;
-- 
1.7.7


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] IB/lmx4: silence GCC warning

2013-02-13 Thread Bart Van Assche


On 10/29/12 10:50, Paul Bolle wrote:

On Wed, 2012-10-10 at 09:23 +0200, Jack Morgenstein wrote:

You could use:

u16 uninitialized_var(vlan);

instead.


I guess we'd better just wait and see whether uninitialized_var()
survives before discussing your suggestion (see the thread starting at
https://lkml.org/lkml/2012/10/26/508 ).


Although this in the special QP data flow, I still prefer to avoid adding extra 
code (even setting
initial values at procedure entry).  The line above will also do the job.  
"uninitialized_var"
is used elsewhere in the driver.  See, for example, mlx4_ib_post_send() in the 
same file (qp.c).


(replying to an e-mail of a few months ago)

If there are no further objections I'd like to see this patch to go 
upstream. It fixes an annoying compiler warning and I don't think that 
this patch has a negative performance impact.


Thanks,

Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RESEND PATCH] scsi: make struct scsi_varlen_cdb_hdr packed

2012-10-11 Thread Bart Van Assche


On 10/11/12 11:15, James Hogan wrote:

The struct scsi_varlen_cdb_hdr is expected to be exactly 10 bytes when
used in struct osd_cdb_head, but it isn't marked as packed. Some
architectures will round the struct size up which triggers BUILD_BUG_ON
compile errors in osd_initiator.c when the outer structs are unexpected
sizes. This is fixed by marking struct scsi_varlen_cdb_hdr as __packed.

Signed-off-by: James Hogan 
---
  include/scsi/scsi.h |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h
index 66216c1..3beaef3 100644
--- a/include/scsi/scsi.h
+++ b/include/scsi/scsi.h
@@ -198,7 +198,7 @@ struct scsi_varlen_cdb_hdr {
__u8 additional_cdb_length; /* total cdb length - 8 */
__be16 service_action;
/* service specific data follows */
-};
+} __packed;

  static inline unsigned
  scsi_varlen_cdb_length(const void *hdr)


Hello James,

Are you aware that __packed can also be used on individual struct 
members and that doing so has a lower performance penalty than using the 
__packed attribute on an entire struct ? See e.g.  for 
an example.


Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Deadlock observed in SCSI device removal

2012-12-15 Thread Bart Van Assche


On 12/15/12 05:22, Amit Sahrawat wrote:

We are using kernel version 3.0.33, and the patch which is mentioned for
the above problem - is already applied in the kernel.


If the kernel reports that io_schedule() hangs that usually means that 
an I/O request was not finished properly. It would help if you could 
verify whether or not you can reproduce this issue with kernel 3.6.10.


Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ofa-general] Re: Merging of completely unreviewed drivers

2008-02-22 Thread Bart Van Assche

On Fri, Feb 22, 2008 at 7:54 PM, Ingo Molnar <[EMAIL PROTECTED]> wrote:
> If a patch or if a file has a clean _style_, bugs and deeper
> structural problems often stand out like a sore thumb. But if the
> code is peppered with random style noise, it's a lot harder (for me
> at least) to notice real bugs. I can notice bugs in a squeeky clean
> code base about 5 times easier than in a noisy codebase. This effect
> alone makes checkpatch indispensible for the scheduler and for
> arch/x86.

I also appreciate style uniformity in kernel code. My (limited)
experience with checkpatch is that most checkpatch complaints are easy
to resolve.

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Integration of SCST in the mainstream Linux kernel

2008-01-23 Thread Bart Van Assche

As you probably know there is a trend in enterprise computing towards
networked storage. This is illustrated by the emergence during the
past few years of standards like SRP (SCSI RDMA Protocol), iSCSI
(Internet SCSI) and iSER (iSCSI Extensions for RDMA). Two different
pieces of software are necessary to make networked storage possible:
initiator software and target software. As far as I know there exist
three different SCSI target implementations for Linux:
- The iSCSI Enterprise Target Daemon (IETD,
http://iscsitarget.sourceforge.net/);
- The Linux SCSI Target Framework (STGT, http://stgt.berlios.de/);
- The Generic SCSI Target Middle Level for Linux project (SCST,
http://scst.sourceforge.net/).
Since I was wondering which SCSI target software would be best suited
for an InfiniBand network, I started evaluating the STGT and SCST SCSI
target implementations. Apparently the performance difference between
STGT and SCST is small on 100 Mbit/s and 1 Gbit/s Ethernet networks,
but the SCST target software outperforms the STGT software on an
InfiniBand network. See also the following thread for the details:
http://sourceforge.net/mailarchive/forum.php?thread_name=e2e108260801170127w2937b2afg9bef324efa945e43%40mail.gmail.com&forum_name=scst-devel.

About the design of the SCST software: while one of the goals of the
STGT project was to keep the in-kernel code minimal, the SCST project
implements the whole SCSI target in kernel space. SCST is implemented
as a set of new kernel modules, only minimal changes to the existing
kernel are necessary before the SCST kernel modules can be used. This
is the same approach that will be followed in the very near future in
the OpenSolaris kernel (see also
http://opensolaris.org/os/project/comstar/). More information about
the design of SCST can be found here:
http://scst.sourceforge.net/doc/scst_pg.html.

My impression is that both the STGT and SCST projects are well
designed, well maintained and have a considerable user base. According
to the SCST maintainer (Vladislav Bolkhovitin), SCST is superior to
STGT with respect to features, performance, maturity, stability, and
number of existing target drivers. Unfortunately the SCST kernel code
lives outside the kernel tree, which makes SCST harder to use than
STGT.

As an SCST user, I would like to see the SCST kernel code integrated
in the mainstream kernel because of its excellent performance on an
InfiniBand network. Since the SCST project comprises about 14 KLOC,
reviewing the SCST code will take considerable time. Who will do this
reviewing work ? And with regard to the comments made by the
reviewers: Vladislav, do you have the time to carry out the
modifications requested by the reviewers ? I expect a.o. that
reviewers will ask to move SCST's configuration pseudofiles from
procfs to sysfs.

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [ofa-general] InfiniBand/RDMA merge plans for 2.6.25

2008-01-23 Thread Bart Van Assche

On Jan 18, 2008 1:11 AM, Roland Dreier <[EMAIL PROTECTED]> wrote:

> Anyway, here are all the pending things that I'm aware of.  As usual,
> if something isn't already in my tree and isn't listed below, I
> probably missed it or dropped it by mistake.  Please remind me again
> in that case.

Are there any plans to merge the SDP (Sockets Direct Protocol) implementation ?

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] I2C: add support for the PCF8575 chip

2007-10-05 Thread Bart Van Assche

From: Bart Van Assche

Add support for the PCF8575 I2C chip.

Signed-off-by: Bart Van Assche ([EMAIL PROTECTED])
---
diff -uprN -X orig/linux-2.6.22.9/Documentation/dontdiff
orig/linux-2.6.22.9/drivers/i2c/chips/Kconfig
linux-2.6.22.9/drivers/i2c/chips/Kconfig
--- orig/linux-2.6.22.9/drivers/i2c/chips/Kconfig   2007-09-26
20:03:01.0 +0200
+++ linux-2.6.22.9/drivers/i2c/chips/Kconfig2007-10-05 09:27:19.0 
+0200
@@ -41,13 +41,27 @@ config SENSORS_PCF8574
default n
help
  If you say yes here you get support for Philips PCF8574 and
- PCF8574A chips.
+ PCF8574A chips. These chips are 8-bit I/O expanders for the I2C bus.

  This driver can also be built as a module.  If so, the module
  will be called pcf8574.

- These devices are hard to detect and rarely found on mainstream
- hardware.  If unsure, say N.
+ These devices are hard to detect automatically and are rarely found
+ on mainstream hardware.  If unsure, say N.
+
+config SENSORS_PCF8575
+   tristate "Philips PCF8575"
+   default n
+   help
+ If you say yes here you get support for Philips PCF8575 chip.
+ This chip is a 16-bit I/O expander for the I2C bus.  Several other
+ chip manufacturers sell equivalent chips, e.g. Texas Instruments.
+
+ This driver can also be built as a module.  If so, the module
+ will be called pcf8575.
+
+ This device is hard to detect automatically and is rarely found on
+ mainstream hardware.  If unsure, say N.

 config SENSORS_PCA9539
tristate "Philips PCA9539 16-bit I/O port"
diff -uprN -X orig/linux-2.6.22.9/Documentation/dontdiff
orig/linux-2.6.22.9/drivers/i2c/chips/Makefile
linux-2.6.22.9/drivers/i2c/chips/Makefile
--- orig/linux-2.6.22.9/drivers/i2c/chips/Makefile  2007-09-26
20:03:01.0 +0200
+++ linux-2.6.22.9/drivers/i2c/chips/Makefile   2007-10-05
09:16:21.0 +0200
@@ -9,6 +9,7 @@ obj-$(CONFIG_SENSORS_MAX6875)   += max6875
 obj-$(CONFIG_SENSORS_M41T00)   += m41t00.o
 obj-$(CONFIG_SENSORS_PCA9539)  += pca9539.o
 obj-$(CONFIG_SENSORS_PCF8574)  += pcf8574.o
+obj-$(CONFIG_SENSORS_PCF8575)  += pcf8575.o
 obj-$(CONFIG_SENSORS_PCF8591)  += pcf8591.o
 obj-$(CONFIG_ISP1301_OMAP) += isp1301_omap.o
 obj-$(CONFIG_TPS65010) += tps65010.o
diff -uprN -X orig/linux-2.6.22.9/Documentation/dontdiff
orig/linux-2.6.22.9/drivers/i2c/chips/pcf8575.c
linux-2.6.22.9/drivers/i2c/chips/pcf8575.c
--- orig/linux-2.6.22.9/drivers/i2c/chips/pcf8575.c 2007-10-05
09:44:02.0 +0200
+++ linux-2.6.22.9/drivers/i2c/chips/pcf8575.c  2007-10-05
11:10:46.0 +0200
@@ -0,0 +1,305 @@
+/*
+  pcf8575.c - Part of lm_sensors, Linux kernel modules for hardware
+  monitoring
+  Copyright (c) 2000  Frodo Looijaard <[EMAIL PROTECTED]>,
+  Philip Edelbrock <[EMAIL PROTECTED]>,
+  Dan Eaton <[EMAIL PROTECTED]>
+  Ported to Linux 2.6 by Aurelien Jarno <[EMAIL PROTECTED]> with
+  the help of Jean Delvare <[EMAIL PROTECTED]>
+
+  Copyright (C) 2006 Michael Hennerich, Analog Devices Inc.
+  <[EMAIL PROTECTED]>
+  based on the pcf8574.c
+
+  Ported from ucLinux to mainstream Linux kernel by Bart Van Assche.
+
+  About the pcf8575: the pcf8575 is an 16-bit I/O expander for the I2C bus
+  produced by a.o. Philips Semiconductors.
+
+  This program is free software; you can redistribute it and/or modify
+  it under the terms of the GNU General Public License as published by
+  the Free Software Foundation; either version 2 of the License, or
+  (at your option) any later version.
+
+  This program is distributed in the hope that it will be useful,
+  but WITHOUT ANY WARRANTY; without even the implied warranty of
+  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+  GNU General Public License for more details.
+
+  You should have received a copy of the GNU General Public License
+  along with this program; if not, write to the Free Software
+  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+*/
+
+
+#include 
+#include 
+#include 
+#include 
+
+
+/* Addresses to scan */
+static unsigned short normal_i2c[]   = { I2C_CLIENT_END };
+#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 20)
+static unsigned short normal_i2c_range[] = { I2C_CLIENT_END };
+#endif
+
+/* Insmod parameters */
+I2C_CLIENT_INSMOD;
+
+
+/* Each client has this additional data */
+struct pcf8575_data {
+   struct i2c_client client;
+
+   u16 write;  /* Remember last written value */
+   u8  buf[3];
+};
+
+static int pcf8575_attach_adapter(struct i2c_adapter *adapter);
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(2, 6, 0)
+static int pcf8575_detect(struct i2c_adapter *adapter, int address, int kind);
+#else
+static int pcf8575_detect(struct i2c_adapter *adapter,
+   int addr, unsigned short flags, int kind);
+#endif
+static int pcf8575_detach_client(struct i2c_c

Re: Integration of SCST in the mainstream Linux kernel

2008-01-30 Thread Bart Van Assche

On Jan 30, 2008 12:32 AM, FUJITA Tomonori <[EMAIL PROTECTED]> wrote:
>
> iSER has parameters to limit the maximum size of RDMA (it needs to
> repeat RDMA with a poor configuration)?

Please specify which parameters you are referring to. As you know I
had already repeated my tests with ridiculously high values for the
following iSER parameters: FirstBurstLength, MaxBurstLength and
MaxRecvDataSegmentLength (16 MB, which is more than the 1 MB block
size specified to dd).

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-01-30 Thread Bart Van Assche

On Jan 29, 2008 9:42 PM, James Bottomley
<[EMAIL PROTECTED]> wrote:
> > As an SCST user, I would like to see the SCST kernel code integrated
> > in the mainstream kernel because of its excellent performance on an
> > InfiniBand network. Since the SCST project comprises about 14 KLOC,
> > reviewing the SCST code will take considerable time. Who will do this
> > reviewing work ? And with regard to the comments made by the
> > reviewers: Vladislav, do you have the time to carry out the
> > modifications requested by the reviewers ? I expect a.o. that
> > reviewers will ask to move SCST's configuration pseudofiles from
> > procfs to sysfs.
>
> The two target architectures perform essentially identical functions, so
> there's only really room for one in the kernel.  Right at the moment,
> it's STGT.  Problems in STGT come from the user<->kernel boundary which
> can be mitigated in a variety of ways.  The fact that the figures are
> pretty much comparable on non IB networks shows this.

Are you saying that users who need an efficient iSCSI implementation
should switch to OpenSolaris ? The OpenSolaris COMSTAR project involves
the migration of the existing OpenSolaris iSCSI target daemon from
userspace to their kernel. The OpenSolaris developers are
spending time on this because they expect a significant performance
improvement.

> I really need a whole lot more evidence than at worst a 20% performance
> difference on IB to pull one implementation out and replace it with
> another.  Particularly as there's no real evidence that STGT can't be
> tweaked to recover the 20% even on IB.

My measurements on a 1 GB/s InfiniBand network have shown that the current
SCST implementation is able to read data via direct I/O at a rate of 811 GB/s
(via SRP) and that the current STGT implementation is able to transfer data at a
rate of 589 MB/s (via iSER). That's a performance difference of 38%.

And even more important, the I/O latency of SCST is significantly
lower than that
of STGT. This is very important for database workloads -- the I/O pattern caused
by database software is close to random I/O, and database software needs low
latency I/O in order to run efficiently.

In the thread with the title "Performance of SCST versus STGT" on the
SCST-devel /
STGT-devel mailing lists not only the raw performance numbers were discussed but
also which further performance improvements are possible. It became clear that
the SCST performance can be improved further by implementing a well known
optimization (zero-copy I/O). Fujita Tomonori explained in the same
thread that it is
possible to improve the performance of STGT further, but that this would require
a lot of effort (implementing asynchronous I/O in the kernel and also
implementing
a new caching mechanism using pre-registered buffers).

See also:
http://sourceforge.net/mailarchive/forum.php?forum_name=scst-devel&viewmonth=200801&viewday=17

Bart Van Assche.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-01-30 Thread Bart Van Assche

On Jan 30, 2008 11:56 AM, FUJITA Tomonori <[EMAIL PROTECTED]> wrote:
> On Wed, 30 Jan 2008 09:38:04 +0100
> "Bart Van Assche" <[EMAIL PROTECTED]> wrote:
> >
> > Please specify which parameters you are referring to. As you know I
>
> Sorry, I can't say. I don't know much about iSER. But seems that Pete
> and Robin can get the better I/O performance - line speed ratio with
> STGT.

Robin Humble was using a DDR InfiniBand network, while my tests were
performed with an SDR InfiniBand network. Robin's results can't be
directly compared to my results.

Pete Wyckoff's results
(http://www.osc.edu/~pw/papers/wyckoff-iser-snapi07-talk.pdf) are hard
to interpret. I have asked Pete which of the numbers in his test can
be compared with what I measured, but Pete did not reply.

> The version of OpenIB might matters too. For example, Pete said that
> STGT reads loses about 100 MB/s for some transfer sizes for some
> transfer sizes due to the OpenIB version difference or other unclear
> reasons.
>
> http://article.gmane.org/gmane.linux.iscsi.tgt.devel/135

Pete wrote about a degradation from 600 MB/s to 500 MB/s for reads
with STGT+iSER. In my tests I measured 589 MB/s for reads (direct
I/O), which matches with the better result obtained by Pete.

Note: the InfiniBand kernel modules I used were those from the
2.6.22.9 kernel, not from the OFED distribution.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-01-30 Thread Bart Van Assche

On Jan 30, 2008 5:34 PM, James Bottomley
<[EMAIL PROTECTED]> wrote:
>
> On Wed, 2008-01-30 at 09:38 +0100, Bart Van Assche wrote:
> > On Jan 30, 2008 12:32 AM, FUJITA Tomonori <[EMAIL PROTECTED]> wrote:
> > >
> > > iSER has parameters to limit the maximum size of RDMA (it needs to
> > > repeat RDMA with a poor configuration)?
> >
> > Please specify which parameters you are referring to. As you know I
> > had already repeated my tests with ridiculously high values for the
> > following iSER parameters: FirstBurstLength, MaxBurstLength and
> > MaxRecvDataSegmentLength (16 MB, which is more than the 1 MB block
> > size specified to dd).
>
> the 1Mb block size is a bit of a red herring.  Unless you've
> specifically increased the max_sector_size and are using an sg_chain
> converted driver, on x86 the maximum possible transfer accumulation is
> 0.5MB.

I did not publish the results, but I have also done tests with other
block sizes. The other sizes I tested were between 0.1MB and 10MB. The
performance difference for these other sizes compared to a block size
of 1MB was small (smaller than the variance between individual tests
results).

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Integration of SCST in the mainstream Linux kernel

2008-01-30 Thread Bart Van Assche

On Jan 30, 2008 5:22 PM, James Bottomley
<[EMAIL PROTECTED]> wrote:
> ...
> Deciding what lives in userspace and what should be in the kernel lies
> at the very heart of architectural decisions.  However, the argument
> that "it should be in the kernel because that would make it faster" is
> pretty much a discredited one.  To prevail on that argument, you have to
> demonstrate that there's no way to enable user space to do the same
> thing at the same speed.  Further, it was the same argument used the
> last time around when the STGT vs SCST investigation was done.  Your own
> results on non-IB networks show that both architectures perform at the
> same speed.  That tends to support the conclusion that there's something
> specific about IB that needs to be tweaked or improved for STGT to get
> it to perform correctly.

You should know that given two different implementations in software of the
same communication protocol, differences in latency and throughput become
more visible as the network latency gets lower and the throughput gets higher.
That's why conclusions can only be drawn from the InfiniBand numbers, and
not from the 1 Gbit/s Ethernet numbers. Assuming that there is something
specific in STGT with regard to InfiniBand is speculation.

> Furthermore, if you have already decided before testing that SCST is
> right and that STGT is wrong based on the architectures, it isn't
> exactly going to increase my confidence in your measurement methodology
> claiming to show this, now is it?

I did not draw any conclusions from the architecture -- the only data I based my
conclusions on were my own performance measurements.

> ...
> These are both features being independently worked on, are they not?
> Even if they weren't, the combination of the size of SCST in kernel plus
> the problem of having to find a migration path for the current STGT
> users still looks to me to involve the greater amount of work.

My proposal was to have both the SCST kernel code and the STGT kernel
code in the mainstream Linux kernel. This would make it easier for current
STGT users to evaluate SCST. It's too early to choose one of the two
projects -- this choice can be made later on.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Is gcc thread-unsafe?

2007-10-26 Thread Bart Van Assche

On 10/25/07, Linus Torvalds <> wrote:

> The gcc developers seem to have had a total disregard for what people
> want or need, and every time some code generation issue comes up, there's
> a lot of people on the list that do language-lawyering, rather than admit
> that there might be a problem.

Please make a proposal for how gcc should be modified instead of just
shooting on the gcc people -- the root cause here is the way the C/C++
memory model is defined. (Note: I'm not in any way involved in gcc
development.)

You can find my proposal to improve gcc here:
http://gcc.gnu.org/ml/gcc/2007-10/msg00465.html

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Is gcc thread-unsafe?

2007-10-26 Thread Bart Van Assche

On 10/26/07, Linus Torvalds <[EMAIL PROTECTED]> wrote:
>
> On Fri, 26 Oct 2007, Linus Torvalds wrote:
> >
> > On Fri, 26 Oct 2007, Bart Van Assche wrote:
> > >
> > > You can find my proposal to improve gcc here:
> > > http://gcc.gnu.org/ml/gcc/2007-10/msg00465.html
> >
> > Btw, I think this is fine per se, but putting "__attribute__((acquire))"
> > on the functions that acquire a lock does seem to be problematic, in that
> > quite often you might well want to inline those things. How would you
> > handle that?
>
> Thinking some more about this, you really have two cases:
>  - full functions taking/releasing locks (possibly conditionally, ie
>   with something like trylock and/or based on argument values).
>
>   You simply *cannot* require these to be marked, because the locking may
>   have been done indirectly. Yes, you can mark things like
>   "pthread_mutex_trylock()" as being an acquire-function, but the fact
>   is, users will then wrap these things in *other* functions, and return
>   their return values.
>
>   Ergo: a compiler *must* assume that a function call that it
>   didn't inline involves locking. There's no point in adding some
>   gcc-specific attributes to system header files, because it's not going
>   to fix anything in any portable program.

You have a point here.

>  - inline assembly (together with, potentially, compiler primitives).
>That's the only other way to reliably do locking from C.
>
>This one gcc could certainly extend on. But would there really be any
>upside? It would be easier/better to say that inline assembly (at least
>if it clobbers memory or is volatile) has the same serialization issues
>as a function call.

A problem is that the serialization properties defined for functions
in the C standard only apply to volatile variables, not to
non-volatile variables. But for asm statements this can be solved by
adding memory to the list of clobbered registers -- this will prevent
any reordering of manipulations of non-volatile variables and asm
statements.

Andrew, do you know whether gcc currently contains any optimization
that interchanges the order of accesses to non-volatile variables and
function calls ?

> So the fact is, any compiler that turns
>
> if (conditional)
> x++;
>
> into an unconditional write to x (where 'x' is potentially visible to the
> outside - global visibility or has had its address taken) is just broken.
>
> No ifs, buts or maybes about it. You simply cannot do that optimization,
> because there is no way for the compiler to know whether the conditional
> implies that you hold a lock or not.

I agree with the above, but I see this as a different issue -- it
wasn't my intention to solve this with my proposal for acquire and
release attributes.

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Is gcc thread-unsafe?

2007-11-02 Thread Bart Van Assche

On 10/30/07, Andrew Haley <[EMAIL PROTECTED]> wrote:
> David Schwartz writes:
> >
> > Can we get some kind of consensus that 'optimizations' that add
> > writes to any object that the programmer might have taken the
> > address of are invalid on any platform that supports memory
> > protection?
>
> That's what the proposed standard language says, kinda-sorta.  There's
> an informal description at
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2338.html.

There is other important information in the cited text. A.o. it is
explained that register promotion of potentially shared variables can
introduce data races. Or: register promotion can introduce bugs in
multithreaded software when compiled with optimization enabled. Are
there any register promotion transformations implemented in gcc that
can introduce data races in multithreaded software ? This is very
important information both for kernel developers and for developers of
multithreaded userspace applications.

Another conclusion from the cited text is that in contrast with what
was stated before on the gcc mailing list, it is not required to
declare thread-shared variables volatile if that thread-shared data is
consistently protected by calls to locking functions.

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Is gcc thread-unsafe?

2007-11-04 Thread Bart Van Assche

On 11/2/07, Andrew Haley <[EMAIL PROTECTED]> wrote:
> Bart Van Assche writes:
> > On 10/30/07, Andrew Haley <[EMAIL PROTECTED]> wrote:
> > > That's what the proposed standard language says, kinda-sorta.  There's
> > > an informal description at
> > > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2338.html.
> >
> > There is other important information in the cited text. A.o. it is
> > explained that register promotion of potentially shared variables
> > can introduce data races. Or: register promotion can introduce bugs
> > in multithreaded software when compiled with optimization
> > enabled. Are there any register promotion transformations
> > implemented in gcc that can introduce data races in multithreaded
> > software ?
>
> I expect so.  We're going to have to audit this whole enormous code
> base to find them all and take them out.
>
> Note that some of these optimizations have been around since gcc 3.4.

Has it already been decided who will do this audit, and when this
audit will happen ? Has a target date been set when this audit should
be complete, or is the completion of this audit a requirement for the
release of a specific gcc version ?

And if there would exist register promotion transformations in gcc
that can introduce data races, which would be the optimization levels
that enable these transformations ?

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Is gcc thread-unsafe?

2007-11-04 Thread Bart Van Assche

On 11/4/07, Linus Torvalds <[EMAIL PROTECTED]> wrote:
>
> On Sun, 4 Nov 2007, Bart Van Assche wrote:
> >
> > Has it already been decided who will do this audit, and when this
> > audit will happen ? Has a target date been set when this audit should
> > be complete, or is the completion of this audit a requirement for the
> > release of a specific gcc version ?
>
> I am told that the gcc people realized that was indeed a bug (people were
> able to show problems even in non-threaded environments with mprotect()),
> and have now fixed it in the current gcc sources. That still leaves the
> old versions with potential problems, but I think it makes it much less
> interesting to audit for these things.
>
> Linus

What I understood from the gcc mailing list is that a patch has been
applied to the gcc sources that solves the issue with speculative
stores that was already discussed here on the LKML
(http://gcc.gnu.org/ml/gcc/2007-10/msg00554.html).

But the issue I am referring to is a different issue: namely that a
compiler optimization called register promotion can introduce data
races. Hans J. Boehm has a clear explanation of this -- see also
paragraph 4.3 in
http://www.hpl.hp.com/techreports/2004/HPL-2004-209.pdf or
http://portal.acm.org/citation.cfm?id=1064978.1065042 .

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] add u64 number parser

2016-07-22 Thread Bart Van Assche


On 07/22/16 17:23, James Smart wrote:

+   buf = kmalloc(len + 1, GFP_KERNEL);
+   if (!buf)
+   return -ENOMEM;
+   memcpy(buf, s->from, len);
+   buf[len] = '\0';


Hello James,

Have you considered to combine the above kmalloc() and memcpy() calls 
into a single kasprintf(GFP_KERNEL, "%.*s", len, s->from) call?


Bart.

Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues

2016-09-19 Thread Bart Van Assche

On 09/19/16 03:38, Alexander Gordeev wrote:
> On Fri, Sep 16, 2016 at 05:04:48PM -0400, Keith Busch wrote:
>
> CC-ing linux-bl...@vger.kernel.org
>
>> I'm not sure I see how this helps. That probably means I'm not considering
>> the right scenario. Could you elaborate on when having multiple hardware
>> queues to choose from a given CPU will provide a benefit?
>
> No, I do not keep in mind any particular scenario besides common
> sense. Just an assumption deeper queues are better (in this RFC
> a virtual combined queue consisting of multipe h/w queues).
>
> Apparently, there could be positive effects only in systems where
> # of queues / # of CPUs > 1 or # of queues / # of cores > 1. But
> I do not happen to have ones. If I had numbers this would not be
> the RFC and I probably would not have posted in the first place ;)
>
> Would it be possible to give it a try on your hardware?

Hello Alexander,

It is your task to measure the performance impact of these patches and 
not Keith's task. BTW, I'm not convinced that multiple hardware queues 
per CPU will result in a performance improvement. I have not yet seen 
any SSD for which a queue depth above 512 results in better performance 
than queue depth equal to 512. Which applications do you think will 
generate and sustain a queue depth above 512? Additionally, my 
experience from another high performance context (RDMA) is that reducing 
the number of queues can result in higher IOPS due to fewer interrupts 
per I/O.

Bart.

Re: [PATCH v3 2/3] block: convert to device_add_disk()

2016-06-20 Thread Bart Van Assche


On 06/17/2016 06:34 PM, Dan Williams wrote:

diff --git a/arch/um/drivers/ubd_kern.c b/arch/um/drivers/ubd_kern.c
index ef6b4d960bad..4e7cb5df8e43 100644
--- a/arch/um/drivers/ubd_kern.c
+++ b/arch/um/drivers/ubd_kern.c
@@ -823,12 +823,11 @@ static int ubd_disk_register(int major, u64 size, int 
unit,
ubd_devs[unit].pdev.dev.release = ubd_device_release;
dev_set_drvdata(&ubd_devs[unit].pdev.dev, &ubd_devs[unit]);
platform_device_register(&ubd_devs[unit].pdev);
-   disk->driverfs_dev = &ubd_devs[unit].pdev.dev;
}

disk->private_data = &ubd_devs[unit];
disk->queue = ubd_devs[unit].queue;
-   add_disk(disk);
+   device_add_disk(&ubd_devs[unit].pdev.dev, disk);

*disk_out = disk;
return 0;


Have you reviewed this patch yourself? I think the above change is an 
unintended behavior change.


Bart.

Re: [PATCH v3 1/3] block: introduce device_add_disk()

2016-06-20 Thread Bart Van Assche


On 06/17/2016 06:34 PM, Dan Williams wrote:

In preparation for removing the ->driverfs_dev member of a gendisk, add
an api that takes the parent device as a parameter to add_disk().  For
now this maintains the status quo of WARN()ing on failure, but not
return a error code.


Reviewed-by: Bart Van Assche

Re: [PATCH 02/13] irq: Introduce IRQD_AFFINITY_MANAGED flag

2016-06-20 Thread Bart Van Assche


On 06/20/2016 02:22 PM, Christoph Hellwig wrote:

On Thu, Jun 16, 2016 at 05:39:07PM +0200, Bart Van Assche wrote:

On 06/16/2016 05:20 PM, Christoph Hellwig wrote:

On Wed, Jun 15, 2016 at 09:36:54PM +0200, Bart Van Assche wrote:

My concern
is that I doubt that there is an interrupt assignment scheme that works
optimally for all workloads. Hence my request to preserve the ability to
modify interrupt affinity from user space.


I'd say let's do such an interface incrementally based on the use
case - especially after we get networking over to use common code
to distribute the interrupts.  If you were doing something like this
with the current blk-mq code it wouldn't work very well due to the
fact that you'd have a mismatch between the assigned interrupt and
the blk-mq queue mapping anyway.

It might be a good idea to start brainstorming how we'd want to handle
this change - we'd basically need a per-device notification that the
interrupt mapping changes so that we can rebuild the queue mapping,
which is somewhat similar to the lib/cpu_rmap.c code used by a few
networking drivers.  This would also help with dealing with cpu
hotplug events that change the cpu mapping.


A notification mechanism that reports interrupt mapping changes will 
definitely help. What would also help is an API that allows drivers to 
query the MSI-X IRQ of an adapter that is nearest given a cpumask, e.g. 
hctx->cpumask. Another function can then map that IRQ into an index in 
the range 0..n-1 where n is the number of MSI-X interrupts for that 
adapter. Every blk-mq/scsi-mq driver will need this functionality to 
decide which IRQ to associate with a block layer hctx.


Bart.

Re: [PATCH] scsi: libfc: fix seconds_since_last_reset calculation

2016-06-20 Thread Bart Van Assche


On 06/17/2016 05:47 PM, Arnd Bergmann wrote:

-   jiffies_to_timespec(jiffies, &v0);
-   jiffies_to_timespec(lport->boot_time, &v1);
-   fc_stats->seconds_since_last_reset = (v0.tv_sec - v1.tv_sec);
+   fc_stats->seconds_since_last_reset = (lport->boot_time - jiffies) / HZ;


Does this patch introduce a 64-bit division? There might still be 32-bit 
users of this code.


Bart.

Re: [PATCH] scsi: libfc: fix seconds_since_last_reset calculation

2016-06-20 Thread Bart Van Assche


On 06/20/2016 04:48 PM, Arnd Bergmann wrote:

On Monday, June 20, 2016 3:54:06 PM CEST Bart Van Assche wrote:

On 06/17/2016 05:47 PM, Arnd Bergmann wrote:

- jiffies_to_timespec(jiffies, &v0);
- jiffies_to_timespec(lport->boot_time, &v1);
- fc_stats->seconds_since_last_reset = (v0.tv_sec - v1.tv_sec);
+ fc_stats->seconds_since_last_reset = (lport->boot_time - jiffies) / HZ;


Does this patch introduce a 64-bit division? There might still be 32-bit
users of this code.


No, both lport->boot_time and jiffies are 'unsigned long'.


Ah, you're right. Hence:

Reviewed-by: Bart Van Assche

Re: [RFC] IB/srp: Remove create_workqueue

2016-06-21 Thread Bart Van Assche


On 06/20/2016 08:59 PM, Tejun Heo wrote:

On Tue, Jun 07, 2016 at 01:00:13PM -0700, Bart Van Assche wrote:

srp_remove_wq is used for SRP target port removal work only. This work is
neither queued from inside a shrinker nor by the page writeback code so I
think it is safe to drop WQ_MEM_RECLAIM.


It should be able to use system_wq then.


No. I have tried that but that resulted in a deadlock.


See also commit bcc059103591 for the details.


So, create_workqueue() limits concurrency to 1 per cpu and if you have
a dependency between two work items and they get scheduled on the same
cpu they can deadlock.  system_wq doesn't have that restriction and
should be fine, AFAICS.


Agreed, as long as WQ_DFL_ACTIVE is not reduced from its current value 
(256) to a very low value (e.g. 1 or 2). This assumption should be 
documented but I'm not sure what the best way is to document this...


Bart.

Re: [dm-devel] [RFC][PATCH] dm: Remove dm_bufio_cond_resched()

2016-09-23 Thread Bart Van Assche


On 09/23/16 00:34, Peter Zijlstra wrote:

Is anybody still using PREEMPT_NONE? Most workloads also care about
latency to some extend. Lots of code has explicit cond_resched() and
doesn't worry.


From a SLES11 system:

$ grep PREEMPT_NONE /boot/config-3.0.101-0.47.67-default
CONFIG_PREEMPT_NONE=y

From a SLES12 system:

$ grep CONFIG_PREEMPT_NONE /boot/config-4.4.16-56-default
CONFIG_PREEMPT_NONE=y

Bart.

[PATCH] sched: Change READ_ONCE(jiffies) into jiffies

2016-09-25 Thread Bart Van Assche

A quote from Documentation/memory_barriers.txt:

For example, because 'jiffies' is marked volatile, it is never
necessary to say READ_ONCE(jiffies).  The reason for this is
that READ_ONCE() and WRITE_ONCE() are implemented as volatile
casts, which has no effect when its argument is already marked
volatile.

Hence change all 'READ_ONCE(jiffies)' occurrences into 'jiffies'.

Signed-off-by: Bart Van Assche 
Cc: Peter Zijlstra 
Cc: Oleg Nesterov 
---
 kernel/sched/core.c| 2 +-
 kernel/sched/cputime.c | 4 ++--
 kernel/sched/fair.c| 8 
 kernel/sched/wait.c| 6 --
 4 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 44817c6..edb811e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3100,7 +3100,7 @@ void scheduler_tick(void)
 u64 scheduler_tick_max_deferment(void)
 {
struct rq *rq = this_rq();
-   unsigned long next, now = READ_ONCE(jiffies);
+   unsigned long next, now = jiffies;
 
next = rq->last_sched_tick + HZ;
 
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index a846cf8..9aecaad 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -691,7 +691,7 @@ void thread_group_cputime_adjusted(struct task_struct *p, 
cputime_t *ut, cputime
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 static cputime_t vtime_delta(struct task_struct *tsk)
 {
-   unsigned long now = READ_ONCE(jiffies);
+   unsigned long now = jiffies;
 
if (time_before(now, (unsigned long)tsk->vtime_snap))
return 0;
@@ -701,7 +701,7 @@ static cputime_t vtime_delta(struct task_struct *tsk)
 
 static cputime_t get_vtime_delta(struct task_struct *tsk)
 {
-   unsigned long now = READ_ONCE(jiffies);
+   unsigned long now = jiffies;
cputime_t delta, other;
 
/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 039de34..db5ae54 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4802,7 +4802,7 @@ static void cpu_load_update_idle(struct rq *this_rq)
if (weighted_cpuload(cpu_of(this_rq)))
return;
 
-   cpu_load_update_nohz(this_rq, READ_ONCE(jiffies), 0);
+   cpu_load_update_nohz(this_rq, jiffies, 0);
 }
 
 /*
@@ -4828,7 +4828,7 @@ void cpu_load_update_nohz_start(void)
  */
 void cpu_load_update_nohz_stop(void)
 {
-   unsigned long curr_jiffies = READ_ONCE(jiffies);
+   unsigned long curr_jiffies = jiffies;
struct rq *this_rq = this_rq();
unsigned long load;
 
@@ -4851,7 +4851,7 @@ static void cpu_load_update_periodic(struct rq *this_rq, 
unsigned long load)
 {
 #ifdef CONFIG_NO_HZ_COMMON
/* See the mess around cpu_load_update_nohz(). */
-   this_rq->last_load_update_tick = READ_ONCE(jiffies);
+   this_rq->last_load_update_tick = jiffies;
 #endif
cpu_load_update(this_rq, load, 1);
 }
@@ -4864,7 +4864,7 @@ void cpu_load_update_active(struct rq *this_rq)
unsigned long load = weighted_cpuload(cpu_of(this_rq));
 
if (tick_nohz_tick_stopped())
-   cpu_load_update_nohz(this_rq, READ_ONCE(jiffies), load);
+   cpu_load_update_nohz(this_rq, jiffies, load);
else
cpu_load_update_periodic(this_rq, load);
 }
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index f15d6b6..af4959d 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -601,7 +601,8 @@ EXPORT_SYMBOL(bit_wait_io);
 
 __sched int bit_wait_timeout(struct wait_bit_key *word, int mode)
 {
-   unsigned long now = READ_ONCE(jiffies);
+   unsigned long now = jiffies;
+
if (time_after_eq(now, word->timeout))
return -EAGAIN;
schedule_timeout(word->timeout - now);
@@ -613,7 +614,8 @@ EXPORT_SYMBOL_GPL(bit_wait_timeout);
 
 __sched int bit_wait_io_timeout(struct wait_bit_key *word, int mode)
 {
-   unsigned long now = READ_ONCE(jiffies);
+   unsigned long now = jiffies;
+
if (time_after_eq(now, word->timeout))
return -EAGAIN;
io_schedule_timeout(word->timeout - now);
-- 
2.10.0

Re: [PATCH] sched: Change READ_ONCE(jiffies) into jiffies

2016-09-26 Thread Bart Van Assche


On 09/26/16 01:55, Peter Zijlstra wrote:

On Sun, Sep 25, 2016 at 07:08:45PM -0700, Bart Van Assche wrote:

A quote from Documentation/memory_barriers.txt:

For example, because 'jiffies' is marked volatile, it is never
necessary to say READ_ONCE(jiffies).  The reason for this is
that READ_ONCE() and WRITE_ONCE() are implemented as volatile
casts, which has no effect when its argument is already marked
volatile.

Hence change all 'READ_ONCE(jiffies)' occurrences into 'jiffies'.


So I'm not a huge fan of this patch. Yes its technically pointless, but
it does convey intent. See the READ_ONCE() as a comment if you will.


Hello Peter,

Are you aware that the scheduler code is the only code that uses 
READ_ONCE() to read 'jiffies', and that even in the scheduler not all 
'jiffies' reads use READ_ONCE()? This patch makes the scheduler code 
more consistent and easier to read without affecting the generated code.


Bart.

Re: [PATCH 1/2] wd719x: Remove last declaration using DEFINE_PCI_DEVICE_TABLE

2016-08-31 Thread Bart Van Assche

On 08/31/16 09:19, Joe Perches wrote:
> Convert it to the preferred const struct pci_device_id instead.

Reviewed-by: Bart Van Assche

Re: Block-level access

2016-09-01 Thread Bart Van Assche


On 09/01/2016 02:48 PM, Alex Austin wrote:

CC'ing linux-scsi and linux-block.

Also, please CC me in replies.

On Thu, Sep 1, 2016 at 4:43 PM, Alex Austin  wrote:

Hello,
What is the most performant way to directly interface with an attached hard
drive? I've so far used read()/write() on /dev/sd_ but I find error handling
exceedingly difficult, as I don't always even get errors reported, even if the
open() call includes O_DIRECT. I've also used ioctl(SG_IO), but find that it's
extremely slow due to the lack of queuing support in the API. Is there a
mid-level API that will get me decent error handling while allowing command
queuing, or do I just need to make multiple threads all doing separate SG_IO
ioctls?


What the most efficient way is to interface is with a hard drive depends 
on the I/O pattern. Are you aware that buffered I/O performs read-ahead? 
Have you already tried to tune the read-ahead setting (blockdev --getra 
/ blockdev --setra)?


BTW, if you need an example of how to queue SG_IO, you are welcome to 
have a look at the fio source code. You will probably be interested in 
the bsg API. See e.g. https://lwn.net/Articles/174469/.


Bart.

Re: Observing Softlockup's while running heavy IOs

2016-09-01 Thread Bart Van Assche


On 09/01/2016 03:31 AM, Sreekanth Reddy wrote:

I reduced the ISR workload by one third in-order to reduce the time
that is spent per CPU in interrupt context, even then I am observing
softlockups.

As I mentioned before only same single CPU in the set of CPUs(enabled
in affinity_hint) is busy with handling the interrupts from
corresponding IRQx. I have done below experiment in driver to limit
these softlockups/hardlockups. But I am not sure whether it is
reasonable to do this in driver,

Experiment:
If the CPUx is continuously busy with handling the remote CPUs
(enabled in the corresponding IRQ's affinity_hint) IO works by 1/4th
of the HBA queue depth in the same ISR context then enable a flag
called 'change_smp_affinity' for this IRQ. Also created a thread with
will poll for this flag for every IRQ's (enabled by driver) for every
second. If this thread see that this flag is enabled for any IRQ then
it will write next CPU number from the CPUs enabled in the IRQ's
affinity_hint to the IRQ's smp_affinity procfs attribute using
'call_usermodehelper()' API.

This to make sure that interrupts are not processed by same single CPU
all the time and to make the other CPUs to handle the interrupts if
the current CPU is continuously busy with handling the other CPUs IO
interrupts.

For example consider a system which has 8 logical CPUs and one MSIx
vector enabled (called IRQ 120) in driver, HBA queue depth as 8K.
then IRQ's procfs attributes will be
IRQ# 120, affinity_hint=0xff, smp_affinity=0x00

After starting heavy IOs, we will observe that only CPU0 will be busy
with handling the interrupts. This experiment driver will change the
smp_affinity to next CPU number i.e. 0x01 (using cmd 'echo 0x01 >
/proc/irq/120/smp_affinity', driver issue's this cmd using
call_usermodehelper() API) if it observes that CPU0 is continuously
processing more than 2K of IOs replies of other CPUs i.e from CPU1 to
CPU7.

Whether doing this kind of stuff in driver is ok?


Hello Sreekanth,

To me this sounds like something that should be implemented in the I/O 
chipset on the motherboard. If you have a look at the Intel Software 
Developer Manuals then you will see that logical destination mode 
supports round-robin interrupt delivery. However, the Linux kernel 
selects physical destination mode on systems with more than eight 
logical CPUs (see also arch/x86/kernel/apic/apic_flat_64.c).


I'm not sure the maintainers of the interrupt subsystem would welcome 
code that emulates round-robin interrupt delivery. So your best option 
is probably to minimize the amount of work that is done in interrupt 
context and to move as much work as possible out of interrupt context in 
such a way that it can be spread over multiple CPU cores, e.g. by using 
queue_work_on().


Bart.

Re: [PATCH] trivial treewide: Convert dev_set_uevent_suppress argument to bool

2016-09-01 Thread Bart Van Assche

On 09/01/16 13:11, Joe Perches wrote:
> Assigning an int to a bitfield:1 can lose precision.
> Change the caller argument uses from 1/0 to true/false.

Hello Joe,

Can you clarify how assigning 0 or 1 to a one-bit bitfield can cause a 
loss of precision?

Thanks,

Bart.

Re: [PATCH] trivial treewide: Convert dev_set_uevent_suppress argument to bool

2016-09-02 Thread Bart Van Assche

On 09/01/16 17:51, Joe Perches wrote:
> On Fri, 2016-09-02 at 00:47 +0000, Bart Van Assche wrote:
>> On 09/01/16 13:11, Joe Perches wrote:
>>>
>>> Assigning an int to a bitfield:1 can lose precision.
>>> Change the caller argument uses from 1/0 to true/false.
>> Hello Joe,
>
> Hi Bart.
>
>> Can you clarify how assigning 0 or 1 to a one-bit bitfield can cause a
>> loss of precision?
>
> There are no existing defects.
>
> Using 1/0 is not a loss of precision, it's just
> changing to use bool avoids potential errors and
> promotes consistency.
>
> Other uses of this function already use true/false.

Hello Joe,

In the patch description you refer to loss of precision. However, your 
patch does not address any loss of precision issues. So I think that the 
patch description is misleading and could be made more clear.

Bart.

Re: [PATCH] iscsi-target: fix spelling mistake "Unsolicitied" -> "Unsolicited"

2016-09-02 Thread Bart Van Assche


On 09/02/2016 07:32 AM, Colin King wrote:

Trivial fix to spelling mistakes in pr_debug message and comments


Reviewed-by: Bart Van Assche

Re: [PATCH] trivial treewide: Convert dev_set_uevent_suppress argument to bool

2016-09-02 Thread Bart Van Assche


On 09/02/2016 08:41 AM, Joe Perches wrote:

On Fri, 2016-09-02 at 13:41 +, Bart Van Assche wrote:

On 09/01/16 17:51, Joe Perches wrote:

On Fri, 2016-09-02 at 00:47 +, Bart Van Assche wrote:

On 09/01/16 13:11, Joe Perches wrote:

Assigning an int to a bitfield:1 can lose precision.
Change the caller argument uses from 1/0 to true/false.

Can you clarify how assigning 0 or 1 to a one-bit bitfield can cause a
loss of precision?

There are no existing defects.
Using 1/0 is not a loss of precision, it's just
changing to use bool avoids potential errors and
promotes consistency.
Other uses of this function already use true/false.

In the patch description you refer to loss of precision. However, your
patch does not address any loss of precision issues. So I think that the
patch description is misleading and could be made more clear.


I tend towards terse being better than verbose.
The original patch description says

"no change to objects"

What would you suggest?


Hello Joe,

How about the following:

dev_set_uevent_suppress() expects a boolean as second argument. Make 
this clear by passing true/false instead of 1/0 as the second argument.


Bart.

Re: [PATCH] sched: Avoid that __wait_on_bit_lock() hangs

2016-08-05 Thread Bart Van Assche


On 08/04/2016 07:09 AM, Peter Zijlstra wrote:

On Wed, Aug 03, 2016 at 02:51:23PM -0700, Bart Van Assche wrote:

So I started testing the patch below that should fix the same hang but
without triggering any wait list corruption.

diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index f15d6b6..4e3f651 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -282,7 +282,7 @@ void abort_exclusive_wait(wait_queue_head_t *q,
wait_queue_t *wait,
spin_lock_irqsave(&q->lock, flags);
if (!list_empty(&wait->task_list))
list_del_init(&wait->task_list);
-   else if (waitqueue_active(q))
+   if (waitqueue_active(q))
__wake_up_locked_key(q, mode, key);
spin_unlock_irqrestore(&q->lock, flags);
 }


So the problem with this patch is that it will violate the nr_exclusive
semantics in that it can result in too many wakeups -- which is a much
less severe (typically harmless) issue.

We now always wake up the next waiter, even if there wasn't an actual
wakeup we raced against. And if we then also get a wakeup, we can end up
with 2 woken tasks (instead of the nr_exclusive=1).

Now, since wait loops must all deal with spurious wakeups, this ends up
as harmless overhead.


How about adding a fifth argument to abort_exclusive_wait() that 
indicates whether or not the "if (waitqueue_active(q)) 
__wake_up_locked_key(q, mode, key)" code should be executed? 
__wait_event() could pass "condition" as fifth argument when calling 
abort_exclusive_wait().



But I'd still like to understand where we loose the wakeup.


My assumption is that __wake_up_common() and signal delivery happen 
concurrently, that __wake_up_common() wakes up bit_wait_io() and that 
signal delivery happens after bit_wait_io() has been woken up but before 
it tests the signal pending state.


Bart.

[PATCH v2 0/3] sched: Avoid that __wait_on_bit_lock() hangs

2016-08-06 Thread Bart Van Assche


Hello Ingo,

This patch series addresses an issue that my SRP driver test software 
can trigger reproducibly, namely that __wait_on_bit_lock() hangs. Please 
review and apply these patches whenever this is convenient for you. The 
changes compared to the first version of this series are:

* Reworked the abort_exclusive_wait() changes.
* Added two patches, namely a patch that introduces a local variable and
  a patch (nr. 3) that eliminates again the spurious wakeups introduced
  by patch 1.

These patches have been tested on top of kernel v4.7 with kernel 
debugging enabled (detection of list corruption, lockdep, SLUB 
debugging, kmemleak, ...).


See also https://lkml.org/lkml/2016/8/3/289.

Thanks,

Bart.

[PATCH v2 2/3] sched: Introduce a local variable in abort_exclusive_wait()

2016-08-06 Thread Bart Van Assche

This patch does not change any functionality.

Signed-off-by: Bart Van Assche 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Oleg Nesterov 
Cc: Andrew Morton 
Cc: Johannes Weiner 
Cc: Neil Brown 
Cc: Michael Shaver 
---
 kernel/sched/wait.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index fa12939..dcc7693 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -429,18 +429,20 @@ int __sched
 __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
wait_bit_action_f *action, unsigned mode)
 {
+   struct wait_bit_key *const key = &q->key;
+
do {
int ret;
 
prepare_to_wait_exclusive(wq, &q->wait, mode);
-   if (!test_bit(q->key.bit_nr, q->key.flags))
+   if (!test_bit(key->bit_nr, key->flags))
continue;
-   ret = action(&q->key, mode);
+   ret = action(key, mode);
if (!ret)
continue;
-   abort_exclusive_wait(wq, &q->wait, mode, &q->key);
+   abort_exclusive_wait(wq, &q->wait, mode, key);
return ret;
-   } while (test_and_set_bit(q->key.bit_nr, q->key.flags));
+   } while (test_and_set_bit(key->bit_nr, key->flags));
finish_wait(wq, &q->wait);
return 0;
 }
-- 
2.9.2

[PATCH v2 1/3] sched: Avoid that __wait_on_bit_lock() hangs

2016-08-06 Thread Bart Van Assche

If delivery of a signal and __wake_up_common() happen concurrently
it is possible that the signal is delivered after __wake_up_common()
woke up the affected task and before bit_wait_io() checks whether a
signal is pending. Avoid that the next waiter is not woken up if this
happens. This patch fixes the following hang:

INFO: task systemd-udevd:10111 blocked for more than 480 seconds.
  Not tainted 4.7.0-dbg+ #1
Call Trace:
 [] schedule+0x37/0x90
 [] schedule_timeout+0x27f/0x470
 [] io_schedule_timeout+0x9f/0x110
 [] bit_wait_io+0x16/0x60
 [] __wait_on_bit_lock+0x49/0xa0
 [] __lock_page+0xb9/0xc0
 [] truncate_inode_pages_range+0x3e0/0x760
 [] truncate_inode_pages+0x10/0x20
 [] kill_bdev+0x30/0x40
 [] __blkdev_put+0x71/0x360
 [] blkdev_put+0x49/0x170
 [] blkdev_close+0x20/0x30
 [] __fput+0xe8/0x1f0
 [] fput+0x9/0x10
 [] task_work_run+0x83/0xb0
 [] do_exit+0x3ee/0xc40
 [] do_group_exit+0x4b/0xc0
 [] get_signal+0x2ca/0x940
 [] do_signal+0x23/0x660
 [] exit_to_usermode_loop+0x73/0xb0
 [] syscall_return_slowpath+0xb0/0xc0
 [] entry_SYSCALL_64_fastpath+0xa6/0xa8

Fixes: 777c6c5f1f6e ("wait: prevent exclusive waiter starvation")
References: https://lkml.org/lkml/2012/8/24/185
References: http://www.spinics.net/lists/raid/msg53056.html
Signed-off-by: Bart Van Assche 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Oleg Nesterov 
Cc: Andrew Morton 
Cc: Johannes Weiner 
Cc: Neil Brown 
Cc: Michael Shaver 
Cc:  # v2.6.29+
---
 kernel/sched/wait.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index f15d6b6..fa12939 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -266,12 +266,16 @@ EXPORT_SYMBOL(finish_wait);
  * the wait descriptor from the given waitqueue if still
  * queued.
  *
- * Wakes up the next waiter if the caller is concurrently
- * woken up through the queue.
- *
- * This prevents waiter starvation where an exclusive waiter
- * aborts and is woken up concurrently and no one wakes up
- * the next waiter.
+ * Wakes up the next waiter to prevent waiter starvation
+ * when an exclusive waiter aborts and is woken up
+ * concurrently and no one wakes up the next waiter. Note:
+ * even when a signal is pending it is possible that
+ * __wake_up_common() wakes up the current thread and hence
+ * that @wait has been removed from the wait queue @q. Hence
+ * test whether there are more waiters on the wait queue
+ * even if @wait is not on the wait queue @q. This approach
+ * will cause a spurious wakeup if a signal is delivered and
+ * no other thread calls __wake_up_common() concurrently.
  */
 void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait,
unsigned int mode, void *key)
@@ -282,7 +286,7 @@ void abort_exclusive_wait(wait_queue_head_t *q, 
wait_queue_t *wait,
spin_lock_irqsave(&q->lock, flags);
if (!list_empty(&wait->task_list))
list_del_init(&wait->task_list);
-   else if (waitqueue_active(q))
+   if (waitqueue_active(q))
__wake_up_locked_key(q, mode, key);
spin_unlock_irqrestore(&q->lock, flags);
 }
-- 
2.9.2

[PATCH v2 3/3] sched: Avoid that abort_exclusive_wait() triggers spurious wakeups

2016-08-06 Thread Bart Van Assche

Patch "sched: Avoid that __wait_on_bit_lock() hangs" made spurious
wakeups possible. Avoid to trigger such spurious wakeups.

Signed-off-by: Bart Van Assche 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Oleg Nesterov 
Cc: Andrew Morton 
Cc: Johannes Weiner 
Cc: Neil Brown 
Cc: Michael Shaver 
---
 include/linux/wait.h |  5 +++--
 kernel/sched/wait.c  | 21 -
 2 files changed, 11 insertions(+), 15 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 27d7a0a..5034fce 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -282,7 +282,7 @@ wait_queue_head_t *bit_waitqueue(void *, int);
__ret = __int;  \
if (exclusive) {\
abort_exclusive_wait(&wq, &__wait,  \
-state, NULL);  \
+   condition, state, NULL);\
goto __out; \
}   \
break;  \
@@ -976,7 +976,8 @@ void prepare_to_wait(wait_queue_head_t *q, wait_queue_t 
*wait, int state);
 void prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int 
state);
 long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int 
state);
 void finish_wait(wait_queue_head_t *q, wait_queue_t *wait);
-void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait, unsigned 
int mode, void *key);
+void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait,
+ bool wake_up_next, unsigned int mode, void *key);
 long wait_woken(wait_queue_t *wait, unsigned mode, long timeout);
 int woken_wake_function(wait_queue_t *wait, unsigned mode, int sync, void 
*key);
 int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void 
*key);
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index dcc7693..c5a913f 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -259,6 +259,7 @@ EXPORT_SYMBOL(finish_wait);
  * abort_exclusive_wait - abort exclusive waiting in a queue
  * @q: waitqueue waited on
  * @wait: wait descriptor
+ * @wake_up_next: Whether or not to wake up another exclusive waiter
  * @mode: runstate of the waiter to be woken
  * @key: key to identify a wait bit queue or %NULL
  *
@@ -266,19 +267,12 @@ EXPORT_SYMBOL(finish_wait);
  * the wait descriptor from the given waitqueue if still
  * queued.
  *
- * Wakes up the next waiter to prevent waiter starvation
- * when an exclusive waiter aborts and is woken up
- * concurrently and no one wakes up the next waiter. Note:
- * even when a signal is pending it is possible that
- * __wake_up_common() wakes up the current thread and hence
- * that @wait has been removed from the wait queue @q. Hence
- * test whether there are more waiters on the wait queue
- * even if @wait is not on the wait queue @q. This approach
- * will cause a spurious wakeup if a signal is delivered and
- * no other thread calls __wake_up_common() concurrently.
+ * Wakes up the next waiter if @wake_up_next is set to prevent
+ * waiter starvation when an exclusive waiter aborts and is
+ * woken up concurrently and no one wakes up the next waiter.
  */
 void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait,
-   unsigned int mode, void *key)
+ bool wake_up_next, unsigned int mode, void *key)
 {
unsigned long flags;
 
@@ -286,7 +280,7 @@ void abort_exclusive_wait(wait_queue_head_t *q, 
wait_queue_t *wait,
spin_lock_irqsave(&q->lock, flags);
if (!list_empty(&wait->task_list))
list_del_init(&wait->task_list);
-   if (waitqueue_active(q))
+   if (wake_up_next && waitqueue_active(q))
__wake_up_locked_key(q, mode, key);
spin_unlock_irqrestore(&q->lock, flags);
 }
@@ -440,7 +434,8 @@ __wait_on_bit_lock(wait_queue_head_t *wq, struct 
wait_bit_queue *q,
ret = action(key, mode);
if (!ret)
continue;
-   abort_exclusive_wait(wq, &q->wait, mode, key);
+   abort_exclusive_wait(wq, &q->wait,
+   test_bit(key->bit_nr, key->flags), mode, key);
return ret;
} while (test_and_set_bit(key->bit_nr, key->flags));
finish_wait(wq, &q->wait);
-- 
2.9.2

Re: [PATCH v2 1/3] sched: Avoid that __wait_on_bit_lock() hangs

2016-08-06 Thread Bart Van Assche


On 08/05/16 16:09, Bart Van Assche wrote:

If delivery of a signal and __wake_up_common() happen concurrently
it is possible that the signal is delivered after __wake_up_common()
woke up the affected task and before bit_wait_io() checks whether a
signal is pending. Avoid that the next waiter is not woken up if this
happens.


(replying to my own e-mail)

Although this patch works reliably in my tests, the above description 
does not explain the hang. I will resend this patch with a better 
description.


Bart.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1594 matches

Mail list logo