[PATCH V2 1/2] scsi_debug: unify scsi_level in proc and sysfs
From: Sha Zhengju There're severel interfaces to show scsi_level value of scsi_debug, but they're not in consistent, e.g.: 1) #cat /sys/bus/pseudo/drivers/scsi_debug/scsi_level 5 2) #cat /proc/scsi/scsi_debug/7 scsi_debug adapter driver, version 1.82 [20100324] num_tgts=1, shared (ram) size=1024 MB, opts=0x0, every_nth=0(curr:0) delay=1, max_luns=1, scsi_level=5 sector_size=512 bytes, cylinders=130, heads=255, sectors=63 number of aborts=0, device_reset=0, bus_resets=0, host_resets=0 dix_reads=0 dix_writes=0 dif_errors=0 3) #cat /sys/bus/scsi/devices/7\:0\:0\:0/scsi_level (or lsscsi -l) 6 According to the description in scsi.h, "struct scsi_device::scsi_level values. For SCSI devices other than those prior to SCSI-2 (i.e. over 12 years old) this value is (resp[2] + 1) where 'resp' is a byte array of the response to an INQUIRY". For scsi_debug, the resp[2] of INQUIRY is 5 which indicates using SPC-3, but the sysfs's scsi_level should show 6. So introduce a new scsi_debug_inquiry_vers to only represent the version vaule, and correct the scsi_debug_scsi_level to be resp[2] + 1. Signed-off-by: Sha Zhengju --- v2: - introduce a new variable to represent resp[2] of INQUIRY. [Doug Gilbert] drivers/scsi/scsi_debug.c |9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c index 80b8b10..210bd10 100644 --- a/drivers/scsi/scsi_debug.c +++ b/drivers/scsi/scsi_debug.c @@ -110,7 +110,7 @@ static const char * scsi_debug_version_date = "20100324"; #define DEF_PHYSBLK_EXP 0 #define DEF_PTYPE 0 #define DEF_REMOVABLE false -#define DEF_SCSI_LEVEL 5/* INQUIRY, byte2 [5->SPC-3] */ +#define DEF_INQUIRY_VERS 5/* INQUIRY VERSION, byte2 [5->SPC-3] */ #define DEF_SECTOR_SIZE 512 #define DEF_UNMAP_ALIGNMENT 0 #define DEF_UNMAP_GRANULARITY 1 @@ -181,7 +181,8 @@ static int scsi_debug_opt_blks = DEF_OPT_BLKS; static int scsi_debug_opts = DEF_OPTS; static int scsi_debug_physblk_exp = DEF_PHYSBLK_EXP; static int scsi_debug_ptype = DEF_PTYPE; /* SCSI peripheral type (0==disk) */ -static int scsi_debug_scsi_level = DEF_SCSI_LEVEL; +static int scsi_debug_inquiry_vers = DEF_INQUIRY_VERS; +static int scsi_debug_scsi_level = DEF_INQUIRY_VERS + 1; /* scsi_level is resp[2]+1 */ static int scsi_debug_sector_size = DEF_SECTOR_SIZE; static int scsi_debug_virtual_gb = DEF_VIRTUAL_GB; static int scsi_debug_vpd_use_hostno = DEF_VPD_USE_HOSTNO; @@ -927,7 +928,7 @@ static int resp_inquiry(struct scsi_cmnd * scp, int target, } /* drops through here for a standard inquiry */ arr[1] = scsi_debug_removable ? 0x80 : 0; /* Removable disk */ - arr[2] = scsi_debug_scsi_level; + arr[2] = scsi_debug_inquiry_vers; arr[3] = 2;/* response_data_format==2 */ arr[4] = SDEBUG_LONG_INQ_SZ - 5; arr[5] = scsi_debug_dif ? 1 : 0; /* PROTECT bit */ @@ -2811,7 +2812,7 @@ MODULE_PARM_DESC(opts, "1->noise, 2->medium_err, 4->timeout, 8->recovered_err... MODULE_PARM_DESC(physblk_exp, "physical block exponent (def=0)"); MODULE_PARM_DESC(ptype, "SCSI peripheral type(def=0[disk])"); MODULE_PARM_DESC(removable, "claim to have removable media (def=0)"); -MODULE_PARM_DESC(scsi_level, "SCSI level to simulate(def=5[SPC-3])"); +MODULE_PARM_DESC(scsi_level, "SCSI level to simulate(def=6[SPC-3])"); MODULE_PARM_DESC(sector_size, "logical block size in bytes (def=512)"); MODULE_PARM_DESC(unmap_alignment, "lowest aligned thin provisioning lba (def=0)"); MODULE_PARM_DESC(unmap_granularity, "thin provisioning granularity in blocks (def=1)"); -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] scsi_scan: introduce a new variable to represent INQUIRY VERSION field
From: Sha Zhengju As per Doug Gilbert's suggestion, introduce a new variable to represent the resp[2] of standard INQUIRY data. For different scsi devices, the linux scsi_level is computed based on the resp[2] value. This will make the code more easier to understand. Signed-off-by: Sha Zhengju --- drivers/scsi/scsi_scan.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c index 307a811..421d162 100644 --- a/drivers/scsi/scsi_scan.c +++ b/drivers/scsi/scsi_scan.c @@ -546,6 +546,7 @@ static int scsi_probe_lun(struct scsi_device *sdev, unsigned char *inq_result, int response_len = 0; int pass, count, result; struct scsi_sense_hdr sshdr; + char inquiry_version; *bflags = 0; @@ -705,10 +706,16 @@ static int scsi_probe_lun(struct scsi_device *sdev, unsigned char *inq_result, * device is attached at LUN 0 (SCSI_SCAN_TARGET_PRESENT) so * non-zero LUNs can be scanned. */ - sdev->scsi_level = inq_result[2] & 0x07; - if (sdev->scsi_level >= 2 || - (sdev->scsi_level == 1 && (inq_result[3] & 0x0f) == 1)) - sdev->scsi_level++; + /* For SCSI devices that equal or after SCSI_2, or is SCSI_1_CCS, +* its scsi_level is (inquiry_version + 1). +*/ + inquiry_version = inq_result[2] & 0x07; + if (inquiry_version >= 2 || + (inquiry_version == 1 && (inq_result[3] & 0x0f) == 1)) + sdev->scsi_level = inquiry_version + 1; + else + sdev->scsi_level = inquiry_version; /* SCSI_1 */ + sdev->sdev_target->scsi_level = sdev->scsi_level; return 0; -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] bfa: set correct command return code
> -Original Message- > From: Hannes Reinecke [mailto:h...@suse.de] > Sent: Tuesday, January 14, 2014 2:56 PM > To: James Bottomley > Cc: linux-scsi@vger.kernel.org; Vijaya Mohan Guvva; Hannes Reinecke > Subject: [PATCH] bfa: set correct command return code > > For various error conditions the bfa driver just returns 'DID_ERROR', which > carries no information at all about the actual source of error. > This patch updates the error handling to return a correct error code, > depending on the type of error occurred. Thanks Hannes, Changes looks good to me. Acked-by: Vijaya Mohan Guvva -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]
From: walt > On 01/21/2014 01:51 AM, David Laight wrote: > > From: Sarah Sharp > >> On Mon, Jan 20, 2014 at 11:21:14AM +, David Laight wrote: > > ... > >>> A guess... > >>> > >>> In queue_bulk_sg_tx() try calling xhci_v1_0_td_remainder() instead > >>> of xhci_td_remainder(). > >> > David, I tried the one-liner below, which changed nothing AFAICS, but > then I'm not sure it's the change you intended: ... > /* Set the TRB length, TD size, and interrupter fields. */ > - if (xhci->hci_version < 0x100) { > + if (xhci->hci_version > 0x100) { > remainder = xhci_td_remainder( > urb->transfer_buffer_length - > running_total); So my wild guess wasn't right. Can't win them all. David
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: > One topic that has been lurking forever at the edges is the current > 4k limitation for file system block sizes. Some devices in > production today and others coming soon have larger sectors and it > would be interesting to see if it is time to poke at this topic > again. > Large block support was proposed years ago by Christoph Lameter (http://lwn.net/Articles/232757/). I think I was just getting started in the community at the time so I do not recall any of the details. I do believe it motivated an alternative by Nick Piggin called fsblock though (http://lwn.net/Articles/321390/). At the very least it would be nice to know why neither were never merged for those of us that were not around at the time and who may not have the chance to dive through mailing list archives between now and March. FWIW, I would expect that a show-stopper for any proposal is requiring high-order allocations to succeed for the system to behave correctly. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 69201] New: qla2xxx: Low-latency storage triggers lock contention
https://bugzilla.kernel.org/show_bug.cgi?id=69201 Bug ID: 69201 Summary: qla2xxx: Low-latency storage triggers lock contention Product: SCSI Drivers Version: 2.5 Kernel Version: 3.12.7 Hardware: All OS: Linux Tree: Mainline Status: NEW Severity: enhancement Priority: P1 Component: QLOGIC QLA2XXX Assignee: scsi_drivers-qla2...@kernel-bugs.osdl.org Reporter: bvanass...@acm.org Regression: No Running a fio test on an initiator system with an 8 Gb/s QLogic FC adapter revealed a bottleneck in the qla2xxx initiator driver - lock contention on ha->hardware_lock. The test that revealed this is as follows: - On a target system with 4 CPU threads (Intel i5), an 8 Gb/s QLogic FC HBA and kernel 3.12.7, download the SCST trunk r5194, build it in release mode, load the brd kernel module and configure SCST such that it exports /dev/ram[0123] via the vdisk_blockio driver. Set the vdisk_blockio parameter threads_num to 2. Export these four RAM disks as LUNs 0..3. - On an initiator system with 12 CPU threads (Intel Core i7 with hyperthreading enabled), an 8 Gb/s QLogic HBA and kernel 3.12.7, run the following fio job (where /dev/sd[cdef] corresponds to the SCST LUNs): fio --bs=4K --ioengine=libaio --rw=randrw --buffered=0 --numjobs=12 \ --iodepth=16 --iodepth_batch=8 --iodepth_batch_complete=8 \ --thread --loops=$((2**31)) --runtime=60 --group_reporting\ --gtod_reduce=1 --invalidate=1\ $(for d in /dev/sd[cdef]; do echo --name=$d --filename=$d; done) - While this fio job is running, run the following commands: perf record -ag sleep 10 perf report –stdio >perf-report-fc.txt The perf report shows that quite some time is spent in the spin_lock_irqsave() call invoked from qla24xx_dif_start_scsi(). Does this mean that this test revealed lock contention on ha->hardware_lock ? -- You are receiving this mail because: You are watching the assignee of the bug.-- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 69201] qla2xxx: Low-latency storage triggers lock contention
https://bugzilla.kernel.org/show_bug.cgi?id=69201 --- Comment #1 from Bart Van Assche --- Created attachment 123011 --> https://bugzilla.kernel.org/attachment.cgi?id=123011&action=edit perf report --stdio output -- You are receiving this mail because: You are watching the assignee of the bug. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: linux rdma 3.14 merge plans
On 1/22/2014 2:43 AM, Roland Dreier wrote: On Tue, Jan 21, 2014 at 2:00 PM, Or Gerlitz wrote: Roland, ping! the signature patches were posted > three months ago. We deserve a response from the maintainer that goes beyond "I need to think on that". Responsiveness was stated by Linus to be the #1 requirement from kernel maintainers. Hi Roland, I'll try to respond here. removing LKML and adding Linux-scsi. Or, I'm not sure what response you're after from me. Linus has also said that maintainers should say "no" a lot more (http://lwn.net/Articles/571995/) so maybe you want me to say, "No, I won't merge this patch set, since it adds a bunch of complexity to support a feature no one really cares about." 1. I disagree about no-one cares about DIF/DIX. We are witnessing growing interests in this especially for RDMA. 2. We put a lot of efforts to avoid complexity here and plug-in as simple as possible. Application that will choose to use DIF will implement only 3 steps: a. allocate signature enabled MR. b. register signature enabled MR with DIF attributes (via post_send) and then do RDMA. c. check MR status after transaction is completed (_lightweight_ verb that can be called from interrupt context). Is that it? (And yes I am skeptical about this stuff — I work at an enterprise storage company and even here it's hard to find anyone who cares about DIF/DIX, especially offload features that stop it from being end-to-end) 1. RDMA verbs are _NOT_ stopping DIF from being end-to-end. OS (or SCSI in our specific case) passes LLD 2 scatterlists: data {block1, block2, block3,...}, and protection {DIF1, DIF2, DIF3}. LLD is required to verify the data integrity (block guards) and to interleave over the wire {block1, DIF1, block2, DIF2}. You must support that in HW, you rather iSER/SRP will use giant copy's to interleave by itself? or in case OS asked LLD to INSERT DIF iSER/SRP will compute CRC for each data-block? RDMA storage ULPs are transports - they should have no business with data processing. 2. HW DIF offload also gives you protection across the PCI. the data-validation is done (hopefully offloaded) also when data+protection are written to the back-end device. end-to-end is preserved. 3. SAS & FC have T10-PI offload. This is just adding RDMA into the game. With this set of verbs iSER, SRP, FCoE Initiators and targets will be able to support T10-PI. I'm sure you're not expecting me to say, "Sure, I'll merge it without understanding the problem it's solving Problem: T10-PI offload support for RDMA based initiators. Supporting end-to-end data integrity while sustaining high RDMA performance. or how it's doing that," How it's doing that: - We introduce a new type of memory region that posses protection attributes suited for data integrity offload. - We Introduce a new fast registration method that can bind all the relevant info for verify/generate of protection information: * describe if/how to interleave data with protection. * describe what method of data integrity is used (DIF type X, CRC, XOR...) and the seeds that HW should start calculation from. * describe how to verify the data. - We Introduce a new lightweight check of the data-integrity status to check if there were any integrity errors and get information on them. Note: We made MR allocation routine generic enough to lay a framework to unite all MR allocation methods (get_dma_mr, alloc_fast_reg_mr, reg_phys, reg_user_mr, fmrs, and probably more in the future...). We defined ib_create_mr that can actually get mr_init_attr which can be easily extended as opposed to the specific calls exists today. So I would say this even reduces complexity. Hope this helps, Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v2 10/17] target: Add protection SGLs to target_submit_cmd_map_sgls
On 1/22/2014 12:17 AM, Nicholas A. Bellinger wrote: On Sun, 2014-01-19 at 14:12 +0200, Sagi Grimberg wrote: On 1/19/2014 4:44 AM, Nicholas A. Bellinger wrote: From: Nicholas Bellinger This patch adds support to target_submit_cmd_map_sgls() for accepting 'sgl_prot' + 'sgl_prot_count' parameters for DIF protection information. Note the passed parameters are stored at se_cmd->t_prot_sg and se_cmd->t_prot_nents respectively. Also, update tcm_loop and vhost-scsi fabrics usage of target_submit_cmd_map_sgls() to take into account the new parameters. I didn't see that you added protection allocation to transports that does not use target_submit_cmd_map_sgls() - which happens to be iSCSI/iSER/SRP :( Don't you think that prot SG allocation should be added also to target_alloc_sgl()? by then se_cmd should contain the protection attributes and this routine can know if it needs to allocate prot_sg as well. This is how I used it... Yes, this specific bit was left out for the moment as no code in the patch for v3.14 actually uses it.. I'm planning to add it to for-next -> v3.15 code as soon as the merge window closes. --nab Yes, that makes sense to me. Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v2 11/17] target/iblock: Add blk_integrity + BIP passthrough support
On 1/22/2014 3:52 AM, Martin K. Petersen wrote: "Sagi" == Sagi Grimberg writes: Sagi> Please remind me why we ignore IP-CSUM guard type again? MKP, Sagi> will this be irrelevant for the initiator as well? if so, I don't Sagi> see a reason to expose this in RDMA verbs. I don't see much use for IP checksum for the target. You are required by SBC to use T10 CRC on the wire so there is no point in converting to IP checksum in the backend. My impending patches will allow you to pass through PI with T10 CRC to a device with an IP checksum block integrity profile (i.e. the choice of checksum is a per-bio bip flag instead of an HBA-enforced global). OK, so IP checksum support still makes sense. Thanks! Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v2 12/17] target/file: Add DIF protection init/format support
On 1/22/2014 12:28 AM, Nicholas A. Bellinger wrote: On Sun, 2014-01-19 at 14:31 +0200, Sagi Grimberg wrote: On 1/19/2014 4:44 AM, Nicholas A. Bellinger wrote: From: Nicholas Bellinger This patch adds support for DIF protection init/format support into the FILEIO backend. It involves using a seperate $FILE.protection for storing PI that is opened via fd_init_prot() using the common pi_prot_type attribute. The actual formatting of the protection is done via fd_format_prot() using the common pi_prot_format attribute, that will populate the initial PI data based upon the currently configured pi_prot_type. Based on original FILEIO code from Sagi. Nice! see comments below... v1 changes: - Fix sparse warnings in fd_init_format_buf (Fengguang) Cc: Martin K. Petersen Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Sagi Grimberg Cc: Or Gerlitz Signed-off-by: Nicholas Bellinger --- drivers/target/target_core_file.c | 137 + drivers/target/target_core_file.h |4 ++ 2 files changed, 141 insertions(+) diff --git a/drivers/target/target_core_file.c b/drivers/target/target_core_file.c index 0e34cda..119d519 100644 --- a/drivers/target/target_core_file.c +++ b/drivers/target/target_core_file.c @@ -700,6 +700,140 @@ static sector_t fd_get_blocks(struct se_device *dev) dev->dev_attrib.block_size); } +static int fd_init_prot(struct se_device *dev) +{ + struct fd_dev *fd_dev = FD_DEV(dev); + struct file *prot_file, *file = fd_dev->fd_file; + struct inode *inode; + int ret, flags = O_RDWR | O_CREAT | O_LARGEFILE | O_DSYNC; + char buf[FD_MAX_DEV_PROT_NAME]; + + if (!file) { + pr_err("Unable to locate fd_dev->fd_file\n"); + return -ENODEV; + } + + inode = file->f_mapping->host; + if (S_ISBLK(inode->i_mode)) { + pr_err("FILEIO Protection emulation only supported on" + " !S_ISBLK\n"); + return -ENOSYS; + } + + if (fd_dev->fbd_flags & FDBD_HAS_BUFFERED_IO_WCE) + flags &= ~O_DSYNC; + + snprintf(buf, FD_MAX_DEV_PROT_NAME, "%s.protection", +fd_dev->fd_dev_name); + + prot_file = filp_open(buf, flags, 0600); + if (IS_ERR(prot_file)) { + pr_err("filp_open(%s) failed\n", buf); + ret = PTR_ERR(prot_file); + return ret; + } + fd_dev->fd_prot_file = prot_file; + + return 0; +} + +static void fd_init_format_buf(struct se_device *dev, unsigned char *buf, + u32 unit_size, u32 *ref_tag, u16 app_tag, + bool inc_reftag) +{ + unsigned char *p = buf; + int i; + + for (i = 0; i < unit_size; i += dev->prot_length) { + *((u16 *)&p[0]) = 0x; + *((__be16 *)&p[2]) = cpu_to_be16(app_tag); + *((__be32 *)&p[4]) = cpu_to_be32(*ref_tag); + + if (inc_reftag) + (*ref_tag)++; + + p += dev->prot_length; + } +} + +static int fd_format_prot(struct se_device *dev) +{ + struct fd_dev *fd_dev = FD_DEV(dev); + struct file *prot_fd = fd_dev->fd_prot_file; + sector_t prot_length, prot; + unsigned char *buf; + loff_t pos = 0; + u32 ref_tag = 0; + int unit_size = FDBD_FORMAT_UNIT_SIZE * dev->dev_attrib.block_size; + int rc, ret = 0, size, len; + bool inc_reftag = false; + + if (!dev->dev_attrib.pi_prot_type) { + pr_err("Unable to format_prot while pi_prot_type == 0\n"); + return -ENODEV; + } + if (!prot_fd) { + pr_err("Unable to locate fd_dev->fd_prot_file\n"); + return -ENODEV; + } + + switch (dev->dev_attrib.pi_prot_type) { redundant - see below. + case TARGET_DIF_TYPE3_PROT: + ref_tag = 0x; + break; + case TARGET_DIF_TYPE2_PROT: + case TARGET_DIF_TYPE1_PROT: + inc_reftag = true; + break; + default: + break; + } + + buf = vzalloc(unit_size); + if (!buf) { + pr_err("Unable to allocate FILEIO prot buf\n"); + return -ENOMEM; + } + + prot_length = (dev->transport->get_blocks(dev) + 1) * dev->prot_length; + size = prot_length; + + pr_debug("Using FILEIO prot_length: %llu\n", +(unsigned long long)prot_length); + + for (prot = 0; prot < prot_length; prot += unit_size) { + + fd_init_format_buf(dev, buf, unit_size, &ref_tag, 0x, + inc_reftag); I didn't send you my latest patches (my fault...).T10-PI format should only place escape values throughout the protection file (fill it with 0xff). so I guess in this case fd_init_formast_buf() boils down to memset(buf, 0xff, unit_size) once
Hello
Could I have 5mins of your time to discuss a life changingmatter with you? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On 01/22/2014 04:34 AM, Mel Gorman wrote: On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: One topic that has been lurking forever at the edges is the current 4k limitation for file system block sizes. Some devices in production today and others coming soon have larger sectors and it would be interesting to see if it is time to poke at this topic again. Large block support was proposed years ago by Christoph Lameter (http://lwn.net/Articles/232757/). I think I was just getting started in the community at the time so I do not recall any of the details. I do believe it motivated an alternative by Nick Piggin called fsblock though (http://lwn.net/Articles/321390/). At the very least it would be nice to know why neither were never merged for those of us that were not around at the time and who may not have the chance to dive through mailing list archives between now and March. FWIW, I would expect that a show-stopper for any proposal is requiring high-order allocations to succeed for the system to behave correctly. I have a somewhat hazy memory of Andrew warning us that touching this code takes us into dark and scary places. ric -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote: > On 01/22/2014 04:34 AM, Mel Gorman wrote: > >On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: > >>One topic that has been lurking forever at the edges is the current > >>4k limitation for file system block sizes. Some devices in > >>production today and others coming soon have larger sectors and it > >>would be interesting to see if it is time to poke at this topic > >>again. > >> > >Large block support was proposed years ago by Christoph Lameter > >(http://lwn.net/Articles/232757/). I think I was just getting started > >in the community at the time so I do not recall any of the details. I do > >believe it motivated an alternative by Nick Piggin called fsblock though > >(http://lwn.net/Articles/321390/). At the very least it would be nice to > >know why neither were never merged for those of us that were not around > >at the time and who may not have the chance to dive through mailing list > >archives between now and March. > > > >FWIW, I would expect that a show-stopper for any proposal is requiring > >high-order allocations to succeed for the system to behave correctly. > > > > I have a somewhat hazy memory of Andrew warning us that touching > this code takes us into dark and scary places. > That is a light summary. As Andrew tends to reject patches with poor documentation in case we forget the details in 6 months, I'm going to guess that he does not remember the details of a discussion from 7ish years ago. This is where Andrew swoops in with a dazzling display of his eidetic memory just to prove me wrong. Ric, are there any storage vendor that is pushing for this right now? Is someone working on this right now or planning to? If they are, have they looked into the history of fsblock (Nick) and large block support (Christoph) to see if they are candidates for forward porting or reimplementation? I ask because without that person there is a risk that the discussion will go as follows Topic leader: Does anyone have an objection to supporting larger block sizes than the page size? Room: Send patches and we'll talk. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On 01/22/2014 09:34 AM, Mel Gorman wrote: On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote: On 01/22/2014 04:34 AM, Mel Gorman wrote: On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: One topic that has been lurking forever at the edges is the current 4k limitation for file system block sizes. Some devices in production today and others coming soon have larger sectors and it would be interesting to see if it is time to poke at this topic again. Large block support was proposed years ago by Christoph Lameter (http://lwn.net/Articles/232757/). I think I was just getting started in the community at the time so I do not recall any of the details. I do believe it motivated an alternative by Nick Piggin called fsblock though (http://lwn.net/Articles/321390/). At the very least it would be nice to know why neither were never merged for those of us that were not around at the time and who may not have the chance to dive through mailing list archives between now and March. FWIW, I would expect that a show-stopper for any proposal is requiring high-order allocations to succeed for the system to behave correctly. I have a somewhat hazy memory of Andrew warning us that touching this code takes us into dark and scary places. That is a light summary. As Andrew tends to reject patches with poor documentation in case we forget the details in 6 months, I'm going to guess that he does not remember the details of a discussion from 7ish years ago. This is where Andrew swoops in with a dazzling display of his eidetic memory just to prove me wrong. Ric, are there any storage vendor that is pushing for this right now? Is someone working on this right now or planning to? If they are, have they looked into the history of fsblock (Nick) and large block support (Christoph) to see if they are candidates for forward porting or reimplementation? I ask because without that person there is a risk that the discussion will go as follows Topic leader: Does anyone have an objection to supporting larger block sizes than the page size? Room: Send patches and we'll talk. I will have to see if I can get a storage vendor to make a public statement, but there are vendors hoping to see this land in Linux in the next few years. I assume that anyone with a shipping device will have to at least emulate the 4KB sector size for years to come, but that there might be a significant performance win for platforms that can do a larger block. Note that windows seems to suffer from the exact same limitation, so we are not alone here with the vm page size/fs block size entanglement ric -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, 2014-01-22 at 09:34 +, Mel Gorman wrote: > On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: > > One topic that has been lurking forever at the edges is the current > > 4k limitation for file system block sizes. Some devices in > > production today and others coming soon have larger sectors and it > > would be interesting to see if it is time to poke at this topic > > again. > > > > Large block support was proposed years ago by Christoph Lameter > (http://lwn.net/Articles/232757/). I think I was just getting started > in the community at the time so I do not recall any of the details. I do > believe it motivated an alternative by Nick Piggin called fsblock though > (http://lwn.net/Articles/321390/). At the very least it would be nice to > know why neither were never merged for those of us that were not around > at the time and who may not have the chance to dive through mailing list > archives between now and March. > > FWIW, I would expect that a show-stopper for any proposal is requiring > high-order allocations to succeed for the system to behave correctly. > My memory is that Nick's work just didn't have the momentum to get pushed in. It all seemed very reasonable though, I think our hatred of buffered heads just wasn't yet bigger than the fear of moving away. But, the bigger question is how big are the blocks going to be? At some point (64K?) we might as well just make a log structured dm target and have a single setup for both shingled and large sector drives. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, Jan 22, 2014 at 09:58:46AM -0500, Ric Wheeler wrote: > On 01/22/2014 09:34 AM, Mel Gorman wrote: > >On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote: > >>On 01/22/2014 04:34 AM, Mel Gorman wrote: > >>>On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: > One topic that has been lurking forever at the edges is the current > 4k limitation for file system block sizes. Some devices in > production today and others coming soon have larger sectors and it > would be interesting to see if it is time to poke at this topic > again. > > >>>Large block support was proposed years ago by Christoph Lameter > >>>(http://lwn.net/Articles/232757/). I think I was just getting started > >>>in the community at the time so I do not recall any of the details. I do > >>>believe it motivated an alternative by Nick Piggin called fsblock though > >>>(http://lwn.net/Articles/321390/). At the very least it would be nice to > >>>know why neither were never merged for those of us that were not around > >>>at the time and who may not have the chance to dive through mailing list > >>>archives between now and March. > >>> > >>>FWIW, I would expect that a show-stopper for any proposal is requiring > >>>high-order allocations to succeed for the system to behave correctly. > >>> > >>I have a somewhat hazy memory of Andrew warning us that touching > >>this code takes us into dark and scary places. > >> > >That is a light summary. As Andrew tends to reject patches with poor > >documentation in case we forget the details in 6 months, I'm going to guess > >that he does not remember the details of a discussion from 7ish years ago. > >This is where Andrew swoops in with a dazzling display of his eidetic > >memory just to prove me wrong. > > > >Ric, are there any storage vendor that is pushing for this right now? > >Is someone working on this right now or planning to? If they are, have they > >looked into the history of fsblock (Nick) and large block support (Christoph) > >to see if they are candidates for forward porting or reimplementation? > >I ask because without that person there is a risk that the discussion > >will go as follows > > > >Topic leader: Does anyone have an objection to supporting larger block > > sizes than the page size? > >Room: Send patches and we'll talk. > > > > I will have to see if I can get a storage vendor to make a public > statement, but there are vendors hoping to see this land in Linux in > the next few years. What about the second and third questions -- is someone working on this right now or planning to? Have they looked into the history of fsblock (Nick) and large block support (Christoph) to see if they are candidates for forward porting or reimplementation? Don't get me wrong, I'm interested in the topic but I severely doubt I'd have the capacity to research the background of this in advance. It's also unlikely that I'd work on it in the future without throwing out my current TODO list. In an ideal world someone will have done the legwork in advance of LSF/MM to help drive the topic. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Tue, 2014-01-21 at 22:04 -0500, Ric Wheeler wrote: > One topic that has been lurking forever at the edges is the current 4k > limitation for file system block sizes. Some devices in production today and > others coming soon have larger sectors and it would be interesting to see if > it > is time to poke at this topic again. > > LSF/MM seems to be pretty much the only event of the year that most of the > key > people will be present, so should be a great topic for a joint session. But the question is what will the impact be. A huge amount of fuss was made about 512->4k. Linux was totally ready because we had variable block sizes and our page size is 4k. I even have one pure 4k sector drive that works in one of my test systems. However, the result was the market chose to go the physical/logical route because of other Operating System considerations, all 4k drives expose 512 byte sectors and do RMW internally. For us it becomes about layout and alignment, which we already do. I can't see how going to 8k or 16k would be any different from what we've already done. In other words, this is an already solved problem. James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, 2014-01-22 at 15:14 +, Chris Mason wrote: > On Wed, 2014-01-22 at 09:34 +, Mel Gorman wrote: > > On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: > > > One topic that has been lurking forever at the edges is the current > > > 4k limitation for file system block sizes. Some devices in > > > production today and others coming soon have larger sectors and it > > > would be interesting to see if it is time to poke at this topic > > > again. > > > > > > > Large block support was proposed years ago by Christoph Lameter > > (http://lwn.net/Articles/232757/). I think I was just getting started > > in the community at the time so I do not recall any of the details. I do > > believe it motivated an alternative by Nick Piggin called fsblock though > > (http://lwn.net/Articles/321390/). At the very least it would be nice to > > know why neither were never merged for those of us that were not around > > at the time and who may not have the chance to dive through mailing list > > archives between now and March. > > > > FWIW, I would expect that a show-stopper for any proposal is requiring > > high-order allocations to succeed for the system to behave correctly. > > > > My memory is that Nick's work just didn't have the momentum to get > pushed in. It all seemed very reasonable though, I think our hatred of > buffered heads just wasn't yet bigger than the fear of moving away. > > But, the bigger question is how big are the blocks going to be? At some > point (64K?) we might as well just make a log structured dm target and > have a single setup for both shingled and large sector drives. There is no real point. Even with 4k drives today using 4k sectors in the filesystem, we still get 512 byte writes because of journalling and the buffer cache. The question is what would we need to do to support these devices and the answer is "try to send IO in x byte multiples x byte aligned" this really becomes an ioscheduler problem, not a supporting large page problem. James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On 01/22/2014 11:03 AM, James Bottomley wrote: On Wed, 2014-01-22 at 15:14 +, Chris Mason wrote: On Wed, 2014-01-22 at 09:34 +, Mel Gorman wrote: On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: One topic that has been lurking forever at the edges is the current 4k limitation for file system block sizes. Some devices in production today and others coming soon have larger sectors and it would be interesting to see if it is time to poke at this topic again. Large block support was proposed years ago by Christoph Lameter (http://lwn.net/Articles/232757/). I think I was just getting started in the community at the time so I do not recall any of the details. I do believe it motivated an alternative by Nick Piggin called fsblock though (http://lwn.net/Articles/321390/). At the very least it would be nice to know why neither were never merged for those of us that were not around at the time and who may not have the chance to dive through mailing list archives between now and March. FWIW, I would expect that a show-stopper for any proposal is requiring high-order allocations to succeed for the system to behave correctly. My memory is that Nick's work just didn't have the momentum to get pushed in. It all seemed very reasonable though, I think our hatred of buffered heads just wasn't yet bigger than the fear of moving away. But, the bigger question is how big are the blocks going to be? At some point (64K?) we might as well just make a log structured dm target and have a single setup for both shingled and large sector drives. There is no real point. Even with 4k drives today using 4k sectors in the filesystem, we still get 512 byte writes because of journalling and the buffer cache. I think that you are wrong here James. Even with 512 byte drives, the IO's we send down tend to be 4k or larger. Do you have traces that show this and details? The question is what would we need to do to support these devices and the answer is "try to send IO in x byte multiples x byte aligned" this really becomes an ioscheduler problem, not a supporting large page problem. James Not that simple. The requirement of some of these devices are that you *never* send down a partial write or an unaligned write. Also keep in mind that larger block sizes allow us to track larger files with smaller amounts of metadata which is a second win. Ric -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v2 02/17] target: Add DIF CHECK_CONDITION ASC/ASCQ exception cases
On 1/19/2014 4:44 AM, Nicholas A. Bellinger wrote: From: Nicholas Bellinger This patch adds support for DIF related CHECK_CONDITION ASC/ASCQ exception cases into transport_send_check_condition_and_sense(). This includes: LOGICAL BLOCK GUARD CHECK FAILED LOGICAL BLOCK APPLICATION TAG CHECK FAILED LOGICAL BLOCK REFERENCE TAG CHECK FAILED that used by DIF TYPE1 and TYPE3 failure cases. Cc: Martin K. Petersen Cc: Christoph Hellwig Cc: Hannes Reinecke Cc: Sagi Grimberg Cc: Or Gerlitz Signed-off-by: Nicholas Bellinger --- drivers/target/target_core_transport.c | 30 ++ include/target/target_core_base.h |3 +++ 2 files changed, 33 insertions(+) diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c index 18c828d..fa4fc04 100644 --- a/drivers/target/target_core_transport.c +++ b/drivers/target/target_core_transport.c @@ -2674,6 +2674,36 @@ transport_send_check_condition_and_sense(struct se_cmd *cmd, buffer[SPC_ASC_KEY_OFFSET] = 0x1d; buffer[SPC_ASCQ_KEY_OFFSET] = 0x00; break; + case TCM_LOGICAL_BLOCK_GUARD_CHECK_FAILED: + /* CURRENT ERROR */ + buffer[0] = 0x70; + buffer[SPC_ADD_SENSE_LEN_OFFSET] = 10; + /* ILLEGAL REQUEST */ + buffer[SPC_SENSE_KEY_OFFSET] = ILLEGAL_REQUEST; + /* LOGICAL BLOCK GUARD CHECK FAILED */ + buffer[SPC_ASC_KEY_OFFSET] = 0x10; + buffer[SPC_ASCQ_KEY_OFFSET] = 0x01; + break; + case TCM_LOGICAL_BLOCK_APP_TAG_CHECK_FAILED: + /* CURRENT ERROR */ + buffer[0] = 0x70; + buffer[SPC_ADD_SENSE_LEN_OFFSET] = 10; + /* ILLEGAL REQUEST */ + buffer[SPC_SENSE_KEY_OFFSET] = ILLEGAL_REQUEST; + /* LOGICAL BLOCK APPLICATION TAG CHECK FAILED */ + buffer[SPC_ASC_KEY_OFFSET] = 0x10; + buffer[SPC_ASCQ_KEY_OFFSET] = 0x02; + break; + case TCM_LOGICAL_BLOCK_REF_TAG_CHECK_FAILED: + /* CURRENT ERROR */ + buffer[0] = 0x70; + buffer[SPC_ADD_SENSE_LEN_OFFSET] = 10; + /* ILLEGAL REQUEST */ + buffer[SPC_SENSE_KEY_OFFSET] = ILLEGAL_REQUEST; + /* LOGICAL BLOCK REFERENCE TAG CHECK FAILED */ + buffer[SPC_ASC_KEY_OFFSET] = 0x10; + buffer[SPC_ASCQ_KEY_OFFSET] = 0x03; + break; Hey Nic, I think we missed the failed LBA here. AFAICT According to SPC-4, a DIF error should be accompanied by Information sense-data descriptor with the (first) failed sector in the information field. This means that this routine should be ready to accept a u32 bad_sector or something. I'm not sure how much of a must it really is. Let me prepare a patch... Sagi. case TCM_LOGICAL_UNIT_COMMUNICATION_FAILURE: default: /* CURRENT ERROR */ diff --git a/include/target/target_core_base.h b/include/target/target_core_base.h index d98048b..0336d70 100644 --- a/include/target/target_core_base.h +++ b/include/target/target_core_base.h @@ -205,6 +205,9 @@ enum tcm_sense_reason_table { TCM_OUT_OF_RESOURCES= R(0x12), TCM_PARAMETER_LIST_LENGTH_ERROR = R(0x13), TCM_MISCOMPARE_VERIFY = R(0x14), + TCM_LOGICAL_BLOCK_GUARD_CHECK_FAILED= R(0x15), + TCM_LOGICAL_BLOCK_APP_TAG_CHECK_FAILED = R(0x16), + TCM_LOGICAL_BLOCK_REF_TAG_CHECK_FAILED = R(0x17), #undef R }; -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, 2014-01-22 at 15:19 +, Mel Gorman wrote: > On Wed, Jan 22, 2014 at 09:58:46AM -0500, Ric Wheeler wrote: > > On 01/22/2014 09:34 AM, Mel Gorman wrote: > > >On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote: > > >>On 01/22/2014 04:34 AM, Mel Gorman wrote: > > >>>On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: > > One topic that has been lurking forever at the edges is the current > > 4k limitation for file system block sizes. Some devices in > > production today and others coming soon have larger sectors and it > > would be interesting to see if it is time to poke at this topic > > again. > > > > >>>Large block support was proposed years ago by Christoph Lameter > > >>>(http://lwn.net/Articles/232757/). I think I was just getting started > > >>>in the community at the time so I do not recall any of the details. I do > > >>>believe it motivated an alternative by Nick Piggin called fsblock though > > >>>(http://lwn.net/Articles/321390/). At the very least it would be nice to > > >>>know why neither were never merged for those of us that were not around > > >>>at the time and who may not have the chance to dive through mailing list > > >>>archives between now and March. > > >>> > > >>>FWIW, I would expect that a show-stopper for any proposal is requiring > > >>>high-order allocations to succeed for the system to behave correctly. > > >>> > > >>I have a somewhat hazy memory of Andrew warning us that touching > > >>this code takes us into dark and scary places. > > >> > > >That is a light summary. As Andrew tends to reject patches with poor > > >documentation in case we forget the details in 6 months, I'm going to guess > > >that he does not remember the details of a discussion from 7ish years ago. > > >This is where Andrew swoops in with a dazzling display of his eidetic > > >memory just to prove me wrong. > > > > > >Ric, are there any storage vendor that is pushing for this right now? > > >Is someone working on this right now or planning to? If they are, have they > > >looked into the history of fsblock (Nick) and large block support > > >(Christoph) > > >to see if they are candidates for forward porting or reimplementation? > > >I ask because without that person there is a risk that the discussion > > >will go as follows > > > > > >Topic leader: Does anyone have an objection to supporting larger block > > > sizes than the page size? > > >Room: Send patches and we'll talk. > > > > > > > I will have to see if I can get a storage vendor to make a public > > statement, but there are vendors hoping to see this land in Linux in > > the next few years. > > What about the second and third questions -- is someone working on this > right now or planning to? Have they looked into the history of fsblock > (Nick) and large block support (Christoph) to see if they are candidates > for forward porting or reimplementation? I really think that if we want to make progress on this one, we need code and someone that owns it. Nick's work was impressive, but it was mostly there for getting rid of buffer heads. If we have a device that needs it and someone working to enable that device, we'll go forward much faster. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, 2014-01-22 at 11:45 -0500, Ric Wheeler wrote: > On 01/22/2014 11:03 AM, James Bottomley wrote: > > On Wed, 2014-01-22 at 15:14 +, Chris Mason wrote: > >> On Wed, 2014-01-22 at 09:34 +, Mel Gorman wrote: > >>> On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: > One topic that has been lurking forever at the edges is the current > 4k limitation for file system block sizes. Some devices in > production today and others coming soon have larger sectors and it > would be interesting to see if it is time to poke at this topic > again. > > >>> Large block support was proposed years ago by Christoph Lameter > >>> (http://lwn.net/Articles/232757/). I think I was just getting started > >>> in the community at the time so I do not recall any of the details. I do > >>> believe it motivated an alternative by Nick Piggin called fsblock though > >>> (http://lwn.net/Articles/321390/). At the very least it would be nice to > >>> know why neither were never merged for those of us that were not around > >>> at the time and who may not have the chance to dive through mailing list > >>> archives between now and March. > >>> > >>> FWIW, I would expect that a show-stopper for any proposal is requiring > >>> high-order allocations to succeed for the system to behave correctly. > >>> > >> My memory is that Nick's work just didn't have the momentum to get > >> pushed in. It all seemed very reasonable though, I think our hatred of > >> buffered heads just wasn't yet bigger than the fear of moving away. > >> > >> But, the bigger question is how big are the blocks going to be? At some > >> point (64K?) we might as well just make a log structured dm target and > >> have a single setup for both shingled and large sector drives. > > There is no real point. Even with 4k drives today using 4k sectors in > > the filesystem, we still get 512 byte writes because of journalling and > > the buffer cache. > > I think that you are wrong here James. Even with 512 byte drives, the IO's we > send down tend to be 4k or larger. Do you have traces that show this and > details? It's mostly an ext3 journalling issue ... and it's only metadata and mostly the ioschedulers can elevate it into 4k chunks, so yes, most of our writes are 4k+, so this is a red herring, yes. > > > > The question is what would we need to do to support these devices and > > the answer is "try to send IO in x byte multiples x byte aligned" this > > really becomes an ioscheduler problem, not a supporting large page > > problem. > > > > James > > > > Not that simple. > > The requirement of some of these devices are that you *never* send down a > partial write or an unaligned write. But this is the million dollar question. That was originally going to be the requirement of the 4k sector devices but look what happened in the market. > Also keep in mind that larger block sizes allow us to track larger > files with > smaller amounts of metadata which is a second win. Larger file block sizes are completely independent from larger device block sizes (we can have 16k file block sizes on 4k or even 512b devices). The questions on larger block size devices are twofold: 1. If manufacturers tell us that they'll only support I/O on the physical sector size, do we believe them, given that they said this before on 4k and then backed down. All the logical vs physical sector stuff is now in T10 standards, why would they try to go all physical again, especially as they've now all written firmware that does the necessary RMW? 2. If we agree they'll do RMW in Firmware again, what do we have to do to take advantage of larger sector sizes beyond what we currently do in alignment and chunking? There may still be issues in FS journal and data layouts. James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote: > On Wed, 2014-01-22 at 15:19 +, Mel Gorman wrote: > > On Wed, Jan 22, 2014 at 09:58:46AM -0500, Ric Wheeler wrote: > > > On 01/22/2014 09:34 AM, Mel Gorman wrote: > > > >On Wed, Jan 22, 2014 at 09:10:48AM -0500, Ric Wheeler wrote: > > > >>On 01/22/2014 04:34 AM, Mel Gorman wrote: > > > >>>On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: > > > One topic that has been lurking forever at the edges is the current > > > 4k limitation for file system block sizes. Some devices in > > > production today and others coming soon have larger sectors and it > > > would be interesting to see if it is time to poke at this topic > > > again. > > > > > > >>>Large block support was proposed years ago by Christoph Lameter > > > >>>(http://lwn.net/Articles/232757/). I think I was just getting started > > > >>>in the community at the time so I do not recall any of the details. I > > > >>>do > > > >>>believe it motivated an alternative by Nick Piggin called fsblock > > > >>>though > > > >>>(http://lwn.net/Articles/321390/). At the very least it would be nice > > > >>>to > > > >>>know why neither were never merged for those of us that were not around > > > >>>at the time and who may not have the chance to dive through mailing > > > >>>list > > > >>>archives between now and March. > > > >>> > > > >>>FWIW, I would expect that a show-stopper for any proposal is requiring > > > >>>high-order allocations to succeed for the system to behave correctly. > > > >>> > > > >>I have a somewhat hazy memory of Andrew warning us that touching > > > >>this code takes us into dark and scary places. > > > >> > > > >That is a light summary. As Andrew tends to reject patches with poor > > > >documentation in case we forget the details in 6 months, I'm going to > > > >guess > > > >that he does not remember the details of a discussion from 7ish years > > > >ago. > > > >This is where Andrew swoops in with a dazzling display of his eidetic > > > >memory just to prove me wrong. > > > > > > > >Ric, are there any storage vendor that is pushing for this right now? > > > >Is someone working on this right now or planning to? If they are, have > > > >they > > > >looked into the history of fsblock (Nick) and large block support > > > >(Christoph) > > > >to see if they are candidates for forward porting or reimplementation? > > > >I ask because without that person there is a risk that the discussion > > > >will go as follows > > > > > > > >Topic leader: Does anyone have an objection to supporting larger block > > > > sizes than the page size? > > > >Room: Send patches and we'll talk. > > > > > > > > > > I will have to see if I can get a storage vendor to make a public > > > statement, but there are vendors hoping to see this land in Linux in > > > the next few years. > > > > What about the second and third questions -- is someone working on this > > right now or planning to? Have they looked into the history of fsblock > > (Nick) and large block support (Christoph) to see if they are candidates > > for forward porting or reimplementation? > > I really think that if we want to make progress on this one, we need > code and someone that owns it. Nick's work was impressive, but it was > mostly there for getting rid of buffer heads. If we have a device that > needs it and someone working to enable that device, we'll go forward > much faster. Do we even need to do that (eliminate buffer heads)? We cope with 4k sector only devices just fine today because the bh mechanisms now operate on top of the page cache and can do the RMW necessary to update a bh in the page cache itself which allows us to do only 4k chunked writes, so we could keep the bh system and just alter the granularity of the page cache. The other question is if the drive does RMW between 4k and whatever its physical sector size, do we need to do anything to take advantage of it ... as in what would altering the granularity of the page cache buy us? James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v2 03/17] target/sbc: Add DIF setup in sbc_check_prot + sbc_parse_cdb
On 1/22/2014 12:48 AM, Nicholas A. Bellinger wrote: + cmd->prot_handover = PROT_SEPERATED; I know that we are not planning to support interleaved mode at the moment, But I think that the protection handover type is the backstore preference and should be taken from se_dev. But it is not that important for now... Yeah, I figured since the RDMA pieces needed the handover type defined in some form, it made sense to include PROT_SEPERATED hardcoded here, but stopped short of adding se_dev->prot_handler for the first round merge. --nab Actually they don't, I just added them in iSER code to demonstrate the HW ability. If we are not planning to support that (although as MKP mentioned it might be useful in some cases), you can remove that for now and we can add it in the future - iSER can ignore it for now (I'll refactor the patches). Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote: > On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote: [ I like big sectors and I cannot lie ] > > > I really think that if we want to make progress on this one, we need > > code and someone that owns it. Nick's work was impressive, but it was > > mostly there for getting rid of buffer heads. If we have a device that > > needs it and someone working to enable that device, we'll go forward > > much faster. > > Do we even need to do that (eliminate buffer heads)? We cope with 4k > sector only devices just fine today because the bh mechanisms now > operate on top of the page cache and can do the RMW necessary to update > a bh in the page cache itself which allows us to do only 4k chunked > writes, so we could keep the bh system and just alter the granularity of > the page cache. > We're likely to have people mixing 4K drives and on the same box. We could just go with the biggest size and use the existing bh code for the sub-pagesized blocks, but I really hesitate to change VM fundamentals for this. >From a pure code point of view, it may be less work to change it once in the VM. But from an overall system impact point of view, it's a big change in how the system behaves just for filesystem metadata. > The other question is if the drive does RMW between 4k and whatever its > physical sector size, do we need to do anything to take advantage of > it ... as in what would altering the granularity of the page cache buy > us? The real benefit is when and how the reads get scheduled. We're able to do a much better job pipelining the reads, controlling our caches and reducing write latency by having the reads done up in the OS instead of the drive. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] First round of SCSI updates for the 3.13+ merge window
This patch set is a lot of driver updates for qla4xxx, bfa, hpsa, qla2xxx. It also removes the aic7xxx_old driver (which has been deprecated for nearly a decade) and adds support for deadlines in error handling. The patch is available here: git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git scsi-for-linus The short changelog is: Aaron Lu (1): sr: use block layer runtime PM Adheer Chandravanshi (5): qla4xxx: Recreate chap data list during get chap operation qla4xxx: Add support for ISCSI_PARAM_LOCAL_IPADDR sysfs attr libiscsi: Add local_ipaddr parameter in iscsi_conn struct scsi_transport_iscsi: Export ISCSI_PARAM_LOCAL_IPADDR attr for iscsi_connection qla4xxx: Update driver version to 5.04.00-k2 Akinobu Mita (1): scsi_debug: simplify creation and destruction of driver attribute files Alan (1): mac_scsi: Fix crash on out of memory Armen Baloyan (5): qla2xxx: Replace a constant with a macro definition for host->canqueue assigmnment. qla2xxx: Add changes to obtain ISPFX00 adapters product information in accordance with firmware update. qla2xxx: Add logic to abort BSG commands for ISPFX00. qla2xxx: Fix issue with not displaying node name after system reboot. qla2xxx: Print proper QLAFX00 product name at probe. Atul Deshmukh (1): qla2xxx: Clear RISC INT reg only for an event and not always while polling. Ben Collins (1): megaraid: Use resource_size_t for PCI resources, not long Bodo Stroesser (1): st: fix enlarge_buffer Chad Dupuis (7): qla2xxx: Only complete dcbx_comp and lb_portup_comp for virtual port index 0. qla2xxx: Use the correct mailbox registers when acknowledging an IDC request on ISP8044. qla2xxx: Disable adapter when we encounter a PCI disconnect. qla2xxx: Refactor shutdown code so some functionality can be reused. Revert "qla2xxx: Ramp down queue depth for attached SCSI devices when driver resources are low." qla2xxx: Add BPM support for ISP25xx. qla2xxx: Honor execute firmware failures. Dan Carpenter (1): qla4xxx: overflow in qla4xxx_set_chap_entry() Douglas Gilbert (1): MAINTAINERS: update sg entry Felipe Pena (1): lpfc: Fix wrong assignment in lpfc_debugfs.c Geert Uytterhoeven (1): sd: Do not call do_div() with a 64-bit divisor Geyslan G. Bem (1): be2iscsi: fix memory leak in error path Hannes Reinecke (3): Update documentation Unlock accesses to eh_deadline improved eh timeout handler Harish Zunjarrao (3): qla4xxx: Add support for additional network parameters settings iscsi_transport: Additional parameters for network settings iscsi_transport: Remove net param enum values James Bottomley (1): Fix erratic device offline during EH Joe Carnuccio (3): qla2xxx: Fix undefined behavior in call to snprintf(). qla2xxx: Add BSG interface for read/write serdes register. qla2xxx: Correctly set mailboxes for extended init control block. Lalit Chandivade (2): qla4xxx: Add host statistics support scsi_transport_iscsi: Add host statistics support Matt Gates (1): hpsa: allow SCSI mid layer to handle unit attention Meelis Roos (1): qla1280: Annotate timer on stack so object debug does not complain Mike Miller (1): hpsa: remove P822se PCI ID Paul Gortmaker (1): aci7xxx_old: delete decade+ obsolete driver Ren Mingxin (1): Set the minimum valid value of 'eh_deadline' as 0 Saurav Kashyap (4): qla2xxx: Fix warning reported by smatch. qla2xxx: Update the driver version to 8.06.00.12-k. qla2xxx: Adding MAINTAINERS for qla2xxx FC-SCSI driver qla2xxx: Don't consider the drivers knocked out of IDC participation for future reset recovery process. Sawan Chandak (3): qla2xxx: Reset nic_core_reset_owner on moving from COLD to READY for ISP8044 qla2xxx: Use scnprintf() instead of snprintf() in the sysfs handlers. qla2xxx: Disable INTx interrupt for ISP82XX Stephen M. Cameron (11): hpsa: do not require board "not ready" status after hard reset hpsa: enable unit attention reporting hpsa: rename scsi prefetch field hpsa: use workqueue instead of kernel thread for lockup detection hpsa: prevent stalled i/o hpsa: cap CCISS_PASSTHRU at 20 concurrent commands. hpsa: add MSA 2040 to list of external target devices hpsa: fix memory leak in CCISS_BIG_PASSTHRU ioctl hpsa: remove unneeded include of seq_file.h hpsa: add 5 second delay after doorbell reset hpsa: do not attempt to flush the cache on locked up controllers Vijaya Mohan Guvva (8): bfa: Driver version upgrade to 3.2.23.0 bfa: change FC_ELS_TOV to 20sec bfa: Observed auto D-port mode instead of manual bfa: Fix for bcu or hcm faa query hang bfa: LUN discovery issue in direct attach mode bfa: Register port with SCSI even on
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote: > On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote: > > On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote: > > [ I like big sectors and I cannot lie ] I think I might be sceptical, but I don't think that's showing in my concerns ... > > > I really think that if we want to make progress on this one, we need > > > code and someone that owns it. Nick's work was impressive, but it was > > > mostly there for getting rid of buffer heads. If we have a device that > > > needs it and someone working to enable that device, we'll go forward > > > much faster. > > > > Do we even need to do that (eliminate buffer heads)? We cope with 4k > > sector only devices just fine today because the bh mechanisms now > > operate on top of the page cache and can do the RMW necessary to update > > a bh in the page cache itself which allows us to do only 4k chunked > > writes, so we could keep the bh system and just alter the granularity of > > the page cache. > > > > We're likely to have people mixing 4K drives and size here> on the same box. We could just go with the biggest size and > use the existing bh code for the sub-pagesized blocks, but I really > hesitate to change VM fundamentals for this. If the page cache had a variable granularity per device, that would cope with this. It's the variable granularity that's the VM problem. > From a pure code point of view, it may be less work to change it once in > the VM. But from an overall system impact point of view, it's a big > change in how the system behaves just for filesystem metadata. Agreed, but only if we don't do RMW in the buffer cache ... which may be a good reason to keep it. > > The other question is if the drive does RMW between 4k and whatever its > > physical sector size, do we need to do anything to take advantage of > > it ... as in what would altering the granularity of the page cache buy > > us? > > The real benefit is when and how the reads get scheduled. We're able to > do a much better job pipelining the reads, controlling our caches and > reducing write latency by having the reads done up in the OS instead of > the drive. I agree with all of that, but my question is still can we do this by propagating alignment and chunk size information (i.e. the physical sector size) like we do today. If the FS knows the optimal I/O patterns and tries to follow them, the odd cockup won't impact performance dramatically. The real question is can the FS make use of this layout information *without* changing the page cache granularity? Only if you answer me "no" to this do I think we need to worry about changing page cache granularity. Realistically, if you look at what the I/O schedulers output on a standard (spinning rust) workload, it's mostly large transfers. Obviously these are misalgned at the ends, but we can fix some of that in the scheduler. Particularly if the FS helps us with layout. My instinct tells me that we can fix 99% of this with layout on the FS + io schedulers ... the remaining 1% goes to the drive as needing to do RMW in the device, but the net impact to our throughput shouldn't be that great. James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On 01/22/2014 01:13 PM, James Bottomley wrote: On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote: On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote: On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote: [ I like big sectors and I cannot lie ] I think I might be sceptical, but I don't think that's showing in my concerns ... I really think that if we want to make progress on this one, we need code and someone that owns it. Nick's work was impressive, but it was mostly there for getting rid of buffer heads. If we have a device that needs it and someone working to enable that device, we'll go forward much faster. Do we even need to do that (eliminate buffer heads)? We cope with 4k sector only devices just fine today because the bh mechanisms now operate on top of the page cache and can do the RMW necessary to update a bh in the page cache itself which allows us to do only 4k chunked writes, so we could keep the bh system and just alter the granularity of the page cache. We're likely to have people mixing 4K drives and on the same box. We could just go with the biggest size and use the existing bh code for the sub-pagesized blocks, but I really hesitate to change VM fundamentals for this. If the page cache had a variable granularity per device, that would cope with this. It's the variable granularity that's the VM problem. From a pure code point of view, it may be less work to change it once in the VM. But from an overall system impact point of view, it's a big change in how the system behaves just for filesystem metadata. Agreed, but only if we don't do RMW in the buffer cache ... which may be a good reason to keep it. The other question is if the drive does RMW between 4k and whatever its physical sector size, do we need to do anything to take advantage of it ... as in what would altering the granularity of the page cache buy us? The real benefit is when and how the reads get scheduled. We're able to do a much better job pipelining the reads, controlling our caches and reducing write latency by having the reads done up in the OS instead of the drive. I agree with all of that, but my question is still can we do this by propagating alignment and chunk size information (i.e. the physical sector size) like we do today. If the FS knows the optimal I/O patterns and tries to follow them, the odd cockup won't impact performance dramatically. The real question is can the FS make use of this layout information *without* changing the page cache granularity? Only if you answer me "no" to this do I think we need to worry about changing page cache granularity. Realistically, if you look at what the I/O schedulers output on a standard (spinning rust) workload, it's mostly large transfers. Obviously these are misalgned at the ends, but we can fix some of that in the scheduler. Particularly if the FS helps us with layout. My instinct tells me that we can fix 99% of this with layout on the FS + io schedulers ... the remaining 1% goes to the drive as needing to do RMW in the device, but the net impact to our throughput shouldn't be that great. James I think that the key to having the file system work with larger sectors is to create them properly aligned and use the actual, native sector size as their FS block size. Which is pretty much back the original challenge. Teaching each and every file system to be aligned at the storage granularity/minimum IO size when that is larger than the physical sector size is harder I think. ric -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, 2014-01-22 at 13:17 -0500, Ric Wheeler wrote: > On 01/22/2014 01:13 PM, James Bottomley wrote: > > On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote: > >> On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote: > >>> On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote: > >> [ I like big sectors and I cannot lie ] > > I think I might be sceptical, but I don't think that's showing in my > > concerns ... > > > I really think that if we want to make progress on this one, we need > code and someone that owns it. Nick's work was impressive, but it was > mostly there for getting rid of buffer heads. If we have a device that > needs it and someone working to enable that device, we'll go forward > much faster. > >>> Do we even need to do that (eliminate buffer heads)? We cope with 4k > >>> sector only devices just fine today because the bh mechanisms now > >>> operate on top of the page cache and can do the RMW necessary to update > >>> a bh in the page cache itself which allows us to do only 4k chunked > >>> writes, so we could keep the bh system and just alter the granularity of > >>> the page cache. > >>> > >> We're likely to have people mixing 4K drives and >> size here> on the same box. We could just go with the biggest size and > >> use the existing bh code for the sub-pagesized blocks, but I really > >> hesitate to change VM fundamentals for this. > > If the page cache had a variable granularity per device, that would cope > > with this. It's the variable granularity that's the VM problem. > > > >> From a pure code point of view, it may be less work to change it once in > >> the VM. But from an overall system impact point of view, it's a big > >> change in how the system behaves just for filesystem metadata. > > Agreed, but only if we don't do RMW in the buffer cache ... which may be > > a good reason to keep it. > > > >>> The other question is if the drive does RMW between 4k and whatever its > >>> physical sector size, do we need to do anything to take advantage of > >>> it ... as in what would altering the granularity of the page cache buy > >>> us? > >> The real benefit is when and how the reads get scheduled. We're able to > >> do a much better job pipelining the reads, controlling our caches and > >> reducing write latency by having the reads done up in the OS instead of > >> the drive. > > I agree with all of that, but my question is still can we do this by > > propagating alignment and chunk size information (i.e. the physical > > sector size) like we do today. If the FS knows the optimal I/O patterns > > and tries to follow them, the odd cockup won't impact performance > > dramatically. The real question is can the FS make use of this layout > > information *without* changing the page cache granularity? Only if you > > answer me "no" to this do I think we need to worry about changing page > > cache granularity. > > > > Realistically, if you look at what the I/O schedulers output on a > > standard (spinning rust) workload, it's mostly large transfers. > > Obviously these are misalgned at the ends, but we can fix some of that > > in the scheduler. Particularly if the FS helps us with layout. My > > instinct tells me that we can fix 99% of this with layout on the FS + io > > schedulers ... the remaining 1% goes to the drive as needing to do RMW > > in the device, but the net impact to our throughput shouldn't be that > > great. > > > > James > > > > I think that the key to having the file system work with larger > sectors is to > create them properly aligned and use the actual, native sector size as > their FS > block size. Which is pretty much back the original challenge. Only if you think laying out stuff requires block size changes. If a 4k block filesystem's allocation algorithm tried to allocate on a 16k boundary for instance, that gets us a lot of the performance without needing a lot of alteration. It's not even obvious that an ignorant 4k layout is going to be so bad ... the RMW occurs only at the ends of the transfers, not in the middle. If we say 16k physical block and average 128k transfers, probabalistically we misalign on 6 out of 31 sectors (or 19% of the time). We can make that better by increasing the transfer size (it comes down to 10% for 256k transfers. > Teaching each and every file system to be aligned at the storage > granularity/minimum IO size when that is larger than the physical > sector size is > harder I think. But you're making assumptions about needing larger block sizes. I'm asking what can we do with what we currently have? Increasing the transfer size is a way of mitigating the problem with no FS support whatever. Adding alignment to the FS layout algorithm is another. When you've done both of those, I think you're already at the 99% aligned case, which is "do we need to bother any more" territory for me. James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, 2014-01-22 at 10:13 -0800, James Bottomley wrote: > On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote: > > > We're likely to have people mixing 4K drives and > size here> on the same box. We could just go with the biggest size and > > use the existing bh code for the sub-pagesized blocks, but I really > > hesitate to change VM fundamentals for this. > > If the page cache had a variable granularity per device, that would cope > with this. It's the variable granularity that's the VM problem. Agreed. But once we go variable granularity we're basically talking the large order allocation problem. > > > From a pure code point of view, it may be less work to change it once in > > the VM. But from an overall system impact point of view, it's a big > > change in how the system behaves just for filesystem metadata. > > Agreed, but only if we don't do RMW in the buffer cache ... which may be > a good reason to keep it. > > > > The other question is if the drive does RMW between 4k and whatever its > > > physical sector size, do we need to do anything to take advantage of > > > it ... as in what would altering the granularity of the page cache buy > > > us? > > > > The real benefit is when and how the reads get scheduled. We're able to > > do a much better job pipelining the reads, controlling our caches and > > reducing write latency by having the reads done up in the OS instead of > > the drive. > > I agree with all of that, but my question is still can we do this by > propagating alignment and chunk size information (i.e. the physical > sector size) like we do today. If the FS knows the optimal I/O patterns > and tries to follow them, the odd cockup won't impact performance > dramatically. The real question is can the FS make use of this layout > information *without* changing the page cache granularity? Only if you > answer me "no" to this do I think we need to worry about changing page > cache granularity. Can it mostly work? I think the answer is yes. If not we'd have a lot of miserable people on top of raid5/6 right now. We can always make a generic r/m/w engine in DM that supports larger sectors transparently. > > Realistically, if you look at what the I/O schedulers output on a > standard (spinning rust) workload, it's mostly large transfers. > Obviously these are misalgned at the ends, but we can fix some of that > in the scheduler. Particularly if the FS helps us with layout. My > instinct tells me that we can fix 99% of this with layout on the FS + io > schedulers ... the remaining 1% goes to the drive as needing to do RMW > in the device, but the net impact to our throughput shouldn't be that > great. There are a few workloads where the VM and the FS would team up to make this fairly miserable Small files. Delayed allocation fixes a lot of this, but the VM doesn't realize that fileA, fileB, fileC, and fileD all need to be written at the same time to avoid RMW. Btrfs and MD have setup plugging callbacks to accumulate full stripes as much as possible, but it still hurts. Metadata. These writes are very latency sensitive and we'll gain a lot if the FS is explicitly trying to build full sector IOs. I do agree that its very likely these drives are going to silently rmw in the background for us. Circling back to what we might talk about at the conference, Ric do you have any ideas on when these drives might hit the wild? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On 01/22/2014 01:35 PM, James Bottomley wrote: On Wed, 2014-01-22 at 13:17 -0500, Ric Wheeler wrote: On 01/22/2014 01:13 PM, James Bottomley wrote: On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote: On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote: On Wed, 2014-01-22 at 17:02 +, Chris Mason wrote: [ I like big sectors and I cannot lie ] I think I might be sceptical, but I don't think that's showing in my concerns ... I really think that if we want to make progress on this one, we need code and someone that owns it. Nick's work was impressive, but it was mostly there for getting rid of buffer heads. If we have a device that needs it and someone working to enable that device, we'll go forward much faster. Do we even need to do that (eliminate buffer heads)? We cope with 4k sector only devices just fine today because the bh mechanisms now operate on top of the page cache and can do the RMW necessary to update a bh in the page cache itself which allows us to do only 4k chunked writes, so we could keep the bh system and just alter the granularity of the page cache. We're likely to have people mixing 4K drives and on the same box. We could just go with the biggest size and use the existing bh code for the sub-pagesized blocks, but I really hesitate to change VM fundamentals for this. If the page cache had a variable granularity per device, that would cope with this. It's the variable granularity that's the VM problem. From a pure code point of view, it may be less work to change it once in the VM. But from an overall system impact point of view, it's a big change in how the system behaves just for filesystem metadata. Agreed, but only if we don't do RMW in the buffer cache ... which may be a good reason to keep it. The other question is if the drive does RMW between 4k and whatever its physical sector size, do we need to do anything to take advantage of it ... as in what would altering the granularity of the page cache buy us? The real benefit is when and how the reads get scheduled. We're able to do a much better job pipelining the reads, controlling our caches and reducing write latency by having the reads done up in the OS instead of the drive. I agree with all of that, but my question is still can we do this by propagating alignment and chunk size information (i.e. the physical sector size) like we do today. If the FS knows the optimal I/O patterns and tries to follow them, the odd cockup won't impact performance dramatically. The real question is can the FS make use of this layout information *without* changing the page cache granularity? Only if you answer me "no" to this do I think we need to worry about changing page cache granularity. Realistically, if you look at what the I/O schedulers output on a standard (spinning rust) workload, it's mostly large transfers. Obviously these are misalgned at the ends, but we can fix some of that in the scheduler. Particularly if the FS helps us with layout. My instinct tells me that we can fix 99% of this with layout on the FS + io schedulers ... the remaining 1% goes to the drive as needing to do RMW in the device, but the net impact to our throughput shouldn't be that great. James I think that the key to having the file system work with larger sectors is to create them properly aligned and use the actual, native sector size as their FS block size. Which is pretty much back the original challenge. Only if you think laying out stuff requires block size changes. If a 4k block filesystem's allocation algorithm tried to allocate on a 16k boundary for instance, that gets us a lot of the performance without needing a lot of alteration. The key here is that we cannot assume that writes happen only during allocation/append mode. Unless the block size enforces it, we will have non-aligned, small block IO done to allocated regions that won't get coalesced. It's not even obvious that an ignorant 4k layout is going to be so bad ... the RMW occurs only at the ends of the transfers, not in the middle. If we say 16k physical block and average 128k transfers, probabalistically we misalign on 6 out of 31 sectors (or 19% of the time). We can make that better by increasing the transfer size (it comes down to 10% for 256k transfers. This really depends on the nature of the device. Some devices could produce very erratic performance or even (not today, but some day) reject the IO. Teaching each and every file system to be aligned at the storage granularity/minimum IO size when that is larger than the physical sector size is harder I think. But you're making assumptions about needing larger block sizes. I'm asking what can we do with what we currently have? Increasing the transfer size is a way of mitigating the problem with no FS support whatever. Adding alignment to the FS layout algorithm is another. When you've done both of those, I think you're already at the 99% aligned case, which is "do we need to bot
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, 2014-01-22 at 18:37 +, Chris Mason wrote: > On Wed, 2014-01-22 at 10:13 -0800, James Bottomley wrote: > > On Wed, 2014-01-22 at 18:02 +, Chris Mason wrote: [agreement cut because it's boring for the reader] > > Realistically, if you look at what the I/O schedulers output on a > > standard (spinning rust) workload, it's mostly large transfers. > > Obviously these are misalgned at the ends, but we can fix some of that > > in the scheduler. Particularly if the FS helps us with layout. My > > instinct tells me that we can fix 99% of this with layout on the FS + io > > schedulers ... the remaining 1% goes to the drive as needing to do RMW > > in the device, but the net impact to our throughput shouldn't be that > > great. > > There are a few workloads where the VM and the FS would team up to make > this fairly miserable > > Small files. Delayed allocation fixes a lot of this, but the VM doesn't > realize that fileA, fileB, fileC, and fileD all need to be written at > the same time to avoid RMW. Btrfs and MD have setup plugging callbacks > to accumulate full stripes as much as possible, but it still hurts. > > Metadata. These writes are very latency sensitive and we'll gain a lot > if the FS is explicitly trying to build full sector IOs. OK, so these two cases I buy ... the question is can we do something about them today without increasing the block size? The metadata problem, in particular, might be block independent: we still have a lot of small chunks to write out at fractured locations. With a large block size, the FS knows it's been bad and can expect the rolled up newspaper, but it's not clear what it could do about it. The small files issue looks like something we should be tackling today since writing out adjacent files would actually help us get bigger transfers. James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC] [ATTEND] scsi-mq
James, I'd like to attend to participate in the EH, MQ, and T10 PI RDMA discussions. -- james s On 1/16/2014 11:29 AM, Sagi Grimberg wrote: On 1/16/2014 1:05 AM, Nicholas A. Bellinger wrote: Hi all, I'd like to discuss the current state of scsi-mq prototype code. And now that blk-mq is upstream for v3.13, exploring the remaining TODO items towards an initial scsi-mq merge sometime before 2015 is upon us. The benefits of scsi-mq remain unchanged: - Utilizes blk-mq's native per-cpu primitive + NUMA local friendly queuing of pre-allocated struct request descriptor memory - Eliminates all fast-path memory allocations in SCSI-core + optionally the underlying SCSI LLDs - Avoids fast-path Scsi_Host->host_lock + request_queue->queue_lock accesses in submission + completion paths These benefits have been discussed in greater detail in [1], and the latest alpha quality code is available at [2] below. The current TODO items include: - A plan for per device SCSI error handling - Proper scsi_device->sdev_gendev reference counting - Queuing fairness across multiple scsi-mq devices per host - Support for > 1 nr_hw_queues + conversion of qla2xxx + lpfc LLDs that support native hardware multiqueue Thank you, --nab References: [1]: [ATTEND] scsi-mq prototype discussion http://marc.info/?l=linux-scsi&m=137358831329753&w=2 [2]: scsi-mq WIP updated to v3.13-rc3 http://marc.info/?l=linux-scsi&m=138782535731722&w=2 +1 I would be happy to join this discussion, I think it is also important to think about the interaction with iSCSI and LLDs. Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[LSF/MM ATTEND] EH, MQ, RDMA T10-PI
Hello, I would like to attend LSF/MM 2014. I would like to continue discussions held on the eh enhancements, multi-queue, and addition of T10-PI to RDMA/iSER. I'm currently the maintainer of the Emulex lpfc driver and bring years of scsi, driver, firmware, and asic experience. thank you. -- James Smart -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [usb-storage] Re: usb disk recognized but fails
Here is the usbmon log. The disk works with TV and Windows. No error messages. Could it be under-powered? Thanks, Milan 88020e734600 2269960175 S Ci:3:001:0 s a3 00 0001 0004 4 < 88020e734600 2269960196 C Ci:3:001:0 0 4 = 0001 88020e734600 2269960200 S Ci:3:001:0 s a3 00 0002 0004 4 < 88020e734600 2269960207 C Ci:3:001:0 0 4 = 01010100 88020e734600 2269960209 S Co:3:001:0 s 23 01 0010 0002 0 88020e734600 2269960213 C Co:3:001:0 0 0 88020e734600 2269960215 S Ci:3:001:0 s a3 00 0003 0004 4 < 88020e734600 2269960218 C Ci:3:001:0 0 4 = 0001 88020e734600 2269960219 S Ci:3:001:0 s a3 00 0004 0004 4 < 88020e734600 2269960222 C Ci:3:001:0 0 4 = 0001 88021041eb40 2270063011 S Ii:3:001:1 -115:2048 4 < 8801fdbb2e40 2270063029 S Ci:3:001:0 s a3 00 0002 0004 4 < 8801fdbb2e40 2270063049 C Ci:3:001:0 0 4 = 0101 88020e734600 2270063117 S Co:3:001:0 s 23 03 0004 0002 0 88020e734600 2270063129 C Co:3:001:0 0 0 8801fdbb2300 2270116350 S Ci:3:001:0 s a3 00 0002 0004 4 < 8801fdbb2300 2270116368 C Ci:3:001:0 0 4 = 1105 88020e734900 2270169657 S Ci:3:001:0 s a3 00 0002 0004 4 < 88020e734900 2270169685 C Ci:3:001:0 0 4 = 03051000 8801fdbb2900 2270223011 S Co:3:001:0 s 23 01 0014 0002 0 8801fdbb2900 2270223028 C Co:3:001:0 0 0 88020e734b40 2270236320 S Ci:3:004:0 s 80 06 0100 0008 8 < 88020e734b40 2270236397 C Ci:3:004:0 0 8 = 12010002 0040 8801fdbb2f00 2270236408 S Ci:3:004:0 s 80 06 0100 0012 18 < 8801fdbb2f00 2270236477 C Ci:3:004:0 0 18 = 12010002 0040 cd141661 50010103 0201 8801fdbb2f00 2270236489 S Ci:3:004:0 s 80 06 0200 0009 9 < 8801fdbb2f00 2270236561 C Ci:3:004:0 0 9 = 09022000 010100c0 01 8801fdbb2f00 2270236570 S Ci:3:004:0 s 80 06 0200 0020 32 < 8801fdbb2f00 2270236697 C Ci:3:004:0 0 32 = 09022000 010100c0 01090400 00020806 5705 81020002 00070502 02000200 8801fdbb2f00 2270236710 S Ci:3:004:0 s 80 06 0300 00ff 255 < 8801fdbb2f00 2270236772 C Ci:3:004:0 0 4 = 04030904 8801fdbb2f00 2270236777 S Ci:3:004:0 s 80 06 0303 0409 00ff 255 < 8801fdbb2f00 2270236901 C Ci:3:004:0 0 48 = 30035500 53004200 20003200 2e003000 20002000 53004100 54004100 20004200 8801fdbb2f00 2270236911 S Ci:3:004:0 s 80 06 0301 0409 00ff 255 < 8801fdbb2f00 2270236999 C Ci:3:004:0 0 26 = 1a035300 75007000 65007200 20005400 6f007000 20002000 2000 8801fdbb2f00 2270237009 S Ci:3:004:0 s 80 06 0302 0409 00ff 255 < 8801fdbb2f00 2270237099 C Ci:3:004:0 0 26 = 1a034d00 36003100 31003600 30003100 38005600 45003100 3500 8801fdbb2a80 2270237390 S Co:3:004:0 s 00 09 0001 0 8801fdbb2a80 2270237436 C Co:3:004:0 0 0 8801fdbb2c00 2271239716 S Ci:3:004:0 s a1 fe 0001 1 < 8801fdbb2c00 2271239790 C Ci:3:004:0 0 1 = 00 8801fdbb2c00 2271239949 S Bo:3:004:2 -115 31 = 55534243 0100 2400 8612 0024 00 8801fdbb2c00 227124 C Bo:3:004:2 0 31 > 880202622540 2271240013 S Bi:3:004:1 -115 36 < 880202622540 2271240187 C Bi:3:004:1 0 36 = 1f4d4149 57444320 57443634 30304250 56542d30 3048585a 54302020 8801fdbb2c00 2271240213 S Bi:3:004:1 -115 13 < 8801fdbb2c00 2271240225 C Bi:3:004:1 0 13 = 55534253 0100 00 8801fdbb2c00 2271240591 S Bo:3:004:2 -115 31 = 55534243 0200 0600 00 8801fdbb2c00 2271240683 C Bo:3:004:2 0 31 > 8801fdbb2c00 2271240730 S Bi:3:004:1 -115 13 < 8801fdbb2c00 2271240791 C Bi:3:004:1 0 13 = 55534253 0200 00 8801fdbb2c00 2271240860 S Bo:3:004:2 -115 31 = 55534243 0300 0800 8a25 00 8801fdbb2c00 2271240879 C Bo:3:004:2 0 31 > 8801ee74d180 2271240888 S Bi:3:004:1 -115 8 < 8801ee74d180 2271240955 C Bi:3:004:1 0 8 = 4a8582af 0200 8801fdbb2c00 2271240967 S Bi:3:004:1 -115 13 < 8801fdbb2c00 2271240992 C Bi:3:004:1 0 13 = 55534253 0300 00 8801fdbb2c00 2271241041 S Bo:3:004:2 -115 31 = 55534243 0400 c000 861a 003f00c0 00 8801fdbb2c00 2271241055 C Bo:3:004:2 0 31 > 8801ee74d180 2271241062 S Bi:3:004:1 -115 192 < 8801ee74d180 2271241097 C Bi:3:004:1 -121 4 = 0300 8801fdbb2c00 2271241107 S Bi:3:004:1 -115 13 < 8801fdbb2c00 2271241146 C Bi:3:004:1 0 13 = 55534253 0400 00 8801fdbb2c00 2271241199 S Bo:3:004:2 -115 31 = 55534243 0500 c000 861a 003f00c0 00 8801fdbb2c00 2271241216 C Bo:3:004:2 0 31 > 8801ee74d180 2271241224 S Bi:3:004:1 -115 192 < 8801ee74d180 2271241270 C Bi:3:004:1 -121 4 = 0300 8801fdbb2c00 2271241281 S Bi:3:004:1 -115 13 < 8801fdbb2c00 2271241309 C Bi:3:004:1 0 13 = 55534253 0500 00 8801fdbb2c00 2271242377 S Bo:3:004:2 -115 31 = 55534243 0600 0600
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, 2014-01-22 at 13:39 -0500, Ric Wheeler wrote: > On 01/22/2014 01:35 PM, James Bottomley wrote: > > On Wed, 2014-01-22 at 13:17 -0500, Ric Wheeler wrote: [...] > >> I think that the key to having the file system work with larger > >> sectors is to > >> create them properly aligned and use the actual, native sector size as > >> their FS > >> block size. Which is pretty much back the original challenge. > > Only if you think laying out stuff requires block size changes. If a 4k > > block filesystem's allocation algorithm tried to allocate on a 16k > > boundary for instance, that gets us a lot of the performance without > > needing a lot of alteration. > > The key here is that we cannot assume that writes happen only during > allocation/append mode. But that doesn't matter at all, does it? If the file is sector aligned, then the write is aligned. If the write is short on a large block fs, well we'd just have to do the RMW in the OS anyway ... is that any better than doing it in the device? > Unless the block size enforces it, we will have non-aligned, small > block IO done > to allocated regions that won't get coalesced. We always get that if it's the use pattern ... the question merely becomes who bears the burden of RMW. > > It's not even obvious that an ignorant 4k layout is going to be so > > bad ... the RMW occurs only at the ends of the transfers, not in the > > middle. If we say 16k physical block and average 128k transfers, > > probabalistically we misalign on 6 out of 31 sectors (or 19% of the > > time). We can make that better by increasing the transfer size (it > > comes down to 10% for 256k transfers. > > This really depends on the nature of the device. Some devices could > produce very > erratic performance Yes, we get that today with misaligned writes to the 4k devices. > or even (not today, but some day) reject the IO. I really doubt this. All 4k drives today do RMW ... I don't see that changing any time soon. > >> Teaching each and every file system to be aligned at the storage > >> granularity/minimum IO size when that is larger than the physical > >> sector size is > >> harder I think. > > But you're making assumptions about needing larger block sizes. I'm > > asking what can we do with what we currently have? Increasing the > > transfer size is a way of mitigating the problem with no FS support > > whatever. Adding alignment to the FS layout algorithm is another. When > > you've done both of those, I think you're already at the 99% aligned > > case, which is "do we need to bother any more" territory for me. > > > > I would say no, we will eventually need larger file system block sizes. > > Tuning and getting 95% (98%?) of the way there with alignment and IO > scheduler > does help a lot. That is what we do today and it is important when > looking for > high performance. > > However, this is more of a short term work around for a lack of a > fundamental > ability to do the right sized file system block for a specific class > of device. > As such, not a crisis that must be solved today, but rather something > that I > think is definitely worth looking at so we can figure this out over > the next > year or so. But this, I think, is the fundamental point for debate. If we can pull alignment and other tricks to solve 99% of the problem is there a need for radical VM surgery? Is there anything coming down the pipe in the future that may move the devices ahead of the tricks? James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
project!!
I am Howard Maxim,retired professional broker.I have a project for you or your firm.My private email is project20...@gmail.com -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On 01/22/2014 01:37 PM, Chris Mason wrote: > Circling back to what we might talk about at the conference, Ric do you > have any ideas on when these drives might hit the wild? > > -chris I will poke at vendors to see if we can get someone to make a public statement, but I cannot do that for them. Ric -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, 22 Jan 2014 11:30:19 -0800 James Bottomley wrote: > But this, I think, is the fundamental point for debate. If we can pull > alignment and other tricks to solve 99% of the problem is there a need > for radical VM surgery? Is there anything coming down the pipe in the > future that may move the devices ahead of the tricks? I expect it would be relatively simple to get large blocksizes working on powerpc with 64k PAGE_SIZE. So before diving in and doing huge amounts of work, perhaps someone can do a proof-of-concept on powerpc (or ia64) with 64k blocksize. That way we'll at least have an understanding of what the potential gains will be. If the answer is "1.5%" then poof - go off and do something else. (And the gains on powerpc would be an upper bound - unlike powerpc, x86 still has to fiddle around with 16x as many pages and perhaps order-4 allocations(?)) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [usb-storage] Re: usb disk recognized but fails
On Wed, 22 Jan 2014, Milan Svoboda wrote: > Here is the usbmon log. The disk works with TV and Windows. No error messages. > Could it be under-powered? That's possible, but not if it works under Windows on the same computer. The first reset occurred when the disk failed to respond correctly to a GET WINDOW command: > 8801fdbb2c00 2271440149 S Bo:3:004:2 -115 31 = 55534243 5900 0002 > 80001024 2480fe01 00ec 00 > 8801fdbb2c00 2271440171 C Bo:3:004:2 0 31 > > 8800d19fd240 2271440179 S Bi:3:004:1 -115 512 < > 8800d19fd240 2271444020 C Bi:3:004:1 -121 13 = 55534253 5900 > 00 > 8801fdbb2c00 2271444044 S Bi:3:004:1 -115 13 < > 8801fdbb2c00 2278599765 C Bi:3:004:1 -104 0 It sent the status when it should have sent the data. That's definitely a bug in the disk or its USB interface. A bit later we see repeated attempts to read 8 blocks starting at block 1250258607. In the first attempt, the disk returned 8 blocks of data but sent an error status. When asked for the error information, the disk sent "No Information": > 8801fdbb2c00 2278816851 S Bo:3:004:2 -115 31 = 55534243 5d00 0010 > 8a28 004a856e af08 00 > 8801fdbb2c00 2278816886 C Bo:3:004:2 0 31 > > 8800c22c69c0 2278816911 S Bi:3:004:1 -115 4096 < > 8800c22c69c0 2278817038 C Bi:3:004:1 0 4096 = > > 8801fdbb2c00 2278817085 S Bi:3:004:1 -115 13 < > 8801fdbb2c00 2289976173 C Bi:3:004:1 -32 0 > 8801fdbb2c00 2289976229 S Co:3:004:0 s 02 01 0081 0 > 8801fdbb2c00 2289976257 C Co:3:004:0 0 0 > 8801fdbb2c00 2289976273 S Bi:3:004:1 -115 13 < > 8801fdbb2c00 2289976335 C Bi:3:004:1 0 13 = 55534253 5d00 01 > 8801fdbb2c00 2289976378 S Bo:3:004:2 -115 31 = 55534243 5e00 1200 > 8603 0012 00 > 8801fdbb2c00 2289976419 C Bo:3:004:2 0 31 > > 8800c0afccc0 2289976428 S Bi:3:004:1 -115 18 < > 8800c0afccc0 2290123415 C Bi:3:004:1 0 18 = 7000 000a > > 8801fdbb2c00 2290123463 S Bi:3:004:1 -115 13 < > 8801fdbb2c00 2290123486 C Bi:3:004:1 0 13 = 55534253 5e00 00 In the second attempt, the disk returned only 1 block of data, and again, no error information: > 8801fdbb2c00 2290129737 S Bo:3:004:2 -115 31 = 55534243 5f00 0010 > 8a28 004a856e af08 00 > 8801fdbb2c00 2290129782 C Bo:3:004:2 0 31 > > 8800d4a51b40 2290129803 S Bi:3:004:1 -115 4096 < > 8800d4a51b40 2301288990 C Bi:3:004:1 -32 512 = > > 8801fdbb2c00 2301289039 S Co:3:004:0 s 02 01 0081 0 > 8801fdbb2c00 2301289068 C Co:3:004:0 0 0 > 8801fdbb2c00 2301289086 S Bi:3:004:1 -115 13 < > 8801fdbb2c00 2301289153 C Bi:3:004:1 0 13 = 55534253 5f00 01 > 8801fdbb2c00 2301289189 S Bo:3:004:2 -115 31 = 55534243 6000 1200 > 8603 0012 00 > 8801fdbb2c00 2301289223 C Bo:3:004:2 0 31 > > 8800d188c780 2301289263 S Bi:3:004:1 -115 18 < > 8800d188c780 2301299693 C Bi:3:004:1 0 18 = 7000 000a > > 8801fdbb2c00 2301299741 S Bi:3:004:1 -115 13 < > 8801fdbb2c00 2301299765 C Bi:3:004:1 0 13 = 55534253 6000 00 This continued a few more times until the computer gave up. Maybe there is something wrong with one particular block at that address on the disk. Do you have a USB-2 port on the computer you can plug the disk into, instead of USB-3? Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, 2014-01-22 at 11:50 -0800, Andrew Morton wrote: > On Wed, 22 Jan 2014 11:30:19 -0800 James Bottomley > wrote: > > > But this, I think, is the fundamental point for debate. If we can pull > > alignment and other tricks to solve 99% of the problem is there a need > > for radical VM surgery? Is there anything coming down the pipe in the > > future that may move the devices ahead of the tricks? > > I expect it would be relatively simple to get large blocksizes working > on powerpc with 64k PAGE_SIZE. So before diving in and doing huge > amounts of work, perhaps someone can do a proof-of-concept on powerpc > (or ia64) with 64k blocksize. Maybe 5 drives in raid5 on MD, with 4K coming from each drive. Well aligned 16K IO will work, everything else will about the same as a rmw from a single drive. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Persistent reservation behaviour/compliance with redundant controllers
2014/1/7 James Bottomley > > On Mon, 2014-01-06 at 23:53 +0100, Matthias Eble wrote: > > 2014/1/6 Lee Duncan : > > > On 12/25/2013 03:00 PM, Matthias Eble wrote: > > >> Here's the dmmp map > > >> 360002aca6e6b dm-6 3PARdata,VV > > >> size=2.0T features='0' hwhandler='0' wp=rw > > >> `-+- policy='round-robin 0' prio=1 status=active > > >> |- 3:0:1:4 sdg 8:96active ready running > > >> |- 3:0:3:4 sdl 8:176 active ready running > > >> |- 5:0:3:4 sdbg 67:160 active ready running > > >> `- 5:0:1:4 sdce 69:32 active ready running > > >> > > >> There can only be two registrations at a time: (sdg XOR sdl) and (sdbg > > >> XOR sdce) > > >> Now my question is: Does this comply to the standard? > > >> > > > > > > I _believe_ the problem is that you are re-registering the same > > > I_T_Nexus through /dev/sdl, your second attempt at registration, as you > > > did when you used /dev/sdg, your original registration. > > > > > > Can sdg and sdl be the same I_T_Nexus at a time? > > Right now, they are handled like that. > > In my understanding, every scsi disk device represents an I_T_Nexus. > > No, every SCSI disk is an I_T_L nexus. There's no actual device object > in Linux for an I_T nexus. Hi All, I'd like to document the progress and findings in lots of off-list emails with HP's t10 members. Maybe someone on the net will face the same problem. First of all, the SPC wording isn't 100% precise. For most commands, the Lun context is implicit. So if the standards state "I_T Nexus", I_T_L Nexuses are meant, as the reservation commands are always lun specific. That said, PR-registrations need to be done for every I_T_L Nexus -> every single dmmp path (/dev/sdX) So we started to test the behaviour of the 3Par system. It seems that there are some quirks in the 3Par implementation. The error that led to my initial question is that the target port identifier isn't included in the target's reservation handling. Thus all PR commands from one host port are considered the same. Regardless of the target port over which they were received. (As seen in attached commands #5 or #6 after issuing #2 ) Note that the investigations haven't been finished. For those who are interested, here are the findings (verbose output stripped): 1.# sg_persist --in --read-keys /dev/sdl 3PARdata VV3122 Peripheral device type: disk PR generation=0x44, there are NO registered reservation keys register via sdl: 2.# sg_persist -vvv -d /dev/sdl --no-inquiry --out --register --param-sark=0x420480a0296c PR out: command (Register) successful test for scp3r23 table 33 compliance (same key on registered I_T Nexus should succeed): False 3.# sg_persist -vvv -d /dev/sdl --no-inquiry --out --register --param-sark=0x420480a0296c persistent reserve out: scsi status: Reservation Conflict PR out: command failed now with a *different key* (should conflict): True 4.# sg_persist -vvv -d /dev/sdl --no-inquiry --out --register --param-sark=0x420480a0296d persistent reserve out: scsi status: Reservation Conflict PR out: command failed Same behaviour using another path/I_T_L Nexus (should succeed in both cases): 5.# sg_persist -vvv -d /dev/sdg --no-inquiry --out --register --param-sark=0x420480a0296c persistent reserve out: scsi status: Reservation Conflict PR out: command failed 6.# sg_persist -vvv -d /dev/sdg --no-inquiry --out --register --param-sark=0x420480a0296d persistent reserve out: scsi status: Reservation Conflict PR out: command failed Unregister via sdg :-/ 7.# sg_persist -vvv -d /dev/sdg --no-inquiry --out --register --param-rk=0x420480a0296c PR out: command (Register) successful Additionally, read-full-status service action and ALL_TG_PT are not supported, right now. That's it for now. Thanks for your replies, Matthias -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
> "Ric" == Ric Wheeler writes: Ric> I will have to see if I can get a storage vendor to make a public Ric> statement, but there are vendors hoping to see this land in Linux Ric> in the next few years. I assume that anyone with a shipping device Ric> will have to at least emulate the 4KB sector size for years to Ric> come, but that there might be a significant performance win for Ric> platforms that can do a larger block. I am aware of two companies that already created devices with 8KB logical blocks and expected Linux to work. I had to do some explaining. I agree with Ric that this is something we'll need to address sooner rather than later. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
> "James" == James Bottomley writes: >> or even (not today, but some day) reject the IO. James> I really doubt this. All 4k drives today do RMW ... I don't see James> that changing any time soon. All consumer grade 4K phys drives do RMW. It's a different story for enterprise drives. The vendors appear to be divided between 4Kn and 512e with RMW mitigation. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed 22-01-14 09:00:33, James Bottomley wrote: > On Wed, 2014-01-22 at 11:45 -0500, Ric Wheeler wrote: > > On 01/22/2014 11:03 AM, James Bottomley wrote: > > > On Wed, 2014-01-22 at 15:14 +, Chris Mason wrote: > > >> On Wed, 2014-01-22 at 09:34 +, Mel Gorman wrote: > > >>> On Tue, Jan 21, 2014 at 10:04:29PM -0500, Ric Wheeler wrote: > > One topic that has been lurking forever at the edges is the current > > 4k limitation for file system block sizes. Some devices in > > production today and others coming soon have larger sectors and it > > would be interesting to see if it is time to poke at this topic > > again. > > > > >>> Large block support was proposed years ago by Christoph Lameter > > >>> (http://lwn.net/Articles/232757/). I think I was just getting started > > >>> in the community at the time so I do not recall any of the details. I do > > >>> believe it motivated an alternative by Nick Piggin called fsblock though > > >>> (http://lwn.net/Articles/321390/). At the very least it would be nice to > > >>> know why neither were never merged for those of us that were not around > > >>> at the time and who may not have the chance to dive through mailing list > > >>> archives between now and March. > > >>> > > >>> FWIW, I would expect that a show-stopper for any proposal is requiring > > >>> high-order allocations to succeed for the system to behave correctly. > > >>> > > >> My memory is that Nick's work just didn't have the momentum to get > > >> pushed in. It all seemed very reasonable though, I think our hatred of > > >> buffered heads just wasn't yet bigger than the fear of moving away. > > >> > > >> But, the bigger question is how big are the blocks going to be? At some > > >> point (64K?) we might as well just make a log structured dm target and > > >> have a single setup for both shingled and large sector drives. > > > There is no real point. Even with 4k drives today using 4k sectors in > > > the filesystem, we still get 512 byte writes because of journalling and > > > the buffer cache. > > > > I think that you are wrong here James. Even with 512 byte drives, the IO's > > we > > send down tend to be 4k or larger. Do you have traces that show this and > > details? > > It's mostly an ext3 journalling issue ... and it's only metadata and > mostly the ioschedulers can elevate it into 4k chunks, so yes, most of > our writes are 4k+, so this is a red herring, yes. ext3 (similarly as ext4) does block level journalling meaning that it journals *only* full blocks. So an ext3/4 filesystem with 4 KB blocksize will never journal anything else than full 4 KB blocks. So I'm not sure where this 512-byte writes idea came from.. > > Also keep in mind that larger block sizes allow us to track larger > > files with > > smaller amounts of metadata which is a second win. > > Larger file block sizes are completely independent from larger device > block sizes (we can have 16k file block sizes on 4k or even 512b > devices). The questions on larger block size devices are twofold: > > 1. If manufacturers tell us that they'll only support I/O on the > physical sector size, do we believe them, given that they said > this before on 4k and then backed down. All the logical vs > physical sector stuff is now in T10 standards, why would they > try to go all physical again, especially as they've now all > written firmware that does the necessary RMW? > 2. If we agree they'll do RMW in Firmware again, what do we have to > do to take advantage of larger sector sizes beyond what we > currently do in alignment and chunking? There may still be > issues in FS journal and data layouts. I also believe drives will support smaller-than-blocksize writes. But supporting larger fs blocksize can sometimes be beneficial for other reasons (think performance with specialized workloads because amount of metadata is smaller, fragmentation is smaller, ...). Currently ocfs2, ext4, and possibly others go through the hoops to support allocating file data in chunks larger than fs blocksize - at the first sight that should be straightforward but if you look at the code you find out there are nasty corner cases which make it pretty ugly. And each fs doing these large data allocations currently invents its own way to deal with the problems. So providing some common infrastructure for dealing with blocks larger than page size would definitely relieve some pain. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [usb-storage] Re: usb disk recognized but fails
> >> 8801fdbb2c00 2290129737 S Bo:3:004:2 -115 31 = 55534243 5f00 >> 0010 8a28 004a856e af08 00 >> 8801fdbb2c00 2290129782 C Bo:3:004:2 0 31 > >> 8800d4a51b40 2290129803 S Bi:3:004:1 -115 4096 < >> 8800d4a51b40 2301288990 C Bi:3:004:1 -32 512 = >> >> 8801fdbb2c00 2301289039 S Co:3:004:0 s 02 01 0081 0 >> 8801fdbb2c00 2301289068 C Co:3:004:0 0 0 >> 8801fdbb2c00 2301289086 S Bi:3:004:1 -115 13 < >> 8801fdbb2c00 2301289153 C Bi:3:004:1 0 13 = 55534253 5f00 01 >> 8801fdbb2c00 2301289189 S Bo:3:004:2 -115 31 = 55534243 6000 >> 1200 8603 0012 00 >> 8801fdbb2c00 2301289223 C Bo:3:004:2 0 31 > >> 8800d188c780 2301289263 S Bi:3:004:1 -115 18 < >> 8800d188c780 2301299693 C Bi:3:004:1 0 18 = 7000 000a >> >> 8801fdbb2c00 2301299741 S Bi:3:004:1 -115 13 < >> 8801fdbb2c00 2301299765 C Bi:3:004:1 0 13 = 55534253 6000 00 > >This continued a few more times until the computer gave up. Maybe >there is something wrong with one particular block at that address on >the disk. I tried to run fdisk /dev/sdb which obivously failed but it tried to access sectors 0, 1, 2, 3 which resulted in kernel log messages reporting invalid sectors. I tried to connect the disk with Windows and it was connected immediatelly, the partition is visible, I can read files, create files... No signs of problems visible... I don't know if it tells anything, but Windows reports the disk as USB 2.0 SATA bridge even when connected to USB-3 ports. > >Do you have a USB-2 port on the computer you can plug the disk into, >instead of USB-3? > I tried it, but it is only one and separated on the other side of laptop so I could connect only one connector and the result is the same. Milan Svoboda >Alan Stern > > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 60758] module scsi_wait_scan not found kernel panic on boot
https://bugzilla.kernel.org/show_bug.cgi?id=60758 --- Comment #54 from Akemi Yagi --- (In reply to Lin Feng from comment #53) > though it's somthing about virtio driver(my guest uses virtio as the storage > driver), looking into this commit it is mainly about C code changes, not > module compiling or not. Also I have checked the modules compiled, in both > cases(with and without this commit) we get virtio_blk.ko module. But the > difference is that with this commit virtio_blk.ko isn't packed into the > initramfs. > > However between both cases there is no environmental changes, with exactly > the same config, same dracut, same gcc, everything...So I don't know why > dracut doesn't pack the virtio_blk.ko into the initramfs, and more kidding I > find that it packed the floppy.ko instead. > (In my case as a workround we can compile virtio moduels into kernel or use > other disk bus driver such as IDE or USB instead) > > But one thing I don't understand can someone tell me why, Will dracut look > through the kernel tree(C codes) to find some useful information to pack the > final initramfs? Thank you for this extensive analysis. I, too, was having the same problem on my KVM guest that uses virtio (host=RHEL 6.5 and guest=CentOS 6.5). Just to reconfirm your findings, I created initramfs with a '--add-drivers virtio_blk' option and the kernel booted just fine. It would indeed be great if we could find out why dracut fails to pick up some particular modules. -- You are receiving this mail because: You are the assignee for the bug. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [usb-storage] Re: usb disk recognized but fails
On Wed, 22 Jan 2014, Milan Svoboda wrote: > >This continued a few more times until the computer gave up. Maybe > >there is something wrong with one particular block at that address on > >the disk. > > I tried to run fdisk /dev/sdb which obivously failed but it tried to access > sectors 0, 1, 2, 3 which resulted in kernel log > messages reporting invalid sectors. > I tried to connect the disk with Windows and it was connected immediatelly, > the partition is visible, I can read files, create files... > No signs of problems visible... Maybe Windows doesn't try to access that problematic disk block. > I don't know if it tells anything, but Windows reports the disk as USB 2.0 > SATA bridge even when connected to USB-3 ports. > > > > >Do you have a USB-2 port on the computer you can plug the disk into, > >instead of USB-3? > > > > I tried it, but it is only one and separated on the other side of laptop so I > could connect only one connector and the > result is the same. Which connector did you plug in: the data cable or the power cable? Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v2 12/17] target/file: Add DIF protection init/format support
On Wed, 2014-01-22 at 12:12 +0200, Sagi Grimberg wrote: > On 1/22/2014 12:28 AM, Nicholas A. Bellinger wrote: > > On Sun, 2014-01-19 at 14:31 +0200, Sagi Grimberg wrote: > >> On 1/19/2014 4:44 AM, Nicholas A. Bellinger wrote: > >>> From: Nicholas Bellinger > >>> > >>> This patch adds support for DIF protection init/format support into > >>> the FILEIO backend. > >>> > >>> It involves using a seperate $FILE.protection for storing PI that is > >>> opened via fd_init_prot() using the common pi_prot_type attribute. > >>> The actual formatting of the protection is done via fd_format_prot() > >>> using the common pi_prot_format attribute, that will populate the > >>> initial PI data based upon the currently configured pi_prot_type. > >>> > >>> Based on original FILEIO code from Sagi. > >> Nice! see comments below... > >> > >>> +static void fd_init_format_buf(struct se_device *dev, unsigned char *buf, > >>> +u32 unit_size, u32 *ref_tag, u16 app_tag, > >>> +bool inc_reftag) > >>> +{ > >>> + unsigned char *p = buf; > >>> + int i; > >>> + > >>> + for (i = 0; i < unit_size; i += dev->prot_length) { > >>> + *((u16 *)&p[0]) = 0x; > >>> + *((__be16 *)&p[2]) = cpu_to_be16(app_tag); > >>> + *((__be32 *)&p[4]) = cpu_to_be32(*ref_tag); > >>> + > >>> + if (inc_reftag) > >>> + (*ref_tag)++; > >>> + > >>> + p += dev->prot_length; > >>> + } > >>> +} > >>> + > >>> +static int fd_format_prot(struct se_device *dev) > >>> +{ > >>> + struct fd_dev *fd_dev = FD_DEV(dev); > >>> + struct file *prot_fd = fd_dev->fd_prot_file; > >>> + sector_t prot_length, prot; > >>> + unsigned char *buf; > >>> + loff_t pos = 0; > >>> + u32 ref_tag = 0; > >>> + int unit_size = FDBD_FORMAT_UNIT_SIZE * dev->dev_attrib.block_size; > >>> + int rc, ret = 0, size, len; > >>> + bool inc_reftag = false; > >>> + > >>> + if (!dev->dev_attrib.pi_prot_type) { > >>> + pr_err("Unable to format_prot while pi_prot_type == 0\n"); > >>> + return -ENODEV; > >>> + } > >>> + if (!prot_fd) { > >>> + pr_err("Unable to locate fd_dev->fd_prot_file\n"); > >>> + return -ENODEV; > >>> + } > >>> + > >>> + switch (dev->dev_attrib.pi_prot_type) { > >> redundant - see below. > >>> + case TARGET_DIF_TYPE3_PROT: > >>> + ref_tag = 0x; > >>> + break; > >>> + case TARGET_DIF_TYPE2_PROT: > >>> + case TARGET_DIF_TYPE1_PROT: > >>> + inc_reftag = true; > >>> + break; > >>> + default: > >>> + break; > >>> + } > >>> + > >>> + buf = vzalloc(unit_size); > >>> + if (!buf) { > >>> + pr_err("Unable to allocate FILEIO prot buf\n"); > >>> + return -ENOMEM; > >>> + } > >>> + > >>> + prot_length = (dev->transport->get_blocks(dev) + 1) * dev->prot_length; > >>> + size = prot_length; > >>> + > >>> + pr_debug("Using FILEIO prot_length: %llu\n", > >>> + (unsigned long long)prot_length); > >>> + > >>> + for (prot = 0; prot < prot_length; prot += unit_size) { > >>> + > >>> + fd_init_format_buf(dev, buf, unit_size, &ref_tag, 0x, > >>> +inc_reftag); > >> I didn't send you my latest patches (my fault...).T10-PI format should > >> only place > >> escape values throughout the protection file (fill it with 0xff). so I > >> guess in this case > >> fd_init_formast_buf() boils down to memset(buf, 0xff, unit_size) once > >> before the loop > >> and just loop until prot_length writing buf, no need to address > >> apptag/reftag... > > Yeah, was thinking about just formatting with escape values as mentioned > > above, but thought it might be useful to keep around for pre-populating > > values apptag + reftag values for testing purposes. > > > > --nab > > > > OK, but maybe it is better to do that under some debug configuration > rather then always do that. > With the apptag escape in place from the format the host is going to ignore the other areas, so this shouldn't really matter. If it turns out to be a issue, we can just drop this code later. --nab -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v2 02/17] target: Add DIF CHECK_CONDITION ASC/ASCQ exception cases
On Wed, 2014-01-22 at 18:44 +0200, Sagi Grimberg wrote: > On 1/19/2014 4:44 AM, Nicholas A. Bellinger wrote: > > From: Nicholas Bellinger > > > > This patch adds support for DIF related CHECK_CONDITION ASC/ASCQ > > exception cases into transport_send_check_condition_and_sense(). > > > > This includes: > > > >LOGICAL BLOCK GUARD CHECK FAILED > >LOGICAL BLOCK APPLICATION TAG CHECK FAILED > >LOGICAL BLOCK REFERENCE TAG CHECK FAILED > > > > that used by DIF TYPE1 and TYPE3 failure cases. > > > > Cc: Martin K. Petersen > > Cc: Christoph Hellwig > > Cc: Hannes Reinecke > > Cc: Sagi Grimberg > > Cc: Or Gerlitz > > Signed-off-by: Nicholas Bellinger > > --- > > drivers/target/target_core_transport.c | 30 > > ++ > > include/target/target_core_base.h |3 +++ > > 2 files changed, 33 insertions(+) > > > > diff --git a/drivers/target/target_core_transport.c > > b/drivers/target/target_core_transport.c > > index 18c828d..fa4fc04 100644 > > --- a/drivers/target/target_core_transport.c > > +++ b/drivers/target/target_core_transport.c > > @@ -2674,6 +2674,36 @@ transport_send_check_condition_and_sense(struct > > se_cmd *cmd, > > buffer[SPC_ASC_KEY_OFFSET] = 0x1d; > > buffer[SPC_ASCQ_KEY_OFFSET] = 0x00; > > break; > > + case TCM_LOGICAL_BLOCK_GUARD_CHECK_FAILED: > > + /* CURRENT ERROR */ > > + buffer[0] = 0x70; > > + buffer[SPC_ADD_SENSE_LEN_OFFSET] = 10; > > + /* ILLEGAL REQUEST */ > > + buffer[SPC_SENSE_KEY_OFFSET] = ILLEGAL_REQUEST; > > + /* LOGICAL BLOCK GUARD CHECK FAILED */ > > + buffer[SPC_ASC_KEY_OFFSET] = 0x10; > > + buffer[SPC_ASCQ_KEY_OFFSET] = 0x01; > > + break; > > + case TCM_LOGICAL_BLOCK_APP_TAG_CHECK_FAILED: > > + /* CURRENT ERROR */ > > + buffer[0] = 0x70; > > + buffer[SPC_ADD_SENSE_LEN_OFFSET] = 10; > > + /* ILLEGAL REQUEST */ > > + buffer[SPC_SENSE_KEY_OFFSET] = ILLEGAL_REQUEST; > > + /* LOGICAL BLOCK APPLICATION TAG CHECK FAILED */ > > + buffer[SPC_ASC_KEY_OFFSET] = 0x10; > > + buffer[SPC_ASCQ_KEY_OFFSET] = 0x02; > > + break; > > + case TCM_LOGICAL_BLOCK_REF_TAG_CHECK_FAILED: > > + /* CURRENT ERROR */ > > + buffer[0] = 0x70; > > + buffer[SPC_ADD_SENSE_LEN_OFFSET] = 10; > > + /* ILLEGAL REQUEST */ > > + buffer[SPC_SENSE_KEY_OFFSET] = ILLEGAL_REQUEST; > > + /* LOGICAL BLOCK REFERENCE TAG CHECK FAILED */ > > + buffer[SPC_ASC_KEY_OFFSET] = 0x10; > > + buffer[SPC_ASCQ_KEY_OFFSET] = 0x03; > > + break; > > Hey Nic, > > I think we missed the failed LBA here. AFAICT According to SPC-4, a DIF > error should be accompanied by Information sense-data descriptor with > the (first) failed > sector in the information field. This means that this routine should be > ready to accept a > u32 bad_sector or something. I'm not sure how much of a must it really is. > > Let me prepare a patch... > Ah yes, good catch. This is what se_cmd->block_num was intended for.. Care to add these assignments to target_core_sbc.c:sbc_dif_verify_* failures as well..? --nab -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v2 03/17] target/sbc: Add DIF setup in sbc_check_prot + sbc_parse_cdb
On Wed, 2014-01-22 at 20:00 +0200, Sagi Grimberg wrote: > On 1/22/2014 12:48 AM, Nicholas A. Bellinger wrote: > > + cmd->prot_handover = PROT_SEPERATED; > >> I know that we are not planning to support interleaved mode at the > >> moment, But I think > >> that the protection handover type is the backstore preference and should > >> be taken from se_dev. > >> But it is not that important for now... > >> > > Yeah, I figured since the RDMA pieces needed the handover type defined > > in some form, it made sense to include PROT_SEPERATED hardcoded here, > > but stopped short of adding se_dev->prot_handler for the first round > > merge. > > > > --nab > > > > Actually they don't, I just added them in iSER code to demonstrate the > HW ability. > If we are not planning to support that (although as MKP mentioned it > might be useful in some cases), > you can remove that for now and we can add it in the future - iSER can > ignore it for now (I'll refactor the patches). > I'll likely leave this in for the initial merge to avoid rebasing target-pending/for-next now that the merge window is open. Let's drop this bit in a separate incremental patch. --nab -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM ATTEND] interest in blk-mq, scsi-mq, dm-cache, dm-thinp, dm-*
On 1/22/14 1:54 AM, "Bart Van Assche" wrote: >On 01/16/14 23:35, Giridhar Malavali wrote: >> On 1/10/14 10:27 AM, "Mike Snitzer" wrote: >>> I would like to attend to participate in discussions related to topics >>> listed in the subject. As a maintainer of DM I'd be interested to >>> learn/discuss areas that should become a development focus in the >>>months >>> following LSF. >> >> +1 for scsi-mq. > >(removed lsf-pc from CC-list) > >Hello Giridhar, > >I definitely appreciate QLogic's continuous efforts to improve >performance of their initiator and target drivers. Via your e-mail I >learned that QLogic is looking at scsi-mq, which is great news. However, >the following issue might have to be addressed before the qla2xxx driver >can fully take advantage of the scsi-mq work: >https://bugzilla.kernel.org/show_bug.cgi?id=69201. It would be >appreciated if someone could have a look at the measurements described >in that bugzilla entry. One of our internal runs have indicated lock contention issues and we are working on this actively. We will be posting the patches once we are done with it. Thanks for bringing this to our attention. -- Giri > >Thanks, > >Bart. <>
[LSF/MM ATTEND] Interested in bk-mq, scsi-mq , ISER, T10PI
James, I'd like to attend LSF/MM 2014. Have been trying out blk-mq / scsi-mq on Emulex offloaded iSCSi Solution. Also interested in T10 PI, iSER and RDMA. Also, am the Maintainer of be2iscsi driver. Thanks Jay -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, 22 Jan 2014, Chris Mason wrote: On Wed, 2014-01-22 at 11:50 -0800, Andrew Morton wrote: On Wed, 22 Jan 2014 11:30:19 -0800 James Bottomley wrote: But this, I think, is the fundamental point for debate. If we can pull alignment and other tricks to solve 99% of the problem is there a need for radical VM surgery? Is there anything coming down the pipe in the future that may move the devices ahead of the tricks? I expect it would be relatively simple to get large blocksizes working on powerpc with 64k PAGE_SIZE. So before diving in and doing huge amounts of work, perhaps someone can do a proof-of-concept on powerpc (or ia64) with 64k blocksize. Maybe 5 drives in raid5 on MD, with 4K coming from each drive. Well aligned 16K IO will work, everything else will about the same as a rmw from a single drive. I think this is the key point to think about here. How will these new hard drive large block sizes differ from RAID stripes and SSD eraseblocks? In all of these cases there are very clear advantages to doing the writes in properly sized and aligned chunks that correspond with the underlying structure to avoid the RMW overhead. It's extremely unlikely that drive manufacturers will produce drives that won't work with any existing OS, so they are going to support smaller writes in firmware. If they don't, they won't be able to sell their drives to anyone running existing software. Given the Enterprise software upgrade cycle compared to the expanding storage needs, whatever they ship will have to work on OS and firmware releases that happened several years ago. I think what is needed is some way to be able to get a report on how man RMW cycles have to happen. Then people can work on ways to reduce this number and measure the results. I don't know if md and dm are currently smart enough to realize that the entire stripe is being overwritten and avoid the RMW cycle. If they can't, I would expect that once we start measuring it, they will gain such support. David Lang -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going beyond 4096 bytes
On Wed, Jan 22, 2014 at 06:46:11PM -0800, David Lang wrote: > It's extremely unlikely that drive manufacturers will produce drives > that won't work with any existing OS, so they are going to support > smaller writes in firmware. If they don't, they won't be able to > sell their drives to anyone running existing software. Given the > Enterprise software upgrade cycle compared to the expanding storage > needs, whatever they ship will have to work on OS and firmware > releases that happened several years ago. I've been talking to a number of HDD vendors, and while most of the discussions has been about SMR, the topic of 64k sectors did come up recently. In the opinion of at least one drive vendor, the pressure or 64k sectors will start increasing (roughly paraphrasing that vendor's engineer, "it's a matter of physics"), and it might not be surprising that in 2 or 3 years, we might start seing drives with 64k sectors. Like with 4k sector drives, it's likely that at least initial said drives will have an emulation mode where sub-64k writes will require a read-modify-write cycle. What I told that vendor was that if this were the case, he should seriously consider submitting a topic proposal to the LSF/MM, since if he wants those drives to be well supported, we need to start thinking about what changes might be necessary at the VM and FS layers now. So hopefully we'll see a topic proposal from that HDD vendor in the next couple of days. The bottom line is that I'm pretty well convinced that like SMR drives, 64k sector drives will be coming, and it's not something we can duck. It might not come as quickly as the HDD vendor community might like --- I remember attending an IDEMA conference in 2008 where they confidently predicted that 4k sector drives would be the default in 2 years, and it took a wee bit longer than that. But nevertheless, looking at the most likely roadmap and trajectory of hard drive technology, these are two things that will very likely be coming down the pike, and it would be best if we start thinking about how to engage with these changes constructively sooner rather than putting it off and then getting caught behind the eight-ball later. Cheers, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html