On 04/28/2017 10:35 PM, Denis V. Lunev wrote: > On 04/28/2017 09:31 PM, Eric Blake wrote: >> On 04/28/2017 11:17 AM, Denis V. Lunev wrote: >>> Hello, All! >>> >>> Recently we have experienced problems with very slow >>> bdrv_get_block_status call, which is massive called f.e. >>> from mirror_run(). >>> >>> The problem was narrowed down to slow lseek64() system >>> call, which can take 1-2-3 seconds. >> I'm guessing you meant one-to-three (the range), not one-two-three >> (three separate digits), and just had an unfortunate abbreviation of >> 'to' turning into the phonetically-similar '2'. >> > from 1 to 3, you are correct > > >>> root@s158 ~]# strace -f -p 77048 -T -e lseek >>> Process 77048 attached with 14 threads >>> [pid 77053] lseek(23, 6359613440, SEEK_DATA) = 6360276992 <1.250323> >> That sounds like a bug in your choice of filesystem. It's been >> mentioned before that lseek has some pathetically poor behavior (I think >> tmpfs was one of the culprits), but I maintain that it is better to >> hammer on the kernel folks to fix the poor behavior than it is to have >> to implement user-space workarounds in every single affected program. >> >> That said, a workaround of being able to request the avoidance of lseek >> has been brought up on this list before. >> >> >>> The problem comes from this branch of the code >>> >>> bdrv_co_get_block_status >>> ....... >>> if (bs->file && >>> (ret & BDRV_BLOCK_DATA) && !(ret & BDRV_BLOCK_ZERO) && >>> (ret & BDRV_BLOCK_OFFSET_VALID)) { >>> int file_pnum; >>> >>> ret2 = bdrv_co_get_block_status(bs->file, ret >> BDRV_SECTOR_BITS, >>> *pnum, &file_pnum); >>> if (ret2 >= 0) { >>> /* Ignore errors. This is just providing extra information, it >>> * is useful but not necessary. >>> */ >>> if (!file_pnum) { >>> /* !file_pnum indicates an offset at or beyond the EOF; >>> it is >>> * perfectly valid for the format block driver to point >>> to such >>> * offsets, so catch it and mark everything as zero */ >>> ret |= BDRV_BLOCK_ZERO; >>> } else { >>> /* Limit request to the range reported by the protocol >>> driver */ >>> *pnum = file_pnum; >>> ret |= (ret2 & BDRV_BLOCK_ZERO); >>> } >>> } >>> } >> I don't see anything wrong with this code. It's nice to know that we >> have data (qcow2 says we have allocated bytes due to this layer, so >> don't refer to the backing files), but when the underlying file can ALSO >> tell us that the underlying protocol is sparse and we are on a hole, >> then we know that we have BDRV_BLOCK_ZERO condition which can in turn >> enable other optimizations in quite a bit of the stack. It IS valid to >> return BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO (we don't have to limit >> ourselves to BDRV_BLOCK_DATA), and we should probably do so any time >> that such computation is cheap. >> >> I agree with your analysis that on a poor filesystem where lseek() >> behavior sucks that it is no longer cheap to determine where the holes are. >> >> But the proper workaround for those filesystems should NOT be made by >> undoing this part of bdrv_co_get_block_status(), but should rather be >> fixed in file-posix.c by the addition of a user-controllable knob on >> whether to skip lseek(). In other words, if we're going to work around >> the poor filesystem performance, the workaround should live in >> file-posix.c, not in the generic block/io.c. Once the workaround is >> enabled, the 'ret2 = bdrv_co_get_block_status()' nested call should be >> lightning fast (because file-posix.c would no longer be using lseek when >> you've requested the workaround). >> >>> Frankly speaking, this optimization should not give much. >>> If upper layer format (QCOW2) says that we have data >>> here, then nowadays in 99.9% we do have the data. >> You are correct that we have BDRV_BLOCK_DATA; but it IS useful to ALSO >> know if that data reads back as all zeros (so we can skip reading it). >> Just because qcow2 reports something as data does NOT rule out whether >> the data still reads as zeros. >> >>> Meanwhile this branch can cause problems. We would >>> need block cleared entirely to get the benefit for most >>> cases in mirror and backup operations. >>> >>> At my opinion it worth to drop this at all. >>> >>> Guys, do you have any opinion? >> Again, my opinion is to not change this part of block/io.c. Rather, >> work should be expended on individual protocol drivers to quit wasting >> unnecessary time on reporting BDRV_BLOCK_ZERO if the time spent to >> determine that is not worth the effort (as it is when using lseek() on >> inefficient filesystems). It is always safe for a protocol driver to >> report BDRV_BLOCK_DATA instead of BDRV_BLOCK_ZERO, if asking the >> question is too expensive (treating holes as data may pessimize other >> operations, but that's okay if the penalty for asking is worse than what >> optimizations are able to regain). >> >>> Den >>> >>> P.S. The kernel is one based on RedHat 3.10.0-514. The same >>> problem was observed in 3.10.0-327 too. >> Red Hat is always two words (I'm allowed to point that out, as they >> employ me ;). But if it really is a Red Hat kernel problem, be sure to >> use your support contract to point out the kernel developers that they >> really need to fix lseek() - and you'll need to give details on which >> filesystem you're using (as not all filesystems have that abysmal >> performance). >> pls ignore above letter completely. I have pressed 'Send' in the wrong Window. Correct answer is above in the thread.
Den