In zoned btrfs a region that was once written then freed is not usable
until we reset the underlying zone. So we need to distinguish such
unusable space from usable free space.
Therefore we need to introduce the "zone_unusable" field to the block
group structure, and "bytes_zone_unusable" to the
The implementation of fitrim is depending on space cache, which is not used
and disabled for zoned btrfs' extent allocator. So the current code does
not work with zoned btrfs. In the future, we can implement fitrim for zoned
btrfs by enabling space cache (but, only for fitrim) or scanning the exten
This commit implements a sequential extent allocator for the ZONED mode.
This allocator just needs to check if there is enough space in the block
group. Therefor the allocator never manages bitmaps or clusters. Also add
ASSERTs to the corresponding functions.
Actually, with zone append writing, it
This is a preparation for the next patch. This commit split
alloc_log_tree() to allocating tree structure part (remains in
alloc_log_tree()) and allocating tree node part (moved in
btrfs_alloc_log_tree_node()). The latter part is also exported to be used
in the next patch.
Reviewed-by: Josef Bacik
This is the 1/3 patch to enable tree log on ZONED mode.
The tree-log feature does not work on ZONED mode as is. Blocks for a
tree-log tree are allocated mixed with other metadata blocks, and btrfs
writes and syncs the tree-log blocks to devices at the time of fsync(),
which is different timing fro
This is the 3/3 patch to enable tree-log on ZONED mode.
The allocation order of nodes of "fs_info->log_root_tree" and nodes of
"root->log_root" is not the same as the writing order of them. So, the
writing causes unaligned write errors.
This patch reorders the allocation of them by delaying alloc
This is the 2/3 patch to enable tree-log on ZONED mode.
Since we can start more than one log transactions per subvolume
simultaneously, nodes from multiple transactions can be allocated
interleaved. Such mixed allocation results in non-sequential writes at the
time of log transaction commit. The n
This is 4/4 patch to implement device-replace on ZONED mode.
Even after the copying is done, the write pointers of the source device and
the destination device may not be synchronized. For example, when the last
allocated extent is freed before device-replace process, the extent is not
copied, lea
This is the 1/4 patch to support device-replace in ZONED mode.
We have two types of I/Os during the device-replace process. One is an I/O
to "copy" (by the scrub functions) all the device extents on the source
device to the destination device. The other one is an I/O to "clone" (by
handle_ops_on_
This is 3/4 patch to implement device-replace on ZONED mode.
This commit implement copying. So, it track the write pointer during device
replace process. Device-replace's copying is smart to copy only used
extents on source device, we have to fill the gap to honor the sequential
write rule in the
To serialize allocation and submit_bio, we introduced mutex around them. As
a result, preallocation must be completely disabled to avoid a deadlock.
Since current relocation process relies on preallocation to move file data
extents, it must be handled in another way. In ZONED mode, we just truncat
When btrfs find a checksum error and if the file system has a mirror of the
damaged data, btrfs read the correct data from the mirror and write the
data to damaged blocks. This repairing, however, is against the sequential
write required rule.
We can consider three methods to repair an IO failure
This is 2/4 patch to implement device-replace for ZONED mode.
On zoned mode, a block group must be either copied (from the source device
to the destination device) or cloned (to the both device).
This commit implements the cloning part. If a block group targeted by an IO
is marked to copy, we sho
This commit enables zone append writing for zoned btrfs. When using zone
append, a bio is issued to the start of a target zone and the device
decides to place it inside the zone. Upon completion the device reports
the actual written position back to the host.
Three parts are necessary to enable zo
If more than one IO is issued for one file extent, these IO can be written
to separate regions on a device. Since we cannot map one file extent to
such a separate area, we need to follow the "one IO == one ordered extent"
rule.
The Normal buffered, uncompressed, not pre-allocated write path (used
We cannot use zone append for writing metadata, because the B-tree nodes
have references to each other using the logical address. Without knowing
the address in advance, we cannot construct the tree in the first place.
So we need to serialize write IOs for metadata.
We cannot add a mutex around al
Likewise to buffered IO, enable zone append writing for direct IO when its
used on a zoned block device.
Reviewed-by: Josef Bacik
Signed-off-by: Naohiro Aota
---
fs/btrfs/inode.c | 18 ++
1 file changed, 18 insertions(+)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c6
When truncating a file, file buffers which have already been allocated but
not yet written may be truncated. Truncating these buffers could cause
breakage of a sequential write pattern in a block group if the truncated
blocks are for example followed by blocks allocated to another file. To
avoid t
In ZONED, btrfs uses per-FS zoned_meta_io_lock to serialize the metadata
write IOs.
Even with these serialization, write bios sent from btree_write_cache_pages
can be reordered by async checksum workers as these workers are per CPU and
not per zone.
To preserve write BIO ordering, we can disable
Conventional zones do not have a write pointer, so we cannot use it to
determine the allocation offset if a block group contains a conventional
zone.
But instead, we can consider the end of the last allocated extent in the
block group as an allocation offset.
For new block group, we cannot calcul
From: Johannes Thumshirn
A following patch will add another caller of
btrfs_lookup_ordered_extent() from a bio endio context.
btrfs_lookup_ordered_extent() uses spin_lock_irq() which unconditionally
disables interrupts. Change this to spin_lock_irqsave() so interrupts
aren't disabled and re-enab
Zoned device has its own hardware restrictions e.g. max_zone_append_size
when using REQ_OP_ZONE_APPEND. To follow the restrictions, use
bio_add_zone_append_page() instead of bio_add_page(). We need target device
to use bio_add_zone_append_page(), so this commit reads the chunk
information to memoiz
From: Johannes Thumshirn
In zoned mode, cache if a block-group is on a sequential write only zone.
On sequential write only zones, we can use REQ_OP_ZONE_APPEND for writing
of data, therefore provide btrfs_use_zone_append() to figure out if I/O is
targeting a sequential write only zone and we can
From: Johannes Thumshirn
To ensure that an ordered extent maps to a contiguous region on disk, we
need to maintain a "one bio == one ordered extent" rule.
This commit ensures that constructing bio does not span across an ordered
extent.
Signed-off-by: Johannes Thumshirn
Signed-off-by: Naohiro
For a zone append write, the device decides the location the data is
written to. Therefore we cannot ensure that two bios are written
consecutively on the device. In order to ensure that a ordered extent maps
to a contiguous region on disk, we need to maintain a "one bio == one
ordered extent" rule
btrfs_rmap_block currently reverse-maps the physical addresses on all
devices to the corresponding logical addresses.
This commit extends the function to match to a specified device. The old
functionality of querying all devices is left intact by specifying NULL as
target device.
We pass block_de
For an ZONED volume, a block group maps to a zone of the device. For
deleted unused block groups, the zone of the block group can be reset to
rewind the zone write pointer at the start of the zone.
Reviewed-by: Josef Bacik
Signed-off-by: Naohiro Aota
---
fs/btrfs/block-group.c | 8 ++--
fs
Tree manipulating operations like merging nodes often release
once-allocated tree nodes. Btrfs cleans such nodes so that pages in the
node are not uselessly written out. On ZONED volumes, however, such
optimization blocks the following IOs as the cancellation of the write out
of the freed blocks br
ZONED btrfs uses REQ_OP_ZONE_APPEND bios for writing to actual devices. Let
btrfs_end_bio() and btrfs_op be aware of it.
Reviewed-by: Josef Bacik
Signed-off-by: Naohiro Aota
---
fs/btrfs/disk-io.c | 4 ++--
fs/btrfs/inode.c | 10 +-
fs/btrfs/volumes.c | 8
fs/btrfs/volumes.
This commit extract page adding to bio part from submit_extent_page(). The
page is added only when bio_flags are the same, contiguous and the added
page fits in the same stripe as pages in the bio.
Condition checkings are reordered to allow early return to avoid possibly
heavy btrfs_bio_fits_in_st
Since the allocation info of tree log node is not recorded to the extent
tree, calculate_alloc_pointer() cannot detect the node, so the pointer can
be over a tree node.
Replaying the log call btrfs_remove_free_space() for each node in the log
tree. So, advance the pointer after the node.
Reviewed
This final patch adds the ZONED incompat flag to
BTRFS_FEATURE_INCOMPAT_SUPP and enables btrfs to mount ZONED flagged file
system.
Signed-off-by: Naohiro Aota
Reviewed-by: Josef Bacik
---
fs/btrfs/ctree.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/ctree.h b/f
Zoned btrfs must allocate blocks at the zones' write pointer. The device's
write pointer position can be mapped to a logical address within a block
group. This commit adds "alloc_offset" to track the logical address.
This logical address is populated in btrfs_load_block_group_zone_info()
from writ
Add a check in verify_one_dev_extent() to check if a device extent on a
zoned block device is aligned to the respective zone boundary.
Signed-off-by: Naohiro Aota
Reviewed-by: Anand Jain
Reviewed-by: Josef Bacik
---
fs/btrfs/volumes.c | 14 ++
1 file changed, 14 insertions(+)
diff
This commit implements a zoned chunk/dev_extent allocator. The zoned
allocator aligns the device extents to zone boundaries, so that a zone
reset affects only the device extent and does not change the state of
blocks in the neighbor device extents.
Also, it checks that a region allocation is not o
From: Johannes Thumshirn
Run zoned btrfs mode on non-zoned devices. This is done by "slicing
up" the block-device into static sized chunks and fake a conventional zone
on each of them. The emulated zone size is determined from the size of
device extent.
This is mainly aimed at testing parts of t
From: Johannes Thumshirn
Don't set the zoned flag in fs_info when encountering the
BTRFS_FEATURE_INCOMPAT_ZONED on mount. The zoned flag in fs_info is in a
union together with the zone_size, so setting it too early will result in
setting an incorrect zone_size as well.
Once the correct zone_size
From: Johannes Thumshirn
Since we have no write pointer in conventional zones, we cannot determine
the allocation offset from it. Instead, we set the allocation offset after
the highest addressed extent. This is done by reading the extent tree in
btrfs_load_block_group_zone_info().
However, this
This series adds zoned block device support to btrfs. Some of the patches
in the previous series are already merged as preparation patches.
This series is also available on github.
Kernel https://github.com/naota/linux/tree/btrfs-zoned-v13
Userland https://github.com/naota/btrfs-progs/tree/btrfs
This is preparation patch to implement zone emulation on a regular device.
To emulate zoned mode on a regular (non-zoned) device, we need to decide an
emulating zone size. Instead of making it compile-time static value, we'll
make it configurable at mkfs time. Since we have one zone == one device
From: Johannes Thumshirn
Add bio_add_zone_append_page(), a wrapper around bio_add_hw_page() which
is intended to be used by file systems that directly add pages to a bio
instead of using bio_iov_iter_get_pages().
Cc: Jens Axboe
Reviewed-by: Christoph Hellwig
Reviewed-by: Josef Bacik
Signed-of
The zoned btrfs puts a superblock at the beginning of SB logging zones
if the zone is conventional. This difference causes a chicken-and-egg
problem for emulated zoned mode. Since the device is a regular
(non-zoned) device, we cannot know if the btrfs is regular or emulated
zoned while we read the
A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding
max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds
such restricted bio using __bio_iov_append_get_pages if bio_op(bio) ==
REQ_OP_ZONE_APPEND.
To utilize it, we need to set the bio_op before calling
bio_iov
Currently btrfs uses page Private2 bit to incidate if we have ordered
extent for the page range.
But the lifespan of it is not consistent, during regular writeback path,
there are two locations to clear the same PagePrivate2:
T - Page marked Dirty
|
+ - Page marked Private2, t
[Oops. A part of the cover letter is missing again. The cover-letter
file has it all. I am not sure why it happened.
Here below, I am just sending it by email].
v4:
Add rb from Josef in patch 1 and 3.
In patch 1/3, use fs_info instead of device->fs_devices->fs_info.
Drop round-robin policy beca
On Fri 22 Jan 2021 at 01:07, David Sterba wrote:
Adding Davidlohr to CC as it's about reverting his patch.
In d5c8238849e7 ("btrfs: convert data_seqcount to
seqcount_mutex_t")
the seqcount_mutex_t was added as an annotation for lockep so by
revert
we'd lose that again.
On Thu, Jan 21, 20
On 2021/1/22 上午7:55, chainofflowers wrote:
Hi Qu,
it happened again. This time on my /home partition.
I rebooted from an external disk and ran btrfs check without first going
through btrfs scrub, and this is the output, no errors:
--
[manjaro oc]# btrf
Hi Qu,
it happened again. This time on my /home partition.
I rebooted from an external disk and ran btrfs check without first going
through btrfs scrub, and this is the output, no errors:
--
[manjaro oc]# btrfs check /dev/mapper/OMO
Opening filesystem to ch
Hi btrfs-gurus,
I'm running a simple reflink/snapshot/COW scalability test at the
moment. It is just a loop that does "fio overwrite of 10,000 4kB
random direct IOs in a 4GB file; snapshot" and I want to check a
couple of things I'm seeing with btrfs. fio config file is appended
to the email.
Fir
On Thu, Jan 21, 2021 at 07:16:05PM +0100, Goffredo Baroncelli wrote:
> On 1/20/21 5:02 PM, Josef Bacik wrote:
> > On 1/17/21 1:54 PM, Goffredo Baroncelli wrote:
> > >
> > > Hi all,
> > >
> > > This is an RFC; I wrote this patch because I find the idea interesting
> > > even though it adds more co
Hi all,
Dne 20.01.2021 v 0:12 Zygo Blaxell napsal(a):
> With the 4 device types we can trivially specify this arrangement.
>
> The sorted device lists would be:
>
> Metadata sort order Data sort order
> metadata only (3) data only (2)
> metadata pr
On 1/20/21 5:02 PM, Josef Bacik wrote:
On 1/17/21 1:54 PM, Goffredo Baroncelli wrote:
Hi all,
This is an RFC; I wrote this patch because I find the idea interesting
even though it adds more complication to the chunk allocator.
The basic idea is to store the metadata chunk in the fasters disks
On Thu, Jan 21, 2021 at 06:10:36PM +0800, Anand Jain wrote:
>
>
> On 20/1/21 8:14 pm, David Sterba wrote:
> > On Tue, Jan 19, 2021 at 11:52:05PM -0800, Anand Jain wrote:
> >> The read policy type latency routes the read IO based on the historical
> >> average wait-time experienced by the read IOs
Adding Davidlohr to CC as it's about reverting his patch.
In d5c8238849e7 ("btrfs: convert data_seqcount to seqcount_mutex_t")
the seqcount_mutex_t was added as an annotation for lockep so by revert
we'd lose that again.
On Thu, Jan 21, 2021 at 07:39:10PM +0800, Su Yue wrote:
> while running xfst
On Wed, Jan 20, 2021 at 12:25:25PM +0200, Nikolay Borisov wrote:
> Those constants are really used internally by zstd and including
> linux/zstd.h into users results in the following warnings:
>
> In file included from fs/btrfs/zstd.c:19:
> ./include/linux/zstd.h:798:21: warning: ‘ZSTD_skippableHe
On Thu, Jan 21, 2021 at 02:13:54PM +0800, Qu Wenruo wrote:
> [BUG]
> There is a long existing bug in the last parameter of
> btrfs_add_ordered_extent(), in commit 771ed689d2cd ("Btrfs: Optimize
> compressed writeback and reads") back to 2008.
>
> In that ancient commit btrfs_add_ordered_extent() e
On Thu, Jan 21, 2021 at 04:19:47PM +0800, Yang Li wrote:
> Fix below warnings reported by coccicheck:
> ./fs/btrfs/raid56.c:237:2-8: WARNING: NULL check before some freeing
> functions is not needed.
>
> Reported-by: Abaci Robot
> Signed-off-by: Yang Li
Added to misc-next, thanks.
On Wed, Jan 20, 2021 at 12:25:15PM +0200, Nikolay Borisov wrote:
> This fixes following W=1 warnings:
>
> fs/btrfs/file-item.c:27: warning: Cannot understand * @inode: the inode we
> want to update the disk_i_size for
> on line 27 - I thought it was a doc line
> fs/btrfs/file-item.c:65: warnin
On Mon, Jan 18, 2021 at 11:20:21PM +0100, David Sterba wrote:
> Hi,
>
> btrfs-progs version 5.10 have been released.
I got a report that static build is broken. It's caused by libmount that
has some internal functions with the same name as is in progs
(canonicalize_path, parse_size). I don't have
while running xfstests on 32 bits test box, many tests failed because of
warnings in dmesg. One of those warnings(btrfs/003):
[ 66.441305] [ cut here ]
[ 66.441317] WARNING: CPU: 6 PID: 9251 at incl
On 2021/1/21 下午6:32, Filipe Manana wrote:
On Thu, Jan 21, 2021 at 6:27 AM Qu Wenruo wrote:
[BUG]
There is a long existing bug in the last parameter of
btrfs_add_ordered_extent(), in commit 771ed689d2cd ("Btrfs: Optimize
compressed writeback and reads") back to 2008.
In that ancient commit
On 20/1/21 9:54 pm, Michal Rostecki wrote:
On Wed, Jan 20, 2021 at 08:30:56PM +0800, Anand Jain wrote:
I ran fio tests again, now with dstat in an another window. I don't
notice any such stalls, the read numbers went continuous until fio
finished. Could you please check with the below f
On Thu, Jan 21, 2021 at 6:27 AM Qu Wenruo wrote:
>
> [BUG]
> There is a long existing bug in the last parameter of
> btrfs_add_ordered_extent(), in commit 771ed689d2cd ("Btrfs: Optimize
> compressed writeback and reads") back to 2008.
>
> In that ancient commit btrfs_add_ordered_extent() expects t
On 21/1/21 4:19 pm, Yang Li wrote:
Fix below warnings reported by coccicheck:
./fs/btrfs/raid56.c:237:2-8: WARNING: NULL check before some freeing
functions is not needed.
Reported-by: Abaci Robot
Signed-off-by: Yang Li
---
fs/btrfs/raid56.c | 3 +--
1 file changed, 1 insertion(+), 2 deleti
On 20/1/21 3:52 pm, Anand Jain wrote:
This is a preparatory patch and introduces a new device flag
'read_preferred', RW-able using sysfs interface.
Signed-off-by: Anand Jain > ---
v4: -
There is rb from Josef for this patch in v3.
Could you please add it?
Thanks, Anand
v2: C style fixe
On 20/1/21 8:14 pm, David Sterba wrote:
On Tue, Jan 19, 2021 at 11:52:05PM -0800, Anand Jain wrote:
The read policy type latency routes the read IO based on the historical
average wait-time experienced by the read IOs through the individual
device. This patch obtains the historical read IO st
On 1/20/21 7:01 PM, Julian Calaby wrote:
> Hi Chaitanya,
>
> On Tue, Jan 19, 2021 at 5:01 PM Chaitanya Kulkarni
> wrote:
>> Hi,
>>
>> This is a *compile only RFC* which adds a generic helper to initialize
>> the various fields of the bio that is repeated all the places in
>> file-systems, block la
Fix below warnings reported by coccicheck:
./fs/btrfs/raid56.c:237:2-8: WARNING: NULL check before some freeing
functions is not needed.
Reported-by: Abaci Robot
Signed-off-by: Yang Li
---
fs/btrfs/raid56.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/btrfs/raid56.c
68 matches
Mail list logo