[
https://issues.apache.org/jira/browse/IGNITE-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Rakov updated IGNITE-6532:
-------------------------------
Description:
Modern databases (Oracle, MySql) work with storage drive on physical level,
creating their own partition table and filesystem.
Ignite Persistent Store work with regular files. It appends new pages to
partition file once new pages are allocated and written on checkpoint. These
new pages can form one or several fragments on filesystem level.
As a result, after weeks of uptime, partition files can contain huge amount of
fragments. There were reports about 1200000 fragments in index.bin file on XFS
filesystem.
We can work this around by preallocating files in bigger chunks, e.g. 1000
pages at a time. On the other hand, early allocation will increase LFS size
overhead, so we should consider reasonable heuristic for allocation.
Allocation should be performed on native level. Just writing a byte at position
(file_size + page_size * 1000) won't do it because XFS (and other filesystems
as well) has an optimization for that case. Missing range will be just skipped.
Related article about filesystem internals:
https://blog.codecentric.de/en/2017/04/xfs-possible-memory-allocation-deadlock-kmem_alloc/
was:
Modern databases (Oracle, MySql) work with storage drive on physical level,
creating their own partition table and filesystem.
Ignite Persistent Store work with regular files. It appends new pages to
partition file once new pages are allocated and written on checkpoint. These
new pages can form one or several fragments on filesystem level.
As a result, after weeks of uptime, partition files can contain huge amount of
fragments. There were reports about 1200000 fragments in index.bin file on XFS
filesystem.
We can work this around by preallocating files in bigger chunks, e.g. 1000
pages at a time. On the other hand, early allocation will increase LFS size
overhead, so we should consider reasonable heuristic for allocation.
Allocation should be performed on native level. Just writing a byte at position
(file_size + page_size * 1000) won't do it because XFS (and other filesystems
as well) has an optimization for that case. Missing range will be just skipped.
> Introduce preallocation in LFS files to avoid high fragmentation on
> filesystem level
> ------------------------------------------------------------------------------------
>
> Key: IGNITE-6532
> URL: https://issues.apache.org/jira/browse/IGNITE-6532
> Project: Ignite
> Issue Type: Bug
> Components: persistence
> Affects Versions: 2.2
> Reporter: Ivan Rakov
> Fix For: 2.4
>
>
> Modern databases (Oracle, MySql) work with storage drive on physical level,
> creating their own partition table and filesystem.
> Ignite Persistent Store work with regular files. It appends new pages to
> partition file once new pages are allocated and written on checkpoint. These
> new pages can form one or several fragments on filesystem level.
> As a result, after weeks of uptime, partition files can contain huge amount
> of fragments. There were reports about 1200000 fragments in index.bin file on
> XFS filesystem.
> We can work this around by preallocating files in bigger chunks, e.g. 1000
> pages at a time. On the other hand, early allocation will increase LFS size
> overhead, so we should consider reasonable heuristic for allocation.
> Allocation should be performed on native level. Just writing a byte at
> position (file_size + page_size * 1000) won't do it because XFS (and other
> filesystems as well) has an optimization for that case. Missing range will be
> just skipped.
> Related article about filesystem internals:
> https://blog.codecentric.de/en/2017/04/xfs-possible-memory-allocation-deadlock-kmem_alloc/
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)