Hi Guozhang, thanks for your response. I have done some further investigations.
The difference I see between the two versions is the following, in 1.1 this is the stat of the rocksdb MANIFEST files of one of the partitions: root@fooapp-6c4649dd68-wzrxk:/data# stat fooapp/2_5/rocksdb/foo-store/MANIFEST-000007 File: fooapp/2_5/rocksdb/foo-store/MANIFEST-000007 Size: 159 Blocks: 16 IO Block: 4096 regular file Device: d3h/211d Inode: 44831568 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2019-03-01 19:56:46.000000000 +0000 Modify: 2019-03-01 19:56:46.000000000 +0000 Change: 2019-03-01 19:56:46.000000000 +0000 Birth: - And ls -ls 8 -rw-r--r--. 1 root root 275 Mar 1 19:57 MANIFEST-000007 Then in 2.1 this is the equivalent manifest file: root@fooapp-7678bbcfb7-wtjjn:/data# stat fooapp/2_5/rocksdb/foo-store/MANIFEST-000011 File: fooapp /2_5/rocksdb/foo-store/MANIFEST-000011 Size: 160 Blocks: 8200 IO Block: 4096 regular file Device: 10006eh/1048686d Inode: 185032005 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2019-03-01 19:32:42.000000000 +0000 Modify: 2019-03-01 19:32:42.000000000 +0000 Change: 2019-03-01 19:32:42.000000000 +0000 Birth: - And ls -ls 4100 -rw-r--r--. 1 root root 160 Mar 1 19:32 MANIFEST-000011 I believe this is what is resulting in the actual disk usage - many of these manifest files using 4MB on the filesystem each quickly adding up despite only being 160 bytes. https://github.com/facebook/rocksdb/blob/v5.14.2/include/rocksdb/options.h#L589 implies this should always have been the case, and they weren't preallocating (properly?) but now they are. I can see in the OPTIONS file that the newer version has introduced the manifest_preallocation_size=4194304 configuration whereas it's absent in the 5.7.3 file. At rest this application has 508 MANIFEST-* files (it has a lot of individual state stores) so they will account for 2GB alone (significantly greater than the total state dir size). Though this is manageable. In 1.1 during start-up there can be as many as 670 of the MANIFEST files, in 2.1 the number of MANIFEST files grows exceptionally fast to peak as high as 8811 !! in one test run before dropping back down, it's probably been more in some cases but this was a clean start with a decent initial rebalance delay. Both versions are running in docker, identical configurations. We have not touched the defaults set for RocksDB by Kafka Streams. So it looks like it might be possible to tune down the manifest preallocation size somewhat in RocksDB directly given none of our manifest files are required to be this size, the largest appears to be 4.7KB actual disk required, many are much much smaller. My next line of enquiry then is why the number of MANIFEST files on restore goes so extremely high so fast in this new version. 4.7KB * 8811 files is a more manageable size so either addressing the manifest size or the number of manifests will bring this into manageable territory for us. Though I'd like to tackle both. rocksdb.config.setter is I expect the way forward on the manifest preallocation size. I don't know if anyone can give an indication of what a typical MANIFEST size would be or how it scales relative to say partition/segment size? Are you able to shed some light on why this newer version is creating so many MANIFEST files? I can look into it further but I suspect either each instance is creating stores for more partitions than it will actually handle or there are duplicate / redundant / cleaned up stores/MANIFEST files hanging around. Thanks Adrian On 01/03/2019, 18:08, "Guozhang Wang" <wangg...@gmail.com> wrote: Hello Adrian, What you described did sounds wired to me. I'm not aware of any regressions on rocksDB disk usage from 1.1 to 2.1. Could you file a JIRA ticket with more details like state dir snapshots, your code snippet and configs etc so we can find a way to reproduce it? Guozhang On Fri, Mar 1, 2019 at 5:54 AM Adrian McCague <adrian.mcca...@zopa.com> wrote: > Hi, > > We are in the process of attempting to upgrade from Kafka Streams 1.1.0 to > 2.1.0(-cp1) however we get some wildly different behaviour with regards to > disk usage between these two versions. > > An update that uses existing state data exhibits the same behaviour as > starting with a completely clean state data directories. > > With 1.1.0 the same topologies restoring the same state stores will at > most use 275MB. > With 2.1.0 they will quickly take up to 60GB – I extended the volumes to > see just how far it would go, already magnitudes larger than before. > > Only remotely interesting information in the logs are the following > exceptions which other than the explicit disk space exception I presume is > related to out of disk space: > > org.apache.kafka.streams.errors.ProcessorStateException: task [0_7] > Exception caught while trying to restore state from > foo-v4-KSTREAM-JOINTHIS-0000000012-store-changelog-7 > at > org.apache.kafka.streams.processor.internals.ProcessorStateManager.updateStandbyStates(ProcessorStateManager.java:185) > at > org.apache.kafka.streams.processor.internals.StandbyTask.update(StandbyTask.java:188) > at > org.apache.kafka.streams.processor.internals.StreamThread.maybeUpdateStandbyTasks(StreamThread.java:1114) > at > org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:895) > at > org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:777) > at > org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:747) > Caused by: java.lang.NullPointerException: null > java.lang.NullPointerException: null > at > org.apache.kafka.streams.state.internals.RocksDBStore.toggleDbForBulkLoading(RocksDBStore.java:229) > at > org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStore.getWriteBatches(RocksDBSegmentedBytesStore.java:230) > at > org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStore.restoreAllInternal(RocksDBSegmentedBytesStore.java:208) > at > org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStore$RocksDBSegmentsBatchingRestoreCallback.restoreAll(RocksDBSegmentedBytesStore.java:262) > at > org.apache.kafka.streams.processor.internals.StateRestoreCallbackAdapter.lambda$adapt$0(StateRestoreCallbackAdapter.java:42) > at > org.apache.kafka.streams.processor.internals.CompositeRestoreListener.restoreBatch(CompositeRestoreListener.java:89) > at > org.apache.kafka.streams.processor.internals.StateRestorer.restore(StateRestorer.java:83) > at > org.apache.kafka.streams.processor.internals.StoreChangelogReader.processNext(StoreChangelogReader.java:310) > at > org.apache.kafka.streams.processor.internals.StoreChangelogReader.restore(StoreChangelogReader.java:92) > at > org.apache.kafka.streams.processor.internals.TaskManager.updateNewAndRestoringTasks(TaskManager.java:321) > at > org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:839) > at > org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:777) > at > org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:747) > > > org.apache.kafka.streams.errors.ProcessorStateException: Error opening > store KSTREAM-JOINTHIS-0000000018-store.1551343860000 at location > /data/foo/0_1/KSTREAM-JOINTHIS-0000000018-store/KSTREAM-JOINTHIS-0000000018-store.1551343860000 > at > org.apache.kafka.streams.state.internals.RocksDBStore.openDB(RocksDBStore.java:158) > at > org.apache.kafka.streams.state.internals.Segment.openDB(Segment.java:45) > at > org.apache.kafka.streams.state.internals.Segments.getOrCreateSegment(Segments.java:101) > at > org.apache.kafka.streams.state.internals.Segments.getOrCreateSegmentIfLive(Segments.java:81) > at > org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStore.getWriteBatches(RocksDBSegmentedBytesStore.java:224) > at > org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStore.restoreAllInternal(RocksDBSegmentedBytesStore.java:208) > at > org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStore$RocksDBSegmentsBatchingRestoreCallback.restoreAll(RocksDBSegmentedBytesStore.java:262) > at > org.apache.kafka.streams.processor.internals.StateRestoreCallbackAdapter.lambda$adapt$0(StateRestoreCallbackAdapter.java:42) > at > org.apache.kafka.streams.processor.internals.CompositeRestoreListener.restoreBatch(CompositeRestoreListener.java:89) > at > org.apache.kafka.streams.processor.internals.StateRestorer.restore(StateRestorer.java:83) > at > org.apache.kafka.streams.processor.internals.StoreChangelogReader.processNext(StoreChangelogReader.java:310) > at > org.apache.kafka.streams.processor.internals.StoreChangelogReader.restore(StoreChangelogReader.java:92) > at > org.apache.kafka.streams.processor.internals.TaskManager.updateNewAndRestoringTasks(TaskManager.java:321) > at > org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:839) > at > org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:777) > at > org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:747) > Caused by: org.rocksdb.RocksDBException: While appending to file: > /data/foo/0_1/KSTREAM-JOINTHIS-0000000018-store/KSTREAM-JOINTHIS-0000000018-store.1551343860000/MANIFEST-000001: > No space left on device > at org.rocksdb.RocksDB.open(Native Method) > at org.rocksdb.RocksDB.open(RocksDB.java:231) > at > org.apache.kafka.streams.state.internals.RocksDBStore.openDB(RocksDBStore.java:156) > ... 15 common frames omitted > > It feels likely that this is some serious amount of preallocation that is > happening that didn’t use to? > I have eliminated > https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8202261 being an > issue – we see this behaviour with 8u141, 8u181 and 8u202 > > Any help or avenues of investigation are welcome. > > Thanks > Adrian > > Disclaimer > > This e-mail has been sent to you by Zopa. > > Zopa Limited is authorised and regulated by the Financial Conduct > Authority, > and entered on the Financial Services Register under firm registration > number > 718925. Zopa Bank Limited is authorised by the Prudential Regulation > Authority > and regulated by the Financial Conduct Authority and the Prudential > Regulation > Authority, and entered on the Financial Services Register (800542). > > Registered Office: Zopa Limited (05197592) and Zopa Bank Limited > (10627575) are both incorporated in England & Wales and have their > registered office at: 1st Floor, Cottons Centre, Tooley Street, London, > SE1 2QG. > > -- -- Guozhang