Re: [PR] OAK-10966 - Indexing job: create optimized version of PersistedLinkedList [jackrabbit-oak]

via GitHub Mon, 29 Jul 2024 01:34:22 -0700


nfsantos commented on code in PR #1595:
URL: https://github.com/apache/jackrabbit-oak/pull/1595#discussion_r1694820881



##########
oak-run-commons/src/main/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/FlatFileStoreIterator.java:
##########
@@ -19,34 +19,44 @@
 
 package org.apache.jackrabbit.oak.index.indexer.document.flatfile;
 
-import static org.apache.jackrabbit.guava.common.collect.Iterators.concat;
-import static 
org.apache.jackrabbit.guava.common.collect.Iterators.singletonIterator;
-
-import java.io.Closeable;
-import java.util.Iterator;
-import java.util.Set;
-
+import org.apache.jackrabbit.guava.common.collect.AbstractIterator;
+import org.apache.jackrabbit.oak.commons.IOUtils;
 import org.apache.jackrabbit.oak.index.indexer.document.NodeStateEntry;
 import 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.linkedList.FlatFileBufferLinkedList;
 import 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.linkedList.NodeStateEntryList;
 import 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.linkedList.PersistedLinkedList;
+import 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.linkedList.PersistedLinkedListV2;
 import 
org.apache.jackrabbit.oak.index.indexer.document.flatfile.pipelined.ConfigHelper;
 import org.apache.jackrabbit.oak.spi.blob.BlobStore;
 import org.apache.jackrabbit.oak.spi.state.NodeState;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
-import org.apache.jackrabbit.guava.common.collect.AbstractIterator;
+import java.io.Closeable;
+import java.util.Iterator;
+import java.util.Set;
+
+import static org.apache.jackrabbit.guava.common.collect.Iterators.concat;
+import static 
org.apache.jackrabbit.guava.common.collect.Iterators.singletonIterator;
 
 class FlatFileStoreIterator extends AbstractIterator<NodeStateEntry> 
implements Iterator<NodeStateEntry>, Closeable {
-    private static final Logger log = 
LoggerFactory.getLogger(FlatFileStoreIterator.class);
+    private static final Logger LOG = 
LoggerFactory.getLogger(FlatFileStoreIterator.class);
 
     static final String BUFFER_MEM_LIMIT_CONFIG_NAME = 
"oak.indexer.memLimitInMB";
     // by default, use the PersistedLinkedList
     private static final int DEFAULT_BUFFER_MEM_LIMIT_IN_MB = 0;
-    static final String PERSISTED_LINKED_LIST_CACHE_SIZE = 
"oak.indexer.persistedLinkedList.cacheSize";
-    static final int DEFAULT_PERSISTED_LINKED_LIST_CACHE_SIZE = 1000;
 
+    public static final String PERSISTED_LINKED_LIST_CACHE_SIZE = 
"oak.indexer.persistedLinkedList.cacheSize";
+    public static final int DEFAULT_PERSISTED_LINKED_LIST_CACHE_SIZE = 1000;
+
+    public static final String PERSISTED_LINKED_LIST_V2_CACHE_SIZE = 
"oak.indexer.persistedLinkedListV2.cacheSize";
+    public static final int DEFAULT_PERSISTED_LINKED_LIST_V2_CACHE_SIZE = 
10000;
+
+    public static final String PERSISTED_LINKED_LIST_V2_MEMORY_CACHE_SIZE_MB = 
"oak.indexer.persistedLinkedListV2.cacheMaxSizeMB";
+    public static final int 
DEFAULT_PERSISTED_LINKED_LIST_V2_MEMORY_CACHE_SIZE_MB = 8;
+
+    public static final String PERSISTED_LINKED_LIST_USE_V2 = 
"oak.indexer.persistedLinkedList.useV2";

Review Comment:
   Both of these design choices are based on learnings from developing the 
Pipelined strategy, which internally also keeps several buffers that ideally 
should be as big as possible while remaining within the limits.
   
   - 8MB default - This is intentionally low to avoid failures when running 
with small heaps. In the past, I had test failures in some CI agents because 
they were configured with only 512MB of RAM and I had set a default buffer size 
that was taking up a significant portion of the available memory. The intention 
is that in any deployment of Oak, the value will be set to 128MB or greater, 
but by default, I prefer to be very conservative. And 8MB is not so small, 
based on my experiments, 1000 entries usually take less than 1MB of RAM and 
this is the default for the previous implementation. 
   - Limit on size and count - This is to account for the case where entries 
are all very small. The memory estimation is not perfect, because it does not 
take in consideration the memory required by the internal structures of the 
HashMap or any other per-entry overhead that we may not be considering. Once 
again, I had issues with unexpected OOME with the buffers used in the Pipelined 
strategy when I did not set a limit on the number of entries. And having a 
limit on the number of entries gives us another lever to adjust in case we find 
some unexpected issues in production.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OAK-10966 - Indexing job: create optimized version of PersistedLinkedList [jackrabbit-oak]

Reply via email to