jpountz commented on PR #13337:
URL: https://github.com/apache/lucene/pull/13337#issuecomment-2090882979
<details>
<summary>I created the following benchmark to simulate lookups in a terms
dictionary that cannot fit in the page cache.</summary>
```java
import java.io.IOException;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Random;
import java.util.concurrent.ThreadLocalRandom;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.IOContext;
import org.apache.lucene.store.IndexInput;
import org.apache.lucene.store.IndexOutput;
import org.apache.lucene.store.NIOFSDirectory;
public class PrefetchBench {
private static final int NUM_TERMS = 3;
private static final long FILE_SIZE = 100L * 1024 * 1024 * 1024; // 100GB
private static final int NUM_BYTES = 16;
public static int DUMMY;
public static void main(String[] args) throws IOException {
Path filePath = Paths.get(args[0]);
Path dirPath = filePath.getParent();
String fileName = filePath.getFileName().toString();
Random r = ThreadLocalRandom.current();
try (Directory dir = new NIOFSDirectory(dirPath)) {
if (Arrays.asList(dir.listAll()).contains(fileName) == false) {
try (IndexOutput out = dir.createOutput(fileName,
IOContext.DEFAULT)) {
byte[] buf = new byte[8196];
for (long i = 0; i < FILE_SIZE; i += buf.length) {
r.nextBytes(buf);
out.writeBytes(buf, buf.length);
}
}
}
for (boolean dataFitsInCache : new boolean[] { false, true}) {
try (IndexInput i0 = dir.openInput("file", IOContext.DEFAULT)) {
byte[][] b = new byte[NUM_TERMS][];
for (int i = 0; i < NUM_TERMS; ++i) {
b[i] = new byte[NUM_BYTES];
}
IndexInput[] inputs = new IndexInput[NUM_TERMS];
if (dataFitsInCache) {
// 16MB slice that should easily fit in the page cache
inputs[0] = i0.slice("slice", 0, 16 * 1024 * 1024);
} else {
inputs[0] = i0;
}
for (int i = 1; i < NUM_TERMS; ++i) {
inputs[i] = inputs[0].clone();
}
final long length = inputs[0].length();
List<Long>[] latencies = new List[2];
latencies[0] = new ArrayList<>();
latencies[1] = new ArrayList<>();
for (int iter = 0; iter < 10_000; ++iter) {
final boolean prefetch = (iter & 1) == 0;
final long start = System.nanoTime();
for (IndexInput ii : inputs) {
final long offset = r.nextLong(length - NUM_BYTES);
ii.seek(offset);
if (prefetch) {
ii.prefetch();
}
}
for (int i = 0; i < NUM_TERMS; ++i) {
inputs[i].readBytes(b[i], 0, b[i].length);
}
final long end = System.nanoTime();
// Prevent the JVM from optimizing away the reads
DUMMY = Arrays.stream(b).mapToInt(Arrays::hashCode).sum();
latencies[iter & 1].add((end - start) / 1024);
}
latencies[0].sort(null);
latencies[1].sort(null);
System.out.println("Data " + (dataFitsInCache ? "fits" : "does not
fit") + " in the page cache");
long prefetchP50 = latencies[0].get(latencies[0].size() / 2);
long prefetchP90 = latencies[0].get(latencies[0].size() * 9 / 10);
long noPrefetchP50 = latencies[1].get(latencies[0].size() / 2);
long noPrefetchP90 = latencies[1].get(latencies[0].size() * 9 /
10);
System.out.println(" With prefetching: P50=" + prefetchP50 +
"us P90=" + prefetchP90 + "us");
System.out.println(" Without prefetching: P50=" + noPrefetchP50 +
"us P90=" + noPrefetchP90 + "us");
}
}
}
}
}
```
</details>
It assumes 3 terms that need to read 16 bytes each from the terms
dictionary. We compare the time it takes to read these 16 bytes using today's
sequential approach vs. taking advantage of the new `prefetch()` API, both in
the case when the terms dictionary fits in the page cache, and when it doesn't.
The benchmark prints the following on my machine:
```
Data does not fit in the page cache
With prefetching: P50=103us P90=144us
Without prefetching: P50=226us P90=247us
Data fits in the page cache
With prefetching: P50=14us P90=83us
Without prefetching: P50=3us P90=73us
```
Using `prefetch()` adds a bit of overhead (10us) to each lookup in the
best-case scenario when data fits in the page cache, but saves a non-negligible
amount of time (100us) when data does not fit in the page cache. Note that
these numbers would need to be multiplied by the number of segments to search.
Also, this tests with 3 terms while some queries may have much higher numbers
of terms. I'd need to test what happens with higher query concurrency, but this
looks promising to me.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]