[ 
https://issues.apache.org/jira/browse/LUCENE-2537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2537:
-------------------------------

    Attachment: FileCopyTest.java

I wrote a test which compares FileChannel API to intermediate buffer copies. 
The test runs each method 3 times and reports the best time of each. It can be 
run w/ different file and chunk sizes.

Here are the results of copying a 1GB file using different chunk sizes (the 
chunk is used as the intermediate buffer size as well).

Machine spec:
* Linux, 64-bit (IBM) JVM
* 2xQuad (+hyper-threading) - 16 cores overall
* 16GB RAM
* SAS HD

||Chunk Size||FileChannel||Intermediate Buffer||Diff||
|64K|1865|1528|{color:red}-18%{color}|
|128K|1660|1526|{color:red}-9%{color}|
|512K|1514|1493|{color:red}-2%{color}|
|1M|1552|2072|{color:green}+33%{color}|
|2M|1488|1559|{color:green}5%{color}|
|4M|1596|1831|{color:green}13%{color}|
|16M|1563|1964|{color:green}21%{color}|
|64M|1494|2442|{color:green}39%{color}|
|128M|1469|2445|{color:green}40%{color}|

For small buffer sizes, intermediate byte[] copies is preferable. However, 
FileChannel method performs pretty much consistently, irregardless of the 
buffer size (except for the first run), while the byte[] approach degrades a 
lot, as the buffer size increases.

I think, given these results, we can use the FileChannel method w/ a chunk size 
of 4 (or even 2) MB, to be on the safe side and don't eat up too much RAM?

> FSDirectory.copy() impl is unsafe
> ---------------------------------
>
>                 Key: LUCENE-2537
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2537
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Store
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>             Fix For: 3.1, 4.0
>
>         Attachments: FileCopyTest.java
>
>
> There are a couple of issues with it:
> # FileChannel.transferFrom documents that it may not copy the number of bytes 
> requested, however we don't check the return value. So need to fix the code 
> to read in a loop until all bytes were copied..
> # When calling addIndexes() w/ very large segments (few hundred MBs in size), 
> I ran into the following exception (Java 1.6 -- Java 1.5's exception was 
> cryptic):
> {code}
> Exception in thread "main" java.io.IOException: Map failed
>     at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:770)
>     at 
> sun.nio.ch.FileChannelImpl.transferToTrustedChannel(FileChannelImpl.java:450)
>     at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:523)
>     at org.apache.lucene.store.FSDirectory.copy(FSDirectory.java:450)
>     at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:3019)
> Caused by: java.lang.OutOfMemoryError: Map failed
>     at sun.nio.ch.FileChannelImpl.map0(Native Method)
>     at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:767)
>     ... 7 more
> {code}
> I changed the impl to something like this:
> {code}
> long numWritten = 0;
> long numToWrite = input.size();
> long bufSize = 1 << 26;
> while (numWritten < numToWrite) {
>   numWritten += output.transferFrom(input, numWritten, bufSize);
> }
> {code}
> And the code successfully adds the indexes. This code uses chunks of 64MB, 
> however that might be too large for some applications, so we definitely need 
> a smaller one. The question is how small so that performance won't be 
> affected, and it'd be great if we can let it be configurable, however since 
> that API is called by other API, such as addIndexes, not sure it's easily 
> controllable.
> Also, I read somewhere (can't remember now where) that on Linux the native 
> impl is better and does copy in chunks. So perhaps we should make a Linux 
> specific impl?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to