I have created these as the first step. Animesh, feel free to submit PR for these. I will look into your micro benchmarks soon.
1. [image: Improvement] ARROW-3497[Java] Add user documentation for achieving better performance <https://jira.apache.org/jira/browse/ARROW-3497> 2. [image: Improvement] ARROW-3496[Java] Add microbenchmark code to Java <https://jira.apache.org/jira/browse/ARROW-3496> 3. [image: Improvement] ARROW-3495[Java] Optimize bit operations performance <https://jira.apache.org/jira/browse/ARROW-3495> 4. [image: Improvement] ARROW-3493[Java] Document BOUNDS_CHECKING_ENABLED <https://jira.apache.org/jira/browse/ARROW-3493> On Thu, Oct 11, 2018 at 10:00 AM Li Jin <ice.xell...@gmail.com> wrote: > Hi Wes and Animesh, > > Thanks for the analysis and discussion. I am happy to looking into this. I > will create some Jiras soon. > > Li > > On Thu, Oct 11, 2018 at 5:49 AM Wes McKinney <wesmck...@gmail.com> wrote: > >> hey Animesh, >> >> Thank you for doing this analysis. If you'd like to share some of the >> analysis more broadly e.g. on the Apache Arrow blog or social media, >> let us know. >> >> Seems like there might be a few follow ups here for the Arrow Java >> community: >> >> * Documentation about achieving better performance >> * Writing some microperformance benchmarks >> * Making some improvements to the code to facilitate better performance >> >> Feel free to create some JIRA issues. Are any Java developers >> interested in digging a little more into these issues? >> >> Thanks, >> Wes >> On Tue, Oct 9, 2018 at 7:18 AM Animesh Trivedi >> <animesh.triv...@gmail.com> wrote: >> > >> > Hi Wes and all, >> > >> > Here is another round of updates: >> > >> > Quick recap - previously we established that for 1kB binary blobs, Arrow >> > can deliver > 160 Gbps performance from in-memory buffers. >> > >> > In this round I looked at the performance of materializing "integers". >> In >> > my benchmarks, I found that with careful optimizations/code-rewriting we >> > can push the performance of integer reading from 5.42 Gbps/core to 13.61 >> > Gbps/core (~2.5x). The peak performance with 16 cores, scale up to 110+ >> > Gbps. Key things to do is: >> > >> > 1) Disable memory access checks in Arrow and Netty buffers. This gave >> > significant performance boost. However, for such an important >> performance >> > flag, it is very poorly documented >> > ("drill.enable_unsafe_memory_access=true"). >> > >> > 2) Materialize values from Validity and Value direct buffers instead of >> > calling getInt() function on the IntVector. This is implemented as a new >> > Unsafe reader type ( >> > >> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L31 >> > ) >> > >> > 3) Optimize bitmap operation to check if a bit is set or not ( >> > >> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L23 >> > ) >> > >> > A detailed write up of these steps is available here: >> > >> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-09-arrow-int.md >> > >> > I have 2 follow-up questions: >> > >> > 1) Regarding the `isSet` function, why does it has to calculate number >> of >> > bits set? ( >> > >> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseFixedWidthVector.java#L797 >> ). >> > Wouldn't just checking if the result of the AND operation is zero or >> not be >> > sufficient? Like what I did : >> > >> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/ArrowReaderUnsafe.java#L28 >> > >> > >> > 2) What is the reason behind this bitmap generation optimization here >> > >> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BitVectorHelper.java#L179 >> > ? At this point when this function is called, the bitmap vector is >> already >> > read from the storage, and contains the right values (either all null, >> all >> > set, or whatever). Generating this mask here for the special cases when >> the >> > values are all NULL or all set (this was the case in my benchmark), can >> be >> > slower than just returning what one has read from the storage. >> > >> > Collectively optimizing these two bitmap operations give more than 1 >> Gbps >> > gains in my bench-marking code. >> > >> > Cheers, >> > -- >> > Animesh >> > >> > >> > On Thu, Oct 4, 2018 at 12:52 PM Wes McKinney <wesmck...@gmail.com> >> wrote: >> > >> > > See e.g. >> > > >> > > >> > > >> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/ipc-read-write-test.cc#L222 >> > > >> > > >> > > On Thu, Oct 4, 2018 at 6:48 AM Animesh Trivedi >> > > <animesh.triv...@gmail.com> wrote: >> > > > >> > > > Primarily write the same microbenchmark as I have in Java in C++ for >> > > table >> > > > reading and value materialization. So just an example of equivalent >> > > > ArrowFileReader example code in C++. Unit tests are a good starting >> > > point, >> > > > thanks for the tip :) >> > > > >> > > > On Thu, Oct 4, 2018 at 12:39 PM Wes McKinney <wesmck...@gmail.com> >> > > wrote: >> > > > >> > > > > > 3. Are there examples of Arrow in C++ read/write code that I can >> > > have a >> > > > > look? >> > > > > >> > > > > What kind of code are you looking for? I would direct you to >> relevant >> > > > > unit tests that exhibit certain functionality, but it depends on >> what >> > > > > you are trying to do >> > > > > On Wed, Oct 3, 2018 at 9:45 AM Animesh Trivedi >> > > > > <animesh.triv...@gmail.com> wrote: >> > > > > > >> > > > > > Hi all - quick update on the performance investigation: >> > > > > > >> > > > > > - I spent some time looking at performance profile for a binary >> blob >> > > > > column >> > > > > > (1024 bytes of byte[]) and found a few favorable settings for >> > > delivering >> > > > > up >> > > > > > to 168 Gbps from in-memory reading benchmark on 16 cores. These >> > > settings >> > > > > > (NUMA, JVM settings, Arrow holder API, and batch size, etc.) are >> > > > > documented >> > > > > > here: >> > > > > > >> > > > > >> > > >> https://github.com/animeshtrivedi/blog/blob/master/post/2018-10-03-arrow-binary.md >> > > > > > - these setting also help to improved the last number that >> reported >> > > (but >> > > > > > not by much) for the in-memory TPC-DS store_sales table from ~39 >> > > Gbps up >> > > > > to >> > > > > > ~45-47 Gbps (note: this number is just in-memory benchmark, >> i.e., >> > > w/o any >> > > > > > networking or storage links) >> > > > > > >> > > > > > A few follow up questions that I have: >> > > > > > 1. Arrow reads a batch size worth of data in one go. Are there >> any >> > > > > > recommended batch sizes? In my investigation, small batch size >> help >> > > with >> > > > > a >> > > > > > better cache profile but increase number of instructions >> required >> > > (more >> > > > > > looping). Larger one do otherwise. Somehow ~10MB/thread seem to >> be >> > > the >> > > > > best >> > > > > > performing configuration, which is also a bit counter intuitive >> as >> > > for 16 >> > > > > > threads this will lead to 160 MB of memory footprint. May be >> this is >> > > also >> > > > > > tired to the memory management logic which is my next question. >> > > > > > 2. Arrow use's netty's memory manager. (i) what are decent netty >> > > memory >> > > > > > management settings for "io.netty.allocator.*" parameters? I >> don't >> > > find >> > > > > any >> > > > > > decent write-up on them; (ii) Is there a provision for ArrowBuf >> being >> > > > > > re-used once a batch is consumed? As it looks for now, read read >> > > > > allocates >> > > > > > a new buffer to read the whole batch size. >> > > > > > 3. Are there examples of Arrow in C++ read/write code that I can >> > > have a >> > > > > > look? >> > > > > > >> > > > > > Cheers, >> > > > > > -- >> > > > > > Animesh >> > > > > > >> > > > > > >> > > > > > On Wed, Sep 19, 2018 at 8:49 PM Wes McKinney < >> wesmck...@gmail.com> >> > > > > wrote: >> > > > > > >> > > > > > > On Wed, Sep 19, 2018 at 2:13 PM Animesh Trivedi >> > > > > > > <animesh.triv...@gmail.com> wrote: >> > > > > > > > >> > > > > > > > Hi Johan, Wes, and Jacques - many thanks for your comments: >> > > > > > > > >> > > > > > > > @Johan - >> > > > > > > > 1. I also do not suspect that there is any inherent >> drawback in >> > > Java >> > > > > or >> > > > > > > C++ >> > > > > > > > due to the Arrow format. I mentioned C++ because Wes >> pointed out >> > > that >> > > > > > > Java >> > > > > > > > routines are not the most optimized ones (yet!). And >> naturally >> > > one >> > > > > would >> > > > > > > > expect better performance in a native language with all >> > > > > > > pointer/memory/SIMD >> > > > > > > > instruction optimizations that you mentioned. As far as I >> know, >> > > the >> > > > > > > > off-heap buffers are managed in ArrowBuf which implements an >> > > abstract >> > > > > > > netty >> > > > > > > > class. But there is nothing unusual, i.e., netty specific, >> about >> > > > > these >> > > > > > > > unsafe routines, they are used by many projects. Though >> there is >> > > cost >> > > > > > > > associated with materializing on-heap Java values from >> off-heap >> > > > > memory >> > > > > > > > regions. I need to benchmark that more carefully. >> > > > > > > > >> > > > > > > > 2. When you say "I've so far always been able to get similar >> > > > > performance >> > > > > > > > numbers" - do you mean the same performance of my case 3 >> where 16 >> > > > > cores >> > > > > > > > drive close to 40 Gbps, or the same performance between >> your C++ >> > > and >> > > > > Java >> > > > > > > > benchmarks. Do you have some write-up? I would be >> interested to >> > > read >> > > > > up >> > > > > > > :) >> > > > > > > > >> > > > > > > > 3. "Can you get to 100 Gbps starting from primitive arrays >> in >> > > Java" >> > > > > -> >> > > > > > > that >> > > > > > > > is a good idea. Let me try and report back. >> > > > > > > > >> > > > > > > > @Wes - >> > > > > > > > >> > > > > > > > Is there some benchmark template for C++ routines I can >> have a >> > > look? >> > > > > I >> > > > > > > > would be happy to get some input from Java-Arrow experts on >> how >> > > to >> > > > > write >> > > > > > > > these benchmarks more efficiently. I will have a closer >> look at >> > > the >> > > > > JIRA >> > > > > > > > tickets that you mentioned. >> > > > > > > > >> > > > > > > > So, for now I am focusing on the case 3, which is about >> > > establishing >> > > > > > > > performance when reading from a local in-memory I/O stream >> that I >> > > > > > > > implemented ( >> > > > > > > > >> > > > > > > >> > > > > >> > > >> https://github.com/animeshtrivedi/benchmarking-arrow/blob/master/src/main/java/com/github/animeshtrivedi/benchmark/MemoryIOChannel.java >> > > > > > > ). >> > > > > > > > In this case I first read data from parquet files, convert >> them >> > > into >> > > > > > > Arrow, >> > > > > > > > and write-out to a MemoryIOChannel, and then read back from >> it. >> > > So, >> > > > > the >> > > > > > > > performance has nothing to do with Crail or HDFS in the >> case 3. >> > > > > Once, I >> > > > > > > > establish the base performance in this setup (which is >> around ~40 >> > > > > Gbps >> > > > > > > with >> > > > > > > > 16 cores) I will add Crail to the mix. Perhaps Crail I/O >> streams >> > > can >> > > > > take >> > > > > > > > ArrowBuf as src/dst buffers. That should be doable. >> > > > > > > >> > > > > > > Right, in any case what you are testing is the performance of >> > > using a >> > > > > > > particular Java accessor layer to JVM off-heap Arrow memory >> to sum >> > > the >> > > > > > > non-null values of each column. I'm not sure that a single >> > > bandwidth >> > > > > > > number produced by this benchmark is very informative for >> people >> > > > > > > contemplating what memory format to use in their system due >> to the >> > > > > > > current state of the implementation (Java) and workload >> measured >> > > > > > > (summing the non-null values with a naive algorithm). I would >> guess >> > > > > > > that a C++ version with raw pointers and a loop-unrolled, >> > > branch-free >> > > > > > > vectorized sum is going to be a lot faster. >> > > > > > > >> > > > > > > > >> > > > > > > > @Jacques - >> > > > > > > > >> > > > > > > > That is a good point that "Arrow's implementation is more >> > > focused on >> > > > > > > > interacting with the structure than transporting it". >> However, >> > > in any >> > > > > > > > distributed system one needs to move data/structure around >> - as >> > > far >> > > > > as I >> > > > > > > > understand that is another goal of the project. My >> investigation >> > > > > started >> > > > > > > > within the context of Spark/SQL data processing. Spark >> converts >> > > > > incoming >> > > > > > > > data into its own in-memory UnsafeRow representation for >> > > processing. >> > > > > So >> > > > > > > > naturally the performance of this data ingestion pipeline >> cannot >> > > > > > > outperform >> > > > > > > > the read performance of the used file format. I benchmarked >> > > Parquet, >> > > > > ORC, >> > > > > > > > Avro, JSON (for the specific TPC-DS store_sales table). And >> then >> > > > > > > curiously >> > > > > > > > benchmarked Arrow as well because its design choices are a >> better >> > > > > fit for >> > > > > > > > modern high-performance RDMA/NVMe/100+Gbps devices I am >> > > > > investigating. >> > > > > > > From >> > > > > > > > this point of view, I am trying to find out can Arrow be >> the file >> > > > > format >> > > > > > > > for the next generation of storage/networking devices (see >> Apache >> > > > > Crail >> > > > > > > > project) delivering close to the hardware speed >> reading/writing >> > > > > rates. As >> > > > > > > > Wes pointed out that a C++ library implementation should be >> > > memory-IO >> > > > > > > > bound, so what would it take to deliver the same >> performance in >> > > Java >> > > > > ;) >> > > > > > > > (and then, from across the network). >> > > > > > > > >> > > > > > > > I hope this makes sense. >> > > > > > > > >> > > > > > > > Cheers, >> > > > > > > > -- >> > > > > > > > Animesh >> > > > > > > > >> > > > > > > > On Wed, Sep 19, 2018 at 6:28 PM Jacques Nadeau < >> > > jacq...@apache.org> >> > > > > > > wrote: >> > > > > > > > >> > > > > > > > > My big question is what is the use case and how/what are >> you >> > > > > trying to >> > > > > > > > > compare? Arrow's implementation is more focused on >> interacting >> > > > > with the >> > > > > > > > > structure than transporting it. Generally speaking, when >> we're >> > > > > working >> > > > > > > with >> > > > > > > > > Arrow data we frequently are just interacting with memory >> > > > > locations and >> > > > > > > > > doing direct operations. If you have a layer that >> supports that >> > > > > type of >> > > > > > > > > semantic, create a movement technique that depends on >> that. >> > > Arrow >> > > > > > > doesn't >> > > > > > > > > force a particular API since the data itself is defined >> by its >> > > > > > > in-memory >> > > > > > > > > layout so if you have a custom use or pattern, just work >> with >> > > the >> > > > > > > in-memory >> > > > > > > > > structures. >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > On Wed, Sep 19, 2018 at 7:49 AM Wes McKinney < >> > > wesmck...@gmail.com> >> > > > > > > wrote: >> > > > > > > > > >> > > > > > > > > > hi Animesh, >> > > > > > > > > > >> > > > > > > > > > Per Johan's comments, the C++ library is essentially >> going >> > > to be >> > > > > > > > > > IO/memory bandwidth bound since you're interacting with >> raw >> > > > > pointers. >> > > > > > > > > > >> > > > > > > > > > I'm looking at your code >> > > > > > > > > > >> > > > > > > > > > private void consumeFloat4(FieldVector fv) { >> > > > > > > > > > Float4Vector accessor = (Float4Vector) fv; >> > > > > > > > > > int valCount = accessor.getValueCount(); >> > > > > > > > > > for(int i = 0; i < valCount; i++){ >> > > > > > > > > > if(!accessor.isNull(i)){ >> > > > > > > > > > float4Count+=1; >> > > > > > > > > > checksum+=accessor.get(i); >> > > > > > > > > > } >> > > > > > > > > > } >> > > > > > > > > > } >> > > > > > > > > > >> > > > > > > > > > You'll want to get a Java-Arrow expert from Dremio to >> advise >> > > you >> > > > > the >> > > > > > > > > > fastest way to iterate over this data -- my >> understanding is >> > > that >> > > > > > > much >> > > > > > > > > > code in Dremio interacts with the wrapped Netty ArrowBuf >> > > objects >> > > > > > > > > > rather than going through the higher level APIs. You're >> also >> > > > > dropping >> > > > > > > > > > performance because memory mapping is not yet >> implemented in >> > > > > Java, >> > > > > > > see >> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3191. >> > > > > > > > > > >> > > > > > > > > > Furthermore, the IPC reader class you are using could >> be made >> > > > > more >> > > > > > > > > > efficient. I described the problem in >> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-3192 -- >> this >> > > will be >> > > > > > > > > > required as soon as we have the ability to do memory >> mapping >> > > in >> > > > > Java >> > > > > > > > > > >> > > > > > > > > > Could Crail use the Arrow data structures in its runtime >> > > rather >> > > > > than >> > > > > > > > > > copying? If not, how are Crail's runtime data structures >> > > > > different? >> > > > > > > > > > >> > > > > > > > > > - Wes >> > > > > > > > > > On Wed, Sep 19, 2018 at 9:19 AM Johan Peltenburg - EWI >> > > > > > > > > > <j.w.peltenb...@tudelft.nl> wrote: >> > > > > > > > > > > >> > > > > > > > > > > Hello Animesh, >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > I browsed a bit in your sources, thanks for sharing. >> We >> > > have >> > > > > > > performed >> > > > > > > > > > some similar measurements to your third case in the >> past for >> > > > > C/C++ on >> > > > > > > > > > collections of various basic types such as primitives >> and >> > > > > strings. >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > I can say that in terms of consuming data from the >> Arrow >> > > format >> > > > > > > versus >> > > > > > > > > > language native collections in C++, I've so far always >> been >> > > able >> > > > > to >> > > > > > > get >> > > > > > > > > > similar performance numbers (e.g. no drawbacks due to >> the >> > > Arrow >> > > > > > > format >> > > > > > > > > > itself). Especially when accessing the data through >> Arrow's >> > > raw >> > > > > data >> > > > > > > > > > pointers (and using for example std::string_view-like >> > > > > constructs). >> > > > > > > > > > > >> > > > > > > > > > > In C/C++ the fast data structures are engineered in >> such a >> > > way >> > > > > > > that as >> > > > > > > > > > little pointer traversals are required and they take up >> an as >> > > > > small >> > > > > > > as >> > > > > > > > > > possible memory footprint. Thus each memory access is >> > > relatively >> > > > > > > > > efficient >> > > > > > > > > > (in terms of obtaining the data of interest). The same >> can >> > > > > > > absolutely be >> > > > > > > > > > said for Arrow, if not even more efficient in some cases >> > > where >> > > > > object >> > > > > > > > > > fields are of variable length. >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > In the JVM case, the Arrow data is stored off-heap. >> This >> > > > > requires >> > > > > > > the >> > > > > > > > > > JVM to interface to it through some calls to Unsafe >> hidden >> > > under >> > > > > the >> > > > > > > > > Netty >> > > > > > > > > > layer (but please correct me if I'm wrong, I'm not an >> expert >> > > on >> > > > > > > this). >> > > > > > > > > > Those calls are the only reason I can think of that >> would >> > > > > degrade the >> > > > > > > > > > performance a bit compared to a pure JAva case. I don't >> know >> > > if >> > > > > the >> > > > > > > > > Unsafe >> > > > > > > > > > calls are inlined during JIT compilation. If they >> aren't they >> > > > > will >> > > > > > > > > increase >> > > > > > > > > > access latency to any data a little bit. >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > I don't have a similar machine so it's not easy to >> relate >> > > my >> > > > > > > numbers to >> > > > > > > > > > yours, but if you can get that data consumed with 100 >> Gbps >> > > in a >> > > > > pure >> > > > > > > Java >> > > > > > > > > > case, I don't see any reason (resulting from Arrow >> format / >> > > > > off-heap >> > > > > > > > > > storage) why you wouldn't be able to get at least really >> > > close. >> > > > > Can >> > > > > > > you >> > > > > > > > > get >> > > > > > > > > > to 100 Gbps starting from primitive arrays in Java with >> your >> > > > > > > consumption >> > > > > > > > > > functions in the first place? >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > I'm interested to see your progress on this. >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > Kind regards, >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > Johan Peltenburg >> > > > > > > > > > > >> > > > > > > > > > > ________________________________ >> > > > > > > > > > > From: Animesh Trivedi <animesh.triv...@gmail.com> >> > > > > > > > > > > Sent: Wednesday, September 19, 2018 2:08:50 PM >> > > > > > > > > > > To: dev@arrow.apache.org; d...@crail.apache.org >> > > > > > > > > > > Subject: [JAVA] Arrow performance measurement >> > > > > > > > > > > >> > > > > > > > > > > Hi all, >> > > > > > > > > > > >> > > > > > > > > > > A week ago, Wes and I had a discussion about the >> > > performance >> > > > > of the >> > > > > > > > > > > Arrow/Java implementation on the Apache Crail >> (Incubating) >> > > > > mailing >> > > > > > > > > list ( >> > > > > > > > > > > >> > > > > > > >> > > >> http://mail-archives.apache.org/mod_mbox/crail-dev/201809.mbox/browser >> > > > > > > > > ). >> > > > > > > > > > In >> > > > > > > > > > > a nutshell: I am investigating the performance of >> various >> > > file >> > > > > > > formats >> > > > > > > > > > > (including Arrow) on high-performance NVMe and >> > > > > RDMA/100Gbps/RoCE >> > > > > > > > > setups. >> > > > > > > > > > I >> > > > > > > > > > > benchmarked how long does it take to materialize >> values >> > > (ints, >> > > > > > > longs, >> > > > > > > > > > > doubles) of the store_sales table, the largest table >> in the >> > > > > TPC-DS >> > > > > > > > > > dataset >> > > > > > > > > > > stored on different file formats. Here is a write-up >> on >> > > this - >> > > > > > > > > > > >> > > https://crail.incubator.apache.org/blog/2018/08/sql-p1.html. I >> > > > > > > found >> > > > > > > > > > that >> > > > > > > > > > > between a pair of machine connected over a 100 Gbps >> link, >> > > Arrow >> > > > > > > (using >> > > > > > > > > > as a >> > > > > > > > > > > file format on HDFS) delivered close to ~30 Gbps of >> > > bandwidth >> > > > > with >> > > > > > > all >> > > > > > > > > 16 >> > > > > > > > > > > cores engaged. Wes pointed out that (i) Arrow is >> in-memory >> > > IPC >> > > > > > > format, >> > > > > > > > > > and >> > > > > > > > > > > has not been optimized for storage interfaces/APIs >> like >> > > HDFS; >> > > > > (ii) >> > > > > > > the >> > > > > > > > > > > performance I am measuring is for the java >> implementation. >> > > > > > > > > > > >> > > > > > > > > > > Wes, I hope I summarized our discussion correctly. >> > > > > > > > > > > >> > > > > > > > > > > That brings us to this email where I promised to >> follow up >> > > on >> > > > > the >> > > > > > > Arrow >> > > > > > > > > > > mailing list to understand and optimize the >> performance of >> > > > > > > Arrow/Java >> > > > > > > > > > > implementation on high-performance devices. I wrote a >> small >> > > > > > > stand-alone >> > > > > > > > > > > benchmark ( >> > > > > https://github.com/animeshtrivedi/benchmarking-arrow) >> > > > > > > with >> > > > > > > > > > three >> > > > > > > > > > > implementations of WritableByteChannel, >> SeekableByteChannel >> > > > > > > interfaces: >> > > > > > > > > > > >> > > > > > > > > > > 1. Arrow data is stored in HDFS/tmpfs - this gives me >> ~30 >> > > Gbps >> > > > > > > > > > performance >> > > > > > > > > > > 2. Arrow data is stored in Crail/DRAM - this gives me >> > > ~35-36 >> > > > > Gbps >> > > > > > > > > > > performance >> > > > > > > > > > > 3. Arrow data is stored in on-heap byte[] - this >> gives me >> > > ~39 >> > > > > Gbps >> > > > > > > > > > > performance >> > > > > > > > > > > >> > > > > > > > > > > I think the order makes sense. To better understand >> the >> > > > > > > performance of >> > > > > > > > > > > Arrow/Java we can focus on the option 3. >> > > > > > > > > > > >> > > > > > > > > > > The key question I am trying to answer is "what would >> it >> > > take >> > > > > for >> > > > > > > > > > > Arrow/Java to deliver 100+ Gbps of performance"? Is it >> > > > > possible? If >> > > > > > > > > yes, >> > > > > > > > > > > then what is missing/or mis-interpreted by me? If >> not, then >> > > > > where >> > > > > > > is >> > > > > > > > > the >> > > > > > > > > > > performance lost? Does anyone have any performance >> > > measurements >> > > > > > > for C++ >> > > > > > > > > > > implementation? if they have seen/expect better >> numbers. >> > > > > > > > > > > >> > > > > > > > > > > As a next step, I will profile the read path of >> Arrow/Java >> > > for >> > > > > the >> > > > > > > > > option >> > > > > > > > > > > 3. I will report my findings. >> > > > > > > > > > > >> > > > > > > > > > > Any thoughts and feedback on this investigation are >> very >> > > > > welcome. >> > > > > > > > > > > >> > > > > > > > > > > Cheers, >> > > > > > > > > > > -- >> > > > > > > > > > > Animesh >> > > > > > > > > > > >> > > > > > > > > > > PS~ Cross-posting on the d...@crail.apache.org list as >> > > well. >> > > > > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > >> > > >> >