Hello everyone, I solved the issue with my writer. Now everything is working fine, including HDFS file reads and writes. I also wrote a parquet to arrow converter (on HDFS) that works fine.
I noticed that Arrow javadocs are still at the 0.7 release. Can someone please update them? FWIW: I wrote a blog post about how to read and write arrow files in Java - https://github.com/animeshtrivedi/blog/blob/master/post/2017-12-26-arrow.md The corresponding code is at https://github.com/animeshtrivedi/ArrowExample Thanks, -- Animesh On Wed, Dec 20, 2017 at 4:35 PM, Animesh Trivedi <animesh.triv...@gmail.com> wrote: > I think the null pointer exception happens due to some issue in my new > writer (which used my implementation of the ByteBuffer writable > interface)...let me narrow it down first. > > The basic code, that does not use my writer's implementation, seems to > work. This is the code which is at github. I did not push the new writer > implementation yet. > > Thanks > -- > Animesh > > > On 20 Dec 2017 14:51, "Animesh Trivedi" <animesh.triv...@gmail.com> wrote: > > Wes, Emilio, Siddharth - many thanks for helpful replies and comments ! > > I managed to upgrade the code to 0.8 API. I have to say that 0.8 API is > much more intuitive ;) I will summarize my code example with some > documentation in a blog post soon (and post it here too). > > - Is there 1st class support to read/write files to HDFS files? > Because FSData[Output/Input]Stream from HDFS do not implement > [Read/Writeable]ByteChannel interfaces required to instantiate ArrowFile > readers and writers. I already implemented something for me that works but > am wondering if it does not make sense to have these facilities as > utilities in the Arrow code? > > However, my example code runs fine on a small example of 10 rows with > multiple batches. But it fails to read for anything larger. I have not > verified if it was working for 0.7 version or at what row count it starts > to fail. The writes are fine as far as I can tell. For example, I am > writing and then reading TPC-DS data (store_sales table with int, long, and > doubles) and I get > > [...] > Reading the arrow file : ./store_sales.arrow > File size : 3965838890 schema is Schema<ss_sold_date_sk: Int(32, true), > ss_sold_time_sk: Int(32, true), ss_item_sk: Int(32, true), ss_customer_sk: > Int(32, true), ss_cdemo_sk: Int(32, true), ss_hdemo_sk: Int(32, true), > ss_addr_sk: Int(32, true), ss_store_sk: Int(32, true), ss_promo_sk: Int(32, > true), ss_ticket_number: Int(64, true), ss_quantity: Int(32, true), > ss_wholesale_cost: FloatingPoint(DOUBLE), ss_list_price: > FloatingPoint(DOUBLE), ss_sales_price: FloatingPoint(DOUBLE), > ss_ext_discount_amt: FloatingPoint(DOUBLE), ss_ext_sales_price: > FloatingPoint(DOUBLE), ss_ext_wholesale_cost: FloatingPoint(DOUBLE), > ss_ext_list_price: FloatingPoint(DOUBLE), ss_ext_tax: > FloatingPoint(DOUBLE), ss_coupon_amt: FloatingPoint(DOUBLE), ss_net_paid: > FloatingPoint(DOUBLE), ss_net_paid_inc_tax: FloatingPoint(DOUBLE), > ss_net_profit: FloatingPoint(DOUBLE)> > Number of arrow blocks are 19 > java.lang.NullPointerException > at org.apache.arrow.vector.ipc.message.MessageSerializer.deseri > alizeRecordBatch(MessageSerializer.java:256) > at org.apache.arrow.vector.ipc.message.MessageSerializer.deseri > alizeRecordBatch(MessageSerializer.java:242) > at org.apache.arrow.vector.ipc.ArrowFileReader.readRecordBatch( > ArrowFileReader.java:162) > at org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(Ar > rowFileReader.java:113) > at org.apache.arrow.vector.ipc.ArrowFileReader.loadRecordBatch( > ArrowFileReader.java:139) > at com.github.animeshtrivedi.arrowexample.ArrowRead.makeRead( > ArrowRead.java:82) > at com.github.animeshtrivedi.arrowexample.ArrowRead.main(ArrowR > ead.java:217) > > > Some context, the file size is 3965838890 bytes and the schema read from > the file is correct. The code where it fails is doing something like: > > System.out.println("File size : " + arrowFile.length() + " schema > is " + root.getSchema().toString()); > List<ArrowBlock> arrowBlocks = arrowFileReader.getRecordBlocks(); > System.out.println("Number of arrow blocks are " + > arrowBlocks.size()); > for (int i = 0; i < arrowBlocks.size(); i++) { > ArrowBlock rbBlock = arrowBlocks.get(i); > if (!arrowFileReader.loadRecordBatch(rbBlock)) { > throw new IOException("Expected to read record batch"); > } > > the stack comes from here: https://github.com/animeshtriv > edi/ArrowExample/blob/master/src/main/java/com/github/ > animeshtrivedi/arrowexample/ArrowRead.java#L82 > > Any idea what might be happening? > > Thanks, > -- > Animesh > > On Tue, Dec 19, 2017 at 7:03 PM, Siddharth Teotia <siddha...@dremio.com> > wrote: > >> From Arrow 0.8, the second step "Grab the corresponding mutator and >> accessor objects by calls to getMutator(), getAccessor()" is not needed. >> In >> fact, it is not even there. >> >> On Tue, Dec 19, 2017 at 10:01 AM, Siddharth Teotia <siddha...@dremio.com> >> wrote: >> >> > Hi Animesh, >> > >> > Firstly I would like to suggest switching over to Arrow 0.8 release asap >> > since you are writing JAVA programs and the API usage has changed >> > drastically. The new APIs are much simpler with good javadocs and >> detailed >> > internal comments. >> > >> > If you are writing stop-gap implementation then it is probably fine to >> > continue with old version but for long term new API usage is >> recommended. >> > >> > >> > - Create an instance of the vector. Note that this doesn't allocate >> > any memory for the elements in the vector >> > - Grab the corresponding mutator and accessor objects by calls to >> > getMutator(), getAccessor(). >> > - Allocate memory >> > - *allocateNew()* - we will allocate memory for default number of >> > elements in the vector. This is applicable to both fixed width >> and variable >> > width vectors. >> > - *allocateNew(valueCount)* - for fixed width vectors. Use this >> > method if you have already know the number of elements to store >> in the >> > vector >> > - *allocateNew(bytes, valueCount)* - for variable width vectors. >> > Use this method if you already know the total size (in bytes) of >> all the >> > variable width elements you will be storing in the vector. For >> example, if >> > you are going to store 1024 elements in the vector and the total >> size >> > across all variable width elements is under 1MB, you can call >> > allocateBytes(1024*1024, 1024) >> > - Populate the vector: >> > - Use the *set() or setSafe() *APIs in the mutator interface. From >> > Arrow 0.8 onwards, you can use these APIs directly on the vector >> instance >> > and mutator/accessor are removed. >> > - The difference between set() and corresponding setSafe() API is >> > that latter internally takes care of expanding the vector's >> buffer(s) for >> > storing new data. >> > - Each set() API has a corresponding setSafe() API. >> > - Do a setValueCount() based on the number of elements you populated >> > in the vector. >> > - Retrieve elements from the vector: >> > - Use the get(), getObject() APIs in the accessor interface. >> Again, >> > from Arrow 0.8 onwards you can use these APIs directly. >> > - With respect to usage of setInitialCapacity: >> > - Let's say your application always issues calls to allocateNew(). >> > It is likely that this will end up over-allocating memory because >> it >> > assumes a default value count to begin with. >> > - In this case, if you do setInitialCapacity() followed by >> > allocateNew() then latter doesn't do default memory allocation. >> It does >> > exactly for the value capacity you specified in >> setInitialCapacity(). >> > >> > I would highly recommend taking a look at https://github.com/apache/ >> > arrow/blob/master/java/vector/src/test/java/org/apache/ >> > arrow/vector/TestValueVector.java >> > This has lots of examples around populating the vector, retrieving from >> > vector, using setInitialCapacity(), using set(), setSafe() methods and a >> > combination of them to understand when things can go wrong. >> > >> > Hopefully this helps. Meanwhile we will try to add some internal README >> > for the usage of vectors. >> > >> > Thanks, >> > Siddharth >> > >> > On Tue, Dec 19, 2017 at 8:55 AM, Emilio Lahr-Vivaz <elahrvi...@ccri.com >> > >> > wrote: >> > >> >> This has probably changed with the Java code refactor, but I've posted >> >> some answers inline, to the best of my understanding. >> >> >> >> Thanks, >> >> >> >> Emilio >> >> >> >> On 12/16/2017 12:17 PM, Animesh Trivedi wrote: >> >> >> >>> Thanks Wes for you help. >> >>> >> >>> Based upon some code reading, I managed to code-up a basic working >> >>> example. >> >>> The code is here: >> >>> https://github.com/animeshtrivedi/ArrowExample/tree/master/s >> >>> rc/main/java/com/github/animeshtrivedi/arrowexample >> >>> . >> >>> >> >>> However, I do have some questions about the concepts in Arrow >> >>> >> >>> 1. ArrowBlock is the unit of reading/writing. One ArrowBlock >> essentially >> >>> is >> >>> the amount of the data one must hold in-memory at a time. Is my >> >>> understanding correct? >> >>> >> >> yes >> >> >> >>> >> >>> 2. There are Base[Reade/Writer] interfaces as well as Mutator/Accessor >> >>> classes in the ValueVector interface - both are implemented by all >> >>> supported data types. What is the relationship between these two? or >> when >> >>> is one suppose to use one over other. I only use Mutator/Accessor >> classes >> >>> in my code. >> >>> >> >> The write/reader interfaces are parallel implementations that make some >> >> things easier, but don't encompass all available functionality (for >> >> example, fixed size lists, nested lists, some dictionary operations, >> etc). >> >> However, you should be able to accomplish everything using >> >> mutators/accessors. >> >> >> >>> >> >>> 3. What are the "safe" varient functions in the Mutator's code? I >> could >> >>> not >> >>> understand what they meant to achieve. >> >>> >> >> The safe methods ensure that the vector is large enough to set the >> value. >> >> You can use the unsafe versions if you know that your vector has >> already >> >> allocated enough space for your data. >> >> >> >>> 4. What are MinorTypes? >> >>> >> >> Minor types are a representation of the different vector types. I >> believe >> >> they are being de-emphasized in favor of FieldTypes, as minor types >> don't >> >> contain enough information to represent all vectors. >> >> >> >>> >> >>> 5. For a writer, what is a dictionary provider? For example in the >> >>> Integration.java code, the reader is given as the dictionary provider >> for >> >>> the writer. But, is it something more than just: >> >>> DictionaryProvider.MapDictionaryProvider provider = new >> >>> DictionaryProvider.MapDictionaryProvider(); >> >>> ArrowFileWriter arrowWriter = new ArrowFileWriter(root, provider, >> >>> fileOutputStream.getChannel()); >> >>> >> >> The dictionary provider is an interface for looking up dictionary >> values. >> >> When reading a file, the reader itself has already read the >> dictionaries >> >> and thus serves as the provider. >> >> >> >>> 6. I am not clearly sure about the sequence of call that one needs to >> do >> >>> write on mutators. For example, if I code something like >> >>> NullableIntVector intVector = (NullableIntVector) fieldVector; >> >>> NullableIntVector.Mutator mutator = intVector.getMutator(); >> >>> [.write num values] >> >>> mutator.setValueCount(num) >> >>> then this works for primitive types, but not for VarBinary type. >> There I >> >>> have to set the capacity first, >> >>> >> >>> NullableVarBinaryVector varBinaryVector = (NullableVarBinaryVector) >> >>> fieldVector; >> >>> varBinaryVector.setInitialCapacity(items); >> >>> varBinaryVector.allocateNew(); >> >>> NullableVarBinaryVector.Mutator mutator = >> varBinaryVector.getMutator(); >> >>> >> >> The method calls are not very well documented - I would suggest looking >> >> at the reader/writer implementations to see what calls are required for >> >> which vector types. Generally variable length vectors (lists, var >> binary, >> >> etc) behave differently than fixed width vectors (ints, longs, etc). >> >> >> >> Example of these are here: >> >>> https://github.com/animeshtrivedi/ArrowExample/blob/master/s >> >>> rc/main/java/com/github/animeshtrivedi/arrowexample/ArrowWrite.java >> >>> (writeField[???] functions). >> >>> >> >>> Thank you very much, >> >>> -- >> >>> Animesh >> >>> >> >>> >> >>> >> >>> On Thu, Dec 14, 2017 at 6:15 PM, Wes McKinney <wesmck...@gmail.com> >> >>> wrote: >> >>> >> >>> hi Animesh, >> >>>> >> >>>> I suggest you try the ArrowStreamReader/Writer or >> >>>> ArrowFileReader/Writer classes. See >> >>>> https://github.com/apache/arrow/blob/master/java/tools/ >> >>>> src/main/java/org/apache/arrow/tools/Integration.java >> >>>> for example working code for this >> >>>> >> >>>> - Wes >> >>>> >> >>>> On Thu, Dec 14, 2017 at 8:30 AM, Animesh Trivedi >> >>>> <animesh.triv...@gmail.com> wrote: >> >>>> >> >>>>> Hi all, >> >>>>> >> >>>>> It might be a trivial question, so please let me know if I am >> missing >> >>>>> something. >> >>>>> >> >>>>> I am trying to write and read files in the Arrow format in Java. My >> >>>>> data >> >>>>> >> >>>> is >> >>>> >> >>>>> simple flat schema with primitive types. I already have the data in >> >>>>> Java. >> >>>>> So my questions are: >> >>>>> 1. Is this possible or am I fundamentally missing something what >> Arrow >> >>>>> >> >>>> can >> >>>> >> >>>>> or cannot do (or is designed to do). I assume that an efficient >> >>>>> in-memory >> >>>>> columnar data format should work with files too. >> >>>>> 2. Can you point me out to a working example? or a starting example. >> >>>>> Intuitively I am looking for a way to define schema, write/read >> column >> >>>>> vectors to/from files as one does with Parquet or ORC. >> >>>>> >> >>>>> I try to locate some working examples with ArrowFile[Reader/Writer] >> >>>>> >> >>>> classes >> >>>> >> >>>>> in the maven tests but so far not sure where to start. >> >>>>> >> >>>>> Thanks, >> >>>>> -- >> >>>>> Animesh >> >>>>> >> >>>> >> >> >> > >> > > >