I'll take a look at updating the site docs today. Thanks for pointing this out!
On Wed, Dec 27, 2017 at 4:57 AM, Animesh Trivedi <animesh.triv...@gmail.com> wrote: > Hello everyone, > > I solved the issue with my writer. Now everything is working fine, > including HDFS file reads and writes. I also wrote a parquet to arrow > converter (on HDFS) that works fine. > > I noticed that Arrow javadocs are still at the 0.7 release. Can someone > please update them? > > FWIW: I wrote a blog post about how to read and write arrow files in Java > - > https://github.com/animeshtrivedi/blog/blob/master/post/2017-12-26-arrow.md > > The corresponding code is at https://github.com/animeshtrivedi/ArrowExample > > > Thanks, > -- > Animesh > > > On Wed, Dec 20, 2017 at 4:35 PM, Animesh Trivedi <animesh.triv...@gmail.com> > wrote: > >> I think the null pointer exception happens due to some issue in my new >> writer (which used my implementation of the ByteBuffer writable >> interface)...let me narrow it down first. >> >> The basic code, that does not use my writer's implementation, seems to >> work. This is the code which is at github. I did not push the new writer >> implementation yet. >> >> Thanks >> -- >> Animesh >> >> >> On 20 Dec 2017 14:51, "Animesh Trivedi" <animesh.triv...@gmail.com> wrote: >> >> Wes, Emilio, Siddharth - many thanks for helpful replies and comments ! >> >> I managed to upgrade the code to 0.8 API. I have to say that 0.8 API is >> much more intuitive ;) I will summarize my code example with some >> documentation in a blog post soon (and post it here too). >> >> - Is there 1st class support to read/write files to HDFS files? >> Because FSData[Output/Input]Stream from HDFS do not implement >> [Read/Writeable]ByteChannel interfaces required to instantiate ArrowFile >> readers and writers. I already implemented something for me that works but >> am wondering if it does not make sense to have these facilities as >> utilities in the Arrow code? >> >> However, my example code runs fine on a small example of 10 rows with >> multiple batches. But it fails to read for anything larger. I have not >> verified if it was working for 0.7 version or at what row count it starts >> to fail. The writes are fine as far as I can tell. For example, I am >> writing and then reading TPC-DS data (store_sales table with int, long, and >> doubles) and I get >> >> [...] >> Reading the arrow file : ./store_sales.arrow >> File size : 3965838890 schema is Schema<ss_sold_date_sk: Int(32, true), >> ss_sold_time_sk: Int(32, true), ss_item_sk: Int(32, true), ss_customer_sk: >> Int(32, true), ss_cdemo_sk: Int(32, true), ss_hdemo_sk: Int(32, true), >> ss_addr_sk: Int(32, true), ss_store_sk: Int(32, true), ss_promo_sk: Int(32, >> true), ss_ticket_number: Int(64, true), ss_quantity: Int(32, true), >> ss_wholesale_cost: FloatingPoint(DOUBLE), ss_list_price: >> FloatingPoint(DOUBLE), ss_sales_price: FloatingPoint(DOUBLE), >> ss_ext_discount_amt: FloatingPoint(DOUBLE), ss_ext_sales_price: >> FloatingPoint(DOUBLE), ss_ext_wholesale_cost: FloatingPoint(DOUBLE), >> ss_ext_list_price: FloatingPoint(DOUBLE), ss_ext_tax: >> FloatingPoint(DOUBLE), ss_coupon_amt: FloatingPoint(DOUBLE), ss_net_paid: >> FloatingPoint(DOUBLE), ss_net_paid_inc_tax: FloatingPoint(DOUBLE), >> ss_net_profit: FloatingPoint(DOUBLE)> >> Number of arrow blocks are 19 >> java.lang.NullPointerException >> at org.apache.arrow.vector.ipc.message.MessageSerializer.deseri >> alizeRecordBatch(MessageSerializer.java:256) >> at org.apache.arrow.vector.ipc.message.MessageSerializer.deseri >> alizeRecordBatch(MessageSerializer.java:242) >> at org.apache.arrow.vector.ipc.ArrowFileReader.readRecordBatch( >> ArrowFileReader.java:162) >> at org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(Ar >> rowFileReader.java:113) >> at org.apache.arrow.vector.ipc.ArrowFileReader.loadRecordBatch( >> ArrowFileReader.java:139) >> at com.github.animeshtrivedi.arrowexample.ArrowRead.makeRead( >> ArrowRead.java:82) >> at com.github.animeshtrivedi.arrowexample.ArrowRead.main(ArrowR >> ead.java:217) >> >> >> Some context, the file size is 3965838890 bytes and the schema read from >> the file is correct. The code where it fails is doing something like: >> >> System.out.println("File size : " + arrowFile.length() + " schema >> is " + root.getSchema().toString()); >> List<ArrowBlock> arrowBlocks = arrowFileReader.getRecordBlocks(); >> System.out.println("Number of arrow blocks are " + >> arrowBlocks.size()); >> for (int i = 0; i < arrowBlocks.size(); i++) { >> ArrowBlock rbBlock = arrowBlocks.get(i); >> if (!arrowFileReader.loadRecordBatch(rbBlock)) { >> throw new IOException("Expected to read record batch"); >> } >> >> the stack comes from here: https://github.com/animeshtriv >> edi/ArrowExample/blob/master/src/main/java/com/github/ >> animeshtrivedi/arrowexample/ArrowRead.java#L82 >> >> Any idea what might be happening? >> >> Thanks, >> -- >> Animesh >> >> On Tue, Dec 19, 2017 at 7:03 PM, Siddharth Teotia <siddha...@dremio.com> >> wrote: >> >>> From Arrow 0.8, the second step "Grab the corresponding mutator and >>> accessor objects by calls to getMutator(), getAccessor()" is not needed. >>> In >>> fact, it is not even there. >>> >>> On Tue, Dec 19, 2017 at 10:01 AM, Siddharth Teotia <siddha...@dremio.com> >>> wrote: >>> >>> > Hi Animesh, >>> > >>> > Firstly I would like to suggest switching over to Arrow 0.8 release asap >>> > since you are writing JAVA programs and the API usage has changed >>> > drastically. The new APIs are much simpler with good javadocs and >>> detailed >>> > internal comments. >>> > >>> > If you are writing stop-gap implementation then it is probably fine to >>> > continue with old version but for long term new API usage is >>> recommended. >>> > >>> > >>> > - Create an instance of the vector. Note that this doesn't allocate >>> > any memory for the elements in the vector >>> > - Grab the corresponding mutator and accessor objects by calls to >>> > getMutator(), getAccessor(). >>> > - Allocate memory >>> > - *allocateNew()* - we will allocate memory for default number of >>> > elements in the vector. This is applicable to both fixed width >>> and variable >>> > width vectors. >>> > - *allocateNew(valueCount)* - for fixed width vectors. Use this >>> > method if you have already know the number of elements to store >>> in the >>> > vector >>> > - *allocateNew(bytes, valueCount)* - for variable width vectors. >>> > Use this method if you already know the total size (in bytes) of >>> all the >>> > variable width elements you will be storing in the vector. For >>> example, if >>> > you are going to store 1024 elements in the vector and the total >>> size >>> > across all variable width elements is under 1MB, you can call >>> > allocateBytes(1024*1024, 1024) >>> > - Populate the vector: >>> > - Use the *set() or setSafe() *APIs in the mutator interface. From >>> > Arrow 0.8 onwards, you can use these APIs directly on the vector >>> instance >>> > and mutator/accessor are removed. >>> > - The difference between set() and corresponding setSafe() API is >>> > that latter internally takes care of expanding the vector's >>> buffer(s) for >>> > storing new data. >>> > - Each set() API has a corresponding setSafe() API. >>> > - Do a setValueCount() based on the number of elements you populated >>> > in the vector. >>> > - Retrieve elements from the vector: >>> > - Use the get(), getObject() APIs in the accessor interface. >>> Again, >>> > from Arrow 0.8 onwards you can use these APIs directly. >>> > - With respect to usage of setInitialCapacity: >>> > - Let's say your application always issues calls to allocateNew(). >>> > It is likely that this will end up over-allocating memory because >>> it >>> > assumes a default value count to begin with. >>> > - In this case, if you do setInitialCapacity() followed by >>> > allocateNew() then latter doesn't do default memory allocation. >>> It does >>> > exactly for the value capacity you specified in >>> setInitialCapacity(). >>> > >>> > I would highly recommend taking a look at https://github.com/apache/ >>> > arrow/blob/master/java/vector/src/test/java/org/apache/ >>> > arrow/vector/TestValueVector.java >>> > This has lots of examples around populating the vector, retrieving from >>> > vector, using setInitialCapacity(), using set(), setSafe() methods and a >>> > combination of them to understand when things can go wrong. >>> > >>> > Hopefully this helps. Meanwhile we will try to add some internal README >>> > for the usage of vectors. >>> > >>> > Thanks, >>> > Siddharth >>> > >>> > On Tue, Dec 19, 2017 at 8:55 AM, Emilio Lahr-Vivaz <elahrvi...@ccri.com >>> > >>> > wrote: >>> > >>> >> This has probably changed with the Java code refactor, but I've posted >>> >> some answers inline, to the best of my understanding. >>> >> >>> >> Thanks, >>> >> >>> >> Emilio >>> >> >>> >> On 12/16/2017 12:17 PM, Animesh Trivedi wrote: >>> >> >>> >>> Thanks Wes for you help. >>> >>> >>> >>> Based upon some code reading, I managed to code-up a basic working >>> >>> example. >>> >>> The code is here: >>> >>> https://github.com/animeshtrivedi/ArrowExample/tree/master/s >>> >>> rc/main/java/com/github/animeshtrivedi/arrowexample >>> >>> . >>> >>> >>> >>> However, I do have some questions about the concepts in Arrow >>> >>> >>> >>> 1. ArrowBlock is the unit of reading/writing. One ArrowBlock >>> essentially >>> >>> is >>> >>> the amount of the data one must hold in-memory at a time. Is my >>> >>> understanding correct? >>> >>> >>> >> yes >>> >> >>> >>> >>> >>> 2. There are Base[Reade/Writer] interfaces as well as Mutator/Accessor >>> >>> classes in the ValueVector interface - both are implemented by all >>> >>> supported data types. What is the relationship between these two? or >>> when >>> >>> is one suppose to use one over other. I only use Mutator/Accessor >>> classes >>> >>> in my code. >>> >>> >>> >> The write/reader interfaces are parallel implementations that make some >>> >> things easier, but don't encompass all available functionality (for >>> >> example, fixed size lists, nested lists, some dictionary operations, >>> etc). >>> >> However, you should be able to accomplish everything using >>> >> mutators/accessors. >>> >> >>> >>> >>> >>> 3. What are the "safe" varient functions in the Mutator's code? I >>> could >>> >>> not >>> >>> understand what they meant to achieve. >>> >>> >>> >> The safe methods ensure that the vector is large enough to set the >>> value. >>> >> You can use the unsafe versions if you know that your vector has >>> already >>> >> allocated enough space for your data. >>> >> >>> >>> 4. What are MinorTypes? >>> >>> >>> >> Minor types are a representation of the different vector types. I >>> believe >>> >> they are being de-emphasized in favor of FieldTypes, as minor types >>> don't >>> >> contain enough information to represent all vectors. >>> >> >>> >>> >>> >>> 5. For a writer, what is a dictionary provider? For example in the >>> >>> Integration.java code, the reader is given as the dictionary provider >>> for >>> >>> the writer. But, is it something more than just: >>> >>> DictionaryProvider.MapDictionaryProvider provider = new >>> >>> DictionaryProvider.MapDictionaryProvider(); >>> >>> ArrowFileWriter arrowWriter = new ArrowFileWriter(root, provider, >>> >>> fileOutputStream.getChannel()); >>> >>> >>> >> The dictionary provider is an interface for looking up dictionary >>> values. >>> >> When reading a file, the reader itself has already read the >>> dictionaries >>> >> and thus serves as the provider. >>> >> >>> >>> 6. I am not clearly sure about the sequence of call that one needs to >>> do >>> >>> write on mutators. For example, if I code something like >>> >>> NullableIntVector intVector = (NullableIntVector) fieldVector; >>> >>> NullableIntVector.Mutator mutator = intVector.getMutator(); >>> >>> [.write num values] >>> >>> mutator.setValueCount(num) >>> >>> then this works for primitive types, but not for VarBinary type. >>> There I >>> >>> have to set the capacity first, >>> >>> >>> >>> NullableVarBinaryVector varBinaryVector = (NullableVarBinaryVector) >>> >>> fieldVector; >>> >>> varBinaryVector.setInitialCapacity(items); >>> >>> varBinaryVector.allocateNew(); >>> >>> NullableVarBinaryVector.Mutator mutator = >>> varBinaryVector.getMutator(); >>> >>> >>> >> The method calls are not very well documented - I would suggest looking >>> >> at the reader/writer implementations to see what calls are required for >>> >> which vector types. Generally variable length vectors (lists, var >>> binary, >>> >> etc) behave differently than fixed width vectors (ints, longs, etc). >>> >> >>> >> Example of these are here: >>> >>> https://github.com/animeshtrivedi/ArrowExample/blob/master/s >>> >>> rc/main/java/com/github/animeshtrivedi/arrowexample/ArrowWrite.java >>> >>> (writeField[???] functions). >>> >>> >>> >>> Thank you very much, >>> >>> -- >>> >>> Animesh >>> >>> >>> >>> >>> >>> >>> >>> On Thu, Dec 14, 2017 at 6:15 PM, Wes McKinney <wesmck...@gmail.com> >>> >>> wrote: >>> >>> >>> >>> hi Animesh, >>> >>>> >>> >>>> I suggest you try the ArrowStreamReader/Writer or >>> >>>> ArrowFileReader/Writer classes. See >>> >>>> https://github.com/apache/arrow/blob/master/java/tools/ >>> >>>> src/main/java/org/apache/arrow/tools/Integration.java >>> >>>> for example working code for this >>> >>>> >>> >>>> - Wes >>> >>>> >>> >>>> On Thu, Dec 14, 2017 at 8:30 AM, Animesh Trivedi >>> >>>> <animesh.triv...@gmail.com> wrote: >>> >>>> >>> >>>>> Hi all, >>> >>>>> >>> >>>>> It might be a trivial question, so please let me know if I am >>> missing >>> >>>>> something. >>> >>>>> >>> >>>>> I am trying to write and read files in the Arrow format in Java. My >>> >>>>> data >>> >>>>> >>> >>>> is >>> >>>> >>> >>>>> simple flat schema with primitive types. I already have the data in >>> >>>>> Java. >>> >>>>> So my questions are: >>> >>>>> 1. Is this possible or am I fundamentally missing something what >>> Arrow >>> >>>>> >>> >>>> can >>> >>>> >>> >>>>> or cannot do (or is designed to do). I assume that an efficient >>> >>>>> in-memory >>> >>>>> columnar data format should work with files too. >>> >>>>> 2. Can you point me out to a working example? or a starting example. >>> >>>>> Intuitively I am looking for a way to define schema, write/read >>> column >>> >>>>> vectors to/from files as one does with Parquet or ORC. >>> >>>>> >>> >>>>> I try to locate some working examples with ArrowFile[Reader/Writer] >>> >>>>> >>> >>>> classes >>> >>>> >>> >>>>> in the maven tests but so far not sure where to start. >>> >>>>> >>> >>>>> Thanks, >>> >>>>> -- >>> >>>>> Animesh >>> >>>>> >>> >>>> >>> >> >>> > >>> >> >> >>