I'm working on getting all the docs updated for 0.8.0 -- there are some issues blocking a more automated update, so I may update them piecemeal until this is resolved:
https://github.com/apache/arrow/pull/1472 On Tue, Jan 2, 2018 at 10:41 AM, Wes McKinney <wesmck...@gmail.com> wrote: > I'll take a look at updating the site docs today. Thanks for pointing this > out! > > On Wed, Dec 27, 2017 at 4:57 AM, Animesh Trivedi > <animesh.triv...@gmail.com> wrote: >> Hello everyone, >> >> I solved the issue with my writer. Now everything is working fine, >> including HDFS file reads and writes. I also wrote a parquet to arrow >> converter (on HDFS) that works fine. >> >> I noticed that Arrow javadocs are still at the 0.7 release. Can someone >> please update them? >> >> FWIW: I wrote a blog post about how to read and write arrow files in Java >> - >> https://github.com/animeshtrivedi/blog/blob/master/post/2017-12-26-arrow.md >> >> The corresponding code is at https://github.com/animeshtrivedi/ArrowExample >> >> >> Thanks, >> -- >> Animesh >> >> >> On Wed, Dec 20, 2017 at 4:35 PM, Animesh Trivedi <animesh.triv...@gmail.com> >> wrote: >> >>> I think the null pointer exception happens due to some issue in my new >>> writer (which used my implementation of the ByteBuffer writable >>> interface)...let me narrow it down first. >>> >>> The basic code, that does not use my writer's implementation, seems to >>> work. This is the code which is at github. I did not push the new writer >>> implementation yet. >>> >>> Thanks >>> -- >>> Animesh >>> >>> >>> On 20 Dec 2017 14:51, "Animesh Trivedi" <animesh.triv...@gmail.com> wrote: >>> >>> Wes, Emilio, Siddharth - many thanks for helpful replies and comments ! >>> >>> I managed to upgrade the code to 0.8 API. I have to say that 0.8 API is >>> much more intuitive ;) I will summarize my code example with some >>> documentation in a blog post soon (and post it here too). >>> >>> - Is there 1st class support to read/write files to HDFS files? >>> Because FSData[Output/Input]Stream from HDFS do not implement >>> [Read/Writeable]ByteChannel interfaces required to instantiate ArrowFile >>> readers and writers. I already implemented something for me that works but >>> am wondering if it does not make sense to have these facilities as >>> utilities in the Arrow code? >>> >>> However, my example code runs fine on a small example of 10 rows with >>> multiple batches. But it fails to read for anything larger. I have not >>> verified if it was working for 0.7 version or at what row count it starts >>> to fail. The writes are fine as far as I can tell. For example, I am >>> writing and then reading TPC-DS data (store_sales table with int, long, and >>> doubles) and I get >>> >>> [...] >>> Reading the arrow file : ./store_sales.arrow >>> File size : 3965838890 schema is Schema<ss_sold_date_sk: Int(32, true), >>> ss_sold_time_sk: Int(32, true), ss_item_sk: Int(32, true), ss_customer_sk: >>> Int(32, true), ss_cdemo_sk: Int(32, true), ss_hdemo_sk: Int(32, true), >>> ss_addr_sk: Int(32, true), ss_store_sk: Int(32, true), ss_promo_sk: Int(32, >>> true), ss_ticket_number: Int(64, true), ss_quantity: Int(32, true), >>> ss_wholesale_cost: FloatingPoint(DOUBLE), ss_list_price: >>> FloatingPoint(DOUBLE), ss_sales_price: FloatingPoint(DOUBLE), >>> ss_ext_discount_amt: FloatingPoint(DOUBLE), ss_ext_sales_price: >>> FloatingPoint(DOUBLE), ss_ext_wholesale_cost: FloatingPoint(DOUBLE), >>> ss_ext_list_price: FloatingPoint(DOUBLE), ss_ext_tax: >>> FloatingPoint(DOUBLE), ss_coupon_amt: FloatingPoint(DOUBLE), ss_net_paid: >>> FloatingPoint(DOUBLE), ss_net_paid_inc_tax: FloatingPoint(DOUBLE), >>> ss_net_profit: FloatingPoint(DOUBLE)> >>> Number of arrow blocks are 19 >>> java.lang.NullPointerException >>> at org.apache.arrow.vector.ipc.message.MessageSerializer.deseri >>> alizeRecordBatch(MessageSerializer.java:256) >>> at org.apache.arrow.vector.ipc.message.MessageSerializer.deseri >>> alizeRecordBatch(MessageSerializer.java:242) >>> at org.apache.arrow.vector.ipc.ArrowFileReader.readRecordBatch( >>> ArrowFileReader.java:162) >>> at org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(Ar >>> rowFileReader.java:113) >>> at org.apache.arrow.vector.ipc.ArrowFileReader.loadRecordBatch( >>> ArrowFileReader.java:139) >>> at com.github.animeshtrivedi.arrowexample.ArrowRead.makeRead( >>> ArrowRead.java:82) >>> at com.github.animeshtrivedi.arrowexample.ArrowRead.main(ArrowR >>> ead.java:217) >>> >>> >>> Some context, the file size is 3965838890 bytes and the schema read from >>> the file is correct. The code where it fails is doing something like: >>> >>> System.out.println("File size : " + arrowFile.length() + " schema >>> is " + root.getSchema().toString()); >>> List<ArrowBlock> arrowBlocks = arrowFileReader.getRecordBlocks(); >>> System.out.println("Number of arrow blocks are " + >>> arrowBlocks.size()); >>> for (int i = 0; i < arrowBlocks.size(); i++) { >>> ArrowBlock rbBlock = arrowBlocks.get(i); >>> if (!arrowFileReader.loadRecordBatch(rbBlock)) { >>> throw new IOException("Expected to read record batch"); >>> } >>> >>> the stack comes from here: https://github.com/animeshtriv >>> edi/ArrowExample/blob/master/src/main/java/com/github/ >>> animeshtrivedi/arrowexample/ArrowRead.java#L82 >>> >>> Any idea what might be happening? >>> >>> Thanks, >>> -- >>> Animesh >>> >>> On Tue, Dec 19, 2017 at 7:03 PM, Siddharth Teotia <siddha...@dremio.com> >>> wrote: >>> >>>> From Arrow 0.8, the second step "Grab the corresponding mutator and >>>> accessor objects by calls to getMutator(), getAccessor()" is not needed. >>>> In >>>> fact, it is not even there. >>>> >>>> On Tue, Dec 19, 2017 at 10:01 AM, Siddharth Teotia <siddha...@dremio.com> >>>> wrote: >>>> >>>> > Hi Animesh, >>>> > >>>> > Firstly I would like to suggest switching over to Arrow 0.8 release asap >>>> > since you are writing JAVA programs and the API usage has changed >>>> > drastically. The new APIs are much simpler with good javadocs and >>>> detailed >>>> > internal comments. >>>> > >>>> > If you are writing stop-gap implementation then it is probably fine to >>>> > continue with old version but for long term new API usage is >>>> recommended. >>>> > >>>> > >>>> > - Create an instance of the vector. Note that this doesn't allocate >>>> > any memory for the elements in the vector >>>> > - Grab the corresponding mutator and accessor objects by calls to >>>> > getMutator(), getAccessor(). >>>> > - Allocate memory >>>> > - *allocateNew()* - we will allocate memory for default number of >>>> > elements in the vector. This is applicable to both fixed width >>>> and variable >>>> > width vectors. >>>> > - *allocateNew(valueCount)* - for fixed width vectors. Use this >>>> > method if you have already know the number of elements to store >>>> in the >>>> > vector >>>> > - *allocateNew(bytes, valueCount)* - for variable width vectors. >>>> > Use this method if you already know the total size (in bytes) of >>>> all the >>>> > variable width elements you will be storing in the vector. For >>>> example, if >>>> > you are going to store 1024 elements in the vector and the total >>>> size >>>> > across all variable width elements is under 1MB, you can call >>>> > allocateBytes(1024*1024, 1024) >>>> > - Populate the vector: >>>> > - Use the *set() or setSafe() *APIs in the mutator interface. From >>>> > Arrow 0.8 onwards, you can use these APIs directly on the vector >>>> instance >>>> > and mutator/accessor are removed. >>>> > - The difference between set() and corresponding setSafe() API is >>>> > that latter internally takes care of expanding the vector's >>>> buffer(s) for >>>> > storing new data. >>>> > - Each set() API has a corresponding setSafe() API. >>>> > - Do a setValueCount() based on the number of elements you populated >>>> > in the vector. >>>> > - Retrieve elements from the vector: >>>> > - Use the get(), getObject() APIs in the accessor interface. >>>> Again, >>>> > from Arrow 0.8 onwards you can use these APIs directly. >>>> > - With respect to usage of setInitialCapacity: >>>> > - Let's say your application always issues calls to allocateNew(). >>>> > It is likely that this will end up over-allocating memory because >>>> it >>>> > assumes a default value count to begin with. >>>> > - In this case, if you do setInitialCapacity() followed by >>>> > allocateNew() then latter doesn't do default memory allocation. >>>> It does >>>> > exactly for the value capacity you specified in >>>> setInitialCapacity(). >>>> > >>>> > I would highly recommend taking a look at https://github.com/apache/ >>>> > arrow/blob/master/java/vector/src/test/java/org/apache/ >>>> > arrow/vector/TestValueVector.java >>>> > This has lots of examples around populating the vector, retrieving from >>>> > vector, using setInitialCapacity(), using set(), setSafe() methods and a >>>> > combination of them to understand when things can go wrong. >>>> > >>>> > Hopefully this helps. Meanwhile we will try to add some internal README >>>> > for the usage of vectors. >>>> > >>>> > Thanks, >>>> > Siddharth >>>> > >>>> > On Tue, Dec 19, 2017 at 8:55 AM, Emilio Lahr-Vivaz <elahrvi...@ccri.com >>>> > >>>> > wrote: >>>> > >>>> >> This has probably changed with the Java code refactor, but I've posted >>>> >> some answers inline, to the best of my understanding. >>>> >> >>>> >> Thanks, >>>> >> >>>> >> Emilio >>>> >> >>>> >> On 12/16/2017 12:17 PM, Animesh Trivedi wrote: >>>> >> >>>> >>> Thanks Wes for you help. >>>> >>> >>>> >>> Based upon some code reading, I managed to code-up a basic working >>>> >>> example. >>>> >>> The code is here: >>>> >>> https://github.com/animeshtrivedi/ArrowExample/tree/master/s >>>> >>> rc/main/java/com/github/animeshtrivedi/arrowexample >>>> >>> . >>>> >>> >>>> >>> However, I do have some questions about the concepts in Arrow >>>> >>> >>>> >>> 1. ArrowBlock is the unit of reading/writing. One ArrowBlock >>>> essentially >>>> >>> is >>>> >>> the amount of the data one must hold in-memory at a time. Is my >>>> >>> understanding correct? >>>> >>> >>>> >> yes >>>> >> >>>> >>> >>>> >>> 2. There are Base[Reade/Writer] interfaces as well as Mutator/Accessor >>>> >>> classes in the ValueVector interface - both are implemented by all >>>> >>> supported data types. What is the relationship between these two? or >>>> when >>>> >>> is one suppose to use one over other. I only use Mutator/Accessor >>>> classes >>>> >>> in my code. >>>> >>> >>>> >> The write/reader interfaces are parallel implementations that make some >>>> >> things easier, but don't encompass all available functionality (for >>>> >> example, fixed size lists, nested lists, some dictionary operations, >>>> etc). >>>> >> However, you should be able to accomplish everything using >>>> >> mutators/accessors. >>>> >> >>>> >>> >>>> >>> 3. What are the "safe" varient functions in the Mutator's code? I >>>> could >>>> >>> not >>>> >>> understand what they meant to achieve. >>>> >>> >>>> >> The safe methods ensure that the vector is large enough to set the >>>> value. >>>> >> You can use the unsafe versions if you know that your vector has >>>> already >>>> >> allocated enough space for your data. >>>> >> >>>> >>> 4. What are MinorTypes? >>>> >>> >>>> >> Minor types are a representation of the different vector types. I >>>> believe >>>> >> they are being de-emphasized in favor of FieldTypes, as minor types >>>> don't >>>> >> contain enough information to represent all vectors. >>>> >> >>>> >>> >>>> >>> 5. For a writer, what is a dictionary provider? For example in the >>>> >>> Integration.java code, the reader is given as the dictionary provider >>>> for >>>> >>> the writer. But, is it something more than just: >>>> >>> DictionaryProvider.MapDictionaryProvider provider = new >>>> >>> DictionaryProvider.MapDictionaryProvider(); >>>> >>> ArrowFileWriter arrowWriter = new ArrowFileWriter(root, provider, >>>> >>> fileOutputStream.getChannel()); >>>> >>> >>>> >> The dictionary provider is an interface for looking up dictionary >>>> values. >>>> >> When reading a file, the reader itself has already read the >>>> dictionaries >>>> >> and thus serves as the provider. >>>> >> >>>> >>> 6. I am not clearly sure about the sequence of call that one needs to >>>> do >>>> >>> write on mutators. For example, if I code something like >>>> >>> NullableIntVector intVector = (NullableIntVector) fieldVector; >>>> >>> NullableIntVector.Mutator mutator = intVector.getMutator(); >>>> >>> [.write num values] >>>> >>> mutator.setValueCount(num) >>>> >>> then this works for primitive types, but not for VarBinary type. >>>> There I >>>> >>> have to set the capacity first, >>>> >>> >>>> >>> NullableVarBinaryVector varBinaryVector = (NullableVarBinaryVector) >>>> >>> fieldVector; >>>> >>> varBinaryVector.setInitialCapacity(items); >>>> >>> varBinaryVector.allocateNew(); >>>> >>> NullableVarBinaryVector.Mutator mutator = >>>> varBinaryVector.getMutator(); >>>> >>> >>>> >> The method calls are not very well documented - I would suggest looking >>>> >> at the reader/writer implementations to see what calls are required for >>>> >> which vector types. Generally variable length vectors (lists, var >>>> binary, >>>> >> etc) behave differently than fixed width vectors (ints, longs, etc). >>>> >> >>>> >> Example of these are here: >>>> >>> https://github.com/animeshtrivedi/ArrowExample/blob/master/s >>>> >>> rc/main/java/com/github/animeshtrivedi/arrowexample/ArrowWrite.java >>>> >>> (writeField[???] functions). >>>> >>> >>>> >>> Thank you very much, >>>> >>> -- >>>> >>> Animesh >>>> >>> >>>> >>> >>>> >>> >>>> >>> On Thu, Dec 14, 2017 at 6:15 PM, Wes McKinney <wesmck...@gmail.com> >>>> >>> wrote: >>>> >>> >>>> >>> hi Animesh, >>>> >>>> >>>> >>>> I suggest you try the ArrowStreamReader/Writer or >>>> >>>> ArrowFileReader/Writer classes. See >>>> >>>> https://github.com/apache/arrow/blob/master/java/tools/ >>>> >>>> src/main/java/org/apache/arrow/tools/Integration.java >>>> >>>> for example working code for this >>>> >>>> >>>> >>>> - Wes >>>> >>>> >>>> >>>> On Thu, Dec 14, 2017 at 8:30 AM, Animesh Trivedi >>>> >>>> <animesh.triv...@gmail.com> wrote: >>>> >>>> >>>> >>>>> Hi all, >>>> >>>>> >>>> >>>>> It might be a trivial question, so please let me know if I am >>>> missing >>>> >>>>> something. >>>> >>>>> >>>> >>>>> I am trying to write and read files in the Arrow format in Java. My >>>> >>>>> data >>>> >>>>> >>>> >>>> is >>>> >>>> >>>> >>>>> simple flat schema with primitive types. I already have the data in >>>> >>>>> Java. >>>> >>>>> So my questions are: >>>> >>>>> 1. Is this possible or am I fundamentally missing something what >>>> Arrow >>>> >>>>> >>>> >>>> can >>>> >>>> >>>> >>>>> or cannot do (or is designed to do). I assume that an efficient >>>> >>>>> in-memory >>>> >>>>> columnar data format should work with files too. >>>> >>>>> 2. Can you point me out to a working example? or a starting example. >>>> >>>>> Intuitively I am looking for a way to define schema, write/read >>>> column >>>> >>>>> vectors to/from files as one does with Parquet or ORC. >>>> >>>>> >>>> >>>>> I try to locate some working examples with ArrowFile[Reader/Writer] >>>> >>>>> >>>> >>>> classes >>>> >>>> >>>> >>>>> in the maven tests but so far not sure where to start. >>>> >>>>> >>>> >>>>> Thanks, >>>> >>>>> -- >>>> >>>>> Animesh >>>> >>>>> >>>> >>>> >>>> >> >>>> > >>>> >>> >>> >>>