Re: arrow read/write examples in Java

Wes McKinney Thu, 11 Jan 2018 16:56:45 -0800

I'm working on getting all the docs updated for 0.8.0 -- there are
some issues blocking a more automated update, so I may update them
piecemeal until this is resolved:


https://github.com/apache/arrow/pull/1472

On Tue, Jan 2, 2018 at 10:41 AM, Wes McKinney <wesmck...@gmail.com> wrote:
> I'll take a look at updating the site docs today. Thanks for pointing this 
> out!
>
> On Wed, Dec 27, 2017 at 4:57 AM, Animesh Trivedi
> <animesh.triv...@gmail.com> wrote:
>> Hello everyone,
>>
>> I solved the issue with my writer. Now everything is working fine,
>> including HDFS file reads and writes. I also wrote a parquet to arrow
>> converter (on HDFS) that works fine.
>>
>> I noticed that Arrow javadocs are still at the 0.7 release. Can someone
>> please update them?
>>
>> FWIW:  I wrote a blog post about how to read and write arrow files in Java
>> -
>> https://github.com/animeshtrivedi/blog/blob/master/post/2017-12-26-arrow.md
>>
>> The corresponding code is at  https://github.com/animeshtrivedi/ArrowExample
>>
>>
>> Thanks,
>> --
>> Animesh
>>
>>
>> On Wed, Dec 20, 2017 at 4:35 PM, Animesh Trivedi <animesh.triv...@gmail.com>
>> wrote:
>>
>>> I think the null pointer exception happens due to some issue in my new
>>> writer (which used my implementation of the ByteBuffer writable
>>> interface)...let me narrow it down first.
>>>
>>> The basic code, that does not use my writer's implementation, seems to
>>> work. This is the code which is at github. I did not push the new writer
>>> implementation yet.
>>>
>>> Thanks
>>> --
>>> Animesh
>>>
>>>
>>> On 20 Dec 2017 14:51, "Animesh Trivedi" <animesh.triv...@gmail.com> wrote:
>>>
>>> Wes, Emilio, Siddharth - many thanks for helpful replies and comments !
>>>
>>> I managed to upgrade the code to 0.8 API. I have to say that 0.8 API is
>>> much more intuitive ;)  I will summarize my code example with some
>>> documentation in a blog post soon (and post it here too).
>>>
>>> - Is there 1st class support to read/write files to HDFS files?
>>> Because FSData[Output/Input]Stream from HDFS do not implement
>>> [Read/Writeable]ByteChannel interfaces required to instantiate ArrowFile
>>> readers and writers. I already implemented something for me that works but
>>> am wondering if it does not make sense to have these facilities as
>>> utilities in the Arrow code?
>>>
>>> However, my example code runs fine on a small example of 10 rows with
>>> multiple batches. But it fails to read for anything larger. I have not
>>> verified if it was working for 0.7 version or at what row count it starts
>>> to fail. The writes are fine as far as I can tell. For example, I am
>>> writing and then reading TPC-DS data (store_sales table with int, long, and
>>> doubles) and I get
>>>
>>> [...]
>>> Reading the arrow file : ./store_sales.arrow
>>> File size : 3965838890 schema is Schema<ss_sold_date_sk: Int(32, true),
>>> ss_sold_time_sk: Int(32, true), ss_item_sk: Int(32, true), ss_customer_sk:
>>> Int(32, true), ss_cdemo_sk: Int(32, true), ss_hdemo_sk: Int(32, true),
>>> ss_addr_sk: Int(32, true), ss_store_sk: Int(32, true), ss_promo_sk: Int(32,
>>> true), ss_ticket_number: Int(64, true), ss_quantity: Int(32, true),
>>> ss_wholesale_cost: FloatingPoint(DOUBLE), ss_list_price:
>>> FloatingPoint(DOUBLE), ss_sales_price: FloatingPoint(DOUBLE),
>>> ss_ext_discount_amt: FloatingPoint(DOUBLE), ss_ext_sales_price:
>>> FloatingPoint(DOUBLE), ss_ext_wholesale_cost: FloatingPoint(DOUBLE),
>>> ss_ext_list_price: FloatingPoint(DOUBLE), ss_ext_tax:
>>> FloatingPoint(DOUBLE), ss_coupon_amt: FloatingPoint(DOUBLE), ss_net_paid:
>>> FloatingPoint(DOUBLE), ss_net_paid_inc_tax: FloatingPoint(DOUBLE),
>>> ss_net_profit: FloatingPoint(DOUBLE)>
>>> Number of arrow blocks are 19
>>> java.lang.NullPointerException
>>>         at org.apache.arrow.vector.ipc.message.MessageSerializer.deseri
>>> alizeRecordBatch(MessageSerializer.java:256)
>>>         at org.apache.arrow.vector.ipc.message.MessageSerializer.deseri
>>> alizeRecordBatch(MessageSerializer.java:242)
>>>         at org.apache.arrow.vector.ipc.ArrowFileReader.readRecordBatch(
>>> ArrowFileReader.java:162)
>>>         at org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(Ar
>>> rowFileReader.java:113)
>>>         at org.apache.arrow.vector.ipc.ArrowFileReader.loadRecordBatch(
>>> ArrowFileReader.java:139)
>>>         at com.github.animeshtrivedi.arrowexample.ArrowRead.makeRead(
>>> ArrowRead.java:82)
>>>         at com.github.animeshtrivedi.arrowexample.ArrowRead.main(ArrowR
>>> ead.java:217)
>>>
>>>
>>> Some context, the file size is 3965838890 bytes and the schema read from
>>> the file is correct. The code where it fails is doing something like:
>>>
>>>         System.out.println("File size : " + arrowFile.length() + " schema
>>> is "  + root.getSchema().toString());
>>>         List<ArrowBlock> arrowBlocks = arrowFileReader.getRecordBlocks();
>>>         System.out.println("Number of arrow blocks are " +
>>> arrowBlocks.size());
>>>         for (int i = 0; i < arrowBlocks.size(); i++) {
>>>             ArrowBlock rbBlock = arrowBlocks.get(i);
>>>             if (!arrowFileReader.loadRecordBatch(rbBlock)) {
>>>                 throw new IOException("Expected to read record batch");
>>>             }
>>>
>>> the stack comes from here: https://github.com/animeshtriv
>>> edi/ArrowExample/blob/master/src/main/java/com/github/
>>> animeshtrivedi/arrowexample/ArrowRead.java#L82
>>>
>>> Any idea what might be happening?
>>>
>>> Thanks,
>>> --
>>> Animesh
>>>
>>> On Tue, Dec 19, 2017 at 7:03 PM, Siddharth Teotia <siddha...@dremio.com>
>>> wrote:
>>>
>>>> From Arrow 0.8, the second step "Grab the corresponding mutator and
>>>> accessor objects by calls to getMutator(), getAccessor()" is not needed.
>>>> In
>>>> fact, it is not even there.
>>>>
>>>> On Tue, Dec 19, 2017 at 10:01 AM, Siddharth Teotia <siddha...@dremio.com>
>>>> wrote:
>>>>
>>>> > Hi Animesh,
>>>> >
>>>> > Firstly I would like to suggest switching over to Arrow 0.8 release asap
>>>> > since you are writing JAVA programs and the API usage has changed
>>>> > drastically. The new APIs are much simpler with good javadocs and
>>>> detailed
>>>> > internal comments.
>>>> >
>>>> > If you are writing stop-gap implementation then it is probably fine to
>>>> > continue with old version but for long term new API usage is
>>>> recommended.
>>>> >
>>>> >
>>>> >    - Create an instance of the vector. Note that this doesn't allocate
>>>> >    any memory for the elements in the vector
>>>> >    - Grab the corresponding mutator and accessor objects by calls to
>>>> >    getMutator(), getAccessor().
>>>> >    - Allocate memory
>>>> >       - *allocateNew()* - we will allocate memory for default number of
>>>> >       elements in the vector. This is applicable to both fixed width
>>>> and variable
>>>> >       width vectors.
>>>> >       - *allocateNew(valueCount)* -  for fixed width vectors. Use this
>>>> >       method if you have already know the number of elements to store
>>>> in the
>>>> >       vector
>>>> >       - *allocateNew(bytes, valueCount)* - for variable width vectors.
>>>> >       Use this method if you already know the total size (in bytes) of
>>>> all the
>>>> >       variable width elements you will be storing in the vector. For
>>>> example, if
>>>> >       you are going to store 1024 elements in the vector and the total
>>>> size
>>>> >       across all variable width elements is under 1MB, you can call
>>>> >       allocateBytes(1024*1024, 1024)
>>>> >    - Populate the vector:
>>>> >       - Use the *set() or setSafe() *APIs in the mutator interface. From
>>>> >       Arrow 0.8 onwards, you can use these APIs directly on the vector
>>>> instance
>>>> >       and mutator/accessor are removed.
>>>> >       - The difference between set() and corresponding setSafe() API is
>>>> >       that latter internally takes care of expanding the vector's
>>>> buffer(s) for
>>>> >       storing new data.
>>>> >       - Each set() API has a corresponding setSafe() API.
>>>> >    - Do a setValueCount() based on the number of elements you populated
>>>> >    in the vector.
>>>> >    - Retrieve elements from the vector:
>>>> >       - Use the get(), getObject() APIs in the accessor interface.
>>>> Again,
>>>> >       from Arrow 0.8 onwards you can use these APIs directly.
>>>> >    - With respect to usage of setInitialCapacity:
>>>> >       - Let's say your application always issues calls to allocateNew().
>>>> >       It is likely that this will end up over-allocating memory because
>>>> it
>>>> >       assumes a default value count to begin with.
>>>> >       - In this case, if you do setInitialCapacity() followed by
>>>> >       allocateNew() then latter doesn't do default memory allocation.
>>>> It does
>>>> >       exactly for the value capacity you specified in
>>>> setInitialCapacity().
>>>> >
>>>> > I would highly recommend taking a look at https://github.com/apache/
>>>> > arrow/blob/master/java/vector/src/test/java/org/apache/
>>>> > arrow/vector/TestValueVector.java
>>>> > This has lots of examples around populating the vector, retrieving from
>>>> > vector, using setInitialCapacity(), using set(), setSafe() methods and a
>>>> > combination of them to understand when things can go wrong.
>>>> >
>>>> > Hopefully this helps. Meanwhile we will try to add some internal README
>>>> > for the usage of vectors.
>>>> >
>>>> > Thanks,
>>>> > Siddharth
>>>> >
>>>> > On Tue, Dec 19, 2017 at 8:55 AM, Emilio Lahr-Vivaz <elahrvi...@ccri.com
>>>> >
>>>> > wrote:
>>>> >
>>>> >> This has probably changed with the Java code refactor, but I've posted
>>>> >> some answers inline, to the best of my understanding.
>>>> >>
>>>> >> Thanks,
>>>> >>
>>>> >> Emilio
>>>> >>
>>>> >> On 12/16/2017 12:17 PM, Animesh Trivedi wrote:
>>>> >>
>>>> >>> Thanks Wes for you help.
>>>> >>>
>>>> >>> Based upon some code reading, I managed to code-up a basic working
>>>> >>> example.
>>>> >>> The code is here:
>>>> >>> https://github.com/animeshtrivedi/ArrowExample/tree/master/s
>>>> >>> rc/main/java/com/github/animeshtrivedi/arrowexample
>>>> >>> .
>>>> >>>
>>>> >>> However, I do have some questions about the concepts in Arrow
>>>> >>>
>>>> >>> 1. ArrowBlock is the unit of reading/writing. One ArrowBlock
>>>> essentially
>>>> >>> is
>>>> >>> the amount of the data one must hold in-memory at a time. Is my
>>>> >>> understanding correct?
>>>> >>>
>>>> >> yes
>>>> >>
>>>> >>>
>>>> >>> 2. There are Base[Reade/Writer] interfaces as well as Mutator/Accessor
>>>> >>> classes in the ValueVector interface - both are implemented by all
>>>> >>> supported data types. What is the relationship between these two? or
>>>> when
>>>> >>> is one suppose to use one over other. I only use Mutator/Accessor
>>>> classes
>>>> >>> in my code.
>>>> >>>
>>>> >> The write/reader interfaces are parallel implementations that make some
>>>> >> things easier, but don't encompass all available functionality (for
>>>> >> example, fixed size lists, nested lists, some dictionary operations,
>>>> etc).
>>>> >> However, you should be able to accomplish everything using
>>>> >> mutators/accessors.
>>>> >>
>>>> >>>
>>>> >>> 3. What are the "safe" varient functions in the Mutator's code? I
>>>> could
>>>> >>> not
>>>> >>> understand what they meant to achieve.
>>>> >>>
>>>> >> The safe methods ensure that the vector is large enough to set the
>>>> value.
>>>> >> You can use the unsafe versions if you know that your vector has
>>>> already
>>>> >> allocated enough space for your data.
>>>> >>
>>>> >>> 4. What are MinorTypes?
>>>> >>>
>>>> >> Minor types are a representation of the different vector types. I
>>>> believe
>>>> >> they are being de-emphasized in favor of FieldTypes, as minor types
>>>> don't
>>>> >> contain enough information to represent all vectors.
>>>> >>
>>>> >>>
>>>> >>> 5. For a writer, what is a dictionary provider? For example in the
>>>> >>> Integration.java code, the reader is given as the dictionary provider
>>>> for
>>>> >>> the writer. But, is it something more than just:
>>>> >>> DictionaryProvider.MapDictionaryProvider provider = new
>>>> >>> DictionaryProvider.MapDictionaryProvider();
>>>> >>> ArrowFileWriter arrowWriter = new ArrowFileWriter(root, provider,
>>>> >>> fileOutputStream.getChannel());
>>>> >>>
>>>> >> The dictionary provider is an interface for looking up dictionary
>>>> values.
>>>> >> When reading a file, the reader itself has already read the
>>>> dictionaries
>>>> >> and thus serves as the provider.
>>>> >>
>>>> >>> 6. I am not clearly sure about the sequence of call that one needs to
>>>> do
>>>> >>> write on mutators. For example, if I code something like
>>>> >>> NullableIntVector intVector = (NullableIntVector) fieldVector;
>>>> >>> NullableIntVector.Mutator mutator = intVector.getMutator();
>>>> >>> [.write num values]
>>>> >>> mutator.setValueCount(num)
>>>> >>> then this works for primitive types, but not for VarBinary type.
>>>> There I
>>>> >>> have to set the capacity first,
>>>> >>>
>>>> >>> NullableVarBinaryVector varBinaryVector = (NullableVarBinaryVector)
>>>> >>> fieldVector;
>>>> >>> varBinaryVector.setInitialCapacity(items);
>>>> >>> varBinaryVector.allocateNew();
>>>> >>> NullableVarBinaryVector.Mutator mutator =
>>>> varBinaryVector.getMutator();
>>>> >>>
>>>> >> The method calls are not very well documented - I would suggest looking
>>>> >> at the reader/writer implementations to see what calls are required for
>>>> >> which vector types. Generally variable length vectors (lists, var
>>>> binary,
>>>> >> etc) behave differently than fixed width vectors (ints, longs, etc).
>>>> >>
>>>> >> Example of these are here:
>>>> >>> https://github.com/animeshtrivedi/ArrowExample/blob/master/s
>>>> >>> rc/main/java/com/github/animeshtrivedi/arrowexample/ArrowWrite.java
>>>> >>> (writeField[???] functions).
>>>> >>>
>>>> >>> Thank you very much,
>>>> >>> --
>>>> >>> Animesh
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> On Thu, Dec 14, 2017 at 6:15 PM, Wes McKinney <wesmck...@gmail.com>
>>>> >>> wrote:
>>>> >>>
>>>> >>> hi Animesh,
>>>> >>>>
>>>> >>>> I suggest you try the ArrowStreamReader/Writer or
>>>> >>>> ArrowFileReader/Writer classes. See
>>>> >>>> https://github.com/apache/arrow/blob/master/java/tools/
>>>> >>>> src/main/java/org/apache/arrow/tools/Integration.java
>>>> >>>> for example working code for this
>>>> >>>>
>>>> >>>> - Wes
>>>> >>>>
>>>> >>>> On Thu, Dec 14, 2017 at 8:30 AM, Animesh Trivedi
>>>> >>>> <animesh.triv...@gmail.com> wrote:
>>>> >>>>
>>>> >>>>> Hi all,
>>>> >>>>>
>>>> >>>>> It might be a trivial question, so please let me know if I am
>>>> missing
>>>> >>>>> something.
>>>> >>>>>
>>>> >>>>> I am trying to write and read files in the Arrow format in Java. My
>>>> >>>>> data
>>>> >>>>>
>>>> >>>> is
>>>> >>>>
>>>> >>>>> simple flat schema with primitive types. I already have the data in
>>>> >>>>> Java.
>>>> >>>>> So my questions are:
>>>> >>>>> 1. Is this possible or am I fundamentally missing something what
>>>> Arrow
>>>> >>>>>
>>>> >>>> can
>>>> >>>>
>>>> >>>>> or cannot do (or is designed to do). I assume that an efficient
>>>> >>>>> in-memory
>>>> >>>>> columnar data format should work with files too.
>>>> >>>>> 2. Can you point me out to a working example? or a starting example.
>>>> >>>>> Intuitively I am looking for a way to define schema, write/read
>>>> column
>>>> >>>>> vectors to/from files as one does with Parquet or ORC.
>>>> >>>>>
>>>> >>>>> I try to locate some working examples with ArrowFile[Reader/Writer]
>>>> >>>>>
>>>> >>>> classes
>>>> >>>>
>>>> >>>>> in the maven tests but so far not sure where to start.
>>>> >>>>>
>>>> >>>>> Thanks,
>>>> >>>>> --
>>>> >>>>> Animesh
>>>> >>>>>
>>>> >>>>
>>>> >>
>>>> >
>>>>
>>>
>>>
>>>

Re: arrow read/write examples in Java

Reply via email to