Re: arrow read/write examples in Java

Wes McKinney Tue, 02 Jan 2018 07:42:58 -0800

I'll take a look at updating the site docs today. Thanks for pointing this out!


On Wed, Dec 27, 2017 at 4:57 AM, Animesh Trivedi
<animesh.triv...@gmail.com> wrote:
> Hello everyone,
>
> I solved the issue with my writer. Now everything is working fine,
> including HDFS file reads and writes. I also wrote a parquet to arrow
> converter (on HDFS) that works fine.
>
> I noticed that Arrow javadocs are still at the 0.7 release. Can someone
> please update them?
>
> FWIW:  I wrote a blog post about how to read and write arrow files in Java
> -
> https://github.com/animeshtrivedi/blog/blob/master/post/2017-12-26-arrow.md
>
> The corresponding code is at  https://github.com/animeshtrivedi/ArrowExample
>
>
> Thanks,
> --
> Animesh
>
>
> On Wed, Dec 20, 2017 at 4:35 PM, Animesh Trivedi <animesh.triv...@gmail.com>
> wrote:
>
>> I think the null pointer exception happens due to some issue in my new
>> writer (which used my implementation of the ByteBuffer writable
>> interface)...let me narrow it down first.
>>
>> The basic code, that does not use my writer's implementation, seems to
>> work. This is the code which is at github. I did not push the new writer
>> implementation yet.
>>
>> Thanks
>> --
>> Animesh
>>
>>
>> On 20 Dec 2017 14:51, "Animesh Trivedi" <animesh.triv...@gmail.com> wrote:
>>
>> Wes, Emilio, Siddharth - many thanks for helpful replies and comments !
>>
>> I managed to upgrade the code to 0.8 API. I have to say that 0.8 API is
>> much more intuitive ;)  I will summarize my code example with some
>> documentation in a blog post soon (and post it here too).
>>
>> - Is there 1st class support to read/write files to HDFS files?
>> Because FSData[Output/Input]Stream from HDFS do not implement
>> [Read/Writeable]ByteChannel interfaces required to instantiate ArrowFile
>> readers and writers. I already implemented something for me that works but
>> am wondering if it does not make sense to have these facilities as
>> utilities in the Arrow code?
>>
>> However, my example code runs fine on a small example of 10 rows with
>> multiple batches. But it fails to read for anything larger. I have not
>> verified if it was working for 0.7 version or at what row count it starts
>> to fail. The writes are fine as far as I can tell. For example, I am
>> writing and then reading TPC-DS data (store_sales table with int, long, and
>> doubles) and I get
>>
>> [...]
>> Reading the arrow file : ./store_sales.arrow
>> File size : 3965838890 schema is Schema<ss_sold_date_sk: Int(32, true),
>> ss_sold_time_sk: Int(32, true), ss_item_sk: Int(32, true), ss_customer_sk:
>> Int(32, true), ss_cdemo_sk: Int(32, true), ss_hdemo_sk: Int(32, true),
>> ss_addr_sk: Int(32, true), ss_store_sk: Int(32, true), ss_promo_sk: Int(32,
>> true), ss_ticket_number: Int(64, true), ss_quantity: Int(32, true),
>> ss_wholesale_cost: FloatingPoint(DOUBLE), ss_list_price:
>> FloatingPoint(DOUBLE), ss_sales_price: FloatingPoint(DOUBLE),
>> ss_ext_discount_amt: FloatingPoint(DOUBLE), ss_ext_sales_price:
>> FloatingPoint(DOUBLE), ss_ext_wholesale_cost: FloatingPoint(DOUBLE),
>> ss_ext_list_price: FloatingPoint(DOUBLE), ss_ext_tax:
>> FloatingPoint(DOUBLE), ss_coupon_amt: FloatingPoint(DOUBLE), ss_net_paid:
>> FloatingPoint(DOUBLE), ss_net_paid_inc_tax: FloatingPoint(DOUBLE),
>> ss_net_profit: FloatingPoint(DOUBLE)>
>> Number of arrow blocks are 19
>> java.lang.NullPointerException
>>         at org.apache.arrow.vector.ipc.message.MessageSerializer.deseri
>> alizeRecordBatch(MessageSerializer.java:256)
>>         at org.apache.arrow.vector.ipc.message.MessageSerializer.deseri
>> alizeRecordBatch(MessageSerializer.java:242)
>>         at org.apache.arrow.vector.ipc.ArrowFileReader.readRecordBatch(
>> ArrowFileReader.java:162)
>>         at org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(Ar
>> rowFileReader.java:113)
>>         at org.apache.arrow.vector.ipc.ArrowFileReader.loadRecordBatch(
>> ArrowFileReader.java:139)
>>         at com.github.animeshtrivedi.arrowexample.ArrowRead.makeRead(
>> ArrowRead.java:82)
>>         at com.github.animeshtrivedi.arrowexample.ArrowRead.main(ArrowR
>> ead.java:217)
>>
>>
>> Some context, the file size is 3965838890 bytes and the schema read from
>> the file is correct. The code where it fails is doing something like:
>>
>>         System.out.println("File size : " + arrowFile.length() + " schema
>> is "  + root.getSchema().toString());
>>         List<ArrowBlock> arrowBlocks = arrowFileReader.getRecordBlocks();
>>         System.out.println("Number of arrow blocks are " +
>> arrowBlocks.size());
>>         for (int i = 0; i < arrowBlocks.size(); i++) {
>>             ArrowBlock rbBlock = arrowBlocks.get(i);
>>             if (!arrowFileReader.loadRecordBatch(rbBlock)) {
>>                 throw new IOException("Expected to read record batch");
>>             }
>>
>> the stack comes from here: https://github.com/animeshtriv
>> edi/ArrowExample/blob/master/src/main/java/com/github/
>> animeshtrivedi/arrowexample/ArrowRead.java#L82
>>
>> Any idea what might be happening?
>>
>> Thanks,
>> --
>> Animesh
>>
>> On Tue, Dec 19, 2017 at 7:03 PM, Siddharth Teotia <siddha...@dremio.com>
>> wrote:
>>
>>> From Arrow 0.8, the second step "Grab the corresponding mutator and
>>> accessor objects by calls to getMutator(), getAccessor()" is not needed.
>>> In
>>> fact, it is not even there.
>>>
>>> On Tue, Dec 19, 2017 at 10:01 AM, Siddharth Teotia <siddha...@dremio.com>
>>> wrote:
>>>
>>> > Hi Animesh,
>>> >
>>> > Firstly I would like to suggest switching over to Arrow 0.8 release asap
>>> > since you are writing JAVA programs and the API usage has changed
>>> > drastically. The new APIs are much simpler with good javadocs and
>>> detailed
>>> > internal comments.
>>> >
>>> > If you are writing stop-gap implementation then it is probably fine to
>>> > continue with old version but for long term new API usage is
>>> recommended.
>>> >
>>> >
>>> >    - Create an instance of the vector. Note that this doesn't allocate
>>> >    any memory for the elements in the vector
>>> >    - Grab the corresponding mutator and accessor objects by calls to
>>> >    getMutator(), getAccessor().
>>> >    - Allocate memory
>>> >       - *allocateNew()* - we will allocate memory for default number of
>>> >       elements in the vector. This is applicable to both fixed width
>>> and variable
>>> >       width vectors.
>>> >       - *allocateNew(valueCount)* -  for fixed width vectors. Use this
>>> >       method if you have already know the number of elements to store
>>> in the
>>> >       vector
>>> >       - *allocateNew(bytes, valueCount)* - for variable width vectors.
>>> >       Use this method if you already know the total size (in bytes) of
>>> all the
>>> >       variable width elements you will be storing in the vector. For
>>> example, if
>>> >       you are going to store 1024 elements in the vector and the total
>>> size
>>> >       across all variable width elements is under 1MB, you can call
>>> >       allocateBytes(1024*1024, 1024)
>>> >    - Populate the vector:
>>> >       - Use the *set() or setSafe() *APIs in the mutator interface. From
>>> >       Arrow 0.8 onwards, you can use these APIs directly on the vector
>>> instance
>>> >       and mutator/accessor are removed.
>>> >       - The difference between set() and corresponding setSafe() API is
>>> >       that latter internally takes care of expanding the vector's
>>> buffer(s) for
>>> >       storing new data.
>>> >       - Each set() API has a corresponding setSafe() API.
>>> >    - Do a setValueCount() based on the number of elements you populated
>>> >    in the vector.
>>> >    - Retrieve elements from the vector:
>>> >       - Use the get(), getObject() APIs in the accessor interface.
>>> Again,
>>> >       from Arrow 0.8 onwards you can use these APIs directly.
>>> >    - With respect to usage of setInitialCapacity:
>>> >       - Let's say your application always issues calls to allocateNew().
>>> >       It is likely that this will end up over-allocating memory because
>>> it
>>> >       assumes a default value count to begin with.
>>> >       - In this case, if you do setInitialCapacity() followed by
>>> >       allocateNew() then latter doesn't do default memory allocation.
>>> It does
>>> >       exactly for the value capacity you specified in
>>> setInitialCapacity().
>>> >
>>> > I would highly recommend taking a look at https://github.com/apache/
>>> > arrow/blob/master/java/vector/src/test/java/org/apache/
>>> > arrow/vector/TestValueVector.java
>>> > This has lots of examples around populating the vector, retrieving from
>>> > vector, using setInitialCapacity(), using set(), setSafe() methods and a
>>> > combination of them to understand when things can go wrong.
>>> >
>>> > Hopefully this helps. Meanwhile we will try to add some internal README
>>> > for the usage of vectors.
>>> >
>>> > Thanks,
>>> > Siddharth
>>> >
>>> > On Tue, Dec 19, 2017 at 8:55 AM, Emilio Lahr-Vivaz <elahrvi...@ccri.com
>>> >
>>> > wrote:
>>> >
>>> >> This has probably changed with the Java code refactor, but I've posted
>>> >> some answers inline, to the best of my understanding.
>>> >>
>>> >> Thanks,
>>> >>
>>> >> Emilio
>>> >>
>>> >> On 12/16/2017 12:17 PM, Animesh Trivedi wrote:
>>> >>
>>> >>> Thanks Wes for you help.
>>> >>>
>>> >>> Based upon some code reading, I managed to code-up a basic working
>>> >>> example.
>>> >>> The code is here:
>>> >>> https://github.com/animeshtrivedi/ArrowExample/tree/master/s
>>> >>> rc/main/java/com/github/animeshtrivedi/arrowexample
>>> >>> .
>>> >>>
>>> >>> However, I do have some questions about the concepts in Arrow
>>> >>>
>>> >>> 1. ArrowBlock is the unit of reading/writing. One ArrowBlock
>>> essentially
>>> >>> is
>>> >>> the amount of the data one must hold in-memory at a time. Is my
>>> >>> understanding correct?
>>> >>>
>>> >> yes
>>> >>
>>> >>>
>>> >>> 2. There are Base[Reade/Writer] interfaces as well as Mutator/Accessor
>>> >>> classes in the ValueVector interface - both are implemented by all
>>> >>> supported data types. What is the relationship between these two? or
>>> when
>>> >>> is one suppose to use one over other. I only use Mutator/Accessor
>>> classes
>>> >>> in my code.
>>> >>>
>>> >> The write/reader interfaces are parallel implementations that make some
>>> >> things easier, but don't encompass all available functionality (for
>>> >> example, fixed size lists, nested lists, some dictionary operations,
>>> etc).
>>> >> However, you should be able to accomplish everything using
>>> >> mutators/accessors.
>>> >>
>>> >>>
>>> >>> 3. What are the "safe" varient functions in the Mutator's code? I
>>> could
>>> >>> not
>>> >>> understand what they meant to achieve.
>>> >>>
>>> >> The safe methods ensure that the vector is large enough to set the
>>> value.
>>> >> You can use the unsafe versions if you know that your vector has
>>> already
>>> >> allocated enough space for your data.
>>> >>
>>> >>> 4. What are MinorTypes?
>>> >>>
>>> >> Minor types are a representation of the different vector types. I
>>> believe
>>> >> they are being de-emphasized in favor of FieldTypes, as minor types
>>> don't
>>> >> contain enough information to represent all vectors.
>>> >>
>>> >>>
>>> >>> 5. For a writer, what is a dictionary provider? For example in the
>>> >>> Integration.java code, the reader is given as the dictionary provider
>>> for
>>> >>> the writer. But, is it something more than just:
>>> >>> DictionaryProvider.MapDictionaryProvider provider = new
>>> >>> DictionaryProvider.MapDictionaryProvider();
>>> >>> ArrowFileWriter arrowWriter = new ArrowFileWriter(root, provider,
>>> >>> fileOutputStream.getChannel());
>>> >>>
>>> >> The dictionary provider is an interface for looking up dictionary
>>> values.
>>> >> When reading a file, the reader itself has already read the
>>> dictionaries
>>> >> and thus serves as the provider.
>>> >>
>>> >>> 6. I am not clearly sure about the sequence of call that one needs to
>>> do
>>> >>> write on mutators. For example, if I code something like
>>> >>> NullableIntVector intVector = (NullableIntVector) fieldVector;
>>> >>> NullableIntVector.Mutator mutator = intVector.getMutator();
>>> >>> [.write num values]
>>> >>> mutator.setValueCount(num)
>>> >>> then this works for primitive types, but not for VarBinary type.
>>> There I
>>> >>> have to set the capacity first,
>>> >>>
>>> >>> NullableVarBinaryVector varBinaryVector = (NullableVarBinaryVector)
>>> >>> fieldVector;
>>> >>> varBinaryVector.setInitialCapacity(items);
>>> >>> varBinaryVector.allocateNew();
>>> >>> NullableVarBinaryVector.Mutator mutator =
>>> varBinaryVector.getMutator();
>>> >>>
>>> >> The method calls are not very well documented - I would suggest looking
>>> >> at the reader/writer implementations to see what calls are required for
>>> >> which vector types. Generally variable length vectors (lists, var
>>> binary,
>>> >> etc) behave differently than fixed width vectors (ints, longs, etc).
>>> >>
>>> >> Example of these are here:
>>> >>> https://github.com/animeshtrivedi/ArrowExample/blob/master/s
>>> >>> rc/main/java/com/github/animeshtrivedi/arrowexample/ArrowWrite.java
>>> >>> (writeField[???] functions).
>>> >>>
>>> >>> Thank you very much,
>>> >>> --
>>> >>> Animesh
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Thu, Dec 14, 2017 at 6:15 PM, Wes McKinney <wesmck...@gmail.com>
>>> >>> wrote:
>>> >>>
>>> >>> hi Animesh,
>>> >>>>
>>> >>>> I suggest you try the ArrowStreamReader/Writer or
>>> >>>> ArrowFileReader/Writer classes. See
>>> >>>> https://github.com/apache/arrow/blob/master/java/tools/
>>> >>>> src/main/java/org/apache/arrow/tools/Integration.java
>>> >>>> for example working code for this
>>> >>>>
>>> >>>> - Wes
>>> >>>>
>>> >>>> On Thu, Dec 14, 2017 at 8:30 AM, Animesh Trivedi
>>> >>>> <animesh.triv...@gmail.com> wrote:
>>> >>>>
>>> >>>>> Hi all,
>>> >>>>>
>>> >>>>> It might be a trivial question, so please let me know if I am
>>> missing
>>> >>>>> something.
>>> >>>>>
>>> >>>>> I am trying to write and read files in the Arrow format in Java. My
>>> >>>>> data
>>> >>>>>
>>> >>>> is
>>> >>>>
>>> >>>>> simple flat schema with primitive types. I already have the data in
>>> >>>>> Java.
>>> >>>>> So my questions are:
>>> >>>>> 1. Is this possible or am I fundamentally missing something what
>>> Arrow
>>> >>>>>
>>> >>>> can
>>> >>>>
>>> >>>>> or cannot do (or is designed to do). I assume that an efficient
>>> >>>>> in-memory
>>> >>>>> columnar data format should work with files too.
>>> >>>>> 2. Can you point me out to a working example? or a starting example.
>>> >>>>> Intuitively I am looking for a way to define schema, write/read
>>> column
>>> >>>>> vectors to/from files as one does with Parquet or ORC.
>>> >>>>>
>>> >>>>> I try to locate some working examples with ArrowFile[Reader/Writer]
>>> >>>>>
>>> >>>> classes
>>> >>>>
>>> >>>>> in the maven tests but so far not sure where to start.
>>> >>>>>
>>> >>>>> Thanks,
>>> >>>>> --
>>> >>>>> Animesh
>>> >>>>>
>>> >>>>
>>> >>
>>> >
>>>
>>
>>
>>

Re: arrow read/write examples in Java

Reply via email to