Re: arrow read/write examples in Java

Animesh Trivedi Wed, 27 Dec 2017 01:58:31 -0800

Hello everyone,

I solved the issue with my writer. Now everything is working fine,
including HDFS file reads and writes. I also wrote a parquet to arrow
converter (on HDFS) that works fine.


I noticed that Arrow javadocs are still at the 0.7 release. Can someone
please update them?

FWIW:  I wrote a blog post about how to read and write arrow files in Java
-
https://github.com/animeshtrivedi/blog/blob/master/post/2017-12-26-arrow.md

The corresponding code is at  https://github.com/animeshtrivedi/ArrowExample


Thanks,
--
Animesh


On Wed, Dec 20, 2017 at 4:35 PM, Animesh Trivedi <animesh.triv...@gmail.com>
wrote:

> I think the null pointer exception happens due to some issue in my new
> writer (which used my implementation of the ByteBuffer writable
> interface)...let me narrow it down first.
>
> The basic code, that does not use my writer's implementation, seems to
> work. This is the code which is at github. I did not push the new writer
> implementation yet.
>
> Thanks
> --
> Animesh
>
>
> On 20 Dec 2017 14:51, "Animesh Trivedi" <animesh.triv...@gmail.com> wrote:
>
> Wes, Emilio, Siddharth - many thanks for helpful replies and comments !
>
> I managed to upgrade the code to 0.8 API. I have to say that 0.8 API is
> much more intuitive ;)  I will summarize my code example with some
> documentation in a blog post soon (and post it here too).
>
> - Is there 1st class support to read/write files to HDFS files?
> Because FSData[Output/Input]Stream from HDFS do not implement
> [Read/Writeable]ByteChannel interfaces required to instantiate ArrowFile
> readers and writers. I already implemented something for me that works but
> am wondering if it does not make sense to have these facilities as
> utilities in the Arrow code?
>
> However, my example code runs fine on a small example of 10 rows with
> multiple batches. But it fails to read for anything larger. I have not
> verified if it was working for 0.7 version or at what row count it starts
> to fail. The writes are fine as far as I can tell. For example, I am
> writing and then reading TPC-DS data (store_sales table with int, long, and
> doubles) and I get
>
> [...]
> Reading the arrow file : ./store_sales.arrow
> File size : 3965838890 schema is Schema<ss_sold_date_sk: Int(32, true),
> ss_sold_time_sk: Int(32, true), ss_item_sk: Int(32, true), ss_customer_sk:
> Int(32, true), ss_cdemo_sk: Int(32, true), ss_hdemo_sk: Int(32, true),
> ss_addr_sk: Int(32, true), ss_store_sk: Int(32, true), ss_promo_sk: Int(32,
> true), ss_ticket_number: Int(64, true), ss_quantity: Int(32, true),
> ss_wholesale_cost: FloatingPoint(DOUBLE), ss_list_price:
> FloatingPoint(DOUBLE), ss_sales_price: FloatingPoint(DOUBLE),
> ss_ext_discount_amt: FloatingPoint(DOUBLE), ss_ext_sales_price:
> FloatingPoint(DOUBLE), ss_ext_wholesale_cost: FloatingPoint(DOUBLE),
> ss_ext_list_price: FloatingPoint(DOUBLE), ss_ext_tax:
> FloatingPoint(DOUBLE), ss_coupon_amt: FloatingPoint(DOUBLE), ss_net_paid:
> FloatingPoint(DOUBLE), ss_net_paid_inc_tax: FloatingPoint(DOUBLE),
> ss_net_profit: FloatingPoint(DOUBLE)>
> Number of arrow blocks are 19
> java.lang.NullPointerException
>         at org.apache.arrow.vector.ipc.message.MessageSerializer.deseri
> alizeRecordBatch(MessageSerializer.java:256)
>         at org.apache.arrow.vector.ipc.message.MessageSerializer.deseri
> alizeRecordBatch(MessageSerializer.java:242)
>         at org.apache.arrow.vector.ipc.ArrowFileReader.readRecordBatch(
> ArrowFileReader.java:162)
>         at org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(Ar
> rowFileReader.java:113)
>         at org.apache.arrow.vector.ipc.ArrowFileReader.loadRecordBatch(
> ArrowFileReader.java:139)
>         at com.github.animeshtrivedi.arrowexample.ArrowRead.makeRead(
> ArrowRead.java:82)
>         at com.github.animeshtrivedi.arrowexample.ArrowRead.main(ArrowR
> ead.java:217)
>
>
> Some context, the file size is 3965838890 bytes and the schema read from
> the file is correct. The code where it fails is doing something like:
>
>         System.out.println("File size : " + arrowFile.length() + " schema
> is "  + root.getSchema().toString());
>         List<ArrowBlock> arrowBlocks = arrowFileReader.getRecordBlocks();
>         System.out.println("Number of arrow blocks are " +
> arrowBlocks.size());
>         for (int i = 0; i < arrowBlocks.size(); i++) {
>             ArrowBlock rbBlock = arrowBlocks.get(i);
>             if (!arrowFileReader.loadRecordBatch(rbBlock)) {
>                 throw new IOException("Expected to read record batch");
>             }
>
> the stack comes from here: https://github.com/animeshtriv
> edi/ArrowExample/blob/master/src/main/java/com/github/
> animeshtrivedi/arrowexample/ArrowRead.java#L82
>
> Any idea what might be happening?
>
> Thanks,
> --
> Animesh
>
> On Tue, Dec 19, 2017 at 7:03 PM, Siddharth Teotia <siddha...@dremio.com>
> wrote:
>
>> From Arrow 0.8, the second step "Grab the corresponding mutator and
>> accessor objects by calls to getMutator(), getAccessor()" is not needed.
>> In
>> fact, it is not even there.
>>
>> On Tue, Dec 19, 2017 at 10:01 AM, Siddharth Teotia <siddha...@dremio.com>
>> wrote:
>>
>> > Hi Animesh,
>> >
>> > Firstly I would like to suggest switching over to Arrow 0.8 release asap
>> > since you are writing JAVA programs and the API usage has changed
>> > drastically. The new APIs are much simpler with good javadocs and
>> detailed
>> > internal comments.
>> >
>> > If you are writing stop-gap implementation then it is probably fine to
>> > continue with old version but for long term new API usage is
>> recommended.
>> >
>> >
>> >    - Create an instance of the vector. Note that this doesn't allocate
>> >    any memory for the elements in the vector
>> >    - Grab the corresponding mutator and accessor objects by calls to
>> >    getMutator(), getAccessor().
>> >    - Allocate memory
>> >       - *allocateNew()* - we will allocate memory for default number of
>> >       elements in the vector. This is applicable to both fixed width
>> and variable
>> >       width vectors.
>> >       - *allocateNew(valueCount)* -  for fixed width vectors. Use this
>> >       method if you have already know the number of elements to store
>> in the
>> >       vector
>> >       - *allocateNew(bytes, valueCount)* - for variable width vectors.
>> >       Use this method if you already know the total size (in bytes) of
>> all the
>> >       variable width elements you will be storing in the vector. For
>> example, if
>> >       you are going to store 1024 elements in the vector and the total
>> size
>> >       across all variable width elements is under 1MB, you can call
>> >       allocateBytes(1024*1024, 1024)
>> >    - Populate the vector:
>> >       - Use the *set() or setSafe() *APIs in the mutator interface. From
>> >       Arrow 0.8 onwards, you can use these APIs directly on the vector
>> instance
>> >       and mutator/accessor are removed.
>> >       - The difference between set() and corresponding setSafe() API is
>> >       that latter internally takes care of expanding the vector's
>> buffer(s) for
>> >       storing new data.
>> >       - Each set() API has a corresponding setSafe() API.
>> >    - Do a setValueCount() based on the number of elements you populated
>> >    in the vector.
>> >    - Retrieve elements from the vector:
>> >       - Use the get(), getObject() APIs in the accessor interface.
>> Again,
>> >       from Arrow 0.8 onwards you can use these APIs directly.
>> >    - With respect to usage of setInitialCapacity:
>> >       - Let's say your application always issues calls to allocateNew().
>> >       It is likely that this will end up over-allocating memory because
>> it
>> >       assumes a default value count to begin with.
>> >       - In this case, if you do setInitialCapacity() followed by
>> >       allocateNew() then latter doesn't do default memory allocation.
>> It does
>> >       exactly for the value capacity you specified in
>> setInitialCapacity().
>> >
>> > I would highly recommend taking a look at https://github.com/apache/
>> > arrow/blob/master/java/vector/src/test/java/org/apache/
>> > arrow/vector/TestValueVector.java
>> > This has lots of examples around populating the vector, retrieving from
>> > vector, using setInitialCapacity(), using set(), setSafe() methods and a
>> > combination of them to understand when things can go wrong.
>> >
>> > Hopefully this helps. Meanwhile we will try to add some internal README
>> > for the usage of vectors.
>> >
>> > Thanks,
>> > Siddharth
>> >
>> > On Tue, Dec 19, 2017 at 8:55 AM, Emilio Lahr-Vivaz <elahrvi...@ccri.com
>> >
>> > wrote:
>> >
>> >> This has probably changed with the Java code refactor, but I've posted
>> >> some answers inline, to the best of my understanding.
>> >>
>> >> Thanks,
>> >>
>> >> Emilio
>> >>
>> >> On 12/16/2017 12:17 PM, Animesh Trivedi wrote:
>> >>
>> >>> Thanks Wes for you help.
>> >>>
>> >>> Based upon some code reading, I managed to code-up a basic working
>> >>> example.
>> >>> The code is here:
>> >>> https://github.com/animeshtrivedi/ArrowExample/tree/master/s
>> >>> rc/main/java/com/github/animeshtrivedi/arrowexample
>> >>> .
>> >>>
>> >>> However, I do have some questions about the concepts in Arrow
>> >>>
>> >>> 1. ArrowBlock is the unit of reading/writing. One ArrowBlock
>> essentially
>> >>> is
>> >>> the amount of the data one must hold in-memory at a time. Is my
>> >>> understanding correct?
>> >>>
>> >> yes
>> >>
>> >>>
>> >>> 2. There are Base[Reade/Writer] interfaces as well as Mutator/Accessor
>> >>> classes in the ValueVector interface - both are implemented by all
>> >>> supported data types. What is the relationship between these two? or
>> when
>> >>> is one suppose to use one over other. I only use Mutator/Accessor
>> classes
>> >>> in my code.
>> >>>
>> >> The write/reader interfaces are parallel implementations that make some
>> >> things easier, but don't encompass all available functionality (for
>> >> example, fixed size lists, nested lists, some dictionary operations,
>> etc).
>> >> However, you should be able to accomplish everything using
>> >> mutators/accessors.
>> >>
>> >>>
>> >>> 3. What are the "safe" varient functions in the Mutator's code? I
>> could
>> >>> not
>> >>> understand what they meant to achieve.
>> >>>
>> >> The safe methods ensure that the vector is large enough to set the
>> value.
>> >> You can use the unsafe versions if you know that your vector has
>> already
>> >> allocated enough space for your data.
>> >>
>> >>> 4. What are MinorTypes?
>> >>>
>> >> Minor types are a representation of the different vector types. I
>> believe
>> >> they are being de-emphasized in favor of FieldTypes, as minor types
>> don't
>> >> contain enough information to represent all vectors.
>> >>
>> >>>
>> >>> 5. For a writer, what is a dictionary provider? For example in the
>> >>> Integration.java code, the reader is given as the dictionary provider
>> for
>> >>> the writer. But, is it something more than just:
>> >>> DictionaryProvider.MapDictionaryProvider provider = new
>> >>> DictionaryProvider.MapDictionaryProvider();
>> >>> ArrowFileWriter arrowWriter = new ArrowFileWriter(root, provider,
>> >>> fileOutputStream.getChannel());
>> >>>
>> >> The dictionary provider is an interface for looking up dictionary
>> values.
>> >> When reading a file, the reader itself has already read the
>> dictionaries
>> >> and thus serves as the provider.
>> >>
>> >>> 6. I am not clearly sure about the sequence of call that one needs to
>> do
>> >>> write on mutators. For example, if I code something like
>> >>> NullableIntVector intVector = (NullableIntVector) fieldVector;
>> >>> NullableIntVector.Mutator mutator = intVector.getMutator();
>> >>> [.write num values]
>> >>> mutator.setValueCount(num)
>> >>> then this works for primitive types, but not for VarBinary type.
>> There I
>> >>> have to set the capacity first,
>> >>>
>> >>> NullableVarBinaryVector varBinaryVector = (NullableVarBinaryVector)
>> >>> fieldVector;
>> >>> varBinaryVector.setInitialCapacity(items);
>> >>> varBinaryVector.allocateNew();
>> >>> NullableVarBinaryVector.Mutator mutator =
>> varBinaryVector.getMutator();
>> >>>
>> >> The method calls are not very well documented - I would suggest looking
>> >> at the reader/writer implementations to see what calls are required for
>> >> which vector types. Generally variable length vectors (lists, var
>> binary,
>> >> etc) behave differently than fixed width vectors (ints, longs, etc).
>> >>
>> >> Example of these are here:
>> >>> https://github.com/animeshtrivedi/ArrowExample/blob/master/s
>> >>> rc/main/java/com/github/animeshtrivedi/arrowexample/ArrowWrite.java
>> >>> (writeField[???] functions).
>> >>>
>> >>> Thank you very much,
>> >>> --
>> >>> Animesh
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Dec 14, 2017 at 6:15 PM, Wes McKinney <wesmck...@gmail.com>
>> >>> wrote:
>> >>>
>> >>> hi Animesh,
>> >>>>
>> >>>> I suggest you try the ArrowStreamReader/Writer or
>> >>>> ArrowFileReader/Writer classes. See
>> >>>> https://github.com/apache/arrow/blob/master/java/tools/
>> >>>> src/main/java/org/apache/arrow/tools/Integration.java
>> >>>> for example working code for this
>> >>>>
>> >>>> - Wes
>> >>>>
>> >>>> On Thu, Dec 14, 2017 at 8:30 AM, Animesh Trivedi
>> >>>> <animesh.triv...@gmail.com> wrote:
>> >>>>
>> >>>>> Hi all,
>> >>>>>
>> >>>>> It might be a trivial question, so please let me know if I am
>> missing
>> >>>>> something.
>> >>>>>
>> >>>>> I am trying to write and read files in the Arrow format in Java. My
>> >>>>> data
>> >>>>>
>> >>>> is
>> >>>>
>> >>>>> simple flat schema with primitive types. I already have the data in
>> >>>>> Java.
>> >>>>> So my questions are:
>> >>>>> 1. Is this possible or am I fundamentally missing something what
>> Arrow
>> >>>>>
>> >>>> can
>> >>>>
>> >>>>> or cannot do (or is designed to do). I assume that an efficient
>> >>>>> in-memory
>> >>>>> columnar data format should work with files too.
>> >>>>> 2. Can you point me out to a working example? or a starting example.
>> >>>>> Intuitively I am looking for a way to define schema, write/read
>> column
>> >>>>> vectors to/from files as one does with Parquet or ORC.
>> >>>>>
>> >>>>> I try to locate some working examples with ArrowFile[Reader/Writer]
>> >>>>>
>> >>>> classes
>> >>>>
>> >>>>> in the maven tests but so far not sure where to start.
>> >>>>>
>> >>>>> Thanks,
>> >>>>> --
>> >>>>> Animesh
>> >>>>>
>> >>>>
>> >>
>> >
>>
>
>
>

Re: arrow read/write examples in Java

Reply via email to