Hi Animesh,

Firstly I would like to suggest switching over to Arrow 0.8 release asap
since you are writing JAVA programs and the API usage has changed
drastically. The new APIs are much simpler with good javadocs and detailed
internal comments.

If you are writing stop-gap implementation then it is probably fine to
continue with old version but for long term new API usage is recommended.


   - Create an instance of the vector. Note that this doesn't allocate any
   memory for the elements in the vector
   - Grab the corresponding mutator and accessor objects by calls to
   getMutator(), getAccessor().
   - Allocate memory
      - *allocateNew()* - we will allocate memory for default number of
      elements in the vector. This is applicable to both fixed width
and variable
      width vectors.
      - *allocateNew(valueCount)* -  for fixed width vectors. Use this
      method if you have already know the number of elements to store in the
      vector
      - *allocateNew(bytes, valueCount)* - for variable width vectors. Use
      this method if you already know the total size (in bytes) of all the
      variable width elements you will be storing in the vector. For
example, if
      you are going to store 1024 elements in the vector and the total size
      across all variable width elements is under 1MB, you can call
      allocateBytes(1024*1024, 1024)
   - Populate the vector:
      - Use the *set() or setSafe() *APIs in the mutator interface. From
      Arrow 0.8 onwards, you can use these APIs directly on the vector instance
      and mutator/accessor are removed.
      - The difference between set() and corresponding setSafe() API is
      that latter internally takes care of expanding the vector's buffer(s) for
      storing new data.
      - Each set() API has a corresponding setSafe() API.
   - Do a setValueCount() based on the number of elements you populated in
   the vector.
   - Retrieve elements from the vector:
      - Use the get(), getObject() APIs in the accessor interface. Again,
      from Arrow 0.8 onwards you can use these APIs directly.
   - With respect to usage of setInitialCapacity:
      - Let's say your application always issues calls to allocateNew(). It
      is likely that this will end up over-allocating memory because
it assumes a
      default value count to begin with.
      - In this case, if you do setInitialCapacity() followed by
      allocateNew() then latter doesn't do default memory allocation. It does
      exactly for the value capacity you specified in setInitialCapacity().

I would highly recommend taking a look at
https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java
This has lots of examples around populating the vector, retrieving from
vector, using setInitialCapacity(), using set(), setSafe() methods and a
combination of them to understand when things can go wrong.

Hopefully this helps. Meanwhile we will try to add some internal README for
the usage of vectors.

Thanks,
Siddharth

On Tue, Dec 19, 2017 at 8:55 AM, Emilio Lahr-Vivaz <elahrvi...@ccri.com>
wrote:

> This has probably changed with the Java code refactor, but I've posted
> some answers inline, to the best of my understanding.
>
> Thanks,
>
> Emilio
>
> On 12/16/2017 12:17 PM, Animesh Trivedi wrote:
>
>> Thanks Wes for you help.
>>
>> Based upon some code reading, I managed to code-up a basic working
>> example.
>> The code is here:
>> https://github.com/animeshtrivedi/ArrowExample/tree/master/
>> src/main/java/com/github/animeshtrivedi/arrowexample
>> .
>>
>> However, I do have some questions about the concepts in Arrow
>>
>> 1. ArrowBlock is the unit of reading/writing. One ArrowBlock essentially
>> is
>> the amount of the data one must hold in-memory at a time. Is my
>> understanding correct?
>>
> yes
>
>>
>> 2. There are Base[Reade/Writer] interfaces as well as Mutator/Accessor
>> classes in the ValueVector interface - both are implemented by all
>> supported data types. What is the relationship between these two? or when
>> is one suppose to use one over other. I only use Mutator/Accessor classes
>> in my code.
>>
> The write/reader interfaces are parallel implementations that make some
> things easier, but don't encompass all available functionality (for
> example, fixed size lists, nested lists, some dictionary operations, etc).
> However, you should be able to accomplish everything using
> mutators/accessors.
>
>>
>> 3. What are the "safe" varient functions in the Mutator's code? I could
>> not
>> understand what they meant to achieve.
>>
> The safe methods ensure that the vector is large enough to set the value.
> You can use the unsafe versions if you know that your vector has already
> allocated enough space for your data.
>
>> 4. What are MinorTypes?
>>
> Minor types are a representation of the different vector types. I believe
> they are being de-emphasized in favor of FieldTypes, as minor types don't
> contain enough information to represent all vectors.
>
>>
>> 5. For a writer, what is a dictionary provider? For example in the
>> Integration.java code, the reader is given as the dictionary provider for
>> the writer. But, is it something more than just:
>> DictionaryProvider.MapDictionaryProvider provider = new
>> DictionaryProvider.MapDictionaryProvider();
>> ArrowFileWriter arrowWriter = new ArrowFileWriter(root, provider,
>> fileOutputStream.getChannel());
>>
> The dictionary provider is an interface for looking up dictionary values.
> When reading a file, the reader itself has already read the dictionaries
> and thus serves as the provider.
>
>> 6. I am not clearly sure about the sequence of call that one needs to do
>> write on mutators. For example, if I code something like
>> NullableIntVector intVector = (NullableIntVector) fieldVector;
>> NullableIntVector.Mutator mutator = intVector.getMutator();
>> [.write num values]
>> mutator.setValueCount(num)
>> then this works for primitive types, but not for VarBinary type. There I
>> have to set the capacity first,
>>
>> NullableVarBinaryVector varBinaryVector = (NullableVarBinaryVector)
>> fieldVector;
>> varBinaryVector.setInitialCapacity(items);
>> varBinaryVector.allocateNew();
>> NullableVarBinaryVector.Mutator mutator = varBinaryVector.getMutator();
>>
> The method calls are not very well documented - I would suggest looking at
> the reader/writer implementations to see what calls are required for which
> vector types. Generally variable length vectors (lists, var binary, etc)
> behave differently than fixed width vectors (ints, longs, etc).
>
> Example of these are here:
>> https://github.com/animeshtrivedi/ArrowExample/blob/master/
>> src/main/java/com/github/animeshtrivedi/arrowexample/ArrowWrite.java
>> (writeField[???] functions).
>>
>> Thank you very much,
>> --
>> Animesh
>>
>>
>>
>> On Thu, Dec 14, 2017 at 6:15 PM, Wes McKinney <wesmck...@gmail.com>
>> wrote:
>>
>> hi Animesh,
>>>
>>> I suggest you try the ArrowStreamReader/Writer or
>>> ArrowFileReader/Writer classes. See
>>> https://github.com/apache/arrow/blob/master/java/tools/
>>> src/main/java/org/apache/arrow/tools/Integration.java
>>> for example working code for this
>>>
>>> - Wes
>>>
>>> On Thu, Dec 14, 2017 at 8:30 AM, Animesh Trivedi
>>> <animesh.triv...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> It might be a trivial question, so please let me know if I am missing
>>>> something.
>>>>
>>>> I am trying to write and read files in the Arrow format in Java. My data
>>>>
>>> is
>>>
>>>> simple flat schema with primitive types. I already have the data in
>>>> Java.
>>>> So my questions are:
>>>> 1. Is this possible or am I fundamentally missing something what Arrow
>>>>
>>> can
>>>
>>>> or cannot do (or is designed to do). I assume that an efficient
>>>> in-memory
>>>> columnar data format should work with files too.
>>>> 2. Can you point me out to a working example? or a starting example.
>>>> Intuitively I am looking for a way to define schema, write/read column
>>>> vectors to/from files as one does with Parquet or ORC.
>>>>
>>>> I try to locate some working examples with ArrowFile[Reader/Writer]
>>>>
>>> classes
>>>
>>>> in the maven tests but so far not sure where to start.
>>>>
>>>> Thanks,
>>>> --
>>>> Animesh
>>>>
>>>
>

Reply via email to