Hi Michael,

Please see my comments inline.

Keep in mind these are all pretty internal APIs so they might change in the
future.

On Mon, Feb 5, 2018 at 11:30 AM, Michael Shtelma <mshte...@gmail.com> wrote:

> Hi all,
>
> I would like to make some changes (updates) to the data stored in
> Spark data frames, which I get as a result of different queries.
> Afterwards, I would like to operate with these changed data frames as
> with normal data frames in Spark, e.g. use them for further
> transformations.
>
> I would like to use Apache Arrow as an intermediate representation of
> the data, I am going to update. My idea was to call
> ds.toArrowPayload() and afterwards operate with RDD<ArrowPayload>, so
> get the batch for each payload and perform the update operation on the
> batch. Question: Can I update individual values for some column
> vector? Or is it better to rewrite the whole column?
>

Yes you can update individual values for some vectors.


> And the final question is how to get all the batches back to Spark, I
> mean create data frame?
> Can I use method ArrowConverters.toDataFrame(arrowRDD,ds.schema(),
> ...) for that ?
>
> You probably want to use the columnar vector API to do that:
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java#L134
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java#L31


> Is it going to work? Does anybody have any better ideas?
> Any assistance would be greatly appreciated!
>
> Best,
> Michael
>

Reply via email to