Re: Spark DataFrame <--> Arrow Roundtrip

Li Jin Tue, 06 Feb 2018 08:19:51 -0800

Hi Michael,

Please see my comments inline.


Keep in mind these are all pretty internal APIs so they might change in the
future.

On Mon, Feb 5, 2018 at 11:30 AM, Michael Shtelma <mshte...@gmail.com> wrote:

> Hi all,
>
> I would like to make some changes (updates) to the data stored in
> Spark data frames, which I get as a result of different queries.
> Afterwards, I would like to operate with these changed data frames as
> with normal data frames in Spark, e.g. use them for further
> transformations.
>
> I would like to use Apache Arrow as an intermediate representation of
> the data, I am going to update. My idea was to call
> ds.toArrowPayload() and afterwards operate with RDD<ArrowPayload>, so
> get the batch for each payload and perform the update operation on the
> batch. Question: Can I update individual values for some column
> vector? Or is it better to rewrite the whole column?
>

Yes you can update individual values for some vectors.


> And the final question is how to get all the batches back to Spark, I
> mean create data frame?
> Can I use method ArrowConverters.toDataFrame(arrowRDD,ds.schema(),
> ...) for that ?
>
> You probably want to use the columnar vector API to do that:
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java#L134
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java#L31


> Is it going to work? Does anybody have any better ideas?
> Any assistance would be greatly appreciated!
>
> Best,
> Michael
>

Re: Spark DataFrame <--> Arrow Roundtrip

Reply via email to