Spark DataFrame <--> Arrow Roundtrip

Michael Shtelma Mon, 05 Feb 2018 08:31:04 -0800

Hi all,

I would like to make some changes (updates) to the data stored in
Spark data frames, which I get as a result of different queries.
Afterwards, I would like to operate with these changed data frames as
with normal data frames in Spark, e.g. use them for further
transformations.


I would like to use Apache Arrow as an intermediate representation of
the data, I am going to update. My idea was to call
ds.toArrowPayload() and afterwards operate with RDD<ArrowPayload>, so
get the batch for each payload and perform the update operation on the
batch. Question: Can I update individual values for some column
vector? Or is it better to rewrite the whole column?

And the final question is how to get all the batches back to Spark, I
mean create data frame?
Can I use method ArrowConverters.toDataFrame(arrowRDD,ds.schema(),
...) for that ?

Is it going to work? Does anybody have any better ideas?
Any assistance would be greatly appreciated!

Best,
Michael

Spark DataFrame <--> Arrow Roundtrip

Reply via email to