Hi all, I would like to make some changes (updates) to the data stored in Spark data frames, which I get as a result of different queries. Afterwards, I would like to operate with these changed data frames as with normal data frames in Spark, e.g. use them for further transformations.
I would like to use Apache Arrow as an intermediate representation of the data, I am going to update. My idea was to call ds.toArrowPayload() and afterwards operate with RDD<ArrowPayload>, so get the batch for each payload and perform the update operation on the batch. Question: Can I update individual values for some column vector? Or is it better to rewrite the whole column? And the final question is how to get all the batches back to Spark, I mean create data frame? Can I use method ArrowConverters.toDataFrame(arrowRDD,ds.schema(), ...) for that ? Is it going to work? Does anybody have any better ideas? Any assistance would be greatly appreciated! Best, Michael