Hi Michael, Please see my comments inline.
Keep in mind these are all pretty internal APIs so they might change in the future. On Mon, Feb 5, 2018 at 11:30 AM, Michael Shtelma <mshte...@gmail.com> wrote: > Hi all, > > I would like to make some changes (updates) to the data stored in > Spark data frames, which I get as a result of different queries. > Afterwards, I would like to operate with these changed data frames as > with normal data frames in Spark, e.g. use them for further > transformations. > > I would like to use Apache Arrow as an intermediate representation of > the data, I am going to update. My idea was to call > ds.toArrowPayload() and afterwards operate with RDD<ArrowPayload>, so > get the batch for each payload and perform the update operation on the > batch. Question: Can I update individual values for some column > vector? Or is it better to rewrite the whole column? > Yes you can update individual values for some vectors. > And the final question is how to get all the batches back to Spark, I > mean create data frame? > Can I use method ArrowConverters.toDataFrame(arrowRDD,ds.schema(), > ...) for that ? > > You probably want to use the columnar vector API to do that: https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java#L134 https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java#L31 > Is it going to work? Does anybody have any better ideas? > Any assistance would be greatly appreciated! > > Best, > Michael >