[PR] [SPARK-51814][SS][PYTHON][FOLLLOW-UP] Use RecordBatch.schema.names instead of column_names for old pyarrow compatibility [spark]

via GitHub Mon, 21 Apr 2025 21:14:35 -0700


HyukjinKwon opened a new pull request, #50658:
URL: https://github.com/apache/spark/pull/50658


   ### What changes were proposed in this pull request?
   
   This PR is a followup of https://github.com/apache/spark/pull/50600 that 
proposes to use `RecordBatch.schema.names` instead of `column_names` for old 
version compatibility. `RecordBatch.column_names` is available from 13.0 
(https://arrow.apache.org/docs/13.0/python/generated/pyarrow.RecordBatch.html).
   
   ### Why are the changes needed?
   
   To keep the compatibility with old PyArrow versions. It's currently broken 
(https://github.com/apache/spark/actions/runs/14570805420/job/40867709859):
   
   ```
     File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
2178, in main
       process()
     File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
2170, in process
       serializer.dump_stream(out_iter, outfile)
     File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 1470, in dump_stream
       return ArrowStreamUDFSerializer.dump_stream(self, flatten_iterator(), 
stream)
     File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 181, in dump_stream
       return super(ArrowStreamUDFSerializer, 
self).dump_stream(wrap_and_init_stream(), stream)
     File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 120, in dump_stream
       for batch in iterator:
     File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 166, in wrap_and_init_stream
       for batch, _ in iterator:
     File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 1455, in flatten_iterator
       for packed in iterator:
     File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 1437, in load_stream
       for k, g in groupby(data_batches, key=lambda x: x[0]):
     File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 1424, in generate_data_batches
       DataRow = Row(*(batch.column_names))
   AttributeError: 'pyarrow.lib.RecordBatch' object has no attribute 
'column_names'
   
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. The main change has not been released yet.
   
   ### How was this patch tested?
   
   Unittests in this PR, and scheduled build.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-51814][SS][PYTHON][FOLLLOW-UP] Use RecordBatch.schema.names instead of column_names for old pyarrow compatibility [spark]

Reply via email to