Thanks for investigating this and the quick fix Joris and Wes! I just have a couple questions about the behavior observed here. The pyspark code assigns either the same series back to the pandas.DataFrame or makes some modifications if it is a timestamp. In the case there are no timestamps, is this potentially making extra copies or will it be unable to take advantage of new zero-copy features in pyarrow? For the case of having timestamp columns that need to be modified, is there a more efficient way to create a new dataframe with only copies of the modified series? Thanks!
Bryan On Thu, Jan 16, 2020 at 11:48 PM Joris Van den Bossche < jorisvandenboss...@gmail.com> wrote: > That sounds like a good solution. Having the zero-copy behavior depending > on whether you have only 1 column of a certain type or not, might lead to > surprising results. To avoid yet another keyword, only doing it when > split_blocks=True sounds good to me (in practice, that's also when it will > happen mostly, except for very narrow dataframes with only few columns). > > Joris > > On Thu, 16 Jan 2020 at 22:44, Wes McKinney <wesmck...@gmail.com> wrote: > > > hi Joris, > > > > Thanks for investigating this. It seems there were some unintended > > consequences of the zero-copy optimizations from ARROW-3789. Another > > way forward might be to "opt in" to this behavior, or to only do the > > zero copy optimizations when split_blocks=True. What do you think? > > > > - Wes > > > > On Thu, Jan 16, 2020 at 3:42 AM Joris Van den Bossche > > <jorisvandenboss...@gmail.com> wrote: > > > > > > So the spark integration build started to fail, and with the following > > test > > > error: > > > > > > ====================================================================== > > > ERROR: test_toPandas_batch_order > > > (pyspark.sql.tests.test_arrow.EncryptionArrowTests) > > > ---------------------------------------------------------------------- > > > Traceback (most recent call last): > > > File "/spark/python/pyspark/sql/tests/test_arrow.py", line 422, in > > > test_toPandas_batch_order > > > run_test(*case) > > > File "/spark/python/pyspark/sql/tests/test_arrow.py", line 409, in > > run_test > > > pdf, pdf_arrow = self._toPandas_arrow_toggle(df) > > > File "/spark/python/pyspark/sql/tests/test_arrow.py", line 152, in > > > _toPandas_arrow_toggle > > > pdf_arrow = df.toPandas() > > > File "/spark/python/pyspark/sql/pandas/conversion.py", line 115, in > > toPandas > > > return _check_dataframe_localize_timestamps(pdf, timezone) > > > File "/spark/python/pyspark/sql/pandas/types.py", line 180, in > > > _check_dataframe_localize_timestamps > > > pdf[column] = _check_series_localize_timestamps(series, timezone) > > > File > > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py", > > > line 3487, in __setitem__ > > > self._set_item(key, value) > > > File > > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py", > > > line 3565, in _set_item > > > NDFrame._set_item(self, key, value) > > > File > > > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/generic.py", > > > line 3381, in _set_item > > > self._data.set(key, value) > > > File > > > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/managers.py", > > > line 1090, in set > > > blk.set(blk_locs, value_getitem(val_locs)) > > > File > > > "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/blocks.py", > > > line 380, in set > > > self.values[locs] = values > > > ValueError: assignment destination is read-only > > > > > > > > > It's from a test that is doing conversions from spark to arrow to > pandas > > > (so calling pyarrow.Table.to_pandas here > > > < > > > https://github.com/apache/spark/blob/018bdcc53c925072b07956de0600452ad255b9c7/python/pyspark/sql/pandas/conversion.py#L111-L115 > > >), > > > and on the resulting DataFrame, it is iterating through all columns, > > > potentially fixing timezones, and writing each column back into the > > > DataFrame (here > > > < > > > https://github.com/apache/spark/blob/018bdcc53c925072b07956de0600452ad255b9c7/python/pyspark/sql/pandas/types.py#L179-L181 > > > > > > ). > > > > > > Since it is giving an error about read-only, it might be related to > > > zero-copy behaviour of to_pandas, and thus might be related to the > > refactor > > > of the arrow->pandas conversion that landed yesterday ( > > > https://github.com/apache/arrow/pull/6067, it says it changed to do > > > zero-copy for 1-column blocks if possible). > > > I am not sure if something should be fixed in pyarrow for this, but the > > > obvious thing that pyspark can do is specify they don't want zero-copy. > > > > > > Joris > > > > > > On Wed, 15 Jan 2020 at 14:32, Crossbow <cross...@ursalabs.org> wrote: > > > > > >