Re: PySpark failure [RE: [NIGHTLY] Arrow Build Report for Job nightly-2020-01-15-0]

Joris Van den Bossche Fri, 24 Jan 2020 02:11:05 -0800

Hi Bryan,

For the case that the column is no timestamp and was not modified: I don't
think it will take copies of the full dataframe by assigning columns in a
loop like that. But it is still doing work (it will copy data for that
column into the array holding those data for 2D blocks), and which can
easily be avoided I think by only assigning back when the column was
actually modified (eg by moving the is_datetime64tz_dtype inline in the
loop iterating through all columns, so you can only write back if actually
having tz-aware data).


Further, even if you do the above to avoid writing back to the dataframe
when not needed, I am not sure you should directly try to use the new
zero-copy feature of the Table.to_pandas conversion (with
split_blocks=True). It depends very much on what further happens with the
converted dataframe. Once you do some operations in pandas, those splitted
blocks will get combined (resulting in a memory copy then), and it also
means you can't modify the dataframe (if this dataframe is used in python
UDFs, it might limit what can be done in those UDFs. Just guessing here, I
don't know the pyspark code well enough).

Joris


On Thu, 23 Jan 2020 at 21:03, Bryan Cutler <cutl...@gmail.com> wrote:

> Thanks for investigating this and the quick fix Joris and Wes!  I just have
> a couple questions about the behavior observed here.  The pyspark code
> assigns either the same series back to the pandas.DataFrame or makes some
> modifications if it is a timestamp. In the case there are no timestamps, is
> this potentially making extra copies or will it be unable to take advantage
> of new zero-copy features in pyarrow? For the case of having timestamp
> columns that need to be modified, is there a more efficient way to create a
> new dataframe with only copies of the modified series?  Thanks!
>
> Bryan
>
> On Thu, Jan 16, 2020 at 11:48 PM Joris Van den Bossche <
> jorisvandenboss...@gmail.com> wrote:
>
> > That sounds like a good solution. Having the zero-copy behavior depending
> > on whether you have only 1 column of a certain type or not, might lead to
> > surprising results. To avoid yet another keyword, only doing it when
> > split_blocks=True sounds good to me (in practice, that's also when it
> will
> > happen mostly, except for very narrow dataframes with only few columns).
> >
> > Joris
> >
> > On Thu, 16 Jan 2020 at 22:44, Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > > hi Joris,
> > >
> > > Thanks for investigating this. It seems there were some unintended
> > > consequences of the zero-copy optimizations from ARROW-3789. Another
> > > way forward might be to "opt in" to this behavior, or to only do the
> > > zero copy optimizations when split_blocks=True. What do you think?
> > >
> > > - Wes
> > >
> > > On Thu, Jan 16, 2020 at 3:42 AM Joris Van den Bossche
> > > <jorisvandenboss...@gmail.com> wrote:
> > > >
> > > > So the spark integration build started to fail, and with the
> following
> > > test
> > > > error:
> > > >
> > > >
> ======================================================================
> > > > ERROR: test_toPandas_batch_order
> > > > (pyspark.sql.tests.test_arrow.EncryptionArrowTests)
> > > >
> ----------------------------------------------------------------------
> > > > Traceback (most recent call last):
> > > >   File "/spark/python/pyspark/sql/tests/test_arrow.py", line 422, in
> > > > test_toPandas_batch_order
> > > >     run_test(*case)
> > > >   File "/spark/python/pyspark/sql/tests/test_arrow.py", line 409, in
> > > run_test
> > > >     pdf, pdf_arrow = self._toPandas_arrow_toggle(df)
> > > >   File "/spark/python/pyspark/sql/tests/test_arrow.py", line 152, in
> > > > _toPandas_arrow_toggle
> > > >     pdf_arrow = df.toPandas()
> > > >   File "/spark/python/pyspark/sql/pandas/conversion.py", line 115, in
> > > toPandas
> > > >     return _check_dataframe_localize_timestamps(pdf, timezone)
> > > >   File "/spark/python/pyspark/sql/pandas/types.py", line 180, in
> > > > _check_dataframe_localize_timestamps
> > > >     pdf[column] = _check_series_localize_timestamps(series, timezone)
> > > >   File
> > >
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py",
> > > > line 3487, in __setitem__
> > > >     self._set_item(key, value)
> > > >   File
> > >
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/frame.py",
> > > > line 3565, in _set_item
> > > >     NDFrame._set_item(self, key, value)
> > > >   File
> > >
> >
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/generic.py",
> > > > line 3381, in _set_item
> > > >     self._data.set(key, value)
> > > >   File
> > >
> >
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/managers.py",
> > > > line 1090, in set
> > > >     blk.set(blk_locs, value_getitem(val_locs))
> > > >   File
> > >
> >
> "/opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/internals/blocks.py",
> > > > line 380, in set
> > > >     self.values[locs] = values
> > > > ValueError: assignment destination is read-only
> > > >
> > > >
> > > > It's from a test that is doing conversions from spark to arrow to
> > pandas
> > > > (so calling pyarrow.Table.to_pandas here
> > > > <
> > >
> >
> https://github.com/apache/spark/blob/018bdcc53c925072b07956de0600452ad255b9c7/python/pyspark/sql/pandas/conversion.py#L111-L115
> > > >),
> > > > and on the resulting DataFrame, it is iterating through all columns,
> > > > potentially fixing timezones, and writing each column back into the
> > > > DataFrame (here
> > > > <
> > >
> >
> https://github.com/apache/spark/blob/018bdcc53c925072b07956de0600452ad255b9c7/python/pyspark/sql/pandas/types.py#L179-L181
> > > >
> > > > ).
> > > >
> > > > Since it is giving an error about read-only, it might be related to
> > > > zero-copy behaviour of to_pandas, and thus might be related to the
> > > refactor
> > > > of the arrow->pandas conversion that landed yesterday (
> > > > https://github.com/apache/arrow/pull/6067, it says it changed to do
> > > > zero-copy for 1-column blocks if possible).
> > > > I am not sure if something should be fixed in pyarrow for this, but
> the
> > > > obvious thing that pyspark can do is specify they don't want
> zero-copy.
> > > >
> > > > Joris
> > > >
> > > > On Wed, 15 Jan 2020 at 14:32, Crossbow <cross...@ursalabs.org>
> wrote:
> > > >
> > >
> >
>

Re: PySpark failure [RE: [NIGHTLY] Arrow Build Report for Job nightly-2020-01-15-0]

Reply via email to