I put up a PR to revert ARROW-9223. If someone cannot resolve the problem another way that I recommend applying the reversion and cutting RC2
https://github.com/apache/arrow/pull/7802 To state the obvious we must verify that this resolves the Spark problem also On Sun, Jul 19, 2020 at 6:55 PM Wes McKinney <wesmck...@gmail.com> wrote: > > I think I see the problem now: > > In [40]: parr > Out[40]: > 0 {'f0': 1969-12-31 16:00:00-08:00} > 1 {'f0': 1969-12-31 16:00:00.000001-08:00} > 2 {'f0': 1969-12-31 16:00:00.000002-08:00} > dtype: object > > In [41]: parr[0]['f0'] > Out[41]: datetime.datetime(1969, 12, 31, 16, 0, tzinfo=<DstTzInfo > 'America/Los_Angeles' PST-1 day, 16:00:00 STD>) > > In [42]: pa.array(parr) > Out[42]: > <pyarrow.lib.StructArray object at 0x7f0893706a60> > -- is_valid: all not null > -- child 0 type: timestamp[us] > [ > 1969-12-31 16:00:00.000000, > 1969-12-31 16:00:00.000001, > 1969-12-31 16:00:00.000002 > ] > > In [43]: pa.array(parr).field(0).type > Out[43]: TimestampType(timestamp[us]) > > On 0.17.1 > > In [8]: arr = pa.array([0, 1, 2], type=pa.timestamp('us', > 'America/Los_Angeles')) > > In [9]: arr > Out[9]: > <pyarrow.lib.TimestampArray object at 0x7f9dede69d00> > [ > 1970-01-01 00:00:00.000000, > 1970-01-01 00:00:00.000001, > 1970-01-01 00:00:00.000002 > ] > > In [10]: struct_arr = pa.StructArray.from_arrays([arr], names=['f0']) > > In [11]: struct_arr > Out[11]: > <pyarrow.lib.StructArray object at 0x7f9ded0016e0> > -- is_valid: all not null > -- child 0 type: timestamp[us, tz=America/Los_Angeles] > [ > 1970-01-01 00:00:00.000000, > 1970-01-01 00:00:00.000001, > 1970-01-01 00:00:00.000002 > ] > > In [12]: struct_arr.to_pandas() > Out[12]: > 0 {'f0': 1970-01-01 00:00:00} > 1 {'f0': 1970-01-01 00:00:00.000001} > 2 {'f0': 1970-01-01 00:00:00.000002} > dtype: object > > In [13]: pa.array(struct_arr.to_pandas()) > Out[13]: > <pyarrow.lib.StructArray object at 0x7f9ded003210> > -- is_valid: all not null > -- child 0 type: timestamp[us] > [ > 1970-01-01 00:00:00.000000, > 1970-01-01 00:00:00.000001, > 1970-01-01 00:00:00.000002 > ] > > In [14]: pa.array(struct_arr.to_pandas()).type > Out[14]: StructType(struct<f0: timestamp[us]>) > > So while the time zone is getting stripped in both cases, the failure > to round trip is a problem. If we are going to attach the time zone in > to_pandas() then we need to respect it when going the other way. > > This looks like a regression to me and so I'm inclined to revise my > vote on the release to -0/-1 > > On Sun, Jul 19, 2020 at 6:46 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > > Ah I forgot that this is a "feature" of nanosecond timestamps > > > > In [21]: arr = pa.array([0, 1, 2], type=pa.timestamp('us', > > 'America/Los_Angeles')) > > > > In [22]: struct_arr = pa.StructArray.from_arrays([arr], names=['f0']) > > > > In [23]: struct_arr.to_pandas() > > Out[23]: > > 0 {'f0': 1969-12-31 16:00:00-08:00} > > 1 {'f0': 1969-12-31 16:00:00.000001-08:00} > > 2 {'f0': 1969-12-31 16:00:00.000002-08:00} > > dtype: object > > > > So this is working as intended, such as it is > > > > On Sun, Jul 19, 2020 at 6:40 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > > > > There seems to be other broken StructArray stuff > > > > > > In [14]: arr = pa.array([0, 1, 2], type=pa.timestamp('ns', > > > 'America/Los_Angeles')) > > > > > > In [15]: struct_arr = pa.StructArray.from_arrays([arr], names=['f0']) > > > > > > In [16]: struct_arr > > > Out[16]: > > > <pyarrow.lib.StructArray object at 0x7f089370f590> > > > -- is_valid: all not null > > > -- child 0 type: timestamp[ns, tz=America/Los_Angeles] > > > [ > > > 1970-01-01 00:00:00.000000000, > > > 1970-01-01 00:00:00.000000001, > > > 1970-01-01 00:00:00.000000002 > > > ] > > > > > > In [17]: struct_arr.to_pandas() > > > Out[17]: > > > 0 {'f0': 0} > > > 1 {'f0': 1} > > > 2 {'f0': 2} > > > dtype: object > > > > > > All in all it appears that this part of the project needs some TLC > > > > > > On Sun, Jul 19, 2020 at 6:16 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > > > > > > Well, the problem is that time zones are really finicky comparing > > > > Spark (which uses a localtime interpretation of timestamps without > > > > time zone) and Arrow (which has naive timestamps -- a concept similar > > > > but different from the SQL concept TIMESTAMP WITHOUT TIME ZONE -- and > > > > tz-aware timestamps). So somewhere there is a time zone being stripped > > > > or applied/localized which may result in the transferred data to/from > > > > Spark being shifted by the time zone offset. I think it's important > > > > that we determine what the problem is -- if it's a problem that has to > > > > be fixed in Arrow (and it's not clear to me that it is) it's worth > > > > spending some time to understand what's going on to avoid the > > > > possibility of patch release on account of this. > > > > > > > > On Sun, Jul 19, 2020 at 6:12 PM Neal Richardson > > > > <neal.p.richard...@gmail.com> wrote: > > > > > > > > > > If it’s a display problem, should it block the release? > > > > > > > > > > Sent from my iPhone > > > > > > > > > > > On Jul 19, 2020, at 3:57 PM, Wes McKinney <wesmck...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > I opened https://issues.apache.org/jira/browse/ARROW-9525 about the > > > > > > display problem. My guess is that there are other problems lurking > > > > > > here > > > > > > > > > > > >> On Sun, Jul 19, 2020 at 5:54 PM Wes McKinney <wesmck...@gmail.com> > > > > > >> wrote: > > > > > >> > > > > > >> hi Bryan, > > > > > >> > > > > > >> This is a display bug > > > > > >> > > > > > >> In [6]: arr = pa.array([0, 1, 2], type=pa.timestamp('ns', > > > > > >> 'America/Los_Angeles')) > > > > > >> > > > > > >> In [7]: arr.view('int64') > > > > > >> Out[7]: > > > > > >> <pyarrow.lib.Int64Array object at 0x7fd1b8aaef30> > > > > > >> [ > > > > > >> 0, > > > > > >> 1, > > > > > >> 2 > > > > > >> ] > > > > > >> > > > > > >> In [8]: arr > > > > > >> Out[8]: > > > > > >> <pyarrow.lib.TimestampArray object at 0x7fd1b8aae6e0> > > > > > >> [ > > > > > >> 1970-01-01 00:00:00.000000000, > > > > > >> 1970-01-01 00:00:00.000000001, > > > > > >> 1970-01-01 00:00:00.000000002 > > > > > >> ] > > > > > >> > > > > > >> In [9]: arr.to_pandas() > > > > > >> Out[9]: > > > > > >> 0 1969-12-31 16:00:00-08:00 > > > > > >> 1 1969-12-31 16:00:00.000000001-08:00 > > > > > >> 2 1969-12-31 16:00:00.000000002-08:00 > > > > > >> dtype: datetime64[ns, America/Los_Angeles] > > > > > >> > > > > > >> the repr of TimestampArray doesn't take into account the timezone > > > > > >> > > > > > >> In [10]: arr[0] > > > > > >> Out[10]: <pyarrow.TimestampScalar: Timestamp('1969-12-31 > > > > > >> 16:00:00-0800', tz='America/Los_Angeles')> > > > > > >> > > > > > >> So if it's incorrect, the problem is happening somewhere before or > > > > > >> while the StructArray is being created. If I had to guess it's > > > > > >> caused > > > > > >> by the tzinfo of the datetime.datetime values not being handled in > > > > > >> the > > > > > >> way that they were before > > > > > >> > > > > > >>> On Sun, Jul 19, 2020 at 5:19 PM Wes McKinney > > > > > >>> <wesmck...@gmail.com> wrote: > > > > > >>> > > > > > >>> Well this is not good and pretty disappointing given that we had > > > > > >>> nearly a month to sort through the implications of Micah’s patch. > > > > > >>> We should try to resolve this ASAP > > > > > >>> > > > > > >>> On Sun, Jul 19, 2020 at 5:10 PM Bryan Cutler <cutl...@gmail.com> > > > > > >>> wrote: > > > > > >>>> > > > > > >>>> +0 (non-binding) > > > > > >>>> > > > > > >>>> I ran verification script for binaries and then source, as > > > > > >>>> below, and both > > > > > >>>> look good > > > > > >>>> ARROW_TMPDIR=/tmp/arrow-test TEST_DEFAULT=0 TEST_SOURCE=1 > > > > > >>>> TEST_CPP=1 > > > > > >>>> TEST_PYTHON=1 TEST_JAVA=1 TEST_INTEGRATION_CPP=1 > > > > > >>>> TEST_INTEGRATION_JAVA=1 > > > > > >>>> dev/release/verify-release-candidate.sh source 1.0.0 1 > > > > > >>>> > > > > > >>>> I tried to patch Spark locally to verify the recent change in > > > > > >>>> nested > > > > > >>>> timestamps and was not able to get things working quite right, > > > > > >>>> but I'm not > > > > > >>>> sure if the problem is in Spark, Arrow or my patch - hence my > > > > > >>>> vote of +0. > > > > > >>>> > > > > > >>>> Here is what I'm seeing > > > > > >>>> > > > > > >>>> ``` > > > > > >>>> (Input as datetime) > > > > > >>>> datetime.datetime(2018, 3, 10, 0, 0) > > > > > >>>> datetime.datetime(2018, 3, 15, 0, 0) > > > > > >>>> > > > > > >>>> (Struct Array) > > > > > >>>> -- is_valid: all not null > > > > > >>>> -- child 0 type: timestamp[us, tz=America/Los_Angeles] > > > > > >>>> [ > > > > > >>>> 2018-03-10 00:00:00.000000, > > > > > >>>> 2018-03-10 00:00:00.000000 > > > > > >>>> ] > > > > > >>>> -- child 1 type: timestamp[us, tz=America/Los_Angeles] > > > > > >>>> [ > > > > > >>>> 2018-03-15 00:00:00.000000, > > > > > >>>> 2018-03-15 00:00:00.000000 > > > > > >>>> ] > > > > > >>>> > > > > > >>>> (Flattened Arrays) > > > > > >>>> types [TimestampType(timestamp[us, tz=America/Los_Angeles]), > > > > > >>>> TimestampType(timestamp[us, tz=America/Los_Angeles])] > > > > > >>>> [<pyarrow.lib.TimestampArray object at 0x7ffbbd88f520> > > > > > >>>> [ > > > > > >>>> 2018-03-10 00:00:00.000000, > > > > > >>>> 2018-03-10 00:00:00.000000 > > > > > >>>> ], <pyarrow.lib.TimestampArray object at 0x7ffba958be50> > > > > > >>>> [ > > > > > >>>> 2018-03-15 00:00:00.000000, > > > > > >>>> 2018-03-15 00:00:00.000000 > > > > > >>>> ]] > > > > > >>>> > > > > > >>>> (Pandas Conversion) > > > > > >>>> [ > > > > > >>>> 0 2018-03-09 16:00:00-08:00 > > > > > >>>> 1 2018-03-09 16:00:00-08:00 > > > > > >>>> dtype: datetime64[ns, America/Los_Angeles], > > > > > >>>> > > > > > >>>> 0 2018-03-14 17:00:00-07:00 > > > > > >>>> 1 2018-03-14 17:00:00-07:00 > > > > > >>>> dtype: datetime64[ns, America/Los_Angeles]] > > > > > >>>> ``` > > > > > >>>> > > > > > >>>> Based on output of existing a correct timestamp udf, it looks > > > > > >>>> like the > > > > > >>>> pyarrow Struct Array values are wrong and that's carried through > > > > > >>>> the > > > > > >>>> flattened arrays, causing the Pandas values to have a negative > > > > > >>>> offset. > > > > > >>>> > > > > > >>>> Here is output from a working udf with timestamp, the pyarrow > > > > > >>>> Array > > > > > >>>> displays in UTC time, I believe. > > > > > >>>> > > > > > >>>> ``` > > > > > >>>> (Timestamp Array) > > > > > >>>> type timestamp[us, tz=America/Los_Angeles] > > > > > >>>> [ > > > > > >>>> [ > > > > > >>>> 1969-01-01 09:01:01.000000 > > > > > >>>> ] > > > > > >>>> ] > > > > > >>>> > > > > > >>>> (Pandas Conversion) > > > > > >>>> 0 1969-01-01 01:01:01-08:00 > > > > > >>>> Name: _0, dtype: datetime64[ns, America/Los_Angeles] > > > > > >>>> > > > > > >>>> (Timezone Localized) > > > > > >>>> 0 1969-01-01 01:01:01 > > > > > >>>> Name: _0, dtype: datetime64[ns] > > > > > >>>> ``` > > > > > >>>> > > > > > >>>> I'll have to dig in further at another time and debug where the > > > > > >>>> values go > > > > > >>>> wrong. > > > > > >>>> > > > > > >>>> On Sat, Jul 18, 2020 at 9:51 PM Micah Kornfield > > > > > >>>> <emkornfi...@gmail.com> > > > > > >>>> wrote: > > > > > >>>> > > > > > >>>>> +1 (binding) > > > > > >>>>> > > > > > >>>>> Ran wheel and binary tests on ubuntu 19.04 > > > > > >>>>> > > > > > >>>>> On Fri, Jul 17, 2020 at 2:25 PM Neal Richardson < > > > > > >>>>> neal.p.richard...@gmail.com> > > > > > >>>>> wrote: > > > > > >>>>> > > > > > >>>>>> +1 (binding) > > > > > >>>>>> > > > > > >>>>>> In addition to the usual verification on > > > > > >>>>>> https://github.com/apache/arrow/pull/7787, I've successfully > > > > > >>>>>> staged the > > > > > >>>>> R > > > > > >>>>>> binary artifacts on Windows ( > > > > > >>>>>> https://github.com/r-windows/rtools-packages/pull/126), macOS ( > > > > > >>>>>> https://github.com/autobrew/homebrew-core/pull/12), and Linux ( > > > > > >>>>>> https://github.com/ursa-labs/arrow-r-nightly/actions/runs/172977277) > > > > > >>>>> using > > > > > >>>>>> the release candidate. > > > > > >>>>>> > > > > > >>>>>> And I agree with the judgment about skipping a JS release > > > > > >>>>>> artifact. Looks > > > > > >>>>>> like there hasn't been a code change since October so there's > > > > > >>>>>> no point. > > > > > >>>>>> > > > > > >>>>>> Neal > > > > > >>>>>> > > > > > >>>>>> On Fri, Jul 17, 2020 at 10:37 AM Wes McKinney > > > > > >>>>>> <wesmck...@gmail.com> > > > > > >>>>> wrote: > > > > > >>>>>> > > > > > >>>>>>> I see the JS failures as well. I think it is a failure > > > > > >>>>>>> localized to > > > > > >>>>>>> newer Node versions since our JavaScript CI works fine. I > > > > > >>>>>>> don't think > > > > > >>>>>>> it should block the release given the lack of development > > > > > >>>>>>> activity in > > > > > >>>>>>> JavaScript [1] -- if any JS devs are concerned about > > > > > >>>>>>> publishing an > > > > > >>>>>>> artifact then we can skip pushing it to NPM > > > > > >>>>>>> > > > > > >>>>>>> @Ryan it seems it may be something environment related on your > > > > > >>>>>>> machine, I'm on Ubuntu 18.04 and have not seen this. > > > > > >>>>>>> > > > > > >>>>>>> On > > > > > >>>>>>> > > > > > >>>>>>>> * Python 3.8 wheel's tests are failed. 3.5, 3.6 and 3.7 > > > > > >>>>>>>> are passed. It seems that -larrow and -larrow_python for > > > > > >>>>>>>> Cython are failed. > > > > > >>>>>>> > > > > > >>>>>>> I suspect this is related to > > > > > >>>>>>> > > > > > >>>>>>> > > > > > >>>>>> > > > > > >>>>> https://github.com/apache/arrow/commit/120c21f4bf66d2901b3a353a1f67bac3c3355924#diff-0f69784b44040448d17d0e4e8a641fe8 > > > > > >>>>>>> , > > > > > >>>>>>> but I don't think it's a blocking issue > > > > > >>>>>>> > > > > > >>>>>>> [1]: https://github.com/apache/arrow/commits/master/js > > > > > >>>>>>> > > > > > >>>>>>> On Fri, Jul 17, 2020 at 9:42 AM Ryan Murray > > > > > >>>>>>> <rym...@dremio.com> wrote: > > > > > >>>>>>>> > > > > > >>>>>>>> I've tested Java and it looks good. However the verify > > > > > >>>>>>>> script keeps > > > > > >>>>> on > > > > > >>>>>>>> bailing with protobuf related errors: > > > > > >>>>>>>> 'cpp/build/orc_ep-prefix/src/orc_ep-build/c++/src/orc_proto.pb.cc' > > > > > >>>>> and > > > > > >>>>>>>> friends cant find protobuf definitions. A bit odd as cmake > > > > > >>>>>>>> can see > > > > > >>>>>>> protobuf > > > > > >>>>>>>> headers and builds directly off master work just fine. Has > > > > > >>>>>>>> anyone > > > > > >>>>> else > > > > > >>>>>>>> experienced this? I am on ubutnu 18.04 > > > > > >>>>>>>> > > > > > >>>>>>>> On Fri, Jul 17, 2020 at 10:49 AM Antoine Pitrou > > > > > >>>>>>>> <anto...@python.org> > > > > > >>>>>>> wrote: > > > > > >>>>>>>> > > > > > >>>>>>>>> > > > > > >>>>>>>>> +1 (binding). I tested on Ubuntu 18.04. > > > > > >>>>>>>>> > > > > > >>>>>>>>> * Wheels verification went fine. > > > > > >>>>>>>>> * Source verification went fine with CUDA enabled and > > > > > >>>>>>>>> TEST_INTEGRATION_JS=0 TEST_JS=0. > > > > > >>>>>>>>> > > > > > >>>>>>>>> I didn't test the binaries. > > > > > >>>>>>>>> > > > > > >>>>>>>>> Regards > > > > > >>>>>>>>> > > > > > >>>>>>>>> Antoine. > > > > > >>>>>>>>> > > > > > >>>>>>>>> > > > > > >>>>>>>>> Le 17/07/2020 à 03:41, Krisztián Szűcs a écrit : > > > > > >>>>>>>>>> Hi, > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> I would like to propose the second release candidate (RC1) > > > > > >>>>>>>>>> of > > > > > >>>>>> Apache > > > > > >>>>>>>>>> Arrow version 1.0.0. > > > > > >>>>>>>>>> This is a major release consisting of 826 resolved JIRA > > > > > >>>>> issues[1]. > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> The verification of the first release candidate (RC0) has > > > > > >>>>>>>>>> failed > > > > > >>>>>>> [0], and > > > > > >>>>>>>>>> the packaging scripts were unable to produce two wheels. > > > > > >>>>>>>>>> Compared > > > > > >>>>>>>>>> to RC0 this release candidate includes additional patches > > > > > >>>>>>>>>> for the > > > > > >>>>>>>>>> following bugs: ARROW-9506, ARROW-9504, ARROW-9497, > > > > > >>>>>>>>>> ARROW-9500, ARROW-9499. > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> This release candidate is based on commit: > > > > > >>>>>>>>>> bc0649541859095ee77d03a7b891ea8d6e2fd641 [2] > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> The source release rc1 is hosted at [3]. > > > > > >>>>>>>>>> The binary artifacts are hosted at [4][5][6][7]. > > > > > >>>>>>>>>> The changelog is located at [8]. > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> Please download, verify checksums and signatures, run the > > > > > >>>>>>>>>> unit > > > > > >>>>>> tests, > > > > > >>>>>>>>>> and vote on the release. See [9] for how to validate a > > > > > >>>>>>>>>> release > > > > > >>>>>>> candidate. > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> The vote will be open for at least 72 hours. > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> [ ] +1 Release this as Apache Arrow 1.0.0 > > > > > >>>>>>>>>> [ ] +0 > > > > > >>>>>>>>>> [ ] -1 Do not release this as Apache Arrow 1.0.0 because... > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> [0]: > > > > > >>>>>>> https://github.com/apache/arrow/pull/7778#issuecomment-659065370 > > > > > >>>>>>>>>> [1]: > > > > > >>>>>>>>> > > > > > >>>>>>> > > > > > >>>>>> > > > > > >>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20%28Resolved%2C%20Closed%29%20AND%20fixVersion%20%3D%201.0.0 > > > > > >>>>>>>>>> [2]: > > > > > >>>>>>>>> > > > > > >>>>>>> > > > > > >>>>>> > > > > > >>>>> https://github.com/apache/arrow/tree/bc0649541859095ee77d03a7b891ea8d6e2fd641 > > > > > >>>>>>>>>> [3]: > > > > > >>>>>>> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-1.0.0-rc1 > > > > > >>>>>>>>>> [4]: https://bintray.com/apache/arrow/centos-rc/1.0.0-rc1 > > > > > >>>>>>>>>> [5]: https://bintray.com/apache/arrow/debian-rc/1.0.0-rc1 > > > > > >>>>>>>>>> [6]: https://bintray.com/apache/arrow/python-rc/1.0.0-rc1 > > > > > >>>>>>>>>> [7]: https://bintray.com/apache/arrow/ubuntu-rc/1.0.0-rc1 > > > > > >>>>>>>>>> [8]: > > > > > >>>>>>>>> > > > > > >>>>>>> > > > > > >>>>>> > > > > > >>>>> https://github.com/apache/arrow/blob/bc0649541859095ee77d03a7b891ea8d6e2fd641/CHANGELOG.md > > > > > >>>>>>>>>> [9]: > > > > > >>>>>>>>> > > > > > >>>>>>> > > > > > >>>>>> > > > > > >>>>> https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates > > > > > >>>>>>>>>> > > > > > >>>>>>>>> > > > > > >>>>>>> > > > > > >>>>>> > > > > > >>>>>