Hi Antoine — when there is no time zone specified, I do not think it is appropriate to consider the data to refer to a specific moment in time without applying an explicit time zone localization. So absent an explicit UTC time zone, we can’t say that the data refers to instants in time from the UTC perspective.
That said, absent a set time zone, there are two possible behaviors of timestamp functions (like “extract hour”): localtime using the system locale or UTC. I think we have already decided years ago that we would use the latter interpretation when it comes to stringfication or extracting fields. When localizing data (adding a time zone when there was none previously), I do not think we can assume that the data is already localized to UTC. I provided a gist showing the behavior of the pandas tz_localize function — the int64 values must each be shifted by the UTC offset at that moment. That’s what I think we have to do in this project. If you know that the data is UTC, then the correct action is to call tz_localize(‘UTC’) and then tz_convert(tz) where tz is the intended time zone (which is only a modification to the type metadata). My interpretation is certainly colored by the experience of designing this functionality in pandas, but after 10 years of observing real world use this model seems to work well and not trip people up too much. Wes On Mon, Jun 14, 2021 at 11:01 AM Antoine Pitrou <anto...@python.org> wrote: > > Also, as a secondary (but IMHO important) concern, if we choose the > "always UTC" interpretation, we should stop using the "time zone naive" > wording in the spec, because there is a high risk of confusion with > Python's different "naive timestamp" concept: > > https://docs.python.org/3/library/datetime.html > > """A naive object does not contain enough information to unambiguously > locate itself relative to other date/time objects. Whether a naive > object represents Coordinated Universal Time (UTC), local time, or time > in some other timezone is purely up to the program, just like it is up > to the program whether a particular number represents metres, miles, or > mass. Naive objects are easy to understand and to work with, at the cost > of ignoring some aspects of reality.""" > > > Le 14/06/2021 à 17:57, Antoine Pitrou a écrit : > > > > Hello, > > > > In ARROW-13033, there was a disagreement as to how the specification > > about timezone-less timestamps should be interpreted. > > > > Here is the wording in the Schema specification: > > > >> /// * If the time zone is null or equal to an empty string, the data > is "time > >> /// zone naive" and shall be displayed *as is* to the user, not > localized > >> /// to the locale of the user. This data can be though of as UTC > but > >> /// without having "UTC" as the time zone, it is not considered to > be > >> /// localized to any time zone > > > > My interpretation is that timestamp *values* are always expressed in > > UTC. The timezone is an optional piece of metadata that describes the > > context in which they were obtained, but do not impact how the *values* > > should be interpreted. > > > > Joris' interpretation is that timestamp *values* are expressed in an > > arbitrary "local time" that is unknown and unspecified. It is therefore > > difficult to exactly interpret them, since the timezone information is > > unavailable. > > > > (I'll let Joris express his thoughts more accurately, but the gist of > > his opinion is that "can be thought of as UTC" is only an indication, > > not a prescription) > > > > > > To me, the problem with the "unknown local timezone" interpretation is > > that it renders the data essentially ambiguous and useless. The problem > > is very similar to the problem of having string data without a > > well-known encoding. This is well-known to Python users as the Python 2 > > encoding hell (to the point that it motivated the heavy and disruptive > > Python 3 transition). > > > > (note the problem is even worse for timestamps. At least, you can with a > > high degree of probability detect that an arbitrary binary string is > > *not* UTF8-encoded. You cannot do so with timestamp values: any 64-bit > > timestamp may or may not be a UTC timestamp. Once you have lost that > > information, you cannot regain it anymore.) > > > > In any case, I think this must be clarified, first on this mailing-list, > > then by making the spec wording stronger and more prescriptive. > > > > Regards > > > > Antoine. > > >