On Tue, Jun 15, 2021 at 1:19 PM Weston Pace <[email protected]> wrote:
> Arrow's "Timestamp with Timezone" can have fields extracted
> from it.
>
Sure, one *can* extract fields from timestamp+tz. But I don't feel
timestamp+tz is *designed* for extracting fields:
- Extracting fields from int64+tz is inefficient, because it bundles two
steps: 1) convert to datetime struct; and 2) return one field from the
datetime struct. (If I want to extract Year, Month, Day, is that three
function calls that *each* convert to datetime struct?)
- Extracting fields from int64+tz is awkward, because it's not obvious
which timezone is being used. (To extract fields in a custom timezone, must
I 1) clone the column with a new timezone; and 2) call the function?)
My understanding of "best practice" for extracting multiple fields using
Arrow's timestamp columns is:
1. Convert from timestamp column to date32 and/or time32/time64 columns in
one pass (one of three operations, perhaps: timestamp=>date32,
timestamp=>time64, or timestamp=>struct{date32,time64})
2. Extract fields from those date32 and time64 columns.
Only step 1 needs a timezone. In C, the analogue is localtime().
We do step 1 at Workbench -- see converttimestamptodate
<https://github.com/CJWorkbench/converttimestamptodate/blob/main/converttimestamptodate.py>
for
our implementation. We haven't had much demand for step 2, so we'll get to
it later.
I think of this "best practice" as a compromise:
- date32+time64 aren't as time-efficient as C's struct tm, but together
they use 12 bytes whereas the C struct costs 50-100 bytes.
- date32+time64 are 50% less space-efficient than int64, but they're
intuitive and they save time.
A small benchmark to prove that "save time" assertion in Python:
>>> import datetime, os, time, timeit
>>> os.environ['TZ'] = 'America/Montreal'
>>> time.tzset()
>>> timestamp = time.time()
>>> timeit.timeit(lambda: datetime.date.fromtimestamp(timestamp).year)
0.2955563920113491
>>> timeit.timeit(lambda: datetime.date(2021, 6, 15).year) # baseline:
timeit overhead + tuple construction
0.2509278700017603
Most of the test is overhead; but certainly the timestamp=>date conversion
takes time, and it's sane to try and minimize that overhead.
Enjoy life,
Adam
--
Adam Hooper
+1-514-882-9694
http://adamhooper.com