Re: Python SDK timestamp precision

Kenneth Knowles Tue, 23 Apr 2019 07:34:50 -0700

Another brute force approach that I expect is not really that painful and
allows optimal compactness in all cases: Support 3 precisions, with
appropriate standard coders. Millis for unix timestamps, micros for
compactness in int64, nanos for Java/Spanner/Proto/Pubsub* aka the max
precision anyone has appetite for. I don't expect other precisions or
encodings to be relevant during Beam's lifetime, but going this path it
will be easy to add a new one if it comes along.


Another couple of points on the prior approach of splitting the
representations, though:

 - Dropping of data is by window expiry so element timestamps do not need
runner understanding
 - Often element timestamps are part of the data as well, so we already
duplicate this a lot of the time, unless the user builds a data structure
that is "datum minus timestamp" just to save the bytes.
 - On the other hand, if timestamps were explicitly views into the datum,
we could save the bytes automatically rather than the user having to do it.

I really like the idea of element timestamps being just views into data
that is already part of the element.

Kenn

*prior mail I said Pubsub was micros, but it is nanos

On Tue, Apr 23, 2019 at 7:20 AM Kenneth Knowles <k...@apache.org> wrote:

> On Tue, Apr 23, 2019 at 5:48 AM Robert Bradshaw <rober...@google.com>
> wrote:
>
>> On Thu, Apr 18, 2019 at 12:23 AM Kenneth Knowles <k...@apache.org> wrote:
>> >
>> > For Robert's benefit, I want to point out that my proposal is to
>> support femtosecond data, with femtosecond-scale windows, even if
>> watermarks/event timestamps/holds are only millisecond precision.
>> >
>> > So the workaround once I have time, for SQL and schema-based
>> transforms, will be to have a logical type that matches the Java and
>> protobuf definition of nanos (seconds-since-epoch + nanos-in-second) to
>> preserve the user's data. And then when doing windowing inserting the
>> necessary rounding somewhere in the SQL or schema layers.
>>
>> It seems to me that the underlying granularity of element timestamps
>> and window boundaries, as seen an operated on by the runner (and
>> transmitted over the FnAPI boundary), is not something we can make
>> invisible to the user (and consequently we cannot just insert rounding
>> on higher precision data and get the right results). However, I would
>> be very interested in seeing proposals that could get around this.
>> Watermarks, of course, can be as approximate (in one direction) as one
>> likes.
>>
>
> I outlined a way... or perhaps I retracted it to ponder and sent the rest
> of my email. Sorry! Something like this, TL;DR store the original data but
> do runner ops on rounded data.
>
>  -  WindowFn must receive exactly the data that came from the user's data
> source. So that cannot be rounded.
>  - The user's WindowFn assigns to a window, so it can contain arbitrary
> precision as it should be grouped as bytes.
>  - End of window, timers, watermark holds, etc, are all treated only as
> bounds, so can all be rounded based on their use as an upper or lower bound.
>
> We already do a lot of this - Pubsub publish timestamps are microsecond
> precision (you could say our current connector constitutes data loss) as
> are Windmill timestamps (since these are only combines of Beam timestamps
> here there is no data loss). There are undoubtedly some corner cases I've
> missed, and naively this might look like duplicating timestamps so that
> could be an unacceptable performance concern.
>
> As for choice of granularity, it would be ideal if any time-like field
>> could be used as the timestamp (for subsequent windowing). On the
>> other hand, nanoseconds (or smaller) complicates the arithmetic and
>> encoding as a 64-bit int has a time range of only a couple hundred
>> years without overflow (which is an argument for microseconds, as they
>> are a nice balance between sub-second granularity and multi-millennia
>> span). Standardizing on milliseconds is more restrictive but has the
>> advantage that it's what Java and Joda Time use now (though it's
>> always easier to pad precision than round it away).
>>
>
> A correction: Java *now* uses nanoseconds [1]. It uses the same breakdown
> as proto (int64 seconds since epoch + int32 nanos within second). It has
> legacy classes that use milliseconds, and Joda itself now encourages moving
> back to Java's new Instant type. Nanoseconds should complicate the
> arithmetic only for the one person authoring the date library, which they
> have already done.
>
> It would also be really nice to clean up the infinite-future being the
>> somewhat arbitrary max micros rounded to millis, and
>> end-of-global-window being infinite-future minus 1 hour (IIRC), etc.
>> as well as the ugly logic in Python to cope with millis-micros
>> conversion.
>>
>
> I actually don't have a problem with this. If you are trying to keep the
> representation compact, not add bytes on top of instants, then you just
> have to choose magic numbers, right?
>
> Kenn
>
> [1] https://docs.oracle.com/javase/8/docs/api/java/time/Instant.html
>
>
>> > On Wed, Apr 17, 2019 at 3:13 PM Robert Burke <rob...@frantil.com>
>> wrote:
>> >>
>> >> +1 for plan B. Nano second precision on windowing seems... a little
>> much for a system that's aggregating data over time. Even for processing
>> say particle super collider data, they'd get away with artificially
>> increasing the granularity in batch settings.
>> >>
>> >> Now if they were streaming... they'd probably want femtoseconds anyway.
>> >> The point is, we should see if users demand it before adding in the
>> necessary work.
>> >>
>> >> On Wed, 17 Apr 2019 at 14:26, Chamikara Jayalath <chamik...@google.com>
>> wrote:
>> >>>
>> >>> +1 for plan B as well. I think it's important to make timestamp
>> precision consistent now without introducing surprising behaviors for
>> existing users. But we should move towards a higher granularity timestamp
>> precision in the long run to support use-cases that Beam users otherwise
>> might miss out (on a runner that supports such precision).
>> >>>
>> >>> - Cham
>> >>>
>> >>> On Wed, Apr 17, 2019 at 1:35 PM Lukasz Cwik <lc...@google.com> wrote:
>> >>>>
>> >>>> I also like Plan B because in the cross language case, the pipeline
>> would not work since every party (Runners & SDKs) would have to be aware of
>> the new beam:coder:windowed_value:v2 coder. Plan A has the property where
>> if the SDK/Runner wasn't updated then it may start truncating the
>> timestamps unexpectedly.
>> >>>>
>> >>>> On Wed, Apr 17, 2019 at 1:24 PM Lukasz Cwik <lc...@google.com>
>> wrote:
>> >>>>>
>> >>>>> Kenn, this discussion is about the precision of the timestamp in
>> the user data. As you had mentioned, Runners need not have the same
>> granularity of user data as long as they correctly round the timestamp to
>> guarantee that triggers are executed correctly but the user data should
>> have the same precision across SDKs otherwise user data timestamps will be
>> truncated in cross language scenarios.
>> >>>>>
>> >>>>> Based on the systems that were listed, either microsecond or
>> nanosecond would make sense. The issue with changing the precision is that
>> all Beam runners except for possibly Beam Python on Dataflow are using
>> millisecond precision since they are all using the same Java Runner
>> windowing/trigger logic.
>> >>>>>
>> >>>>> Plan A: Swap precision to nanosecond
>> >>>>> 1) Change the Python SDK to only expose millisecond precision
>> timestamps (do now)
>> >>>>> 2) Change the user data encoding to support nanosecond precision
>> (do now)
>> >>>>> 3) Swap runner libraries to be nanosecond precision aware updating
>> all window/triggering logic (do later)
>> >>>>> 4) Swap SDKs to expose nanosecond precision (do later)
>> >>>>>
>> >>>>> Plan B:
>> >>>>> 1) Change the Python SDK to only expose millisecond precision
>> timestamps and keep the data encoding as is (do now)
>> >>>>> (We could add greater precision later to plan B by creating a new
>> version beam:coder:windowed_value:v2 which would be nanosecond and would
>> require runners to correctly perform an internal conversions for
>> windowing/triggering.)
>> >>>>>
>> >>>>> I think we should go with Plan B and when users request greater
>> precision we can make that an explicit effort. What do people think?
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Wed, Apr 17, 2019 at 5:43 AM Maximilian Michels <m...@apache.org>
>> wrote:
>> >>>>>>
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> Thanks for taking care of this issue in the Python SDK, Thomas!
>> >>>>>>
>> >>>>>> It would be nice to have a uniform precision for timestamps but,
>> as Kenn
>> >>>>>> pointed out, timestamps are extracted from systems that have
>> different
>> >>>>>> precision.
>> >>>>>>
>> >>>>>> To add to the list: Flink - milliseconds
>> >>>>>>
>> >>>>>> After all, it doesn't matter as long as there is sufficient
>> precision
>> >>>>>> and conversions are done correctly.
>> >>>>>>
>> >>>>>> I think we could improve the situation by at least adding a
>> >>>>>> "milliseconds" constructor to the Python SDK's Timestamp.
>> >>>>>>
>> >>>>>> Cheers,
>> >>>>>> Max
>> >>>>>>
>> >>>>>> On 17.04.19 04:13, Kenneth Knowles wrote:
>> >>>>>> > I am not so sure this is a good idea. Here are some systems and
>> their
>> >>>>>> > precision:
>> >>>>>> >
>> >>>>>> > Arrow - microseconds
>> >>>>>> > BigQuery - microseconds
>> >>>>>> > New Java instant - nanoseconds
>> >>>>>> > Firestore - microseconds
>> >>>>>> > Protobuf - nanoseconds
>> >>>>>> > Dataflow backend - microseconds
>> >>>>>> > Postgresql - microseconds
>> >>>>>> > Pubsub publish time - nanoseconds
>> >>>>>> > MSSQL datetime2 - 100 nanoseconds (original datetime about 3
>> millis)
>> >>>>>> > Cassandra - milliseconds
>> >>>>>> >
>> >>>>>> > IMO it is important to be able to treat any of these as a Beam
>> >>>>>> > timestamp, even though they aren't all streaming. Who knows when
>> we
>> >>>>>> > might be ingesting a streamed changelog, or using them for
>> reprocessing
>> >>>>>> > an archived stream. I think for this purpose we either should
>> >>>>>> > standardize on nanoseconds or make the runner's resolution
>> independent
>> >>>>>> > of the data representation.
>> >>>>>> >
>> >>>>>> > I've had some offline conversations about this. I think we can
>> have
>> >>>>>> > higher-than-runner precision in the user data, and allow
>> WindowFns and
>> >>>>>> > DoFns to operate on this higher-than-runner precision data, and
>> still
>> >>>>>> > have consistent watermark treatment. Watermarks are just bounds,
>> after all.
>> >>>>>> >
>> >>>>>> > Kenn
>> >>>>>> >
>> >>>>>> > On Tue, Apr 16, 2019 at 6:48 PM Thomas Weise <t...@apache.org
>> >>>>>> > <mailto:t...@apache.org>> wrote:
>> >>>>>> >
>> >>>>>> >     The Python SDK currently uses timestamps in microsecond
>> resolution
>> >>>>>> >     while Java SDK, as most would probably expect, uses
>> milliseconds.
>> >>>>>> >
>> >>>>>> >     This causes a few difficulties with portability (Python
>> coders need
>> >>>>>> >     to convert to millis for WindowedValue and Timers, which is
>> related
>> >>>>>> >     to a bug I'm looking into:
>> >>>>>> >
>> >>>>>> >     https://issues.apache.org/jira/browse/BEAM-7035
>> >>>>>> >
>> >>>>>> >     As Luke pointed out, the issue was previously discussed:
>> >>>>>> >
>> >>>>>> >     https://issues.apache.org/jira/browse/BEAM-1524
>> >>>>>> >
>> >>>>>> >     I'm not privy to the reasons why we decided to go with
>> micros in
>> >>>>>> >     first place, but would it be too big of a change or
>> impractical for
>> >>>>>> >     other reasons to switch Python SDK to millis before it gets
>> more users?
>> >>>>>> >
>> >>>>>> >     Thanks,
>> >>>>>> >     Thomas
>> >>>>>> >
>>
>

Re: Python SDK timestamp precision

Reply via email to