Ooh, lovely. Yes, I imagine this can be fastest; but it's not ideal for
streaming because it's high-RAM and high time-to-first-byte.
Thank you again for your advice. You've been more than helpful.
Enjoy life,
Adam
--
Adam Hooper
+1-514-882-9694
http://adamhooper.com
use I don't use
nested values.
Does the C++ parquet reader support reading a batch of values and their
validity bitmap?
Enjoy life,
Adam
--
Adam Hooper
+1-514-882-9694
http://adamhooper.com
s about how a
> time
> > > > > display is correct. Applications can choose what they feel makes
> sense
> > > to
> > > > > them (as long as they don't start automatically tacking on
> timezones to
> > > > > naive timestamps).
.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit#>
you set up!
But to answer your question here: my understanding is we're debating how to
store an Instant in Arrow. Or conversely, how to interpret a timestamp that
has no timezone field.
--
Adam Hooper
+1-514-882-9694
http://adamhooper.com
be messier, since every system/language uses a
different byte structure.)
Perhaps we can make a spreadsheet and look comprehensively at how many
> use cases would be disenfranchised by requiring UTC normalization
> always.
Hear, hear!
Can we also poll people to find out how they're storing Instants today?
Enjoy life,
Adam
--
Adam Hooper
+1-514-882-9694
http://adamhooper.com
: datetime.date.fromtimestamp(timestamp).year)
0.2955563920113491
>>> timeit.timeit(lambda: datetime.date(2021, 6, 15).year) # baseline:
timeit overhead + tuple construction
0.2509278700017603
Most of the test is overhead; but certainly the timestamp=>date conversion
takes time, and it's sane to try and minimize that overhead.
Enjoy life,
Adam
--
Adam Hooper
+1-514-882-9694
http://adamhooper.com
t;> * Instant - an instantaneous point on the time-line
> >> * DateTime - full date and time with time-zone
> >> * LocalDateTime - date-time without a time-zone
> >>
> >> ...
> >>
> >> I recommend that Arrow supports all three. Choose clear, distinct
> >> names for all three, consistent with names used elsewhere in the
> >> industry.
> >
> > It seems to me that we are discussing whether our "timestamp without
> > timezone" should be interpreted as a LocalDateTime or as an Instant
> > (since interpreting it as UTC makes it an Instant, I think). Is that a
> > correct / helpful framing?
>
> That is correct, IMHO.
>
>
--
Adam Hooper
+1-514-882-9694
http://adamhooper.com
eans
data is stored as UTC) ... well ... what *is* the meaning of the timezone
field?
(In my opinion, there shouldn't be a field at all.)
Enjoy life,
Adam
--
Adam Hooper
+1-514-882-9694
http://adamhooper.com
to store UTC
timestamps is TIMESTAMP WITH TIME ZONE, which doesn't store a time zone.)
I'm a smart person. I keep making these embarrassing -- and costly --
mistakes.
I've never been tripped up by java.time.Instant. It's no wonder Java
embraced it.
I hope Arrow empowers its community to make tools that make me feel
not-stupid.
Enjoy life,
Adam
--
Adam Hooper
+1-514-882-9694
http://adamhooper.com
On Thu, Jun 3, 2021 at 2:02 PM Adam Hooper wrote:
> I understand isAdjustedToUTC=true to mean "timestamp", and
> isAdjustedToUTC=false to mean, "int64 and I hope somebody attached some
> docs because
> https://github.com/apache/parquet-format/blob/master/Logi
TC=true to mean "timestamp", and
isAdjustedToUTC=false to mean, "int64 and I hope somebody attached some
docs because
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc
lists a whole slew of potential meanings and without extra metadata I'll
never be able to figure out what this column means."
Enjoy life,
Adam
--
Adam Hooper
+1-514-882-9694
http://adamhooper.com
nd
they *cannot store future times* (future timezones are yet to be decreed by
politicians).
Don't follow in C or SQL's footsteps. Store timestamps as integers UTC
timestamps. Store timezone somewhere else; use it to convert to local
timezone when formatting and to convert to calendar for calendar math.
--
Adam Hooper
+1-514-882-9694
http://adamhooper.com
processing
> > > > > > > > >>>>> github.com/apache/beam - Apache Beam is a unified
> > > > programming
> > > > > model
> > > > > > > > >> for
> > > > > > > > >>>>> Batch an
resql.org/wiki/Collations> to version
collations in v13/v14. I'm a Postgres user who experienced index corruption
between collation versions, To me, Postgres' effort seems both cutting-edge
and essential.
Enjoy life,
Adam
--
Adam Hooper
+1-514-882-9694
http://adamhooper.com
dly, then Alice can memorize the pattern. With that information she
can detect, in a given amount of time, how many times Bob ran the same
query.
Adam
--
Adam Hooper
+1-514-882-9694
http://adamhooper.com
hose tools
help a team discover which of the RDBMS's multitude of weaknesses are most
urgent. The RDBMS provides few knobs and much documentation. The team
selects compromises.
I think a good bullet for your list of requirements is: "simple enough to
explain to a non-programmer."
m/CJWorkbench/arrow-tools/blob/ddc1a664ac3d0b78f4537e3e8e82ecc10c471ef8/src/arrow-validate.cc#L43
[2] https://github.com/cyb70289/utf8
--
Adam Hooper
+1-514-882-9694
http://adamhooper.com
e.
>
Thank you for clarifying. This is all music to my ears. I feel Arrow's
careful design gives me all the tools I need to confidently repel malicious
input.
<https://github.com/cyb70289/utf8>
Enjoy life,
Adam
--
Adam Hooper
+1-514-882-9694
http://adamhooper.com
lementations
> (CSV, Parquet...).
>
> Regards
>
> Antoine.
>
>
> Le 18/12/2019 à 17:42, Adam Hooper a écrit :
> > My project parses Arrow files produced by untrusted code.
> >
> > It looks to me like the "validate" function should help me avoid
>
Adam Hooper created ARROW-7435:
--
Summary: Security issue: ValidateOffsets() does not prevent buffer
over-read
Key: ARROW-7435
URL: https://issues.apache.org/jira/browse/ARROW-7435
Project: Apache Arrow
security a goal of the Arrow project/format? If so, how shall I report
this bug without endangering other users in my situation?
Enjoy life,
Adam
--
Adam Hooper
+1-514-882-9694
http://adamhooper.com
Adam Hooper created ARROW-7281:
--
Summary: AdaptiveIntBuilder::length() does not consider
pending_pos_.
Key: ARROW-7281
URL: https://issues.apache.org/jira/browse/ARROW-7281
Project: Apache Arrow
Adam Hooper created ARROW-7266:
--
Summary: dictionary_encode() of a slice gives wrong result
Key: ARROW-7266
URL: https://issues.apache.org/jira/browse/ARROW-7266
Project: Apache Arrow
Issue
Adam Hooper created ARROW-6895:
--
Summary: parquet::arrow::ColumnReader:
ByteArrayDictionaryRecordReader repeats returned values when calling
`NextBatch()`
Key: ARROW-6895
URL: https://issues.apache.org/jira/browse
Adam Hooper created ARROW-6861:
--
Summary: With arrow-0.14.1-output Parquet dictionary column:
Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize
Key: ARROW-6861
URL: https
Adam Hooper created ARROW-6568:
--
Summary: pyarrow.parquet crash writing zero-chunk dictionary-type
column
Key: ARROW-6568
URL: https://issues.apache.org/jira/browse/ARROW-6568
Project: Apache Arrow
26 matches
Mail list logo