Hello all,
I think that guarantees on masked values are worthwhile to define for more
than a
single type in isolation. In particular, requiring this exclusively for
Utf8View
will leave Utf8 and LargeUtf8 as arrays which *may* legally have non-utf8
masked
values but cannot be consumed by arrow-rs.
> From: Raphael Taylor-Davies
> Sent: Monday, July 31, 2023 12:50 AM
> To: dev@arrow.apache.org
> Subject: Re: [DISCUSS][Format] Draft implementation of string view array
> format
>
> !
: dev@arrow.apache.org
Subject: Re: [DISCUSS][Format] Draft implementation of string view array format
!---|
This Message Is From an External Sender
|---!
Hi All,
Hav
le SparkSQL's Performance. Contribute to
oap-project/gluten development by creating an account on GitHub.
github.com
--
Pedro Pedreira
From: Weston Pace
Sent: Tuesday, July 11, 2023 8:42 AM
To: dev@arrow
Performance<https://github.com/oap-project/gluten>
> Gluten: Plugin to Double SparkSQL's Performance. Contribute to
> oap-project/gluten development by creating an account on GitHub.
> github.com
>
>
>
>
>
>
> --
> Pedro Pedreira
> _
com/oap-project/gluten>
Gluten: Plugin to Double SparkSQL's Performance. Contribute to
oap-project/gluten development by creating an account on GitHub.
github.com
--
Pedro Pedreira
____________
From: Weston Pace
Sent: Tuesday, July 11, 2023
> I definitely hope that with time Arrow will penetrate deeper into these
> engines, perhaps in a similar manner to DataFusion, as opposed to
> primarily existing at the surface-level.
I'm not sure the problem here is a lack of understanding or maturity. In
fact, it would be much easier if this w
For example, if someone (datafusion, velox, etc.) were to come up with a
framework for UDFs then would batches be passed in and out of those UDFs in
the Arrow format?
Yes, I think the arrow format is a perfect fit for this
Is Arrow meant to only be used in between systems (in this case query
eng
> The point I was trying to make, albeit very badly, was that these
> operations are typically implemented using some sort of row format [1]
> [2], and therefore their performance is not impacted by the array
> representations. I think it is both inevitable, and in fact something to
> be encouraged
Thus the approach you
describe for validating an entire character buffer as UTF-8 then checking
offsets will be just as valid for Utf8View arrays as for Utf8 arrays.
The difference here is that it is perhaps expected for Utf8View to have
gaps in the underlying data that are not referenced as part
@Andrew:
Restricting these arrays to a single buffer will severely decrease their
utility. Since the character data is stored in multiple character buffers
writing Utf8View array can proceed without resizing allocations,
which is a major overhead when writing Utf8 arrays. Furthermore since the
cha
> I would be interested in hearing some input from the Rust community.
A couple of thoughts:
The variable number of buffers would definitely pose some challenges for the
Rust implementation, the closest thing we currently have is possibly
UnionArray, but even then the number of buffers is stil
> * This is the first layout where the number of buffers depends on the
data
> and not the schema. I think this is the most architecturally significant
> fact. I
I have spent some time reading the initial proposal -- thank you for that.
I now understand what Weston was saying about the "variabl
I hope implementations don't start exposing non-standard datatypes over
the C Data Interface (apart from extension types, of course). I would
also be wary of exposing non-standard datatypes in the official Arrow
C++ implementation.
Regards
Antoine.
Le 21/06/2023 à 14:27, Benjamin Kietzma
> Ben, at one point there was some discussion that this might be a c-data
> only type. However, I believe that was based on the raw pointers
> representation. What you've proposed here, if I understand correctly, is
> an index + offsets representation and it is suitable for IPC correct?
> (e.g.
Before I say anything else I'll say that I am in favor of this new layout.
There is some existing literature on the idea (e.g. umbra) and your
benchmarks show some nice improvements.
Compared to some of the other layouts we've discussed recently (REE, list
veiw) I do think this layout is more uniq
Hi Gang,
I'm not sure what you mean, sorry if my answers are off base:
Parquet's ByteArray will be unaffected by the addition of the string view
type;
all arrow strings (arrow::Type::STRING, arrow::Type::LARGE_STRING, and
with this patch arrow::Type::STRING_VIEW) are converted to ByteArrays
durin
Hi Ben,
The posted benchmark [1] looks pretty good to me. However, I want to
raise a possible issue from the perspective of parquet-cpp. Parquet-cpp
uses a customized parquet::ByteArray type [2] for string/binary, I would
expect some regression of conversions between parquet reader/writer
and the
Cool. Thanks for doing that!
On Thu, Jun 15, 2023 at 12:40 Benjamin Kietzman wrote:
> I've added https://github.com/apache/arrow/issues/36112 to track
> deduplication of buffers on write.
> I don't think it would require modification of the IPC format.
>
> Ben
>
> On Thu, Jun 15, 2023 at 1:30 PM
I've added https://github.com/apache/arrow/issues/36112 to track
deduplication of buffers on write.
I don't think it would require modification of the IPC format.
Ben
On Thu, Jun 15, 2023 at 1:30 PM Matt Topol wrote:
> Based on my understanding, in theory a buffer *could* be shared within a
> b
Based on my understanding, in theory a buffer *could* be shared within a
batch since the flatbuffers message just uses an offset and length to
identify the buffers.
That said, I don't believe any current implementation actually does this or
takes advantage of this in any meaningful way.
--Matt
O
Hi Ben,
It's exciting to see this move along.
The buffers will be duplicated. If buffer duplication is becomes a concern,
> I'd prefer to handle
> that in the ipc writer. Then buffers which are duplicated could be detected
> by checking
> pointer identity and written only once.
Question: to be
Hello again all,
The PR [1] to add string view to the format and the C++ implementation is
hovering around passing CI and has been undrafted. Furthermore, there is
now also a PR [2] to add string view to the Go implementation. Code review
is underway for each PR and I'd like to move toward a vote
@Jacob
> You mention benchmarks multiple times, are these results published
somewhere?
I benchmarked the performance of raw pointer vs index offset views in my PR
to velox,
I do intend to port them to my arrow PR but I haven't gotten there yet.
Furthermore, it
seemed less urgent to me since coexis
Hello Ben,
Thanks for your work on this. I think this will be an excellent addition to
the format.
If I understand correctly, multiple arrays can reference the same buffers
in memory, but once they are written to IPC their data buffers will be
duplicated. Is that right?
Dictionary types have a s
Very cool!
In addition to performance mentioned above, I could see this being
useful for the R bindings - we already have a global string pool and a
mechanism for keeping a vector of them alive.
I don't see the C Data interface in the PR although I may have missed
it - is that a part of the propo
Hello Everyone,
I think keeping interoperability with the large ecosystem is a very
important goal for arrow so I am overall in favor of this proposal!
You mention benchmarks multiple times, are these results published
somewhere?
Thanks
On Tue, May 16, 2023 at 11:39 PM Benjamin Kietzman
wrote:
Hello all,
As previously discussed on this list [1], an UmbraDB/DuckDB/Velox compatible
"string view" type could bring several performance benefits to access and
authoring of string data in the arrow format [2]. Additionally better
interoperability with engines already using this format could be
e
28 matches
Mail list logo