> Looks like projecting columns isn't available by default.
> One of the benefits of Parquet file format is column projection, where
the IO is limited to just the columns projected.
> Unfortunately, you are correct that it doesn't allow for easy column
projecting (you're going to read all the col
IMO, Facebook has mentioned a format for ML in it's paper section 3.3.2[1].
It mentions that
> ML tables are also typically much wider, and tend to have tens of
thousands
> of features usually stored as large maps.
>
> The most pressing issue with the DWRF format was metadata overhead;
> our ML us
> Of course, what I'm really asking for is to see how Lance would compare
;-)
> P.S. The second paper [2] also talks about ML workloads (in Section 5.8)
> and GPU performance (in Section 5.9). It also cites Lance as one of the
> future formats (in Section 5.6.2).
Disclaimer: I work for LanceDb an
And the first paper's reference of arrow (in the references section) lists 2022 as the date of last access. Sent from Proton Mail for iOS On Thu, Oct 19, 2023 at 18:51, Aldrin wrote: For context, that second referenced paper has Wes McKinney as a co-author, so th
For context, that second referenced paper has Wes McKinney as a co-author, so they were much better positioned to say "the right things." Sent from Proton Mail for iOS On Thu, Oct 19, 2023 at 18:38, Jin Shang wrote: Honestly I don't understand why this VLDB paper [1] ch
Honestly I don't understand why this VLDB paper [1] chooses to include
Feather in their evaluations. This paper studies OLAP DBMS file formats.
Feather is clearly not optimized for the workload and performs badly in
most of their benchmarks. This paper also has several inaccurate or
outdated claims
On Wed, Oct 18, 2023 at 11:20 PM Andrew Lamb wrote:
>
> If you are looking for a more formal discussion and empirical analysis of
> the differences, I suggest reading "A Deep Dive into Common Open Formats
> for Analytical DBMSs" [1], a VLDB 2023 (runner up best paper!) that
> compares and contrast
There is a note there explaining what they understand by it but
further down the line they do not make such distinction.
The fact that parquet can be better in-memory format than arrow for
certain common uses is something I haven't thought of
and is eye-opening for me, admittedly so because I am n
The fact that they describe Arrow and Feather as distinct formats
(they're not!) with different characteristics is a bit of a bummer.
Le 18/10/2023 à 22:20, Andrew Lamb a écrit :
If you are looking for a more formal discussion and empirical analysis of
the differences, I suggest reading "A
If you are looking for a more formal discussion and empirical analysis of
the differences, I suggest reading "A Deep Dive into Common Open Formats
for Analytical DBMSs" [1], a VLDB 2023 (runner up best paper!) that
compares and contrasts Arrow, Parquet, ORC and Feather file formats.
[1] https://ww
To further what others have already mentioned, the IPC file format is
primarily optimised for IPC use-cases, that is exchanging the entire
contents between processes. It is relatively inexpensive to encode and
decode, and supports all arrow datatypes, making it ideal for things
like spill-to-di
Plenty of opinions here already, but I happen to think that IPC
streams and/or Arrow File/Feather are wildly underutilized. For the
use-case where you're mostly just going to read an entire file into R
or Python it's a bit faster (and far superior to a CSV or pickling or
.rds files in R).
> you're
Arrow IPC file is great, it focuses on in-memory representation and direct
computation.
Basically, it can support compression and dictionary encoding, and can
zero-copy
deserialize the file to memory Arrow format.
Parquet provides some strong functionality, like Statistics, which could
help prunin
Also there is
https://github.com/lancedb/lance between the two formats. Depending on the
use case it can be a great choice.
Best regards
Adam Lippai
On Tue, Oct 17, 2023 at 22:44 Matt Topol wrote:
> One benefit of the feather format (i.e. Arrow IPC file format) is the
> ability to mmap the file
One benefit of the feather format (i.e. Arrow IPC file format) is the
ability to mmap the file to easily handle reading sections of a larger than
memory file of data. Since, as Felipe mentioned, the format is focused on
in-memory representation, you can easily and simply mmap the file and use
the r
It’s not the best since the format is really focused on in- memory
representation and direct computation, but you can do it:
https://arrow.apache.org/docs/python/feather.html
—
Felipe
On Tue, 17 Oct 2023 at 23:26 Nara wrote:
> Hi,
>
> Is it a good idea to use Apache Arrow as a file format? Loo
Hi,
Is it a good idea to use Apache Arrow as a file format? Looks like
projecting columns isn't available by default.
One of the benefits of Parquet file format is column projection, where the
IO is limited to just the columns projected.
Regards ,
Nara
17 matches
Mail list logo