> is there a way to hint to the downstream users on the number of rows expected to write?
It will be very hard to do. Spark pipelines the execution (within shuffle boundaries) and we can't predict the number of final output rows. On Mon, Aug 7, 2023 at 8:27 PM Steve Loughran <ste...@cloudera.com.invalid> wrote: > > > On Thu, 1 Jun 2023 at 00:58, Andrew Melo <andrew.m...@gmail.com> wrote: > >> Hi all >> >> I've been developing for some time a Spark DSv2 plugin "Laurelin" ( >> https://github.com/spark-root/laurelin >> ) to read the ROOT (https://root.cern) file format (which is used in >> high energy physics). I've recently presented my work in a conference ( >> https://indico.jlab.org/event/459/contributions/11603/). >> >> > nice paper given the esoteric nature of HEP file formats. > > All of that to say, >> >> A) is there no reason that the builtin (eg parquet) data sources can't >> consume the external APIs? It's hard to write a plugin that has to use a >> specific API when you're competing with another source who gets access to >> the internals directly. >> >> B) What is the Spark-approved API to code against for to write? There is >> a mess of *ColumnWriter classes in the Java namespace, and while there is >> no documentation, it's unclear which is preferred by the core (maybe >> ArrowWriterColumnVector?). We can give a zero copy write if the API >> describes it >> > > There's a dangerous tendency for things that libraries need to be tagged > private [spark], normally worked around by people putting their code into > org.apache.spark packages. Really everyone who does that should try to get > a longer term fix in, as well as that quick-and-effective workaround. > Knowing where problems lie would be a good first step. spark sub-modules > are probably a place to get insight into where those low-level internal > operations are considered important, although many uses may be for historic > "we wrote it that way a long time ago" reasons > > >> >> C) Putting aside everything above, is there a way to hint to the >> downstream users on the number of rows expected to write? Any smart writer >> will use off-heap memory to write to disk/memory, so the current API that >> shoves rows in doesn't do the trick. You don't want to keep reallocating >> buffers constantly >> >> D) what is sparks plan to use arrow-based columnar data representations? >> I see that there a lot of external efforts whose only option is to inject >> themselves in the CLASSPATH. The regular DSv2 api is already crippled for >> reads and for writes it's even worse. Is there a commitment from the spark >> core to bring the API to parity? Or is instead is it just a YMMV commitment >> > > No idea, I'm afraid. I do think arrow makes a good format for processing, > and it'd be interesting to see how well it actually works as a wire format > to replace other things (e.g hive's protocol), especially on RDMA networks > and the like. I'm not up to date with ongoing work there -if anyone has > pointers that'd be interesting. > >> >> Thanks! >> Andrew >> >> >> >> >> >> -- >> It's dark in this basement. >> >