I'm not quite sure if this hint is useful. People usually keep a buffer and
flush the buffer when it's full, so that they can control the batch size of
writing, no matter how many inputs they will get. e.g. if spark hints to
you that there will be 1 GB data, are you going to allocate a 1 GB buffer
for it?

Also note that: there is not always a shuffle right before the writing, if
it's shuffle read -> filter -> data source write, we don't know the final
output data size unless we know the selectivity of the filter.

On Thu, Aug 17, 2023 at 1:02 PM Andrew Melo <andrew.m...@gmail.com> wrote:

> Hello Wenchen,
>
> On Wed, Aug 16, 2023 at 23:33 Wenchen Fan <cloud0...@gmail.com> wrote:
>
>> > is there a way to hint to the downstream users on the number of rows
>> expected to write?
>>
>> It will be very hard to do. Spark pipelines the execution (within shuffle
>> boundaries) and we can't predict the number of final output rows.
>>
>
> Perhaps I don't understand -- even in the case of multiple shuffles, you
> can assume that there is exactly one shuffle boundary before the write
> operation, and that shuffle boundary knows the number of input rows for
> that shuffle. That number of rows has to be, by construction, the upper
> bound on the number of rows that will be passed to the writer.
>
> If the writer can be hinted that bound then it can do something smart with
> allocating (memory or disk). By comparison, the current API just gives
> rows/batches one at a time, and in the case of off-heap allocation (like
> with arrow's off-heap storage), it's crazy inefficient to try and do the
> equivalent of realloc() to grow the buffer size.
>
> Thanks
> Andrew
>
>
>
>> On Mon, Aug 7, 2023 at 8:27 PM Steve Loughran <ste...@cloudera.com.invalid>
>> wrote:
>>
>>>
>>>
>>> On Thu, 1 Jun 2023 at 00:58, Andrew Melo <andrew.m...@gmail.com> wrote:
>>>
>>>> Hi all
>>>>
>>>> I've been developing for some time a Spark DSv2 plugin "Laurelin" (
>>>> https://github.com/spark-root/laurelin
>>>> ) to read the ROOT (https://root.cern) file format (which is used in
>>>> high energy physics). I've recently presented my work in a conference (
>>>> https://indico.jlab.org/event/459/contributions/11603/).
>>>>
>>>>
>>> nice paper given the esoteric nature of HEP file formats.
>>>
>>> All of that to say,
>>>>
>>>> A) is there no reason that the builtin (eg parquet) data sources can't
>>>> consume the external APIs? It's hard to write a plugin that has to use a
>>>> specific API when you're competing with another source who gets access to
>>>> the internals directly.
>>>>
>>>> B) What is the Spark-approved API to code against for to write? There
>>>> is a mess of *ColumnWriter classes in the Java namespace, and while there
>>>> is no documentation, it's unclear which is preferred by the core (maybe
>>>> ArrowWriterColumnVector?). We can give a zero copy write if the API
>>>> describes it
>>>>
>>>
>>> There's a dangerous tendency for things that libraries need to be tagged
>>> private [spark], normally worked around by people putting their code into
>>> org.apache.spark packages. Really everyone who does that should try to get
>>> a longer term fix in, as well as that quick-and-effective workaround.
>>> Knowing where problems lie would be a good first step. spark sub-modules
>>> are probably a place to get insight into where those low-level internal
>>> operations are considered important, although many uses may be for historic
>>> "we wrote it that way a long time ago" reasons
>>>
>>>
>>>>
>>>> C) Putting aside everything above, is there a way to hint to the
>>>> downstream users on the number of rows expected to write? Any smart writer
>>>> will use off-heap memory to write to disk/memory, so the current API that
>>>> shoves rows in doesn't do the trick. You don't want to keep reallocating
>>>> buffers constantly
>>>>
>>>> D) what is sparks plan to use arrow-based columnar data
>>>> representations? I see that there a lot of external efforts whose only
>>>> option is to inject themselves in the CLASSPATH. The regular DSv2 api is
>>>> already crippled for reads and for writes it's even worse. Is there a
>>>> commitment from the spark core to bring the API to parity? Or is instead is
>>>> it just a YMMV commitment
>>>>
>>>
>>> No idea, I'm afraid. I do think arrow makes a good format for
>>> processing, and it'd be interesting to see how well it actually works as a
>>> wire format to replace other things (e.g hive's protocol), especially on
>>> RDMA networks and the like. I'm not up to date with ongoing work there -if
>>> anyone has pointers that'd be interesting.
>>>
>>>>
>>>> Thanks!
>>>> Andrew
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> It's dark in this basement.
>>>>
>>> --
> It's dark in this basement.
>

Reply via email to