Re: [Structured Streaming] Avoiding multiple streaming queries

Karthik Reddy Vadde Tue, 24 Jul 2018 16:53:07 -0700

On Thu, Jul 12, 2018 at 10:23 AM Arun Mahadevan <ar...@apache.org> wrote:


> Yes ForeachWriter [1] could be an option If you want to write to different
> sinks. You can put your custom logic to split the data into different sinks.
>
> The drawback here is that you cannot plugin existing sinks like Kafka and
> you need to write the custom logic yourself and you cannot scale the
> partitions for the sinks independently.
>
> [1]
> https://spark.apache.org/docs/2.1.2/api/java/org/apache/spark/sql/ForeachWriter.html
>
> From: chandan prakash <chandanbaran...@gmail.com>
> Date: Thursday, July 12, 2018 at 2:38 AM
> To: Tathagata Das <tathagata.das1...@gmail.com>, "ymaha...@snappydata.io"
> <ymaha...@snappydata.io>, "priy...@asperasoft.com" <priy...@asperasoft.com>,
> "user @spark" <user@spark.apache.org>
> Subject: Re: [Structured Streaming] Avoiding multiple streaming queries
>
> Hi,
> Did anyone of you thought  about writing a custom foreach sink writer
> which can decided which record should go to which sink (based on some
> marker in record, which we can possibly annotate during transformation) and
> then accordingly write to specific sink.
> This will mean that:
> 1. every custom sink writer will have connections to as many sinks as many
> there are types of sink where records can go.
> 2.  every record will be read once in the single query but can be written
> to multiple sinks
>
> Do you guys see any drawback in this approach ?
> One drawback off course there is that sink is supposed to write the
> records as they are but we are inducing some intelligence here in the sink.
> Apart from that any other issues do you see with this approach?
>
> Regards,
> Chandan
>
>
> On Thu, Feb 15, 2018 at 7:41 AM Tathagata Das <tathagata.das1...@gmail.com>
> wrote:
>
>> Of course, you can write to multiple Kafka topics from a single query. If
>> your dataframe that you want to write has a column named "topic" (along
>> with "key", and "value" columns), it will write the contents of a row to
>> the topic in that row. This automatically works. So the only thing you need
>> to figure out is how to generate the value of that column.
>>
>> This is documented -
>> https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#writing-data-to-kafka
>>
>> Or am i misunderstanding the problem?
>>
>> TD
>>
>>
>>
>>
>> On Tue, Feb 13, 2018 at 10:45 AM, Yogesh Mahajan <ymaha...@snappydata.io>
>> wrote:
>>
>>> I had a similar issue and i think that’s where the structured streaming
>>> design lacks.
>>> Seems like Question#2 in your email is a viable workaround for you.
>>>
>>> In my case, I have a custom Sink backed by an efficient in-memory column
>>> store suited for fast ingestion.
>>>
>>> I have a Kafka stream coming from one topic, and I need to classify the
>>> stream based on schema.
>>> For example, a Kafka topic can have three different types of schema
>>> messages and I would like to ingest into the three different column
>>> tables(having different schema) using my custom Sink implementation.
>>>
>>> Right now only(?) option I have is to create three streaming queries
>>> reading the same topic and ingesting to respective column tables using
>>> their Sink implementations.
>>> These three streaming queries create underlying three
>>> IncrementalExecutions and three KafkaSources, and three queries reading the
>>> same data from the same Kafka topic.
>>> Even with CachedKafkaConsumers at partition level, this is not an
>>> efficient way to handle a simple streaming use case.
>>>
>>> One workaround to overcome this limitation is to have same schema for
>>> all the messages in a Kafka partition, unfortunately this is not in our
>>> control and customers cannot change it due to their dependencies on other
>>> subsystems.
>>>
>>> Thanks,
>>> http://www.snappydata.io/blog <http://snappydata.io>
>>>
>>> On Mon, Feb 12, 2018 at 5:54 PM, Priyank Shrivastava <
>>> priy...@asperasoft.com> wrote:
>>>
>>>> I have a structured streaming query which sinks to Kafka.  This query
>>>> has a complex aggregation logic.
>>>>
>>>>
>>>> I would like to sink the output DF of this query to
>>>> multiple Kafka topics each partitioned on a different ‘key’ column.  I
>>>> don’t want to have multiple Kafka sinks for each of the
>>>> different Kafka topics because that would mean running multiple streaming
>>>> queries - one for each Kafka topic, especially since my aggregation logic
>>>> is complex.
>>>>
>>>>
>>>> Questions:
>>>>
>>>> 1.  Is there a way to output the results of a structured streaming
>>>> query to multiple Kafka topics each with a different key column but without
>>>> having to execute multiple streaming queries?
>>>>
>>>>
>>>> 2.  If not,  would it be efficient to cascade the multiple queries such
>>>> that the first query does the complex aggregation and writes output
>>>> to Kafka and then the other queries just read the output of the first query
>>>> and write their topics to Kafka thus avoiding doing the complex aggregation
>>>> again?
>>>>
>>>>
>>>> Thanks in advance for any help.
>>>>
>>>>
>>>> Priyank
>>>>
>>>>
>>>>
>>>
>>
>
> --
> Chandan Prakash
>
>

Re: [Structured Streaming] Avoiding multiple streaming queries

Reply via email to