No, direct stream in and of itself won't ensure an end-to-end guarantee, because it doesn't know anything about your output actions.
You still need to do some work. The point is having easy access to offsets for batches on a per-partition basis makes it easier to do that work, especially in conjunction with aggregation. On Thu, Sep 29, 2016 at 10:40 AM, Deepak Sharma <deepakmc...@gmail.com> wrote: > If you use spark direct streams , it ensure end to end guarantee for > messages. > > > On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar <ali.rac...@gmail.com> wrote: >> >> My concern with Postgres / Cassandra is only scalability. I will look >> further into Postgres horizontal scaling, thanks. >> >> Writes could be idempotent if done as upserts, otherwise updates will be >> idempotent but not inserts. >> >> Data should not be lost. The system should be as fault tolerant as >> possible. >> >> What's the advantage of using Spark for reading Kafka instead of direct >> Kafka consumers? >> >> On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <c...@koeninger.org> >> wrote: >>> >>> I wouldn't give up the flexibility and maturity of a relational >>> database, unless you have a very specific use case. I'm not trashing >>> cassandra, I've used cassandra, but if all I know is that you're doing >>> analytics, I wouldn't want to give up the ability to easily do ad-hoc >>> aggregations without a lot of forethought. If you're worried about >>> scaling, there are several options for horizontally scaling Postgres >>> in particular. One of the current best from what I've worked with is >>> Citus. >>> >>> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <deepakmc...@gmail.com> >>> wrote: >>> > Hi Cody >>> > Spark direct stream is just fine for this use case. >>> > But why postgres and not cassandra? >>> > Is there anything specific here that i may not be aware? >>> > >>> > Thanks >>> > Deepak >>> > >>> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <c...@koeninger.org> >>> > wrote: >>> >> >>> >> How are you going to handle etl failures? Do you care about lost / >>> >> duplicated data? Are your writes idempotent? >>> >> >>> >> Absent any other information about the problem, I'd stay away from >>> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream >>> >> feeding postgres. >>> >> >>> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <ali.rac...@gmail.com> >>> >> wrote: >>> >> > Is there an advantage to that vs directly consuming from Kafka? >>> >> > Nothing >>> >> > is >>> >> > being done to the data except some light ETL and then storing it in >>> >> > Cassandra >>> >> > >>> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma >>> >> > <deepakmc...@gmail.com> >>> >> > wrote: >>> >> >> >>> >> >> Its better you use spark's direct stream to ingest from kafka. >>> >> >> >>> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <ali.rac...@gmail.com> >>> >> >> wrote: >>> >> >>> >>> >> >>> I don't think I need a different speed storage and batch storage. >>> >> >>> Just >>> >> >>> taking in raw data from Kafka, standardizing, and storing it >>> >> >>> somewhere >>> >> >>> where >>> >> >>> the web UI can query it, seems like it will be enough. >>> >> >>> >>> >> >>> I'm thinking about: >>> >> >>> >>> >> >>> - Reading data from Kafka via Spark Streaming >>> >> >>> - Standardizing, then storing it in Cassandra >>> >> >>> - Querying Cassandra from the web ui >>> >> >>> >>> >> >>> That seems like it will work. My question now is whether to use >>> >> >>> Spark >>> >> >>> Streaming to read Kafka, or use Kafka consumers directly. >>> >> >>> >>> >> >>> >>> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh >>> >> >>> <mich.talebza...@gmail.com> wrote: >>> >> >>>> >>> >> >>>> - Spark Streaming to read data from Kafka >>> >> >>>> - Storing the data on HDFS using Flume >>> >> >>>> >>> >> >>>> You don't need Spark streaming to read data from Kafka and store >>> >> >>>> on >>> >> >>>> HDFS. It is a waste of resources. >>> >> >>>> >>> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly >>> >> >>>> >>> >> >>>> KafkaAgent.sources = kafka-sources >>> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs >>> >> >>>> >>> >> >>>> That will be for your batch layer. To analyse you can directly >>> >> >>>> read >>> >> >>>> from >>> >> >>>> hdfs files with Spark or simply store data in a database of your >>> >> >>>> choice via >>> >> >>>> cron or something. Do not mix your batch layer with speed layer. >>> >> >>>> >>> >> >>>> Your speed layer will ingest the same data directly from Kafka >>> >> >>>> into >>> >> >>>> spark streaming and that will be online or near real time >>> >> >>>> (defined >>> >> >>>> by your >>> >> >>>> window). >>> >> >>>> >>> >> >>>> Then you have a a serving layer to present data from both speed >>> >> >>>> (the >>> >> >>>> one from SS) and batch layer. >>> >> >>>> >>> >> >>>> HTH >>> >> >>>> >>> >> >>>> >>> >> >>>> >>> >> >>>> >>> >> >>>> Dr Mich Talebzadeh >>> >> >>>> >>> >> >>>> >>> >> >>>> >>> >> >>>> LinkedIn >>> >> >>>> >>> >> >>>> >>> >> >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> >> >>>> >>> >> >>>> >>> >> >>>> >>> >> >>>> http://talebzadehmich.wordpress.com >>> >> >>>> >>> >> >>>> >>> >> >>>> Disclaimer: Use it at your own risk. Any and all responsibility >>> >> >>>> for >>> >> >>>> any >>> >> >>>> loss, damage or destruction of data or any other property which >>> >> >>>> may >>> >> >>>> arise >>> >> >>>> from relying on this email's technical content is explicitly >>> >> >>>> disclaimed. The >>> >> >>>> author will in no case be liable for any monetary damages arising >>> >> >>>> from such >>> >> >>>> loss, damage or destruction. >>> >> >>>> >>> >> >>>> >>> >> >>>> >>> >> >>>> >>> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar <ali.rac...@gmail.com> >>> >> >>>> wrote: >>> >> >>>>> >>> >> >>>>> The web UI is actually the speed layer, it needs to be able to >>> >> >>>>> query >>> >> >>>>> the data online, and show the results in real-time. >>> >> >>>>> >>> >> >>>>> It also needs a custom front-end, so a system like Tableau can't >>> >> >>>>> be >>> >> >>>>> used, it must have a custom backend + front-end. >>> >> >>>>> >>> >> >>>>> Thanks for the recommendation of Flume. Do you think this will >>> >> >>>>> work: >>> >> >>>>> >>> >> >>>>> - Spark Streaming to read data from Kafka >>> >> >>>>> - Storing the data on HDFS using Flume >>> >> >>>>> - Using Spark to query the data in the backend of the web UI? >>> >> >>>>> >>> >> >>>>> >>> >> >>>>> >>> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh >>> >> >>>>> <mich.talebza...@gmail.com> wrote: >>> >> >>>>>> >>> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka can >>> >> >>>>>> be >>> >> >>>>>> stored on HDFS using flume. >>> >> >>>>>> >>> >> >>>>>> - Query this data to generate reports / analytics (There will >>> >> >>>>>> be a >>> >> >>>>>> web UI which will be the front-end to the data, and will show >>> >> >>>>>> the >>> >> >>>>>> reports) >>> >> >>>>>> >>> >> >>>>>> This is basically batch layer and you need something like >>> >> >>>>>> Tableau >>> >> >>>>>> or >>> >> >>>>>> Zeppelin to query data >>> >> >>>>>> >>> >> >>>>>> You will also need spark streaming to query data online for >>> >> >>>>>> speed >>> >> >>>>>> layer. That data could be stored in some transient fabric like >>> >> >>>>>> ignite or >>> >> >>>>>> even druid. >>> >> >>>>>> >>> >> >>>>>> HTH >>> >> >>>>>> >>> >> >>>>>> >>> >> >>>>>> >>> >> >>>>>> >>> >> >>>>>> >>> >> >>>>>> >>> >> >>>>>> >>> >> >>>>>> >>> >> >>>>>> Dr Mich Talebzadeh >>> >> >>>>>> >>> >> >>>>>> >>> >> >>>>>> >>> >> >>>>>> LinkedIn >>> >> >>>>>> >>> >> >>>>>> >>> >> >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> >> >>>>>> >>> >> >>>>>> >>> >> >>>>>> >>> >> >>>>>> http://talebzadehmich.wordpress.com >>> >> >>>>>> >>> >> >>>>>> >>> >> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility >>> >> >>>>>> for >>> >> >>>>>> any loss, damage or destruction of data or any other property >>> >> >>>>>> which >>> >> >>>>>> may >>> >> >>>>>> arise from relying on this email's technical content is >>> >> >>>>>> explicitly >>> >> >>>>>> disclaimed. The author will in no case be liable for any >>> >> >>>>>> monetary >>> >> >>>>>> damages >>> >> >>>>>> arising from such loss, damage or destruction. >>> >> >>>>>> >>> >> >>>>>> >>> >> >>>>>> >>> >> >>>>>> >>> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar >>> >> >>>>>> <ali.rac...@gmail.com> >>> >> >>>>>> wrote: >>> >> >>>>>>> >>> >> >>>>>>> It needs to be able to scale to a very large amount of data, >>> >> >>>>>>> yes. >>> >> >>>>>>> >>> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma >>> >> >>>>>>> <deepakmc...@gmail.com> wrote: >>> >> >>>>>>>> >>> >> >>>>>>>> What is the message inflow ? >>> >> >>>>>>>> If it's really high , definitely spark will be of great use . >>> >> >>>>>>>> >>> >> >>>>>>>> Thanks >>> >> >>>>>>>> Deepak >>> >> >>>>>>>> >>> >> >>>>>>>> >>> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <ali.rac...@gmail.com> >>> >> >>>>>>>> wrote: >>> >> >>>>>>>>> >>> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for >>> >> >>>>>>>>> ideas. >>> >> >>>>>>>>> >>> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and >>> >> >>>>>>>>> writing >>> >> >>>>>>>>> their >>> >> >>>>>>>>> raw data into Kafka. >>> >> >>>>>>>>> >>> >> >>>>>>>>> I need to: >>> >> >>>>>>>>> >>> >> >>>>>>>>> - Do ETL on the data, and standardize it. >>> >> >>>>>>>>> >>> >> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / >>> >> >>>>>>>>> Raw >>> >> >>>>>>>>> HDFS / ElasticSearch / Postgres) >>> >> >>>>>>>>> >>> >> >>>>>>>>> - Query this data to generate reports / analytics (There >>> >> >>>>>>>>> will be >>> >> >>>>>>>>> a >>> >> >>>>>>>>> web UI which will be the front-end to the data, and will >>> >> >>>>>>>>> show >>> >> >>>>>>>>> the reports) >>> >> >>>>>>>>> >>> >> >>>>>>>>> Java is being used as the backend language for everything >>> >> >>>>>>>>> (backend >>> >> >>>>>>>>> of the web UI, as well as the ETL layer) >>> >> >>>>>>>>> >>> >> >>>>>>>>> I'm considering: >>> >> >>>>>>>>> >>> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL >>> >> >>>>>>>>> layer >>> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it) >>> >> >>>>>>>>> >>> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the >>> >> >>>>>>>>> standardized >>> >> >>>>>>>>> data, and to allow queries >>> >> >>>>>>>>> >>> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark to >>> >> >>>>>>>>> run >>> >> >>>>>>>>> queries across the data (mostly filters), or directly run >>> >> >>>>>>>>> queries against >>> >> >>>>>>>>> Cassandra / HBase >>> >> >>>>>>>>> >>> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these >>> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka >>> >> >>>>>>>>> consumers vs >>> >> >>>>>>>>> Spark for >>> >> >>>>>>>>> ETL, which persistent data store to use, and how to query >>> >> >>>>>>>>> that >>> >> >>>>>>>>> data store in >>> >> >>>>>>>>> the backend of the web UI, for displaying the reports). >>> >> >>>>>>>>> >>> >> >>>>>>>>> >>> >> >>>>>>>>> Thanks. >>> >> >>>>>>> >>> >> >>>>>>> >>> >> >>>>>> >>> >> >>>>> >>> >> >>>> >>> >> >>> >>> >> >> >>> >> >> >>> >> >> >>> >> >> -- >>> >> >> Thanks >>> >> >> Deepak >>> >> >> www.bigdatabig.com >>> >> >> www.keosha.net >>> >> > >>> >> > >>> >> >>> >> --------------------------------------------------------------------- >>> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >> >>> > >>> > >>> > >>> > -- >>> > Thanks >>> > Deepak >>> > www.bigdatabig.com >>> > www.keosha.net >> >> > > > > -- > Thanks > Deepak > www.bigdatabig.com > www.keosha.net