If they're files in a file system, and you don't actually need multiple kinds of consumers, have you considered streamingContext.fileStream instead of kafka?
On Wed, Jul 20, 2016 at 5:40 AM, Rabin Banerjee <dev.rabin.baner...@gmail.com> wrote: > Hi Cody, > > Thanks for your reply . > > Let Me elaborate a bit,We have a Directory where small xml(90 KB) files > are continuously coming(pushed from other node).File has ID & Timestamp in > name and also inside record. Data coming in the directory has to be pushed > to Kafka to finally get into Spark Streaming . Data is time series data(Per > device per 15 min 1 file of 90 KB, Total 10,000 Device. So 50,000 files per > 15 min). No utility can be installed in the source where data is generated , > so data will be always ftp-ed to a directory .In Spark streaming we are > always interested with latest 60 min(window) of data(latest 4 files per > device). What do you suggest to get them into Spark Streaming with > reliability (probably with Kafka). In streaming I am only interested with > the latest 4 data(60 min). > > > Also I am thinking about , instead of using Spark Windowing ,Using Custom > java code will push the ID of the file to Kafka and push parsed XML data to > HBASE keeping Hbase insert timestamp as File Timestamp, HBASE key will be > only ID ,CF will have 4 version(Time series version) per device ID (4 latest > data). As hbase keeps the data per key sorted with timestamp , I will always > get the latest 4 ts data on get(key). Spark streaming will get the ID from > Kafka, then read the data from HBASE using get(ID). This will eliminate > usage of Windowing from Spark-Streaming . Is it good to use ? > > Regards, > Rabin Banerjee > > > On Tue, Jul 19, 2016 at 8:44 PM, Cody Koeninger <c...@koeninger.org> wrote: >> >> Unless you're using only 1 partition per topic, there's no reasonable >> way of doing this. Offsets for one topicpartition do not necessarily >> have anything to do with offsets for another topicpartition. You >> could do the last (200 / number of partitions) messages per >> topicpartition, but you have no guarantee as to the time those events >> represent, especially if your producers are misbehaving. To be >> perfectly clear, this is a consequence of the Kafka data model, and >> has nothing to do with spark. >> >> So, given that it's a bad idea and doesn't really do what you're >> asking... you can do this using KafkaUtils.createRDD >> >> On Sat, Jul 16, 2016 at 10:43 AM, Rabin Banerjee >> <dev.rabin.baner...@gmail.com> wrote: >> > Just to add , >> > >> > I want to read the MAX_OFFSET of a topic , then read MAX_OFFSET-200 , >> > every time . >> > >> > Also I want to know , If I want to fetch a specific offset range for >> > Batch >> > processing, is there any option for doing that ? >> > >> > >> > >> > >> > On Sat, Jul 16, 2016 at 9:08 PM, Rabin Banerjee >> > <dev.rabin.baner...@gmail.com> wrote: >> >> >> >> HI All, >> >> >> >> I have 1000 kafka topics each storing messages for different devices >> >> . >> >> I want to use the direct approach for connecting kafka from Spark , in >> >> which >> >> I am only interested in latest 200 messages in the Kafka . >> >> >> >> How do I do that ? >> >> >> >> Thanks. >> > >> > > > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org