Re: Watermarking without aggregation with Structured Streaming

2018-10-24 Thread Sanjay Awatramani
Try if this works... println(query.lastProgress.eventTime.get("watermark")) Regards,Sanjay On 2018/09/30 09:05:40, peay wrote:  > Thanks for the pointers. I guess right now the only workaround would be to > apply a "dummy" aggregation (e.g., group by the timestamp itself) only to > have the sta

Require some clarity on partitioning

2014-04-07 Thread Sanjay Awatramani
Hi, I was going through Matei's Advanced Spark presentation at  https://www.youtube.com/watch?v=w0Tisli7zn4 , and had few questions. The presentation of this video is at  http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf The PageRank example int

Implementation problem with Streaming

2014-03-25 Thread Sanjay Awatramani
Hi, I had initially thought of a streaming approach to solve my problem, and I am stuck at few places and want opinion if this problem is suitable for streaming, or is it better to stick to basic spark. Problem: I get chunks of log files in a folder and need to do some analysis on them on an h

Re: Sliding Window operations do not work as documented

2014-03-24 Thread Sanjay Awatramani
ening because of the lazy transformations. Wherever sliding window would be used, in most of the cases, no actions will be taken on the pre-window batch, hence my gut feeling was that Streaming would read every batch if any actions are being taken in the windowed stream. Regards, Sanjay On Fr

Sliding Window operations do not work as documented

2014-03-21 Thread Sanjay Awatramani
Hi, I want to run a map/reduce process over last 5 seconds of data, every 4 seconds. This is quite similar to the sliding window pictorial example under Window Operations section on http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html .  The RDDs returned by window t

Re: N-Fold validation and RDD partitions

2014-03-21 Thread Sanjay Awatramani
Hi Jaonary, I believe the n folds should be mapped into n Keys in spark using a map function. You can reduce the returned PairRDD and you should get your metric. I don't understand partitions fully, but from whatever I understand of it, they aren't required in your scenario. Regards, Sanjay

Re: Relation between DStream and RDDs

2014-03-21 Thread Sanjay Awatramani
e element being the count >>>result for the corresponding original RDD. >>> >>> >>> >>>For reduce, it's the same using reduce operation... >>> >>>The only operations that are a bit more complex are reduceByWindow & >>

Re: Relation between DStream and RDDs

2014-03-20 Thread Sanjay Awatramani
tiple RDDs from a DStream in *every batch*. Can you elaborate on why do you need multiple RDDs every batch? > > >TD > > > >On Wed, Mar 19, 2014 at 10:20 PM, Sanjay Awatramani >wrote: > >Hi, >> >> >>As I understand, a DStream consists of 1 or more

Relation between DStream and RDDs

2014-03-19 Thread Sanjay Awatramani
Hi, As I understand, a DStream consists of 1 or more RDDs. And foreachRDD will run a given func on each and every RDD inside a DStream. I created a simple program which reads log files from a folder every hour: JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new Duration(60 * 60