Apache Spark - How to concert DataFrame json string to structured element and using schema_of_json

2022-09-05 Thread M Singh
Hi: In apache spark we can read json using the following: spark.read.json("path"). There is support to convert json string in a dataframe into structured element using (https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#from_json-org.apache.spark.sql.Column-org.

Re: Apache Kafka / Spark Integration - Exception - The server disconnected before a response was received.

2018-04-10 Thread M Singh
10, 2018, 7:49:42 AM PDT, Daniel Hinojosa wrote: This looks more like a spark issue than it does a Kafka judging by the stack trace, are you using Spark structured streaming with Kafka integration by chance? On Mon, Apr 9, 2018 at 8:47 AM, M Singh wrote: > Hi Folks: > Just wanted

Apache Spark - Structured Streaming State Management With Watermark

2018-03-28 Thread M Singh
Hi: I am using Apache Spark Structured Streaming (2.2.1) to implement custom sessionization for events.  The processing is in two steps:1. flatMapGroupsWithState (based on user id) - which stores the state of user and emits events every minute until a expire event is received 2. The next step i

Apache Spark - Structured Streaming StreamExecution Stats Description

2018-03-28 Thread M Singh
Hi: I am using spark structured streaming 2.2.1 and am using flatMapGroupWithState and a groupBy count operators. In the StreamExecution logs I see two enteries for stateOperators "stateOperators" : [ {     "numRowsTotal" : 1617339,     "numRowsUpdated" : 9647   }, {     "numRowsTotal" : 1326355,

Apache Spark Structured Streaming - How to keep executor alive.

2018-03-23 Thread M Singh
Hi: I am working on spark structured streaming (2.2.1) with kafka and want 100 executors to be alive. I set spark.executor.instances to be 100.  The process starts running with 100 executors but after some time only a few remain which causes backlog of events from kafka.  I thought I saw a sett

Apache Spark Structured Streaming - Kafka Streaming - Option to ignore checkpoint

2018-03-22 Thread M Singh
Hi: I am working on a realtime application using spark structured streaming (v 2.2.1). The application reads data from kafka and if there is a failure, I would like to ignore the checkpoint.  Is there any configuration to just read from last kafka offset after a failure and ignore any offset che

Apache Spark Structured Streaming - Kafka Consumer cannot fetch records for offset exception

2018-03-22 Thread M Singh
Hi: I am working with Spark (2.2.1) and Kafka (0.10) on AWS EMR and for the last few days, after running the application for 30-60 minutes get exception from Kafka Consumer included below. The structured streaming application is processing 1 minute worth of data from kafka topic. So I've tried

Re: Apache Spark - Structured Streaming reading from Kafka some tasks take much longer

2018-02-24 Thread M Singh
Hi Vijay: I am using spark-shell because I am still prototyping the steps involved. Regarding executors - I have 280 executors and UI only show a few straggler tasks on each trigger.  The UI does not show too much time spend on GC.  suspect the delay is because of getting data from kafka. The num

Apache Spark - Structured Streaming reading from Kafka some tasks take much longer

2018-02-23 Thread M Singh
Hi: I am working with spark structured streaming (2.2.1) reading data from Kafka (0.11).  I need to aggregate data ingested every minute and I am using spark-shell at the moment.  The message rate ingestion rate is approx 500k/second.  During some trigger intervals (1 minute) especially when t

Re: Apache Spark - Structured Streaming Query Status - field descriptions

2018-02-11 Thread M Singh
helpful to answer some of them. For example:  inputRowsPerSecond = numRecords / inputTimeSec,  processedRowsPerSecond = numRecords / processingTimeSecThis is explaining why the 2 rowPerSec difference. On Feb 10, 2018, at 8:42 PM, M Singh wrote: Hi: I am working with spark 2.2.0 and am looking at

Re: Apache Spark - Structured Streaming - Updating UDF state dynamically at run time

2018-02-10 Thread M Singh
Just checking if anyone has any pointers for dynamically updating query state in structured streaming. Thanks On Thursday, February 8, 2018 2:58 PM, M Singh wrote: Hi Spark Experts: I am trying to use a stateful udf with spark structured streaming that needs to update the state

Apache Spark - Structured Streaming Query Status - field descriptions

2018-02-10 Thread M Singh
Hi: I am working with spark 2.2.0 and am looking at the query status console output.  My application reads from kafka - performs flatMapGroupsWithState and then aggregates the elements for two group counts.  The output is send to console sink.  I see the following output  (with my questions in

Apache Spark - Structured Streaming - Updating UDF state dynamically at run time

2018-02-08 Thread M Singh
Hi Spark Experts: I am trying to use a stateful udf with spark structured streaming that needs to update the state periodically. Here is the scenario: 1. I have a udf with a variable with default value (eg: 1)  This value is applied to a column (eg: subtract the variable from the column value )2.

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-06 Thread M Singh
s://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Mon, Feb 5, 2018 at 8:11 PM, M Singh wrote: Just checking if anyone has more details on how watermark works in cases where event time is earlier than processing time stamp. On Friday, February 2, 2018

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-05 Thread M Singh
Just checking if anyone has more details on how watermark works in cases where event time is earlier than processing time stamp. On Friday, February 2, 2018 8:47 AM, M Singh wrote: Hi Vishu/Jacek: Thanks for your responses. Jacek - At the moment, the current time for my use case is

Re: Apache Spark - Exception on adding column to Structured Streaming DataFrame

2018-02-05 Thread M Singh
Hi TD: Just wondering if you have any insight for me or need more info. Thanks On Thursday, February 1, 2018 7:43 AM, M Singh wrote: Hi TD: Here is the udpated code with explain and full stack trace. Please let me know what could be the issue and what to look for in the explain output

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-02 Thread M Singh
don't want to process it, you could do a filter based on its EventTime field, but I guess you will have to compare it with the processing time since there is no API to access Watermark by the user.  -Vishnu On Fri, Jan 26, 2018 at 1:14 PM, M Singh wrote: Hi: I am trying to filter out re

Re: Apache Spark - Exception on adding column to Structured Streaming DataFrame

2018-02-01 Thread M Singh
tion.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279) ... 1 more On Wednesday, January 31, 2018 3:46 PM, Tathagata Das wrote: Could you give the full stack trace of the exception? Also, can you do `dataframe2.explain(true)` and show us the

Apache Spark - Exception on adding column to Structured Streaming DataFrame

2018-01-31 Thread M Singh
Hi Folks: I have to add a column to a structured streaming dataframe but when I do that (using select or withColumn) I get an exception.  I can add a column in structured non-streaming structured dataframe. I could not find any documentation on how to do this in the following doc  [https://spar

Apache Spark - Spark Structured Streaming - Watermark usage

2018-01-26 Thread M Singh
Hi: I am trying to filter out records which are lagging behind (based on event time) by a certain amount of time.   Is the watermark api applicable to this scenario (ie, filtering lagging records) or it is only applicable with aggregation ?  I could not get a clear understanding from the documen

Re: Apache Spark - Custom structured streaming data source

2018-01-26 Thread M Singh
8:36 PM, "M Singh" wrote: Hi: I am trying to create a custom structured streaming source and would like to know if there is any example or documentation on the steps involved. I've looked at the some methods available in the SparkSession but these are internal to the sql package

Apache Spark - Custom structured streaming data source

2018-01-25 Thread M Singh
Hi: I am trying to create a custom structured streaming source and would like to know if there is any example or documentation on the steps involved. I've looked at the some methods available in the SparkSession but these are internal to the sql package:   private[sql] def internalCreateDataFrame

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-05 Thread M Singh
ring-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Thu, Jan 4, 2018 at 10:49 PM, M Singh wrote: Thanks Tathagata for your answer. The reason I was asking

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-04 Thread M Singh
lated note, these APIs are subject to change. In fact in the upcoming release 2.3, we are adding a DataSource V2 API for batch/microbatch-streaming/continuous-streaming sources and sinks. On Wed, Jan 3, 2018 at 11:23 PM, M Singh wrote: Hi: The documentation for Sink.addBatch is as follows:   /

Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-03 Thread M Singh
Hi: The documentation for Sink.addBatch is as follows:   /**   * Adds a batch of data to this sink. The data for a given `batchId` is deterministic and if   * this method is called more than once with the same batchId (which will happen in the case of   * failures), then `data` should only be ad

Re: Spark on EMR suddenly stalling

2018-01-01 Thread M Singh
Hi Jeroen: I am not sure if I missed it - but can you let us know what is your input source and output sink ?   In some cases, I found that saving to S3 was a problem. In this case I started saving the output to the EMR HDFS and later copied to S3 using s3-dist-cp which solved our issue. Mans

Apache Spark - Using withWatermark for DataSets

2017-12-30 Thread M Singh
Hi: I am working with DataSets so that I can use mapGroupsWithState for business logic and then use dropDuplicates over a set of fields.  I would like to use the withWatermark so that I can restrict the how much state is stored.   >From the API it looks like withWatermark takes a string - timesta

Re: Apache Spark - Structured Streaming graceful shutdown

2017-12-30 Thread M Singh
n external system (like kafka) Eyal On Tue, Dec 26, 2017 at 10:37 PM, M Singh wrote: Thanks Diogo.  My question is how to gracefully call the stop method while the streaming application is running in a cluster. On Monday, December 25, 2017 5:39 PM, Diogo Munaro Vieira wrote: Hi M Singh

Re: Apache Spark - Structured Streaming graceful shutdown

2017-12-26 Thread M Singh
Thanks Diogo.  My question is how to gracefully call the stop method while the streaming application is running in a cluster. On Monday, December 25, 2017 5:39 PM, Diogo Munaro Vieira wrote: Hi M Singh! Here I'm using query.stop() Em 25 de dez de 2017 19:19, "M Singh&

Apache Spark - (2.2.0) - window function for DataSet

2017-12-25 Thread M Singh
Hi:I would like to use window function on a DataSet stream (Spark 2.2.0)The window function requires Column as argument and can be used with DataFrames by passing the column. Is there any analogous window function or pointers to how window function can be used for DataSets ? Thanks

Apache Spark - Structured Streaming from file - checkpointing

2017-12-25 Thread M Singh
Hi: I am using spark structured streaming (v 2.2.0) to read data from files. I have configured checkpoint location. On stopping and restarting the application, it looks like it is reading the previously ingested files.  Is that expected behavior ?   Is there anyway to prevent reading files that

Apache Spark - Structured Streaming graceful shutdown

2017-12-25 Thread M Singh
Hi:Are there any patterns/recommendations for gracefully stopping a structured streaming application ?Thanks

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-12 Thread M Singh
directory.  TD On Fri, Jul 11, 2014 at 11:46 AM, M Singh wrote: So, is it expected for the process to generate stages/tasks even after processing a file ? > > >Also, is there a way to figure out the file that is getting processed and when >that process is complete ? > > &

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-11 Thread M Singh
shuffle=based operation like reduceByKey, groupByKey, join, etc., the system is essentially redistributing the data across the cluster and it needs to know how many parts should it divide the data into. Thats where the default parallelism is used.  TD On Fri, Jul 11, 2014 at 3:16 AM, M Singh

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-11 Thread M Singh
24 PM, Tathagata Das wrote: How are you supplying the text file?  On Wed, Jul 9, 2014 at 11:51 AM, M Singh wrote: Hi Folks: > > > >I am working on an application which uses spark streaming (version 1.1.0 >snapshot on a standalone cluster) to process text file and save counters

Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-09 Thread M Singh
Hi Folks: I am working on an application which uses spark streaming (version 1.1.0 snapshot on a standalone cluster) to process text file and save counters in cassandra based on fields in each row.  I am testing the application in two modes:  * Process each row and save the counter i

Re: Java sample for using cassandra-driver-spark

2014-07-08 Thread M Singh
re:  https://github.com/datastax/cassandra-driver-spark/issues/11 We're open to any ideas. Just let us know what you need the API to have in the comments. Regards, Piotr Kołaczkowski 2014-07-05 0:48 GMT+02:00 M Singh : Hi: > > >Is there a Java sample fragment for using cassandra-dr

Re: Reading text file vs streaming text files

2014-07-08 Thread M Singh
each batch, you can simply move those processed files to another directory or so. Thanks Best Regards On Thu, Jul 3, 2014 at 6:34 PM, M Singh wrote: Hi: > > >I am working on a project where a few thousand text files (~20M in size) will >be dropped in an hdfs directory every

Java sample for using cassandra-driver-spark

2014-07-04 Thread M Singh
Hi: Is there a Java sample fragment for using cassandra-driver-spark ? Thanks

Re: window analysis with Spark and Spark streaming

2014-07-04 Thread M Singh
The windowing capabilities of spark streaming determine the events in the RDD created for that time window.  If the duration is 1s then all the events received in a particular 1s window will be a part of the RDD created for that window for that stream. On Friday, July 4, 2014 1:28 PM, alessan

Re: window analysis with Spark and Spark streaming

2014-07-04 Thread M Singh
Another alternative could be use SparkStreaming's textFileStream with windowing capabilities. On Friday, July 4, 2014 9:52 AM, Gianluca Privitera wrote: You should think about a custom receiver, in order to solve the problem of the “already collected” data.  http://spark.apache.org/docs

spark text processing

2014-07-03 Thread M Singh
Hi: Is there a way to find out when spark has finished processing a text file (both for streaming and non-streaming cases) ? Also, after processing, can spark copy the file to another directory ? Thanks

Reading text file vs streaming text files

2014-07-03 Thread M Singh
Hi: I am working on a project where a few thousand text files (~20M in size) will be dropped in an hdfs directory every 15 minutes.  Data from the file will used to update counters in cassandra (non-idempotent operation).  I was wondering what is the best to deal with this: * Use text s

Configuration properties for Spark

2014-06-30 Thread M Singh
Hi: Is there a comprehensive properties list (with permissible/default values) for spark ? Thanks Mans

Re: jackson-core-asl jar (1.8.8 vs 1.9.x) conflict with the spark-sql (version 1.x)

2014-06-28 Thread M Singh
, Woodstox, etc., folks) to see if we can get people to upgrade to more recent versions of Jackson. -- Paul — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Fri, Jun 27, 2014 at 12:58 PM, M Singh wrote: Hi: > > >I am using spark to stream data to cassandra and it work

jackson-core-asl jar (1.8.8 vs 1.9.x) conflict with the spark-sql (version 1.x)

2014-06-27 Thread M Singh
Hi: I am using spark to stream data to cassandra and it works fine in local mode. But when I execute the application in a standalone clustered env I got exception included below (java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass). I think this is due to the jackson-core-a