Hi:
In apache spark we can read json using the following:
spark.read.json("path").
There is support to convert json string in a dataframe into structured element
using
(https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#from_json-org.apache.spark.sql.Column-org.
10, 2018, 7:49:42 AM PDT, Daniel Hinojosa
wrote:
This looks more like a spark issue than it does a Kafka judging by the
stack trace, are you using Spark structured streaming with Kafka
integration by chance?
On Mon, Apr 9, 2018 at 8:47 AM, M Singh
wrote:
> Hi Folks:
> Just wanted
Hi:
I am using Apache Spark Structured Streaming (2.2.1) to implement custom
sessionization for events. The processing is in two steps:1.
flatMapGroupsWithState (based on user id) - which stores the state of user and
emits events every minute until a expire event is received
2. The next step i
Hi:
I am using spark structured streaming 2.2.1 and am using flatMapGroupWithState
and a groupBy count operators.
In the StreamExecution logs I see two enteries for stateOperators
"stateOperators" : [ {
"numRowsTotal" : 1617339,
"numRowsUpdated" : 9647
}, {
"numRowsTotal" : 1326355,
Hi:
I am working on spark structured streaming (2.2.1) with kafka and want 100
executors to be alive. I set spark.executor.instances to be 100. The process
starts running with 100 executors but after some time only a few remain which
causes backlog of events from kafka.
I thought I saw a sett
Hi:
I am working on a realtime application using spark structured streaming (v
2.2.1). The application reads data from kafka and if there is a failure, I
would like to ignore the checkpoint. Is there any configuration to just read
from last kafka offset after a failure and ignore any offset che
Hi:
I am working with Spark (2.2.1) and Kafka (0.10) on AWS EMR and for the last
few days, after running the application for 30-60 minutes get exception from
Kafka Consumer included below.
The structured streaming application is processing 1 minute worth of data from
kafka topic. So I've tried
Hi Vijay:
I am using spark-shell because I am still prototyping the steps involved.
Regarding executors - I have 280 executors and UI only show a few straggler
tasks on each trigger. The UI does not show too much time spend on GC.
suspect the delay is because of getting data from kafka. The num
Hi:
I am working with spark structured streaming (2.2.1) reading data from Kafka
(0.11).
I need to aggregate data ingested every minute and I am using spark-shell at
the moment. The message rate ingestion rate is approx 500k/second. During
some trigger intervals (1 minute) especially when t
helpful to
answer some of them.
For example: inputRowsPerSecond = numRecords / inputTimeSec,
processedRowsPerSecond = numRecords / processingTimeSecThis is explaining why
the 2 rowPerSec difference.
On Feb 10, 2018, at 8:42 PM, M Singh wrote:
Hi:
I am working with spark 2.2.0 and am looking at
Just checking if anyone has any pointers for dynamically updating query state
in structured streaming.
Thanks
On Thursday, February 8, 2018 2:58 PM, M Singh
wrote:
Hi Spark Experts:
I am trying to use a stateful udf with spark structured streaming that needs to
update the state
Hi:
I am working with spark 2.2.0 and am looking at the query status console
output.
My application reads from kafka - performs flatMapGroupsWithState and then
aggregates the elements for two group counts. The output is send to console
sink. I see the following output (with my questions in
Hi Spark Experts:
I am trying to use a stateful udf with spark structured streaming that needs to
update the state periodically.
Here is the scenario:
1. I have a udf with a variable with default value (eg: 1) This value is
applied to a column (eg: subtract the variable from the column value )2.
s://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski
On Mon, Feb 5, 2018 at 8:11 PM, M Singh wrote:
Just checking if anyone has more details on how watermark works in cases where
event time is earlier than processing time stamp.
On Friday, February 2, 2018
Just checking if anyone has more details on how watermark works in cases where
event time is earlier than processing time stamp.
On Friday, February 2, 2018 8:47 AM, M Singh wrote:
Hi Vishu/Jacek:
Thanks for your responses.
Jacek - At the moment, the current time for my use case is
Hi TD:
Just wondering if you have any insight for me or need more info.
Thanks
On Thursday, February 1, 2018 7:43 AM, M Singh
wrote:
Hi TD:
Here is the udpated code with explain and full stack trace.
Please let me know what could be the issue and what to look for in the explain
output
don't want to process it, you could do a filter based on its
EventTime field, but I guess you will have to compare it with the processing
time since there is no API to access Watermark by the user.
-Vishnu
On Fri, Jan 26, 2018 at 1:14 PM, M Singh wrote:
Hi:
I am trying to filter out re
tion.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
... 1 more
On Wednesday, January 31, 2018 3:46 PM, Tathagata Das
wrote:
Could you give the full stack trace of the exception?
Also, can you do `dataframe2.explain(true)` and show us the
Hi Folks:
I have to add a column to a structured streaming dataframe but when I do that
(using select or withColumn) I get an exception. I can add a column in
structured non-streaming structured dataframe. I could not find any
documentation on how to do this in the following doc
[https://spar
Hi:
I am trying to filter out records which are lagging behind (based on event
time) by a certain amount of time.
Is the watermark api applicable to this scenario (ie, filtering lagging
records) or it is only applicable with aggregation ? I could not get a clear
understanding from the documen
8:36 PM, "M Singh" wrote:
Hi:
I am trying to create a custom structured streaming source and would like to
know if there is any example or documentation on the steps involved.
I've looked at the some methods available in the SparkSession but these are
internal to the sql package
Hi:
I am trying to create a custom structured streaming source and would like to
know if there is any example or documentation on the steps involved.
I've looked at the some methods available in the SparkSession but these are
internal to the sql package:
private[sql] def internalCreateDataFrame
ring-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski
On Thu, Jan 4, 2018 at 10:49 PM, M Singh wrote:
Thanks Tathagata for your answer.
The reason I was asking
lated note, these APIs are subject to change. In fact in the upcoming release
2.3, we are adding a DataSource V2 API for
batch/microbatch-streaming/continuous-streaming sources and sinks.
On Wed, Jan 3, 2018 at 11:23 PM, M Singh wrote:
Hi:
The documentation for Sink.addBatch is as follows:
/
Hi:
The documentation for Sink.addBatch is as follows:
/** * Adds a batch of data to this sink. The data for a given `batchId` is
deterministic and if * this method is called more than once with the same
batchId (which will happen in the case of * failures), then `data` should
only be ad
Hi Jeroen:
I am not sure if I missed it - but can you let us know what is your input
source and output sink ?
In some cases, I found that saving to S3 was a problem. In this case I started
saving the output to the EMR HDFS and later copied to S3 using s3-dist-cp which
solved our issue.
Mans
Hi:
I am working with DataSets so that I can use mapGroupsWithState for business
logic and then use dropDuplicates over a set of fields. I would like to use
the withWatermark so that I can restrict the how much state is stored.
>From the API it looks like withWatermark takes a string - timesta
n external system (like kafka)
Eyal
On Tue, Dec 26, 2017 at 10:37 PM, M Singh wrote:
Thanks Diogo. My question is how to gracefully call the stop method while the
streaming application is running in a cluster.
On Monday, December 25, 2017 5:39 PM, Diogo Munaro Vieira
wrote:
Hi M Singh
Thanks Diogo. My question is how to gracefully call the stop method while the
streaming application is running in a cluster.
On Monday, December 25, 2017 5:39 PM, Diogo Munaro Vieira
wrote:
Hi M Singh! Here I'm using query.stop()
Em 25 de dez de 2017 19:19, "M Singh&
Hi:I would like to use window function on a DataSet stream (Spark 2.2.0)The
window function requires Column as argument and can be used with DataFrames by
passing the column. Is there any analogous window function or pointers to how
window function can be used for DataSets ?
Thanks
Hi:
I am using spark structured streaming (v 2.2.0) to read data from files. I have
configured checkpoint location. On stopping and restarting the application, it
looks like it is reading the previously ingested files. Is that expected
behavior ?
Is there anyway to prevent reading files that
Hi:Are there any patterns/recommendations for gracefully stopping a structured
streaming application ?Thanks
directory.
TD
On Fri, Jul 11, 2014 at 11:46 AM, M Singh wrote:
So, is it expected for the process to generate stages/tasks even after
processing a file ?
>
>
>Also, is there a way to figure out the file that is getting processed and when
>that process is complete ?
>
>
&
shuffle=based operation like reduceByKey, groupByKey,
join, etc., the system is essentially redistributing the data across the
cluster and it needs to know how many parts should it divide the data into.
Thats where the default parallelism is used.
TD
On Fri, Jul 11, 2014 at 3:16 AM, M Singh
24 PM, Tathagata Das
wrote:
How are you supplying the text file?
On Wed, Jul 9, 2014 at 11:51 AM, M Singh wrote:
Hi Folks:
>
>
>
>I am working on an application which uses spark streaming (version 1.1.0
>snapshot on a standalone cluster) to process text file and save counters
Hi Folks:
I am working on an application which uses spark streaming (version 1.1.0
snapshot on a standalone cluster) to process text file and save counters in
cassandra based on fields in each row. I am testing the application in two
modes:
* Process each row and save the counter i
re:
https://github.com/datastax/cassandra-driver-spark/issues/11
We're open to any ideas. Just let us know what you need the API to have in the
comments.
Regards,
Piotr Kołaczkowski
2014-07-05 0:48 GMT+02:00 M Singh :
Hi:
>
>
>Is there a Java sample fragment for using cassandra-dr
each batch, you can simply move those processed files to another
directory or so.
Thanks
Best Regards
On Thu, Jul 3, 2014 at 6:34 PM, M Singh wrote:
Hi:
>
>
>I am working on a project where a few thousand text files (~20M in size) will
>be dropped in an hdfs directory every
Hi:
Is there a Java sample fragment for using cassandra-driver-spark ?
Thanks
The windowing capabilities of spark streaming determine the events in the RDD
created for that time window. If the duration is 1s then all the events
received in a particular 1s window will be a part of the RDD created for that
window for that stream.
On Friday, July 4, 2014 1:28 PM, alessan
Another alternative could be use SparkStreaming's textFileStream with windowing
capabilities.
On Friday, July 4, 2014 9:52 AM, Gianluca Privitera
wrote:
You should think about a custom receiver, in order to solve the problem of the
“already collected” data.
http://spark.apache.org/docs
Hi:
Is there a way to find out when spark has finished processing a text file (both
for streaming and non-streaming cases) ?
Also, after processing, can spark copy the file to another directory ?
Thanks
Hi:
I am working on a project where a few thousand text files (~20M in size) will
be dropped in an hdfs directory every 15 minutes. Data from the file will used
to update counters in cassandra (non-idempotent operation). I was wondering
what is the best to deal with this:
* Use text s
Hi:
Is there a comprehensive properties list (with permissible/default values) for
spark ?
Thanks
Mans
,
Woodstox, etc., folks) to see if we can get people to upgrade to more recent
versions of Jackson.
-- Paul
—
p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
On Fri, Jun 27, 2014 at 12:58 PM, M Singh wrote:
Hi:
>
>
>I am using spark to stream data to cassandra and it work
Hi:
I am using spark to stream data to cassandra and it works fine in local mode.
But when I execute the application in a standalone clustered env I got
exception included below (java.lang.NoClassDefFoundError:
org/codehaus/jackson/annotate/JsonClass).
I think this is due to the jackson-core-a
46 matches
Mail list logo