Yes, you can certainly use spark streaming, but reading from the original
source table may still be time consuming and resource intensive.
Having some context on the RDBMS platform, data size/volumes involved and the
tolerable lag (between changes being created and it being processed by Spark)
Just curious - is this HttpSink your own custom sink or Dropwizard
configuration?
If your own custom code, I would suggest looking/trying out the Dropwizard.
See
http://spark.apache.org/docs/latest/monitoring.html#metrics
https://metrics.dropwizard.io/4.0.0/
Also, from what I know, the metrics
See if https://spark.apache.org/docs/latest/monitoring.html helps.
Essentially whether you are running an app as spark-shell, via spark-submit
(local, Spark-Cluster, YARN, Kubernetes, mesos), the driver will provide a UI
on port 4040.
You can monitor via the UI and via a REST API
E.g. running
still excessive.
From: Vitaliy Pisarev
Date: Thursday, November 15, 2018 at 1:58 PM
To: "Thakrar, Jayesh"
Cc: Shahbaz , user , David
Markovitz
Subject: Re: How to address seemingly low core utilization on a spark workload?
Small update, my initial estimate was incorrect. I have on
save.
From: Vitaliy Pisarev
Date: Thursday, November 15, 2018 at 1:03 PM
To: Shahbaz
Cc: "Thakrar, Jayesh" , user
, "dudu.markov...@microsoft.com"
Subject: Re: How to address seemingly low core utilization on a spark workload?
Agree, and I will try it. One clarification t
ittle work.
Question is what can I do about it.
On Thu, Nov 15, 2018 at 5:29 PM Thakrar, Jayesh
mailto:jthak...@conversantmedia.com>> wrote:
Can you shed more light on what kind of processing you are doing?
One common pattern that I have seen for active core/executor utilization
dropping to zero
Can you shed more light on what kind of processing you are doing?
One common pattern that I have seen for active core/executor utilization
dropping to zero is while reading ORC data and the driver seems (I think) to be
doing schema validation.
In my case I would have hundreds of thousands of ORC
Not sure I get what you mean….
I ran the query that you had – and don’t get the same hash as you.
From: Gokula Krishnan D
Date: Friday, September 28, 2018 at 10:40 AM
To: "Thakrar, Jayesh"
Cc: user
Subject: Re: [Spark SQL] why spark sql hash() are returns the same hash value
thoug
Cannot reproduce your situation.
Can you share Spark version?
Welcome to
__
/ __/__ ___ _/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92)
Type in
Disclaimer - I use Spark with Scala and not Python.
But I am guessing that Jorn's reference to modularization is to ensure that you
do the processing inside methods/functions and call those methods sequentially.
I believe that as long as an RDD/dataset variable is in scope, its memory may
not be
21, 2018 at 10:20 PM
To: ayan guha
Cc: "Thakrar, Jayesh" , user
Subject: Re: How to skip nonexistent file when read files with spark?
Thanks ayan,
Also I have tried this method, the most tricky thing is that dataframe union
method must be based on same structure schema, while on my
Probably you can do some preprocessing/checking of the paths before you attempt
to read it via Spark.
Whether it is local or hdfs filesystem, you can try to check for existence and
other details by using the "FileSystem.globStatus" method from the Hadoop API.
From: JF Chen
Date: Sunday, May 20,
And here's some more info on Spark Metrics
https://www.slideshare.net/JayeshThakrar/apache-bigdata2017sparkprofiling
From: Maximiliano Felice
Date: Monday, January 8, 2018 at 8:14 AM
To: Irtiza Ali
Cc:
Subject: Re: Spark Monitoring using Jolokia
Hi!
I don't know very much about them, but I'
You can also get the metrics from the Spark application events log file.
See https://www.slideshare.net/JayeshThakrar/apache-bigdata2017sparkprofiling
From: "Qiao, Richard"
Date: Monday, December 4, 2017 at 6:09 PM
To: Nick Dimiduk , "user@spark.apache.org"
Subject: Re: Access to Applications
What you have is sequential and hence sequential processing.
Also Spark/Scala are not parallel programming languages.
But even if they were, statements are executed sequentially unless you exploit
the parallel/concurrent execution features.
Anyway, see if this works:
val (RDD1, RDD2) = (JavaFunc
Could this be due to https://issues.apache.org/jira/browse/HIVE-6 ?
From: Patrik Medvedev
Date: Monday, June 12, 2017 at 2:31 AM
To: Jörn Franke , vaquar khan
Cc: Jean Georges Perrin , User
Subject: Re: [Spark JDBC] Does spark support read from remote Hive server via
JDBC
Hello,
All secu
Roy - can you check if you have HADOOP_CONF_DIR and YARN_CONF_DIR set to the
directory containing the HDFS and YARN configuration files?
From: Sandeep Nemuri
Date: Monday, March 27, 2017 at 9:44 AM
To: Saisai Shao
Cc: Yong Zhang , ", Roy" , user
Subject: Re: spark-submit config via file
You
se the old memory management,
you may explicitly enable `spark.memory.useLegacyMode` (not recommended).
On Mon, Feb 13, 2017 at 11:23 PM, Thakrar, Jayesh
mailto:jthak...@conversantmedia.com>> wrote:
Nancy,
As your log output indicated, your executor 11 GB memory limit.
While you might wan
Nancy,
As your log output indicated, your executor 11 GB memory limit.
While you might want to address the root cause/data volume as suggested by Jon,
you can do an immediate test by changing your command as follows
spark-shell --master yarn --deploy-mode client --driver-memory 16G
--num-execut
Ben,
Also look at Phoenix (Apache project) which provides a better (one of the best)
SQL/JDBC layer on top of HBase.
http://phoenix.apache.org/
Cheers,
Jayesh
From: vincent gromakowski
Date: Monday, October 17, 2016 at 1:53 PM
To: Benjamin Kim
Cc: Michael Segel , Jörn Franke
, Mich Talebzad
Yes, iterating over a dataframe and making changes is not uncommon.
Ofcourse RDDs, dataframes and datasets are immultable, but there is some
optimization in the optimizer that can potentially help to dampen the
effect/impact of creating a new rdd, df or ds.
Also, the use-case you cited is similar
21 matches
Mail list logo