Have you looked through and see metrics for state operators?
It has been providing "total rows" of state, and starting from Spark 2.4 it
also provides additional metrics specific to HDFSBackedStateStoreProvider,
including estimated memory usage in overall.
https://github.com/apache/spark/blob/24f
No. We are already capturing these metrics (e.g. numInputRows,
inputRowsPerSecond).
I am talking about "No. of States" in the memory at any given time.
On Thu, May 7, 2020 at 4:31 PM Jungtaek Lim
wrote:
> If you're referring total "entries" in all states in SS job, it's being
> provided via Str
It's not either 1 or 2. Both two items are applied. I haven't played with
DStream + pyspark but given the error message is clear you'll probably want
to change the client.id "Python Kafka streamer" to accommodate the naming
convention guided in error message.
On Thu, May 7, 2020 at 3:55 PM Vijayan
If you're referring total "entries" in all states in SS job, it's being
provided via StreamingQueryListener.
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries
Hope this helps.
On Fri, May 8, 2020 at 3:26 AM Something Something
wrote:
>
Hi,
I've updated the SO question with masked data, added year column and other
requirement. Please take a look.
Hope this helps in solving the problem.
Thanks and regards,
AB
On Thu 7 May, 2020, 10:59 AM Sonal Goyal, wrote:
> As mentioned in the comments on SO, can you provide a (masked) samp
Is there a way to dynamically modify value of 'maxOffsetsPerTrigger' while
a Stateful Structured Streaming job is running?
We are thinking of auto-scaling our Spark cluster but if we don't modify
the value of 'maxOffsetsPerTrigger' dynamically would adding more VMs to
the cluster help? I don't thi
Is there a way to get the total no. of active states in memory at any given
point in a Stateful Spark Structured Streaming job? We are thinking of
using this metric for 'Auto Scaling' our Spark cluster.
It's only happening for Hadoop config. The exceptions trace are different
for each time it gets died. And Jobs run for couple hours then worker dies.
Another Reason:
*20/05/02 02:26:34 ERROR SparkUncaughtExceptionHandler: Uncaught exception
in thread Thread[ExecutorRunner for app-20200501213234-9
Thanks for the quick reply, Zhang.
I don't think that we have too much data skew, and if we do, there isn't much
of a way around it - we need to groupby this specific column in order to run
aggregates.
I'm running this with PySpark, it doesn't look like the groupBy() function
takes a numParti
You might want to double check your Hadoop config files. From the stack
trace it looks like this is happening when simply trying to load
configuration (XML files). Make sure they're well formed.
On Thu, May 7, 2020 at 6:12 AM Hrishikesh Mishra
wrote:
> Hi
>
> I am getting out of memory error i
You appear to be hitting the broadcast timeout. See:
https://stackoverflow.com/a/41126034/375670
On Thu, May 7, 2020 at 8:56 AM Deepak Garg wrote:
> Hi,
>
> I am getting following error while running a spark job. Error
> occurred when Spark is trying to write the dataframe in ORC format . I am
Hi,
I am getting following error while running a spark job. Error occurred when
Spark is trying to write the dataframe in ORC format . I am pasting the
error trace.
Any help in resolving this would be appreciated.
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
execute
Hi
I am getting out of memory error in worker log in streaming jobs in every
couple of hours. After this worker dies. There is no shuffle, no
aggression, no. caching in job, its just a transformation.
I'm not able to identify where is the problem, driver or executor. And why
worker getting dead a
AFAICT, there might be data skews, some partitions got too much rows,
which caused out of memory limitation. Trying .groupBy().count()
or .aggregateByKey().count() may help check each partition data size.
If no data skew, to increase .groupBy() parameter `numPartitions` is
worth a try.
--
Cheers,
14 matches
Mail list logo