ExecutorMoniitor timeout

2021-08-10 Thread Zhenyu Hu
In private class Tracker of org.apache.spark.scheduler.dynalloc.ExecutorMonitor, the method ` updateTimeout ` will take the min of `_cach

Re: Performance of PySpark jobs on the Kubernetes cluster

2021-08-10 Thread Khalid Mammadov
Hi Mich I think you need to check your code. If code does not use PySpark API effectively you may get this. I.e. if you use pure Python/pandas api rather than Pyspark i.e. transform->transform->action. e.g df.select(..).withColumn(...)...count() Hope this helps to put you on right direction. Che

How can I write data to hive with jdbc

2021-08-10 Thread igyu
var cfg:Map[String,String] = Map() cfg += ("url"->"jdbc:hive2://tidb4ser:11000/joinwarehouse;user=jztwk;password=123456;hive.server2.proxy.user=jztwk") cfg += ("dbtable"->"ods_job_log") cfg += ("user"->"jztwk") cfg += ("passwrod"-> "123456") cfg += ("driver"-> "org.apache.h

Facing weird problem while reading Parquet

2021-08-10 Thread Prateek Rajput
Hi everyone, I am using spark-core-2.4 and spark-sql-2.4 (java spark). While reading 40K parquet part files from a single HDFS directory, somehow spark is spanning only 20037 parallel tasks, which is weird. My initial experience with spark is that while reading number of total tasks are equal to nu