Hi all, There are three jobs, among which the first rdd is the same. Can the first rdd be calculated once, and then the subsequent operations will be calculated in parallel?
<http://apache-spark-user-list.1001560.n3.nabble.com/file/t11001/1600954813075-image.png> My code is as follows: sqls = [""" INSERT OVERWRITE TABLE `spark_input_test3` PARTITION (dt='20200917') SELECT id, plan_id, status, retry_times, start_time, schedule_datekey FROM temp_table where status=3""", """ INSERT OVERWRITE TABLE `spark_input_test4` PARTITION (dt='20200917') SELECT id, cur_inst_id, status, update_time, schedule_time, task_name FROM temp_table where schedule_time > '2020-09-01 00:00:00' """] def multi_thread(): sql = """SELECT id, plan_id, status, retry_times, start_time, schedule_datekey, task_name, update_time, schedule_time, cur_inst_id, scheduler_id FROM table where dt < '20200801'""" data = spark.sql(sql).persist(StorageLevel.MEMORY_AND_DISK) threads = [] for i in range(2): try: t = threading.Thread(target=insert_overwrite_thread, args=(sqls[i], data, )) t.start() threads.append(t) except Exception as x: print x for t in threads: t.join() def insert_overwrite_thread(sql, data): data.createOrReplaceTempView('temp_table') spark.sql(sql) Since spark is in lazy mode, the first RDD will still be calculated multiple times during parallel submission. I would like to ask you if there are other ways, thanks! Cheers, Gang Li -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org