Hi, The code we're executing constructs pyspark.ml.Pipeline objects concurrently in separate python threads.
We observe that the stages fed to the pipeline object get corrupted i.e the stages supplied to a Pipeline object in one thread appear inside a different Pipeline object constructed in a different thread with different stages. Things work fine if there is a single thread, so this looks like a thread safety problem with the pyspark.ml.Pipeline class. I give code snippet below - looking for confirmation that this looks like a bug. If so, I will file one with complete py script to repro this issue. class MLThread(threading.Thread): def __init__(self,name,pipeline_stages): super(MLThread,self).__init__() self.name=name self.pipeline_stages=pipeline_stages def run(self): self.pipeline=pyspark.ml.Pipeline(stages=self.pipeline_stages) .. do some work on the pipeline thread_1 = MLThread("t1", pipeline_stages_1) thread_2 = MLThread("t2", pipeline_stages_2) . thread_n = MLThread("tn", pipeline_stages_n) .. launch all threads Once all threads are done, verify the self.pipeline.getStages() in each object matches that supplied during pipeline object construction. Observe that there are occasions when this does not hold true. This occurs with both Spark 1.6 and 2.x Regards, Vinayak Joshi