Hi ,
i have dataframe with 1000 columns to dummies with stingIndexer
when i apply pipliene take long times whene i want merge result with other
data frame
i mean :
originnal data frame + columns indexed by STringindexers
PB save stage it s long why ?
code
indexers = [StringIndexer(inputCol=i, outputCol=i+"_index").fit(df)
for i in l]
li = [i+"_index" for i in l]
pipeline = Pipeline(stages=indexers)
df_r = pipeline.fit(df).transform(df)
df_r = df_r.repartition(500)
df_r.persist()
df_r.write().parquet(paths)