Hello Spark Community,

I have a dataset of size 20G, 20 columns. Each column is categorical, so I
applied string-indexer and one-hot-encoding on every column. After, I
applied vector-assembler on all the newly derived columns to form a feature
vector for each record, and then feed the feature vectors to a ML algorithm.

However, during the feature engineering steps, I observed from Spark UI
that the input size (i.e., from Executor tab) increased dramatically to
600G+. The cluster I used might not have so much resource. Are there any
ways for optimizing the memory usage of intermediate results?

Thanks.

Reply via email to