Spark dataframe hdfs vs s3

Dark Crusader Wed, 27 May 2020 09:18:25 -0700

Hi all,

I am reading data from hdfs in the form of parquet files (around 3 GB) and
running an algorithm from the spark ml library.


If I create the same spark dataframe by reading data from S3, the same
algorithm takes considerably more time.

I don't understand why this is happening. Is this a chance occurence or are
the spark dataframes created different?

I don't understand how the data store would effect the algorithm
performance.

Any help would be appreciated. Thanks a lot.

Spark dataframe hdfs vs s3

Reply via email to