Hi all,

I am reading data from hdfs in the form of parquet files (around 3 GB) and
running an algorithm from the spark ml library.

If I create the same spark dataframe by reading data from S3, the same
algorithm takes considerably more time.

I don't understand why this is happening. Is this a chance occurence or are
the spark dataframes created different?

I don't understand how the data store would effect the algorithm
performance.

Any help would be appreciated. Thanks a lot.

Reply via email to