Hi all, I am reading data from hdfs in the form of parquet files (around 3 GB) and running an algorithm from the spark ml library.
If I create the same spark dataframe by reading data from S3, the same algorithm takes considerably more time. I don't understand why this is happening. Is this a chance occurence or are the spark dataframes created different? I don't understand how the data store would effect the algorithm performance. Any help would be appreciated. Thanks a lot.