Hi Randy, Yes, I'm using parquet on both S3 and hdfs.
On Thu, 28 May, 2020, 2:38 am randy clinton, <randyclin...@gmail.com> wrote: > Is the file Parquet on S3 or is it some other file format? > > In general I would assume that HDFS read/writes are more performant for > spark jobs. > > For instance, consider how well partitioned your HDFS file is vs the S3 > file. > > On Wed, May 27, 2020 at 1:51 PM Dark Crusader < > relinquisheddra...@gmail.com> wrote: > >> Hi Jörn, >> >> Thanks for the reply. I will try to create a easier example to reproduce >> the issue. >> >> I will also try your suggestion to look into the UI. Can you guide on >> what I should be looking for? >> >> I was already using the s3a protocol to compare the times. >> >> My hunch is that multiple reads from S3 are required because of improper >> caching of intermediate data. And maybe hdfs is doing a better job at this. >> Does this make sense? >> >> I would also like to add that we built an extra layer on S3 which might >> be adding to even slower times. >> >> Thanks for your help. >> >> On Wed, 27 May, 2020, 11:03 pm Jörn Franke, <jornfra...@gmail.com> wrote: >> >>> Have you looked in Spark UI why this is the case ? >>> S3 Reading can take more time - it depends also what s3 url you are >>> using : s3a vs s3n vs S3. >>> >>> It could help after some calculation to persist in-memory or on HDFS. >>> You can also initially load from S3 and store on HDFS and work from there . >>> >>> HDFS offers Data locality for the tasks, ie the tasks start on the nodes >>> where the data is. Depending on what s3 „protocol“ you are using you might >>> be also more punished with performance. >>> >>> Try s3a as a protocol (replace all s3n with s3a). >>> >>> You can also use s3 url but this requires a special bucket >>> configuration, a dedicated empty bucket and it lacks some ineroperability >>> with other AWS services. >>> >>> Nevertheless, it could be also something else with the code. Can you >>> post an example reproducing the issue? >>> >>> > Am 27.05.2020 um 18:18 schrieb Dark Crusader < >>> relinquisheddra...@gmail.com>: >>> > >>> > >>> > Hi all, >>> > >>> > I am reading data from hdfs in the form of parquet files (around 3 GB) >>> and running an algorithm from the spark ml library. >>> > >>> > If I create the same spark dataframe by reading data from S3, the same >>> algorithm takes considerably more time. >>> > >>> > I don't understand why this is happening. Is this a chance occurence >>> or are the spark dataframes created different? >>> > >>> > I don't understand how the data store would effect the algorithm >>> performance. >>> > >>> > Any help would be appreciated. Thanks a lot. >>> >> > > -- > I appreciate your time, > > ~Randy >