Hi Randy,

Yes, I'm using parquet on both S3 and hdfs.

On Thu, 28 May, 2020, 2:38 am randy clinton, <randyclin...@gmail.com> wrote:

> Is the file Parquet on S3 or is it some other file format?
>
> In general I would assume that HDFS read/writes are more performant for
> spark jobs.
>
> For instance, consider how well partitioned your HDFS file is vs the S3
> file.
>
> On Wed, May 27, 2020 at 1:51 PM Dark Crusader <
> relinquisheddra...@gmail.com> wrote:
>
>> Hi Jörn,
>>
>> Thanks for the reply. I will try to create a easier example to reproduce
>> the issue.
>>
>> I will also try your suggestion to look into the UI. Can you guide on
>> what I should be looking for?
>>
>> I was already using the s3a protocol to compare the times.
>>
>> My hunch is that multiple reads from S3 are required because of improper
>> caching of intermediate data. And maybe hdfs is doing a better job at this.
>> Does this make sense?
>>
>> I would also like to add that we built an extra layer on S3 which might
>> be adding to even slower times.
>>
>> Thanks for your help.
>>
>> On Wed, 27 May, 2020, 11:03 pm Jörn Franke, <jornfra...@gmail.com> wrote:
>>
>>> Have you looked in Spark UI why this is the case ?
>>> S3 Reading can take more time - it depends also what s3 url you are
>>> using : s3a vs s3n vs S3.
>>>
>>> It could help after some calculation to persist in-memory or on HDFS.
>>> You can also initially load from S3 and store on HDFS and work from there .
>>>
>>> HDFS offers Data locality for the tasks, ie the tasks start on the nodes
>>> where the data is. Depending on what s3 „protocol“ you are using you might
>>> be also more punished with performance.
>>>
>>> Try s3a as a protocol (replace all s3n with s3a).
>>>
>>> You can also use s3 url but this requires a special bucket
>>> configuration, a dedicated empty bucket and it lacks some ineroperability
>>> with other AWS services.
>>>
>>> Nevertheless, it could be also something else with the code. Can you
>>> post an example reproducing the issue?
>>>
>>> > Am 27.05.2020 um 18:18 schrieb Dark Crusader <
>>> relinquisheddra...@gmail.com>:
>>> >
>>> > 
>>> > Hi all,
>>> >
>>> > I am reading data from hdfs in the form of parquet files (around 3 GB)
>>> and running an algorithm from the spark ml library.
>>> >
>>> > If I create the same spark dataframe by reading data from S3, the same
>>> algorithm takes considerably more time.
>>> >
>>> > I don't understand why this is happening. Is this a chance occurence
>>> or are the spark dataframes created different?
>>> >
>>> > I don't understand how the data store would effect the algorithm
>>> performance.
>>> >
>>> > Any help would be appreciated. Thanks a lot.
>>>
>>
>
> --
> I appreciate your time,
>
> ~Randy
>

Reply via email to