Thanks Sean.I just realized it. Let me try that.

On Mon, Mar 22, 2021 at 12:31 PM Sean Owen <sro...@gmail.com> wrote:

> You need to do something with the result of repartition. You haven't
> changed textDF
>
> On Mon, Mar 22, 2021, 12:15 PM KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a use case where there are large files in hdfs.
>>
>> Size of the file is 3 GB.
>>
>> It is an existing code in production and I am trying to improve the
>> performance of the job.
>>
>> Sample Code:
>> textDF=dataframe ( This is dataframe that got created from hdfs path)
>> logging.info("Number of partitions"+str(txt_df.rdd.getNumPartitions()))
>> --> Prints 1
>> textDF.repartition(100)
>> logging.info("Number of partitions"+str(txt_df.rdd.getNumPartitions()))
>> --> Prints 1
>>
>> Any suggestions  on why this is happening?
>>
>> Next Block of the code which takes time:
>> rdd.filter(lambda line: len(line)!=collistlenth)
>>
>> any way to parallelize and speed up my process on this?
>>
>> Thanks,
>> Asmath
>>
>

Reply via email to