Thanks Sean.I just realized it. Let me try that. On Mon, Mar 22, 2021 at 12:31 PM Sean Owen <sro...@gmail.com> wrote:
> You need to do something with the result of repartition. You haven't > changed textDF > > On Mon, Mar 22, 2021, 12:15 PM KhajaAsmath Mohammed < > mdkhajaasm...@gmail.com> wrote: > >> Hi, >> >> I have a use case where there are large files in hdfs. >> >> Size of the file is 3 GB. >> >> It is an existing code in production and I am trying to improve the >> performance of the job. >> >> Sample Code: >> textDF=dataframe ( This is dataframe that got created from hdfs path) >> logging.info("Number of partitions"+str(txt_df.rdd.getNumPartitions())) >> --> Prints 1 >> textDF.repartition(100) >> logging.info("Number of partitions"+str(txt_df.rdd.getNumPartitions())) >> --> Prints 1 >> >> Any suggestions on why this is happening? >> >> Next Block of the code which takes time: >> rdd.filter(lambda line: len(line)!=collistlenth) >> >> any way to parallelize and speed up my process on this? >> >> Thanks, >> Asmath >> >