Re: How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg
Hi Lalwani, But I need to augment the directory specific data to every record of that directory. Once I have read the data, there is no link back to the directory in the data which I can use to augment additional data On Wed, May 5, 2021 at 10:41 PM Lalwani, Jayesh wrote: > You don’t have to uni

Re: How to read multiple HDFS directories

2021-05-05 Thread Lalwani, Jayesh
You don’t have to union multiple RDDs. You can read files from multiple directories in a single read call. Spark will manage partitioning of the data across directories. From: Kapil Garg Date: Wednesday, May 5, 2021 at 10:45 AM To: spark users Subject: [EXTERNAL] How to read multiple HDFS dir

Re: How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg
Hi Mich, The number of directories can be 1000+, doing 1000+ reduce by key and union might be a costlier operation. On Wed, May 5, 2021 at 10:22 PM Mich Talebzadeh wrote: > This is my take > > >1. read the current snapshot (provide empty if it doesn't exist yet) >2. Loop over N directori

Re: How to read multiple HDFS directories

2021-05-05 Thread Mich Talebzadeh
This is my take 1. read the current snapshot (provide empty if it doesn't exist yet) 2. Loop over N directories 1. read unprocessed new data from HDFS 2. union them and do a `reduceByKey` operation 3. output a new version of the snapshot HTH view my Linkedin profile <

Re: How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg
Sorry but I didn't get the question. It is possible that 1 record is present in multiple directories. That's why we do a reduceByKey after the union step. On Wed, May 5, 2021 at 9:20 PM Mich Talebzadeh wrote: > When you are doing union on these RDDs, (each RDD has one to one > correspondence wit

Re: How to read multiple HDFS directories

2021-05-05 Thread Mich Talebzadeh
When you are doing union on these RDDs, (each RDD has one to one correspondence with an HDFS directory), do you have a common key across all? view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility

Re: How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg
Hi Mich, I went through the thread and it doesn't relate to the problem statement I shared above. In my problem statement, there is a simple ETL job which doesn't use any external library (such as pandas) This is the flow *hdfsDirs := List(); //contains N directories* *rddList := List();* *for e

Re: How to read multiple HDFS directories

2021-05-05 Thread Mich Talebzadeh
Hi, Have a look at this thread called Tasks are skewed to one executor and see if it helps and we can take it from there. HTH view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility for any lo