Yeah, it works for me. Thanks
On Fri, Nov 18, 2016 at 3:08 AM, ayan guha <guha.a...@gmail.com> wrote: > Hi > > I think you can use map reduce paradigm here. Create a key using user ID > and date and record as a value. Then you can express your operation (do > something) part as a function. If the function meets certain criteria such > as associative and cumulative like, say Add or multiplication, you can use > reducebykey, else you may use groupbykey. > > HTH > On 18 Nov 2016 06:45, "titli batali" <titlibat...@gmail.com> wrote: > >> >> That would help but again in a particular partitions i would need to a >> iterate over the customers having first n letters of user id in that >> partition. I want to get rid of nested iterations. >> >> Thanks >> >> On Thu, Nov 17, 2016 at 10:28 PM, Xiaomeng Wan <shawn...@gmail.com> >> wrote: >> >>> You can partitioned on the first n letters of userid >>> >>> On 17 November 2016 at 08:25, titli batali <titlibat...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I have a use case, where we have 1000 csv files with a column user_Id, >>>> having 8 million unique users. The data contains: userid,date,transaction, >>>> where we run some queries. >>>> >>>> We have a case where we need to iterate for each transaction in a >>>> particular date for each user. There is three nesting loops >>>> >>>> for(user){ >>>> for(date){ >>>> for(transactions){ >>>> //Do Something >>>> } >>>> } >>>> } >>>> >>>> i.e we do similar thing for every (date,transaction) tuple for a >>>> particular user. In order to get away with loop structure and decrease the >>>> processing time We are converting converting the csv files to parquet and >>>> partioning it with userid, df.write.format("parquet").par >>>> titionBy("useridcol").save("hdfs://path"). >>>> >>>> So that while reading the parquet files, we read a particular user in a >>>> particular partition and create a Cartesian product of (date X transaction) >>>> and work on the tuple in each partition, to achieve the above level of >>>> nesting. Partitioning on 8 million users is it a bad option. What could be >>>> a better way to achieve this? >>>> >>>> Thanks >>>> >>>> >>>> >>> >>> >>