Re: Spark Partitioning Strategy with Parquet

titli batali Fri, 30 Dec 2016 00:24:08 -0800

Yeah, it works for me.

Thanks


On Fri, Nov 18, 2016 at 3:08 AM, ayan guha <guha.a...@gmail.com> wrote:

> Hi
>
> I think you can use map reduce paradigm here. Create a key  using user ID
> and date and record as a value. Then you can express your operation (do
> something) part as a function. If the function meets certain criteria such
> as associative and cumulative like, say Add or multiplication, you can use
> reducebykey, else you may use groupbykey.
>
> HTH
> On 18 Nov 2016 06:45, "titli batali" <titlibat...@gmail.com> wrote:
>
>>
>> That would help but again in a particular partitions i would need to a
>> iterate over the customers having first n letters of user id in that
>> partition. I want to get rid of nested iterations.
>>
>> Thanks
>>
>> On Thu, Nov 17, 2016 at 10:28 PM, Xiaomeng Wan <shawn...@gmail.com>
>> wrote:
>>
>>> You can partitioned on the first n letters of userid
>>>
>>> On 17 November 2016 at 08:25, titli batali <titlibat...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a use case, where we have 1000 csv files with a column user_Id,
>>>> having 8 million unique users. The data contains: userid,date,transaction,
>>>> where we run some queries.
>>>>
>>>> We have a case where we need to iterate for each transaction in a
>>>> particular date for each user. There is three nesting loops
>>>>
>>>> for(user){
>>>>               for(date){
>>>>                             for(transactions){
>>>>                               //Do Something
>>>>                               }
>>>>                }
>>>>     }
>>>>
>>>> i.e we do similar thing for every (date,transaction) tuple for a
>>>> particular user. In order to get away with loop structure and decrease the
>>>> processing time We are converting converting the csv files to parquet and
>>>> partioning it with userid, df.write.format("parquet").par
>>>> titionBy("useridcol").save("hdfs://path").
>>>>
>>>> So that while reading the parquet files, we read a particular user in a
>>>> particular partition and create a Cartesian product of (date X transaction)
>>>> and work on the tuple in each partition, to achieve the above level of
>>>> nesting. Partitioning on 8 million users is it a bad option. What could be
>>>> a better way to achieve this?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>
>>>
>>

Re: Spark Partitioning Strategy with Parquet

Reply via email to