Re: How to group multiple row data ?

ayan guha Wed, 29 Apr 2015 07:25:45 -0700

looks like you need this:

lst = [[10001, 132, 2002, 1, "2012-11-23"],
           [10001, 132, 2002, 1, "2012-11-24"],
           [10031, 102, 223, 2, "2012-11-24"],
           [10001, 132, 2002, 2, "2012-11-25"],
           [10001, 132, 2002, 3, "2012-11-26"]]
    base = sc.parallelize(lst,1).map(lambda x:
Row(idx=x[0],num=x[1],yr=x[2],ev=x[3],dt=int(x[4].replace("-",""))))
    baseDF = ssc.createDataFrame(base)
    print baseDF.printSchema()


    baseDF.registerTempTable("base")
    trm = ssc.sql("select a.idx,a.num,a.yr,b.ev,a.dt from base a inner join
base b on a.idx=b.idx and a.num=b.num and a.yr=b.yr where a.dt>=b.dt  order
by a.idx,a.num,a.yr,a.dt")
    trmRDD = trm.map(rowtoarr).reduceByKey(lambda x,y: str(x)+","+str(y))
    for i in trmRDD.collect():
        print i

def rowtoarr(r):
    return (r.idx,r.num,r.yr,r.dt),r.ev

((10031, 102, 223, 20121124), 2)
((10001, 132, 2002, 20121123), 1)
((10001, 132, 2002, 20121125), '1,1,2')
((10001, 132, 2002, 20121124), '1,1')
((10001, 132, 2002, 20121126), '1,1,2,3')

On Wed, Apr 29, 2015 at 10:34 PM, Manoj Awasthi <awasthi.ma...@gmail.com>
wrote:

> Sorry but I didn't fully understand the grouping. This line:
>
> >> The group must only take the closest previous trigger. The first one
> hence shows alone.
>
> Can you please explain further?
>
>
> On Wed, Apr 29, 2015 at 4:42 PM, bipin <bipin....@gmail.com> wrote:
>
>> Hi, I have a ddf with schema (CustomerID, SupplierID, ProductID, Event,
>> CreatedOn), the first 3 are Long ints and event can only be 1,2,3 and
>> CreatedOn is a timestamp. How can I make a group triplet/doublet/singlet
>> out
>> of them such that I can infer that Customer registered event from 1to 2
>> and
>> if present to 3 timewise and preserving the number of entries. For e.g.
>>
>> Before processing:
>> 10001, 132, 2002, 1, 2012-11-23
>> 10001, 132, 2002, 1, 2012-11-24
>> 10031, 102, 223, 2, 2012-11-24
>> 10001, 132, 2002, 2, 2012-11-25
>> 10001, 132, 2002, 3, 2012-11-26
>> (total 5 rows)
>>
>> After processing:
>> 10001, 132, 2002, 2012-11-23, "1"
>> 10031, 102, 223, 2012-11-24, "2"
>> 10001, 132, 2002, 2012-11-24, "1,2,3"
>> (total 5 in last field - comma separated!)
>>
>> The group must only take the closest previous trigger. The first one hence
>> shows alone. Can this be done using spark sql ? If it needs to processed
>> in
>> functionally in scala, how to do this. I can't wrap my head around this.
>> Can
>> anyone help.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-group-multiple-row-data-tp22701.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


-- 
Best Regards,
Ayan Guha

Re: How to group multiple row data ?

Reply via email to