Hi Suraj,It seems your requirement is Record Linkage/Entity 
Resolution.https://en.wikipedia.org/wiki/Record_linkage
http://www.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf

A presentation from Spark Summit using 
GraphXhttps://spark-summit.org/east-2015/talk/distributed-graph-based-entity-resolution-using-spark

 Kind Regards
Salih Oztop
07856128843
http://www.linkedin.com/in/salihoztop
      From: Suraj Shetiya <[email protected]>
 To: Michael Armbrust <[email protected]> 
Cc: Salih Oztop <[email protected]>; "[email protected]" 
<[email protected]>; [email protected] 
 Sent: Thursday, July 2, 2015 10:47 AM
 Subject: Re: Spark Dataframe 1.4 (GroupBy partial match)
   
Hi Michael,

Thanks for a quick response.. This sounds like something that would work. 
However, Rethinking the problem statement and various other use cases, which 
are growing, there are more such scenarios, where one could have columns with 
structured and unstructured data embedded (json or xml or other kind of 
collections), it may make sense to allow probabilistic groupby operations where 
the user can get the same functionality in one step instead of two.. 

Your thoughts on if that makes sense..

-Suraj




---------- Forwarded message ----------
From: "Michael Armbrust" <[email protected]>
Date: Jul 2, 2015 12:49 AM
Subject: Re: Spark Dataframe 1.4 (GroupBy partial match)
To: "Suraj Shetiya" <[email protected]>
Cc: "Salih Oztop" <[email protected]>, "[email protected]" 
<[email protected]>

You should probably write a UDF that uses regular expression or other string 
munging to canonicalize the subject and then group on that derived column.
On Tue, Jun 30, 2015 at 10:30 PM, Suraj Shetiya <[email protected]> wrote:

Thanks Salih. :)


The output of the groupby is as below.

2015-01-14      "SEC Inquiry"
2015-01-16       "Re: SEC Inquiry"
2015-01-18       "Fwd: Re: SEC Inquiry"


And subsequently, we would like to aggregate all messages with a particular 
reference subject. 
For instance the question we are trying to answer could be : Get the count of 
messages with a particular subject. 

Looking forward to any suggestion from you.

On Tue, Jun 30, 2015 at 8:42 PM, Salih Oztop <[email protected]> wrote:

Hi SurajWhat will be your output after group by? Since GroupBy is for 
aggregations like sum, count etc.
If you want to count the 2015 records than it is possible. Kind Regards
Salih Oztop


      From: Suraj Shetiya <[email protected]>
 To: [email protected] 
 Sent: Tuesday, June 30, 2015 3:05 PM
 Subject: Spark Dataframe 1.4 (GroupBy partial match)
   
I have a dataset (trimmed and simplified) with 2 columns as below.

Date                Subject
2015-01-14      "SEC Inquiry"
2014-02-12       "Happy birthday"
2014-02-13       "Re: Happy birthday"
2015-01-16       "Re: SEC Inquiry"
2015-01-18       "Fwd: Re: SEC Inquiry"

I have imported the same in a Spark Dataframe. What I am looking at is groupBy 
subject field (however, I need a partial match to identify the discussion 
topic). 

For example in the above case.. I would like to group all messages, which have 
subject containing "SEC Inquiry" which returns following grouped frame: 

2015-01-14      "SEC Inquiry"
2015-01-16       "Re: SEC Inquiry"
2015-01-18       "Fwd: Re: SEC Inquiry"

Another usecase for a similar problem could be group by year (in the above 
example), it would mean partial match of the date field, which would mean 
groupBy Date by matching year as "2014" or "2015".

Keenly Looking forward to reply/solution to the above.

- Suraj










  

Reply via email to