Re: Extracting key word from a textual column

Jörn Franke Tue, 02 Aug 2016 14:30:06 -0700

If you need to use single inserts, updates, deletes, select why not use hbase 
with Phoenix? I see it as complementary to the hive / warehouse offering


> On 02 Aug 2016, at 22:34, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> Hi,
> 
> I decided to create a catalog table in Hive ORC and transactional. That table 
> has two columns of value
> 
> transactiondescription === account_table.transactiondescription
> hashtag String column created from a semi automated process of deriving it 
> from account_table.transactiondescription
> Once the process is complete in populating the catalog table then we just 
> need to create a new DF based on join between catalog table and the 
> account_table. The join will use hashtag in catalog table to loop over debit 
> column in account_table for a given hashtag. That is pretty fast as going 
> through pattern matching is pretty intensive in any application and database 
> in real time.
> 
> So one can build up the catalog table over time as a reference table. I am 
> sure such tables exist in commercial world.
> 
> Anyway after getting results out I know how I am wasting my money on 
> different things, especially on clothing  etc :)
> 
> 
> HTH
> 
> P.S. Also there is an issue with Spark not being able to read data through 
> Hive transactional tables that have not been compacted yet. Spark just 
> crashes. If these tables need to be updated regularly say catalog table and 
> they are pretty small, one might maintain them in an RDBMS and read them once 
> through JDBC into a DataFrame in Spark before doing analytics.
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 2 August 2016 at 17:56, Sonal Goyal <sonalgoy...@gmail.com> wrote:
>> Hi Mich,
>> 
>> It seems like an entity resolution problem - looking at different 
>> representations of an entity - SAINSBURY in this case and matching them all 
>> together. How dirty is your data in the description - are there stop words 
>> like SACAT/SMKT etc you can strip off and get the base retailer entity ?
>> 
>> Best Regards,
>> Sonal
>> Founder, Nube Technologies 
>> Reifier at Strata Hadoop World
>> Reifier at Spark Summit 2015
>> 
>> 
>> 
>> 
>> 
>>> On Tue, Aug 2, 2016 at 9:55 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>> wrote:
>>> Thanks.
>>> 
>>> I believe there is some catalog of companies that I can get and store it in 
>>> a table and math the company name to transactiondesciption column.
>>> 
>>> That catalog should have sectors in it. For example company XYZ is under 
>>> Grocers etc which will make search and grouping much easier.
>>> 
>>> I believe Spark can do it, though I am generally interested on alternative 
>>> ideas.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
>>> 
>>>> On 2 August 2016 at 16:26, Yong Zhang <java8...@hotmail.com> wrote:
>>>> Well, if you still want to use windows function for your logic, then you 
>>>> need to derive a new column out, like "catalog", and use it as part of 
>>>> grouping logic.
>>>> 
>>>> 
>>>> Maybe you can use regex for deriving out this new column. The 
>>>> implementation needs to depend on your data in "transactiondescription", 
>>>> and regex gives you the most powerful way to handle your data.
>>>> 
>>>> 
>>>> This is really not a Spark question, but how to you process your logic 
>>>> based on the data given.
>>>> 
>>>> 
>>>> Yong
>>>> 
>>>> 
>>>>  
>>>> From: Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> Sent: Tuesday, August 2, 2016 10:00 AM
>>>> To: user @spark
>>>> Subject: Extracting key word from a textual column
>>>>  
>>>> Hi,
>>>> 
>>>> Need some ideas.
>>>> 
>>>> Summary:
>>>> 
>>>> I am working on a tool to slice and dice the amount of money I have spent 
>>>> so far (meaning the whole data sample) on a given retailer so I have a 
>>>> better idea of where I am wasting the money
>>>> 
>>>> Approach
>>>> 
>>>> Downloaded my bank statements from a given account in csv format from 
>>>> inception till end of July. Read the data and stored it in ORC table.
>>>> 
>>>> I am interested for all bills that I paid using Debit Card ( 
>>>> transactiontype = "DEB") that comes out the account directly. 
>>>> Transactiontype is the three character code lookup that I download as well.
>>>> 
>>>> scala> ll_18740868.printSchema
>>>> root
>>>>  |-- transactiondate: date (nullable = true)
>>>>  |-- transactiontype: string (nullable = true)
>>>>  |-- sortcode: string (nullable = true)
>>>>  |-- accountnumber: string (nullable = true)
>>>>  |-- transactiondescription: string (nullable = true)
>>>>  |-- debitamount: double (nullable = true)
>>>>  |-- creditamount: double (nullable = true)
>>>>  |-- balance: double (nullable = true)
>>>> 
>>>> The important fields are transactiondate, transactiontype, 
>>>> transactiondescription and debitamount
>>>> 
>>>> So using analytics. windowing I can do all sorts of things. For example 
>>>> this one gives me the last time I spent money on retailer XYZ and the 
>>>> amount
>>>> 
>>>> SELECT *
>>>> FROM (
>>>>       select transactiondate, transactiondescription, debitamount
>>>>       , rank() over (order by transactiondate desc) AS rank
>>>>       from accounts.ll_18740868 where transactiondescription like '%XYZ%'
>>>>      ) tmp
>>>> where rank <= 1
>>>> 
>>>> And its equivalent using Windowing in FP
>>>> 
>>>> import org.apache.spark.sql.expressions.Window
>>>> val wSpec = 
>>>> Window.partitionBy("transactiontype").orderBy(desc("transactiondate"))
>>>> ll_18740868.filter(col("transactiondescription").contains("XYZ")).select($"transactiondate",$"transactiondescription",
>>>>  rank().over(wSpec).as("rank")).filter($"rank"===1).show
>>>> 
>>>> 
>>>> +---------------+----------------------+----+
>>>> |transactiondate|transactiondescription|rank|
>>>> +---------------+----------------------+----+
>>>> |     2015-12-15|  XYZ LTD CD 4636 |   1|
>>>> +---------------+----------------------+----+
>>>> 
>>>> So far so good. But if I want to find all I spent on each retailer, then 
>>>> it gets trickier as a retailer appears like below in the column 
>>>> transactiondescription:
>>>> 
>>>> ll_18740868.where($"transactiondescription".contains("SAINSBURY")).select($"transactiondescription").show(5)
>>>> +----------------------+
>>>> |transactiondescription|
>>>> +----------------------+
>>>> |  SAINSBURYS SMKT C...|
>>>> |  SACAT SAINSBURYS ...|
>>>> |  SAINSBURY'S SMKT ...|
>>>> |  SAINSBURYS S/MKT ...|
>>>> |  SACAT SAINSBURYS ...|
>>>> +----------------------+
>>>> 
>>>> If I look at them I know they all belong to SAINBURYS (food retailer). I 
>>>> have done some crude grouping and it works somehow
>>>> 
>>>> //define UDF here to handle substring
>>>> val SubstrUDF = udf { (s: String, start: Int, end: Int) => 
>>>> s.substring(start, end) }
>>>> var cutoff = "CD"  // currently used in the statement
>>>> val wSpec2 = 
>>>> Window.partitionBy(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
>>>>  cutoff)-1))
>>>> ll_18740868.where($"transactiontype" === "DEB" && 
>>>> ($"transactiondescription").isNotNull).select(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
>>>>  
>>>> cutoff)-1).as("Retailer"),sum($"debitamount").over(wSpec2).as("Spent")).distinct.orderBy($"Spent").collect.foreach(println)
>>>> 
>>>> However, I really need to extract the "keyword" retailer name from 
>>>> transactiondescription column And I need some ideas about the best way of 
>>>> doing it. Is this possible in Spark?
>>>> 
>>>> Thanks
>>>> 
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>  
>>>> http://talebzadehmich.wordpress.com
>>>> 
>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>> loss, damage or destruction of data or any other property which may arise 
>>>> from relying on this email's technical content is explicitly disclaimed. 
>>>> The author will in no case be liable for any monetary damages arising from 
>>>> such loss, damage or destruction.
>

Re: Extracting key word from a textual column

Reply via email to