If you need to use single inserts, updates, deletes, select why not use hbase with Phoenix? I see it as complementary to the hive / warehouse offering
> On 02 Aug 2016, at 22:34, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > Hi, > > I decided to create a catalog table in Hive ORC and transactional. That table > has two columns of value > > transactiondescription === account_table.transactiondescription > hashtag String column created from a semi automated process of deriving it > from account_table.transactiondescription > Once the process is complete in populating the catalog table then we just > need to create a new DF based on join between catalog table and the > account_table. The join will use hashtag in catalog table to loop over debit > column in account_table for a given hashtag. That is pretty fast as going > through pattern matching is pretty intensive in any application and database > in real time. > > So one can build up the catalog table over time as a reference table. I am > sure such tables exist in commercial world. > > Anyway after getting results out I know how I am wasting my money on > different things, especially on clothing etc :) > > > HTH > > P.S. Also there is an issue with Spark not being able to read data through > Hive transactional tables that have not been compacted yet. Spark just > crashes. If these tables need to be updated regularly say catalog table and > they are pretty small, one might maintain them in an RDBMS and read them once > through JDBC into a DataFrame in Spark before doing analytics. > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > >> On 2 August 2016 at 17:56, Sonal Goyal <sonalgoy...@gmail.com> wrote: >> Hi Mich, >> >> It seems like an entity resolution problem - looking at different >> representations of an entity - SAINSBURY in this case and matching them all >> together. How dirty is your data in the description - are there stop words >> like SACAT/SMKT etc you can strip off and get the base retailer entity ? >> >> Best Regards, >> Sonal >> Founder, Nube Technologies >> Reifier at Strata Hadoop World >> Reifier at Spark Summit 2015 >> >> >> >> >> >>> On Tue, Aug 2, 2016 at 9:55 PM, Mich Talebzadeh <mich.talebza...@gmail.com> >>> wrote: >>> Thanks. >>> >>> I believe there is some catalog of companies that I can get and store it in >>> a table and math the company name to transactiondesciption column. >>> >>> That catalog should have sectors in it. For example company XYZ is under >>> Grocers etc which will make search and grouping much easier. >>> >>> I believe Spark can do it, though I am generally interested on alternative >>> ideas. >>> >>> >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> LinkedIn >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> >>> http://talebzadehmich.wordpress.com >>> >>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>> loss, damage or destruction of data or any other property which may arise >>> from relying on this email's technical content is explicitly disclaimed. >>> The author will in no case be liable for any monetary damages arising from >>> such loss, damage or destruction. >>> >>> >>>> On 2 August 2016 at 16:26, Yong Zhang <java8...@hotmail.com> wrote: >>>> Well, if you still want to use windows function for your logic, then you >>>> need to derive a new column out, like "catalog", and use it as part of >>>> grouping logic. >>>> >>>> >>>> Maybe you can use regex for deriving out this new column. The >>>> implementation needs to depend on your data in "transactiondescription", >>>> and regex gives you the most powerful way to handle your data. >>>> >>>> >>>> This is really not a Spark question, but how to you process your logic >>>> based on the data given. >>>> >>>> >>>> Yong >>>> >>>> >>>> >>>> From: Mich Talebzadeh <mich.talebza...@gmail.com> >>>> Sent: Tuesday, August 2, 2016 10:00 AM >>>> To: user @spark >>>> Subject: Extracting key word from a textual column >>>> >>>> Hi, >>>> >>>> Need some ideas. >>>> >>>> Summary: >>>> >>>> I am working on a tool to slice and dice the amount of money I have spent >>>> so far (meaning the whole data sample) on a given retailer so I have a >>>> better idea of where I am wasting the money >>>> >>>> Approach >>>> >>>> Downloaded my bank statements from a given account in csv format from >>>> inception till end of July. Read the data and stored it in ORC table. >>>> >>>> I am interested for all bills that I paid using Debit Card ( >>>> transactiontype = "DEB") that comes out the account directly. >>>> Transactiontype is the three character code lookup that I download as well. >>>> >>>> scala> ll_18740868.printSchema >>>> root >>>> |-- transactiondate: date (nullable = true) >>>> |-- transactiontype: string (nullable = true) >>>> |-- sortcode: string (nullable = true) >>>> |-- accountnumber: string (nullable = true) >>>> |-- transactiondescription: string (nullable = true) >>>> |-- debitamount: double (nullable = true) >>>> |-- creditamount: double (nullable = true) >>>> |-- balance: double (nullable = true) >>>> >>>> The important fields are transactiondate, transactiontype, >>>> transactiondescription and debitamount >>>> >>>> So using analytics. windowing I can do all sorts of things. For example >>>> this one gives me the last time I spent money on retailer XYZ and the >>>> amount >>>> >>>> SELECT * >>>> FROM ( >>>> select transactiondate, transactiondescription, debitamount >>>> , rank() over (order by transactiondate desc) AS rank >>>> from accounts.ll_18740868 where transactiondescription like '%XYZ%' >>>> ) tmp >>>> where rank <= 1 >>>> >>>> And its equivalent using Windowing in FP >>>> >>>> import org.apache.spark.sql.expressions.Window >>>> val wSpec = >>>> Window.partitionBy("transactiontype").orderBy(desc("transactiondate")) >>>> ll_18740868.filter(col("transactiondescription").contains("XYZ")).select($"transactiondate",$"transactiondescription", >>>> rank().over(wSpec).as("rank")).filter($"rank"===1).show >>>> >>>> >>>> +---------------+----------------------+----+ >>>> |transactiondate|transactiondescription|rank| >>>> +---------------+----------------------+----+ >>>> | 2015-12-15| XYZ LTD CD 4636 | 1| >>>> +---------------+----------------------+----+ >>>> >>>> So far so good. But if I want to find all I spent on each retailer, then >>>> it gets trickier as a retailer appears like below in the column >>>> transactiondescription: >>>> >>>> ll_18740868.where($"transactiondescription".contains("SAINSBURY")).select($"transactiondescription").show(5) >>>> +----------------------+ >>>> |transactiondescription| >>>> +----------------------+ >>>> | SAINSBURYS SMKT C...| >>>> | SACAT SAINSBURYS ...| >>>> | SAINSBURY'S SMKT ...| >>>> | SAINSBURYS S/MKT ...| >>>> | SACAT SAINSBURYS ...| >>>> +----------------------+ >>>> >>>> If I look at them I know they all belong to SAINBURYS (food retailer). I >>>> have done some crude grouping and it works somehow >>>> >>>> //define UDF here to handle substring >>>> val SubstrUDF = udf { (s: String, start: Int, end: Int) => >>>> s.substring(start, end) } >>>> var cutoff = "CD" // currently used in the statement >>>> val wSpec2 = >>>> Window.partitionBy(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription", >>>> cutoff)-1)) >>>> ll_18740868.where($"transactiontype" === "DEB" && >>>> ($"transactiondescription").isNotNull).select(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription", >>>> >>>> cutoff)-1).as("Retailer"),sum($"debitamount").over(wSpec2).as("Spent")).distinct.orderBy($"Spent").collect.foreach(println) >>>> >>>> However, I really need to extract the "keyword" retailer name from >>>> transactiondescription column And I need some ideas about the best way of >>>> doing it. Is this possible in Spark? >>>> >>>> Thanks >>>> >>>> Dr Mich Talebzadeh >>>> >>>> LinkedIn >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>>> loss, damage or destruction of data or any other property which may arise >>>> from relying on this email's technical content is explicitly disclaimed. >>>> The author will in no case be liable for any monetary damages arising from >>>> such loss, damage or destruction. >