Thank you Gourav, Today I saw the article: https://databricks.com/session/apache-spark-in-cloud-and-hybrid-why-security-and-governance-become-more-important It seems also interesting. I was in meeting, I will also watch it.
From: Gourav Sengupta <gourav.sengu...@gmail.com> Date: 24 October 2018 Wednesday 13:39 To: "Ozsakarya, Omer" <omer.ozsaka...@sony.com> Cc: Spark Forum <user@spark.apache.org> Subject: Re: Triggering sql on Was S3 via Apache Spark Also try to read about SCD and the fact that Hive may be a very good alternative as well for running updates on data Regards, Gourav On Wed, 24 Oct 2018, 14:53 , <omer.ozsaka...@sony.com<mailto:omer.ozsaka...@sony.com>> wrote: Thank you very much 😊 From: Gourav Sengupta <gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>> Date: 24 October 2018 Wednesday 11:20 To: "Ozsakarya, Omer" <omer.ozsaka...@sony.com<mailto:omer.ozsaka...@sony.com>> Cc: Spark Forum <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: Re: Triggering sql on Was S3 via Apache Spark This is interesting you asked and then answered the questions (almost) as well Regards, Gourav On Tue, 23 Oct 2018, 13:23 , <omer.ozsaka...@sony.com<mailto:omer.ozsaka...@sony.com>> wrote: Hi guys, We are using Apache Spark on a local machine. I need to implement the scenario below. In the initial load: 1. CRM application will send a file to a folder. This file contains customer information of all customers. This file is in a folder in the local server. File name is: customer.tsv * Customer.tsv contains customerid, country, birty_month, activation_date etc 1. I need to read the contents of customer.tsv. 2. I will add current timestamp info to the file. 3. I will transfer customer.tsv to the S3 bucket: customer.history.data In the daily loads: 1. CRM application will send a new file which contains the updated/deleted/inserted customer information. File name is daily_customer.tsv * Daily_customer.tsv contains contains customerid, cdc_field, country, birty_month, activation_date etc Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted. 1. I need to read the contents of daily_customer.tsv. 2. I will add current timestamp info to the file. 3. I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data 4. I need to merge two buckets customer.history.data and customer.daily.data. * Two buckets have timestamp fields. So I need to query all records whose timestamp is the last timestamp. * I can use row_number() over(partition by customer_id order by timestamp_field desc) as version_number * Then I can put the records whose version is one, to the final bucket: customer.dimension.data I am running Spark on premise. * Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a local Spark cluster? * Is this approach efficient? Will the queries transfer all historical data from AWS S3 to the local cluster? * How can I implement this scenario in a more effective way? Like just transferring daily data to AWS S3 and then running queries on AWS. * For instance Athena can query on AWS. But it is just a query engine. As I know I can not call it by using an sdk and I can not write the results to a bucket/folder. Thanks in advance, Ömer