Re: Record count query parallel processing in databricks spark delta lake

anbutech Mon, 20 Jan 2020 01:25:14 -0800

Thank you Farhan so much for the help.

please help me on the design approach of this problem.what is the best way
to achieve this code to get the results better.


I have some clarification on the code.

want to take daily record count of ingestion source vs databricks delta lake
table vs snowflake table.how to combined the 3 different  query and do the
parallelism each of the count query and take the daily count of each.

I have s3a bucket with lot of ingestion topics folder with json files with
respect to the year, month,date,hour partition.

s3a://bucket_name/topic1/year=2019/Month=12/Day=25/hour=01
s3a://bucket_name/topic1/year=2019/Month=12/Day=25/hour=02
s3a://bucket_name/topic1/year=2019/Month=12/Day=25/hour=03
.............................................................................................
............................................................................................
...........................................................................................

s3a://bucket_name/topic1/year=2019/Month=12/Day=25/hour=23

s3a://bucket_name/topic2/year=2019/Month=12/Day=25/hour=01
s3a://bucket_name/topic2/year=2019/Month=12/Day=25/hour=02
s3a://bucket_name/topic2/year=2019/Month=12/Day=25/hour=03
.....................................................................................
.........................................................................................
...........................................................................................
.............................................................................................
s3a://bucket_name/topic2/year=2019/Month=12/Day=25/hour=23

s3a://bucket_name/topic3/year=2019/Month=12/Day=25/hour=01
s3a://bucket_name/topic3/year=2019/Month=12/Day=25/hour=02
s3a://bucket_name/topic3/year=2019/Month=12/Day=25/hour=03
............................................................................................
............................................................................................
............................................................................................
s3a://bucket_name/topic3/year=2019/Month=12/Day=25/hour=23
 
 
similarly for other 100 topics in the same S3 bucket location with other
topic name.


output:

Daily Day count table of all the 100 topic.

topics
,databricks_table_name,topic_count,databricks_table_count,snowflake_count,Y,M,D
topic1,logtable1,100,100,100,2019,12,25
topic2,logtable2,300,300,300,2019,12,25
topic3,logtable3,500,500,500,2019,12,25
topic4,logtable4,600,100,100,2019,12,25
topic5,logtable5,1000,1000,1000,2019,12,25
topic6,logtable6,200,200,200,2019,12,25
................................
................................
................................
topic100,logtable100,2000,2000,2000,2019,12,25

kindly help me on this problem.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Record count query parallel processing in databricks spark delta lake

Reply via email to