Re: Stream-static join : Refreshing subset of static data / Connection pooling

2020-11-29 Thread chen kevin
Hi, you can use Debezium to capture real-timely the row-level changes in PostgreSql, then stream them to kafka, finally etl and write the data to hbase by flink/spark streaming。So you can join the data in hbase directly. in consideration of the particularly big table, the scan performance in

Separating storage from compute layer with Spark and data warehouses offering ML capabilities

2020-11-29 Thread Mich Talebzadeh
This is a generic question with regard to an optimum design. Many Cloud Data Warehouses like Google BigQuery (BQ) or Oracle Autonomous Data Warehouse (ADW), nowadays off

Re: Stream-static join : Refreshing subset of static data / Connection pooling

2020-11-29 Thread chen kevin
1. I think it should not cause memory issue, , if you configurate kafka, spark/flink and hbase. * We use the method in our scenario, the data will reach aoubt 80~150Tb a day. Does it generate more data in your scenario ? I think it’s the best method to deal with the particularly b

Re: Stream-static join : Refreshing subset of static data / Connection pooling

2020-11-29 Thread Geervan Hayatnagarkar
The real question is two fold: 1) we had to do collect on each microbatch. In high velocity streams this could result in millions of records causing memory issue. Also it appears that we are manually doing the real join by selecting only matching rows from static source. Is there a better way to