Re: How to delete the record

2022-01-30 Thread Gourav Sengupta
Hi, I think it will be useful to understand the problem before solving the problem. Can I please ask what this table is? Is it a fact (event store) kind of a table, or a dimension (master data) kind of table? And what are the downstream consumptions of this table? Besides that what is the unique

Re: How to delete the record

2022-01-27 Thread ayan guha
Btw, 2 options Mitch explained are not mutually exclusive. Option 2 can and should be implemented over a delta lake table anyway. Especially if you need to do hard deletes eventually (eg for regulatory needs) On Fri, 28 Jan 2022 at 6:50 am, Sid Kal wrote: > Thanks Mich and Sean for your time >

Re: How to delete the record

2022-01-27 Thread Sid Kal
Thanks Mich and Sean for your time On Fri, 28 Jan 2022, 00:53 Mich Talebzadeh, wrote: > Yes I believe so. > > Check this article of mine dated early 2019 but will have some relevance > to what I am implying. > > > https://www.linkedin.com/pulse/real-time-data-streaming-big-typical-use-cases-tale

Re: How to delete the record

2022-01-27 Thread Mich Talebzadeh
Yes I believe so. Check this article of mine dated early 2019 but will have some relevance to what I am implying. https://www.linkedin.com/pulse/real-time-data-streaming-big-typical-use-cases-talebzadeh-ph-d-/ HTH view my Linkedin profile

Re: How to delete the record

2022-01-27 Thread Sid Kal
Okay sounds good. So, below two options would help me to capture CDC changes: 1) Delta lake 2) Maintaining snapshot of records with some indicators and timestamp. Correct me if I'm wrong Thanks, Sid On Thu, 27 Jan 2022, 23:59 Mich Talebzadeh, wrote: > There are two ways of doing it. > > >

Re: How to delete the record

2022-01-27 Thread Mich Talebzadeh
There are two ways of doing it. 1. Through snapshot offered meaning an immutable snapshot of the state of the table at a given version. For example, the state of a Delta table

Re: How to delete the record

2022-01-27 Thread Mich Talebzadeh
There are two ways of doing it. 1. Through snapshot offered meaning an immutable snapshot of the state of the table at a given version. For example, the state of a Delta table

Re: How to delete the record

2022-01-27 Thread Sean Owen
Delta, for example, manages merge/append/delete and also keeps previous states of the table's data, so you can query what it looked like before. See delta.io On Thu, Jan 27, 2022, 11:54 AM Sid Kal wrote: > Hi Sean, > > So you mean if I use those file formats it will do the work of CDC > automati

Re: How to delete the record

2022-01-27 Thread Sid Kal
Hi Sean, So you mean if I use those file formats it will do the work of CDC automatically or I would have to handle it via code ? Hi Mich, Not sure if I understood you. Let me try to explain my scenario. Suppose there is a Id "1" which is inserted today, so I transformed and ingested it. Now sup

Re: How to delete the record

2022-01-27 Thread Mich Talebzadeh
Sid, How do you cater for updates? Do you add it as an update with a new record without touching the original record? This approach allows you to see the history of the records i.e. inserted once, deleted once and updated *n* times throughout the Entity Life History of record. So your mileage var

Re: How to delete the record

2022-01-27 Thread Sean Owen
This is what storage engines like Delta, Hudi, Iceberg are for. No need to manage it manually or use a DBMS. These formats allow deletes, upserts, etc of data, using Spark, on cloud storage. On Thu, Jan 27, 2022 at 10:56 AM Mich Talebzadeh wrote: > Where ETL data is stored? > > > > *But now the

Re: How to delete the record

2022-01-27 Thread Sid Kal
Hi Mich, Thanks for your time. Data is stored in S3 via DMS which is read in the Spark jobs. How can I mark as a soft delete ? Any small snippet / link / example. Anything would help. Thanks, Sid On Thu, 27 Jan 2022, 22:26 Mich Talebzadeh, wrote: > Where ETL data is stored? > > > > *But now

Re: How to delete the record

2022-01-27 Thread Mich Talebzadeh
Where ETL data is stored? *But now the main problem is when the record at the source is deleted, it should be deleted in my final transformed record too.* If your final sync (storage) is data warehouse, it should be soft flagged with op_type (Insert/Update/Delete) and op_time (timestamp). H