[ 
https://issues.apache.org/jira/browse/KUDU-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886660#comment-16886660
 ] 

Xu Yao commented on KUDU-2673:
------------------------------

[~adar] Thank you very much, this is really useful for us.
But, I think this will lead to increased network overhead between 
KuduClient(like Spark) and Kudu. Because the primary key of the table is 
(`customer_id`, `as_of_start_date`, `system_start_ts`), so the same 
`customer_id` data will not be merged. We expect the primary key of the table 
to be (`customer_id`). Maybe we can extend the  Kudu's Scanner to fix this 
issue.


> Event timestamp support with kudu.
> ----------------------------------
>
>                 Key: KUDU-2673
>                 URL: https://issues.apache.org/jira/browse/KUDU-2673
>             Project: Kudu
>          Issue Type: New Feature
>          Components: java, spark, tserver
>            Reporter: yangz
>            Priority: Major
>              Labels: features, roadmap-candidate
>
> Kudu has the ability to read historical data. But it is based by the 
> timestamp produced by kudu transaction and mvcc system. The timestamp kudu 
> used greatly weakened the usability.
> For our use case. we write data to kudu from data stream. We use range 
> partition by day.
> We want to get the hour version from kudu. So we need read history data from 
> kudu.
> It produced by undo file. But when user give a timestamp, it means timestamp 
> the event happen, associated with the data. Not the timestamp kudu produced. 
> So we need a way to set event timestamp to the kudu system.
> Finally, we got a way to solve this problem.
> But our solution has two limit.
>  # We only update the table by a row, and for one row we have a timestamp 
> with it.
>  # For getting the right history version of data, we need the data stream 
> send data by event time order.
> Despite these problems, it has satisfied our current business.
>  
> And our implement also solve part problem for the wrong order problem of 
> event time if you only need the newest data, which will not read undo file.
> for the data send into kudu,       t1 < t2
> t1 upsert -> t2 upsert      ->    newest will be t2 value
> t2 upsert -> t1 upsret      ->    (current kudu implement) t1,  our implement 
> will be t2.
>  
> Maybe our solution is not the best for the problem. But I think kudu snapshot 
> read should support event time.
> Our solution is not so complete for all user cases. But I hope it will be 
> useful for some cases with the community.   
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to