[
https://issues.apache.org/jira/browse/IMPALA-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063331#comment-18063331
]
ASF subversion and git services commented on IMPALA-14768:
----------------------------------------------------------
Commit 52f3943049884412b1f1fb0376dd29f5a861400b in impala's branch
refs/heads/master from Fang-Yu Rao
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=52f394304 ]
IMPALA-14768: Add the operation type to the lineage graph
This patch makes Impala produce the operation type of the completed
query in the corresponding lineage event so that it would be easier for
data lineage tools like Apache Atlas to derive the operation type of a
given query. Note that currently Apache Atlas determines the operation
type of a given Impala query by matching the field of 'queryText' in the
lineage event against predefined regular expressions. Refer to
https://github.com/apache/atlas/blob/2957ff2/addons/impala-bridge/src/main/java/org/apache/atlas/impala/hook/ImpalaOperationParser.java#L49-L77
for more details.
However, such an approach is not robust. Recall that the string in
'queryText' is produced by Impala server replacing each newline in the
original query string with a space, which is followed by redaction.
Thus, 'queryText' may not be a valid SQL statement afterward.
string stmt =
replace_all_copy(query_ctx->client_request.stmt, "\n", " ");
Redact(&stmt);
// 'redacted_stmt' will be the string Impala uses to populate
// 'queryText' of the lineage event.
query_ctx->client_request.__set_redacted_stmt((const string) stmt);
For instance, when the original query
string contains a one-line SQL comment, it could be difficult for one to
decide where that one-line SQL comment ends if every newline in the
original query string is already replaced with a space.
Therefore, after this patch, it would be much easier for data lineage
tools to determine the operation type since it will be directly provided
in the lineage log.
On the other hand, apart from the field of 'operationType_', this
patch also makes PlannerTest#testLineage() check the field of
'queryStr_' of ColumnLineageGraph when testLineage() compares the
actual lineage graph with the expected one in lineage.test run in the
frontend test.
Testing:
- Added a new test case to lineage.test run in end-to-end test to show
that Impala could produce a lineage event for INSERT OVERWRITE.
- Updated lineage.test run in end-to-end and frontend tests to make
sure each lineage event comes with its respective operation type.
Change-Id: Icb94120a9bb1b994d4e681ea98521035bcc6510e
Reviewed-on: http://gerrit.cloudera.org:8080/24018
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Add operation type to the lineage graph
> ---------------------------------------
>
> Key: IMPALA-14768
> URL: https://issues.apache.org/jira/browse/IMPALA-14768
> Project: IMPALA
> Issue Type: Task
> Components: Frontend
> Reporter: Fang-Yu Rao
> Assignee: Fang-Yu Rao
> Priority: Major
>
> Currently, a lineage event log produced by Impala does not include the
> information about the operation type.
> {code}
> {"queryText":"create table test_db_01.test_tbl_01 (id
> int)","queryId":"b44da06a10682ce9:286bd74300000000","hash":"7debad31b299d7cccdf78a67968eb39d","user":"[email protected]","timestamp":1771622004,"endTime":1771622005,"edges":[],"vertices":[]}
> {code}
> However, some lineage event processing tool, e.g., Atlas, requires this piece
> of information. To derive the operation type, tools like
> https://github.com/apache/atlas/blob/master/addons/impala-bridge/src/main/java/org/apache/atlas/impala/hook/ImpalaLineageHook.java
> relies on regular expressions in
> https://github.com/apache/atlas/blob/14246fe/addons/impala-bridge/src/main/java/org/apache/atlas/impala/hook/ImpalaOperationParser.java#L30-L65
> to determine the operation type of the logged lineage event. But such
> regular expressions are not able to determine the operation type in all
> cases. One such example is when the SQL statement contains one-line comment.
> One solution to the aforementioned issue is to make sure the query text of a
> lineage event is a valid SQL statement (IMPALA-14741).
> An alternative is for Impala to add an additional field in its lineage graph
> to indicate the operation type. Once Impala is able to log the operation type
> in a lineage event, we could change the logic in Atlas hook that derives the
> operation type.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]