[jira] [Commented] (FLINK-12427) Translate the "Flink DataStream API Programming Guide" page into Chinese
[ https://issues.apache.org/jira/browse/FLINK-12427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858336#comment-16858336 ] xiaogang zhou commented on FLINK-12427: --- May I translate this? > Translate the "Flink DataStream API Programming Guide" page into Chinese > > > Key: FLINK-12427 > URL: https://issues.apache.org/jira/browse/FLINK-12427 > Project: Flink > Issue Type: Sub-task > Components: chinese-translation, Documentation >Affects Versions: 1.9.0 >Reporter: YangFei >Assignee: YangFei >Priority: Major > Labels: pull-request-available > Fix For: 1.9.0 > > > [https://ci.apache.org/projects/flink/flink-docs-master/dev/datastream_api.html] > files locate /flink/docs/dev/datastream_api.zh.md -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-12427) Translate the "Flink DataStream API Programming Guide" page into Chinese
[ https://issues.apache.org/jira/browse/FLINK-12427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858337#comment-16858337 ] xiaogang zhou commented on FLINK-12427: --- [~yangfei] It would be great If I can handle this page, if not availabe, Would you please guide me to another translation issue? just want to take chance to get familiar with the pull request process. > Translate the "Flink DataStream API Programming Guide" page into Chinese > > > Key: FLINK-12427 > URL: https://issues.apache.org/jira/browse/FLINK-12427 > Project: Flink > Issue Type: Sub-task > Components: chinese-translation, Documentation >Affects Versions: 1.9.0 >Reporter: YangFei >Assignee: YangFei >Priority: Major > Labels: pull-request-available > Fix For: 1.9.0 > > > [https://ci.apache.org/projects/flink/flink-docs-master/dev/datastream_api.html] > files locate /flink/docs/dev/datastream_api.zh.md -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-12427) Translate the "Flink DataStream API Programming Guide" page into Chinese
[ https://issues.apache.org/jira/browse/FLINK-12427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16864928#comment-16864928 ] xiaogang zhou commented on FLINK-12427: --- [~yangfei] any feedback laotie? > Translate the "Flink DataStream API Programming Guide" page into Chinese > > > Key: FLINK-12427 > URL: https://issues.apache.org/jira/browse/FLINK-12427 > Project: Flink > Issue Type: Sub-task > Components: chinese-translation, Documentation >Affects Versions: 1.9.0 >Reporter: YangFei >Assignee: YangFei >Priority: Major > Labels: pull-request-available > Fix For: 1.9.0 > > > [https://ci.apache.org/projects/flink/flink-docs-master/dev/datastream_api.html] > files locate /flink/docs/dev/datastream_api.zh.md -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-12894) Document how to configure and use catalogs in SQL CLI in Chinese
[ https://issues.apache.org/jira/browse/FLINK-12894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867154#comment-16867154 ] xiaogang zhou commented on FLINK-12894: --- [~phoenixjiangnan] can i translate this? > Document how to configure and use catalogs in SQL CLI in Chinese > > > Key: FLINK-12894 > URL: https://issues.apache.org/jira/browse/FLINK-12894 > Project: Flink > Issue Type: Sub-task > Components: Connectors / Hive, Documentation, Table SQL / Client >Reporter: Bowen Li >Priority: Major > Fix For: 1.9.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (FLINK-12894) Document how to configure and use catalogs in SQL CLI in Chinese
[ https://issues.apache.org/jira/browse/FLINK-12894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-12894: -- Comment: was deleted (was: [~phoenixjiangnan] can i translate this?) > Document how to configure and use catalogs in SQL CLI in Chinese > > > Key: FLINK-12894 > URL: https://issues.apache.org/jira/browse/FLINK-12894 > Project: Flink > Issue Type: Sub-task > Components: Connectors / Hive, Documentation, Table SQL / Client >Reporter: Bowen Li >Priority: Major > Fix For: 1.9.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-33249) comment should be parsed by StringLiteral() instead of SqlCharStringLiteral to avoid parsing failure
[ https://issues.apache.org/jira/browse/FLINK-33249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1035#comment-1035 ] xiaogang zhou commented on FLINK-33249: --- Hi [~martijnvisser] https://issues.apache.org/jira/browse/CALCITE-6001 will improve CALCITE to omit the charset from the generated literal when it is the default charset of the DIALECT. maybe wait for the CALCITE future version and set the FLINK DIALECT to use UTF-8 > comment should be parsed by StringLiteral() instead of SqlCharStringLiteral > to avoid parsing failure > > > Key: FLINK-33249 > URL: https://issues.apache.org/jira/browse/FLINK-33249 > Project: Flink > Issue Type: Improvement > Components: Table SQL / Planner >Affects Versions: 1.17.1 >Reporter: xiaogang zhou >Priority: Major > Labels: pull-request-available > > this problem is also recorded in calcite > > https://issues.apache.org/jira/browse/CALCITE-6046 > > Hi, I found this problem when I used below code to split SQL statements. the > process is SQL string -> SqlNode -> SQL String > {code:java} > // code placeholder > SqlParser.Config parserConfig = getCurrentSqlParserConfig(sqlDialect); > SqlParser sqlParser = SqlParser.create(sqlContent, parserConfig); > SqlNodeList sqlNodeList = sqlParser.parseStmtList(); > sqlParser.parse(sqlNodeList.get(0));{code} > the Dialect/ SqlConformance is a costumed one: > [https://github.com/apache/flink/blob/master/flink-table/flink-sql-parser/src/main/java/org/apache/flink/sql/parser/validate/FlinkSqlConformance.java] > > > then I found below SQL > {code:java} > // code placeholder > CREATE TABLE source ( > a BIGINT > ) comment '测试test' > WITH ( > 'connector' = 'test' > ); {code} > transformed to > {code:java} > // code placeholder > CREATE TABLE `source` ( > `a` BIGINT > ) > COMMENT u&'\5218\51eftest' WITH ( > 'connector' = 'test' > ) {code} > > and the SQL parser template is like > {code:java} > // code placeholder > SqlCreate SqlCreateTable(Span s, boolean replace, boolean isTemporary) : > { > final SqlParserPos startPos = s.pos(); > boolean ifNotExists = false; > SqlIdentifier tableName; > List constraints = new > ArrayList(); > SqlWatermark watermark = null; > SqlNodeList columnList = SqlNodeList.EMPTY; >SqlCharStringLiteral comment = null; >SqlTableLike tableLike = null; > SqlNode asQuery = null; > SqlNodeList propertyList = SqlNodeList.EMPTY; > SqlNodeList partitionColumns = SqlNodeList.EMPTY; > SqlParserPos pos = startPos; > } > { > > ifNotExists = IfNotExistsOpt() > tableName = CompoundIdentifier() > [ > { pos = getPos(); TableCreationContext ctx = new > TableCreationContext();} > TableColumn(ctx) > ( > TableColumn(ctx) > )* > { > pos = pos.plus(getPos()); > columnList = new SqlNodeList(ctx.columnList, pos); > constraints = ctx.constraints; > watermark = ctx.watermark; > } > > ] > [ { > String p = SqlParserUtil.parseString(token.image); > comment = SqlLiteral.createCharString(p, getPos()); > }] > [ > > partitionColumns = ParenthesizedSimpleIdentifierList() > ] > [ > > propertyList = TableProperties() > ] > [ > > tableLike = SqlTableLike(getPos()) > { > return new SqlCreateTableLike(startPos.plus(getPos()), > tableName, > columnList, > constraints, > propertyList, > partitionColumns, > watermark, > comment, > tableLike, > isTemporary, > ifNotExists); > } > | > > asQuery = OrderedQueryOrExpr(ExprContext.ACCEPT_QUERY) > { > return new SqlCreateTableAs(startPos.plus(getPos()), > tableName, > columnList, > constraints, > propertyList, > partitionColumns, > watermark, > comment, > asQuery, > isTemporary, > ifNotExists); > } > ] > { > return new SqlCreateTable(startPos.plus(getPos()), > tableName, > columnList, > constraints, > propertyList, > partitionColumns, > watermark, > comment, > isTemporary, > ifNotExists); > } > } {code} > will give a exception : > Caused by: org.apache.calcite.sql.parser.SqlParseException: Encounter
[jira] [Commented] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail
[ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806640#comment-17806640 ] xiaogang zhou commented on FLINK-33728: --- [~xtsong] [~wangyang0918] Ok, glad to hear that. Would you please help assign the ticket to me? > do not rewatch when KubernetesResourceManagerDriver watch fail > -- > > Key: FLINK-33728 > URL: https://issues.apache.org/jira/browse/FLINK-33728 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes >Reporter: xiaogang zhou >Priority: Major > Labels: pull-request-available > > I met massive production problem when kubernetes ETCD slow responding happen. > After Kube recoverd after 1 hour, Thousands of Flink jobs using > kubernetesResourceManagerDriver rewatched when recieving > ResourceVersionTooOld, which caused great pressure on API Server and made > API server failed again... > > I am not sure is it necessary to > getResourceEventHandler().onError(throwable) > in PodCallbackHandlerImpl# handleError method? > > We can just neglect the disconnection of watching process. and try to rewatch > once new requestResource called. And we can leverage on the akka heartbeat > timeout to discover the TM failure, just like YARN mode do. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33728) Do not rewatch when KubernetesResourceManagerDriver watch fail
[ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-33728: -- Summary: Do not rewatch when KubernetesResourceManagerDriver watch fail (was: do not rewatch when KubernetesResourceManagerDriver watch fail) > Do not rewatch when KubernetesResourceManagerDriver watch fail > -- > > Key: FLINK-33728 > URL: https://issues.apache.org/jira/browse/FLINK-33728 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes >Reporter: xiaogang zhou >Assignee: xiaogang zhou >Priority: Major > Labels: pull-request-available > > I met massive production problem when kubernetes ETCD slow responding happen. > After Kube recoverd after 1 hour, Thousands of Flink jobs using > kubernetesResourceManagerDriver rewatched when recieving > ResourceVersionTooOld, which caused great pressure on API Server and made > API server failed again... > > I am not sure is it necessary to > getResourceEventHandler().onError(throwable) > in PodCallbackHandlerImpl# handleError method? > > We can just neglect the disconnection of watching process. and try to rewatch > once new requestResource called. And we can leverage on the akka heartbeat > timeout to discover the TM failure, just like YARN mode do. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-17593) Support arbitrary recovery mechanism for PartFileWriter
[ https://issues.apache.org/jira/browse/FLINK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814195#comment-17814195 ] xiaogang zhou commented on FLINK-17593: --- [~maguowei] Hi , for filesystem which does not support recoverable writer(like oss filesystem), looks like we can not use the filesystem connector? > Support arbitrary recovery mechanism for PartFileWriter > --- > > Key: FLINK-17593 > URL: https://issues.apache.org/jira/browse/FLINK-17593 > Project: Flink > Issue Type: Sub-task > Components: Connectors / FileSystem >Reporter: Yun Gao >Assignee: Guowei Ma >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > > Currently Bucket relies directly on _RecoverableOutputStream_ provided by > FileSystem to achieve snapshotting and recovery the in-progress part file for > all the PartFileWriter implementations. This would require that the > PartFileWriter must be based on the OutputStream. > To support the path-based PartFileWriter required by the Hive Sink, we will > first need to abstract the snapshotting mechanism of the PartFileWriter and > make RecoverableOutputStream to be one type of implementation, thus we could > decouple PartFileWriter with the output streams. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail
xiaogang zhou created FLINK-33728: - Summary: do not rewatch when KubernetesResourceManagerDriver watch fail Key: FLINK-33728 URL: https://issues.apache.org/jira/browse/FLINK-33728 Project: Flink Issue Type: New Feature Components: Deployment / Kubernetes Reporter: xiaogang zhou -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail
[ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-33728: -- Description: is it necessary to getResourceEventHandler().onError(throwable) in PodCallbackHandlerImpl# handleError method. We can just neglect the disconnection of watching process. and try to rewatch once new requestResource called > do not rewatch when KubernetesResourceManagerDriver watch fail > -- > > Key: FLINK-33728 > URL: https://issues.apache.org/jira/browse/FLINK-33728 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes >Reporter: xiaogang zhou >Priority: Major > > is it necessary to > getResourceEventHandler().onError(throwable) > in PodCallbackHandlerImpl# handleError method. > > We can just neglect the disconnection of watching process. and try to rewatch > once new requestResource called -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail
[ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-33728: -- Description: I met massive production problem when kubernetes ETCD slow responding happen. After Kube recoverd after 1 hour, Thousands of Flink jobs using kubernetesResourceManagerDriver rewatched when recieving ResourceVersionTooOld, which caused great pressure on API Server and made API server failed again... I am not sure is it necessary to getResourceEventHandler().onError(throwable) in PodCallbackHandlerImpl# handleError method? We can just neglect the disconnection of watching process. and try to rewatch once new requestResource called. And we can leverage on the akka heartbeat timeout to discover the TM failure, just like YARN mode do. was: is it necessary to getResourceEventHandler().onError(throwable) in PodCallbackHandlerImpl# handleError method. We can just neglect the disconnection of watching process. and try to rewatch once new requestResource called > do not rewatch when KubernetesResourceManagerDriver watch fail > -- > > Key: FLINK-33728 > URL: https://issues.apache.org/jira/browse/FLINK-33728 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes >Reporter: xiaogang zhou >Priority: Major > > I met massive production problem when kubernetes ETCD slow responding happen. > After Kube recoverd after 1 hour, Thousands of Flink jobs using > kubernetesResourceManagerDriver rewatched when recieving > ResourceVersionTooOld, which caused great pressure on API Server and made > API server failed again... > > I am not sure is it necessary to > getResourceEventHandler().onError(throwable) > in PodCallbackHandlerImpl# handleError method? > > We can just neglect the disconnection of watching process. and try to rewatch > once new requestResource called. And we can leverage on the akka heartbeat > timeout to discover the TM failure, just like YARN mode do. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail
[ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17793092#comment-17793092 ] xiaogang zhou commented on FLINK-33728: --- [~wangyang0918] [~mapohl] [~gyfora] Would you please let me know you thinking? > do not rewatch when KubernetesResourceManagerDriver watch fail > -- > > Key: FLINK-33728 > URL: https://issues.apache.org/jira/browse/FLINK-33728 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes >Reporter: xiaogang zhou >Priority: Major > > I met massive production problem when kubernetes ETCD slow responding happen. > After Kube recoverd after 1 hour, Thousands of Flink jobs using > kubernetesResourceManagerDriver rewatched when recieving > ResourceVersionTooOld, which caused great pressure on API Server and made > API server failed again... > > I am not sure is it necessary to > getResourceEventHandler().onError(throwable) > in PodCallbackHandlerImpl# handleError method? > > We can just neglect the disconnection of watching process. and try to rewatch > once new requestResource called. And we can leverage on the akka heartbeat > timeout to discover the TM failure, just like YARN mode do. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33741) introduce stat dump period and statsLevel configuration
xiaogang zhou created FLINK-33741: - Summary: introduce stat dump period and statsLevel configuration Key: FLINK-33741 URL: https://issues.apache.org/jira/browse/FLINK-33741 Project: Flink Issue Type: New Feature Reporter: xiaogang zhou I'd like to introduce 2 rocksdb statistic related configuration. Then we can customize stats {code:java} // code placeholder Statistics s = new Statistics(); s.setStatsLevel(EXCEPT_TIME_FOR_MUTEX); currentOptions.setStatsDumpPeriodSec(internalGetOption(RocksDBConfigurableOptions.STATISTIC_DUMP_PERIOD)) .setStatistics(s); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail
[ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17794900#comment-17794900 ] xiaogang zhou commented on FLINK-33728: --- [~gyfora] my proposal is keep the jobmanager running even rewatch fail. A healthy watch listener can get notification from kubernetes of two kind: add pod and delete pod. 1. add pod is necessary when request resource, when we are not requesting resource, this notification is allowed to be lost. 2. delete pod can allow us detect pod failure more quickly, but we can also discover it by detecting the lost of akka heartbeat timeout. according to the statement above, we can tolerate the lost of watch connection when we are not requesting resource > do not rewatch when KubernetesResourceManagerDriver watch fail > -- > > Key: FLINK-33728 > URL: https://issues.apache.org/jira/browse/FLINK-33728 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes >Reporter: xiaogang zhou >Priority: Major > > I met massive production problem when kubernetes ETCD slow responding happen. > After Kube recoverd after 1 hour, Thousands of Flink jobs using > kubernetesResourceManagerDriver rewatched when recieving > ResourceVersionTooOld, which caused great pressure on API Server and made > API server failed again... > > I am not sure is it necessary to > getResourceEventHandler().onError(throwable) > in PodCallbackHandlerImpl# handleError method? > > We can just neglect the disconnection of watching process. and try to rewatch > once new requestResource called. And we can leverage on the akka heartbeat > timeout to discover the TM failure, just like YARN mode do. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail
[ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17794900#comment-17794900 ] xiaogang zhou edited comment on FLINK-33728 at 12/9/23 9:52 AM: [~gyfora] my proposal is keep the jobmanager running after watch fail, and do not rewatch before next request resource called. A healthy watch listener can get notification from kubernetes of two kind: add pod and delete pod. 1. add pod is necessary when request resource, when we are not requesting resource, this notification is allowed to be lost. 2. delete pod can allow us detect pod failure more quickly, but we can also discover it by detecting the lost of akka heartbeat timeout. according to the statement above, we can tolerate the lost of watch connection when we are not requesting resource was (Author: zhoujira86): [~gyfora] my proposal is keep the jobmanager running even rewatch fail. A healthy watch listener can get notification from kubernetes of two kind: add pod and delete pod. 1. add pod is necessary when request resource, when we are not requesting resource, this notification is allowed to be lost. 2. delete pod can allow us detect pod failure more quickly, but we can also discover it by detecting the lost of akka heartbeat timeout. according to the statement above, we can tolerate the lost of watch connection when we are not requesting resource > do not rewatch when KubernetesResourceManagerDriver watch fail > -- > > Key: FLINK-33728 > URL: https://issues.apache.org/jira/browse/FLINK-33728 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes >Reporter: xiaogang zhou >Priority: Major > Labels: pull-request-available > > I met massive production problem when kubernetes ETCD slow responding happen. > After Kube recoverd after 1 hour, Thousands of Flink jobs using > kubernetesResourceManagerDriver rewatched when recieving > ResourceVersionTooOld, which caused great pressure on API Server and made > API server failed again... > > I am not sure is it necessary to > getResourceEventHandler().onError(throwable) > in PodCallbackHandlerImpl# handleError method? > > We can just neglect the disconnection of watching process. and try to rewatch > once new requestResource called. And we can leverage on the akka heartbeat > timeout to discover the TM failure, just like YARN mode do. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33741) introduce stat dump period and statsLevel configuration
[ https://issues.apache.org/jira/browse/FLINK-33741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17794941#comment-17794941 ] xiaogang zhou commented on FLINK-33741: --- And I think we can also parse the multi-line string of the rocksdb statistics. Then we can directly export these rt latency number in metrics. > introduce stat dump period and statsLevel configuration > --- > > Key: FLINK-33741 > URL: https://issues.apache.org/jira/browse/FLINK-33741 > Project: Flink > Issue Type: New Feature >Reporter: xiaogang zhou >Priority: Major > > I'd like to introduce 2 rocksdb statistic related configuration. > Then we can customize stats > {code:java} > // code placeholder > Statistics s = new Statistics(); > s.setStatsLevel(EXCEPT_TIME_FOR_MUTEX); > currentOptions.setStatsDumpPeriodSec(internalGetOption(RocksDBConfigurableOptions.STATISTIC_DUMP_PERIOD)) > .setStatistics(s); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33741) Introduce stat dump period and statsLevel configuration
[ https://issues.apache.org/jira/browse/FLINK-33741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-33741: -- Summary: Introduce stat dump period and statsLevel configuration (was: introduce stat dump period and statsLevel configuration) > Introduce stat dump period and statsLevel configuration > --- > > Key: FLINK-33741 > URL: https://issues.apache.org/jira/browse/FLINK-33741 > Project: Flink > Issue Type: New Feature >Reporter: xiaogang zhou >Priority: Major > > I'd like to introduce 2 rocksdb statistic related configuration. > Then we can customize stats > {code:java} > // code placeholder > Statistics s = new Statistics(); > s.setStatsLevel(EXCEPT_TIME_FOR_MUTEX); > currentOptions.setStatsDumpPeriodSec(internalGetOption(RocksDBConfigurableOptions.STATISTIC_DUMP_PERIOD)) > .setStatistics(s); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33741) Exposed Rocksdb statistics in Flink metrics and introduce 2 Rocksdb statistic related configuration
[ https://issues.apache.org/jira/browse/FLINK-33741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-33741: -- Description: I think we can also parse the multi-line string of the rocksdb statistics. {code:java} // code placeholder /** * DB implements can export properties about their state * via this method on a per column family level. * * If {@code property} is a valid property understood by this DB * implementation, fills {@code value} with its current value and * returns true. Otherwise returns false. * * Valid property names include: * * "rocksdb.num-files-at-level" - return the number of files at * level , where is an ASCII representation of a level * number (e.g. "0"). * "rocksdb.stats" - returns a multi-line string that describes statistics * about the internal operation of the DB. * "rocksdb.sstables" - returns a multi-line string that describes all *of the sstables that make up the db contents. * * * @param columnFamilyHandle {@link org.rocksdb.ColumnFamilyHandle} * instance, or null for the default column family. * @param property to be fetched. See above for examples * @return property value * * @throws RocksDBException thrown if error happens in underlying *native library. */ public String getProperty( /* @Nullable */ final ColumnFamilyHandle columnFamilyHandle, final String property) throws RocksDBException { {code} Then we can directly export these rt latency number in metrics. I'd like to introduce 2 rocksdb statistic related configuration. Then we can customize stats {code:java} // code placeholder Statistics s = new Statistics(); s.setStatsLevel(EXCEPT_TIME_FOR_MUTEX); currentOptions.setStatsDumpPeriodSec(internalGetOption(RocksDBConfigurableOptions.STATISTIC_DUMP_PERIOD)) .setStatistics(s); {code} was: I'd like to introduce 2 rocksdb statistic related configuration. Then we can customize stats {code:java} // code placeholder Statistics s = new Statistics(); s.setStatsLevel(EXCEPT_TIME_FOR_MUTEX); currentOptions.setStatsDumpPeriodSec(internalGetOption(RocksDBConfigurableOptions.STATISTIC_DUMP_PERIOD)) .setStatistics(s); {code} > Exposed Rocksdb statistics in Flink metrics and introduce 2 Rocksdb statistic > related configuration > --- > > Key: FLINK-33741 > URL: https://issues.apache.org/jira/browse/FLINK-33741 > Project: Flink > Issue Type: New Feature >Reporter: xiaogang zhou >Priority: Major > > I think we can also parse the multi-line string of the rocksdb statistics. > {code:java} > // code placeholder > /** > * DB implements can export properties about their state > * via this method on a per column family level. > * > * If {@code property} is a valid property understood by this DB > * implementation, fills {@code value} with its current value and > * returns true. Otherwise returns false. > * > * Valid property names include: > * > * "rocksdb.num-files-at-level " - return the number of files at > * level , where is an ASCII representation of a level > * number (e.g. "0"). > * "rocksdb.stats" - returns a multi-line string that describes statistics > * about the internal operation of the DB. > * "rocksdb.sstables" - returns a multi-line string that describes all > *of the sstables that make up the db contents. > * > * > * @param columnFamilyHandle {@link org.rocksdb.ColumnFamilyHandle} > * instance, or null for the default column family. > * @param property to be fetched. See above for examples > * @return property value > * > * @throws RocksDBException thrown if error happens in underlying > *native library. > */ > public String getProperty( > /* @Nullable */ final ColumnFamilyHandle columnFamilyHandle, > final String property) throws RocksDBException { {code} > > Then we can directly export these rt latency number in metrics. > > I'd like to introduce 2 rocksdb statistic related configuration. > Then we can customize stats > {code:java} > // code placeholder > Statistics s = new Statistics(); > s.setStatsLevel(EXCEPT_TIME_FOR_MUTEX); > currentOptions.setStatsDumpPeriodSec(internalGetOption(RocksDBConfigurableOptions.STATISTIC_DUMP_PERIOD)) > .setStatistics(s); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33741) Exposed Rocksdb statistics in Flink metrics and introduce 2 Rocksdb statistic related configuration
[ https://issues.apache.org/jira/browse/FLINK-33741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-33741: -- Summary: Exposed Rocksdb statistics in Flink metrics and introduce 2 Rocksdb statistic related configuration (was: Introduce stat dump period and statsLevel configuration) > Exposed Rocksdb statistics in Flink metrics and introduce 2 Rocksdb statistic > related configuration > --- > > Key: FLINK-33741 > URL: https://issues.apache.org/jira/browse/FLINK-33741 > Project: Flink > Issue Type: New Feature >Reporter: xiaogang zhou >Priority: Major > > I'd like to introduce 2 rocksdb statistic related configuration. > Then we can customize stats > {code:java} > // code placeholder > Statistics s = new Statistics(); > s.setStatsLevel(EXCEPT_TIME_FOR_MUTEX); > currentOptions.setStatsDumpPeriodSec(internalGetOption(RocksDBConfigurableOptions.STATISTIC_DUMP_PERIOD)) > .setStatistics(s); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail
[ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795571#comment-17795571 ] xiaogang zhou commented on FLINK-33728: --- Hi [~mapohl] , thanks for the comment above. sorry for my poor writing english :P, but I think your re-clarification is exactly what I am proposing. I'd like to introduce a lazy re-initialization of watch mechanism which will tolerate a disconnection of the watch until a new POD is requested. And I think your concern is how we detect a TM loss without a active watcher. I have test my change in a real K8S environment. With a disconnected watcher, I killed a TM pod. after no more than 50s, the task restarted with a exception {code:java} // code placeholder java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id flink-6168d34cf9d3a5d31ad8bb02bce6a370-taskmanager-1-8 timed out. at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1306) at org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:111) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at akka.japi.pf.UnitC {code} moreover, I think YARN also do not have a watcher mechanism, so FLINK scheduled in yarn also relays on a heartbeat timeout mechanism? And an active rewatching strategy can really cause great pressure on API server, especially in the early versions without the resource version zero set in the watch-list request. > do not rewatch when KubernetesResourceManagerDriver watch fail > -- > > Key: FLINK-33728 > URL: https://issues.apache.org/jira/browse/FLINK-33728 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes >Reporter: xiaogang zhou >Priority: Major > Labels: pull-request-available > > I met massive production problem when kubernetes ETCD slow responding happen. > After Kube recoverd after 1 hour, Thousands of Flink jobs using > kubernetesResourceManagerDriver rewatched when recieving > ResourceVersionTooOld, which caused great pressure on API Server and made > API server failed again... > > I am not sure is it necessary to > getResourceEventHandler().onError(throwable) > in PodCallbackHandlerImpl# handleError method? > > We can just neglect the disconnection of watching process. and try to rewatch > once new requestResource called. And we can leverage on the akka heartbeat > timeout to discover the TM failure, just like YARN mode do. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail
[ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17798797#comment-17798797 ] xiaogang zhou commented on FLINK-33728: --- [~mapohl] Hi Matthias , would you please let me know what additional test is needed to prove my proposal can move forward. > do not rewatch when KubernetesResourceManagerDriver watch fail > -- > > Key: FLINK-33728 > URL: https://issues.apache.org/jira/browse/FLINK-33728 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes >Reporter: xiaogang zhou >Priority: Major > Labels: pull-request-available > > I met massive production problem when kubernetes ETCD slow responding happen. > After Kube recoverd after 1 hour, Thousands of Flink jobs using > kubernetesResourceManagerDriver rewatched when recieving > ResourceVersionTooOld, which caused great pressure on API Server and made > API server failed again... > > I am not sure is it necessary to > getResourceEventHandler().onError(throwable) > in PodCallbackHandlerImpl# handleError method? > > We can just neglect the disconnection of watching process. and try to rewatch > once new requestResource called. And we can leverage on the akka heartbeat > timeout to discover the TM failure, just like YARN mode do. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail
[ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804214#comment-17804214 ] xiaogang zhou commented on FLINK-33728: --- Hi Matthias, wish you had recovered and enjoyed a wonderful Holiday :). Can we have a discussion on my proposal [~mapohl] > do not rewatch when KubernetesResourceManagerDriver watch fail > -- > > Key: FLINK-33728 > URL: https://issues.apache.org/jira/browse/FLINK-33728 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes >Reporter: xiaogang zhou >Priority: Major > Labels: pull-request-available > > I met massive production problem when kubernetes ETCD slow responding happen. > After Kube recoverd after 1 hour, Thousands of Flink jobs using > kubernetesResourceManagerDriver rewatched when recieving > ResourceVersionTooOld, which caused great pressure on API Server and made > API server failed again... > > I am not sure is it necessary to > getResourceEventHandler().onError(throwable) > in PodCallbackHandlerImpl# handleError method? > > We can just neglect the disconnection of watching process. and try to rewatch > once new requestResource called. And we can leverage on the akka heartbeat > timeout to discover the TM failure, just like YARN mode do. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail
[ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804553#comment-17804553 ] xiaogang zhou commented on FLINK-33728: --- [~mapohl] I think your concern is really very important. I think my statement is not good enough. After your reminder, I'd like to change it to : We can just neglect the disconnection of watching process {color:#FF}if there is no pending request{color}. and try to rewatch once new requestResource called. And we can choose to fail all CompletableFuture And the [requestWorkerIfRequired|https://github.com/apache/flink/blob/2b9b9859253698c3c90ca420f10975e27e6c52d4/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/active/ActiveResourceManager.java#L332] will request the resource again, this will trigger the rewatch. WDYT [~mapohl] [~xtsong] > do not rewatch when KubernetesResourceManagerDriver watch fail > -- > > Key: FLINK-33728 > URL: https://issues.apache.org/jira/browse/FLINK-33728 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes >Reporter: xiaogang zhou >Priority: Major > Labels: pull-request-available > > I met massive production problem when kubernetes ETCD slow responding happen. > After Kube recoverd after 1 hour, Thousands of Flink jobs using > kubernetesResourceManagerDriver rewatched when recieving > ResourceVersionTooOld, which caused great pressure on API Server and made > API server failed again... > > I am not sure is it necessary to > getResourceEventHandler().onError(throwable) > in PodCallbackHandlerImpl# handleError method? > > We can just neglect the disconnection of watching process. and try to rewatch > once new requestResource called. And we can leverage on the akka heartbeat > timeout to discover the TM failure, just like YARN mode do. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail
[ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804638#comment-17804638 ] xiaogang zhou commented on FLINK-33728: --- [~xtsong] In a default FLINK setting, when the KubenetesClient disconnects from KUBE API server, it will try to reconnect for infinitely times. As kubernetes.watch.reconnectLimit is -1. But KubenetesClient treat ResourceVersionTooOld as a special exception, as it will escape from the normal reconnects. And then it will cause FLINK FlinkKubeClient to retry connect for kubernetes.transactional-operation.max-retries times. If the watcher does not recover, the JM will kill it self. So I think the problem we are trying to solve is not only to avoid massive Flink jobs trying to re-creating watches at the same time. But also how to allow FLINK to continue running even when the KUBE API is in a disorder situation. As for most of the times, FLINK TMs do not need to be bothered by a bad API server . If you think it is not acceptable to recover the watcher only requesting resource, I think another possible way is , we can retry to rewatch pods periodically. WDYT? :) > do not rewatch when KubernetesResourceManagerDriver watch fail > -- > > Key: FLINK-33728 > URL: https://issues.apache.org/jira/browse/FLINK-33728 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes >Reporter: xiaogang zhou >Priority: Major > Labels: pull-request-available > > I met massive production problem when kubernetes ETCD slow responding happen. > After Kube recoverd after 1 hour, Thousands of Flink jobs using > kubernetesResourceManagerDriver rewatched when recieving > ResourceVersionTooOld, which caused great pressure on API Server and made > API server failed again... > > I am not sure is it necessary to > getResourceEventHandler().onError(throwable) > in PodCallbackHandlerImpl# handleError method? > > We can just neglect the disconnection of watching process. and try to rewatch > once new requestResource called. And we can leverage on the akka heartbeat > timeout to discover the TM failure, just like YARN mode do. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail
[ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804638#comment-17804638 ] xiaogang zhou edited comment on FLINK-33728 at 1/9/24 8:52 AM: --- [~xtsong] In a default FLINK setting, when the KubenetesClient disconnects from KUBE API server, it will try to reconnect for infinitely times. As kubernetes.watch.reconnectLimit is -1. But KubenetesClient treat ResourceVersionTooOld as a special exception, as it will escape from the normal reconnects. And then it will cause FLINK FlinkKubeClient to retry connect for kubernetes.transactional-operation.max-retries times, and these retries have not interval between them. If the watcher does not recover, the JM will kill it self. So I think the problem we are trying to solve is not only to avoid massive Flink jobs trying to re-creating watches at the same time. But also how to allow FLINK to continue running even when the KUBE API is in a disorder situation. As for most of the times, FLINK TMs do not need to be bothered by a bad API server . If you think it is not acceptable to recover the watcher only requesting resource, I think another possible way is , we can retry to rewatch pods periodically. WDYT? :) was (Author: zhoujira86): [~xtsong] In a default FLINK setting, when the KubenetesClient disconnects from KUBE API server, it will try to reconnect for infinitely times. As kubernetes.watch.reconnectLimit is -1. But KubenetesClient treat ResourceVersionTooOld as a special exception, as it will escape from the normal reconnects. And then it will cause FLINK FlinkKubeClient to retry connect for kubernetes.transactional-operation.max-retries times. If the watcher does not recover, the JM will kill it self. So I think the problem we are trying to solve is not only to avoid massive Flink jobs trying to re-creating watches at the same time. But also how to allow FLINK to continue running even when the KUBE API is in a disorder situation. As for most of the times, FLINK TMs do not need to be bothered by a bad API server . If you think it is not acceptable to recover the watcher only requesting resource, I think another possible way is , we can retry to rewatch pods periodically. WDYT? :) > do not rewatch when KubernetesResourceManagerDriver watch fail > -- > > Key: FLINK-33728 > URL: https://issues.apache.org/jira/browse/FLINK-33728 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes >Reporter: xiaogang zhou >Priority: Major > Labels: pull-request-available > > I met massive production problem when kubernetes ETCD slow responding happen. > After Kube recoverd after 1 hour, Thousands of Flink jobs using > kubernetesResourceManagerDriver rewatched when recieving > ResourceVersionTooOld, which caused great pressure on API Server and made > API server failed again... > > I am not sure is it necessary to > getResourceEventHandler().onError(throwable) > in PodCallbackHandlerImpl# handleError method? > > We can just neglect the disconnection of watching process. and try to rewatch > once new requestResource called. And we can leverage on the akka heartbeat > timeout to discover the TM failure, just like YARN mode do. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail
[ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804638#comment-17804638 ] xiaogang zhou edited comment on FLINK-33728 at 1/9/24 8:53 AM: --- [~xtsong] In a default FLINK setting, when the KubenetesClient disconnects from KUBE API server, it will try to reconnect for infinitely times. As kubernetes.watch.reconnectLimit is -1. But KubenetesClient treat ResourceVersionTooOld as a special exception, as it will escape from the normal reconnects. And then it will cause FLINK FlinkKubeClient to retry connect for kubernetes.transactional-operation.max-retries times, and these retries have not interval between them. If the watcher does not recover, the JM will kill it self. So I think the problem we are trying to solve is not only to avoid massive Flink jobs trying to re-creating watches at the same time. But also how to allow FLINK to continue running even when the KUBE API SERVER is in a disorder situation. As for most of the times, FLINK TMs have no dependency on API SERVER. If you think it is not acceptable to recover the watcher only requesting resource, I think another possible way is , we can retry to rewatch pods periodically. WDYT? :) was (Author: zhoujira86): [~xtsong] In a default FLINK setting, when the KubenetesClient disconnects from KUBE API server, it will try to reconnect for infinitely times. As kubernetes.watch.reconnectLimit is -1. But KubenetesClient treat ResourceVersionTooOld as a special exception, as it will escape from the normal reconnects. And then it will cause FLINK FlinkKubeClient to retry connect for kubernetes.transactional-operation.max-retries times, and these retries have not interval between them. If the watcher does not recover, the JM will kill it self. So I think the problem we are trying to solve is not only to avoid massive Flink jobs trying to re-creating watches at the same time. But also how to allow FLINK to continue running even when the KUBE API is in a disorder situation. As for most of the times, FLINK TMs do not need to be bothered by a bad API server . If you think it is not acceptable to recover the watcher only requesting resource, I think another possible way is , we can retry to rewatch pods periodically. WDYT? :) > do not rewatch when KubernetesResourceManagerDriver watch fail > -- > > Key: FLINK-33728 > URL: https://issues.apache.org/jira/browse/FLINK-33728 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes >Reporter: xiaogang zhou >Priority: Major > Labels: pull-request-available > > I met massive production problem when kubernetes ETCD slow responding happen. > After Kube recoverd after 1 hour, Thousands of Flink jobs using > kubernetesResourceManagerDriver rewatched when recieving > ResourceVersionTooOld, which caused great pressure on API Server and made > API server failed again... > > I am not sure is it necessary to > getResourceEventHandler().onError(throwable) > in PodCallbackHandlerImpl# handleError method? > > We can just neglect the disconnection of watching process. and try to rewatch > once new requestResource called. And we can leverage on the akka heartbeat > timeout to discover the TM failure, just like YARN mode do. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33741) Exposed Rocksdb Histogram statistics in Flink metrics
[ https://issues.apache.org/jira/browse/FLINK-33741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-33741: -- Summary: Exposed Rocksdb Histogram statistics in Flink metrics (was: Exposed Rocksdb statistics in Flink metrics and introduce 2 Rocksdb statistic related configuration) > Exposed Rocksdb Histogram statistics in Flink metrics > -- > > Key: FLINK-33741 > URL: https://issues.apache.org/jira/browse/FLINK-33741 > Project: Flink > Issue Type: New Feature >Reporter: xiaogang zhou >Assignee: xiaogang zhou >Priority: Major > > I think we can also parse the multi-line string of the rocksdb statistics. > {code:java} > // code placeholder > /** > * DB implements can export properties about their state > * via this method on a per column family level. > * > * If {@code property} is a valid property understood by this DB > * implementation, fills {@code value} with its current value and > * returns true. Otherwise returns false. > * > * Valid property names include: > * > * "rocksdb.num-files-at-level" - return the number of files at > * level , where is an ASCII representation of a level > * number (e.g. "0"). > * "rocksdb.stats" - returns a multi-line string that describes statistics > * about the internal operation of the DB. > * "rocksdb.sstables" - returns a multi-line string that describes all > *of the sstables that make up the db contents. > * > * > * @param columnFamilyHandle {@link org.rocksdb.ColumnFamilyHandle} > * instance, or null for the default column family. > * @param property to be fetched. See above for examples > * @return property value > * > * @throws RocksDBException thrown if error happens in underlying > *native library. > */ > public String getProperty( > /* @Nullable */ final ColumnFamilyHandle columnFamilyHandle, > final String property) throws RocksDBException { {code} > > Then we can directly export these rt latency number in metrics. > > I'd like to introduce 2 rocksdb statistic related configuration. > Then we can customize stats > {code:java} > // code placeholder > Statistics s = new Statistics(); > s.setStatsLevel(EXCEPT_TIME_FOR_MUTEX); > currentOptions.setStatsDumpPeriodSec(internalGetOption(RocksDBConfigurableOptions.STATISTIC_DUMP_PERIOD)) > .setStatistics(s); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33741) Exposed Rocksdb Histogram statistics in Flink metrics
[ https://issues.apache.org/jira/browse/FLINK-33741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-33741: -- Description: I'd like to expose ROCKSDB Histogram metrics like db_get db_write to enable trouble shooting (was: I think we can also parse the multi-line string of the rocksdb statistics. {code:java} // code placeholder /** * DB implements can export properties about their state * via this method on a per column family level. * * If {@code property} is a valid property understood by this DB * implementation, fills {@code value} with its current value and * returns true. Otherwise returns false. * * Valid property names include: * * "rocksdb.num-files-at-level" - return the number of files at * level , where is an ASCII representation of a level * number (e.g. "0"). * "rocksdb.stats" - returns a multi-line string that describes statistics * about the internal operation of the DB. * "rocksdb.sstables" - returns a multi-line string that describes all *of the sstables that make up the db contents. * * * @param columnFamilyHandle {@link org.rocksdb.ColumnFamilyHandle} * instance, or null for the default column family. * @param property to be fetched. See above for examples * @return property value * * @throws RocksDBException thrown if error happens in underlying *native library. */ public String getProperty( /* @Nullable */ final ColumnFamilyHandle columnFamilyHandle, final String property) throws RocksDBException { {code} Then we can directly export these rt latency number in metrics. I'd like to introduce 2 rocksdb statistic related configuration. Then we can customize stats {code:java} // code placeholder Statistics s = new Statistics(); s.setStatsLevel(EXCEPT_TIME_FOR_MUTEX); currentOptions.setStatsDumpPeriodSec(internalGetOption(RocksDBConfigurableOptions.STATISTIC_DUMP_PERIOD)) .setStatistics(s); {code}) > Exposed Rocksdb Histogram statistics in Flink metrics > -- > > Key: FLINK-33741 > URL: https://issues.apache.org/jira/browse/FLINK-33741 > Project: Flink > Issue Type: New Feature >Reporter: xiaogang zhou >Assignee: xiaogang zhou >Priority: Major > > I'd like to expose ROCKSDB Histogram metrics like db_get db_write to enable > trouble shooting -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33741) Expose Rocksdb Histogram statistics in Flink metrics
[ https://issues.apache.org/jira/browse/FLINK-33741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-33741: -- Summary: Expose Rocksdb Histogram statistics in Flink metrics (was: Exposed Rocksdb Histogram statistics in Flink metrics ) > Expose Rocksdb Histogram statistics in Flink metrics > - > > Key: FLINK-33741 > URL: https://issues.apache.org/jira/browse/FLINK-33741 > Project: Flink > Issue Type: New Feature >Reporter: xiaogang zhou >Assignee: xiaogang zhou >Priority: Major > > I'd like to expose ROCKSDB Histogram metrics like db_get db_write to enable > trouble shooting -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-34976) LD_PRELOAD environment may not be effective after su to flink user
xiaogang zhou created FLINK-34976: - Summary: LD_PRELOAD environment may not be effective after su to flink user Key: FLINK-34976 URL: https://issues.apache.org/jira/browse/FLINK-34976 Project: Flink Issue Type: New Feature Components: flink-docker Affects Versions: 1.19.0 Reporter: xiaogang zhou -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-34976) LD_PRELOAD environment may not be effective after su to flink user
[ https://issues.apache.org/jira/browse/FLINK-34976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-34976: -- Description: I am not sure if LD_PRELOAD still takes effect after drop_privs_cmd. Do we need to create a .bashrc file in home directory of flink [https://github.com/apache/flink-docker/blob/627987997ca7ec86bcc3d80b26df58aa595b91af/1.17/scala_2.12-java11-ubuntu/docker-entrypoint.sh#L92] > LD_PRELOAD environment may not be effective after su to flink user > -- > > Key: FLINK-34976 > URL: https://issues.apache.org/jira/browse/FLINK-34976 > Project: Flink > Issue Type: New Feature > Components: flink-docker >Affects Versions: 1.19.0 >Reporter: xiaogang zhou >Priority: Major > > I am not sure if LD_PRELOAD still takes effect after drop_privs_cmd. Do we > need to create a .bashrc file in home directory of flink > > [https://github.com/apache/flink-docker/blob/627987997ca7ec86bcc3d80b26df58aa595b91af/1.17/scala_2.12-java11-ubuntu/docker-entrypoint.sh#L92] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-34976) LD_PRELOAD environment may not be effective after su to flink user
[ https://issues.apache.org/jira/browse/FLINK-34976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-34976: -- Description: I am not sure if LD_PRELOAD still takes effect after drop_privs_cmd. Should we create a .bashrc file in home directory of flink, and export LD_PRELOAD for flink user? [https://github.com/apache/flink-docker/blob/627987997ca7ec86bcc3d80b26df58aa595b91af/1.17/scala_2.12-java11-ubuntu/docker-entrypoint.sh#L92] was: I am not sure if LD_PRELOAD still takes effect after drop_privs_cmd. Do we need to create a .bashrc file in home directory of flink, and export LD_PRELOAD for flink user? [https://github.com/apache/flink-docker/blob/627987997ca7ec86bcc3d80b26df58aa595b91af/1.17/scala_2.12-java11-ubuntu/docker-entrypoint.sh#L92] > LD_PRELOAD environment may not be effective after su to flink user > -- > > Key: FLINK-34976 > URL: https://issues.apache.org/jira/browse/FLINK-34976 > Project: Flink > Issue Type: New Feature > Components: flink-docker >Affects Versions: 1.19.0 >Reporter: xiaogang zhou >Priority: Major > > I am not sure if LD_PRELOAD still takes effect after drop_privs_cmd. Should > we create a .bashrc file in home directory of flink, and export LD_PRELOAD > for flink user? > > [https://github.com/apache/flink-docker/blob/627987997ca7ec86bcc3d80b26df58aa595b91af/1.17/scala_2.12-java11-ubuntu/docker-entrypoint.sh#L92] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-34976) LD_PRELOAD environment may not be effective after su to flink user
[ https://issues.apache.org/jira/browse/FLINK-34976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-34976: -- Description: I am not sure if LD_PRELOAD still takes effect after drop_privs_cmd. Do we need to create a .bashrc file in home directory of flink, and export LD_PRELOAD for flink user? [https://github.com/apache/flink-docker/blob/627987997ca7ec86bcc3d80b26df58aa595b91af/1.17/scala_2.12-java11-ubuntu/docker-entrypoint.sh#L92] was: I am not sure if LD_PRELOAD still takes effect after drop_privs_cmd. Do we need to create a .bashrc file in home directory of flink [https://github.com/apache/flink-docker/blob/627987997ca7ec86bcc3d80b26df58aa595b91af/1.17/scala_2.12-java11-ubuntu/docker-entrypoint.sh#L92] > LD_PRELOAD environment may not be effective after su to flink user > -- > > Key: FLINK-34976 > URL: https://issues.apache.org/jira/browse/FLINK-34976 > Project: Flink > Issue Type: New Feature > Components: flink-docker >Affects Versions: 1.19.0 >Reporter: xiaogang zhou >Priority: Major > > I am not sure if LD_PRELOAD still takes effect after drop_privs_cmd. Do we > need to create a .bashrc file in home directory of flink, and export > LD_PRELOAD for flink user? > > [https://github.com/apache/flink-docker/blob/627987997ca7ec86bcc3d80b26df58aa595b91af/1.17/scala_2.12-java11-ubuntu/docker-entrypoint.sh#L92] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (FLINK-34976) LD_PRELOAD environment may not be effective after su to flink user
[ https://issues.apache.org/jira/browse/FLINK-34976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou closed FLINK-34976. - Resolution: Invalid > LD_PRELOAD environment may not be effective after su to flink user > -- > > Key: FLINK-34976 > URL: https://issues.apache.org/jira/browse/FLINK-34976 > Project: Flink > Issue Type: New Feature > Components: flink-docker >Affects Versions: 1.19.0 >Reporter: xiaogang zhou >Priority: Major > > I am not sure if LD_PRELOAD still takes effect after drop_privs_cmd. Should > we create a .bashrc file in home directory of flink, and export LD_PRELOAD > for flink user? > > [https://github.com/apache/flink-docker/blob/627987997ca7ec86bcc3d80b26df58aa595b91af/1.17/scala_2.12-java11-ubuntu/docker-entrypoint.sh#L92] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-34976) LD_PRELOAD environment may not be effective after su to flink user
[ https://issues.apache.org/jira/browse/FLINK-34976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17832762#comment-17832762 ] xiaogang zhou commented on FLINK-34976: --- [~yunta] Ok, Thanks for clarification > LD_PRELOAD environment may not be effective after su to flink user > -- > > Key: FLINK-34976 > URL: https://issues.apache.org/jira/browse/FLINK-34976 > Project: Flink > Issue Type: New Feature > Components: flink-docker >Affects Versions: 1.19.0 >Reporter: xiaogang zhou >Priority: Major > > I am not sure if LD_PRELOAD still takes effect after drop_privs_cmd. Should > we create a .bashrc file in home directory of flink, and export LD_PRELOAD > for flink user? > > [https://github.com/apache/flink-docker/blob/627987997ca7ec86bcc3d80b26df58aa595b91af/1.17/scala_2.12-java11-ubuntu/docker-entrypoint.sh#L92] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32070) FLIP-306 Unified File Merging Mechanism for Checkpoints
[ https://issues.apache.org/jira/browse/FLINK-32070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834580#comment-17834580 ] xiaogang zhou commented on FLINK-32070: --- [~Zakelly] Hi, we met a problem that FLINK checkpoint has too many sst files will cause great IOPS on HDFS. Can this issue help on that scenario? > FLIP-306 Unified File Merging Mechanism for Checkpoints > --- > > Key: FLINK-32070 > URL: https://issues.apache.org/jira/browse/FLINK-32070 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing, Runtime / State Backends >Reporter: Zakelly Lan >Assignee: Zakelly Lan >Priority: Major > Fix For: 1.20.0 > > > The FLIP: > [https://cwiki.apache.org/confluence/display/FLINK/FLIP-306%3A+Unified+File+Merging+Mechanism+for+Checkpoints] > > The creation of multiple checkpoint files can lead to a 'file flood' problem, > in which a large number of files are written to the checkpoint storage in a > short amount of time. This can cause issues in large clusters with high > workloads, such as the creation and deletion of many files increasing the > amount of file meta modification on DFS, leading to single-machine hotspot > issues for meta maintainers (e.g. NameNode in HDFS). Additionally, the > performance of object storage (e.g. Amazon S3 and Alibaba OSS) can > significantly decrease when listing objects, which is necessary for object > name de-duplication before creating an object, further affecting the > performance of directory manipulation in the file system's perspective of > view (See [hadoop-aws module > documentation|https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#:~:text=an%20intermediate%20state.-,Warning%20%232%3A%20Directories%20are%20mimicked,-The%20S3A%20clients], > section 'Warning #2: Directories are mimicked'). > While many solutions have been proposed for individual types of state files > (e.g. FLINK-11937 for keyed state (RocksDB) and FLINK-26803 for channel > state), the file flood problems from each type of checkpoint file are similar > and lack systematic view and solution. Therefore, the goal of this FLIP is to > establish a unified file merging mechanism to address the file flood problem > during checkpoint creation for all types of state files, including keyed, > non-keyed, channel, and changelog state. This will significantly improve the > system stability and availability of fault tolerance in Flink. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32070) FLIP-306 Unified File Merging Mechanism for Checkpoints
[ https://issues.apache.org/jira/browse/FLINK-32070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834609#comment-17834609 ] xiaogang zhou commented on FLINK-32070: --- Is there any Branch I can view do a POC? And I think if you are busy on flink 2.0 state, I can also help do some work on this issue?[~Zakelly] > FLIP-306 Unified File Merging Mechanism for Checkpoints > --- > > Key: FLINK-32070 > URL: https://issues.apache.org/jira/browse/FLINK-32070 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing, Runtime / State Backends >Reporter: Zakelly Lan >Assignee: Zakelly Lan >Priority: Major > Fix For: 1.20.0 > > > The FLIP: > [https://cwiki.apache.org/confluence/display/FLINK/FLIP-306%3A+Unified+File+Merging+Mechanism+for+Checkpoints] > > The creation of multiple checkpoint files can lead to a 'file flood' problem, > in which a large number of files are written to the checkpoint storage in a > short amount of time. This can cause issues in large clusters with high > workloads, such as the creation and deletion of many files increasing the > amount of file meta modification on DFS, leading to single-machine hotspot > issues for meta maintainers (e.g. NameNode in HDFS). Additionally, the > performance of object storage (e.g. Amazon S3 and Alibaba OSS) can > significantly decrease when listing objects, which is necessary for object > name de-duplication before creating an object, further affecting the > performance of directory manipulation in the file system's perspective of > view (See [hadoop-aws module > documentation|https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#:~:text=an%20intermediate%20state.-,Warning%20%232%3A%20Directories%20are%20mimicked,-The%20S3A%20clients], > section 'Warning #2: Directories are mimicked'). > While many solutions have been proposed for individual types of state files > (e.g. FLINK-11937 for keyed state (RocksDB) and FLINK-26803 for channel > state), the file flood problems from each type of checkpoint file are similar > and lack systematic view and solution. Therefore, the goal of this FLIP is to > establish a unified file merging mechanism to address the file flood problem > during checkpoint creation for all types of state files, including keyed, > non-keyed, channel, and changelog state. This will significantly improve the > system stability and availability of fault tolerance in Flink. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-32070) FLIP-306 Unified File Merging Mechanism for Checkpoints
[ https://issues.apache.org/jira/browse/FLINK-32070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834609#comment-17834609 ] xiaogang zhou edited comment on FLINK-32070 at 4/7/24 6:36 AM: --- Is there any Branch I can compile to do a POC? And I think if you are busy on flink 2.0 state, I can also help do some work on this FLIP-306?[~Zakelly] was (Author: zhoujira86): Is there any Branch I can view do a POC? And I think if you are busy on flink 2.0 state, I can also help do some work on this issue?[~Zakelly] > FLIP-306 Unified File Merging Mechanism for Checkpoints > --- > > Key: FLINK-32070 > URL: https://issues.apache.org/jira/browse/FLINK-32070 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing, Runtime / State Backends >Reporter: Zakelly Lan >Assignee: Zakelly Lan >Priority: Major > Fix For: 1.20.0 > > > The FLIP: > [https://cwiki.apache.org/confluence/display/FLINK/FLIP-306%3A+Unified+File+Merging+Mechanism+for+Checkpoints] > > The creation of multiple checkpoint files can lead to a 'file flood' problem, > in which a large number of files are written to the checkpoint storage in a > short amount of time. This can cause issues in large clusters with high > workloads, such as the creation and deletion of many files increasing the > amount of file meta modification on DFS, leading to single-machine hotspot > issues for meta maintainers (e.g. NameNode in HDFS). Additionally, the > performance of object storage (e.g. Amazon S3 and Alibaba OSS) can > significantly decrease when listing objects, which is necessary for object > name de-duplication before creating an object, further affecting the > performance of directory manipulation in the file system's perspective of > view (See [hadoop-aws module > documentation|https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#:~:text=an%20intermediate%20state.-,Warning%20%232%3A%20Directories%20are%20mimicked,-The%20S3A%20clients], > section 'Warning #2: Directories are mimicked'). > While many solutions have been proposed for individual types of state files > (e.g. FLINK-11937 for keyed state (RocksDB) and FLINK-26803 for channel > state), the file flood problems from each type of checkpoint file are similar > and lack systematic view and solution. Therefore, the goal of this FLIP is to > establish a unified file merging mechanism to address the file flood problem > during checkpoint creation for all types of state files, including keyed, > non-keyed, channel, and changelog state. This will significantly improve the > system stability and availability of fault tolerance in Flink. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32070) FLIP-306 Unified File Merging Mechanism for Checkpoints
[ https://issues.apache.org/jira/browse/FLINK-32070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834636#comment-17834636 ] xiaogang zhou commented on FLINK-32070: --- [~Zakelly] yes, sounds good, Let me take a look > FLIP-306 Unified File Merging Mechanism for Checkpoints > --- > > Key: FLINK-32070 > URL: https://issues.apache.org/jira/browse/FLINK-32070 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing, Runtime / State Backends >Reporter: Zakelly Lan >Assignee: Zakelly Lan >Priority: Major > Fix For: 1.20.0 > > > The FLIP: > [https://cwiki.apache.org/confluence/display/FLINK/FLIP-306%3A+Unified+File+Merging+Mechanism+for+Checkpoints] > > The creation of multiple checkpoint files can lead to a 'file flood' problem, > in which a large number of files are written to the checkpoint storage in a > short amount of time. This can cause issues in large clusters with high > workloads, such as the creation and deletion of many files increasing the > amount of file meta modification on DFS, leading to single-machine hotspot > issues for meta maintainers (e.g. NameNode in HDFS). Additionally, the > performance of object storage (e.g. Amazon S3 and Alibaba OSS) can > significantly decrease when listing objects, which is necessary for object > name de-duplication before creating an object, further affecting the > performance of directory manipulation in the file system's perspective of > view (See [hadoop-aws module > documentation|https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#:~:text=an%20intermediate%20state.-,Warning%20%232%3A%20Directories%20are%20mimicked,-The%20S3A%20clients], > section 'Warning #2: Directories are mimicked'). > While many solutions have been proposed for individual types of state files > (e.g. FLINK-11937 for keyed state (RocksDB) and FLINK-26803 for channel > state), the file flood problems from each type of checkpoint file are similar > and lack systematic view and solution. Therefore, the goal of this FLIP is to > establish a unified file merging mechanism to address the file flood problem > during checkpoint creation for all types of state files, including keyed, > non-keyed, channel, and changelog state. This will significantly improve the > system stability and availability of fault tolerance in Flink. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-32086) Cleanup non-reported managed directory on exit of TM
[ https://issues.apache.org/jira/browse/FLINK-32086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839917#comment-17839917 ] xiaogang zhou commented on FLINK-32086: --- [~Zakelly] yes no problem > Cleanup non-reported managed directory on exit of TM > > > Key: FLINK-32086 > URL: https://issues.apache.org/jira/browse/FLINK-32086 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Checkpointing, Runtime / State Backends >Affects Versions: 1.18.0 >Reporter: Zakelly Lan >Assignee: Zakelly Lan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-22059) add a new option is rocksdb statebackend to enable job threads setting
[ https://issues.apache.org/jira/browse/FLINK-22059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17761587#comment-17761587 ] xiaogang zhou commented on FLINK-22059: --- [~Zakelly] yes > add a new option is rocksdb statebackend to enable job threads setting > -- > > Key: FLINK-22059 > URL: https://issues.apache.org/jira/browse/FLINK-22059 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.12.2 >Reporter: xiaogang zhou >Assignee: xiaogang zhou >Priority: Minor > Labels: auto-deprioritized-major, auto-unassigned, stale-assigned > Fix For: 1.18.0 > > > As discussed in FLINK-21688 , now we are using the setIncreaseParallelism > function to set the number of rocksdb's working threads. > > can we enable another setting key to set the rocksdb's max backgroud jobs > which will set a large flush thread number. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33038) remove getMinRetentionTime in StreamExecDeduplicate
xiaogang zhou created FLINK-33038: - Summary: remove getMinRetentionTime in StreamExecDeduplicate Key: FLINK-33038 URL: https://issues.apache.org/jira/browse/FLINK-33038 Project: Flink Issue Type: Improvement Components: Table SQL / Planner Affects Versions: 1.18.0 Reporter: xiaogang zhou Fix For: 1.19.0 I suggest to remove the StreamExecDeduplicate method in StreamExecDeduplicate as the ttl is controlled by the state meta data. Please let me take the issue if possible -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33038) remove getMinRetentionTime in StreamExecDeduplicate
[ https://issues.apache.org/jira/browse/FLINK-33038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-33038: -- Description: I suggest to remove the getMinRetentionTime method in StreamExecDeduplicate as it is not called by anyone and the ttl is controlled by the state meta data. Please let me take the issue if possible was: I suggest to remove the StreamExecDeduplicate method in StreamExecDeduplicate as the ttl is controlled by the state meta data. Please let me take the issue if possible > remove getMinRetentionTime in StreamExecDeduplicate > --- > > Key: FLINK-33038 > URL: https://issues.apache.org/jira/browse/FLINK-33038 > Project: Flink > Issue Type: Improvement > Components: Table SQL / Planner >Affects Versions: 1.18.0 >Reporter: xiaogang zhou >Priority: Minor > Fix For: 1.19.0 > > > I suggest to remove the getMinRetentionTime method in StreamExecDeduplicate > as it is not called by anyone and the ttl is controlled by the state meta > data. > > Please let me take the issue if possible -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33038) remove getMinRetentionTime in StreamExecDeduplicate
[ https://issues.apache.org/jira/browse/FLINK-33038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17762032#comment-17762032 ] xiaogang zhou commented on FLINK-33038: --- [~snuyanzin] would you please help assign to me? also cc the modifier [~qingyue] > remove getMinRetentionTime in StreamExecDeduplicate > --- > > Key: FLINK-33038 > URL: https://issues.apache.org/jira/browse/FLINK-33038 > Project: Flink > Issue Type: Improvement > Components: Table SQL / Planner >Affects Versions: 1.18.0 >Reporter: xiaogang zhou >Priority: Minor > Fix For: 1.19.0 > > > I suggest to remove the getMinRetentionTime method in StreamExecDeduplicate > as it is not called by anyone and the ttl is controlled by the state meta > data. > > Please let me take the issue if possible -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33038) remove getMinRetentionTime in StreamExecDeduplicate
[ https://issues.apache.org/jira/browse/FLINK-33038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17762248#comment-17762248 ] xiaogang zhou commented on FLINK-33038: --- [~Sergey Nuyanzin] [https://github.com/apache/flink/pull/23360] fixed, would you please have a quick review? > remove getMinRetentionTime in StreamExecDeduplicate > --- > > Key: FLINK-33038 > URL: https://issues.apache.org/jira/browse/FLINK-33038 > Project: Flink > Issue Type: Improvement > Components: Table SQL / Planner >Affects Versions: 1.18.0 >Reporter: xiaogang zhou >Assignee: xiaogang zhou >Priority: Minor > Fix For: 1.19.0 > > > I suggest to remove the getMinRetentionTime method in StreamExecDeduplicate > as it is not called by anyone and the ttl is controlled by the state meta > data. > > Please let me take the issue if possible -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-33038) remove getMinRetentionTime in StreamExecDeduplicate
[ https://issues.apache.org/jira/browse/FLINK-33038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17762248#comment-17762248 ] xiaogang zhou edited comment on FLINK-33038 at 9/6/23 2:26 AM: --- [~Sergey Nuyanzin] fixed, would you please have a quick review? was (Author: zhoujira86): [~Sergey Nuyanzin] [https://github.com/apache/flink/pull/23360] fixed, would you please have a quick review? > remove getMinRetentionTime in StreamExecDeduplicate > --- > > Key: FLINK-33038 > URL: https://issues.apache.org/jira/browse/FLINK-33038 > Project: Flink > Issue Type: Improvement > Components: Table SQL / Planner >Affects Versions: 1.18.0 >Reporter: xiaogang zhou >Assignee: xiaogang zhou >Priority: Minor > Fix For: 1.19.0 > > > I suggest to remove the getMinRetentionTime method in StreamExecDeduplicate > as it is not called by anyone and the ttl is controlled by the state meta > data. > > Please let me take the issue if possible -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33162) seperate the executor in DefaultDispatcherResourceManagerComponentFactory for MetricFetcher and webMonitorEndpoint
xiaogang zhou created FLINK-33162: - Summary: seperate the executor in DefaultDispatcherResourceManagerComponentFactory for MetricFetcher and webMonitorEndpoint Key: FLINK-33162 URL: https://issues.apache.org/jira/browse/FLINK-33162 Project: Flink Issue Type: Improvement Components: Runtime / REST Affects Versions: 1.16.0 Reporter: xiaogang zhou Fix For: 1.19.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33162) seperate the executor in DefaultDispatcherResourceManagerComponentFactory for MetricFetcher and webMonitorEndpoint
[ https://issues.apache.org/jira/browse/FLINK-33162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-33162: -- Description: when starting a job with large number of taskmanagers, the jobmanager of the job failed to respond to and rest request. when look into the jstack we found all the 4 threads are server metrics fetcher. {code:java} // code placeholder "Flink-DispatcherRestEndpoint-thread-4" #91 daemon prio=5 os_prio=0 tid=0x7f17e7823000 nid=0x246 waiting for monitor entry [0x7f178e9fe000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addAll(MetricStore.java:81) - waiting to lock <0x0003d5f62638> (a org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore) at org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcherImpl.lambda$queryMetrics$5(MetricFetcherImpl.java:244) at org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcherImpl$$Lambda$1590/569188012.accept(Unknown Source) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers:- <0x0003ce80d8f0> (a java.util.concurrent.ThreadPoolExecutor$Worker) "Flink-DispatcherRestEndpoint-thread-3" #88 daemon prio=5 os_prio=0 tid=0x7f17e88af000 nid=0x243 waiting for monitor entry [0x7f1790dfe000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addAll(MetricStore.java:81) - waiting to lock <0x0003d5f62638> (a org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore) at org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcherImpl.lambda$queryMetrics$5(MetricFetcherImpl.java:244) at org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcherImpl$$Lambda$1590/569188012.accept(Unknown Source) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers:- <0x0003ce80df88> (a java.util.concurrent.ThreadPoolExecutor$Worker) "Flink-DispatcherRestEndpoint-thread-2" #79 daemon prio=5 os_prio=0 tid=0x7f1793473800 nid=0x23a runnable [0x7f17922fd000] java.lang.Thread.State: RUNNABLE at org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.add(MetricStore.java:216) at org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addAll(MetricStore.java:82) - locked <0x0003d5f62638> (a org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore) at org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcherImpl.lambda$queryMetrics$5(MetricFetcherImpl.java:244) at org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcherImpl$$Lambda$1590/569188012.accept(Unknown Source) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Sche
[jira] [Updated] (FLINK-33162) seperate the executor in DefaultDispatcherResourceManagerComponentFactory for MetricFetcher and webMonitorEndpoint
[ https://issues.apache.org/jira/browse/FLINK-33162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-33162: -- Affects Version/s: 1.13.1 (was: 1.16.0) > seperate the executor in DefaultDispatcherResourceManagerComponentFactory for > MetricFetcher and webMonitorEndpoint > -- > > Key: FLINK-33162 > URL: https://issues.apache.org/jira/browse/FLINK-33162 > Project: Flink > Issue Type: Improvement > Components: Runtime / REST >Affects Versions: 1.13.1 >Reporter: xiaogang zhou >Priority: Major > Fix For: 1.19.0 > > > when starting a job with large number of taskmanagers, the jobmanager of the > job failed to respond to and rest request. when look into the jstack we found > all the 4 threads are server metrics fetcher. > {code:java} > // code placeholder > "Flink-DispatcherRestEndpoint-thread-4" #91 daemon prio=5 os_prio=0 > tid=0x7f17e7823000 nid=0x246 waiting for monitor entry > [0x7f178e9fe000] java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addAll(MetricStore.java:81) > - waiting to lock <0x0003d5f62638> (a > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore) at > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcherImpl.lambda$queryMetrics$5(MetricFetcherImpl.java:244) > at > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcherImpl$$Lambda$1590/569188012.accept(Unknown > Source) at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) >at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) >at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) >at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: - <0x0003ce80d8f0> (a > java.util.concurrent.ThreadPoolExecutor$Worker) > "Flink-DispatcherRestEndpoint-thread-3" #88 daemon prio=5 os_prio=0 > tid=0x7f17e88af000 nid=0x243 waiting for monitor entry > [0x7f1790dfe000] java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addAll(MetricStore.java:81) > - waiting to lock <0x0003d5f62638> (a > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore) at > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcherImpl.lambda$queryMetrics$5(MetricFetcherImpl.java:244) > at > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcherImpl$$Lambda$1590/569188012.accept(Unknown > Source) at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) >at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) >at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) >at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: - <0x0003ce80df88> (a > java.util.concurrent.ThreadPoolExecutor$Worker) > "Flink-DispatcherRestEndpoint-thread-2" #79 daemon prio=5 os_prio=0 > tid=0x7f1793473800 nid=0x23a runnable [0x7f17922fd000] > java.lang.Thread.State: RUNNABLE at > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.add(MetricStore.java:216) >at > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addAll(MetricStore.java:82) > - locked <0x0003d5f62638> (a > org.apac
[jira] [Updated] (FLINK-33162) seperate the executor in DefaultDispatcherResourceManagerComponentFactory for MetricFetcher and webMonitorEndpoint
[ https://issues.apache.org/jira/browse/FLINK-33162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-33162: -- Description: when starting a job with large number of taskmanagers, the jobmanager of the job failed to respond to and rest request. when look into the jstack we found all the 4 threads are server metrics fetcher. {code:java} // code placeholder "Flink-DispatcherRestEndpoint-thread-4" #91 daemon prio=5 os_prio=0 tid=0x7f17e7823000 nid=0x246 waiting for monitor entry [0x7f178e9fe000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addAll(MetricStore.java:81) - waiting to lock <0x0003d5f62638> (a org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore) at org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcherImpl.lambda$queryMetrics$5(MetricFetcherImpl.java:244) at org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcherImpl$$Lambda$1590/569188012.accept(Unknown Source) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers:- <0x0003ce80d8f0> (a java.util.concurrent.ThreadPoolExecutor$Worker) "Flink-DispatcherRestEndpoint-thread-3" #88 daemon prio=5 os_prio=0 tid=0x7f17e88af000 nid=0x243 waiting for monitor entry [0x7f1790dfe000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addAll(MetricStore.java:81) - waiting to lock <0x0003d5f62638> (a org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore) at org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcherImpl.lambda$queryMetrics$5(MetricFetcherImpl.java:244) at org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcherImpl$$Lambda$1590/569188012.accept(Unknown Source) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers:- <0x0003ce80df88> (a java.util.concurrent.ThreadPoolExecutor$Worker) "Flink-DispatcherRestEndpoint-thread-2" #79 daemon prio=5 os_prio=0 tid=0x7f1793473800 nid=0x23a runnable [0x7f17922fd000] java.lang.Thread.State: RUNNABLE at org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.add(MetricStore.java:216) at org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addAll(MetricStore.java:82) - locked <0x0003d5f62638> (a org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore) at org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcherImpl.lambda$queryMetrics$5(MetricFetcherImpl.java:244) at org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcherImpl$$Lambda$1590/569188012.accept(Unknown Source) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Sche
[jira] [Created] (FLINK-33174) enabling tablesample bernoulli in flink
xiaogang zhou created FLINK-33174: - Summary: enabling tablesample bernoulli in flink Key: FLINK-33174 URL: https://issues.apache.org/jira/browse/FLINK-33174 Project: Flink Issue Type: Improvement Components: Table SQL / Planner Affects Versions: 1.17.1 Reporter: xiaogang zhou I'd like to introduce a table sample function to enable fast sampling to streamings. this is enlighted by https://issues.apache.org/jira/browse/CALCITE-5971 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33174) enabling tablesample bernoulli in flink
[ https://issues.apache.org/jira/browse/FLINK-33174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17770822#comment-17770822 ] xiaogang zhou commented on FLINK-33174: --- [~lsy] [~libenchao] Hi Bros, would you please help review the suggestion? > enabling tablesample bernoulli in flink > --- > > Key: FLINK-33174 > URL: https://issues.apache.org/jira/browse/FLINK-33174 > Project: Flink > Issue Type: Improvement > Components: Table SQL / Planner >Affects Versions: 1.17.1 >Reporter: xiaogang zhou >Priority: Major > > I'd like to introduce a table sample function to enable fast sampling to > streamings. > this is enlighted by https://issues.apache.org/jira/browse/CALCITE-5971 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33174) enabling tablesample bernoulli in flink
[ https://issues.apache.org/jira/browse/FLINK-33174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772820#comment-17772820 ] xiaogang zhou commented on FLINK-33174: --- [~libenchao] Hi Benchao, Would you please assign the issue to me? We can see whether to wait for the calcite bumping or find someway else > enabling tablesample bernoulli in flink > --- > > Key: FLINK-33174 > URL: https://issues.apache.org/jira/browse/FLINK-33174 > Project: Flink > Issue Type: Improvement > Components: Table SQL / Planner >Affects Versions: 1.17.1 >Reporter: xiaogang zhou >Priority: Major > > I'd like to introduce a table sample function to enable fast sampling to > streamings. > this is enlighted by https://issues.apache.org/jira/browse/CALCITE-5971 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33174) enabling tablesample bernoulli in flink
[ https://issues.apache.org/jira/browse/FLINK-33174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773306#comment-17773306 ] xiaogang zhou commented on FLINK-33174: --- [~libenchao] [~lsy] [~martijnvisser] Thanks all for your comments, Let me prepare a FLIP first and wait for the calcite upgrading > enabling tablesample bernoulli in flink > --- > > Key: FLINK-33174 > URL: https://issues.apache.org/jira/browse/FLINK-33174 > Project: Flink > Issue Type: Improvement > Components: Table SQL / Planner >Affects Versions: 1.17.1 >Reporter: xiaogang zhou >Priority: Major > > I'd like to introduce a table sample function to enable fast sampling to > streamings. > this is enlighted by https://issues.apache.org/jira/browse/CALCITE-5971 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33249) comment should be parsed by StringLiteral() instead of SqlCharStringLiteral to avoid parsing failure
xiaogang zhou created FLINK-33249: - Summary: comment should be parsed by StringLiteral() instead of SqlCharStringLiteral to avoid parsing failure Key: FLINK-33249 URL: https://issues.apache.org/jira/browse/FLINK-33249 Project: Flink Issue Type: Improvement Components: Table SQL / Planner Affects Versions: 1.17.1 Reporter: xiaogang zhou -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-33249) comment should be parsed by StringLiteral() instead of SqlCharStringLiteral to avoid parsing failure
[ https://issues.apache.org/jira/browse/FLINK-33249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-33249: -- Description: this problem is also recorded in calcite https://issues.apache.org/jira/browse/CALCITE-6046 Hi, I found this problem when I used below code to split SQL statements. the process is SQL string -> SqlNode -> SQL String {code:java} // code placeholder SqlParser.Config parserConfig = getCurrentSqlParserConfig(sqlDialect); SqlParser sqlParser = SqlParser.create(sqlContent, parserConfig); SqlNodeList sqlNodeList = sqlParser.parseStmtList(); sqlParser.parse(sqlNodeList.get(0));{code} the Dialect/ SqlConformance is a costumed one: [https://github.com/apache/flink/blob/master/flink-table/flink-sql-parser/src/main/java/org/apache/flink/sql/parser/validate/FlinkSqlConformance.java] then I found below SQL {code:java} // code placeholder CREATE TABLE source ( a BIGINT ) comment '测试test' WITH ( 'connector' = 'test' ); {code} transformed to {code:java} // code placeholder CREATE TABLE `source` ( `a` BIGINT ) COMMENT u&'\5218\51eftest' WITH ( 'connector' = 'test' ) {code} and the SQL parser template is like {code:java} // code placeholder SqlCreate SqlCreateTable(Span s, boolean replace, boolean isTemporary) : { final SqlParserPos startPos = s.pos(); boolean ifNotExists = false; SqlIdentifier tableName; List constraints = new ArrayList(); SqlWatermark watermark = null; SqlNodeList columnList = SqlNodeList.EMPTY; SqlCharStringLiteral comment = null; SqlTableLike tableLike = null; SqlNode asQuery = null; SqlNodeList propertyList = SqlNodeList.EMPTY; SqlNodeList partitionColumns = SqlNodeList.EMPTY; SqlParserPos pos = startPos; } { ifNotExists = IfNotExistsOpt() tableName = CompoundIdentifier() [ { pos = getPos(); TableCreationContext ctx = new TableCreationContext();} TableColumn(ctx) ( TableColumn(ctx) )* { pos = pos.plus(getPos()); columnList = new SqlNodeList(ctx.columnList, pos); constraints = ctx.constraints; watermark = ctx.watermark; } ] [ { String p = SqlParserUtil.parseString(token.image); comment = SqlLiteral.createCharString(p, getPos()); }] [ partitionColumns = ParenthesizedSimpleIdentifierList() ] [ propertyList = TableProperties() ] [ tableLike = SqlTableLike(getPos()) { return new SqlCreateTableLike(startPos.plus(getPos()), tableName, columnList, constraints, propertyList, partitionColumns, watermark, comment, tableLike, isTemporary, ifNotExists); } | asQuery = OrderedQueryOrExpr(ExprContext.ACCEPT_QUERY) { return new SqlCreateTableAs(startPos.plus(getPos()), tableName, columnList, constraints, propertyList, partitionColumns, watermark, comment, asQuery, isTemporary, ifNotExists); } ] { return new SqlCreateTable(startPos.plus(getPos()), tableName, columnList, constraints, propertyList, partitionColumns, watermark, comment, isTemporary, ifNotExists); } } {code} will give a exception : Caused by: org.apache.calcite.sql.parser.SqlParseException: Encountered "u&\'\\5218\\51eftest\'" at line 4, column 9. Was expecting: ... so I think all the SqlCharStringLiteral should be replaced by StringLiteral() was: this problem is also recorded in calcite https://issues.apache.org/jira/browse/CALCITE-6046 Hi, I found this problem when I used below code to split SQL statements. the process is SQL string -> SqlNode -> SQL String {code:java} // code placeholder SqlParser.Config parserConfig = getCurrentSqlParserConfig(sqlDialect); SqlParser sqlParser = SqlParser.create(sqlContent, parserConfig); SqlNodeList sqlNodeList = sqlParser.parseStmtList(); sqlParser.parse(sqlNodeList.get(0));{code} the Dialect/ SqlConformance is a costumed one: [https://github.com/apache/flink/blob/master/flink-table/flink-sql-parser/src/main/java/org/apache/flink/sql/parser/validate/FlinkSqlConformance.java] then I found below SQL {code:java} // code placeholder CREATE TABLE source ( a BIGINT ) comment '测试test' WITH ( 'connector' = 'test' ); {code} transformed to {code:java} // code placeholder CREATE TABLE `source` ( `a` BIGINT ) COMMENT u&'\5218\51eftest' WITH ( '
[jira] [Updated] (FLINK-33249) comment should be parsed by StringLiteral() instead of SqlCharStringLiteral to avoid parsing failure
[ https://issues.apache.org/jira/browse/FLINK-33249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-33249: -- Description: this problem is also recorded in calcite https://issues.apache.org/jira/browse/CALCITE-6046 Hi, I found this problem when I used below code to split SQL statements. the process is SQL string -> SqlNode -> SQL String {code:java} // code placeholder SqlParser.Config parserConfig = getCurrentSqlParserConfig(sqlDialect); SqlParser sqlParser = SqlParser.create(sqlContent, parserConfig); SqlNodeList sqlNodeList = sqlParser.parseStmtList(); sqlParser.parse(sqlNodeList.get(0));{code} the Dialect/ SqlConformance is a costumed one: [https://github.com/apache/flink/blob/master/flink-table/flink-sql-parser/src/main/java/org/apache/flink/sql/parser/validate/FlinkSqlConformance.java] then I found below SQL {code:java} // code placeholder CREATE TABLE source ( a BIGINT ) comment '测试test' WITH ( 'connector' = 'test' ); {code} transformed to {code:java} // code placeholder CREATE TABLE `source` ( `a` BIGINT ) COMMENT u&'\5218\51eftest' WITH ( 'connector' = 'test' ) {code} and the SQL parser template is like {code:java} // code placeholder SqlCreate SqlCreateTable(Span s, boolean replace, boolean isTemporary) : { final SqlParserPos startPos = s.pos(); boolean ifNotExists = false; SqlIdentifier tableName; List constraints = new ArrayList(); SqlWatermark watermark = null; SqlNodeList columnList = SqlNodeList.EMPTY; SqlCharStringLiteral comment = null; SqlTableLike tableLike = null; SqlNode asQuery = null; SqlNodeList propertyList = SqlNodeList.EMPTY; SqlNodeList partitionColumns = SqlNodeList.EMPTY; SqlParserPos pos = startPos; } { ifNotExists = IfNotExistsOpt() tableName = CompoundIdentifier() [ { pos = getPos(); TableCreationContext ctx = new TableCreationContext();} TableColumn(ctx) ( TableColumn(ctx) )* { pos = pos.plus(getPos()); columnList = new SqlNodeList(ctx.columnList, pos); constraints = ctx.constraints; watermark = ctx.watermark; } ] [ { String p = SqlParserUtil.parseString(token.image); comment = SqlLiteral.createCharString(p, getPos()); }] [ partitionColumns = ParenthesizedSimpleIdentifierList() ] [ propertyList = TableProperties() ] [ tableLike = SqlTableLike(getPos()) { return new SqlCreateTableLike(startPos.plus(getPos()), tableName, columnList, constraints, propertyList, partitionColumns, watermark, comment, tableLike, isTemporary, ifNotExists); } | asQuery = OrderedQueryOrExpr(ExprContext.ACCEPT_QUERY) { return new SqlCreateTableAs(startPos.plus(getPos()), tableName, columnList, constraints, propertyList, partitionColumns, watermark, comment, asQuery, isTemporary, ifNotExists); } ] { return new SqlCreateTable(startPos.plus(getPos()), tableName, columnList, constraints, propertyList, partitionColumns, watermark, comment, isTemporary, ifNotExists); } } {code} will give a exception : Caused by: org.apache.calcite.sql.parser.SqlParseException: Encountered "u&\'\\5218\\51eftest\'" at line 4, column 9. Was expecting: ... so I think all the SqlCharStringLiteral should be replaced by > comment should be parsed by StringLiteral() instead of SqlCharStringLiteral > to avoid parsing failure > > > Key: FLINK-33249 > URL: https://issues.apache.org/jira/browse/FLINK-33249 > Project: Flink > Issue Type: Improvement > Components: Table SQL / Planner >Affects Versions: 1.17.1 >Reporter: xiaogang zhou >Priority: Major > > this problem is also recorded in calcite > > https://issues.apache.org/jira/browse/CALCITE-6046 > > Hi, I found this problem when I used below code to split SQL statements. the > process is SQL string -> SqlNode -> SQL String > {code:java} > // code placeholder > SqlParser.Config parserConfig = getCurrentSqlParserConfig(sqlDialect); > SqlParser sqlParser = SqlParser.create(sqlContent, parserConfig); > SqlNodeList sqlNodeList = sqlParser.parseStmtList(); > s
[jira] [Commented] (FLINK-33249) comment should be parsed by StringLiteral() instead of SqlCharStringLiteral to avoid parsing failure
[ https://issues.apache.org/jira/browse/FLINK-33249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17774339#comment-17774339 ] xiaogang zhou commented on FLINK-33249: --- I'd like to take this ticket > comment should be parsed by StringLiteral() instead of SqlCharStringLiteral > to avoid parsing failure > > > Key: FLINK-33249 > URL: https://issues.apache.org/jira/browse/FLINK-33249 > Project: Flink > Issue Type: Improvement > Components: Table SQL / Planner >Affects Versions: 1.17.1 >Reporter: xiaogang zhou >Priority: Major > > this problem is also recorded in calcite > > https://issues.apache.org/jira/browse/CALCITE-6046 > > Hi, I found this problem when I used below code to split SQL statements. the > process is SQL string -> SqlNode -> SQL String > {code:java} > // code placeholder > SqlParser.Config parserConfig = getCurrentSqlParserConfig(sqlDialect); > SqlParser sqlParser = SqlParser.create(sqlContent, parserConfig); > SqlNodeList sqlNodeList = sqlParser.parseStmtList(); > sqlParser.parse(sqlNodeList.get(0));{code} > the Dialect/ SqlConformance is a costumed one: > [https://github.com/apache/flink/blob/master/flink-table/flink-sql-parser/src/main/java/org/apache/flink/sql/parser/validate/FlinkSqlConformance.java] > > > then I found below SQL > {code:java} > // code placeholder > CREATE TABLE source ( > a BIGINT > ) comment '测试test' > WITH ( > 'connector' = 'test' > ); {code} > transformed to > {code:java} > // code placeholder > CREATE TABLE `source` ( > `a` BIGINT > ) > COMMENT u&'\5218\51eftest' WITH ( > 'connector' = 'test' > ) {code} > > and the SQL parser template is like > {code:java} > // code placeholder > SqlCreate SqlCreateTable(Span s, boolean replace, boolean isTemporary) : > { > final SqlParserPos startPos = s.pos(); > boolean ifNotExists = false; > SqlIdentifier tableName; > List constraints = new > ArrayList(); > SqlWatermark watermark = null; > SqlNodeList columnList = SqlNodeList.EMPTY; >SqlCharStringLiteral comment = null; >SqlTableLike tableLike = null; > SqlNode asQuery = null; > SqlNodeList propertyList = SqlNodeList.EMPTY; > SqlNodeList partitionColumns = SqlNodeList.EMPTY; > SqlParserPos pos = startPos; > } > { > > ifNotExists = IfNotExistsOpt() > tableName = CompoundIdentifier() > [ > { pos = getPos(); TableCreationContext ctx = new > TableCreationContext();} > TableColumn(ctx) > ( > TableColumn(ctx) > )* > { > pos = pos.plus(getPos()); > columnList = new SqlNodeList(ctx.columnList, pos); > constraints = ctx.constraints; > watermark = ctx.watermark; > } > > ] > [ { > String p = SqlParserUtil.parseString(token.image); > comment = SqlLiteral.createCharString(p, getPos()); > }] > [ > > partitionColumns = ParenthesizedSimpleIdentifierList() > ] > [ > > propertyList = TableProperties() > ] > [ > > tableLike = SqlTableLike(getPos()) > { > return new SqlCreateTableLike(startPos.plus(getPos()), > tableName, > columnList, > constraints, > propertyList, > partitionColumns, > watermark, > comment, > tableLike, > isTemporary, > ifNotExists); > } > | > > asQuery = OrderedQueryOrExpr(ExprContext.ACCEPT_QUERY) > { > return new SqlCreateTableAs(startPos.plus(getPos()), > tableName, > columnList, > constraints, > propertyList, > partitionColumns, > watermark, > comment, > asQuery, > isTemporary, > ifNotExists); > } > ] > { > return new SqlCreateTable(startPos.plus(getPos()), > tableName, > columnList, > constraints, > propertyList, > partitionColumns, > watermark, > comment, > isTemporary, > ifNotExists); > } > } {code} > will give a exception : > Caused by: org.apache.calcite.sql.parser.SqlParseException: Encountered > "u&\'\\5218\\51eftest\'" at line 4, column 9. > Was expecting: > ... > > so I think all the SqlCharStringLiteral should be replaced by StringLiteral() -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-33249) comment should be parsed by StringLiteral() instead of SqlCharStringLiteral to avoid parsing failure
[ https://issues.apache.org/jira/browse/FLINK-33249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17774426#comment-17774426 ] xiaogang zhou commented on FLINK-33249: --- [~martijnvisser] Hi Martin, I am not sure whether this is something to be fixed in calcite. As SqlCreateTable template is in flink parser. I have attached a url, would you please have a glance at it? > comment should be parsed by StringLiteral() instead of SqlCharStringLiteral > to avoid parsing failure > > > Key: FLINK-33249 > URL: https://issues.apache.org/jira/browse/FLINK-33249 > Project: Flink > Issue Type: Improvement > Components: Table SQL / Planner >Affects Versions: 1.17.1 >Reporter: xiaogang zhou >Priority: Major > Labels: pull-request-available > > this problem is also recorded in calcite > > https://issues.apache.org/jira/browse/CALCITE-6046 > > Hi, I found this problem when I used below code to split SQL statements. the > process is SQL string -> SqlNode -> SQL String > {code:java} > // code placeholder > SqlParser.Config parserConfig = getCurrentSqlParserConfig(sqlDialect); > SqlParser sqlParser = SqlParser.create(sqlContent, parserConfig); > SqlNodeList sqlNodeList = sqlParser.parseStmtList(); > sqlParser.parse(sqlNodeList.get(0));{code} > the Dialect/ SqlConformance is a costumed one: > [https://github.com/apache/flink/blob/master/flink-table/flink-sql-parser/src/main/java/org/apache/flink/sql/parser/validate/FlinkSqlConformance.java] > > > then I found below SQL > {code:java} > // code placeholder > CREATE TABLE source ( > a BIGINT > ) comment '测试test' > WITH ( > 'connector' = 'test' > ); {code} > transformed to > {code:java} > // code placeholder > CREATE TABLE `source` ( > `a` BIGINT > ) > COMMENT u&'\5218\51eftest' WITH ( > 'connector' = 'test' > ) {code} > > and the SQL parser template is like > {code:java} > // code placeholder > SqlCreate SqlCreateTable(Span s, boolean replace, boolean isTemporary) : > { > final SqlParserPos startPos = s.pos(); > boolean ifNotExists = false; > SqlIdentifier tableName; > List constraints = new > ArrayList(); > SqlWatermark watermark = null; > SqlNodeList columnList = SqlNodeList.EMPTY; >SqlCharStringLiteral comment = null; >SqlTableLike tableLike = null; > SqlNode asQuery = null; > SqlNodeList propertyList = SqlNodeList.EMPTY; > SqlNodeList partitionColumns = SqlNodeList.EMPTY; > SqlParserPos pos = startPos; > } > { > > ifNotExists = IfNotExistsOpt() > tableName = CompoundIdentifier() > [ > { pos = getPos(); TableCreationContext ctx = new > TableCreationContext();} > TableColumn(ctx) > ( > TableColumn(ctx) > )* > { > pos = pos.plus(getPos()); > columnList = new SqlNodeList(ctx.columnList, pos); > constraints = ctx.constraints; > watermark = ctx.watermark; > } > > ] > [ { > String p = SqlParserUtil.parseString(token.image); > comment = SqlLiteral.createCharString(p, getPos()); > }] > [ > > partitionColumns = ParenthesizedSimpleIdentifierList() > ] > [ > > propertyList = TableProperties() > ] > [ > > tableLike = SqlTableLike(getPos()) > { > return new SqlCreateTableLike(startPos.plus(getPos()), > tableName, > columnList, > constraints, > propertyList, > partitionColumns, > watermark, > comment, > tableLike, > isTemporary, > ifNotExists); > } > | > > asQuery = OrderedQueryOrExpr(ExprContext.ACCEPT_QUERY) > { > return new SqlCreateTableAs(startPos.plus(getPos()), > tableName, > columnList, > constraints, > propertyList, > partitionColumns, > watermark, > comment, > asQuery, > isTemporary, > ifNotExists); > } > ] > { > return new SqlCreateTable(startPos.plus(getPos()), > tableName, > columnList, > constraints, > propertyList, > partitionColumns, > watermark, > comment, > isTemporary, > ifNotExists); > } > } {code} > will give a exception : > Caused by: org.apache.calcite.sql.parser.SqlParseException: Encountered > "u&\'\\5218\\51eftest\'" at line 4, column 9. > Was expectin
[jira] [Commented] (FLINK-33249) comment should be parsed by StringLiteral() instead of SqlCharStringLiteral to avoid parsing failure
[ https://issues.apache.org/jira/browse/FLINK-33249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17774468#comment-17774468 ] xiaogang zhou commented on FLINK-33249: --- [~martijnvisser] Sure thing, Let me provide a test to demonstrate the issue. > comment should be parsed by StringLiteral() instead of SqlCharStringLiteral > to avoid parsing failure > > > Key: FLINK-33249 > URL: https://issues.apache.org/jira/browse/FLINK-33249 > Project: Flink > Issue Type: Improvement > Components: Table SQL / Planner >Affects Versions: 1.17.1 >Reporter: xiaogang zhou >Priority: Major > Labels: pull-request-available > > this problem is also recorded in calcite > > https://issues.apache.org/jira/browse/CALCITE-6046 > > Hi, I found this problem when I used below code to split SQL statements. the > process is SQL string -> SqlNode -> SQL String > {code:java} > // code placeholder > SqlParser.Config parserConfig = getCurrentSqlParserConfig(sqlDialect); > SqlParser sqlParser = SqlParser.create(sqlContent, parserConfig); > SqlNodeList sqlNodeList = sqlParser.parseStmtList(); > sqlParser.parse(sqlNodeList.get(0));{code} > the Dialect/ SqlConformance is a costumed one: > [https://github.com/apache/flink/blob/master/flink-table/flink-sql-parser/src/main/java/org/apache/flink/sql/parser/validate/FlinkSqlConformance.java] > > > then I found below SQL > {code:java} > // code placeholder > CREATE TABLE source ( > a BIGINT > ) comment '测试test' > WITH ( > 'connector' = 'test' > ); {code} > transformed to > {code:java} > // code placeholder > CREATE TABLE `source` ( > `a` BIGINT > ) > COMMENT u&'\5218\51eftest' WITH ( > 'connector' = 'test' > ) {code} > > and the SQL parser template is like > {code:java} > // code placeholder > SqlCreate SqlCreateTable(Span s, boolean replace, boolean isTemporary) : > { > final SqlParserPos startPos = s.pos(); > boolean ifNotExists = false; > SqlIdentifier tableName; > List constraints = new > ArrayList(); > SqlWatermark watermark = null; > SqlNodeList columnList = SqlNodeList.EMPTY; >SqlCharStringLiteral comment = null; >SqlTableLike tableLike = null; > SqlNode asQuery = null; > SqlNodeList propertyList = SqlNodeList.EMPTY; > SqlNodeList partitionColumns = SqlNodeList.EMPTY; > SqlParserPos pos = startPos; > } > { > > ifNotExists = IfNotExistsOpt() > tableName = CompoundIdentifier() > [ > { pos = getPos(); TableCreationContext ctx = new > TableCreationContext();} > TableColumn(ctx) > ( > TableColumn(ctx) > )* > { > pos = pos.plus(getPos()); > columnList = new SqlNodeList(ctx.columnList, pos); > constraints = ctx.constraints; > watermark = ctx.watermark; > } > > ] > [ { > String p = SqlParserUtil.parseString(token.image); > comment = SqlLiteral.createCharString(p, getPos()); > }] > [ > > partitionColumns = ParenthesizedSimpleIdentifierList() > ] > [ > > propertyList = TableProperties() > ] > [ > > tableLike = SqlTableLike(getPos()) > { > return new SqlCreateTableLike(startPos.plus(getPos()), > tableName, > columnList, > constraints, > propertyList, > partitionColumns, > watermark, > comment, > tableLike, > isTemporary, > ifNotExists); > } > | > > asQuery = OrderedQueryOrExpr(ExprContext.ACCEPT_QUERY) > { > return new SqlCreateTableAs(startPos.plus(getPos()), > tableName, > columnList, > constraints, > propertyList, > partitionColumns, > watermark, > comment, > asQuery, > isTemporary, > ifNotExists); > } > ] > { > return new SqlCreateTable(startPos.plus(getPos()), > tableName, > columnList, > constraints, > propertyList, > partitionColumns, > watermark, > comment, > isTemporary, > ifNotExists); > } > } {code} > will give a exception : > Caused by: org.apache.calcite.sql.parser.SqlParseException: Encountered > "u&\'\\5218\\51eftest\'" at line 4, column 9. > Was expecting: > ... > > so I think all the SqlCharStringLiteral should be replaced by StringLiteral() -- This message was sent by
[jira] [Updated] (FLINK-15264) Job Manager TASK_SLOTS_TOTAL metrics does not shows the Job ID
[ https://issues.apache.org/jira/browse/FLINK-15264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-15264: -- Affects Version/s: 1.8.1 > Job Manager TASK_SLOTS_TOTAL metrics does not shows the Job ID > -- > > Key: FLINK-15264 > URL: https://issues.apache.org/jira/browse/FLINK-15264 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.8.1 >Reporter: xiaogang zhou >Priority: Major > > I am run the Single flink Job mode on Yarn. As each Job has a Job manager > running some where in the host of the yarn. > > Sometimes different Job managers can running on the same host, and the > metrics TASK_SLOTS_TOTAL has no identification to show which job they > belonging. Can we support a metric which can tell how many slot a job > occupies? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-15264) Job Manager TASK_SLOTS_TOTAL metrics does not shows the Job ID
[ https://issues.apache.org/jira/browse/FLINK-15264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-15264: -- Description: I am running the Single flink Job mode on Yarn. As each Job has a Job manager running some where in the host of the yarn. Sometimes different Job managers can running on the same host, and the metrics TASK_SLOTS_TOTAL has no identification to show which job they are belonging. Can we support a metric which can tell how many slots each job occupies? like put a job-id description in TASK_SLOTS_TOTAL, or add another seperate metrics. [~chesnay] seems you helped lots of people in the runtime/metrics section. let me know your opinion, or you can tell me any existing solution to solve the problem. Thanks a lot was: I am run the Single flink Job mode on Yarn. As each Job has a Job manager running some where in the host of the yarn. Sometimes different Job managers can running on the same host, and the metrics TASK_SLOTS_TOTAL has no identification to show which job they belonging. Can we support a metric which can tell how many slot a job occupies? > Job Manager TASK_SLOTS_TOTAL metrics does not shows the Job ID > -- > > Key: FLINK-15264 > URL: https://issues.apache.org/jira/browse/FLINK-15264 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.8.1 >Reporter: xiaogang zhou >Priority: Major > > I am running the Single flink Job mode on Yarn. As each Job has a Job manager > running some where in the host of the yarn. Sometimes different Job managers > can running on the same host, and the metrics TASK_SLOTS_TOTAL has no > identification to show which job they are belonging. > > Can we support a metric which can tell how many slots each job occupies? like > put a job-id description in TASK_SLOTS_TOTAL, or add another seperate > metrics. > > [~chesnay] seems you helped lots of people in the runtime/metrics section. > let me know your opinion, or you can tell me any existing solution to solve > the problem. Thanks a lot -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15264) Job Manager TASK_SLOTS_TOTAL metrics does not shows the Job ID
[ https://issues.apache.org/jira/browse/FLINK-15264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998037#comment-16998037 ] xiaogang zhou commented on FLINK-15264: --- As I mentioned, it can be really confusing when multiple job managers running on a single host. > Job Manager TASK_SLOTS_TOTAL metrics does not shows the Job ID > -- > > Key: FLINK-15264 > URL: https://issues.apache.org/jira/browse/FLINK-15264 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.8.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2019-12-17-17-01-51-826.png > > > I am running the Single flink Job mode on Yarn. As each Job has a Job manager > running some where in the host of the yarn. Sometimes different Job managers > can running on the same host, and the metrics TASK_SLOTS_TOTAL has no > identification to show which job they are belonging. > > Can we support a metric which can tell how many slots each job occupies? like > put a job-id description in TASK_SLOTS_TOTAL, or add another seperate > metrics. > > [~chesnay] seems you helped lots of people in the runtime/metrics section. > let me know your opinion, or you can tell me any existing solution to solve > the problem. Thanks a lot -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15264) Job Manager TASK_SLOTS_TOTAL metrics does not shows the Job ID
[ https://issues.apache.org/jira/browse/FLINK-15264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998927#comment-16998927 ] xiaogang zhou commented on FLINK-15264: --- As There is JobID jobId attribute in the TaskManagerSlot, it should be possible to expose the slot occupation in the Resource Manager Layer. As the Single flink Job mode on Yarn is facing this issue, I believe this should be a necessary improvement. Can we discuss whether this is acceptable to expose the per job slots occupation in the ResourceManager layer? Please kindly let me know your thinking > Job Manager TASK_SLOTS_TOTAL metrics does not shows the Job ID > -- > > Key: FLINK-15264 > URL: https://issues.apache.org/jira/browse/FLINK-15264 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.8.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2019-12-17-17-01-51-826.png > > > I am running the Single flink Job mode on Yarn. As each Job has a Job manager > running some where in the host of the yarn. Sometimes different Job managers > can running on the same host, and the metrics TASK_SLOTS_TOTAL has no > identification to show which job they are belonging. > > Can we support a metric which can tell how many slots each job occupies? like > put a job-id description in TASK_SLOTS_TOTAL, or add another seperate > metrics. > > [~chesnay] seems you helped lots of people in the runtime/metrics section. > let me know your opinion, or you can tell me any existing solution to solve > the problem. Thanks a lot -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15264) Job Manager TASK_SLOTS_TOTAL metrics does not shows the Job ID
[ https://issues.apache.org/jira/browse/FLINK-15264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002084#comment-17002084 ] xiaogang zhou commented on FLINK-15264: --- [~x1q1j1] Thank you for reply, would you please assign this to me? I would be happy to contribute for this issue! > Job Manager TASK_SLOTS_TOTAL metrics does not shows the Job ID > -- > > Key: FLINK-15264 > URL: https://issues.apache.org/jira/browse/FLINK-15264 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.8.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2019-12-17-17-01-51-826.png > > > I am running the Single flink Job mode on Yarn. As each Job has a Job manager > running some where in the host of the yarn. Sometimes different Job managers > can running on the same host, and the metrics TASK_SLOTS_TOTAL has no > identification to show which job they are belonging. > > Can we support a metric which can tell how many slots each job occupies? like > put a job-id description in TASK_SLOTS_TOTAL, or add another seperate > metrics. > > [~chesnay] seems you helped lots of people in the runtime/metrics section. > let me know your opinion, or you can tell me any existing solution to solve > the problem. Thanks a lot -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Issue Comment Deleted] (FLINK-15264) Job Manager TASK_SLOTS_TOTAL metrics does not shows the Job ID
[ https://issues.apache.org/jira/browse/FLINK-15264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-15264: -- Comment: was deleted (was: [~x1q1j1] Thank you for reply, would you please assign this to me? I would be happy to contribute for this issue!) > Job Manager TASK_SLOTS_TOTAL metrics does not shows the Job ID > -- > > Key: FLINK-15264 > URL: https://issues.apache.org/jira/browse/FLINK-15264 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics >Affects Versions: 1.8.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2019-12-17-17-01-51-826.png > > > I am running the Single flink Job mode on Yarn. As each Job has a Job manager > running some where in the host of the yarn. Sometimes different Job managers > can running on the same host, and the metrics TASK_SLOTS_TOTAL has no > identification to show which job they are belonging. > > Can we support a metric which can tell how many slots each job occupies? like > put a job-id description in TASK_SLOTS_TOTAL, or add another seperate > metrics. > > [~chesnay] seems you helped lots of people in the runtime/metrics section. > let me know your opinion, or you can tell me any existing solution to solve > the problem. Thanks a lot -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21543) when using FIFO compaction, I found sst being deleted on the first checkpoint
xiaogang zhou created FLINK-21543: - Summary: when using FIFO compaction, I found sst being deleted on the first checkpoint Key: FLINK-21543 URL: https://issues.apache.org/jira/browse/FLINK-21543 Project: Flink Issue Type: Bug Components: Runtime / State Backends Reporter: xiaogang zhou 2021/03/01-18:51:01.202049 7f59042fc700 (Original Log Time 2021/03/01-18:51:01.200883) [/compaction/compaction_picker_fifo.cc:107] [_timer_state/processing_user-timers] FIFO compaction: picking file 1710 with creation time 0 for deletion the configuration is like currentOptions.setCompactionStyle(getCompactionStyle()); currentOptions.setLevel0FileNumCompactionTrigger(8); // currentOptions.setMaxTableFilesSizeFIFO(MemorySize.parse("2gb").getBytes()); CompactionOptionsFIFO compactionOptionsFIFO = new CompactionOptionsFIFO(); compactionOptionsFIFO.setMaxTableFilesSize(MemorySize.parse("8gb").getBytes()); compactionOptionsFIFO.setAllowCompaction(true); the rocksdb version is io.github.myasuka frocksdbjni 6.10.2-ververica-3.0 I think the problem is caused by manifest file is not uploaded by flink. Can any one suggest how i can skip this problem? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17878) StreamingFileWriter watermark attribute is transient, this might be different with origin value
xiaogang zhou created FLINK-17878: - Summary: StreamingFileWriter watermark attribute is transient, this might be different with origin value Key: FLINK-17878 URL: https://issues.apache.org/jira/browse/FLINK-17878 Project: Flink Issue Type: Improvement Components: API / DataStream Affects Versions: 1.11.0 Reporter: xiaogang zhou StreamingFileWriter has a private transient long currentWatermark = Long.MIN_VALUE; in case developer wants to create a custom bucket assigner, it will receive a currentWatermark as 0, this might be conflict with the original flink approach to handle a min_long. should we remove the transient key word? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17878) StreamingFileWriter watermark attribute is transient, this might be different with origin value
[ https://issues.apache.org/jira/browse/FLINK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113823#comment-17113823 ] xiaogang zhou commented on FLINK-17878: --- [~lzljs3620320] please help review this comment > StreamingFileWriter watermark attribute is transient, this might be different > with origin value > --- > > Key: FLINK-17878 > URL: https://issues.apache.org/jira/browse/FLINK-17878 > Project: Flink > Issue Type: Improvement > Components: API / DataStream >Affects Versions: 1.11.0 >Reporter: xiaogang zhou >Priority: Major > > StreamingFileWriter has a > private transient long currentWatermark = Long.MIN_VALUE; > > in case developer wants to create a custom bucket assigner, it will receive a > currentWatermark as 0, this might be conflict with the original flink > approach to handle a min_long. > > should we remove the transient key word? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17878) StreamingFileWriter watermark attribute is transient, this might be different with origin value
[ https://issues.apache.org/jira/browse/FLINK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113841#comment-17113841 ] xiaogang zhou commented on FLINK-17878: --- [~gaoyunhaii], No , if we use process time, the watermark will be MIN_LONG. so the currentWatermark should be initialized to MIN_LONG, but with the transient, the currentWatermark will be initialized to 0... As I am working on similar feature , i found this issue. not sure whether i made some mistake... > StreamingFileWriter watermark attribute is transient, this might be different > with origin value > --- > > Key: FLINK-17878 > URL: https://issues.apache.org/jira/browse/FLINK-17878 > Project: Flink > Issue Type: Improvement > Components: API / DataStream >Affects Versions: 1.11.0 >Reporter: xiaogang zhou >Priority: Major > > StreamingFileWriter has a > private transient long currentWatermark = Long.MIN_VALUE; > > in case developer wants to create a custom bucket assigner, it will receive a > currentWatermark as 0, this might be conflict with the original flink > approach to handle a min_long. > > should we remove the transient key word? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17878) StreamingFileWriter watermark attribute is transient, this might be different with origin value
[ https://issues.apache.org/jira/browse/FLINK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113857#comment-17113857 ] xiaogang zhou commented on FLINK-17878: --- [~gaoyunhaii] for my understanding, the operator is serialized then submit to the flink. and transient value will not be serialized, and then runtime value of the currentWatermark will be initialized to zero... > StreamingFileWriter watermark attribute is transient, this might be different > with origin value > --- > > Key: FLINK-17878 > URL: https://issues.apache.org/jira/browse/FLINK-17878 > Project: Flink > Issue Type: Improvement > Components: API / DataStream >Affects Versions: 1.11.0 >Reporter: xiaogang zhou >Priority: Major > > StreamingFileWriter has a > private transient long currentWatermark = Long.MIN_VALUE; > > in case developer wants to create a custom bucket assigner, it will receive a > currentWatermark as 0, this might be conflict with the original flink > approach to handle a min_long. > > should we remove the transient key word? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17878) StreamingFileWriter watermark attribute is transient, this might be different with origin value
[ https://issues.apache.org/jira/browse/FLINK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-17878: -- Component/s: (was: API / DataStream) Table SQL / API > StreamingFileWriter watermark attribute is transient, this might be different > with origin value > --- > > Key: FLINK-17878 > URL: https://issues.apache.org/jira/browse/FLINK-17878 > Project: Flink > Issue Type: Improvement > Components: Table SQL / API >Affects Versions: 1.11.0 >Reporter: xiaogang zhou >Priority: Major > > StreamingFileWriter has a > private transient long currentWatermark = Long.MIN_VALUE; > > in case developer wants to create a custom bucket assigner, it will receive a > currentWatermark as 0, this might be conflict with the original flink > approach to handle a min_long. > > should we remove the transient key word? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17878) StreamingFileWriter watermark attribute is transient, this might be different with origin value
[ https://issues.apache.org/jira/browse/FLINK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113883#comment-17113883 ] xiaogang zhou commented on FLINK-17878: --- [~gaoyunhaii] corrected, may i submit a PR for this? > StreamingFileWriter watermark attribute is transient, this might be different > with origin value > --- > > Key: FLINK-17878 > URL: https://issues.apache.org/jira/browse/FLINK-17878 > Project: Flink > Issue Type: Improvement > Components: Table SQL / API >Affects Versions: 1.11.0 >Reporter: xiaogang zhou >Priority: Major > > StreamingFileWriter has a > private transient long currentWatermark = Long.MIN_VALUE; > > in case developer wants to create a custom bucket assigner, it will receive a > currentWatermark as 0, this might be conflict with the original flink > approach to handle a min_long. > > should we remove the transient key word? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17878) StreamingFileWriter watermark attribute is transient, this might be different with origin value
[ https://issues.apache.org/jira/browse/FLINK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113946#comment-17113946 ] xiaogang zhou commented on FLINK-17878: --- [~gaoyunhaii] thx a lot, Can you please help assign the issue to me? :) I can get some opinion from [~lzljs3620320] on how to modify code > StreamingFileWriter watermark attribute is transient, this might be different > with origin value > --- > > Key: FLINK-17878 > URL: https://issues.apache.org/jira/browse/FLINK-17878 > Project: Flink > Issue Type: Improvement > Components: Table SQL / API >Affects Versions: 1.11.0 >Reporter: xiaogang zhou >Priority: Major > Labels: pull-request-available > > StreamingFileWriter has a > private transient long currentWatermark = Long.MIN_VALUE; > > in case developer wants to create a custom bucket assigner, it will receive a > currentWatermark as 0, this might be conflict with the original flink > approach to handle a min_long. > > should we remove the transient key word? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17878) Transient watermark attribute should be initial at runtime in streaming file operators
[ https://issues.apache.org/jira/browse/FLINK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17114499#comment-17114499 ] xiaogang zhou commented on FLINK-17878: --- [~lzljs3620320] Hi Jingsong, thx for the advice, I have created another PR for release-1.11 [~jark] Hi Jark, thx for the advice, when we use the process time mode , the watermark will not be overwritten. And when I tried to work on this issue, I didnt remove the transient, I init the value in initializeState. > Transient watermark attribute should be initial at runtime in streaming file > operators > -- > > Key: FLINK-17878 > URL: https://issues.apache.org/jira/browse/FLINK-17878 > Project: Flink > Issue Type: Improvement > Components: Connectors / FileSystem >Affects Versions: 1.11.0 >Reporter: xiaogang zhou >Assignee: xiaogang zhou >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > > StreamingFileWriter has a > private transient long currentWatermark = Long.MIN_VALUE; > > in case developer wants to create a custom bucket assigner, it will receive a > currentWatermark as 0, this might be conflict with the original flink > approach to handle a min_long. > > should we remove the transient key word? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
xiaogang zhou created FLINK-31089: - Summary: pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit Key: FLINK-31089 URL: https://issues.apache.org/jira/browse/FLINK-31089 Project: Flink Issue Type: Improvement Components: Runtime / State Backends Affects Versions: 1.16.1 Reporter: xiaogang zhou -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-31089: -- Attachment: image-2023-02-15-20-26-58-604.png > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-31089: -- Description: with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned memory kept growing(in the pc blow from 48G-> 50G). But if we switch it to false, we can see the pinned memory stay static. In our environment, a lot of tasks restart due to memory over limit killed by k8s !image-2023-02-15-20-26-58-604.png|width=1519,height=756! !image-2023-02-15-20-32-17-993.png! > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G). But if we switch it to > false, we can see the pinned memory stay static. In our environment, a lot of > tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=1519,height=756! > > !image-2023-02-15-20-32-17-993.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-31089: -- Attachment: image-2023-02-15-20-32-17-993.png > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-31089: -- Description: with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned memory kept growing(in the pc blow from 48G-> 50G). But if we switch it to false, we can see the pinned memory stay static. In our environment, a lot of tasks restart due to memory over limit killed by k8s !image-2023-02-15-20-26-58-604.png|width=899,height=447! !image-2023-02-15-20-32-17-993.png|width=853,height=464! was: with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned memory kept growing(in the pc blow from 48G-> 50G). But if we switch it to false, we can see the pinned memory stay static. In our environment, a lot of tasks restart due to memory over limit killed by k8s !image-2023-02-15-20-26-58-604.png|width=1519,height=756! !image-2023-02-15-20-32-17-993.png! > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G). But if we switch it to > false, we can see the pinned memory stay static. In our environment, a lot of > tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689097#comment-17689097 ] xiaogang zhou commented on FLINK-31089: --- [~yunta] master, please kindly review. I have also tested the performance. disable PinTopLevelIndexAndFilter can significantly decrease performance disable PinL0FilterAndIndexBlocksInCache does not harm performance a lot. [https://github.com/facebook/rocksdb/issues/4112#issuecomment-405859006] also mentioned, cache top level is enough. as rocksdb memory growing issue has a lot complain in rocksdb issues > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G). But if we switch it to > false, we can see the pinned memory stay static. In our environment, a lot of > tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-31089: -- Description: with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned memory kept growing(in the pc blow from 48G-> 50G). But if we switch it to false, we can see the pinned memory stay static. In our environment, a lot of tasks restart due to memory over limit killed by k8s !image-2023-02-15-20-26-58-604.png|width=899,height=447! !image-2023-02-15-20-32-17-993.png|width=853,height=464! the two graphs are recorded in yesterday and today, which means the data stream number per second will not differ alot. was: with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned memory kept growing(in the pc blow from 48G-> 50G). But if we switch it to false, we can see the pinned memory stay static. In our environment, a lot of tasks restart due to memory over limit killed by k8s !image-2023-02-15-20-26-58-604.png|width=899,height=447! !image-2023-02-15-20-32-17-993.png|width=853,height=464! > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G). But if we switch it to > false, we can see the pinned memory stay static. In our environment, a lot of > tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-31089: -- Description: with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if we switch it to false, we can see the pinned memory stay static. In our environment, a lot of tasks restart due to memory over limit killed by k8s !image-2023-02-15-20-26-58-604.png|width=899,height=447! !image-2023-02-15-20-32-17-993.png|width=853,height=464! the two graphs are recorded in yesterday and today, which means the data stream number per second will not differ alot. was: with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned memory kept growing(in the pc blow from 48G-> 50G). But if we switch it to false, we can see the pinned memory stay static. In our environment, a lot of tasks restart due to memory over limit killed by k8s !image-2023-02-15-20-26-58-604.png|width=899,height=447! !image-2023-02-15-20-32-17-993.png|width=853,height=464! the two graphs are recorded in yesterday and today, which means the data stream number per second will not differ alot. > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay static. In our > environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-31089: -- Description: with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if we switch it to false, we can see the pinned memory stay realtive static. In our environment, a lot of tasks restart due to memory over limit killed by k8s !image-2023-02-15-20-26-58-604.png|width=899,height=447! !image-2023-02-15-20-32-17-993.png|width=853,height=464! the two graphs are recorded in yesterday and today, which means the data stream number per second will not differ alot. was: with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if we switch it to false, we can see the pinned memory stay static. In our environment, a lot of tasks restart due to memory over limit killed by k8s !image-2023-02-15-20-26-58-604.png|width=899,height=447! !image-2023-02-15-20-32-17-993.png|width=853,height=464! the two graphs are recorded in yesterday and today, which means the data stream number per second will not differ alot. > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689454#comment-17689454 ] xiaogang zhou commented on FLINK-31089: --- [~yunta] looks like we are already using the jemalloc $ /usr/bin/pmap -x 1 | grep malloc 7f434e9aa000 204 204 0 r-x-- libjemalloc.so.1 7f434e9aa000 0 0 0 r-x-- libjemalloc.so.1 7f434e9dd000 2044 0 0 - libjemalloc.so.1 7f434e9dd000 0 0 0 - libjemalloc.so.1 7f434ebdc000 8 8 8 r libjemalloc.so.1 7f434ebdc000 0 0 0 r libjemalloc.so.1 7f434ebde000 4 4 4 rw--- libjemalloc.so.1 7f434ebde000 0 0 0 rw--- libjemalloc.so.1 and 'state.backend.rocksdb.memory.partitioned-index-filters' yes, we configured it as true. without the two_level_index_cache. the rocksdb performance is really low. And flink_taskmanager_job_task_operator_.*rocksdb_block_cache_pinned_usage can growing quickly if left PinL0FilterAndIndexBlocksInCache true > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689482#comment-17689482 ] xiaogang zhou commented on FLINK-31089: --- [~yunta] Master, Do you aware : Invalid conf pair: prof:true : Invalid conf pair: lg_prof_interval:29 : Invalid conf pair: lg_prof_sample:17 : Invalid conf pair: prof_prefix:/opt/flink/jeprof.out what this stand for BTW, can you please share me you wechat > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689482#comment-17689482 ] xiaogang zhou edited comment on FLINK-31089 at 2/16/23 3:29 AM: [~yunta] Master, Do you aware : Invalid conf pair: prof:true : Invalid conf pair: lg_prof_interval:29 : Invalid conf pair: lg_prof_sample:17 : Invalid conf pair: prof_prefix:/opt/flink/jeprof.out what this stand for BTW, can you please share me your dingding was (Author: zhoujira86): [~yunta] Master, Do you aware : Invalid conf pair: prof:true : Invalid conf pair: lg_prof_interval:29 : Invalid conf pair: lg_prof_sample:17 : Invalid conf pair: prof_prefix:/opt/flink/jeprof.out what this stand for BTW, can you please share me you wechat > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689482#comment-17689482 ] xiaogang zhou edited comment on FLINK-31089 at 2/16/23 3:39 AM: [~yunta] Master, Do you aware : Invalid conf pair: prof:true : Invalid conf pair: lg_prof_interval:29 : Invalid conf pair: lg_prof_sample:17 : Invalid conf pair: prof_prefix:/opt/flink/jeprof.out what this stand for looks like i need to get a jemalloc with --enable-prof BTW, can you please share me your dingding was (Author: zhoujira86): [~yunta] Master, Do you aware : Invalid conf pair: prof:true : Invalid conf pair: lg_prof_interval:29 : Invalid conf pair: lg_prof_sample:17 : Invalid conf pair: prof_prefix:/opt/flink/jeprof.out what this stand for BTW, can you please share me your dingding > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-31089: -- Attachment: image-2023-02-17-16-48-59-535.png > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690261#comment-17690261 ] xiaogang zhou commented on FLINK-31089: --- [~yunta] Master , I rebuilt a jemalloc from source and with the config below. the Invalid conf pair warning disappeared. But I can't find the prof files in the location I set. Can you please help suggest how to collect the evidence? !image-2023-02-17-16-48-59-535.png|width=625,height=101! > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690329#comment-17690329 ] xiaogang zhou commented on FLINK-31089: --- [~yunta] thx some background info jemalloc version: I updated the jemalloc version from 3.6.0-11 to 5.0.1 the first set of data I collected is setPinL0FilterAndIndexBlocksInCache false, and set the flink kafka offset to 2days ago. I saw N4 [label="rocksdb\nUncompressBlockContentsForCompressionType\n1724255408 (63.3%)\r",shape=box,fontsize=47.8]; is the major part of memory consumer > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690329#comment-17690329 ] xiaogang zhou edited comment on FLINK-31089 at 2/17/23 10:34 AM: - [~yunta] thx some background info jemalloc version: I updated the jemalloc version from 3.6.0-11 to 5.0.1 the first set of data I collected is setPinL0FilterAndIndexBlocksInCache false, and set the flink kafka offset to 2days ago. I saw Total: 2632053083 B 1693036063 64.3% 64.3% 1693036063 64.3% rocksdb::UncompressBlockContentsForCompressionType 684855869 26.0% 90.3% 684855869 26.0% os::malloc@90ca90 122444085 4.7% 95.0% 122444085 4.7% os::malloc@90cc30 50331648 1.9% 96.9% 50331648 1.9% init 41957115 1.6% 98.5% 41957115 1.6% rocksdb::Arena::AllocateNewBlock 15729360 0.6% 99.1% 18924496 0.7% rocksdb::LRUCacheShard::Insert 8388928 0.3% 99.4% 1701424991 64.6% rocksdb::BlockBasedTable::ReadFilter 4194432 0.2% 99.6% 4194432 0.2% std::string::_Rep::_S_create 3704419 0.1% 99.7% 3704419 0.1% readCEN 3195135 0.1% 99.8% 3195135 0.1% rocksdb::LRUHandleTable::Resize 2098816 0.1% 99.9% 2098816 0.1% std::vector::vector 1065045 0.0% 100.0% 1065045 0.0% updatewindow 1052164 0.0% 100.0% 1052164 0.0% inflateInit2_ 0 0.0% 100.0% 87035806 3.3% 0x7fa8f4f8b366 0 0.0% 100.0% 1053704 0.0% 0x7fa8f4f97f59 0 0.0% 100.0% 1053704 0.0% 0x7fa8f4f97f67 is the major part of memory consumer was (Author: zhoujira86): [~yunta] thx some background info jemalloc version: I updated the jemalloc version from 3.6.0-11 to 5.0.1 the first set of data I collected is setPinL0FilterAndIndexBlocksInCache false, and set the flink kafka offset to 2days ago. I saw N4 [label="rocksdb\nUncompressBlockContentsForCompressionType\n1724255408 (63.3%)\r",shape=box,fontsize=47.8]; is the major part of memory consumer > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690346#comment-17690346 ] xiaogang zhou commented on FLINK-31089: --- yes, this is not large when start up, but it keeps growing, so no matter how large the tm memory is, it will finally oom. now I started up another task with setPinL0FilterAndIndexBlocksInCache true, which will have faster growing speed. I will collect another visual profile at weekend, will post it here. And I think it will be convenient to communicate via dingding, I am in a ecommerce company in Shanghai in charge of flink. you can send dingding to my mail zhou16...@163.com > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690346#comment-17690346 ] xiaogang zhou edited comment on FLINK-31089 at 2/17/23 10:53 AM: - yes, this is not large when start up, but it keeps growing, so no matter how large the tm memory is, it will finally oom. now I started up another task with setPinL0FilterAndIndexBlocksInCache true, which will have faster growing speed. I will collect another visual profile at weekend, will post it here. And I think it will be convenient to communicate via dingding, I am in a ecommerce company in Shanghai in charge of flink. you can send dingding to my mail [zhou16...@163.com. Maybe I can get some answer from your side :)|mailto:zhou16...@163.com%E3%80%82] was (Author: zhoujira86): yes, this is not large when start up, but it keeps growing, so no matter how large the tm memory is, it will finally oom. now I started up another task with setPinL0FilterAndIndexBlocksInCache true, which will have faster growing speed. I will collect another visual profile at weekend, will post it here. And I think it will be convenient to communicate via dingding, I am in a ecommerce company in Shanghai in charge of flink. you can send dingding to my mail zhou16...@163.com > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-31089: -- Attachment: (was: l0pin_open-1.png) > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, > l0pin_open.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-31089: -- Attachment: l0pin_open.png > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, > l0pin_open.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xiaogang zhou updated FLINK-31089: -- Attachment: l0pin_open-1.png > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, > l0pin_open.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690999#comment-17690999 ] xiaogang zhou commented on FLINK-31089: --- [~yunta] got some update with l0 pin open, see attache l0pin_open. I configured the table.exec.state.ttl to 36hrs. I suspect whether it does not change the rocksdb default ttl configuration? > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, > l0pin_open.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-31089) pin L0 index in memory can lead to slow memory grow finally lead to memory beyond limit
[ https://issues.apache.org/jira/browse/FLINK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690999#comment-17690999 ] xiaogang zhou edited comment on FLINK-31089 at 2/20/23 4:56 AM: [~yunta] got some update with l0 pin open, see attache l0pin_open. was (Author: zhoujira86): [~yunta] got some update with l0 pin open, see attache l0pin_open. I configured the table.exec.state.ttl to 36hrs. I suspect whether it does not change the rocksdb default ttl configuration? > pin L0 index in memory can lead to slow memory grow finally lead to memory > beyond limit > --- > > Key: FLINK-31089 > URL: https://issues.apache.org/jira/browse/FLINK-31089 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.16.1 >Reporter: xiaogang zhou >Priority: Major > Attachments: image-2023-02-15-20-26-58-604.png, > image-2023-02-15-20-32-17-993.png, image-2023-02-17-16-48-59-535.png, > l0pin_open.png > > > with the setPinL0FilterAndIndexBlocksInCache true, we can see the pinned > memory kept growing(in the pc blow from 48G-> 50G in about 5 hours). But if > we switch it to false, we can see the pinned memory stay realtive static. In > our environment, a lot of tasks restart due to memory over limit killed by k8s > !image-2023-02-15-20-26-58-604.png|width=899,height=447! > > !image-2023-02-15-20-32-17-993.png|width=853,height=464! > the two graphs are recorded in yesterday and today, which means the data > stream number per second will not differ alot. -- This message was sent by Atlassian Jira (v8.20.10#820010)