[jira] [Created] (FLINK-30288) Use visitor to convert predicate for orc
Shammon created FLINK-30288: --- Summary: Use visitor to convert predicate for orc Key: FLINK-30288 URL: https://issues.apache.org/jira/browse/FLINK-30288 Project: Flink Issue Type: Improvement Components: Table Store Affects Versions: table-store-0.3.0 Reporter: Shammon Use `PredicateVisitor` to convert `Predicate` in table store for orc -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30289) RateLimitedSourceReader uses wrong signal for checkpoint rate-limiting
Chesnay Schepler created FLINK-30289: Summary: RateLimitedSourceReader uses wrong signal for checkpoint rate-limiting Key: FLINK-30289 URL: https://issues.apache.org/jira/browse/FLINK-30289 Project: Flink Issue Type: Bug Components: API / Core Affects Versions: 1.17.0 Reporter: Chesnay Schepler Assignee: Chesnay Schepler Fix For: 1.17.0 The checkpoint rate limiter is notified when the checkpoint is complete, but since this signal comes at some point in the future (or not at all) it can result in no records being emitted for a checkpoint, or more records than expected being emitted. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30290) IteratorSourceReaderBase should report END_OF_INPUT sooner
Chesnay Schepler created FLINK-30290: Summary: IteratorSourceReaderBase should report END_OF_INPUT sooner Key: FLINK-30290 URL: https://issues.apache.org/jira/browse/FLINK-30290 Project: Flink Issue Type: Technical Debt Components: API / Core Affects Versions: 1.17.0 Reporter: Chesnay Schepler Assignee: Chesnay Schepler Fix For: 1.17.0 The iterator source reader base does not report end_of_input when the last value was emitted, but instead requires an additional call to pollNext to be made. This is fine functionality-wise, and allowed by the the source reader api contracts, but it's not intuitive behavior and leaks into tests for the datagen source. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30291) Integrate flink-connector-aws into Flink docs
Danny Cranmer created FLINK-30291: - Summary: Integrate flink-connector-aws into Flink docs Key: FLINK-30291 URL: https://issues.apache.org/jira/browse/FLINK-30291 Project: Flink Issue Type: Technical Debt Components: Connectors / AWS, Documentation Reporter: Danny Cranmer Fix For: 1.17.0, 1.16.1 Update the docs render to integrate {{{}flink-connector-aws{}}}. Add a new shortcode to handle rendering the SQL connector correctly -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30292) Better support for conversion between DataType and TypeInformation
Yunfeng Zhou created FLINK-30292: Summary: Better support for conversion between DataType and TypeInformation Key: FLINK-30292 URL: https://issues.apache.org/jira/browse/FLINK-30292 Project: Flink Issue Type: Improvement Components: Table SQL / API Affects Versions: 1.15.3 Reporter: Yunfeng Zhou In Flink 1.15, we have the following ways to convert a DataType to a TypeInformation. Each of them has some disadvantages. * `TypeConversions.fromDataTypeToLegacyInfo` It might lead to precision losses in face of some data types like timestamp. It has been deprecated. * `ExternalTypeInfo.of` It cannot be used to get detailed type information like `RowTypeInfo` It might bring some serialization overhead. Given that the ways mentioned above are both not perfect, Flink SQL should provide a better API to support DataType-TypeInformation conversions, and thus better support Table-DataStream conversions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30293) Create an enumerator for static (batch)
Jingsong Lee created FLINK-30293: Summary: Create an enumerator for static (batch) Key: FLINK-30293 URL: https://issues.apache.org/jira/browse/FLINK-30293 Project: Flink Issue Type: Improvement Components: Table Store Reporter: Jingsong Lee Fix For: table-store-0.3.0 In FLINK-30207, we have created enumerator for continuous. We should have an enumerator for static (batch). For example, for the current read-compacted, time traveling may specify the commit time to read snapshots in the future. I think these capabilities need to be in the core, but should they be in scan? (It seems that it should not) -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [DISCUSS] FLIP-277: Native GlueCatalog Support in Flink
Hi, Samrat. I have seen some users are asking for GlueCatalog support[1], it's really exciting that you're driving it. After a quick look of this Flip, I have some comments: 1: I noticed there's a YAML part in the section of "Using the Catalog", what do you mean by that? Do you mean how to use glue catalog in sql client? If so, just for your information, it's not supported to use yaml envrioment file in sql client[2]. 2: Seems there's a typo in "Design#views" part, it contains "listTables" which I think shouldn't be contained. Also, I'm curious about how to list views using Glue API. Is there an on-hand api to list views directly or we need to list the tables and then filter the views using the table-kind? 3: In "Flink Glue DataType Mapping" part, CharType is mapped to String. It seems the char's size will lose, is it possible to have a better mapping which won't loss the size of char type? 4: About the "Flink CatalogFunction mapping with Glue Function" part, how do we map the function language in Flink's CatalogFunction. [1] https://lists.apache.org/thread/pdd780wl4f26p447fohvm9osky2r9fhh [2] https://issues.apache.org/jira/browse/FLINK-22540 Best regards, Yuxia - 原始邮件 - 发件人: "Samrat Deb" 收件人: "dev" 抄送: "prabhujose gates" 发送时间: 星期六, 2022年 12 月 03日 下午 12:29:16 主题: [DISCUSS] FLIP-277: Native GlueCatalog Support in Flink Hi everyone, I would like to open a discussion[1] on providing GlueCatalog support in Flink. Currently, Flink offers 3 major types of catalog[2]. Out of which only HiveCatalog is a persistent catalog backed by Hive Metastore. We would like to introduce GlueCatalog in Flink offering another option for users which will be persistent in nature. Aws Glue data catalog is a centralized data catalog in AWS cloud that provides integrations with many different connectors[3]. Flink GlueCatalog can use the features provided by glue and create strong integration with other services in the cloud. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-277%3A+Native+GlueCatalog+Support+in+Flink [2] https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/catalogs/ [3] https://docs.aws.amazon.com/glue/latest/dg/components-overview.html#data-catalog-intro [4] https://issues.apache.org/jira/browse/FLINK-29549 Bests Samrat
Re: [VOTE] FLIP-273: Improve Catalog API to Support ALTER TABLE syntax
+1 (binding) Best, Jark On Fri, 2 Dec 2022 at 10:11, Paul Lam wrote: > +1 (non-binding) > > Best, > Paul Lam > > > 2022年12月2日 09:17,yuxia 写道: > > > > +1 (non-binding) > > > > Best regards, > > Yuxia > > > > - 原始邮件 - > > 发件人: "Yaroslav Tkachenko" > > 收件人: "dev" > > 发送时间: 星期五, 2022年 12 月 02日 上午 12:27:24 > > 主题: Re: [VOTE] FLIP-273: Improve Catalog API to Support ALTER TABLE > syntax > > > > +1 (non-binding). > > > > Looking forward to it! > > > > On Thu, Dec 1, 2022 at 5:06 AM Dong Lin wrote: > > > >> +1 (binding) > >> > >> Thanks for the FLIP! > >> > >> On Thu, Dec 1, 2022 at 12:20 PM Shengkai Fang > wrote: > >> > >>> Hi All, > >>> > >>> Thanks for all the feedback so far. Based on the discussion[1] we seem > >>> to have a consensus, so I would like to start a vote on FLIP-273. > >>> > >>> The vote will last for at least 72 hours (Dec 5th at 13:00 GMT, > >>> excluding weekend days) unless there is an objection or insufficient > >> votes. > >>> > >>> Best, > >>> Shengkai > >>> > >>> [1] > >>> > >>> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-273%3A+Improve+the+Catalog+API+to+Support+ALTER+TABLE+syntax > >>> [2] https://lists.apache.org/thread/2v4kh2bpzvk049zdxb687q7o1pcmnnnw > >>> > >> > >
Re: [DISCUSS] FLIP-275: Support Remote SQL Client Based on SQL Gateway
Hi, Shammon Thanks for your feedback. I think it’s good to support jdbc-sdk. However, it's not supported in the gateway side yet. In my opinion, this FLIP is more concerned with the SQL Client. How about put “supporting jdbc-sdk” in ‘Future Work’? We can discuss how to implement it in another thread. Best, Yu Zelin > 2022年12月2日 18:12,Shammon FY 写道: > > Hi zelin > > Thanks for driving this discussion. > > I notice that the sql-client will interact with sql-gateway by `REST > Client` in the `Executor` in the FLIP, how about introducing jdbc-sdk for > sql-gateway? > > Then the sql-client can connect the gateway with jdbc-sdk, on the other > hand, the other applications and tools such as jmeter can use the jdbc-sdk > to connect sql-gateway too. > > Best, > Shammon > > > On Fri, Dec 2, 2022 at 4:10 PM yu zelin wrote: > >> Hi Jim, >> >> Thanks for your feedback! >> >>> Should this configuration be mentioned in the FLIP? >> >> Sure. >> >>> some way for the server to be able to limit the number of requests it >> receives. >> I’m sorry that this FLIP is dedicated in implementing the Remote mode, so >> we >> didn't consider much about this. I think the option is enough currently. >> I will add >> the improvement suggestions to the ‘Future Work’. >> >>> I wonder if two other options are possible >> >> To forward the raw format to gateway and then to client is possible. The >> raw >> results from sink is in ‘CollectResultIterator#bufferedResult’. First, we >> can find >> a way to get this result without wrapping it. Second, constructing a >> ‘InternalTypeInfo’. >> We can construct it using the schema information (data’s logical type). >> After >> construction, we can get the ’TypeSerializer’ to deserialize the raw >> result. >> >> >> >> >>> 2022年12月1日 04:54,Jim Hughes 写道: >>> >>> Hi Yu, >>> >>> Thanks for moving my comments to this thread! Also, thank you for >>> answering my questions; it is helping me understand the SQL Gateway >>> better. >>> >>> 5. Our idea is to introduce a new session option (like >>> 'sql-client.result.fetch-interval') to control >>> the fetching requests sending frequency. What do you think? >>> >>> Should this configuration be mentioned in the FLIP? >>> >>> One slight concern I have with having 'sql-client.result.fetch-interval' >> as >>> a session configuration is that users could set it low and cause the >> client >>> to send a large volume of requests to the SQL gateway. >>> >>> Generally, I'd like to see some way for the server to be able to limit >> the >>> number of requests it receives. If that really needs to be done by a >> proxy >>> in front of the SQL gateway, that is fine as well. (To be clear, I don't >>> think my concern here should be blocking in any way.) >>> >>> 7. What is the serialization lifecycle for results? >>> >>> I wonder if two other options are possible: >>> 3) Could the Gateway just forward the result byte array? (Or does the >>> Gateway need to deserialize the response in order to understand it for >> some >>> reason?) >>> 4) Could the JobManager prepare the results in JSON? (Or similarly could >>> the Client read the format which the JobManager sends?) >>> >>> Thanks again! >>> >>> Cheers, >>> >>> Jim >>> >>> On Wed, Nov 30, 2022 at 9:40 AM yu zelin wrote: >>> Hi, all Thanks Jim’s questions below. Here I’d like to reply to them. > 1. For the Client Parser, is it going to work with the extended syntax > from the Flink Table Store? > > 2. Relatedly, what will happen if an older Client tries to handle syntax > that a newer service supports? (Suppose I use a 1.17 client with a 1.18 > Gateway/system which has a new keyword. Is there anything we should >> be > designing for upfront?) > > 3. How will client and server version mismatches be handled? Will a > single gateway be able to support multiple endpoint versions? > 4. How are commands which change a session handled? Are those sent >> via > an ExecuteStatementRequest? > > 5. The remote POC uses polling for getting back status and getting >> back > results. Would it be possible to switch to web sockets or some other > mechanism to avoid polling? If polling is used for both, the polling > frequency should be different between local and remote configurations. > > 6. What does this sentence mean? "The reason why we didn't get the >> sql > type in client side is because it's hard for the lightweight client-level > parser to recognize some sql type sql, such as query with CTE. " > > 7. What is the serialization lifecycle for results? It makes sense to > have some control over whether the gateway returns results as SQL or JSON. > I'd love to see a way to avoid needing to serialize and deserialize results > on the SQL Gateway if possible. I'm still new enough to the project that > I'm not sure if t
Patch to support Parquet schema evolution
Hi there, I find an null-value issue when using Flink to read parquet files with multi versions of schema (V1->V2->V3->..->Vn). Assuming there are two fileds in given parquet schema as below, and filed F2 only exist in version 2. Version1: F1 Version2: F1, F2 Currently the value of filed F2 will be empty when reading data from parquet file using schema version2. I explore the implementation, and find Flink use a collection named `unknownFieldsIndices` to track the nonexistent fields, applied to all parquet files under given path. I draft a patch to fix this issue with unit test. https://issues.apache.org/jira/browse/FLINK-29527 https://github.com/apache/flink/pull/21149 As these PR is pending for a long time, I hope any commitor can help review it and provide any feedback if possible. Thanks! Shun
Re: Need right
Hi, There is no need for additional permissions: you can start working on a Jira issue of your liking (feel free to ping me to get it assigned to you) and open up a PR. Thanks and looking forward to your contribution! Best regards, Martijn On Mon, Dec 5, 2022 at 6:59 AM Stan1005 <532338...@qq.com.invalid> wrote: > Hi, I want to contribute to Apache Flink. Would you please give me the > contributor permission? My JIRA ID is StarBoy1005.
Re: Patch to support Parquet schema evolution
Hi, Shun. Thanks for the contribution. I'll have a look first and then find some committers help review & merge. Best regards, Yuxia - 原始邮件 - 发件人: "sunshun18" 收件人: "dev" 发送时间: 星期一, 2022年 12 月 05日 上午 11:54:38 主题: Patch to support Parquet schema evolution Hi there, I find an null-value issue when using Flink to read parquet files with multi versions of schema (V1->V2->V3->..->Vn). Assuming there are two fileds in given parquet schema as below, and filed F2 only exist in version 2. Version1: F1 Version2: F1, F2 Currently the value of filed F2 will be empty when reading data from parquet file using schema version2. I explore the implementation, and find Flink use a collection named `unknownFieldsIndices` to track the nonexistent fields, applied to all parquet files under given path. I draft a patch to fix this issue with unit test. https://issues.apache.org/jira/browse/FLINK-29527 https://github.com/apache/flink/pull/21149 As these PR is pending for a long time, I hope any commitor can help review it and provide any feedback if possible. Thanks! Shun
[jira] [Created] (FLINK-30294) Change table property key 'log.scan' to 'startup.mode' and add a default startup mode in Table Store
Caizhi Weng created FLINK-30294: --- Summary: Change table property key 'log.scan' to 'startup.mode' and add a default startup mode in Table Store Key: FLINK-30294 URL: https://issues.apache.org/jira/browse/FLINK-30294 Project: Flink Issue Type: Improvement Components: Table Store Affects Versions: table-store-0.3.0 Reporter: Caizhi Weng Assignee: Caizhi Weng We're introducing time-travel reading of Table Store for batch jobs. However this reading mode is quite similar to the "from-timestamp" startup mode for streaming jobs, just that "from-timestamp" streaming jobs only consume incremental data but not history data. We can support startup mode for both batch and streaming jobs. For batch jobs, "from-timestamp" startup mode will produce all records from the last snapshot before the specified timestamp. For streaming jobs the behavior doesn't change. Previously, in order to use "from-timestamp" startup mode, users will have to specify "log.scan" and also "log.scan.timestamp-millis", which is a little inconvenient. We can introduce a "default" startup mode and its behavior will base on the execution environment and other configurations. In this way, to use "from-timestamp" startup mode, it is enough for users to specify just "startup.timestamp-millis". -- This message was sent by Atlassian Jira (v8.20.10#820010)