[ https://issues.apache.org/jira/browse/FLINK-19595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Flink Jira Bot updated FLINK-19595: ----------------------------------- Labels: stale-major (was: ) > Flink SQL support S3 select > --------------------------- > > Key: FLINK-19595 > URL: https://issues.apache.org/jira/browse/FLINK-19595 > Project: Flink > Issue Type: Improvement > Components: FileSystems, Table SQL / Ecosystem > Reporter: liuxiaolong > Priority: Major > Labels: stale-major > Attachments: image-2020-11-02-18-08-11-461.png, > image-2020-11-02-18-18-14-961.png > > > h4. Summarize > Flink is based on S3AInputStream.java to select datas stored in Tencent COS, > it will call the getObject function of AmazonS3Client.java. > Now, Tencent COS have already support to pushdown the CSV and Parquert file > format. > In these cases, using getObject to select datas will wastes a lots of > bandwidth. > So, I think Flink SQL should support S3 Select, to reduce the waste of > bandwidth. > > h4. Design > 1. In HiveMapredSplitReader.java , we used int[] selectedFields to construct > S3 SELECT SQL. And we created a new Class named S3SelectCsvReader which used > AmazonS3Client.selectObjectContent function to readLine CSV File. > !image-2020-11-02-18-08-11-461.png|width=535,height=967! > > !image-2020-11-02-18-18-14-961.png|width=629,height=284! > > 2. Flink Demo Table: > 1) Table schema > Flink SQL> desc cos.test_s3a; > root > |– name: STRING (col1)| > |– age: INT (col2)| > |– dt: STRING (col3,it's a partition column)| > > 2) Conversion relationship (FLINK SQL Convert To S3 SELECT SQL) > FlinkSQL > S3 SELECT SQL > select name from cos.test_s3a; => > SELECT s._1, null FROM S3Object s > select age from cos.test_s3a; > => SELECT null, s._2 FROM S3Object s > select dt, name, age from cos.test_s3a; => > SELECT s._1, s._2 FROM S3Object s > select dt from cos.test_s3a; > => SELECT null, null FROM S3Object s > select * from cos.test_s3a; > => SELECT s._1, s._2 FROM S3Object s > select name from cos.test_s3a where dt='2020-07-15'; => SELECT > s._1, null FROM S3Object s > > 3) Patch Commit > https://github.com/Coderlxl/flink/commit/b211f4830a7301bf9283a6d37209000b176913ad -- This message was sent by Atlassian Jira (v8.3.4#803005)