morsapaes commented on a change in pull request #14003: URL: https://github.com/apache/flink/pull/14003#discussion_r532193631
########## File path: docs/dev/table/sql/gettingStarted.md ########## @@ -0,0 +1,226 @@ +--- +title: "Getting Started - Flink SQL" +nav-parent_id: sql +nav-pos: 0 +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + + +* This will be replaced by the TOC +{:toc} + +Flink SQL enables SQL developers to design and develop the batch or streaming application without writing the Java, Scala, or any other programming language code. It provides a unified API for both stream and batch processing. As a user, you can perform powerful transformations. Flink’s SQL support is based on [Apache Calcite](https://calcite.apache.org/) which implements the SQL standard. + +In addition to the SQL API, Flink also has a Table API with similar semantics as SQL. The Table API is a language-integrated API, where users develop in a specific programming language to write the queries or call the API. For example, jobs create a table environment, read a table, and apply different transformations and aggregations, and write the results back to another table. It supports different languages e.g. Java, Scala, Python. + +Flink SQL and Table API are just two different ways to write queries that use the same Flink runtime. All the queries are optimized for efficient execution. SQL API is a more descriptive way of writing queries using well-known SQL standards e.g. `select * from Table`. On the other hand, Table API queries start with from clause, followed by joins and where clause, and then finally projection or select at the last e.g. `Table.filter(...).select(...)`. Standard SQL is easy and quick to learn even for users with no programming background. This article will focus on Flink SQL API but Table API details can be found [here]({{ site.baseurl }}/dev/table/). + +### Pre-requisites +You only need to have basic knowledge of SQL to follow along. You will not need to write Java or Scala code or use an IDE. + +### Installation +There are various ways to [install]({{ site.baseurl }}/ops/deployment/) Flink. Probably the easiest one is to download the binaries and run them locally for experimentation. We assume [local installation]({{ site.baseurl }}/try-flink/local_installation.html) for the rest of the tutorial. You can start a local cluster using the following command from the installation folder + +{% highlight bash %} +./bin/start-cluster.sh +{% endhighlight %} + +Once the cluster is started, it will also start a web server on [localhost:8081](localhost:8081) to manage settings and monitor the different jobs. + +### SQL Client +The SQL Client is an interactive client to submit SQL queries to Flink and visualize the results. It’s like a query editor for any other database management system where you can write queries using standard SQL. You can start the SQL client from the installation folder as follows + + {% highlight bash %} +./bin/sql-client.sh embedded + {% endhighlight %} + +### Hello World query + +Once the SQL client, our query editor, is up and running it's time to start writing SQL queries. These queries will be submitted to Flink cluster for computation and results will be returned to the SQL client UI. Let's start with printing 'Hello World'. You can print hello world using the following simple query + +{% highlight sql %} +SELECT 'Hello World'; +{% endhighlight %} + +`Help;` command is used to see different supported DDL (Data definition language) commands. Furthermore, Flink SQL does support different built-in functions as well. The following query will show all the built-in and user-defined functions. +{% highlight sql %} +SHOW FUNCTIONS; +{% endhighlight %} + +Flink SQL provides users with a set of [built-in functions]({{ site.baseurl }}/dev/table/functions/systemFunctions.html) for data transformations. The following example will print the current timestamp using the `CURRENT_TIMESTAMP` function. + +{% highlight sql %} +SELECT CURRENT_TIMESTAMP; +{% endhighlight %} + +--------------- + +{% top %} + +## Setting up tables +Real-world database queries are run against the SQL tables. Although Flink is a stream processing engine, users can define a table on top of the streaming data. Generally, Flink data processing pipelines have three components - source, compute, sink. + +The source is input or from where data is read e.g. a text file, Kafka topic. Then we define some computations that need to be performed on input data. Finally, the sink defines what to do with the output or where to store the results. A sink can be a console log, another output file, or a Kafka topic. It's similar to a database query that reads data from a table, performs a query on it, and then displays the results. + +In Flink SQL semantics, source and sink will be tables, but Flink isn’t a storage engine hence it cannot store the data. So Flink tables need to backed up with a [storage connector]({{ site.baseurl }}/dev/table/connect.html) like [file system]({{ site.baseurl }}/dev/table/connect.html#file-system-connector), [Kafka]({{ site.baseurl }}/dev/table/connect.html#kafka-connector) or [MySQL]({{ site.baseurl }}/dev/table/connect.html#jdbc-connector). While creating these tables, storage connector type, [format]({{ site.baseurl }}/dev/table/connect.html#table-formats) and schema for each table needs to be defined. + + +### Input Source Tables +SQL API environment is configured via [YAML](https://yaml.org) configuration files. When the SQL client starts, it reads the default configuration from the `/conf/sql-client-defaults.yaml` but it can be overriden by user defined configuration file. These files are used to define different environment variables including table source, sinks, [catalogs]({{ site.baseurl }}/dev/table/catalogs.html), [user-defined functions]({{ site.baseurl }}/dev/table/functions/udfs.html). + +Tables can be defined through the SQL client or using environment config file. The SQL client support [SQL DDL commands]({{ site.baseurl }}/dev/table/sql) similar to traditional SQL. Standard SQL DDL is used to [create]({{ site.baseurl }}/dev/table/sql/create.html), [alter]({{ site.baseurl }}/dev/table/sql/alter.html), [drop]({{ site.baseurl }}/dev/table/sql/drop.html) tables. + +Flink has a support for few [formats]({{ site.baseurl }}/dev/table/connectors/formats/) that can be used with tables. Following is an example to define a source table on [csv file]({{ site.baseurl }}/dev/table/connectors/formats/csv.html) using DDL with `EmpId, EmpName, DeptId` as header. Review comment: ```suggestion Flink has support for a few [formats]({% link dev/table/connectors/formats.md %}) that can be used with tables. Following is an example to define a source table on [csv file]({% link dev/table/connectors/formats/csv.md %}) using DDL with `EmpId`,` EmpName`, `DeptId` as header. ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org