This is an automated email from the ASF dual-hosted git repository. morningman pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/incubator-doris-spark-connector.git
commit d1299981bcae2258dbef12fa71d038842f0afb70 Author: Mingyu Chen <morningman....@gmail.com> AuthorDate: Tue May 19 14:20:21 2020 +0800 [Spark on Doris] Shade and provide the thrift lib in spark-doris-connector (#3631) Mainly changes: 1. Shade and provide the thrift lib in spark-doris-connector 2. Add a `build.sh` for spark-doris-connector 3. Move the README.md of spark-doris-connector to `docs/` 4. Change the line delimiter of `fe/src/test/java/org/apache/doris/analysis/AggregateTest.java` --- README.md | 150 -------------------------------------------------------------- build.sh | 59 ++++++++++++++++++++++++ pom.xml | 59 +++++++++++++++++++++--- 3 files changed, 112 insertions(+), 156 deletions(-) diff --git a/README.md b/README.md deleted file mode 100644 index 3c41b93..0000000 --- a/README.md +++ /dev/null @@ -1,150 +0,0 @@ -<!-- -Licensed to the Apache Software Foundation (ASF) under one -or more contributor license agreements. See the NOTICE file -distributed with this work for additional information -regarding copyright ownership. The ASF licenses this file -to you under the Apache License, Version 2.0 (the -"License"); you may not use this file except in compliance -with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, -software distributed under the License is distributed on an -"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -KIND, either express or implied. See the License for the -specific language governing permissions and limitations -under the License. ---> - -# Spark-Doris-Connector - -## Fetures - -- 当前版本只支持从`Doris`中读取数据。 -- 可以将`Doris`表映射为`DataFrame`或者`RDD`,推荐使用`DataFrame`。 -- 支持在`Doris`端完成数据过滤,减少数据传输量。 - -## Version Compatibility - -| Connector | Spark | Doris | Java | Scala | -| --------- | ----- | ------ | ---- | ----- | -| 1.0.0 | 2.x | master | 8 | 2.11 | - - - -## Building - -```bash -mvn clean package -``` - -编译成功后,会在`target`目录下生成文件`doris-spark-1.0.0-SNAPSHOT.jar`。将此文件复制到`Spark`的`ClassPath`中即可使用`Spark-Doris-Connector`。例如,`Local`模式运行的`Spark`,将此文件放入`jars`文件夹下。`Yarn`集群模式运行的`Spark`,则将此文件放入预部署包中。 - -## QuickStart - -### SQL - -```sql -CREATE TEMPORARY VIEW spark_doris -USING doris -OPTIONS( - "table.identifier"="$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME", - "fenodes"="$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT", - "user"="$YOUR_DORIS_USERNAME", - "password"="$YOUR_DORIS_PASSWORD" -); - -SELECT * FROM spark_doris; -``` - -### DataFrame - -```scala -val dorisSparkDF = spark.read.format("doris") - .option("doris.table.identifier", "$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME") - .option("doris.fenodes", "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT") - .option("user", "$YOUR_DORIS_USERNAME") - .option("password", "$YOUR_DORIS_PASSWORD") - .load() - -dorisSparkDF.show(5) -``` - -### RDD - -```scala -import org.apache.doris.spark._ -val dorisSparkRDD = sc.dorisRDD( - tableIdentifier = Some("$YOUR_DORIS_DATABASE_NAME.$YOUR_DORIS_TABLE_NAME"), - cfg = Some(Map( - "doris.fenodes" -> "$YOUR_DORIS_FE_HOSTNAME:$YOUR_DORIS_FE_RESFUL_PORT", - "doris.request.auth.user" -> "$YOUR_DORIS_USERNAME", - "doris.request.auth.password" -> "$YOUR_DORIS_PASSWORD" - )) -) - -dorisSparkRDD.collect() -``` - -## Configuration - -### General - -| Key | Default Value | Comment | -| -------------------------------- | ----------------- | ------------------------------------------------------------ | -| doris.fenodes | -- | Doris Restful接口地址,支持多个地址,使用逗号分隔 | -| doris.table.identifier | -- | DataFame/RDD对应的Doris表名 | -| doris.request.retries | 3 | 向Doris发送请求的重试次数 | -| doris.request.connect.timeout.ms | 30000 | 向Doris发送请求的连接超时时间 | -| doris.request.read.timeout.ms | 30000 | 向Doris发送请求的读取超时时间 | -| doris.request.query.timeout.s | 3600 | 查询doris的超时时间,默认值为1小时,-1表示无超时限制 | -| doris.request.tablet.size | Integer.MAX_VALUE | 一个RDD Partition对应的Doris Tablet个数。<br />此数值设置越小,则会生成越多的Partition。<br />从而提升Spark侧的并行度,但同时会对Doris造成更大的压力。 | -| doris.batch.size | 1024 | 一次从BE读取数据的最大行数。<br />增大此数值可减少Spark与Doris之间建立连接的次数。<br />从而减轻网络延迟所带来的的额外时间开销。 | -| doris.exec.mem.limit | 2147483648 | 单个查询的内存限制。默认为 2GB,单位为字节 | -| doris.deserialize.arrow.async | false | 是否支持异步转换Arrow格式到spark-doris-connector迭代所需的RowBatch | -| doris.deserialize.queue.size | 64 | 异步转换Arrow格式的内部处理队列,当doris.deserialize.arrow.async为true时生效 | - -### SQL and Dataframe Only - -| Key | Default Value | Comment | -| ------------------------------- | ------------- | ------------------------------------------------------------ | -| user | -- | 访问Doris的用户名 | -| password | -- | 访问Doris的密码 | -| doris.filter.query.in.max.count | 100 | 谓词下推中,in表达式value列表元素最大数量。<br />超过此数量,则in表达式条件过滤在Spark侧处理。 | - -### RDD Only - -| Key | Default Value | Comment | -| --------------------------- | ------------- | ------------------------------------------------------------ | -| doris.request.auth.user | -- | 访问Doris的用户名 | -| doris.request.auth.password | -- | 访问Doris的密码 | -| doris.read.field | -- | 读取Doris表的列名列表,多列之间使用逗号分隔 | -| doris.filter.query | -- | 过滤读取数据的表达式,此表达式透传给Doris。<br />Doris使用此表达式完成源端数据过滤。 | - - - -## Doris Data Type - Spark Data Type Mapping - -| Doris Type | Spark Type | -| ---------- | -------------------------------- | -| NULL_TYPE | DataTypes.NullType | -| BOOLEAN | DataTypes.BooleanType | -| TINYINT | DataTypes.ByteType | -| SMALLINT | DataTypes.ShortType | -| INT | DataTypes.IntegerType | -| BIGINT | DataTypes.LongType | -| FLOAT | DataTypes.FloatType | -| DOUBLE | DataTypes.DoubleType | -| DATE | DataTypes.StringType<sup>1</sup> | -| DATETIME | DataTypes.StringType<sup>1</sup> | -| BINARY | DataTypes.BinaryType | -| DECIMAL | DecimalType | -| CHAR | DataTypes.StringType | -| LARGEINT | DataTypes.StringType | -| VARCHAR | DataTypes.StringType | -| DECIMALV2 | DecimalType | -| TIME | DataTypes.DoubleType | -| HLL | Unsupported datatype | - -<sup>1</sup>: Connector中,将`DATE`和`DATETIME`映射为`String`。由于`Doris`底层存储引擎处理逻辑,直接使用时间类型时,覆盖的时间范围无法满足需求。所以使用`String`类型直接返回对应的时间可读文本。 \ No newline at end of file diff --git a/build.sh b/build.sh new file mode 100755 index 0000000..9119841 --- /dev/null +++ b/build.sh @@ -0,0 +1,59 @@ +#!/usr/bin/env bash +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +############################################################## +# This script is used to compile Spark-Doris-Connector +# Usage: +# sh build.sh +# +############################################################## + +set -eo pipefail + +ROOT=`dirname "$0"` +ROOT=`cd "$ROOT"; pwd` + +export DORIS_HOME=${ROOT}/../../ + +# include custom environment variables +if [[ -f ${DORIS_HOME}/custom_env.sh ]]; then + . ${DORIS_HOME}/custom_env.sh +fi + +# check maven +MVN_CMD=mvn +if [[ ! -z ${CUSTOM_MVN} ]]; then + MVN_CMD=${CUSTOM_MVN} +fi +if ! ${MVN_CMD} --version; then + echo "Error: mvn is not found" + exit 1 +fi +export MVN_CMD + +${MVN_CMD} clean package + + +mkdir -p output/ +cp target/doris-spark-1.0.0-SNAPSHOT.jar ./output/ + +echo "*****************************************" +echo "Successfully build Spark-Doris-Connector" +echo "*****************************************" + +exit 0 diff --git a/pom.xml b/pom.xml index 35986ad..cdf1055 100644 --- a/pom.xml +++ b/pom.xml @@ -36,6 +36,50 @@ <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> </properties> + <profiles> + <!-- for custom internal repository --> + <profile> + <id>custom-env</id> + <activation> + <property> + <name>env.CUSTOM_MAVEN_REPO</name> + </property> + </activation> + + <repositories> + <repository> + <id>custom-nexus</id> + <url>${env.CUSTOM_MAVEN_REPO}</url> + </repository> + </repositories> + + <pluginRepositories> + <pluginRepository> + <id>custom-nexus</id> + <url>${env.CUSTOM_MAVEN_REPO}</url> + </pluginRepository> + </pluginRepositories> + </profile> + + <!-- for general repository --> + <profile> + <id>general-env</id> + <activation> + <property> + <name>!env.CUSTOM_MAVEN_REPO</name> + </property> + </activation> + + <repositories> + <repository> + <id>central</id> + <name>central maven repo https</name> + <url>https://repo.maven.apache.org/maven2</url> + </repository> + </repositories> + </profile> + </profiles> + <dependencies> <dependency> <groupId>org.apache.spark</groupId> @@ -53,7 +97,6 @@ <groupId>org.apache.thrift</groupId> <artifactId>libthrift</artifactId> <version>${libthrift.version}</version> - <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.arrow</groupId> @@ -150,23 +193,27 @@ <relocations> <relocation> <pattern>org.apache.arrow</pattern> - <shadedPattern>org.apache.doris.arrow</shadedPattern> + <shadedPattern>org.apache.doris.shaded.org.apache.arrow</shadedPattern> </relocation> <relocation> <pattern>io.netty</pattern> - <shadedPattern>org.apache.doris.netty</shadedPattern> + <shadedPattern>org.apache.doris.shaded.io.netty</shadedPattern> </relocation> <relocation> <pattern>com.fasterxml.jackson</pattern> - <shadedPattern>org.apache.doris.jackson</shadedPattern> + <shadedPattern>org.apache.doris.shaded.com.fasterxml.jackson</shadedPattern> </relocation> <relocation> <pattern>org.apache.commons.codec</pattern> - <shadedPattern>org.apache.doris.commons.codec</shadedPattern> + <shadedPattern>org.apache.doris.shaded.org.apache.commons.codec</shadedPattern> </relocation> <relocation> <pattern>com.google.flatbuffers</pattern> - <shadedPattern>org.apache.doris.flatbuffers</shadedPattern> + <shadedPattern>org.apache.doris.shaded.com.google.flatbuffers</shadedPattern> + </relocation> + <relocation> + <pattern>org.apache.thrift</pattern> + <shadedPattern>org.apache.doris.shaded.org.apache.thrift</shadedPattern> </relocation> </relocations> </configuration> --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org