[GitHub] [incubator-doris] stczwd commented on a change in pull request #1332: Doris sink for Spark Structured Streaming

GitBox Fri, 21 Jun 2019 04:30:08 -0700

stczwd commented on a change in pull request #1332: Doris sink for Spark 
Structured Streaming
URL: https://github.com/apache/incubator-doris/pull/1332#discussion_r296198850


 ##########
 File path: docs/documentation/cn/extending-doris/Spark-sink-for-Doris.md
 ##########
 @@ -0,0 +1,129 @@
+# Spark sink For Doris
+
+## Introduction
+Doris Sink is a type of Structured Streaming sink with exactly-once 
fault-tolerance guarantees. It now supports two mode: Broker LOAD & Mini LOAD.  
+
+- **Broker LOAD**: the data will write files on hdfs and Doris use hdfs broker 
 to load the files in. Suitable for the scenario that the data of each batch is 
larger than 1GB. 
+- **Mini LOAD**: the data will send to Doris by HTTP PUT request. Suitable for 
the scenario that the data of each batch is less than 1GB (**Advice mode**) .
+
+We will support the third mode **Streaming LOAD** recently which has higher 
performance than the two mode reminded above.
+
+Note: Doris is also named ***Palo*** in Baidu
+## How to build Doris jar
+
+In `incubator-doris/extension/spark-sink`, run the command `mvn clean install 
-DskipTests`, which will produce a jar archieve 
like`spark-sql-palo_2.11-2.4.2.jar` and also install it into local maven 
repository, then copy the jar to your spark-client/jars
+
+## How to write first App with Doris Sink
+### dependency
+Add below dependency into the pom.xml in your project.
+
+```html
+<dependency>
+  <groupId>org.apache.spark</groupId>
+  <artifactId>spark-sql-palo_2.11</artifactId>
+  <version>2.4.2</version>
+</dependency>
+```
+
+
+### Demo:
+Mini Load （Bulk Load）
+       
+       import org.apache.spark.sql.SparkSession
+       import org.apache.spark.sql.palo.PaloConfig
+
+    object SSSinkToPaloNewWay {
+      val className = this.getClass.getName.stripSuffix("$")
+    
+      def main(args: Array[String]): Unit = {
+        if (args.length < 7) {
+          System.err.println(s"Usage: $className <computeUrl> <mysqlurl>  +
+           <user> <password> <database> <table> <checkpoint>")
+          sys.exit(1)
+        }
+        val Array(computeUrl, mysqlUrl, user, password, database, table, 
checkpoint) = args
+        val spark = SparkSession
+          .builder
+          .getOrCreate()
+        val lines = spark.readStream
+          .format("socket")
+          .option("host", "localhost")
+          .option("port", 9999)
+          .load()
+        import spark.implicits._
+        var id = 0
+        val messages = lines.map {word =>
 
 Review comment:
   map is used in DataSet, use line.as[String] before using map

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@doris.apache.org
For additional commands, e-mail: dev-h...@doris.apache.org

[GitHub] [incubator-doris] stczwd commented on a change in pull request #1332: Doris sink for Spark Structured Streaming

Reply via email to