wyb opened a new issue #3433:
URL: https://github.com/apache/incubator-doris/issues/3433


   For many users who want to load data into doris for the first time, they 
have large amount of data, about 10G+,it is hard to support to load so large 
data into doris at one time using Broker load or Stream load. To resolve this 
problem, We proposal a new solution to load data by using spark cluster. 
   
   Spark clusters are used to preprocess data (bitmap global dict build, 
partition, sort, aggregation) in spark load, which can improve Doris load 
performance of large data volume and save the computing resources of Doris. 
   
   Spark load is mainly used for the initial migration from other systems or 
loading large amounts of data into Doris.
   
   ```
                    +
                    | 0. User create spark load job
               +----v----+
               |   FE    |---------------------------------+
               +----+----+                                 |
                    | 3. FE send push tasks                |
                    | 5. FE publish version                |
       +------------+------------+                         |
       |            |            |                         |
   +---v---+    +---v---+    +---v---+                     |
   |  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark 
ETL job
   +---^---+    +---^---+    +---^---+                     |
       |4. BE push with broker   |                         |
   +---+---+    +---+---+    +---+---+                     |
   |Broker |    |Broker |    |Broker |                     |
   +---^---+    +---^---+    +---^---+                     |
       |            |            |                         |
   +---+------------+------------+---+ 2.ETL +-------------v---------------+
   |               HDFS              +------->       Spark cluster         |
   |                                 <-------+                             |
   +---------------------------------+       +-----------------------------+
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to