wyb opened a new issue #3433:
URL: https://github.com/apache/incubator-doris/issues/3433
For many users who want to load data into doris for the first time, they
have large amount of data, about 10G+,it is hard to support to load so large
data into doris at one time using Broker load or Stream load. To resolve this
problem, We proposal a new solution to load data by using spark cluster.
Spark clusters are used to preprocess data (bitmap global dict build,
partition, sort, aggregation) in spark load, which can improve Doris load
performance of large data volume and save the computing resources of Doris.
Spark load is mainly used for the initial migration from other systems or
loading large amounts of data into Doris.
```
+
| 0. User create spark load job
+----v----+
| FE |---------------------------------+
+----+----+ |
| 3. FE send push tasks |
| 5. FE publish version |
+------------+------------+ |
| | | |
+---v---+ +---v---+ +---v---+ |
| BE | | BE | | BE | |1. FE submit Spark
ETL job
+---^---+ +---^---+ +---^---+ |
|4. BE push with broker | |
+---+---+ +---+---+ +---+---+ |
|Broker | |Broker | |Broker | |
+---^---+ +---^---+ +---^---+ |
| | | |
+---+------------+------------+---+ 2.ETL +-------------v---------------+
| HDFS +-------> Spark cluster |
| <-------+ |
+---------------------------------+ +-----------------------------+
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]