(doris-website) branch master updated: [update] Update ZhongAn Insurance blog (#411)

luzhijing Mon, 04 Mar 2024 20:44:09 -0800

This is an automated email from the ASF dual-hosted git repository.

luzhijing pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git



The following commit(s) were added to refs/heads/master by this push:
     new 1e746940141 [update] Update ZhongAn Insurance blog (#411)
1e746940141 is described below

commit 1e746940141c9d135d8e5f1b2c0bf1e61f2183c4
Author: KassieZ <[email protected]>
AuthorDate: Tue Mar 5 12:43:58 2024 +0800

    [update] Update ZhongAn Insurance blog (#411)
---
 ...al-time-data-warehouse-based-on-apache-doris.md |   2 -
 ...ion-based-on-the-apache-doris-data-warehouse.md |   2 +-
 ...wn-data-silos-with-an-apache-doris-based-cdp.md | 134 +++++++++++++++++++++
 ...erates-text-searches-by-40-time-apache-doris.md |   2 +-
 src/components/recent-blogs/recent-blogs.data.ts   |  16 +--
 src/constant/newsletter.data.ts                    |  21 ++--
 src/constant/users.data.json                       |   2 +-
 static/images/apache-doris-OneID.png               | Bin 0 -> 189961 bytes
 static/images/apache-doris-based-CDP.png           | Bin 0 -> 281815 bytes
 static/images/apache-doris-bitmap.png              | Bin 0 -> 165078 bytes
 static/images/apache-doris-customer-grouping.png   | Bin 0 -> 124604 bytes
 static/images/apache-doris-data-silos-in-CDP.png   | Bin 0 -> 292329 bytes
 ...n-data-silos-with-an-apache-doris-based-cdp.png | Bin 0 -> 679427 bytes
 13 files changed, 159 insertions(+), 20 deletions(-)

diff --git 
a/blog/a-fast-secure-high-available-real-time-data-warehouse-based-on-apache-doris.md
 
b/blog/a-fast-secure-high-available-real-time-data-warehouse-based-on-apache-doris.md
index ccbacf6355a..525eb7808cf 100644
--- 
a/blog/a-fast-secure-high-available-real-time-data-warehouse-based-on-apache-doris.md
+++ 
b/blog/a-fast-secure-high-available-real-time-data-warehouse-based-on-apache-doris.md
@@ -5,8 +5,6 @@
     'date': '2024-01-08',
     'author': 'Apache Doris',
     'tags': ['Best Practice'],
-    'picked': "true",
-    'order': "4",
     "image": 
'/images/apache-doris-a-fast-secure-and-highly-available-real-time-data-warehouse.png'
 }
 
diff --git 
a/blog/a-financial-anti-fraud-solution-based-on-the-apache-doris-data-warehouse.md
 
b/blog/a-financial-anti-fraud-solution-based-on-the-apache-doris-data-warehouse.md
index ae6f7c6693a..ca045926076 100644
--- 
a/blog/a-financial-anti-fraud-solution-based-on-the-apache-doris-data-warehouse.md
+++ 
b/blog/a-financial-anti-fraud-solution-based-on-the-apache-doris-data-warehouse.md
@@ -6,7 +6,7 @@
     'author': 'Apache Doris',
     'tags': ['Best Practice'],
     'picked': "true",
-    'order': "2",
+    'order': "3",
     "image": 
'/images/a-financial-anti-fraud-solution-based-on-the-apache-doris-data-warehouse.png'
 }
 
diff --git a/blog/breaking-down-data-silos-with-an-apache-doris-based-cdp.md 
b/blog/breaking-down-data-silos-with-an-apache-doris-based-cdp.md
new file mode 100644
index 00000000000..7eb927403b8
--- /dev/null
+++ b/blog/breaking-down-data-silos-with-an-apache-doris-based-cdp.md
@@ -0,0 +1,134 @@
+---
+{
+    'title': "Breaking down data silos with a unified data warehouse: an 
Apache Doris-based CDP",
+    'summary': "The insurance company uses Apache Doris, a unified data 
warehouse, in replacement of Spark + Impala + HBase + NebulaGraph, in their 
Customer Data Platform for 4 times faster customer grouping.",
+    'date': '2024-03-05',
+    'author': 'Apache Doris',
+    'tags': ['Best Practice'],
+    'picked': "true",
+    'order': "1",
+    "image": 
'/images/breaking-down-data-silos-with-an-apache-doris-based-cdp.png'
+}
+
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+The data silos problem is like arthritis for online business, because almost 
everyone gets it as they grow old. Businesses interact with customers via 
websites, mobile apps, H5 pages, and end devices. For one reason or another, it 
is tricky to integrate the data from all these sources. Data stays where it is 
and cannot be interrelated for further analysis. That's how data silos come to 
form. The bigger your business grows, the more diversified customer data 
sources you will have, and the [...]
+
+This is exactly what happens to the insurance company I'm going to talk about 
in this post. By 2023, they have already served over 500 million customers and 
signed 57 billion insurance contracts. When they started to build a customer 
data platform (CDP) to accommodate such a data size, they used multiple 
components. 
+
+## Data silos in CDP
+
+Like most data platforms, their CDP 1.0 had a batch processing pipeline and a 
real-time streaming pipeline. Offline data was loaded, via Spark jobs, to 
Impala, where it was tagged and divided into groups. Meanwhile, Spark also sent 
it to NebulaGraph for OneID computation (elaborated later in this post). On the 
other hand, real-time data was tagged by Flink and then stored in HBase, ready 
to be queried.
+
+That led to a component-heavy computation layer in the CDP: Impala, Spark, 
NebulaGraph, and HBase.
+
+![apache doris data silos in 
CDP](../static/images/apache-doris-data-silos-in-CDP.png)
+
+As a result, offline tags, real-time tags, and graph data were scattered 
across multiple components. Integrating them for further data services was 
costly due to redundant storage and bulky data transfer. What's more, due to 
discrepancies in storage, they had to expand the size of the CDH cluster and 
NebulaGraph cluster, adding to the resource and maintenance costs.
+
+## Apache Doris-based CDP
+
+For CDP 2.0, they decide to introduce a unified solution to clean up the mess. 
At the computation layer of CDP 2.0, [Apache Doris](https://doris.apache.org) 
undertakes both real-time and offline data storage and computation. 
+
+To ingest **offline data**, they utilize the [Stream 
Load](https://doris.apache.org/docs/data-operate/import/import-way/stream-load-manual)
 method. Their 30-thread ingestion test shows that it can perform over 300,000 
upserts per second. To load **real-time data**, they use a combination of 
[Flink-Doris-Connector](https://doris.apache.org/docs/ecosystem/flink-doris-connector)
 and Stream Load. In addition, in real-time reporting where they need to 
extract data from multiple external data  [...]
+
+![apache doris based-CDP](../static/images/apache-doris-based-CDP.png)
+
+The customer analytic workflows on this CDP go like this. First, they sort out 
customer information, then they attach tags to each customer. Based on the 
tags, they divide customers into groups for more targeted analysis and 
operation. 
+
+Next, I'll delve into these workloads and show you how Apache Doris 
accelerates them. 
+
+## OneID
+
+Has this ever happened to you when you have different user registration 
systems for your products and services? You might collect the email of UserID A 
from one product webpage, and later the social security number of UserID B from 
another. Then you find out that UserID A and UserID B actually belong to the 
same person because they go by the same phone number.
+
+That's why OneID arises as an idea. It is to pool the user registration 
information of all business lines into one large table in Apache Doris, sort it 
out, and make sure that one user has a unique OneID. 
+
+This is how they figure out which registration information belongs to the same 
user leveraging the functions in Apache Doris.
+
+![apache doris OneID](../static/images/apache-doris-OneID.png)
+
+## Tagging services
+
+This CDP accommodates information of **500 million customers**, which come 
from over **500 source tables** and are attached to over **2000 tags** in total.
+
+By timeliness, the tags can be divided into real-time tags and offline tags. 
The real-time tags are computed by Apache Flink and written into the flat table 
in Apache Doris, while the offline tags are computed by Apache Doris as they 
are derived from the user attribute table, business table, and user behavior 
table in Doris. Here is the company's best practice in data tagging:  
+
+**1. Offline tags:**
+
+During the peaks of data writing, a full update might easily cause an OOM 
error given their huge data scale. To avoid that, they utilize the [INSERT INTO 
SELECT](https://doris.apache.org/docs/data-operate/import/import-way/insert-into-manual)
 function of Apache Doris and enable **partial column update**. This will cut 
down memory consumption by a lot and maintain system stability during data 
loading.
+
+```SQL
+set enable_unique_key_partial_update=true;
+insert into tb_label_result(one_id, labelxx) 
+select one_id, label_value as labelxx
+from .....
+```
+
+**2. Real-time tags:**
+
+Partial column update is also available for real-time tags, since even 
real-time tags are updated at different paces. All that is needed is to set 
`partial_columns` to `true`.
+
+```SQL
+curl --location-trusted -u root: -H "partial_columns:true" -H 
"column_separator:," -H "columns:id,balance,last_access_time" -T /tmp/test.csv 
http://127.0.0.1:48037/api/db1/user_profile/_stream_load
+```
+
+**3. High-concurrency point queries:**
+
+With its current business size, the company is receiving query requests for 
tags at a concurrency level of over 5000 QPS. They use a combination of 
strategies to guarantee high performance. Firstly, they adopt [Prepared 
Statement](https://doris.apache.org/docs/query-acceleration/hight-concurrent-point-query#using-preparedstatement)
 for pre-compilation and pre-execution of SQL. Secondly, they fine-tune the 
parameters for Doris Backend and the tables to optimize storage and execution. 
Last [...]
+
+- Fine-tune Doris Backend parameters in `be.conf`:
+
+```SQL
+disable_storage_row_cache = false                      
+storage_page_cache_limit=40%
+```
+
+- Fine-tune table parameters upon table creation:
+
+```SQL
+enable_unique_key_merge_on_write = true
+store_row_column = true
+light_schema_change = true
+```
+
+**4. Tag computation (join):**
+
+In practice, many tagging services are implemented by multi-table joins in the 
database. That often involves more than 10 tables. For optimal computation 
performance, they adopt the [colocation 
group](https://doris.apache.org/docs/query-acceleration/join-optimization/colocation-join)
 strategy in Doris.  
+
+
+## Customer Grouping
+
+The customer grouping pipeline in CDP 2.0 goes like this: Apache Doris 
receives SQL from customer service, executes the computation, and sends the 
result set to S3 object storage via SELECT INTO OUTFILE. The company has 
divided their customers into 1 million groups. The customer grouping task that 
used to take **50 seconds in Impala** to finish now only needs **10 seconds in 
Doris**. 
+
+![apache doris customer 
grouping](../static/images/apache-doris-customer-grouping.png)
+
+Apart from grouping the customers for more fine-grained analysis, sometimes 
they do analysis in a reverse direction. That is, to target a certain customer 
and find out to which groups he/she belongs. This helps analysts understand the 
characteristics of customers as well as how different customer groups overlap.
+
+In Apache Doris, this is implemented by the BITMAP functions: 
`BITMAP_CONTAINS` is a fast way to check if a customer is part of a certain 
group, and `BITMAP_OR`, `BITMAP_INTERSECT`, and `BITMAP_XOR` are the choices 
for cross analysis.  
+
+![apache doris bitmap](../static/images/apache-doris-bitmap.png)
+
+
+## Conclusion
+
+From CDP 1.0 to CDP 2.0, the insurance company adopts Apache Doris, a unified 
data warehouse, to replace Spark+Impala+HBase+NebulaGraph. That increases their 
data processing efficiency by breaking down the data silos and streamlining 
data processing pipelines. In CDP 3.0 to come, they want to group their 
customer by combining real-time tags and offline tags for more diversified and 
flexible analysis. The [Apache Doris 
community](https://join.slack.com/t/apachedoriscommunity/shared_invite [...]
\ No newline at end of file
diff --git 
a/blog/inverted-index-accelerates-text-searches-by-40-time-apache-doris.md 
b/blog/inverted-index-accelerates-text-searches-by-40-time-apache-doris.md
index f95d3fbeabc..0de70f3d2b3 100644
--- a/blog/inverted-index-accelerates-text-searches-by-40-time-apache-doris.md
+++ b/blog/inverted-index-accelerates-text-searches-by-40-time-apache-doris.md
@@ -6,7 +6,7 @@
     'author': 'Apache Doris',
     'tags': ['Tech Sharing'],
     'picked': "true",
-    'order': "3",
+    'order': "4",
     "image": 
'/images/how-inverted-index-accelerates-text-searches-by-40-times.png'
 }
 ---
diff --git a/src/components/recent-blogs/recent-blogs.data.ts 
b/src/components/recent-blogs/recent-blogs.data.ts
index 216f2e392bd..70dac8c4d78 100644
--- a/src/components/recent-blogs/recent-blogs.data.ts
+++ b/src/components/recent-blogs/recent-blogs.data.ts
@@ -1,19 +1,19 @@
 export const RECENT_BLOGS_POSTS = [
     {
-        label: `Financial data warehousing: fast, secure, and highly available 
with Apache Doris`,
-        link: 
'https://doris.apache.org/blog/a-fast-secure-high-available-real-time-data-warehouse-based-on-apache-doris',
+        label: `Apache Doris 2.0.5 just released`,
+        link: 'https://doris.apache.org/blog/release-note-2.0.5',
     },
     {
-        label: 'Apache Doris speeds up data reporting, tagging, and data lake 
analytics',
-        link: 
'https://doris.apache.org/blog/apache-doris-speeds-up-data-reporting-tagging-and-data-lake-analytics',
+        label: 'A financial anti-fraud solution based on the Apache Doris data 
warehouse',
+        link: 
'https://doris.apache.org/blog/a-financial-anti-fraud-solution-based-on-the-apache-doris-data-warehouse',
     },
     {
-        label: 'From Elasticsearch to Apache Doris: upgrading an observability 
platform',
-        link: 
'https://doris.apache.org/blog/from-elasticsearch-to-apache-doris-upgrading-an-observability-platform',
+        label: 'A deep dive into inverted index: how it speeds up text 
searches by 40 times',
+        link: 
'https://doris.apache.org/blog/inverted-index-accelerates-text-searches-by-40-time-apache-doris',
     },
     {
-        label: `Empowering cyber security by enabling 7 times faster log 
analysis`,
-        link: 
'https://doris.apache.org/blog/empowering-cyber-security-by-enabling-seven-times-faster-log-analysis',
+        label: `Financial data warehousing: fast, secure, and highly available 
with Apache Doris`,
+        link: 
'https://doris.apache.org/blog/a-fast-secure-high-available-real-time-data-warehouse-based-on-apache-doris',
     },
     
 ];
diff --git a/src/constant/newsletter.data.ts b/src/constant/newsletter.data.ts
index 11f7645777b..ae38a1c1fed 100644
--- a/src/constant/newsletter.data.ts
+++ b/src/constant/newsletter.data.ts
@@ -1,4 +1,11 @@
 export const NEWSLETTER_DATA = [
+    {
+        tags: ['Best Practice'],
+        title: "Breaking down data silos with a unified data warehouse: an 
Apache Doris-based CDP",
+        content: `The insurance company uses Apache Doris, a unified data 
warehouse, in replacement of Spark, Impala, HBase and NebulaGraph, in their 
Customer Data Platform for 4 times faster customer grouping.`,
+        to: '/blog/breaking-down-data-silos-with-an-apache-doris-based-cdp',
+        image: 'breaking-down-data-silos-with-an-apache-doris-based-cdp.png',
+    },
     {
         tags: ['Release Notes'],
         title: "Apache Doris 2.0.5 just released",
@@ -20,12 +27,12 @@ export const NEWSLETTER_DATA = [
         to: 
'/blog/inverted-index-accelerates-text-searches-by-40-time-apache-doris',
         image: 'how-inverted-index-accelerates-text-searches-by-40-times.png',
     },
-    {
-        tags: ['Best Practice'],
-        title: "The financial sector's choice: fast, secure, and highly 
available real-time data warehousing based on Apache Doris",
-        content: `A whole-journey guide for financial users looking for fast 
data processing performance, data security, and high service availability.`,
-        to: 
'/blog/a-fast-secure-high-available-real-time-data-warehouse-based-on-apache-doris',
-        image: 
'apache-doris-a-fast-secure-and-highly-available-real-time-data-warehouse.png',
-    },
+    // {
+    //     tags: ['Best Practice'],
+    //     title: "The financial sector's choice: fast, secure, and highly 
available real-time data warehousing based on Apache Doris",
+    //     content: `A whole-journey guide for financial users looking for 
fast data processing performance, data security, and high service 
availability.`,
+    //     to: 
'/blog/a-fast-secure-high-available-real-time-data-warehouse-based-on-apache-doris',
+    //     image: 
'apache-doris-a-fast-secure-and-highly-available-real-time-data-warehouse.png',
+    // },
  
 ];
diff --git a/src/constant/users.data.json b/src/constant/users.data.json
index 5d27d45f84a..61e22163be9 100644
--- a/src/constant/users.data.json
+++ b/src/constant/users.data.json
@@ -1022,7 +1022,7 @@
         "category": "Finance",
         "logo": "ZhongAn Insurance",
         "order": 113,
-        "to": null,
+        "to": 
"https://doris.apache.org/blog/breaking-down-data-silos-with-an-apache-doris-based-cdp";,
         "image": "/images/user-logo/Finance/ZhongAn Insurance.jpg"
     },
     {
diff --git a/static/images/apache-doris-OneID.png 
b/static/images/apache-doris-OneID.png
new file mode 100644
index 00000000000..00f29e94c52
Binary files /dev/null and b/static/images/apache-doris-OneID.png differ
diff --git a/static/images/apache-doris-based-CDP.png 
b/static/images/apache-doris-based-CDP.png
new file mode 100644
index 00000000000..089444041b7
Binary files /dev/null and b/static/images/apache-doris-based-CDP.png differ
diff --git a/static/images/apache-doris-bitmap.png 
b/static/images/apache-doris-bitmap.png
new file mode 100644
index 00000000000..386e1f3914b
Binary files /dev/null and b/static/images/apache-doris-bitmap.png differ
diff --git a/static/images/apache-doris-customer-grouping.png 
b/static/images/apache-doris-customer-grouping.png
new file mode 100644
index 00000000000..4ce2c5f3c0a
Binary files /dev/null and b/static/images/apache-doris-customer-grouping.png 
differ
diff --git a/static/images/apache-doris-data-silos-in-CDP.png 
b/static/images/apache-doris-data-silos-in-CDP.png
new file mode 100644
index 00000000000..717719ae042
Binary files /dev/null and b/static/images/apache-doris-data-silos-in-CDP.png 
differ
diff --git 
a/static/images/breaking-down-data-silos-with-an-apache-doris-based-cdp.png 
b/static/images/breaking-down-data-silos-with-an-apache-doris-based-cdp.png
new file mode 100644
index 00000000000..dfac2b6fe7d
Binary files /dev/null and 
b/static/images/breaking-down-data-silos-with-an-apache-doris-based-cdp.png 
differ


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(doris-website) branch master updated: [update] Update ZhongAn Insurance blog (#411)

Reply via email to