I think the keypoint here is the way to store and visit string columm.
I prefer to do a poc that stores the string column in a separate file and 
using rowid to visit it. 
If doris store the string column with common columns one file, it may break 
some existing assumptions about segment file.


------------------ ???????? ------------------
??????:                                                                         
                                               "dev"                            
                                                        <zh...@apache.org&gt;;
????????:&nbsp;2021??7??20??(??????) ????10:42
??????:&nbsp;"dev"<dev@doris.apache.org&gt;;

????:&nbsp;Re: [Proposal] Support large variable-length string type



This solution we have done some POC internally.

In the current code, this will cause the memory usage to become larger and
more likely to trigger OOM,
also the size of batch will exceed the maximum value of int32 during RPC.

So to use this scheme, you need to design carefully in several places.

Thanks,
Zhao Chun


?????? <yangz...@gmail.com&gt; ??2021??7??20?????? ????10:31??????

&gt; Hi All
&gt; I want to submit a proposal to support larger string types.
&gt; Background
&gt;
&gt; There are currently two types of strings: CHAR and VARCHAR. Char stores
&gt; fixed-length strings and VARCHAR stores variable-length strings. The
&gt; maximum length of VARCHAR is 65533. This length can meet most demand
&gt; scenarios, but for some scenarios. In the scenario of storing larger
&gt; strings in doris, it is not enough, so we need to add a new data type
&gt; String. String can correspond to blob or text storage in mysql. The maximum
&gt; length is 4GB, but we still don't recommend it. Store more than 64K strings
&gt; in DORIS
&gt; Other system implementation
&gt;
&gt;&nbsp;&nbsp;&nbsp; -
&gt;
&gt;&nbsp;&nbsp;&nbsp; MYSQL: Mysql uses blob or TEXT as the storage type for 
very long
&gt;&nbsp;&nbsp;&nbsp; strings. MySQL can perform string operations on these 
types, but
&gt;&nbsp;&nbsp;&nbsp; performance is not guaranteed. In actual storage, the 
data will be
&gt; stored
&gt;&nbsp;&nbsp;&nbsp; in the overflow page. And according to the version and 
storage engine in
&gt;&nbsp;&nbsp;&nbsp; the data page, the first n characters will be stored for 
indexing
&gt;&nbsp;&nbsp;&nbsp; -
&gt;
&gt;&nbsp;&nbsp;&nbsp; parquet/ORC: These two pairs and large strings are 
directly stored in
&gt;&nbsp;&nbsp;&nbsp; the data area, and there is no special processing and 
only dictionary
&gt;&nbsp;&nbsp;&nbsp; encoding
&gt;
&gt; Design
&gt;
&gt;&nbsp;&nbsp;&nbsp; -
&gt;
&gt;&nbsp;&nbsp;&nbsp; Added the String type, which represents a string of any 
length. In order
&gt;&nbsp;&nbsp;&nbsp; to be compatible with mysql, the maximum length is set 
to 4G-4, and 4
&gt; bytes
&gt;&nbsp;&nbsp;&nbsp; are used to store the length of the string
&gt;&nbsp;&nbsp;&nbsp; -
&gt;
&gt;&nbsp;&nbsp;&nbsp; The data storage is similar to the varchar type, the 
previous length
&gt;&nbsp;&nbsp;&nbsp; identifier is changed to 4 bytes
&gt;&nbsp;&nbsp;&nbsp; -
&gt;
&gt;&nbsp;&nbsp;&nbsp; Indexes are not currently supported, and zonemap indexes 
will be enabled
&gt;&nbsp;&nbsp;&nbsp; after the zonemap length limit is ready.
&gt;

Reply via email to