TangSiyang2001 opened a new issue, #22383:
URL: https://github.com/apache/doris/issues/22383

   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Description
   
   Currently, CSV text format files' reading has the following general 
keypoints:
   
   1. CsvReader delegates the line reading behavior to NewPlainTextReader.
   2. NewPlainTextReader will try to read a whole line from decompressed 
buffer. If not a whole line is read, it will extend the buffer and read more 
data, then try to parse a line from the beginning again.
   3. When a whole line is read, it will be returned to CsvReader, and 
CsvReader will try to parse column field from the read line.
   4. The process will continue to the end of file, read a line, parse a line, 
again and again.
   
   However, there are some shortcomings in this way.
   1. Low extensibility. When more text attributes are required, such as 
enclose and escape, it is hard to make change based on the former code.
   2. Maybe lower performance. 
       (i) When the line reader doesn't read a whole line, it will try to find 
the line delimiter again from the beginning, rather than the last position of 
the former process. 
       (ii) Moreover, when a line is return, the column separators' position is 
still unknown, and CsvReader will parse the line again from the beginning, and 
this process could be take during line reader reading a line.
   
   ### Solution
   
   Refactor ideas:
   
   1. Use a state machine to parse a line in l line reader, rather than calling 
member API. This will make parsing process more controllable to developers.
   2. Use a context to hold the state of the parser, so that when a whole line 
is mismatched, the read progress and state will not be discarded so that we can 
continue the former process.
   3. The positions of column separator will be recorded to the context during 
line reading, and CsvReader can use this info to split the value without 
parsing the line again.
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

Reply via email to