TangSiyang2001 opened a new issue, #22383: URL: https://github.com/apache/doris/issues/22383
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues. ### Description Currently, CSV text format files' reading has the following general keypoints: 1. CsvReader delegates the line reading behavior to NewPlainTextReader. 2. NewPlainTextReader will try to read a whole line from decompressed buffer. If not a whole line is read, it will extend the buffer and read more data, then try to parse a line from the beginning again. 3. When a whole line is read, it will be returned to CsvReader, and CsvReader will try to parse column field from the read line. 4. The process will continue to the end of file, read a line, parse a line, again and again. However, there are some shortcomings in this way. 1. Low extensibility. When more text attributes are required, such as enclose and escape, it is hard to make change based on the former code. 2. Maybe lower performance. (i) When the line reader doesn't read a whole line, it will try to find the line delimiter again from the beginning, rather than the last position of the former process. (ii) Moreover, when a line is return, the column separators' position is still unknown, and CsvReader will parse the line again from the beginning, and this process could be take during line reader reading a line. ### Solution Refactor ideas: 1. Use a state machine to parse a line in l line reader, rather than calling member API. This will make parsing process more controllable to developers. 2. Use a context to hold the state of the parser, so that when a whole line is mismatched, the read progress and state will not be discarded so that we can continue the former process. 3. The positions of column separator will be recorded to the context during line reading, and CsvReader can use this info to split the value without parsing the line again. ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org