Hisoka-X opened a new issue, #9713: URL: https://github.com/apache/seatunnel/issues/9713
### Search before asking - [x] I had searched in the [feature](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22) and found no similar feature requirement. ### Description As a multimodal data integration tool, we hope that SeaTunnel can support parsing complex file types, converting their contents into structured file streams, and ultimately writing them into a vector library through embedding. This issue tracks related tasks. 1. Support parse markdown to structured data (Parser + Normalization). 2. Support parse word to structured data (Parser + Normalization). 3. Support parse pdf to structured data (Parser + Normalization). 4. Support text splitter transform (Chunking). 5. Write a document to introduce the entire RAG Ready data processing process. For chunking please refer Please refer https://docs.dify.ai/en/guides/knowledge-base/create-knowledge-and-upload-documents/chunking-and-cleaning-text and https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_chunking/ ### Usage Scenario _No response_ ### Related issues _No response_ ### Are you willing to submit a PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
