Hisoka-X opened a new issue, #9713:
URL: https://github.com/apache/seatunnel/issues/9713

   ### Search before asking
   
   - [x] I had searched in the 
[feature](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22)
 and found no similar feature requirement.
   
   
   ### Description
   
   As a multimodal data integration tool, we hope that SeaTunnel can support 
parsing complex file types, converting their contents into structured file 
streams, and ultimately writing them into a vector library through embedding. 
This issue tracks related tasks.
   
   1. Support parse markdown to structured data (Parser + Normalization).
   2. Support parse word to structured data (Parser + Normalization).
   3. Support parse pdf to structured data (Parser + Normalization).
   4. Support text splitter transform (Chunking).
   5. Write a document to introduce the entire RAG Ready data processing 
process.
   
   For chunking please refer 
   Please refer 
https://docs.dify.ai/en/guides/knowledge-base/create-knowledge-and-upload-documents/chunking-and-cleaning-text
   and 
   https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_chunking/
   
   
   ### Usage Scenario
   
   _No response_
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to