Re: [I] [Feature][Transform] add PDF Parser Operator and Multimodal Data Labeling Operator [seatunnel]

via GitHub Tue, 17 Mar 2026 01:58:48 -0700


DanielCarter-stack commented on issue #10608:
URL: https://github.com/apache/seatunnel/issues/10608#issuecomment-4073353198


   <!-- code-pr-reviewer -->
   Thank you for the feature request. To better understand your requirements, I 
have a few clarifying questions:
   
   1. **PDF Parser**: Which behavior do you expect?
      - A) As a Source to directly read PDF files (similar to the existing 
Excel/JSON Source)
      - B) As a Transform to parse PDF binary content from a field
      - C) As a File Format extension (adding `PDF` to `FileFormat` enum at 
`seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/config/FileFormat.java`)
   
   2. **Multimodal Data Labeling**: How should this differ from the existing 
`EmbeddingTransform` 
(`seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/embedding/EmbeddingTransform.java`)
 and `LLMTransform`?
      - A) Auto-generate classification tags for images/videos/text
      - B) Call VLM (Vision Language Model) to generate descriptions
      - C) Integrate with specific labeling services/models
   
   3. **Use case**: Could you provide a complete pipeline example (e.g., 
`Source → PDFParser → Labeling → Sink`) with expected input/output fields and 
configuration?
   
   Currently, SeaTunnel supports multiple file formats 
(TEXT/JSON/EXCEL/XML/MARKDOWN/BINARY) via the file source connector and 
provides multimodal embedding transforms for text/images/videos 
(`MultimodalModel.java` at 
`seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/embedding/multimodal/MultimodalModel.java`),
 but lacks PDF support and dedicated labeling operators. Clarifying the above 
will help us evaluate feasibility and implementation priority.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Feature][Transform] add PDF Parser Operator and Multimodal Data Labeling Operator [seatunnel]

Reply via email to