DanielCarter-stack commented on issue #10608:
URL: https://github.com/apache/seatunnel/issues/10608#issuecomment-4073353198
<!-- code-pr-reviewer -->
Thank you for the feature request. To better understand your requirements, I
have a few clarifying questions:
1. **PDF Parser**: Which behavior do you expect?
- A) As a Source to directly read PDF files (similar to the existing
Excel/JSON Source)
- B) As a Transform to parse PDF binary content from a field
- C) As a File Format extension (adding `PDF` to `FileFormat` enum at
`seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/config/FileFormat.java`)
2. **Multimodal Data Labeling**: How should this differ from the existing
`EmbeddingTransform`
(`seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/embedding/EmbeddingTransform.java`)
and `LLMTransform`?
- A) Auto-generate classification tags for images/videos/text
- B) Call VLM (Vision Language Model) to generate descriptions
- C) Integrate with specific labeling services/models
3. **Use case**: Could you provide a complete pipeline example (e.g.,
`Source → PDFParser → Labeling → Sink`) with expected input/output fields and
configuration?
Currently, SeaTunnel supports multiple file formats
(TEXT/JSON/EXCEL/XML/MARKDOWN/BINARY) via the file source connector and
provides multimodal embedding transforms for text/images/videos
(`MultimodalModel.java` at
`seatunnel-transforms-v2/src/main/java/org/apache/seatunnel/transform/nlpmodel/embedding/multimodal/MultimodalModel.java`),
but lacks PDF support and dedicated labeling operators. Clarifying the above
will help us evaluate feasibility and implementation priority.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]