onestardao opened a new issue, #54415: URL: https://github.com/apache/spark/issues/54415
Hi Spark community, thank you for maintaining such an important platform for large-scale data processing. I see many production setups where Spark is used to generate embeddings, build retrieval corpora, and compute evaluation metrics around an LLM or RAG stack. I maintain an MIT-licensed project called **WFGY Problem Map**, which is a 16-question checklist for debugging RAG / LLM pipelines. It focuses on where things go wrong in ingestion, chunking, indexing, and evaluation, especially at scale. Why this could be useful for Spark users: - Spark is often the place where raw data is transformed into the tables that become embeddings and indexes. - Several of the 16 failure modes describe “looks fine as a Spark job, but fails as a retrieval system” type issues. - The checklist is framework-agnostic and can be used regardless of whether people use Spark with Python, Scala, or another language. WFGY Problem Map has been referenced in: - Harvard MIMS Lab ToolUniverse - QCRI LLM Lab Multimodal RAG Survey - Rankify (University of Innsbruck) Suggestion: If you think it might help teams who are combining Spark with LLMs, one option could be a small link in the documentation or a “related resources” section for RAG / LLM use cases: > “RAG / LLM debugging checklist: WFGY Problem Map (16 failure modes)” > https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md Project home: https://github.com/onestardao/WFGY Thank you for considering and for all the work on Spark. Best, PSBigBig -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
