[I] Suggestion: reference WFGY Problem Map (RAG / LLM debugging checklist) for Spark + LLM workloads [spark]

via GitHub Sat, 21 Feb 2026 07:30:23 -0800


onestardao opened a new issue, #54415:
URL: https://github.com/apache/spark/issues/54415


   Hi Spark community,
   
   thank you for maintaining such an important platform for large-scale data 
processing. I see many production setups where Spark is used to generate 
embeddings, build retrieval corpora, and compute evaluation metrics around an 
LLM or RAG stack.
   
   I maintain an MIT-licensed project called **WFGY Problem Map**, which is a 
16-question checklist for debugging RAG / LLM pipelines. It focuses on where 
things go wrong in ingestion, chunking, indexing, and evaluation, especially at 
scale.
   
   Why this could be useful for Spark users:
   - Spark is often the place where raw data is transformed into the tables 
that become embeddings and indexes.
   - Several of the 16 failure modes describe “looks fine as a Spark job, but 
fails as a retrieval system” type issues.
   - The checklist is framework-agnostic and can be used regardless of whether 
people use Spark with Python, Scala, or another language.
   
   WFGY Problem Map has been referenced in:
   - Harvard MIMS Lab ToolUniverse
   - QCRI LLM Lab Multimodal RAG Survey
   - Rankify (University of Innsbruck)
   
   Suggestion:
   
   If you think it might help teams who are combining Spark with LLMs, one 
option could be a small link in the documentation or a “related resources” 
section for RAG / LLM use cases:
   
   > “RAG / LLM debugging checklist: WFGY Problem Map (16 failure modes)”  
   > https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md
   
   Project home: https://github.com/onestardao/WFGY
   
   Thank you for considering and for all the work on Spark.
   
   Best,  
   PSBigBig


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Suggestion: reference WFGY Problem Map (RAG / LLM debugging checklist) for Spark + LLM workloads [spark]

Reply via email to