Thanks! The first link is old, here is a more recent one: 1) https://python.langchain.com/docs/integrations/providers/spark/#spark-sql-individual-tools
Russell On Fri, Jan 3, 2025 at 8:50 AM Gurunandan <gurunandan....@gmail.com> wrote: > HI Mayur, > Please evaluate Langchain's Spark Dataframe Agent for your use case. > > documentation: > 1) https://python.langchain.com/v0.1/docs/integrations/toolkits/spark/ > 2) https://python.langchain.com/docs/integrations/tools/spark_sql/ > > regards, > Guru > > On Fri, Jan 3, 2025 at 6:38 PM Mayur Dattatray Bhosale <ma...@sarvam.ai> > wrote: > > > > Hi team, > > > > We are planning to use Spark for pre-processing the ML training data > given the data is 500+ TBs. > > > > One of the steps in the data-preprocessing requires us to use a LLM (own > deployment of model). I wanted to understand what is the right way to > architect this? These are the options that I can think of: > > > > - Split this into multiple applications at the LLM use case step. Use a > workflow manager to feed the output of the application-1 to LLM and feed > the output of LLM to application 2 > > - Split this into multiple stages by writing the orchestration code of > feeding output of the pre-LLM processing stages to externally hosted LLM > and vice versa > > > > I wanted to know if within Spark there is an easier way to do this or > any plans of having such functionality as a first class citizen of Spark in > future? Also, please suggest any other better alternatives. > > > > Thanks, > > Mayur > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >