Re: LLM based data pre-processing

2025-01-04 Thread Mich Talebzadeh
Hi Russel, Spark's GPU scheduling capabilities have improved significantly with the advent of tools like the NVIDIA RAPIDS Accelerator for Spark. The NVIDIA RAPIDS Accelerator for Spark is directly relevant to

Re: LLM based data pre-processing

2025-01-04 Thread Mich Talebzadeh
Let us add some more detail to DFD diagram Data for the Entire Pipeline as attached Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR PhD Imperial College London

Re: LLM based data pre-processing

2025-01-03 Thread Holden Karau
So it's improved a lot with resource profiles, but in the OSS it's far from automatic you'll have to do a bunch of manual work setting up the resource profiles & tagging your stages with them. For hosted solutions like databricks there might be some magic. On Fri, Jan 3, 2025 at 9:19 AM Russell Ju

Re: LLM based data pre-processing

2025-01-03 Thread Mich Talebzadeh
well I can given you some advice using Google Cloud tools Level 0: High-Level Overview 1. Input: Raw data in Google Cloud Storage (GCS). 2. Processing: - Pre-processing with Dataproc (Spark on tin box) - Inference with LLM (Cloud Run/Vertex AI). - Post-processing with Dataproc

Re: LLM based data pre-processing

2025-01-03 Thread Russell Jurney
How well does Spark handle scheduling for GPUs these days? Three years ago my team used GPUs with Spark on Databricks (one of the first customers), and we couldn't saturate our GPUs more than 35% when doing inference, encoding string fields in sentence transformers for fuzzy string matching. This w

Re: LLM based data pre-processing

2025-01-03 Thread Holden Karau
So I've been working on similar LLM pre-processing of data and I would say one of the questions worth answering is do you want/need your models to be collocated? If you're running on prem in a GPU rich env there's a lot of benefits, but even with a custom model, if your using 3rd party inference or

Re: LLM based data pre-processing

2025-01-03 Thread Russell Jurney
Thanks! The first link is old, here is a more recent one: 1) https://python.langchain.com/docs/integrations/providers/spark/#spark-sql-individual-tools Russell On Fri, Jan 3, 2025 at 8:50 AM Gurunandan wrote: > HI Mayur, > Please evaluate Langchain's Spark Dataframe Agent for your use case. >

Re: LLM based data pre-processing

2025-01-03 Thread Gurunandan
HI Mayur, Please evaluate Langchain's Spark Dataframe Agent for your use case. documentation: 1) https://python.langchain.com/v0.1/docs/integrations/toolkits/spark/ 2) https://python.langchain.com/docs/integrations/tools/spark_sql/ regards, Guru On Fri, Jan 3, 2025 at 6:38 PM Mayur Dattatray Bho

Re: LLM based data pre-processing

2025-01-03 Thread Russell Jurney
I don't have an answer, but I have the very same questions and am eagerly awaiting a solid response :) Russell On Fri, Jan 3, 2025 at 5:07 AM Mayur Dattatray Bhosale wrote: > Hi team, > > We are planning to use Spark for pre-processing the ML training data given > the data is 500+ TBs. > > One

LLM based data pre-processing

2025-01-03 Thread Mayur Dattatray Bhosale
Hi team, We are planning to use Spark for pre-processing the ML training data given the data is 500+ TBs. One of the steps in the data-preprocessing requires us to use a LLM (own deployment of model). I wanted to understand what is the right way to architect this? These are the options that I can