Re: LLM based data pre-processing

Mich Talebzadeh Sat, 04 Jan 2025 06:58:32 -0800

Let us add some more detail to DFD diagram Data  for the Entire Pipeline as
attached



Mich Talebzadeh,

Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College
London <https://en.wikipedia.org/wiki/Imperial_College_London>
London, United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 *Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Fri, 3 Jan 2025 at 21:47, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> well I can given  you some advice using Google Cloud tools
>
> Level 0: High-Level Overview
>
>
>    1. Input: Raw data in Google Cloud Storage (GCS).
>    2. Processing:
>
>
>    - Pre-processing with Dataproc (Spark on tin box)
>    - Inference with LLM (Cloud Run/Vertex AI).
>    - Post-processing with Dataproc (Spark)
>
> 3. Output: Final processed dataset stored in GCS or Google BigQuery DW
>
> Level 1: Detailed Data Flow
>
>    1.
>
>    *Step 1: Pre-Processing*
>    - Input: Raw data from GCS.
>       - Process:
>          - Transform raw data using Spark on *Dataproc.*
>       - Output: Pre-processed data stored back in *GCS.*
>    2.
>
>    *Step 2: LLM Inference*
>    - Input: Pre-processed data from GCS.
>       - Process:
>          - Data sent in batches to *LLM Inference Service(*for
>          processing, pre-processed data) hosted on *Cloud Run/Vertex AI.*
>          - LLM generates inferences for each batch.
>       - Output: LLM-inferred results stored in* GCS.*
>    3.
>
>    *Step 3: Post-Processing*
>    - Input: LLM-inferred results from* GCS*.
>       - Process:
>          - Additional transformations, aggregations, or merging with
>          other datasets using Spark on *Dataproc.*
>       - Output: Final dataset stored in* GCS *or loaded into *Google BigQuery 
> DW
>       *for downstream ML training.
>
> *Orchestration *
>
> Use *Cloud Compose*r that sits on top of *Apache Airflow* or just Airflow
> itself
>
> *Monitoring*
>
>    - Job performance -> Dataproc
>    - LLM API throughput -> Cloud Run/Vertex AI.
>    - Storage and data transfer metrics -> GCS
>    - Google logs
>
> *Notes*
> The LLM-inferenced results are the predictions, insights, or
> transformations performed by the LLM on input data.These results are the
> outputs of the model’s reasoning, natural language understanding, or
> processing capabilities applied to the input.
>
> HTH
>
> Mich Talebzadeh,
>
> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
> PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College
> London <https://en.wikipedia.org/wiki/Imperial_College_London>
> London, United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Fri, 3 Jan 2025 at 13:08, Mayur Dattatray Bhosale <ma...@sarvam.ai>
> wrote:
>
>> Hi team,
>>
>> We are planning to use Spark for pre-processing the ML training data
>> given the data is 500+ TBs.
>>
>> One of the steps in the data-preprocessing requires us to use a LLM (own
>> deployment of model). I wanted to understand what is the right way to
>> architect this? These are the options that I can think of:
>>
>> - Split this into multiple applications at the LLM use case step. Use a
>> workflow manager to feed the output of the application-1 to LLM and feed
>> the output of LLM to application 2
>> - Split this into multiple stages by writing the orchestration code of
>> feeding output of the pre-LLM processing stages to externally hosted LLM
>> and vice versa
>>
>> I wanted to know if within Spark there is an easier way to do this or any
>> plans of having such functionality as a first class citizen of Spark in
>> future? Also, please suggest any other better alternatives.
>>
>> Thanks,
>> Mayur
>>
>

"DFD for entire pipeline"

[Raw Data in GCS] --> [Pre-Processing (Dataproc)] --> [Pre-Processed Data in 
GCS] --> 
--> [LLM Inference Service] --> [LLM Results in GCS] --> [Post-Processing 
(Dataproc)] --> 
--> [Final Dataset in GCS/BigQuery]

1) High level DFD

[Raw Data in GCS]
    |
    |
    V    
[Pre-Processing (Dataproc)]
    - Reads raw data from GCS
    - Filters, aggregates, and formats data
    |
    |
    V    
2) Pre-Processing Stage DFD

[Pre-Processed Data in GCS]

   - Data Sources: Raw data files stored in GCS.
   - Processes: Spark-based transformations.
   - Output: Pre-processed data in GCS.

[Pre-Processed Data in GCS]
    |
    |
    V
3) [LLM Inference Service]
    - Batch data read from GCS
    - Sends data to LLM for inference
    - Receives inference results
    |
    |
    V
[LLM Results in GCS]

4) Post-Processing Stage DFD

[LLM Results in GCS]
    |
    |
    V
[Post-Processing (Dataproc)]
    - Reads inference results
    - Merges with other datasets or performs additional transformations
    |
    |
    V
[Final Dataset in GCS/BigQuery]

 - Data Sources: LLM results stored in GCS.
 - Processes: Spark-based processing for final dataset preparation.
 - Output: Final dataset in GCS or BigQuery.

5) Additional Details
Interactions with GCS:
 - Data is read from and written to GCS at multiple stages to ensure 
scalability and persistence.
 - Each processing stage works on batches of data, leveraging partitioning and 
optimized file formats like Parquet.
 - Parallelization with Spark:
 - Spark parallelizes both pre-processing and post-processing to handle the 
500+ TB dataset efficiently.

LLM Service:
  - Hosted on Cloud Run/Vertex AI to scale horizontally.
  - Accepts batches of data for inference and processes asynchronously.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: LLM based data pre-processing

Reply via email to