Let us add some more detail to DFD diagram Data for the Entire Pipeline as attached
Mich Talebzadeh, Architect | Data Science | Financial Crime | Forensic Analysis | GDPR PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> London, United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Fri, 3 Jan 2025 at 21:47, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > well I can given you some advice using Google Cloud tools > > Level 0: High-Level Overview > > > 1. Input: Raw data in Google Cloud Storage (GCS). > 2. Processing: > > > - Pre-processing with Dataproc (Spark on tin box) > - Inference with LLM (Cloud Run/Vertex AI). > - Post-processing with Dataproc (Spark) > > 3. Output: Final processed dataset stored in GCS or Google BigQuery DW > > Level 1: Detailed Data Flow > > 1. > > *Step 1: Pre-Processing* > - Input: Raw data from GCS. > - Process: > - Transform raw data using Spark on *Dataproc.* > - Output: Pre-processed data stored back in *GCS.* > 2. > > *Step 2: LLM Inference* > - Input: Pre-processed data from GCS. > - Process: > - Data sent in batches to *LLM Inference Service(*for > processing, pre-processed data) hosted on *Cloud Run/Vertex AI.* > - LLM generates inferences for each batch. > - Output: LLM-inferred results stored in* GCS.* > 3. > > *Step 3: Post-Processing* > - Input: LLM-inferred results from* GCS*. > - Process: > - Additional transformations, aggregations, or merging with > other datasets using Spark on *Dataproc.* > - Output: Final dataset stored in* GCS *or loaded into *Google BigQuery > DW > *for downstream ML training. > > *Orchestration * > > Use *Cloud Compose*r that sits on top of *Apache Airflow* or just Airflow > itself > > *Monitoring* > > - Job performance -> Dataproc > - LLM API throughput -> Cloud Run/Vertex AI. > - Storage and data transfer metrics -> GCS > - Google logs > > *Notes* > The LLM-inferenced results are the predictions, insights, or > transformations performed by the LLM on input data.These results are the > outputs of the model’s reasoning, natural language understanding, or > processing capabilities applied to the input. > > HTH > > Mich Talebzadeh, > > Architect | Data Science | Financial Crime | Forensic Analysis | GDPR > PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College > London <https://en.wikipedia.org/wiki/Imperial_College_London> > London, United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* The information provided is correct to the best of my > knowledge but of course cannot be guaranteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Fri, 3 Jan 2025 at 13:08, Mayur Dattatray Bhosale <ma...@sarvam.ai> > wrote: > >> Hi team, >> >> We are planning to use Spark for pre-processing the ML training data >> given the data is 500+ TBs. >> >> One of the steps in the data-preprocessing requires us to use a LLM (own >> deployment of model). I wanted to understand what is the right way to >> architect this? These are the options that I can think of: >> >> - Split this into multiple applications at the LLM use case step. Use a >> workflow manager to feed the output of the application-1 to LLM and feed >> the output of LLM to application 2 >> - Split this into multiple stages by writing the orchestration code of >> feeding output of the pre-LLM processing stages to externally hosted LLM >> and vice versa >> >> I wanted to know if within Spark there is an easier way to do this or any >> plans of having such functionality as a first class citizen of Spark in >> future? Also, please suggest any other better alternatives. >> >> Thanks, >> Mayur >> >
"DFD for entire pipeline" [Raw Data in GCS] --> [Pre-Processing (Dataproc)] --> [Pre-Processed Data in GCS] --> --> [LLM Inference Service] --> [LLM Results in GCS] --> [Post-Processing (Dataproc)] --> --> [Final Dataset in GCS/BigQuery] 1) High level DFD [Raw Data in GCS] | | V [Pre-Processing (Dataproc)] - Reads raw data from GCS - Filters, aggregates, and formats data | | V 2) Pre-Processing Stage DFD [Pre-Processed Data in GCS] - Data Sources: Raw data files stored in GCS. - Processes: Spark-based transformations. - Output: Pre-processed data in GCS. [Pre-Processed Data in GCS] | | V 3) [LLM Inference Service] - Batch data read from GCS - Sends data to LLM for inference - Receives inference results | | V [LLM Results in GCS] 4) Post-Processing Stage DFD [LLM Results in GCS] | | V [Post-Processing (Dataproc)] - Reads inference results - Merges with other datasets or performs additional transformations | | V [Final Dataset in GCS/BigQuery] - Data Sources: LLM results stored in GCS. - Processes: Spark-based processing for final dataset preparation. - Output: Final dataset in GCS or BigQuery. 5) Additional Details Interactions with GCS: - Data is read from and written to GCS at multiple stages to ensure scalability and persistence. - Each processing stage works on batches of data, leveraging partitioning and optimized file formats like Parquet. - Parallelization with Spark: - Spark parallelizes both pre-processing and post-processing to handle the 500+ TB dataset efficiently. LLM Service: - Hosted on Cloud Run/Vertex AI to scale horizontally. - Accepts batches of data for inference and processes asynchronously.
--------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org