Re: Structuring a PySpark Application

2021-07-02 Thread Mich Talebzadeh
Hi Kartik, If you run this shell script for multiple spark-submit jobs you may end up with a virtual environment deleted when another is using it. Virtual environments should not really change much except when packages are added or updated. So this script will avoid deleting the virtual environme

Re: Structuring a PySpark Application

2021-07-01 Thread Kartik Ohri
Hi Mich! The shell script indeed looks more robust now :D Yes, the current setup works fine. I am wondering whether it is the right way to set up things? That is, should I run the program which accepts requests from the queue independently and have it invoke spark-submit cli or something else? T

Re: Structuring a PySpark Application

2021-07-01 Thread Mich Talebzadeh
Hi Kartik, I parameterized your shell script and tested on a stob python file and looks OK, ensuring that the shell script is more robust #!/bin/bash set -e #cd "$(dirname "${BASH_SOURCE[0]}")/../" pyspark_venv="pyspark_venv" source_zip_file="DSBQ.zip" [ -d ${pyspark_venv} ] && rm -r -d ${pysp

Re: Structuring a PySpark Application

2021-06-30 Thread Kartik Ohri
Hi Gourav, Thanks for the suggestion, I'll check it out. Regards, Kartik On Thu, Jul 1, 2021 at 5:38 AM Gourav Sengupta wrote: > Hi, > > I think that reading Matei Zaharia's book "SPARK the definitive guide" > will be a good and best starting point. > > Regards, > Gourav Sengupta > > On Wed, J

Re: Structuring a PySpark Application

2021-06-30 Thread Gourav Sengupta
Hi, I think that reading Matei Zaharia's book "SPARK the definitive guide" will be a good and best starting point. Regards, Gourav Sengupta On Wed, Jun 30, 2021 at 3:47 PM Kartik Ohri wrote: > Hi all! > > I am working on a Pyspark application and would like suggestions on how it > should be st

Re: Structuring a PySpark Application

2021-06-30 Thread Kartik Ohri
Hi Mich! We use this in production but indeed there is much scope for improvements, configuration being one of those :). Yes, we have a private on-premise cluster. We run Spark on YARN (no airflow etc.) which controls the scheduling and use HDFS as a datastore. Regards On Wed, Jun 30, 2021 at 1

Re: Structuring a PySpark Application

2021-06-30 Thread Mich Talebzadeh
Thanks for the details Kartik. Let me go through these. The code itself and indentation looks good. One minor thing I noticed is that you are not using a yaml file (config.yml) for your variables and you seem to embed them in your config.py code. That is what I used to do before :) a friend advis

Re: Structuring a PySpark Application

2021-06-30 Thread Kartik Ohri
Hi Mich! Thanks for the reply. The zip file contains all of the spark related code, particularly contents of this folder . The requirements_spark.txt

Re: Structuring a PySpark Application

2021-06-30 Thread Mich Talebzadeh
Hi Kartik, Can you explain how you create your zip file? Does that include all in your top project directory as per PyCharm etc. The rest looks Ok as you are creating a Python Virtual Env python3 -m venv pyspark_venv source pyspark_venv/bin/activate How do you create that requirements_spark.txt

Structuring a PySpark Application

2021-06-30 Thread Kartik Ohri
Hi all! I am working on a Pyspark application and would like suggestions on how it should be structured. We have a number of possible jobs, organized in modules. There is also a " RequestConsumer