Hi Mich! Thanks for the reply.
The zip file contains all of the spark related code, particularly contents of this folder <https://github.com/metabrainz/listenbrainz-server/tree/master/listenbrainz_spark> . The requirements_spark.txt <https://github.com/metabrainz/listenbrainz-server/blob/master/requirements_spark.txt> is contained in the project and it contains the non-spark dependencies of the python code. The tar.gz file is created according to Pyspark docs <https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html#using-virtualenv> for dependency management. The spark.yarn.dist.archives also comes from there. This is the python file <https://github.com/metabrainz/listenbrainz-server/blob/master/spark_manage.py> invoked by the spark-submit to start the "RequestConsumer". Regards, Kartik On Wed, Jun 30, 2021 at 9:02 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi Kartik, > > Can you explain how you create your zip file? Does that include all in > your top project directory as per PyCharm etc. > > The rest looks Ok as you are creating a Python Virtual Env > > python3 -m venv pyspark_venv > source pyspark_venv/bin/activate > > How do you create that requirements_spark.txt file? > > pip install -r requirements_spark.txt > pip install venv-pack > > > Where is this gz file used? > venv-pack -o pyspark_venv.tar.gz > > Because I am not clear about below line > > --conf "spark.yarn.dist.archives"=pyspark_venv.tar.gz#environment \ > > It helps if you walk us through the shell itself for clarification HTH, > > Mich > > > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Wed, 30 Jun 2021 at 15:47, Kartik Ohri <kartikohr...@gmail.com> wrote: > >> Hi all! >> >> I am working on a Pyspark application and would like suggestions on how >> it should be structured. >> >> We have a number of possible jobs, organized in modules. There is also a " >> RequestConsumer >> <https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py>" >> class which consumes from a messaging queue. Each message contains the name >> of the job to invoke and the arguments to be passed to it. Messages are put >> into the message queue by cronjobs, manually etc. >> >> We submit a zip file containing all python files to a Spark cluster >> running on YARN and ask it to run the RequestConsumer. This >> <https://github.com/metabrainz/listenbrainz-server/blob/master/docker/start-spark-request-consumer.sh#L23-L34> >> is the exact spark-submit command for the interested. The results of the >> jobs are collected >> <https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py#L120-L122> >> by the request consumer and pushed into another queue. >> >> My question is whether this type of structure makes sense. Should the >> Request Consumer instead run independently of Spark and invoke spark-submit >> scripts when it needs to trigger a job? Or is there another recommendation? >> >> Thank you all in advance for taking the time to read this email and >> helping. >> >> Regards, >> Kartik. >> >> >>