Hi Mich! We use this in production but indeed there is much scope for improvements, configuration being one of those :).
Yes, we have a private on-premise cluster. We run Spark on YARN (no airflow etc.) which controls the scheduling and use HDFS as a datastore. Regards On Wed, Jun 30, 2021 at 11:41 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Thanks for the details Kartik. > > Let me go through these. The code itself and indentation looks good. > > One minor thing I noticed is that you are not using a yaml file > (config.yml) for your variables and you seem to embed them in your > config.py code. That is what I used to do before :) a friend advised me to > initialise with yaml and read them in python file. However, I guess that is > a personal style. > > Overall looking neat. I believe you are running all these on-premises and > not using airflow or composer for your scheduling. > > > Cheers > > > Mich > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Wed, 30 Jun 2021 at 18:39, Kartik Ohri <kartikohr...@gmail.com> wrote: > >> Hi Mich! >> >> Thanks for the reply. >> >> The zip file contains all of the spark related >> code, particularly contents of this folder >> <https://github.com/metabrainz/listenbrainz-server/tree/master/listenbrainz_spark> >> . >> The requirements_spark.txt >> <https://github.com/metabrainz/listenbrainz-server/blob/master/requirements_spark.txt> >> is >> contained in the project and it contains the non-spark dependencies of the >> python code. >> The tar.gz file is created according to Pyspark docs >> <https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html#using-virtualenv> >> for >> dependency management. The spark.yarn.dist.archives also comes from >> there. >> >> This is the python file >> <https://github.com/metabrainz/listenbrainz-server/blob/master/spark_manage.py> >> invoked by the spark-submit to start the "RequestConsumer". >> >> Regards, >> Kartik >> >> >> On Wed, Jun 30, 2021 at 9:02 PM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> Hi Kartik, >>> >>> Can you explain how you create your zip file? Does that include all in >>> your top project directory as per PyCharm etc. >>> >>> The rest looks Ok as you are creating a Python Virtual Env >>> >>> python3 -m venv pyspark_venv >>> source pyspark_venv/bin/activate >>> >>> How do you create that requirements_spark.txt file? >>> >>> pip install -r requirements_spark.txt >>> pip install venv-pack >>> >>> >>> Where is this gz file used? >>> venv-pack -o pyspark_venv.tar.gz >>> >>> Because I am not clear about below line >>> >>> --conf "spark.yarn.dist.archives"=pyspark_venv.tar.gz#environment \ >>> >>> It helps if you walk us through the shell itself for clarification HTH, >>> >>> Mich >>> >>> >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Wed, 30 Jun 2021 at 15:47, Kartik Ohri <kartikohr...@gmail.com> >>> wrote: >>> >>>> Hi all! >>>> >>>> I am working on a Pyspark application and would like suggestions on how >>>> it should be structured. >>>> >>>> We have a number of possible jobs, organized in modules. There is also >>>> a "RequestConsumer >>>> <https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py>" >>>> class which consumes from a messaging queue. Each message contains the name >>>> of the job to invoke and the arguments to be passed to it. Messages are put >>>> into the message queue by cronjobs, manually etc. >>>> >>>> We submit a zip file containing all python files to a Spark cluster >>>> running on YARN and ask it to run the RequestConsumer. This >>>> <https://github.com/metabrainz/listenbrainz-server/blob/master/docker/start-spark-request-consumer.sh#L23-L34> >>>> is the exact spark-submit command for the interested. The results of the >>>> jobs are collected >>>> <https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py#L120-L122> >>>> by the request consumer and pushed into another queue. >>>> >>>> My question is whether this type of structure makes sense. Should the >>>> Request Consumer instead run independently of Spark and invoke spark-submit >>>> scripts when it needs to trigger a job? Or is there another recommendation? >>>> >>>> Thank you all in advance for taking the time to read this email and >>>> helping. >>>> >>>> Regards, >>>> Kartik. >>>> >>>> >>>>