Hi Mich!

Thanks for the reply.

The zip file contains all of the spark related code, particularly contents
of this folder
<https://github.com/metabrainz/listenbrainz-server/tree/master/listenbrainz_spark>
.
The requirements_spark.txt
<https://github.com/metabrainz/listenbrainz-server/blob/master/requirements_spark.txt>
is
contained in the project and it contains the non-spark dependencies of the
python code.
The tar.gz file is created according to Pyspark docs
<https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html#using-virtualenv>
for
dependency management. The spark.yarn.dist.archives also comes from there.

This is the python file
<https://github.com/metabrainz/listenbrainz-server/blob/master/spark_manage.py>
invoked by the spark-submit to start the "RequestConsumer".

Regards,
Kartik


On Wed, Jun 30, 2021 at 9:02 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi Kartik,
>
> Can you explain how you create your zip file? Does that include all in
> your top project directory as per PyCharm etc.
>
> The rest looks Ok as you are creating a Python Virtual Env
>
> python3 -m venv pyspark_venv
> source pyspark_venv/bin/activate
>
> How do you create that requirements_spark.txt file?
>
> pip install -r requirements_spark.txt
> pip install venv-pack
>
>
> Where is this gz file used?
> venv-pack -o pyspark_venv.tar.gz
>
> Because I am not clear about below line
>
> --conf "spark.yarn.dist.archives"=pyspark_venv.tar.gz#environment \
>
> It helps if you walk us through the shell itself for clarification HTH,
>
> Mich
>
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 30 Jun 2021 at 15:47, Kartik Ohri <kartikohr...@gmail.com> wrote:
>
>> Hi all!
>>
>> I am working on a Pyspark application and would like suggestions on how
>> it should be structured.
>>
>> We have a number of possible jobs, organized in modules. There is also a "
>> RequestConsumer
>> <https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py>"
>> class which consumes from a messaging queue. Each message contains the name
>> of the job to invoke and the arguments to be passed to it. Messages are put
>> into the message queue by cronjobs, manually etc.
>>
>> We submit a zip file containing all python files to a Spark cluster
>> running on YARN and ask it to run the RequestConsumer. This
>> <https://github.com/metabrainz/listenbrainz-server/blob/master/docker/start-spark-request-consumer.sh#L23-L34>
>> is the exact spark-submit command for the interested. The results of the
>> jobs are collected
>> <https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py#L120-L122>
>> by the request consumer and pushed into another queue.
>>
>> My question is whether this type of structure makes sense. Should the
>> Request Consumer instead run independently of Spark and invoke spark-submit
>> scripts when it needs to trigger a job? Or is there another recommendation?
>>
>> Thank you all in advance for taking the time to read this email and
>> helping.
>>
>> Regards,
>> Kartik.
>>
>>
>>

Reply via email to