Re: Structuring a PySpark Application

Kartik Ohri Wed, 30 Jun 2021 11:21:49 -0700

Hi Mich!

We use this in production but indeed there is much scope for improvements,
configuration being one of those :).


Yes, we have a private on-premise cluster. We run Spark on YARN (no airflow
etc.) which controls the scheduling and use HDFS as a datastore.

Regards

On Wed, Jun 30, 2021 at 11:41 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Thanks for the details Kartik.
>
> Let me go through these. The code itself and indentation looks good.
>
> One minor thing I noticed is that you are not using a yaml file
> (config.yml) for your variables and you seem to embed them in your
> config.py code. That is what I used to do before :) a friend advised me to
> initialise with yaml and read them in python file. However, I guess that is
> a personal style.
>
> Overall looking neat. I believe you are running all these on-premises and
> not using airflow or composer for your scheduling.
>
>
> Cheers
>
>
> Mich
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 30 Jun 2021 at 18:39, Kartik Ohri <kartikohr...@gmail.com> wrote:
>
>> Hi Mich!
>>
>> Thanks for the reply.
>>
>> The zip file contains all of the spark related
>> code, particularly contents of this folder
>> <https://github.com/metabrainz/listenbrainz-server/tree/master/listenbrainz_spark>
>> .
>> The requirements_spark.txt
>> <https://github.com/metabrainz/listenbrainz-server/blob/master/requirements_spark.txt>
>>  is
>> contained in the project and it contains the non-spark dependencies of the
>> python code.
>> The tar.gz file is created according to Pyspark docs
>> <https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html#using-virtualenv>
>>  for
>> dependency management. The spark.yarn.dist.archives also comes from
>> there.
>>
>> This is the python file
>> <https://github.com/metabrainz/listenbrainz-server/blob/master/spark_manage.py>
>> invoked by the spark-submit to start the "RequestConsumer".
>>
>> Regards,
>> Kartik
>>
>>
>> On Wed, Jun 30, 2021 at 9:02 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Kartik,
>>>
>>> Can you explain how you create your zip file? Does that include all in
>>> your top project directory as per PyCharm etc.
>>>
>>> The rest looks Ok as you are creating a Python Virtual Env
>>>
>>> python3 -m venv pyspark_venv
>>> source pyspark_venv/bin/activate
>>>
>>> How do you create that requirements_spark.txt file?
>>>
>>> pip install -r requirements_spark.txt
>>> pip install venv-pack
>>>
>>>
>>> Where is this gz file used?
>>> venv-pack -o pyspark_venv.tar.gz
>>>
>>> Because I am not clear about below line
>>>
>>> --conf "spark.yarn.dist.archives"=pyspark_venv.tar.gz#environment \
>>>
>>> It helps if you walk us through the shell itself for clarification HTH,
>>>
>>> Mich
>>>
>>>
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 30 Jun 2021 at 15:47, Kartik Ohri <kartikohr...@gmail.com>
>>> wrote:
>>>
>>>> Hi all!
>>>>
>>>> I am working on a Pyspark application and would like suggestions on how
>>>> it should be structured.
>>>>
>>>> We have a number of possible jobs, organized in modules. There is also
>>>> a "RequestConsumer
>>>> <https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py>"
>>>> class which consumes from a messaging queue. Each message contains the name
>>>> of the job to invoke and the arguments to be passed to it. Messages are put
>>>> into the message queue by cronjobs, manually etc.
>>>>
>>>> We submit a zip file containing all python files to a Spark cluster
>>>> running on YARN and ask it to run the RequestConsumer. This
>>>> <https://github.com/metabrainz/listenbrainz-server/blob/master/docker/start-spark-request-consumer.sh#L23-L34>
>>>> is the exact spark-submit command for the interested. The results of the
>>>> jobs are collected
>>>> <https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py#L120-L122>
>>>> by the request consumer and pushed into another queue.
>>>>
>>>> My question is whether this type of structure makes sense. Should the
>>>> Request Consumer instead run independently of Spark and invoke spark-submit
>>>> scripts when it needs to trigger a job? Or is there another recommendation?
>>>>
>>>> Thank you all in advance for taking the time to read this email and
>>>> helping.
>>>>
>>>> Regards,
>>>> Kartik.
>>>>
>>>>
>>>>

Re: Structuring a PySpark Application

Reply via email to