Hi all!

I am working on a Pyspark application and would like suggestions on how it
should be structured.

We have a number of possible jobs, organized in modules. There is also a "
RequestConsumer
<https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py>"
class which consumes from a messaging queue. Each message contains the name
of the job to invoke and the arguments to be passed to it. Messages are put
into the message queue by cronjobs, manually etc.

We submit a zip file containing all python files to a Spark cluster running
on YARN and ask it to run the RequestConsumer. This
<https://github.com/metabrainz/listenbrainz-server/blob/master/docker/start-spark-request-consumer.sh#L23-L34>
is the exact spark-submit command for the interested. The results of the
jobs are collected
<https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py#L120-L122>
by the request consumer and pushed into another queue.

My question is whether this type of structure makes sense. Should the
Request Consumer instead run independently of Spark and invoke spark-submit
scripts when it needs to trigger a job? Or is there another recommendation?

Thank you all in advance for taking the time to read this email and helping.

Regards,
Kartik.

Reply via email to