Hi Gourav, Thanks for the suggestion, I'll check it out.
Regards, Kartik On Thu, Jul 1, 2021 at 5:38 AM Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > Hi, > > I think that reading Matei Zaharia's book "SPARK the definitive guide" > will be a good and best starting point. > > Regards, > Gourav Sengupta > > On Wed, Jun 30, 2021 at 3:47 PM Kartik Ohri <kartikohr...@gmail.com> > wrote: > >> Hi all! >> >> I am working on a Pyspark application and would like suggestions on how >> it should be structured. >> >> We have a number of possible jobs, organized in modules. There is also a " >> RequestConsumer >> <https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py>" >> class which consumes from a messaging queue. Each message contains the name >> of the job to invoke and the arguments to be passed to it. Messages are put >> into the message queue by cronjobs, manually etc. >> >> We submit a zip file containing all python files to a Spark cluster >> running on YARN and ask it to run the RequestConsumer. This >> <https://github.com/metabrainz/listenbrainz-server/blob/master/docker/start-spark-request-consumer.sh#L23-L34> >> is the exact spark-submit command for the interested. The results of the >> jobs are collected >> <https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py#L120-L122> >> by the request consumer and pushed into another queue. >> >> My question is whether this type of structure makes sense. Should the >> Request Consumer instead run independently of Spark and invoke spark-submit >> scripts when it needs to trigger a job? Or is there another recommendation? >> >> Thank you all in advance for taking the time to read this email and >> helping. >> >> Regards, >> Kartik. >> >> >>