Hi,

My limited understanding of Spark tells me that a task is the least
possible working unit and Spark itself won't give you much. It
wouldn't expect so since "acount" is a business entity not Spark's
one.

What about using mapPartitions* to know the details of partitions and
do whatever you want (log to stdout or whatever)? Just a thought.

Pozdrawiam,
Jacek

--
Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
http://blog.jaceklaskowski.pl
Mastering Spark https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Follow me at https://twitter.com/jaceklaskowski
Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski


On Sun, Nov 29, 2015 at 3:12 PM, Yuhao Yang <hhb...@gmail.com> wrote:
> Hi all,
>
> I got a simple processing job for 20000 accounts on 8 partitions. It's
> roughly 2500 accounts on each partition. Each account will take about 1s to
> complete the computation. That means each partition will take about 2500
> seconds to finish the batch.
>
> My question is how can I get the detailed progress of how many accounts has
> been processed for each partition during the computation. An ideal solution
> would allow me to know how many accounts has been processed periodically
> (like every minute) so I can monitor and take action to save some time.
> Right now on UI I can only get that task is running.
>
> I know one solution is to split the data horizontally on driver and submit
> to spark in mini batches, yet I think that would waste some cluster resource
> and create extra complexity for result handling.
>
> Any experience or best practice is welcome. Thanks a lot.
>
> Regards,
> Yuhao

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to