I think I have found out what was causing me difficulties.  It seems I was
reading too much into the stage description shown in the "Stages" tab of
the Spark application UI.  While it said "repartition at
NativeMethodAccessorImpl.java:-2", I can infer from the network traffic and
from its response to changes I subsequently made that the actual code that
was running was the code doing the HBase lookups.  I suspect the actual
shuffle, once it occurred, required on the same order of network IO as the
upload to Elasticsearch that followed.

Eric



On Mon, Aug 31, 2015 at 6:09 PM, Eric Walker <eric.wal...@gmail.com> wrote:

> Hi,
>
> I am working on a pipeline that carries out a number of stages, the last
> of which is to build some large JSON objects from information in the
> preceding stages.  The JSON objects are then uploaded to Elasticsearch in
> bulk.
>
> If I carry out a shuffle via a `repartition` call after the JSON documents
> have been created, the upload to ES is fast.  But the shuffle itself takes
> many tens of minutes and is IO-bound.
>
> If I omit the repartition, the upload to ES takes a long time due to a
> complete lack of parallelism.
>
> Currently, the step that precedes the assembling of the JSON documents,
> which goes into the final repartition call, is the querying of pairs of
> object ids.  In a mapper the ids are resolved to documents by querying
> HBase.  The initial pairs of ids are obtained via a query against the SQL
> context, and the query result is repartitioned before going into the mapper
> that resolves the ids into documents.
>
> It's not clear to me why the final repartition preceding the upload to ES
> is required.  I would like to omit it, since it is so expensive and
> involves so much network IO, but have not found a way to do this yet.  If I
> omit the repartition, the job takes much longer.
>
> Does anyone know what might be going on here, and what I might be able to
> do to get rid of the last `repartition` call before the upload to ES?
>
> Eric
>
>

Reply via email to