I think I have found out what was causing me difficulties. It seems I was reading too much into the stage description shown in the "Stages" tab of the Spark application UI. While it said "repartition at NativeMethodAccessorImpl.java:-2", I can infer from the network traffic and from its response to changes I subsequently made that the actual code that was running was the code doing the HBase lookups. I suspect the actual shuffle, once it occurred, required on the same order of network IO as the upload to Elasticsearch that followed.
Eric On Mon, Aug 31, 2015 at 6:09 PM, Eric Walker <eric.wal...@gmail.com> wrote: > Hi, > > I am working on a pipeline that carries out a number of stages, the last > of which is to build some large JSON objects from information in the > preceding stages. The JSON objects are then uploaded to Elasticsearch in > bulk. > > If I carry out a shuffle via a `repartition` call after the JSON documents > have been created, the upload to ES is fast. But the shuffle itself takes > many tens of minutes and is IO-bound. > > If I omit the repartition, the upload to ES takes a long time due to a > complete lack of parallelism. > > Currently, the step that precedes the assembling of the JSON documents, > which goes into the final repartition call, is the querying of pairs of > object ids. In a mapper the ids are resolved to documents by querying > HBase. The initial pairs of ids are obtained via a query against the SQL > context, and the query result is repartitioned before going into the mapper > that resolves the ids into documents. > > It's not clear to me why the final repartition preceding the upload to ES > is required. I would like to omit it, since it is so expensive and > involves so much network IO, but have not found a way to do this yet. If I > omit the repartition, the job takes much longer. > > Does anyone know what might be going on here, and what I might be able to > do to get rid of the last `repartition` call before the upload to ES? > > Eric > >