Sorry if this isn't helpful, but the other obvious thing is to store
intermediate data in Parquet whenever you repeat code/data that can be
shared between jobs. If tests indicate it is faster. Before Parquet this
wasn't necessarily advantageous as IO from disk is slower than IO through
RAM which the computation might be. Parquet open opportunities here by
competing better with repeat computation. You could compare the two to
figure out how to optimize your scripts. Again, you're probably doing this
:)

Russell Jurney @rjurney <http://twitter.com/rjurney>
russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
<http://facebook.com/jurney> datasyndrome.com


On Thu, Feb 7, 2019 at 3:29 PM Michael Doo <michael....@verve.com> wrote:

> Indeed. When loading Parquet using org.apache.parquet.pig.ParquetLoader(),
> we're specifying the schema for which columns we want to load.
>
> On 2/7/19, 5:14 PM, "Russell Jurney" <russell.jur...@gmail.com> wrote:
>
>     Well, the obvious thing is to load only those columns you need. Just in
>     case you’re not doing this.
>
>     On Thu, Feb 7, 2019 at 2:04 PM Michael Doo <michael....@verve.com>
> wrote:
>
>     > Hey all,
>     > I’ve been migrating some processes over from ingesting Avro to
> ingesting
>     > Parquet. In Spark, we’re seeing 2x-8x performance gains when using
> Parquet
>     > over Avro. In Pig, similar processes are about the same runtime
> between the
>     > two formats (and sometimes even higher using Parquet). We’ve enabled
>     > dictionary filtering as well as predicate filter/pushdown. Wondering
> if
>     > there are other settings / strategies we might be missing to take
> advantage
>     > of Parquet.
>     >
>     > Thanks,
>     > Michael
>     >
>     --
>     Russell Jurney @rjurney <http://twitter.com/rjurney>
>     russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
>     <http://facebook.com/jurney> datasyndrome.com
>
>
>

Reply via email to