And if you come across a comprehensive documentation of parquet
configuration, please share it!!!

The Parquet documentation says that it can be configured but doesn't
explain how: http://parquet.apache.org/documentation/latest/
and apparently, both TAJO (
http://tajo.apache.org/docs/0.8.0/table_management/parquet.html) and Drill (
https://drill.apache.org/docs/parquet-format/) seem to have some
configuration parameters for Parquet.
If Hive has configuration parameters for Parquet too, I couldn't find it
documented anywhere.



On 23 February 2018 at 16:48, Sun, Keith <ai...@ebay.com> wrote:

> I got your point and thanks for the nice slides info.
>
>
> So the parquet filter is not an easy thing and I will try that according
> to the deck.
>
>
> Thanks !
> ------------------------------
> *From:* Furcy Pin <pin.fu...@gmail.com>
> *Sent:* Friday, February 23, 2018 3:37:52 AM
> *To:* user@hive.apache.org
> *Subject:* Re: Why the filter push down does not reduce the read data
> record count
>
> Hi,
>
> Unless your table is partitioned or bucketed by myid, Hive generally
> requires to read through all the records to find the records that match
> your predicate.
>
> In other words, Hive table are generally not indexed for single record
> retrieval like you would expect RDBMs tables or Vertica tables to be
> indexed to allow single record.
> Some file formats like ORC (and maybe Parquet, I'm not sure) allow to add
> bloom filters on specific columns of a table
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsnippetessay.wordpress.com%2F2015%2F07%2F25%2Fhive-optimizations-with-indexes-bloom-filters-and-statistics%2F&data=02%7C01%7Caisun%40ebay.com%7C65fc6c45d6394d53c25508d57ab204ff%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549827365379178&sdata=rqlaV994fEVnDts8xeKJ3gysOkG738Q6iAi5aWnLTrM%3D&reserved=0>,
> which could work as a kind of index.
> Also, depending on the query engine you are using (Hive, Spark-SQL,
> Impala, Presto...) and its version, they may or may not be able to leverage
> certain storage optimization.
> For example, Spark still does not support Hive Bucketed Table
> optimization. But it might come in the upcoming Spark 2.3.
>
>
> I'm much less familiar with Parquet, so if anyone has links to a good
> documentation for Parquet fine tuning (or even better a comparison with ORC
> features) that would be really helpful.
> By googling, I found these slides where someone at Netflix
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FRyanBlue3%2Fparquet-performance-tuning-the-missing-guide&data=02%7C01%7Caisun%40ebay.com%7C65fc6c45d6394d53c25508d57ab204ff%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549827365379178&sdata=Ek%2BezplTbMr5m8xmHFICmwkWIBKhO39zWARXNKCrR18%3D&reserved=0>
> seems to have tried the same kind of optimization as you in Parquet.
>
>
>
>
>
> On 23 February 2018 at 12:02, Sun, Keith <ai...@ebay.com> wrote:
>
> Hi,
>
>
> Why Hive still read so much "records" even with a filter pushdown enabled
> and the returned dataset would be a very small amount ( 4k out of
> 30billion records).
>
>
> The "RECORDS_IN" counter of Hive which still showed the 30billion count
> and also the output in the map reduce log like this :
>
> org.apache.hadoop.hive.ql.exec.MapOperator: MAP[4]: records read - 100000
>
>
> BTW, I am using parquet as stoarg format and the filter pushdown did work
> as i see this in log :
>
>
> AM INFO: parquet.filter2.compat.FilterCompat: Filtering using predicate: 
> eq(myid, 223)
>
>
> Thanks,
>
> Keith
>
>
>
>

Reply via email to