Hi Nishadi, I have not seen bloom filters in Spark. They are mentioned as part of the Orc file format, but I don't know if Spark uses them: https://orc.apache.org/docs/spec-index.html. Parquet has block-level min/max values, null counts, etc for leaf columns in its metadata. I don't believe Spark uses those directly either, though the underlying column reader may. See https://github.com/apache/parquet-mr/tree/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata and https://github.com/apache/parquet-mr/tree/master/parquet-column/src/main/java/org/apache/parquet/column/statistics.
Michael > On Jun 29, 2016, at 11:27 PM, Nishadi Kirielle <ndime...@gmail.com> wrote: > > Thank you for the response. > Can I please know the reason why bit map indexes are not appropriate for big > data. > Rather than using the traditional bitmap indexing techniques we are planning > to implement a combination of novel bitmap indexing techniques like bit > sliced indexes and projection indexes. > Furthermore, can I please know whether bloom filters have already been > implemented in Spark. > > Thank you > > On Thu, Jun 30, 2016 at 12:51 AM, Jörn Franke <jornfra...@gmail.com > <mailto:jornfra...@gmail.com>> wrote: > > Is it the traditional bitmap indexing? I would not recommend it for big data. > You could use bloom filters and min/max indexes in-memory which look to be > more appropriate. However, if you want to use bitmap indexes then you would > have to do it as you say. However, bitmap indexes may consume a lot of > memory, so I am not sure that simply caching them in-memory is desired. > > > On 29 Jun 2016, at 19:49, Nishadi Kirielle <ndime...@gmail.com > > <mailto:ndime...@gmail.com>> wrote: > > > > Hi All, > > > > I am a CSE undergraduate and as for our final year project, we are > > expecting to construct a cluster based, bit-oriented analytic platform > > (storage engine) to provide fast query performance when used for OLAP with > > the use of novel bitmap indexing techniques when and where appropriate. > > > > For that we are expecting to use Spark SQL. We will need to implement a way > > to cache the bit map indexes and in-cooperate the use of bitmap indexing at > > the catalyst optimizer level when it is possible. > > > > I would highly appreciate your feedback regarding the proposed approach. > > > > Thank you & Regards > > > > Nishadi Kirielle > > Department of Computer Science and Engineering > > University of Moratuwa > > Sri Lanka >