Hi Nishadi,

I have not seen bloom filters in Spark. They are mentioned as part of the Orc 
file format, but I don't know if Spark uses them: 
https://orc.apache.org/docs/spec-index.html. Parquet has block-level min/max 
values, null counts, etc for leaf columns in its metadata. I don't believe 
Spark uses those directly either, though the underlying column reader may. See 
https://github.com/apache/parquet-mr/tree/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata
 and 
https://github.com/apache/parquet-mr/tree/master/parquet-column/src/main/java/org/apache/parquet/column/statistics.

Michael


> On Jun 29, 2016, at 11:27 PM, Nishadi Kirielle <ndime...@gmail.com> wrote:
> 
> Thank you for the response. 
> Can I please know the reason why bit map indexes are not appropriate for big 
> data. 
> Rather than using the traditional bitmap indexing techniques we are planning 
> to implement a combination of novel bitmap indexing techniques like bit 
> sliced indexes and projection indexes. 
> Furthermore, can I please know whether bloom filters have already been 
> implemented in Spark.
> 
> Thank you
> 
> On Thu, Jun 30, 2016 at 12:51 AM, Jörn Franke <jornfra...@gmail.com 
> <mailto:jornfra...@gmail.com>> wrote:
> 
> Is it the traditional bitmap indexing? I would not recommend it for big data. 
> You could use bloom filters and min/max indexes in-memory which look to be 
> more appropriate. However, if you want to use bitmap indexes then you would 
> have to do it as you say. However, bitmap indexes may consume a lot of 
> memory, so I am not sure that simply caching them in-memory is desired.
> 
> > On 29 Jun 2016, at 19:49, Nishadi Kirielle <ndime...@gmail.com 
> > <mailto:ndime...@gmail.com>> wrote:
> >
> > Hi All,
> >
> > I am a CSE undergraduate and as for our final year project, we are 
> > expecting to construct a cluster based, bit-oriented analytic platform 
> > (storage engine) to provide fast query performance when used for OLAP with 
> > the use of novel bitmap indexing techniques when and where appropriate.
> >
> > For that we are expecting to use Spark SQL. We will need to implement a way 
> > to cache the bit map indexes and in-cooperate the use of bitmap indexing at 
> > the catalyst optimizer level when it is possible.
> >
> > I would highly appreciate your feedback regarding the proposed approach.
> >
> > Thank you & Regards
> >
> > Nishadi Kirielle
> > Department of Computer Science and Engineering
> > University of Moratuwa
> > Sri Lanka
> 

Reply via email to