Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

Jörn Franke Sun, 18 Sep 2016 03:08:59 -0700

Ignite has a special cache for HDFS data (which is not a Java cache), for rdds 
etc. So you are right it is in this sense very different.


Besides caching, from what I see from data scientists is that for interactive 
queries and models evaluation they anyway do not browse the complete data. Even 
with in-memory solutions this is painful slow if you receive several TB of data 
by hour. 

What they do is sampling, e.g.select relevant small subset of data, evaluate 
several different models on the sampled data in "real time" and then calculate 
the winning model as batch later. 

Additionally probabilistic data structures are employed in some cases. For 
example if you want to count the number of unique viewers of a web site it does 
not make sense to browse through the logs for userids all the time, by  employ 
a hyperloglog structure which needs little money and can be accessed in real 
time.

For the case of visualizations, I think in the area of big data it makes also 
very sense to visualize aggregations based on sampling. If you need really the 
last 0,0001% of precision then you can click on the visualization and the 
system takes some time to calculate it.

> On 18 Sep 2016, at 10:54, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> Thanks everyone for ideas.
> 
> Sounds like Ignite has been taken by GridGain  so becomes similar to 
> HazelCast open source by name only. However, an in-memory Java Cache may or 
> may not help.
> 
> The other options like faster databases are on the table depending who wants 
> what (that are normally decisions that includes more than technical 
> criteria). Example if the customer already had Tableau, persuading them to go 
> for QlickView (as an example) may not work.
> 
> So my view is to build the batch layer foundation and leave these finer 
> choices to the customer. We will offer Zeppelin with Parquet and ORC with a 
> certain refresh of these tables and let the customer decide. I stand 
> corrected otherwise.
> 
> BTW I did these simple test on using Zeppelin (running on Spark Standalone 
> mode)
> 
> 1) Read data using Spark sql from Flume text files on HDFS (real time)
> 2) Read data using Spark sql from ORC table in Hive (lagging by 15 min)
> 3) Read data using Spark sql from Parquet table in Hive(lagging by 15 min)
> 
> Timings
> 
> 1)            2 min, 16 sec
> 2)            1 min, 1 sec 
> 3)            1 min, 6 sec
> 
> So unless one splits the atom, ORC or Parquet on Hive look similar 
> performance.
> 
> In all probability customer has a data warehouse that use Tableau or QlikView 
> or similar. Their BAs will carry on using these tools. If they have data 
> scientist then they will either use R that has in built UI or can use Spark 
> sql with Zeppelin. Also one can fire Zeppelin on each node of Spark or even 
> on the same node with different Port. Then of coursed one has to think about 
> adequate response in a concurrent environment.
> 
> Cheers
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 18 September 2016 at 08:52, Sean Owen <so...@cloudera.com> wrote:
>> Alluxio isn't a database though; it's storage. I may be still harping
>> on the wrong solution for you, but as we discussed offline, that's
>> also what Impala, Drill et al are for.
>> 
>> Sorry if this was mentioned before but Ignite is what GridGain became,
>> if that helps.
>> 
>> On Sat, Sep 17, 2016 at 11:00 PM, Mich Talebzadeh
>> <mich.talebza...@gmail.com> wrote:
>> > Thanks Todd
>> >
>> > As I thought Apache Ignite is a data fabric much like Oracle Coherence 
>> > cache
>> > or HazelCast.
>> >
>> > The use case is different between an in-memory-database (IMDB) and Data
>> > Fabric. The build that I am dealing with has a 'database centric' view of
>> > its data (i.e. it accesses its data using Spark sql and JDBC) so an
>> > in-memory database will be a better fit. On the other hand If the
>> > application deals solely with Java objects and does not have any notion of 
>> > a
>> > 'database', does not need SQL style queries and really just wants a
>> > distributed, high performance object storage grid, then I think Ignite 
>> > would
>> > likely be the preferred choice.
>> >
>> > So will likely go if needed for an in-memory database like Alluxio. I have
>> > seen a rather debatable comparison between Spark and Ignite that looks to 
>> > be
>> > like a one sided rant.
>> >
>> > HTH
>> >
>> >
>> >
>> >
>> > Dr Mich Talebzadeh
>> >
>> >
>> >
>> > LinkedIn
>> > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >
>> >
>> >
>> > http://talebzadehmich.wordpress.com
>> >
>> >
>> > Disclaimer: Use it at your own risk. Any and all responsibility for any
>> > loss, damage or destruction of data or any other property which may arise
>> > from relying on this email's technical content is explicitly disclaimed. 
>> > The
>> > author will in no case be liable for any monetary damages arising from such
>> > loss, damage or destruction.
>> >
>> >
>> >
>> >
>

Re: Is there such thing as cache fusion with the underlying tables/files on HDFS

Reply via email to