Re: Persistence and OQL over cold data

Michael Stolz Fri, 19 Aug 2016 14:04:19 -0700

JB: But, by combining the Function Execution service with querying (on
PARTITIONED data) [2], you could target the nodes that would supposedly
hold the data of interests, and execute the queries there.
MS: In order to target the nodes that would supposedly hold the data of
interest you need to know the keys you are looking for. If you know the
keys why are you querying in the first place? Just do getAll(keys).


JB: Additionally, assuming the Indexes were defined properly based on the
predicates in the queries (most often) used, that it would target the data
on disk matching the predicate and load only the data required
MS: Correct, and that's exactly how Geode works.

JB:(no data store, RDBMS or otherwise, especially disk-bound stores, should
have to load the entire table/Region/Map/whatever to access the data
matching the predicate; that's absurd, OOMEs galore).
MS: The trouble happens when you are NOT hitting your indices. If you do a
query that requires a full table scan, then every row in the database table
needs to be examined, and to examine it, it has to be in memory at least
briefly.


--
Mike Stolz
Principal Engineer, GemFire Product Manager
Mobile: 631-835-4771

On Fri, Aug 19, 2016 at 4:39 PM, John Blum <jb...@pivotal.io> wrote:

> Hi All-
>
> DISCLAIMER: I am no expert in querying and index
> architecture/implementation; mostly a consumer.
>
> Perhaps *Anil* or *Jason* can shed more light on the subject, but for my
> own understanding/sanity, it would seem we could do better than this,
> meaning...
>
> I would think any UC partially depends on the organization of your data in
> the grid as well.  If you used a PARTITION data management policy [1],
> for instance, then, of course, your data would be distributed and
> partitioned across all the data nodes in the grid (cluster) holding the
> data (i.e. data nodes that have declared the same PARTITION Region).  It
> should then be possible to make this more optimal by have a redundancy
> level of 1 or more (depending on the frequency of transactions and data
> changes) to parallelize the data access.
>
> Not only does having more nodes mean better (or more optimal)
> organization, but more memory.  Still, given a very large data set, clearly
> some of the data will need to OVERFLOW (to disk).
>
> But, by combining the Function Execution service with querying (on
> PARTITIONED data) [2], you could target the nodes that would supposedly
> hold the data of interests, and execute the queries there.
>
> Additionally, assuming the Indexes were defined properly based on the
> predicates in the queries (most often) used, that it would target the data
> on disk matching the predicate and load only the data required (no data
> store, RDBMS or otherwise, especially disk-bound stores, should have to
> load the entire table/Region/Map/whatever to access the data matching the
> predicate; that's absurd, OOMEs galore).
>
> TMK, Geode keeps Indexes in memory (even loads them on startup) and
> updates them (either sync/async depending on your configuration) as the
> data changes.  You would assume the data would not be changing in the
> OVERFLOW, disk-based data set.  If the data did change, then wouldn't you
> also assume that that data would then have to be in-memory (I think so).
>
> Please let me know if I am way of basis here, but I would think Geode
> gives you enough options that particular UCs could be made, with nominal
> effort, more optimal.
>
> Additional references...
>
> * Query Partitioned Regions [3]
> * Working with Indexes [4], and then...
> * Tips and Guidelines on Using Indexes [5], but also important...
> * Using Indexes with Overflow Regions [6]
>
> Hope this helps.
>
> Cheers!
> -John
>
>
> [1] http://geode.docs.pivotal.io/docs/developing/region_
> options/region_types.html
> [2] http://geode.docs.pivotal.io/docs/developing/querying_
> basics/performance_considerations.html
> [3] http://geode.docs.pivotal.io/docs/developing/querying_
> basics/querying_partitioned_regions.html
> [4] http://geode.docs.pivotal.io/docs/developing/query_
> index/query_index.html
> [5] http://geode.docs.pivotal.io/docs/developing/query_
> index/indexing_guidelines.html
> [6] http://geode.docs.pivotal.io/docs/developing/query_
> index/indexes_with_overflow_regions.html
>
>
> On Fri, Aug 19, 2016 at 12:55 PM, Denis Magda <magda7...@gmail.com> wrote:
>
>> Thanks, now I see.
>>
>> This works the same way as in Ignite then. If you set up an eviction
>> policy in Ignite the data may be evicted to swap at some point of time and
>> if a query is executed right after that the it may swap in the data back to
>> memory. However the indexes must always be in memory.
>>
>> --
>> Denis
>>
>>
>> On Fri, Aug 19, 2016 at 12:43 PM, Michael Stolz <mst...@pivotal.io>
>> wrote:
>>
>>> There is a notion of data aging out in Geode. We call it overflow to
>>> disk.
>>>
>>> The idea is that as data gets old you can have the records in memory
>>> expire, and that expiry can be to disk. That's the cold data.
>>>
>>> You may have built an index while you were initially loading the data,
>>> and if your predicates only hit the indexes you will still get really fast
>>> queries if the result sets aren't large.
>>>
>>> If, however, you ever resort to hitting the disk-based data for a query
>>> it is going to have to read every record that isn't in memory from disk
>>> which is going to be extremely slow. I personally would never use Geode
>>> that way.
>>>
>>>
>>> --
>>> Mike Stolz
>>> Principal Engineer, GemFire Product Manager
>>> Mobile: 631-835-4771
>>>
>>> On Fri, Aug 19, 2016 at 3:35 PM, Denis Magda <magda7...@gmail.com>
>>> wrote:
>>>
>>>> Hi Mike,
>>>>
>>>> Thanks a lot for the explanation! It makes perfect sense to me.
>>>>
>>>> I just thought that you were able to do something with indexes in a
>>>> such way that there is no need to preload everything from disk into memory
>>>> when a query is executed over cold data.
>>>>
>>>> Then what does "execution over cold data" mean? I'm referring to the
>>>> following sentence from the main page:
>>>>
>>>> *Object Query Language allows distributed query execution on hot and
>>>> cold data, with SQL-like capabilities, including joins.*
>>>>
>>>> --
>>>> Denis
>>>>
>>>>
>>>> On Fri, Aug 19, 2016 at 12:27 PM, Michael Stolz <mst...@pivotal.io>
>>>> wrote:
>>>>
>>>>> Here's the thing...
>>>>>
>>>>> On any In-memory data grid, if you run a query before the data has
>>>>> been loaded into memory, it is going to cause the exact same amount of 
>>>>> disk
>>>>> i/o to do the query as it will take to load everything into memory.
>>>>>
>>>>> And the system will still have to go ahead and load everything into
>>>>> memory anyway so you're going to end up doing all that disk i/o TWICE.
>>>>>
>>>>> Geode DOES have a nice feature for key based access though. We
>>>>> actually store the keys in a separate file from the data and we can load
>>>>> that file very quickly. Then if you go after the data for one of those 
>>>>> keys
>>>>> we can lazily load it from disk on demand if it hasn't yet been loaded 
>>>>> into
>>>>> memory.
>>>>>
>>>>> The Lucene integration work that is going on in Geode might also make
>>>>> it possible to load the indexes first and lazily load the data based on
>>>>> queries against the indexes.
>>>>>
>>>>>
>>>>> --
>>>>> Mike Stolz
>>>>> Principal Engineer, GemFire Product Manager
>>>>> Mobile: 631-835-4771
>>>>>
>>>>> On Fri, Aug 19, 2016 at 2:59 PM, Denis Magda <magda7...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello Geode community,
>>>>>>
>>>>>> I've been investigating possibilities of Geode Persistence for a
>>>>>> while and still can't get it clear whether I need to have all my data in
>>>>>> memory if I want to execute OQL queries or OQL engine works over the
>>>>>> persistence as well.
>>>>>>
>>>>>> My use case is the following. During the cluster startup I don't want
>>>>>> to wait while all the data has been pre-loaded from the persistence to 
>>>>>> RAM
>>>>>> and want to execute OQL queries right away. Is it feasible to implement
>>>>>> with Geode? Please provide me with the links where I can read more about
>>>>>> this.
>>>>>>
>>>>>> Regards,
>>>>>> Denis
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Удачи,
>>>> Денис Магда
>>>>
>>>
>>>
>>
>>
>> --
>> Удачи,
>> Денис Магда
>>
>
>
>
> --
> -John
> 503-504-8657
> john.blum10101 (skype)
>

Re: Persistence and OQL over cold data

Reply via email to