Re: Persistence and OQL over cold data

Michael Stolz Fri, 19 Aug 2016 16:17:24 -0700

Unfortunately the indexes are not stored. They need to be rebuilt on
restart. For that reason, on start up, the whole diskstore needs to be read.


--
Mike Stolz
Principal Engineer, GemFire Product Manager
Mobile: 631-835-4771

On Fri, Aug 19, 2016 at 5:30 PM, John Blum <jb...@pivotal.io> wrote:

> *Jason, Mike*: first, thank you.
>
> > *In order to target the nodes that would supposedly hold the data of
> interest you need to know the keys you are looking for. If you know the
> keys why are you querying in the first place? Just do getAll(keys).*
>
> Two reasons...
>
> 1. I want to apply some "additional filtering" that can only be handled
> elegantly by a OQL query predicate after a subset of the data has been
> identified/targeted (using keys).  I have example of this somewhere (doh)
> after working with a customer on this exact UC
>
> 2. I don't want the entire object (i.e. row); I only need a specific
> "projection" of the (object) data.  This is particularly important if I
> have very large and complex object graph and I am streaming data across the
> wire (client/server).
>
>
> > *The trouble happens when you are NOT hitting your indices. *
>
> Yes, good point.
>
> > *If you do a query that requires a full table scan, then every row in
> the database table needs to be examined, and to examine it, it has to be in
> memory at least briefly.*
>
> Of course.
>
> *Denis*-
>
> > *The disk entries that are mentioned by John were located in memory
> before and were overflowed on disk at some point of time. It means that if
> you start your cluster from scratch and want to run OQL queries over the
> indexed data then you have to preload all the data from the persistence.*
>
> I don't specifically recall how much persistent data Geode reloads on
> restart (Geode is a shared-nothing architecture though so each data node
> has it's own persistence; additionally primaries must come online before
> secondaries are accessible).  The question is how much data gets reloaded
> on restart.  It would seem silly if the disk store contained more data then
> would fit in memory and reload everything knowing some of the data would be
> OVERFLOW on preload when it would not all fit.  Geode will reload the Index
> though, which is stored as well.
>
> I let the experts answer this one.
>
>
> On Fri, Aug 19, 2016 at 2:04 PM, Denis Magda <magda7...@gmail.com> wrote:
>
>> Hi John, Jason,
>>
>> If to expand more on this
>>
>>
>> *If an index can be used, the index look up is executed and entries added
>> to the result set.  If any of the entries that match the predicates is
>> actually on disk, those values will need to be loaded to memory before
>> being returned as a result.*
>>
>> The disk entries that are mentioned by John were located in memory before
>> and were overflowed on disk at some point of time. It means that if you
>> start your cluster from scratch and want to run OQL queries over the
>> indexed data then you have to preload all the data from the persistence.
>> Yes, some of the data may be overflowed back to disk during the preloading
>> but you'll have your indexes in a valid state.
>>
>> Correct me if I'm still missing something.
>>
>> --
>> Denis
>>
>>
>> On Fri, Aug 19, 2016 at 1:52 PM, Jason Huynh <jhu...@pivotal.io> wrote:
>>
>>> Hi John,
>>>
>>> I think you were referring to Mike's explanation of:
>>> "If, however, you ever resort to hitting the disk-based data for a query
>>> it is going to have to read every record that isn't in memory from disk
>>> which is going to be extremely slow. I personally would never use Geode
>>> that way."
>>>
>>> When stating:
>>> "Additionally, assuming the Indexes were defined properly based on the
>>> predicates in the queries (most often) used, that it would target the data
>>> on disk matching the predicate and load only the data required (no data
>>> store, RDBMS or otherwise, especially disk-bound stores, should have to
>>> load the entire table/Region/Map/whatever to access the data matching the
>>> predicate; that's absurd, OOMEs galore)."
>>>
>>> Let me try to clear things up slightly...hopefully not causing more
>>> confusion...
>>> If an index can be used, the index look up is executed and entries added
>>> to the result set.  If any of the entries that match the predicates is
>>> actually on disk, those values will need to be loaded to memory before
>>> being returned as a result.
>>> I think what Mike was saying was that if an index is not used, then the
>>> query itself would execute across the entire region, which means loading
>>> every entry into memory.  We would need to inspect each entry to see if
>>> fulfill the criteria.
>>>
>>> -Jason
>>>
>>>
>>>
>>> On Fri, Aug 19, 2016 at 1:39 PM John Blum <jb...@pivotal.io> wrote:
>>>
>>>> Hi All-
>>>>
>>>> DISCLAIMER: I am no expert in querying and index
>>>> architecture/implementation; mostly a consumer.
>>>>
>>>> Perhaps *Anil* or *Jason* can shed more light on the subject, but for
>>>> my own understanding/sanity, it would seem we could do better than this,
>>>> meaning...
>>>>
>>>> I would think any UC partially depends on the organization of your data
>>>> in the grid as well.  If you used a PARTITION data management policy
>>>> [1], for instance, then, of course, your data would be distributed and
>>>> partitioned across all the data nodes in the grid (cluster) holding the
>>>> data (i.e. data nodes that have declared the same PARTITION Region).
>>>> It should then be possible to make this more optimal by have a redundancy
>>>> level of 1 or more (depending on the frequency of transactions and data
>>>> changes) to parallelize the data access.
>>>>
>>>> Not only does having more nodes mean better (or more optimal)
>>>> organization, but more memory.  Still, given a very large data set, clearly
>>>> some of the data will need to OVERFLOW (to disk).
>>>>
>>>> But, by combining the Function Execution service with querying (on
>>>> PARTITIONED data) [2], you could target the nodes that would
>>>> supposedly hold the data of interests, and execute the queries there.
>>>>
>>>> Additionally, assuming the Indexes were defined properly based on the
>>>> predicates in the queries (most often) used, that it would target the data
>>>> on disk matching the predicate and load only the data required (no data
>>>> store, RDBMS or otherwise, especially disk-bound stores, should have to
>>>> load the entire table/Region/Map/whatever to access the data matching the
>>>> predicate; that's absurd, OOMEs galore).
>>>>
>>>> TMK, Geode keeps Indexes in memory (even loads them on startup) and
>>>> updates them (either sync/async depending on your configuration) as the
>>>> data changes.  You would assume the data would not be changing in the
>>>> OVERFLOW, disk-based data set.  If the data did change, then wouldn't you
>>>> also assume that that data would then have to be in-memory (I think so).
>>>>
>>>> Please let me know if I am way of basis here, but I would think Geode
>>>> gives you enough options that particular UCs could be made, with nominal
>>>> effort, more optimal.
>>>>
>>>> Additional references...
>>>>
>>>> * Query Partitioned Regions [3]
>>>> * Working with Indexes [4], and then...
>>>> * Tips and Guidelines on Using Indexes [5], but also important...
>>>> * Using Indexes with Overflow Regions [6]
>>>>
>>>> Hope this helps.
>>>>
>>>> Cheers!
>>>> -John
>>>>
>>>>
>>>> [1] http://geode.docs.pivotal.io/docs/developing/region_opti
>>>> ons/region_types.html
>>>> [2] http://geode.docs.pivotal.io/docs/developing/querying_ba
>>>> sics/performance_considerations.html
>>>> [3] http://geode.docs.pivotal.io/docs/developing/querying_ba
>>>> sics/querying_partitioned_regions.html
>>>> [4] http://geode.docs.pivotal.io/docs/developing/query_index
>>>> /query_index.html
>>>> [5] http://geode.docs.pivotal.io/docs/developing/query_index
>>>> /indexing_guidelines.html
>>>> [6] http://geode.docs.pivotal.io/docs/developing/query_index
>>>> /indexes_with_overflow_regions.html
>>>>
>>>>
>>>> On Fri, Aug 19, 2016 at 12:55 PM, Denis Magda <magda7...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks, now I see.
>>>>>
>>>>> This works the same way as in Ignite then. If you set up an eviction
>>>>> policy in Ignite the data may be evicted to swap at some point of time and
>>>>> if a query is executed right after that the it may swap in the data back 
>>>>> to
>>>>> memory. However the indexes must always be in memory.
>>>>>
>>>>> --
>>>>> Denis
>>>>>
>>>>>
>>>>> On Fri, Aug 19, 2016 at 12:43 PM, Michael Stolz <mst...@pivotal.io>
>>>>> wrote:
>>>>>
>>>>>> There is a notion of data aging out in Geode. We call it overflow to
>>>>>> disk.
>>>>>>
>>>>>> The idea is that as data gets old you can have the records in memory
>>>>>> expire, and that expiry can be to disk. That's the cold data.
>>>>>>
>>>>>> You may have built an index while you were initially loading the
>>>>>> data, and if your predicates only hit the indexes you will still get 
>>>>>> really
>>>>>> fast queries if the result sets aren't large.
>>>>>>
>>>>>> If, however, you ever resort to hitting the disk-based data for a
>>>>>> query it is going to have to read every record that isn't in memory from
>>>>>> disk which is going to be extremely slow. I personally would never use
>>>>>> Geode that way.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Mike Stolz
>>>>>> Principal Engineer, GemFire Product Manager
>>>>>> Mobile: 631-835-4771
>>>>>>
>>>>>> On Fri, Aug 19, 2016 at 3:35 PM, Denis Magda <magda7...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Mike,
>>>>>>>
>>>>>>> Thanks a lot for the explanation! It makes perfect sense to me.
>>>>>>>
>>>>>>> I just thought that you were able to do something with indexes in a
>>>>>>> such way that there is no need to preload everything from disk into 
>>>>>>> memory
>>>>>>> when a query is executed over cold data.
>>>>>>>
>>>>>>> Then what does "execution over cold data" mean? I'm referring to the
>>>>>>> following sentence from the main page:
>>>>>>>
>>>>>>> *Object Query Language allows distributed query execution on hot and
>>>>>>> cold data, with SQL-like capabilities, including joins.*
>>>>>>>
>>>>>>> --
>>>>>>> Denis
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 19, 2016 at 12:27 PM, Michael Stolz <mst...@pivotal.io>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Here's the thing...
>>>>>>>>
>>>>>>>> On any In-memory data grid, if you run a query before the data has
>>>>>>>> been loaded into memory, it is going to cause the exact same amount of 
>>>>>>>> disk
>>>>>>>> i/o to do the query as it will take to load everything into memory.
>>>>>>>>
>>>>>>>> And the system will still have to go ahead and load everything into
>>>>>>>> memory anyway so you're going to end up doing all that disk i/o TWICE.
>>>>>>>>
>>>>>>>> Geode DOES have a nice feature for key based access though. We
>>>>>>>> actually store the keys in a separate file from the data and we can 
>>>>>>>> load
>>>>>>>> that file very quickly. Then if you go after the data for one of those 
>>>>>>>> keys
>>>>>>>> we can lazily load it from disk on demand if it hasn't yet been loaded 
>>>>>>>> into
>>>>>>>> memory.
>>>>>>>>
>>>>>>>> The Lucene integration work that is going on in Geode might also
>>>>>>>> make it possible to load the indexes first and lazily load the data 
>>>>>>>> based
>>>>>>>> on queries against the indexes.
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Mike Stolz
>>>>>>>> Principal Engineer, GemFire Product Manager
>>>>>>>> Mobile: 631-835-4771
>>>>>>>>
>>>>>>>> On Fri, Aug 19, 2016 at 2:59 PM, Denis Magda <magda7...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello Geode community,
>>>>>>>>>
>>>>>>>>> I've been investigating possibilities of Geode Persistence for a
>>>>>>>>> while and still can't get it clear whether I need to have all my data 
>>>>>>>>> in
>>>>>>>>> memory if I want to execute OQL queries or OQL engine works over the
>>>>>>>>> persistence as well.
>>>>>>>>>
>>>>>>>>> My use case is the following. During the cluster startup I don't
>>>>>>>>> want to wait while all the data has been pre-loaded from the 
>>>>>>>>> persistence to
>>>>>>>>> RAM and want to execute OQL queries right away. Is it feasible to 
>>>>>>>>> implement
>>>>>>>>> with Geode? Please provide me with the links where I can read more 
>>>>>>>>> about
>>>>>>>>> this.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Denis
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Удачи,
>>>>>>> Денис Магда
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Удачи,
>>>>> Денис Магда
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> -John
>>>> 503-504-8657
>>>> john.blum10101 (skype)
>>>>
>>>
>>
>>
>> --
>> Удачи,
>> Денис Магда
>>
>
>
>
> --
> -John
> 503-504-8657
> john.blum10101 (skype)
>

Re: Persistence and OQL over cold data

Reply via email to