Re: Persistence and OQL over cold data

Alan Kash Wed, 24 Aug 2016 07:23:33 -0700

Will creating another region for Indexes make them persistent ?


We should capture this information in the documentation.

1. Local / Distributed Read
2. Local / Distributed Write
3. Local / Distributed Indexing.

Thanks


On Fri, Aug 19, 2016 at 7:47 PM, John Blum <jb...@pivotal.io> wrote:

> My apologies for confusing Index storage with Geode; thought I heard this
> somewhere in the context of GemFire/Geode before.  No doubt confused this
> with other data stores I work with.  (So) much to learn yet.
>
> On Fri, Aug 19, 2016 at 4:16 PM, Michael Stolz <mst...@pivotal.io> wrote:
>
>> Unfortunately the indexes are not stored. They need to be rebuilt on
>> restart. For that reason, on start up, the whole diskstore needs to be read.
>>
>> --
>> Mike Stolz
>> Principal Engineer, GemFire Product Manager
>> Mobile: 631-835-4771
>>
>> On Fri, Aug 19, 2016 at 5:30 PM, John Blum <jb...@pivotal.io> wrote:
>>
>>> *Jason, Mike*: first, thank you.
>>>
>>> > *In order to target the nodes that would supposedly hold the data of
>>> interest you need to know the keys you are looking for. If you know the
>>> keys why are you querying in the first place? Just do getAll(keys).*
>>>
>>> Two reasons...
>>>
>>> 1. I want to apply some "additional filtering" that can only be handled
>>> elegantly by a OQL query predicate after a subset of the data has been
>>> identified/targeted (using keys).  I have example of this somewhere (doh)
>>> after working with a customer on this exact UC
>>>
>>> 2. I don't want the entire object (i.e. row); I only need a specific
>>> "projection" of the (object) data.  This is particularly important if I
>>> have very large and complex object graph and I am streaming data across the
>>> wire (client/server).
>>>
>>>
>>> > *The trouble happens when you are NOT hitting your indices. *
>>>
>>> Yes, good point.
>>>
>>> > *If you do a query that requires a full table scan, then every row in
>>> the database table needs to be examined, and to examine it, it has to be in
>>> memory at least briefly.*
>>>
>>> Of course.
>>>
>>> *Denis*-
>>>
>>> > *The disk entries that are mentioned by John were located in memory
>>> before and were overflowed on disk at some point of time. It means that if
>>> you start your cluster from scratch and want to run OQL queries over the
>>> indexed data then you have to preload all the data from the persistence.*
>>>
>>> I don't specifically recall how much persistent data Geode reloads on
>>> restart (Geode is a shared-nothing architecture though so each data node
>>> has it's own persistence; additionally primaries must come online before
>>> secondaries are accessible).  The question is how much data gets reloaded
>>> on restart.  It would seem silly if the disk store contained more data then
>>> would fit in memory and reload everything knowing some of the data would be
>>> OVERFLOW on preload when it would not all fit.  Geode will reload the Index
>>> though, which is stored as well.
>>>
>>> I let the experts answer this one.
>>>
>>>
>>> On Fri, Aug 19, 2016 at 2:04 PM, Denis Magda <magda7...@gmail.com>
>>> wrote:
>>>
>>>> Hi John, Jason,
>>>>
>>>> If to expand more on this
>>>>
>>>>
>>>> *If an index can be used, the index look up is executed and entries
>>>> added to the result set.  If any of the entries that match the predicates
>>>> is actually on disk, those values will need to be loaded to memory before
>>>> being returned as a result.*
>>>>
>>>> The disk entries that are mentioned by John were located in memory
>>>> before and were overflowed on disk at some point of time. It means that if
>>>> you start your cluster from scratch and want to run OQL queries over the
>>>> indexed data then you have to preload all the data from the persistence.
>>>> Yes, some of the data may be overflowed back to disk during the preloading
>>>> but you'll have your indexes in a valid state.
>>>>
>>>> Correct me if I'm still missing something.
>>>>
>>>> --
>>>> Denis
>>>>
>>>>
>>>> On Fri, Aug 19, 2016 at 1:52 PM, Jason Huynh <jhu...@pivotal.io> wrote:
>>>>
>>>>> Hi John,
>>>>>
>>>>> I think you were referring to Mike's explanation of:
>>>>> "If, however, you ever resort to hitting the disk-based data for a
>>>>> query it is going to have to read every record that isn't in memory from
>>>>> disk which is going to be extremely slow. I personally would never use
>>>>> Geode that way."
>>>>>
>>>>> When stating:
>>>>> "Additionally, assuming the Indexes were defined properly based on the
>>>>> predicates in the queries (most often) used, that it would target the data
>>>>> on disk matching the predicate and load only the data required (no data
>>>>> store, RDBMS or otherwise, especially disk-bound stores, should have to
>>>>> load the entire table/Region/Map/whatever to access the data matching the
>>>>> predicate; that's absurd, OOMEs galore)."
>>>>>
>>>>> Let me try to clear things up slightly...hopefully not causing more
>>>>> confusion...
>>>>> If an index can be used, the index look up is executed and entries
>>>>> added to the result set.  If any of the entries that match the predicates
>>>>> is actually on disk, those values will need to be loaded to memory before
>>>>> being returned as a result.
>>>>> I think what Mike was saying was that if an index is not used, then
>>>>> the query itself would execute across the entire region, which means
>>>>> loading every entry into memory.  We would need to inspect each entry to
>>>>> see if fulfill the criteria.
>>>>>
>>>>> -Jason
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 19, 2016 at 1:39 PM John Blum <jb...@pivotal.io> wrote:
>>>>>
>>>>>> Hi All-
>>>>>>
>>>>>> DISCLAIMER: I am no expert in querying and index
>>>>>> architecture/implementation; mostly a consumer.
>>>>>>
>>>>>> Perhaps *Anil* or *Jason* can shed more light on the subject, but
>>>>>> for my own understanding/sanity, it would seem we could do better than
>>>>>> this, meaning...
>>>>>>
>>>>>> I would think any UC partially depends on the organization of your
>>>>>> data in the grid as well.  If you used a PARTITION data management
>>>>>> policy [1], for instance, then, of course, your data would be distributed
>>>>>> and partitioned across all the data nodes in the grid (cluster) holding 
>>>>>> the
>>>>>> data (i.e. data nodes that have declared the same PARTITION
>>>>>> Region).  It should then be possible to make this more optimal by have a
>>>>>> redundancy level of 1 or more (depending on the frequency of transactions
>>>>>> and data changes) to parallelize the data access.
>>>>>>
>>>>>> Not only does having more nodes mean better (or more optimal)
>>>>>> organization, but more memory.  Still, given a very large data set, 
>>>>>> clearly
>>>>>> some of the data will need to OVERFLOW (to disk).
>>>>>>
>>>>>> But, by combining the Function Execution service with querying (on
>>>>>> PARTITIONED data) [2], you could target the nodes that would
>>>>>> supposedly hold the data of interests, and execute the queries there.
>>>>>>
>>>>>> Additionally, assuming the Indexes were defined properly based on the
>>>>>> predicates in the queries (most often) used, that it would target the 
>>>>>> data
>>>>>> on disk matching the predicate and load only the data required (no data
>>>>>> store, RDBMS or otherwise, especially disk-bound stores, should have to
>>>>>> load the entire table/Region/Map/whatever to access the data matching the
>>>>>> predicate; that's absurd, OOMEs galore).
>>>>>>
>>>>>> TMK, Geode keeps Indexes in memory (even loads them on startup) and
>>>>>> updates them (either sync/async depending on your configuration) as the
>>>>>> data changes.  You would assume the data would not be changing in the
>>>>>> OVERFLOW, disk-based data set.  If the data did change, then wouldn't you
>>>>>> also assume that that data would then have to be in-memory (I think so).
>>>>>>
>>>>>> Please let me know if I am way of basis here, but I would think Geode
>>>>>> gives you enough options that particular UCs could be made, with nominal
>>>>>> effort, more optimal.
>>>>>>
>>>>>> Additional references...
>>>>>>
>>>>>> * Query Partitioned Regions [3]
>>>>>> * Working with Indexes [4], and then...
>>>>>> * Tips and Guidelines on Using Indexes [5], but also important...
>>>>>> * Using Indexes with Overflow Regions [6]
>>>>>>
>>>>>> Hope this helps.
>>>>>>
>>>>>> Cheers!
>>>>>> -John
>>>>>>
>>>>>>
>>>>>> [1] http://geode.docs.pivotal.io/docs/developing/region_opti
>>>>>> ons/region_types.html
>>>>>> [2] http://geode.docs.pivotal.io/docs/developing/querying_ba
>>>>>> sics/performance_considerations.html
>>>>>> [3] http://geode.docs.pivotal.io/docs/developing/querying_ba
>>>>>> sics/querying_partitioned_regions.html
>>>>>> [4] http://geode.docs.pivotal.io/docs/developing/query_index
>>>>>> /query_index.html
>>>>>> [5] http://geode.docs.pivotal.io/docs/developing/query_index
>>>>>> /indexing_guidelines.html
>>>>>> [6] http://geode.docs.pivotal.io/docs/developing/query_index
>>>>>> /indexes_with_overflow_regions.html
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 19, 2016 at 12:55 PM, Denis Magda <magda7...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks, now I see.
>>>>>>>
>>>>>>> This works the same way as in Ignite then. If you set up an eviction
>>>>>>> policy in Ignite the data may be evicted to swap at some point of time 
>>>>>>> and
>>>>>>> if a query is executed right after that the it may swap in the data 
>>>>>>> back to
>>>>>>> memory. However the indexes must always be in memory.
>>>>>>>
>>>>>>> --
>>>>>>> Denis
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 19, 2016 at 12:43 PM, Michael Stolz <mst...@pivotal.io>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> There is a notion of data aging out in Geode. We call it overflow
>>>>>>>> to disk.
>>>>>>>>
>>>>>>>> The idea is that as data gets old you can have the records in
>>>>>>>> memory expire, and that expiry can be to disk. That's the cold data.
>>>>>>>>
>>>>>>>> You may have built an index while you were initially loading the
>>>>>>>> data, and if your predicates only hit the indexes you will still get 
>>>>>>>> really
>>>>>>>> fast queries if the result sets aren't large.
>>>>>>>>
>>>>>>>> If, however, you ever resort to hitting the disk-based data for a
>>>>>>>> query it is going to have to read every record that isn't in memory 
>>>>>>>> from
>>>>>>>> disk which is going to be extremely slow. I personally would never use
>>>>>>>> Geode that way.
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Mike Stolz
>>>>>>>> Principal Engineer, GemFire Product Manager
>>>>>>>> Mobile: 631-835-4771
>>>>>>>>
>>>>>>>> On Fri, Aug 19, 2016 at 3:35 PM, Denis Magda <magda7...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Mike,
>>>>>>>>>
>>>>>>>>> Thanks a lot for the explanation! It makes perfect sense to me.
>>>>>>>>>
>>>>>>>>> I just thought that you were able to do something with indexes in
>>>>>>>>> a such way that there is no need to preload everything from disk into
>>>>>>>>> memory when a query is executed over cold data.
>>>>>>>>>
>>>>>>>>> Then what does "execution over cold data" mean? I'm referring to
>>>>>>>>> the following sentence from the main page:
>>>>>>>>>
>>>>>>>>> *Object Query Language allows distributed query execution on hot
>>>>>>>>> and cold data, with SQL-like capabilities, including joins.*
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Denis
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Aug 19, 2016 at 12:27 PM, Michael Stolz <mst...@pivotal.io
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Here's the thing...
>>>>>>>>>>
>>>>>>>>>> On any In-memory data grid, if you run a query before the data
>>>>>>>>>> has been loaded into memory, it is going to cause the exact same 
>>>>>>>>>> amount of
>>>>>>>>>> disk i/o to do the query as it will take to load everything into 
>>>>>>>>>> memory.
>>>>>>>>>>
>>>>>>>>>> And the system will still have to go ahead and load everything
>>>>>>>>>> into memory anyway so you're going to end up doing all that disk i/o 
>>>>>>>>>> TWICE.
>>>>>>>>>>
>>>>>>>>>> Geode DOES have a nice feature for key based access though. We
>>>>>>>>>> actually store the keys in a separate file from the data and we can 
>>>>>>>>>> load
>>>>>>>>>> that file very quickly. Then if you go after the data for one of 
>>>>>>>>>> those keys
>>>>>>>>>> we can lazily load it from disk on demand if it hasn't yet been 
>>>>>>>>>> loaded into
>>>>>>>>>> memory.
>>>>>>>>>>
>>>>>>>>>> The Lucene integration work that is going on in Geode might also
>>>>>>>>>> make it possible to load the indexes first and lazily load the data 
>>>>>>>>>> based
>>>>>>>>>> on queries against the indexes.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Mike Stolz
>>>>>>>>>> Principal Engineer, GemFire Product Manager
>>>>>>>>>> Mobile: 631-835-4771
>>>>>>>>>>
>>>>>>>>>> On Fri, Aug 19, 2016 at 2:59 PM, Denis Magda <magda7...@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello Geode community,
>>>>>>>>>>>
>>>>>>>>>>> I've been investigating possibilities of Geode Persistence for a
>>>>>>>>>>> while and still can't get it clear whether I need to have all my 
>>>>>>>>>>> data in
>>>>>>>>>>> memory if I want to execute OQL queries or OQL engine works over the
>>>>>>>>>>> persistence as well.
>>>>>>>>>>>
>>>>>>>>>>> My use case is the following. During the cluster startup I don't
>>>>>>>>>>> want to wait while all the data has been pre-loaded from the 
>>>>>>>>>>> persistence to
>>>>>>>>>>> RAM and want to execute OQL queries right away. Is it feasible to 
>>>>>>>>>>> implement
>>>>>>>>>>> with Geode? Please provide me with the links where I can read more 
>>>>>>>>>>> about
>>>>>>>>>>> this.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Denis
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Удачи,
>>>>>>>>> Денис Магда
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Удачи,
>>>>>>> Денис Магда
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> -John
>>>>>> 503-504-8657
>>>>>> john.blum10101 (skype)
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Удачи,
>>>> Денис Магда
>>>>
>>>
>>>
>>>
>>> --
>>> -John
>>> 503-504-8657
>>> john.blum10101 (skype)
>>>
>>
>>
>
>
> --
> -John
> 503-504-8657
> john.blum10101 (skype)
>

Re: Persistence and OQL over cold data

Reply via email to