Unfortunately the indexes are not stored. They need to be rebuilt on restart. For that reason, on start up, the whole diskstore needs to be read.
-- Mike Stolz Principal Engineer, GemFire Product Manager Mobile: 631-835-4771 On Fri, Aug 19, 2016 at 5:30 PM, John Blum <jb...@pivotal.io> wrote: > *Jason, Mike*: first, thank you. > > > *In order to target the nodes that would supposedly hold the data of > interest you need to know the keys you are looking for. If you know the > keys why are you querying in the first place? Just do getAll(keys).* > > Two reasons... > > 1. I want to apply some "additional filtering" that can only be handled > elegantly by a OQL query predicate after a subset of the data has been > identified/targeted (using keys). I have example of this somewhere (doh) > after working with a customer on this exact UC > > 2. I don't want the entire object (i.e. row); I only need a specific > "projection" of the (object) data. This is particularly important if I > have very large and complex object graph and I am streaming data across the > wire (client/server). > > > > *The trouble happens when you are NOT hitting your indices. * > > Yes, good point. > > > *If you do a query that requires a full table scan, then every row in > the database table needs to be examined, and to examine it, it has to be in > memory at least briefly.* > > Of course. > > *Denis*- > > > *The disk entries that are mentioned by John were located in memory > before and were overflowed on disk at some point of time. It means that if > you start your cluster from scratch and want to run OQL queries over the > indexed data then you have to preload all the data from the persistence.* > > I don't specifically recall how much persistent data Geode reloads on > restart (Geode is a shared-nothing architecture though so each data node > has it's own persistence; additionally primaries must come online before > secondaries are accessible). The question is how much data gets reloaded > on restart. It would seem silly if the disk store contained more data then > would fit in memory and reload everything knowing some of the data would be > OVERFLOW on preload when it would not all fit. Geode will reload the Index > though, which is stored as well. > > I let the experts answer this one. > > > On Fri, Aug 19, 2016 at 2:04 PM, Denis Magda <magda7...@gmail.com> wrote: > >> Hi John, Jason, >> >> If to expand more on this >> >> >> *If an index can be used, the index look up is executed and entries added >> to the result set. If any of the entries that match the predicates is >> actually on disk, those values will need to be loaded to memory before >> being returned as a result.* >> >> The disk entries that are mentioned by John were located in memory before >> and were overflowed on disk at some point of time. It means that if you >> start your cluster from scratch and want to run OQL queries over the >> indexed data then you have to preload all the data from the persistence. >> Yes, some of the data may be overflowed back to disk during the preloading >> but you'll have your indexes in a valid state. >> >> Correct me if I'm still missing something. >> >> -- >> Denis >> >> >> On Fri, Aug 19, 2016 at 1:52 PM, Jason Huynh <jhu...@pivotal.io> wrote: >> >>> Hi John, >>> >>> I think you were referring to Mike's explanation of: >>> "If, however, you ever resort to hitting the disk-based data for a query >>> it is going to have to read every record that isn't in memory from disk >>> which is going to be extremely slow. I personally would never use Geode >>> that way." >>> >>> When stating: >>> "Additionally, assuming the Indexes were defined properly based on the >>> predicates in the queries (most often) used, that it would target the data >>> on disk matching the predicate and load only the data required (no data >>> store, RDBMS or otherwise, especially disk-bound stores, should have to >>> load the entire table/Region/Map/whatever to access the data matching the >>> predicate; that's absurd, OOMEs galore)." >>> >>> Let me try to clear things up slightly...hopefully not causing more >>> confusion... >>> If an index can be used, the index look up is executed and entries added >>> to the result set. If any of the entries that match the predicates is >>> actually on disk, those values will need to be loaded to memory before >>> being returned as a result. >>> I think what Mike was saying was that if an index is not used, then the >>> query itself would execute across the entire region, which means loading >>> every entry into memory. We would need to inspect each entry to see if >>> fulfill the criteria. >>> >>> -Jason >>> >>> >>> >>> On Fri, Aug 19, 2016 at 1:39 PM John Blum <jb...@pivotal.io> wrote: >>> >>>> Hi All- >>>> >>>> DISCLAIMER: I am no expert in querying and index >>>> architecture/implementation; mostly a consumer. >>>> >>>> Perhaps *Anil* or *Jason* can shed more light on the subject, but for >>>> my own understanding/sanity, it would seem we could do better than this, >>>> meaning... >>>> >>>> I would think any UC partially depends on the organization of your data >>>> in the grid as well. If you used a PARTITION data management policy >>>> [1], for instance, then, of course, your data would be distributed and >>>> partitioned across all the data nodes in the grid (cluster) holding the >>>> data (i.e. data nodes that have declared the same PARTITION Region). >>>> It should then be possible to make this more optimal by have a redundancy >>>> level of 1 or more (depending on the frequency of transactions and data >>>> changes) to parallelize the data access. >>>> >>>> Not only does having more nodes mean better (or more optimal) >>>> organization, but more memory. Still, given a very large data set, clearly >>>> some of the data will need to OVERFLOW (to disk). >>>> >>>> But, by combining the Function Execution service with querying (on >>>> PARTITIONED data) [2], you could target the nodes that would >>>> supposedly hold the data of interests, and execute the queries there. >>>> >>>> Additionally, assuming the Indexes were defined properly based on the >>>> predicates in the queries (most often) used, that it would target the data >>>> on disk matching the predicate and load only the data required (no data >>>> store, RDBMS or otherwise, especially disk-bound stores, should have to >>>> load the entire table/Region/Map/whatever to access the data matching the >>>> predicate; that's absurd, OOMEs galore). >>>> >>>> TMK, Geode keeps Indexes in memory (even loads them on startup) and >>>> updates them (either sync/async depending on your configuration) as the >>>> data changes. You would assume the data would not be changing in the >>>> OVERFLOW, disk-based data set. If the data did change, then wouldn't you >>>> also assume that that data would then have to be in-memory (I think so). >>>> >>>> Please let me know if I am way of basis here, but I would think Geode >>>> gives you enough options that particular UCs could be made, with nominal >>>> effort, more optimal. >>>> >>>> Additional references... >>>> >>>> * Query Partitioned Regions [3] >>>> * Working with Indexes [4], and then... >>>> * Tips and Guidelines on Using Indexes [5], but also important... >>>> * Using Indexes with Overflow Regions [6] >>>> >>>> Hope this helps. >>>> >>>> Cheers! >>>> -John >>>> >>>> >>>> [1] http://geode.docs.pivotal.io/docs/developing/region_opti >>>> ons/region_types.html >>>> [2] http://geode.docs.pivotal.io/docs/developing/querying_ba >>>> sics/performance_considerations.html >>>> [3] http://geode.docs.pivotal.io/docs/developing/querying_ba >>>> sics/querying_partitioned_regions.html >>>> [4] http://geode.docs.pivotal.io/docs/developing/query_index >>>> /query_index.html >>>> [5] http://geode.docs.pivotal.io/docs/developing/query_index >>>> /indexing_guidelines.html >>>> [6] http://geode.docs.pivotal.io/docs/developing/query_index >>>> /indexes_with_overflow_regions.html >>>> >>>> >>>> On Fri, Aug 19, 2016 at 12:55 PM, Denis Magda <magda7...@gmail.com> >>>> wrote: >>>> >>>>> Thanks, now I see. >>>>> >>>>> This works the same way as in Ignite then. If you set up an eviction >>>>> policy in Ignite the data may be evicted to swap at some point of time and >>>>> if a query is executed right after that the it may swap in the data back >>>>> to >>>>> memory. However the indexes must always be in memory. >>>>> >>>>> -- >>>>> Denis >>>>> >>>>> >>>>> On Fri, Aug 19, 2016 at 12:43 PM, Michael Stolz <mst...@pivotal.io> >>>>> wrote: >>>>> >>>>>> There is a notion of data aging out in Geode. We call it overflow to >>>>>> disk. >>>>>> >>>>>> The idea is that as data gets old you can have the records in memory >>>>>> expire, and that expiry can be to disk. That's the cold data. >>>>>> >>>>>> You may have built an index while you were initially loading the >>>>>> data, and if your predicates only hit the indexes you will still get >>>>>> really >>>>>> fast queries if the result sets aren't large. >>>>>> >>>>>> If, however, you ever resort to hitting the disk-based data for a >>>>>> query it is going to have to read every record that isn't in memory from >>>>>> disk which is going to be extremely slow. I personally would never use >>>>>> Geode that way. >>>>>> >>>>>> >>>>>> -- >>>>>> Mike Stolz >>>>>> Principal Engineer, GemFire Product Manager >>>>>> Mobile: 631-835-4771 >>>>>> >>>>>> On Fri, Aug 19, 2016 at 3:35 PM, Denis Magda <magda7...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Mike, >>>>>>> >>>>>>> Thanks a lot for the explanation! It makes perfect sense to me. >>>>>>> >>>>>>> I just thought that you were able to do something with indexes in a >>>>>>> such way that there is no need to preload everything from disk into >>>>>>> memory >>>>>>> when a query is executed over cold data. >>>>>>> >>>>>>> Then what does "execution over cold data" mean? I'm referring to the >>>>>>> following sentence from the main page: >>>>>>> >>>>>>> *Object Query Language allows distributed query execution on hot and >>>>>>> cold data, with SQL-like capabilities, including joins.* >>>>>>> >>>>>>> -- >>>>>>> Denis >>>>>>> >>>>>>> >>>>>>> On Fri, Aug 19, 2016 at 12:27 PM, Michael Stolz <mst...@pivotal.io> >>>>>>> wrote: >>>>>>> >>>>>>>> Here's the thing... >>>>>>>> >>>>>>>> On any In-memory data grid, if you run a query before the data has >>>>>>>> been loaded into memory, it is going to cause the exact same amount of >>>>>>>> disk >>>>>>>> i/o to do the query as it will take to load everything into memory. >>>>>>>> >>>>>>>> And the system will still have to go ahead and load everything into >>>>>>>> memory anyway so you're going to end up doing all that disk i/o TWICE. >>>>>>>> >>>>>>>> Geode DOES have a nice feature for key based access though. We >>>>>>>> actually store the keys in a separate file from the data and we can >>>>>>>> load >>>>>>>> that file very quickly. Then if you go after the data for one of those >>>>>>>> keys >>>>>>>> we can lazily load it from disk on demand if it hasn't yet been loaded >>>>>>>> into >>>>>>>> memory. >>>>>>>> >>>>>>>> The Lucene integration work that is going on in Geode might also >>>>>>>> make it possible to load the indexes first and lazily load the data >>>>>>>> based >>>>>>>> on queries against the indexes. >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Mike Stolz >>>>>>>> Principal Engineer, GemFire Product Manager >>>>>>>> Mobile: 631-835-4771 >>>>>>>> >>>>>>>> On Fri, Aug 19, 2016 at 2:59 PM, Denis Magda <magda7...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hello Geode community, >>>>>>>>> >>>>>>>>> I've been investigating possibilities of Geode Persistence for a >>>>>>>>> while and still can't get it clear whether I need to have all my data >>>>>>>>> in >>>>>>>>> memory if I want to execute OQL queries or OQL engine works over the >>>>>>>>> persistence as well. >>>>>>>>> >>>>>>>>> My use case is the following. During the cluster startup I don't >>>>>>>>> want to wait while all the data has been pre-loaded from the >>>>>>>>> persistence to >>>>>>>>> RAM and want to execute OQL queries right away. Is it feasible to >>>>>>>>> implement >>>>>>>>> with Geode? Please provide me with the links where I can read more >>>>>>>>> about >>>>>>>>> this. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Denis >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Удачи, >>>>>>> Денис Магда >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Удачи, >>>>> Денис Магда >>>>> >>>> >>>> >>>> >>>> -- >>>> -John >>>> 503-504-8657 >>>> john.blum10101 (skype) >>>> >>> >> >> >> -- >> Удачи, >> Денис Магда >> > > > > -- > -John > 503-504-8657 > john.blum10101 (skype) >