Hi Atri, Sorry for a late answer.
> I didn't quite understand. Are you proposing that Ignite should not have FTS > capabilities? It seems an option to me. IMHO it is better to have no FTS instead of something like current Ignite TextQueries. 2021-08-03 12:45 GMT+03:00, Atri Sharma <a...@apache.org>: > Hi Ivan, > > I didn't quite understand. Are you proposing that Ignite should not > have FTS capabilities? > > Atri > > On Tue, Aug 3, 2021 at 2:57 PM Ivan Pavlukhin <vololo...@gmail.com> wrote: >> >> Hi Atri, >> >> My main concern is non-maleficence. Every task has several solutions, >> e.g. straightforward ones: >> 1. Do not implement FTS. >> 2. Create own implementation. >> >> Some of the strongest ones live without FTS [1]. >> >> [1] https://github.com/cockroachdb/cockroach/issues/7821 >> >> 2021-08-02 11:33 GMT+03:00, Atri Sharma <a...@apache.org>: >> > Hi Ivan, >> > >> > Would you like to propose an alternative to Lucene? >> > >> > Atri >> > >> > On Mon, 2 Aug 2021, 13:48 Ivan Pavlukhin, <vololo...@gmail.com> wrote: >> > >> >> Folks, >> >> >> >> Sorry if read the thread not thoroughly enough, but do we consider >> >> Lucene as obviously right choice? In my understanding Ignite history >> >> has shown clearly that "fastest feature implementation" is not usually >> >> the best. And one example of this are text queries. Are not we trying >> >> to do a same mistake again? FTS is a huge feature, I do not believe >> >> there is an easy win for it. >> >> >> >> 2021-07-27 19:18 GMT+03:00, Atri Sharma <a...@apache.org>: >> >> > Andrey, >> >> > >> >> >> Per-partition Lucene index looks simple to implement, but it may >> >> >> require >> >> >> per-partition SQL to make full-text search expressions work >> >> >> correctly >> >> >> within the SQL quiery. >> >> > I think that as long as we follow the map - reduce process that we >> >> > already do for other queries, we should be fine. >> >> > >> >> >> Per-partition SQL index may kill the performance. We already tried >> >> >> to >> >> >> do >> >> >> that in Ignite 2. However, QueryParallelism feature helps to speed >> >> >> up >> >> >> some >> >> >> data-intensive queries, >> >> >> but hits the performance in simple cases, and at some point (e.g. >> >> >> segments >> >> >> > number of CPU) the performance rapidly degrades with the >> >> >> > increasing >> >> >> number of segments. >> >> > >> >> > Yeah, that is always the case, but a global index will be a >> >> > nightmare >> >> > in terms of concurrency and pessimistic concurrency control will >> >> > anyways kill the benefits, coupled with the metadata requirements. >> >> > What were the specific issues with per partition index? >> >> >> >> >> >> AFAIK, Lucene widely used bitmap indices that are easy to merge. >> >> >> Maybe, the map-reduce technique underneath FTS expressions and some >> >> hacks >> >> >> will add a minimal overhead. >> >> > >> >> > Lucene uses many types of indices but the aspect here is that per >> >> > partition Lucene indices can return docIDs and we can merge them in >> >> > reduce phase. So we are abstracted out from specifics of the >> >> > internal >> >> > index being used to serve the query. >> >> > >> >> >> >> >> >> > As illustrated by Ilya, we can use Ignite's WAL records to >> >> >> > rebuild >> >> >> > Lucene indices. The important thing here is to not treat Lucene >> >> >> > indices as source of truth. >> >> >> To use WAL we either should relay Lucene files to our Page memory >> >> >> or >> >> >> be >> >> >> aware of Lucene files structure. >> >> >> The first looks tricky, as we should guarantee a contiguous address >> >> space >> >> >> in Page memory for reflecting Lucene file. Maybe separate managed >> >> >> memory >> >> >> segment with its own rules? >> >> > >> >> > Why not use Lucene's MMappedDirectory and map it to our storage >> >> > classes? >> >> > >> >> >> >> >> >> >> Transactions. >> >> >> >> * Will we support transactions? >> >> >> > Lucene has no concept of transactions. >> >> >> Yes, but we have. >> >> >> Lucene index may be non-transactional, but users never expect to >> >> >> see >> >> >> uncommited data. >> >> >> How does this connect with transactional SQL? >> >> > We could have the Lucene writes done as a part of transactions and >> >> > ack >> >> > back only when it succeeds/fails. WDYT? >> >> >> >> >> >> On Tue, Jul 27, 2021 at 1:36 PM Atri Sharma <a...@apache.org> >> >> >> wrote: >> >> >> >> >> >> > Sorry, I planned on creating a Wiki page for this, but it makes >> >> >> > more >> >> >> > sense to be replying here. >> >> >> > >> >> >> > > * How Lucene index can be split among the nodes? >> >> >> > >> >> >> > We can have partition level indices on each node. >> >> >> > >> >> >> > > * If we'll have a single index for all partitions on the >> >> >> > > particular >> >> >> > > node, >> >> >> > > then how index records will be aware of partitioning? >> >> >> > >> >> >> > Index records dont need to be aware of partitioning -- each >> >> >> > Lucene >> >> >> > index is independent. >> >> >> > >> >> >> > > This is important to filter out backup records from the results >> >> >> > > to >> >> >> > > avoid >> >> >> > > duplicates. >> >> >> > >> >> >> > We can merge documents from different nodes and remove duplicates >> >> >> > as >> >> >> > long as docIDs are globally unique. >> >> >> > >> >> >> > > * How results from several nodes can be merged on the Reduce >> >> >> > > stage? >> >> >> > >> >> >> > As long as documents have a globally unique docID, Lucene has >> >> >> > merge >> >> >> > functions that can merge results from multiple partial results. >> >> >> > >> >> >> > > * Does Lucene supports smth like JOIN operation or others that >> >> >> > > may >> >> >> > require >> >> >> > > data from another partition or index? >> >> >> > >> >> >> > As illustrated by Ilya, Block-Join works for us. >> >> >> > >> >> >> > > If so, then it likes to multistep query with merging results on >> >> >> > > intermediate stages and requires detailed investigation and >> >> >> > > design. >> >> >> > > It is ok if Ignite will have some limitations here, but we >> >> >> > > would >> >> like >> >> >> > > to >> >> >> > > know about them at the early stage. >> >> >> > >> >> >> > > * How effectively map Lucene files to the page memory? Is it >> >> >> > > even >> >> >> > possible? >> >> >> > >> >> >> > Lucene has PageDirectory implementations which allow storing >> >> >> > Lucene >> >> >> > indices on different kind of file structures. It has a >> >> >> > MMappedFileDirectory that we could use? >> >> >> > >> >> >> > > Otherwise, how to deal with potential OOM on large queries and >> >> memory >> >> >> > > capacity planning? >> >> >> > >> >> >> > We can use Lucene's MMapped directory. >> >> >> > >> >> >> > > >> >> >> > > Persistence. >> >> >> > > * How and what consistency guarantees could we have/expect? >> >> >> > >> >> >> > Lucene does not have WAL logs but is append only >> >> >> > >> >> >> > > Seems, we may not be able to write physical records for Lucene >> >> >> > > index >> >> >> > > to >> >> >> > our >> >> >> > > WAL. What can we do with this? >> >> >> > >> >> >> > As illustrated by Ilya, we can use Ignite's WAL records to >> >> >> > rebuild >> >> >> > Lucene indices. The important thing here is to not treat Lucene >> >> >> > indices as source of truth. >> >> >> > > >> >> >> > > Transactions. >> >> >> > > * Will we support transactions? >> >> >> > Lucene has no concept of transactions. >> >> >> > >> >> >> > > * Should Lucene be aware of Transaction and track mvcc (or >> >> >> > > whatever) >> >> >> > > versions for the records? >> >> >> > No >> >> >> > > * What will be consistency guarantees? >> >> >> > We can acknowledge writes back only after Lucene index is >> >> >> > updated. >> >> >> > > >> >> >> > > UX >> >> >> > > * How to add FullText search queries syntax into Calcite? >> >> >> > Postgres's FTS functions are a good reference. >> >> >> > > * AFAIK, the Lucene index has many properties for tuning. How >> >> >> > > will >> >> >> > > the >> >> >> > user >> >> >> > > configure the index? >> >> >> > Most of those properties can be cluster level and exposed as a >> >> >> > new >> >> >> > sub >> >> >> > config for ignite. >> >> >> > > * How and where to store the settings? What are cluster-wide >> >> >> > > and >> >> what >> >> >> > > a >> >> >> > > local to the particular node? >> >> >> > All can be cluster level. >> >> >> > > * Will be all the settings immutable? Can be they changed >> >> >> > > on-fly? >> >> >> > > after >> >> >> > > node/grid restart? >> >> >> > They should be applied post restart. >> >> >> > >> >> >> > > * Any limitations on query syntax? >> >> >> > It depends on how we model our queries for text search. >> >> >> > >> >> >> > > >> >> >> > > SQL >> >> >> > > * Will we support FullText search in SQL? >> >> >> > We need custom functions for it. See Postgres's FTS functions. >> >> >> > > * How to integrate Lucene index into Calcite? What is the cost >> >> model? >> >> >> > There cannot be any cost model since there are no paths for a >> >> >> > text >> >> >> > query. If we see a text query, we have to use Lucene index or >> >> >> > return >> >> >> > an error. In this way, we need to model text search as a set of >> >> >> > UDFs >> >> >> > >> >> >> > > Splitting rules? Traits? >> >> >> > Please see my reply above. >> >> >> > > >> >> >> > > >> >> >> > > With all of this, you can go with the IEP (or even some short >> >> >> > > summary) >> >> >> > and >> >> >> > > further POC and implementation. >> >> >> > > That's a big deal, so let's discuss what could be done here. >> >> >> > > >> >> >> > > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <a...@apache.org> >> >> wrote: >> >> >> > > >> >> >> > > > I am actually happy to drive the feature for Ignite 3. FTS is >> >> >> > > > very >> >> >> > > > important for me and I think Ignite users will benefit from >> >> >> > > > it >> >> >> > > > greatly. >> >> >> > > > >> >> >> > > > If it makes sense to be focusing on Ignite 3 for this >> >> >> > > > capability, >> >> I >> >> >> > > > am >> >> >> > > > eager to contribute there and lead the development. >> >> >> > > > >> >> >> > > > Please share your thoughts. >> >> >> > > > >> >> >> > > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov >> >> >> > > > <andrey.mashen...@gmail.com> wrote: >> >> >> > > > > >> >> >> > > > > Hi Atri, >> >> >> > > > > >> >> >> > > > > All the Jira tickets we have on the Full-text search (FTS) >> >> >> > > > > thing >> >> >> > > > > are >> >> >> > > > > targeted to Ignite 2. >> >> >> > > > > >> >> >> > > > > AFAIK, we want, but we have NOT committed to FTS support in >> >> Ignite >> >> >> > > > > 3, >> >> >> > > > yet. >> >> >> > > > > By the way, we are getting requests for this thing from the >> >> >> > > > > user >> >> >> > side, >> >> >> > > > and >> >> >> > > > > definitely, >> >> >> > > > > FTS would be a valuable feature for Ignite. >> >> >> > > > > >> >> >> > > > > It will be great if the one wants to drive it, any help >> >> >> > > > > will >> >> >> > > > > be >> >> >> > > > appreciated. >> >> >> > > > > >> >> >> > > > > >> >> >> > > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma >> >> >> > > > > <a...@apache.org> >> >> >> > wrote: >> >> >> > > > > >> >> >> > > > > > Hello, >> >> >> > > > > > >> >> >> > > > > > An update, please. I am working through persistence of >> >> >> > > > > > Lucene >> >> >> > > > > > index >> >> >> > > > using >> >> >> > > > > > Ignite Dictionary, and will be asking some questions >> >> >> > > > > > soon. >> >> >> > > > > > >> >> >> > > > > > I had one doubt - - where does this change go? Ignite 3? >> >> >> > > > > > >> >> >> > > > > > Also, I know we want to build native support for text >> >> >> > > > > > searches >> >> >> > > > > > in >> >> >> > > > Ignite 3. >> >> >> > > > > > Is the work I am proposing here part of that, or will >> >> >> > > > > > that >> >> >> > > > > > be >> >> a >> >> >> > > > separate >> >> >> > > > > > effort? >> >> >> > > > > > >> >> >> > > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, < >> >> >> > ilya.kasnach...@gmail.com >> >> >> > > > > >> >> >> > > > > > wrote: >> >> >> > > > > > >> >> >> > > > > > > Hello! >> >> >> > > > > > > >> >> >> > > > > > > I think that number one is the most important one, then >> >> maybe >> >> >> > > > > > > it >> >> >> > > > will see >> >> >> > > > > > > more use and other deficiencies become more apparent, >> >> leading >> >> >> > > > > > > to >> >> >> > more >> >> >> > > > > > > tickets and visibility. >> >> >> > > > > > > >> >> >> > > > > > > Maybe 2. and 3. will even use a different approach when >> >> >> > persistence >> >> >> > > > is >> >> >> > > > > > > implemented. >> >> >> > > > > > > >> >> >> > > > > > > Regards, >> >> >> > > > > > > -- >> >> >> > > > > > > Ilya Kasnacheev >> >> >> > > > > > > >> >> >> > > > > > > >> >> >> > > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma >> >> >> > > > > > > <a...@apache.org>: >> >> >> > > > > > > >> >> >> > > > > > > > Hello Again! >> >> >> > > > > > > > >> >> >> > > > > > > > I have been looking into the aforementioned and here >> >> >> > > > > > > > are >> >> my >> >> >> > follow >> >> >> > > > up >> >> >> > > > > > > > thoughts: >> >> >> > > > > > > > >> >> >> > > > > > > > 1. Support persistence of Lucene indexes. >> >> >> > > > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401 >> >> >> > > > > > > > (Needs >> >> >> > > > fixing of >> >> >> > > > > > > > moving partitions first) >> >> >> > > > > > > > 3. Figure out how to return scores from nodes and use >> >> >> > > > > > > > them >> >> >> > > > > > > > as >> >> >> > sort >> >> >> > > > > > > > parameters on the coordinator node >> >> >> > > > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291) >> >> >> > > > > > > > >> >> >> > > > > > > > Please let me know if this looks ok to make text >> >> >> > > > > > > > queries >> >> >> > > > functional? >> >> >> > > > > > > > >> >> >> > > > > > > > Atri >> >> >> > > > > > > > >> >> >> > > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov >> >> >> > > > > > > > <alexey.scherbak...@gmail.com> wrote: >> >> >> > > > > > > > > >> >> >> > > > > > > > > Hi. >> >> >> > > > > > > > > >> >> >> > > > > > > > > One of the biggest issues with text queries is a >> >> >> > > > > > > > > lack >> >> >> > > > > > > > > of >> >> >> > support >> >> >> > > > for >> >> >> > > > > > > > lucene >> >> >> > > > > > > > > indices persistence, which makes this functionality >> >> >> > > > > > > > > useless >> >> >> > if a >> >> >> > > > > > > > > persistence is enabled. >> >> >> > > > > > > > > >> >> >> > > > > > > > > I would first take care of it. >> >> >> > > > > > > > > >> >> >> > > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin < >> >> >> > > > timonin.ma...@gmail.com >> >> >> > > > > > >: >> >> >> > > > > > > > > >> >> >> > > > > > > > > > Hi, Atri! >> >> >> > > > > > > > > > >> >> >> > > > > > > > > > You're right, Actually there is a lack of support >> >> >> > > > > > > > > > for >> >> >> > > > TextQueries. >> >> >> > > > > > > For >> >> >> > > > > > > > the >> >> >> > > > > > > > > > last ticket I'm doing I see some obvious issues >> >> >> > > > > > > > > > with >> >> >> > > > > > > > > > them >> >> >> > (no >> >> >> > > > page >> >> >> > > > > > > size >> >> >> > > > > > > > > > support, for example). I'm glad that somebody >> >> >> > > > > > > > > > wants >> >> >> > > > > > > > > > to >> >> >> > maintain >> >> >> > > > > > this >> >> >> > > > > > > > > > functionality. Thanks a lot! >> >> >> > > > > > > > > > >> >> >> > > > > > > > > > For the MergeSort algorithm there is already a >> >> >> > > > > > > > > > patch >> >> >> > > > > > > > > > for >> >> >> > that >> >> >> > > > [1]. >> >> >> > > > > > > It's >> >> >> > > > > > > > > > currently on review. This patch introduces an >> >> >> > > > > > > > > > abstract >> >> >> > reducer >> >> >> > > > for >> >> >> > > > > > > > > > CacheQueries with 2 implementations (unordered, >> >> >> > merge-sort). >> >> >> > > > Then >> >> >> > > > > > > > TextQuery >> >> >> > > > > > > > > > leverages on MergeSort to order results from >> >> >> > > > > > > > > > multiple >> >> >> > nodes by >> >> >> > > > > > score. >> >> >> > > > > > > > This >> >> >> > > > > > > > > > patch also fixes the pageSize issue, I've >> >> >> > > > > > > > > > mentioned >> >> >> > > > > > > > > > before. >> >> >> > > > Could >> >> >> > > > > > you >> >> >> > > > > > > > > > please check if it fully matches your idea? Any >> >> >> > > > > > > > > > issues >> >> >> > > > > > > > > > or >> >> >> > > > comments >> >> >> > > > > > > are >> >> >> > > > > > > > > > welcome. >> >> >> > > > > > > > > > >> >> >> > > > > > > > > > I've prepared this ticket, because I need the >> >> MergeSort >> >> >> > > > algorithm >> >> >> > > > > > for >> >> >> > > > > > > > the >> >> >> > > > > > > > > > new type of queries I'm implementing (IndexQuery, >> >> >> > > > > > > > > > it >> >> >> > > > > > > > > > should >> >> >> > > > also >> >> >> > > > > > > > provide >> >> >> > > > > > > > > > ordered results over multiple nodes). Currently >> >> >> > > > > > > > > > I'm >> >> not >> >> >> > > > planning to >> >> >> > > > > > > go >> >> >> > > > > > > > > > further with TextQuery, so if you're going to >> >> >> > > > > > > > > > support >> >> >> > > > > > > > > > this >> >> >> > > > it'll >> >> >> > > > > > be a >> >> >> > > > > > > > great >> >> >> > > > > > > > > > contribution, I think. >> >> >> > > > > > > > > > >> >> >> > > > > > > > > > [1] >> >> https://issues.apache.org/jira/browse/IGNITE-14703 >> >> >> > > > > > > > > > [2] https://github.com/apache/ignite/pull/9081 >> >> >> > > > > > > > > > >> >> >> > > > > > > > > > >> >> >> > > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma < >> >> >> > a...@apache.org> >> >> >> > > > > > > wrote: >> >> >> > > > > > > > > > >> >> >> > > > > > > > > > > Hi All, >> >> >> > > > > > > > > > > >> >> >> > > > > > > > > > > I have been looking into our text queries >> >> >> > > > > > > > > > > support >> >> and >> >> >> > > > > > > > > > > see >> >> >> > > > that it >> >> >> > > > > > > has >> >> >> > > > > > > > > > > limited community support. >> >> >> > > > > > > > > > > >> >> >> > > > > > > > > > > Therefore, I volunteer to be the maintainer of >> >> >> > > > > > > > > > > the >> >> >> > module and >> >> >> > > > > > work >> >> >> > > > > > > on >> >> >> > > > > > > > > > > enhancing it further. >> >> >> > > > > > > > > > > >> >> >> > > > > > > > > > > First goal would be to move to Lucene 8.x, then >> >> >> > > > > > > > > > > work >> >> >> > > > > > > > > > > on >> >> >> > > > sorted >> >> >> > > > > > > reduce >> >> >> > > > > > > > > > > - merge across nodes. Fundamentally, this is >> >> >> > > > > > > > > > > doable >> >> >> > > > > > > > > > > since >> >> >> > > > Lucene >> >> >> > > > > > > > ranks >> >> >> > > > > > > > > > > documents according to their score, and >> >> >> > > > > > > > > > > documents >> >> are >> >> >> > > > returned in >> >> >> > > > > > > the >> >> >> > > > > > > > > > > order of their score. Since the scoring >> >> >> > > > > > > > > > > function >> >> >> > > > > > > > > > > is >> >> >> > > > homogeneous, >> >> >> > > > > > > this >> >> >> > > > > > > > > > > means that across nodes, we can compare scores >> >> >> > > > > > > > > > > and >> >> >> > > > > > > > > > > merge >> >> >> > > > sort. >> >> >> > > > > > > > > > > >> >> >> > > > > > > > > > > Please let me know if I can take this up. >> >> >> > > > > > > > > > > >> >> >> > > > > > > > > > > Atri >> >> >> > > > > > > > > > > >> >> >> > > > > > > > > > > -- >> >> >> > > > > > > > > > > Regards, >> >> >> > > > > > > > > > > >> >> >> > > > > > > > > > > Atri >> >> >> > > > > > > > > > > Apache Concerted >> >> >> > > > > > > > > > > >> >> >> > > > > > > > > > >> >> >> > > > > > > > > >> >> >> > > > > > > > > >> >> >> > > > > > > > > -- >> >> >> > > > > > > > > >> >> >> > > > > > > > > Best regards, >> >> >> > > > > > > > > Alexei Scherbakov >> >> >> > > > > > > > >> >> >> > > > > > > > -- >> >> >> > > > > > > > Regards, >> >> >> > > > > > > > >> >> >> > > > > > > > Atri >> >> >> > > > > > > > Apache Concerted >> >> >> > > > > > > > >> >> >> > > > > > > >> >> >> > > > > > >> >> >> > > > > >> >> >> > > > > >> >> >> > > > > -- >> >> >> > > > > Best regards, >> >> >> > > > > Andrey V. Mashenkov >> >> >> > > > >> >> >> > > > -- >> >> >> > > > Regards, >> >> >> > > > >> >> >> > > > Atri >> >> >> > > > Apache Concerted >> >> >> > > > >> >> >> > > >> >> >> > > >> >> >> > > -- >> >> >> > > Best regards, >> >> >> > > Andrey V. Mashenkov >> >> >> > >> >> >> > -- >> >> >> > Regards, >> >> >> > >> >> >> > Atri >> >> >> > Apache Concerted >> >> >> > >> >> >> >> >> >> >> >> >> -- >> >> >> Best regards, >> >> >> Andrey V. Mashenkov >> >> > >> >> > -- >> >> > Regards, >> >> > >> >> > Atri >> >> > Apache Concerted >> >> > >> >> >> >> >> >> -- >> >> >> >> Best regards, >> >> Ivan Pavlukhin >> >> >> > >> >> >> -- >> >> Best regards, >> Ivan Pavlukhin > > -- > Regards, > > Atri > Apache Concerted > -- Best regards, Ivan Pavlukhin