Hi Scott, Yes, we think that our usage scenario falls into Index-Heavy/Query-Heavy too. We have tested with several values in softcommit/hardcommit values (from few seconds to minutes) with no appreciable improvements :(
Thanks for your reply! - Daniel 2017-08-25 6:45 GMT+02:00 Scott Stults <sstu...@opensourceconnections.com>: > Hi Dani, > > It seems like your use case falls into the Index-Heavy / Query-Heavy > category, so you might try increasing your hard commit frequency to 15 > seconds rather than 15 minutes: > > https://lucidworks.com/2013/08/23/understanding- > transaction-logs-softcommit-and-commit-in-sorlcloud/ > > > -Scott > > On Thu, Aug 24, 2017 at 10:03 AM, Daniel Ortega < > danielortegauf...@gmail.com > > wrote: > > > Hi Scott, > > > > In our indexing service we are using that client too > > (org.apache.solr.client.solrj.impl.CloudSolrClient) :) > > > > This is out Update Request Processor chain configuration: > > > > <updateProcessor class="solr.processor.SignatureUpdateProcessorFactory" > > name > > ="signature"> <bool name="enabled">true</bool> <str > name="signatureField"> > > hash</str> <bool name="overwriteDupes">false</bool> <str name= > > "signatureClass">solr.processor.Lookup3Signature</str> > </updateProcessor> > > < > > updateRequestProcessorChain processor="signature" name="dedupe"> > <processor > > class="solr.LogUpdateProcessorFactory" /> <processor class= > > "solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain> <!-- > > de-duplication process explained in: > > https://cwiki.apache.org/confluence/display/solr/De-Duplication --> < > > requestHandler name="/update" class="solr.UpdateRequestHandler" > <lst > > name= > > "defaults"> <str name="update.chain">dedupe</str> </lst> > </requestHandler> > > > > Thanks for your reply :) > > > > - Dani > > > > 2017-08-24 14:49 GMT+02:00 Scott Stults <sstults@ > opensourceconnections.com > > >: > > > > > Hi Daniel, > > > > > > SolrJ has a few client implementations to choose from: CloudSolrClient, > > > ConcurrentUpdateSolrClient, HttpSolrClient, LBHttpSolrClient. You said > > your > > > query service uses CloudSolrClient, but it would be good to verify > which > > > implementation your indexing service uses. > > > > > > One of the problems you might be having is with your deduplication > step. > > > Can you post your Update Request Processor Chain? > > > > > > > > > -Scott > > > > > > > > > On Wed, Aug 23, 2017 at 4:13 PM, Daniel Ortega < > > > danielortegauf...@gmail.com> > > > wrote: > > > > > > > Hi Scott, > > > > > > > > - *Can you describe the process that queries the DB and sends records > > to > > > * > > > > *Solr?* > > > > > > > > We are enqueueing ids during every ORACLE transaction (in > > > insert/updates). > > > > > > > > An application dequeues every id and perform queries against dozen of > > > > tables in the relational model to retrieve the fields to build the > > > > document. As we know that we are modifying the same ORACLE row in > > > > different (but consecutive) transactions, we store only the last > > version > > > of > > > > the modified documents in a map data structure. > > > > > > > > The application has a configurable interval to send the documents > > stored > > > in > > > > the map to the update handler (we have tested different intervals > from > > > few > > > > milliseconds to several seconds) using the SolrJ client. Actually we > > are > > > > sending all the documents every 15 seconds. > > > > > > > > This application is developed using Java, Spring and Maven and we > have > > > > several instances. > > > > > > > > -* Is it a SolrJ-based application?* > > > > > > > > Yes, it is. We aren't using the last version of SolrJ client (we are > > > > currently using SolrJ v6.3.0). > > > > > > > > - *If it is, which client package are you using?* > > > > > > > > I don't know exactly what do you mean saying 'client package' :) > > > > > > > > - *How many documents do you send at once?* > > > > > > > > It depends on the defined interval described before and the number of > > > > transactions executed in our relational database. From dozens to few > > > > hundreds (and even thousands). > > > > > > > > - *Are you sending your indexing or query traffic through a load > > > balancer?* > > > > > > > > We aren't using a load balancer for indexing, but we have all our > Rest > > > > Query services through an HAProxy (using 'leastconn' algorithm). The > > Rest > > > > Query Services performs queries using the CloudSolrClient. > > > > > > > > Thanks for your reply, > > > > if you need any further information don't hesitate to ask > > > > > > > > Daniel > > > > > > > > 2017-08-23 14:57 GMT+02:00 Scott Stults <sstults@ > > > opensourceconnections.com > > > > >: > > > > > > > > > Hi Daniel, > > > > > > > > > > Great background information about your setup! I've got just a few > > more > > > > > questions: > > > > > > > > > > - Can you describe the process that queries the DB and sends > records > > to > > > > > Solr? > > > > > - Is it a SolrJ-based application? > > > > > - If it is, which client package are you using? > > > > > - How many documents do you send at once? > > > > > - Are you sending your indexing or query traffic through a load > > > balancer? > > > > > > > > > > If you're sending documents to each replica as fast as they can > take > > > > them, > > > > > you might be seeing a bottleneck at the shard leaders. The SolrJ > > > > > CloudSolrClient finds out from Zookeeper which nodes are the shard > > > > leaders > > > > > and sends docs directly to them. > > > > > > > > > > > > > > > -Scott > > > > > > > > > > On Tue, Aug 22, 2017 at 2:16 PM, Daniel Ortega < > > > > > danielortegauf...@gmail.com> > > > > > wrote: > > > > > > > > > > > *Main Problems* > > > > > > > > > > > > > > > > > > We are involved in a migration from Solr Master/Slave > > infrastructure > > > to > > > > > > SolrCloud infrastructure. > > > > > > > > > > > > > > > > > > > > > > > > The main problems that we have now are: > > > > > > > > > > > > > > > > > > > > > > > > - Excessive resources consumption: Currently we have 5 > instances > > > > with > > > > > 80 > > > > > > processors/768 GB RAM each instance using SSD Hard Disk Drives > > > that > > > > > > doesn't > > > > > > support the load that we have in the other architecture. In > our > > > > > > Master-Slave architecture we have only 7 Virtual Machines with > > > lower > > > > > > specs > > > > > > (4 processors and 16 GB each instance using SSD Hard Disk > Drives > > > > too). > > > > > > So, > > > > > > at the moment our SolrCloud infrastructure is wasting several > > > dozen > > > > > > times > > > > > > more resources than our Solr Master/Slave infrastructure. > > > > > > - Despite spending more resources we have worst query times > > > > (compared > > > > > to > > > > > > Solr in master/slave architecture) > > > > > > > > > > > > > > > > > > *Search infrastructure (SolrCloud infrastructure)* > > > > > > > > > > > > > > > > > > > > > > > > As we cannot use DIH Handler (which is what we use in Solr > > > > Master/Slave), > > > > > > we > > > > > > have developed an application which reads every transaction from > > > > Oracle, > > > > > > builds a document collection searching in the database and sends > > the > > > > > result > > > > > > to the */update* handler every 200 milliseconds using SolrJ > client. > > > > This > > > > > > application tries to delete the possible duplicates in each > update > > > > > window, > > > > > > but we are using solr’s de-duplication techniques > > > > > > <https://emea01.safelinks.protection.outlook.com/?url= > > > > > > https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay% > > > > > > 2Fsolr%2FDe-Duplication&data=02%7C01%7Cdortega%40idealista.com% > > > > > > 7Cb169ea024abc4954927208d4bc6868eb% > 7Cd78b7929c2a34897ae9a7d8f8dc1 > > > > > > a1cf%7C0%7C0%7C636340604697721266&sdata=WEhzoHC1Bf77K706% > > > > > > 2Fj2wIWOw5gzfOgsP1IPQESvMsqQ%3D&reserved=0> > > > > > > too. > > > > > > > > > > > > > > > > > > > > > > > > We are indexing ~100 documents per second (with peaks of ~1000 > > > > documents > > > > > > per second). > > > > > > > > > > > > > > > > > > > > > > > > Every search query is centralized in other application which > > exposes > > > a > > > > > DSL > > > > > > behind a REST API and uses SolrJ client too to perform queries. > We > > > have > > > > > > peaks of 2000 QPS. > > > > > > > > > > > > *Cluster structure **(SolrCloud infrastructure)* > > > > > > > > > > > > > > > > > > > > > > > > At the moment, the cluster has 30 SolrCloud instances with the > same > > > > specs > > > > > > (Same physical hosts, same JVM Settings, etc.). > > > > > > > > > > > > > > > > > > > > > > > > *Main collection* > > > > > > > > > > > > > > > > > > > > > > > > In our use case we are using this collection as a NoSQL database > > > > > basically. > > > > > > Our document is composed of about 300 fields that represents an > > > advert, > > > > > and > > > > > > is a denormalization of its relational representation in Oracle. > > > > > > > > > > > > > > > > > > We are using all our nodes to store the collection in 3 shards. > > So, > > > > each > > > > > > shard has 10 replicas. > > > > > > > > > > > > > > > > > > At the moment, we are only indexing a subset of the adverts > stored > > in > > > > > > Oracle, but our goal is to store all the ads that we have in the > DB > > > (a > > > > > few > > > > > > tens of millions of documents). We have NRT requirements, so we > > need > > > to > > > > > > index every document as soon as posible once it’s changed in > > Oracle. > > > > > > > > > > > > > > > > > > > > > > > > We have defined the properties of each field (if it’s > > stored/indexed > > > or > > > > > > not, if should be defined as DocValue, etc…) considering the use > of > > > > that > > > > > > field. > > > > > > > > > > > > > > > > > > > > > > > > *Index size **(SolrCloud infrastructure)* > > > > > > > > > > > > > > > > > > > > > > > > The index size is currently above 6 GB, storing 1.300.000 > documents > > > in > > > > > each > > > > > > shard. So, we are storing 3.900.000 documents and the total index > > > size > > > > is > > > > > > 18 GB. > > > > > > > > > > > > > > > > > > > > > > > > *Indexation **(SolrCloud infrastructure)* > > > > > > > > > > > > > > > > > > > > > > > > The commits *aren’t* triggered by the application described > before. > > > The > > > > > > hardcommit/softcommit interval are configured in Solr: > > > > > > > > > > > > > > > > > > > > > > > > - *HardCommit:* every 15 minutes (with opensearcher = false) > > > > > > - *SoftCommit:* every 5 seconds > > > > > > > > > > > > > > > > > > > > > > > > *Apache Solr Version* > > > > > > > > > > > > > > > > > > > > > > > > We are currently using the last version of Solr (6.6.0) under an > > > Oracle > > > > > VM > > > > > > (Java(TM) SE Runtime Environment (build 1.8.0_131-b11) Oracle (64 > > > > bits)) > > > > > in > > > > > > both deployments. > > > > > > > > > > > > > > > > > > The question is... What is wrong here?!?!?! > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Scott Stults | Founder & Solutions Architect | OpenSource > > Connections, > > > > LLC > > > > > | 434.409.2780 > > > > > http://www.opensourceconnections.com > > > > > > > > > > > > > > > > > > > > > -- > > > Scott Stults | Founder & Solutions Architect | OpenSource Connections, > > LLC > > > | 434.409.2780 > > > http://www.opensourceconnections.com > > > > > > > > > -- > Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC > | 434.409.2780 > http://www.opensourceconnections.com >