[jira] [Commented] (SOLR-4787) Join Contrib

Arul Kalaipandian (JIRA) Fri, 04 Apr 2014 09:46:28 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-4787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960120#comment-13960120
 ]


Arul Kalaipandian commented on SOLR-4787:
-----------------------------------------

Last week, we tried the patch(SOLR-4787) in our test system &  performance of 
hjoin is quite better than the standard join. 

But with following issues,

1) With 'int' join fields, bjoin throws  ArrayIndexOutOfBoundsException

{code:title=bjoin throws ArrayIndexOutOfBoundsException}

Caused by: org.apache.solr.client.solrj.SolrServerException: 
java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: -1
        at 
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:155)
        ... 48 more
Caused by: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: -1
        at 
org.apache.solr.joins.BitSetJoinQParserPlugin$BitSetJoinQuery.createWeight(BitSetJoinQParserPlugin.java:282)
        at 
org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:664)
        at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297)
        at 
org.apache.solr.search.SolrIndexSearcher.getDocSetNC(SolrIndexSearcher.java:1122)
        at 
org.apache.solr.search.SolrIndexSearcher.getPositiveDocSet(SolrIndexSearcher.java:825)
        at 
org.apache.solr.search.SolrIndexSearcher.getProcessedFilter(SolrIndexSearcher.java:942)
        at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1399)
        at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1366)
        at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:457)
        at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:410)
        at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)
        at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)
        at 
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:150)
        ... 48 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
        at org.apache.lucene.util.OpenBitSet.get(OpenBitSet.java:174)
        at 
org.apache.solr.joins.BitSetJoinQParserPlugin$BitSetJoinQuery.createWeight(BitSetJoinQParserPlugin.java:273)
        ... 61 more
{code}          

2) Tescases with both 'bjoin' & 'hjoin' are fails with thread leaks.

{code:title=Both hjoin & bjoin (With or witout localparam 'threads')  }
                Thread[id=29, name=commitScheduler-7-thread-1, 
state=TIMED_WAITING, group=TGRP-VolatileQueryTest]
                at sun.misc.Unsafe.park(Native Method)
                at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
                at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025)
                at java.util.concurrent.DelayQueue.take(DelayQueue.java:164)
                at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:609)
                at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:602)
                at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)
                at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
                at java.lang.Thread.run(Thread.java:662)
{code}


3) 'bjoin' throws NumberFormatException  for 'long' join fields.
   It would be nice to validate the field's type before executing the join 
query.

{code:title=Exception with 'long' join fields }
Caused by: java.lang.NumberFormatException: Invalid shift value in prefixCoded 
bytes (is encoded value really an INT?)
        at 
org.apache.lucene.util.NumericUtils.getPrefixCodedIntShift(NumericUtils.java:210)
        at org.apache.lucene.util.NumericUtils$2.accept(NumericUtils.java:493)
        at 
org.apache.lucene.index.FilteredTermsEnum.next(FilteredTermsEnum.java:241)
        at 
org.apache.lucene.search.FieldCacheImpl$Uninvert.uninvert(FieldCacheImpl.java:308)
        at 
org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheImpl.java:653)
        at 
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:212)
        at 
org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:571)
        at 
org.apache.lucene.search.FieldCacheImpl$IntCache.createValue(FieldCacheImpl.java:619)
        at 
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:212)
        at 
org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:571)
        at 
org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:546)
        at org.apache.solr.joins.MaxInt.getMax(MaxInt.java:98)
        at 
org.apache.solr.joins.BitSetJoinQParserPlugin$BitSetJoinQuery.runJoin(BitSetJoinQParserPlugin.java:405)
        ... 31 more
{code}


4. Make 'fromIndex'  optional as like the standard 'join'
{code}
Caused by: java.lang.NullPointerException
        at 
org.apache.solr.joins.HashSetJoinQParserPlugin$HashSetJoinQuery.hashCode(HashSetJoinQParserPlugin.java:133)
        at org.apache.solr.search.QueryResultKey.<init>(QueryResultKey.java:50)
        at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1274)
        at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:457)
        at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:410)
        at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)
        at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)
        at 
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:150)
        ... 48 more
{code}

 Index details: 
   5 shards with 12 million each(11 million docs + 1 million acl)
   Both docs & acls are in same core.
   Tested with Solr 4.2.1 

> Join Contrib
> ------------
>
>                 Key: SOLR-4787
>                 URL: https://issues.apache.org/jira/browse/SOLR-4787
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 4.2.1
>            Reporter: Joel Bernstein
>            Priority: Minor
>             Fix For: 4.8
>
>         Attachments: SOLR-4787-deadlock-fix.patch, 
> SOLR-4787-pjoin-long-keys.patch, SOLR-4787.patch, SOLR-4787.patch, 
> SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, 
> SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, 
> SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, SOLR-4787.patch, 
> SOLR-4797-hjoin-multivaluekeys-nestedJoins.patch, 
> SOLR-4797-hjoin-multivaluekeys-trunk.patch
>
>
> This contrib provides a place where different join implementations can be 
> contributed to Solr. This contrib currently includes 3 join implementations. 
> The initial patch was generated from the Solr 4.3 tag. Because of changes in 
> the FieldCache API this patch will only build with Solr 4.2 or above.
> *HashSetJoinQParserPlugin aka hjoin*
> The hjoin provides a join implementation that filters results in one core 
> based on the results of a search in another core. This is similar in 
> functionality to the JoinQParserPlugin but the implementation differs in a 
> couple of important ways.
> The first way is that the hjoin is designed to work with int and long join 
> keys only. So, in order to use hjoin, int or long join keys must be included 
> in both the to and from core.
> The second difference is that the hjoin builds memory structures that are 
> used to quickly connect the join keys. So, the hjoin will need more memory 
> then the JoinQParserPlugin to perform the join.
> The main advantage of the hjoin is that it can scale to join millions of keys 
> between cores and provide sub-second response time. The hjoin should work 
> well with up to two million results from the fromIndex and tens of millions 
> of results from the main query.
> The hjoin supports the following features:
> 1) Both lucene query and PostFilter implementations. A *"cost"* > 99 will 
> turn on the PostFilter. The PostFilter will typically outperform the Lucene 
> query when the main query results have been narrowed down.
> 2) With the lucene query implementation there is an option to build the 
> filter with threads. This can greatly improve the performance of the query if 
> the main query index is very large. The "threads" parameter turns on 
> threading. For example *threads=6* will use 6 threads to build the filter. 
> This will setup a fixed threadpool with six threads to handle all hjoin 
> requests. Once the threadpool is created the hjoin will always use it to 
> build the filter. Threading does not come into play with the PostFilter.
> 3) The *size* local parameter can be used to set the initial size of the 
> hashset used to perform the join. If this is set above the number of results 
> from the fromIndex then the you can avoid hashset resizing which improves 
> performance.
> 4) Nested filter queries. The local parameter "fq" can be used to nest a 
> filter query within the join. The nested fq will filter the results of the 
> join query. This can point to another join to support nested joins.
> 5) Full caching support for the lucene query implementation. The filterCache 
> and queryResultCache should work properly even with deep nesting of joins. 
> Only the queryResultCache comes into play with the PostFilter implementation 
> because PostFilters are not cacheable in the filterCache.
> The syntax of the hjoin is similar to the JoinQParserPlugin except that the 
> plugin is referenced by the string "hjoin" rather then "join".
> fq=\{!hjoin fromIndex=collection2 from=id_i to=id_i threads=6 
> fq=$qq\}user:customer1&qq=group:5
> The example filter query above will search the fromIndex (collection2) for 
> "user:customer1" applying the local fq parameter to filter the results. The 
> lucene filter query will be built using 6 threads. This query will generate a 
> list of values from the "from" field that will be used to filter the main 
> query. Only records from the main query, where the "to" field is present in 
> the "from" list will be included in the results.
> The solrconfig.xml in the main query core must contain the reference to the 
> hjoin.
> <queryParser name="hjoin" 
> class="org.apache.solr.joins.HashSetJoinQParserPlugin"/>
> And the join contrib lib jars must be registed in the solrconfig.xml.
>  <lib dir="../../../contrib/joins/lib" regex=".*\.jar" />
> After issuing the "ant dist" command from inside the solr directory the joins 
> contrib jar will appear in the solr/dist directory. Place the the 
> solr-joins-4.*-.jar  in the WEB-INF/lib directory of the solr webapplication. 
> This will ensure that the top level Solr classloader loads these classes 
> rather then the core's classloaded. 
> *BitSetJoinQParserPlugin aka bjoin*
> The bjoin behaves exactly like the hjoin but uses a BitSet instead of a 
> HashSet to perform the underlying join. Because of this the bjoin is much 
> faster and can provide sub-second response times on result sets of tens of 
> millions of records from the fromIndex and hundreds of millions of records 
> from the main query.
> But there are limitations to how the bjoin can be used. The bjoin treats the 
> join keys as addresses in a BitSet and uses the Lucene OpenBitSet 
> implementation which performs very well but is not sparse. So the BitSet 
> memory is dictated by the size of the join keys. For example a bitset with a 
> max join key of 200,000,000 will need 25 MB of memory. For this reason the 
> BitSet join does not support long join keys. In order to keep memory usage 
> down the join keys should also be packed at the low end, for example from 1 
> to 50,000,000. 
> Below is a sampe bjoin:
> fq=\{!bjoin fromIndex=collection2 from=id_i to=id_i threads=6 
> fq=$qq\}user:customer1&qq=group:5
> To register the bjoin the solrconfig.xml in the main query core must contain 
> the reference to the bjoin.
> <queryParser name="bjoin" 
> class="org.apache.solr.joins.BitSetJoinQParserPlugin"/>
> *ValueSourceJoinParserPlugin aka vjoin*
> The second implementation is the ValueSourceJoinParserPlugin aka "vjoin". 
> This implements a ValueSource function query that can return a value from a 
> second core based on join keys and limiting query. The limiting query can be 
> used to select a specific subset of data from the join core. This allows 
> customer specific relevance data to be stored in a separate core and then 
> joined in the main query.
> The vjoin is called using the "vjoin" function query. For example:
> bf=vjoin(fromCore, fromKey, fromVal, toKey, query)
> This example shows "vjoin" being called by the edismax boost function 
> parameter. This example will return the "fromVal" from the "fromCore". The 
> "fromKey" and "toKey" are used to link the records from the main query to the 
> records in the "fromCore". The "query" is used to select a specific set of 
> records to join with in fromCore.
> Currently the fromKey and toKey must be longs but this will change in future 
> versions. Like the pjoin, the "join" SolrCache is used to hold the join 
> memory structures.
> To configure the vjoin you must register the ValueSource plugin in the 
> solrconfig.xml as follows:
> <valueSourceParser name="vjoin" 
> class="org.apache.solr.joins.ValueSourceJoinParserPlugin" />



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-4787) Join Contrib

Reply via email to