I'm attaching a patch to fix several issues with indexed search. Issue 1: large text fields weren't getting indexed due to a low MAX_CONV_SIZE Resolution: change MAX_CONV_SIZE to 1024 * 1024, and add call to writer to boost its maximum field size
Issue 2: search causes segfault when searching for stop words Resolution: set analyzer stop words to NULL for both index creation and search. Possibly this would only have to be set for search, and left on to lower the index size. Issue 3: index causes segfault *after indexing* when module location isn't writable. Resolution: check the return value of FileMgr::createParent(target + "/dummy"); if return value is -1, abort indexing In addition, this patch adds fields for footnotes, morphology, and headers. I *really* would like to see this added to the default indexing. The reason is that with indexed search it is possible to combine fields in one search, something that SWORD attribute search doesn't allow (AFAIK). And indexed search is much faster, of course. My patch only covers one of the three spots this would apparently need to be added. I didn't understand why there was so much duplicated code, nor was I entirely comfortable with the code I had written, so I didn't expand it to cover all cases. It appears that the code for adding fields like strongs is the same in 3 different spots. Surely this could be condensed somehow? I really would like to see the first 3 issues fixed immediately (ie, before next release). Issue 1 makes most genbook indexed search pointless, while Issues 2 and 3 have both been reported as issues against Xiphos. Of course, we can't control the segfault in either case. As far as the extra fields, that will need some extra work, but I feel it's really important as well. At some point, I am going to redo the search functionality in Xiphos, and my plan is to implement indexing myself if these fields aren't in SWORD by then. I have been meaning to address these issues for some time, but hadn't gotten around to it yet. The bug report we had forced the issue. While we're at it, I'd like to bring up two more issues. 1. If the module location isn't writable, there isn't a way for the user to create an index. I would like to see indexes created somewhere else in this case, eg ~/.sword/indexes. I believe BT does something like this already. 2. We currently have no way of notifying the user if the indexes are no longer valid, or if they should be updated. I would like to see a versioning scheme for indexes. For example, with the changes here, and the changes for Hebrew search, all Hebrew indexes previously created are now useless. How do we tell the user that he needs to re-create the index? Along the same lines, all genbook indexes, and many commentary indexes are incorrect. With the next release of SWORD, hopefully with this issue resolved, it would be nice to be able to notify the user that the indexes are now out-of-date or incorrect and need to be rebuilt. Finally, I would like to point out a great tool for examining lucene/clucene indexes. You can get it here: http://www.getopt.org/luke/ Matthew PS I'm going to send this without the attachment. I'll send the patch later, but here it is below: #ifdef USELUCENE if (searchType == -4) { // lucene //Buffers for the wchar<->utf8 char* conversion - const unsigned short int MAX_CONV_SIZE = 2047; + const unsigned int MAX_CONV_SIZE = 1024 * 1024; wchar_t wcharBuffer[MAX_CONV_SIZE + 1]; char utfBuffer[MAX_CONV_SIZE + 1]; @@ -510,10 +510,11 @@ ir = IndexReader::open(target); is = new IndexSearcher(ir); (*percent)(10, percentUserData); - - standard::StandardAnalyzer analyzer; + + const TCHAR* stop_words[] = { NULL }; + standard::StandardAnalyzer *analyzer = new standard::StandardAnalyzer( (const TCHAR**)stop_words ); lucene_utf8towcs(wcharBuffer, istr, MAX_CONV_SIZE); //TODO Is istr always utf8? - q = QueryParser::parse(wcharBuffer, _T("content"), &analyzer); + q = QueryParser::parse(wcharBuffer, _T("content"), analyzer); (*percent)(20, percentUserData); h = is->search(q); (*percent)(80, percentUserData); @@ -1026,21 +1027,27 @@ IndexWriter *coreWriter = NULL; IndexWriter *fsWriter = NULL; Directory *d = NULL; - - standard::StandardAnalyzer *an = new standard::StandardAnalyzer(); + const unsigned int MAX_CONV_SIZE = 1024 * 1024; + + const TCHAR* stop_words[] = { NULL }; + standard::StandardAnalyzer *an = new standard::StandardAnalyzer( (const TCHAR**)stop_words ); SWBuf target = getConfigEntry("AbsoluteDataPath"); bool includeKeyInSearch = getConfig().has("SearchOption", "IncludeKeyInSearch"); char ch = target.c_str()[strlen(target.c_str())-1]; if ((ch != '/') && (ch != '\\')) target.append('/'); target.append("lucene"); - FileMgr::createParent(target+"/dummy"); + int iswritable = FileMgr::createParent(target+"/dummy"); + if (iswritable == -1) + return -1; ramDir = new RAMDirectory(); coreWriter = new IndexWriter(ramDir, an, true); + coreWriter->setMaxFieldLength(MAX_CONV_SIZE); + char perc = 1; VerseKey *vkcheck = 0; vkcheck = SWDYNAMIC_CAST(VerseKey, key); @@ -1066,8 +1073,11 @@ SWBuf proxBuf; SWBuf proxLem; SWBuf strong; + SWBuf morph; + SWBuf footnote; + SWBuf heading; - const short int MAX_CONV_SIZE = 2047; + wchar_t wcharBuffer[MAX_CONV_SIZE + 1]; char err = Error(); @@ -1104,8 +1114,15 @@ AttributeTypeList::iterator words; AttributeList::iterator word; AttributeValue::iterator strongVal; + AttributeValue::iterator morphVal; + AttributeValue::iterator headings; + AttributeTypeList::iterator footnotes; + AttributeList::iterator footList; + AttributeValue::iterator footVal; + strong=""; + morph=""; words = getEntryAttributes().find("Word"); if (words != getEntryAttributes().end()) { for (word = words->second.begin();word != words->second.end(); word++) { @@ -1124,10 +1141,38 @@ strong.append(strongVal->second); strong.append(' '); } + tmp = "Morph"; + morphVal = word->second.find(tmp); + if (morphVal != word->second.end()){ + morph.append(morphVal->second); + morph.append(' '); + } } } } + footnote=""; + footnotes = getEntryAttributes().find("Footnote"); + if (footnotes != getEntryAttributes().end()) { + for (footList = footnotes->second.begin(); footList != footnotes->second.end(); footList++) { + SWBuf tmp = "body"; + footVal = footList->second.find(tmp); + if (footVal != footList->second.end()) { + footnote.append(footVal->second); + footnote.append(' '); + } + } + } + + heading=""; + for (headings = getEntryAttributes()["Heading"]["Preverse"].begin(); + headings != getEntryAttributes()["Heading"]["Preverse"].end(); + headings++) { + heading.append(headings->second); + heading.append(' '); + } + + lucene_utf8towcs(wcharBuffer, keyText, MAX_CONV_SIZE); //keyText must be utf8 // doc->add( *(new Field("key", wcharBuffer, Field::STORE_YES | Field::INDEX_TOKENIZED))); doc->add( *Field::Text(_T("key"), wcharBuffer ) ); @@ -1149,6 +1194,21 @@ //printf("setting fields (%s).\ncontent: %s\nlemma: %s\n", (const char *)*key, content, strong.c_str()); } + if (morph.length() > 0) { + lucene_utf8towcs(wcharBuffer, morph, MAX_CONV_SIZE); + doc->add( *Field::UnStored(_T("morph"), wcharBuffer) ); + } + + if (footnote.length() > 0) { + lucene_utf8towcs(wcharBuffer, footnote, MAX_CONV_SIZE); + doc->add( *Field::UnStored(_T("footnote"), wcharBuffer) ); + } + + if (heading.length() > 0) { + lucene_utf8towcs(wcharBuffer, heading, MAX_CONV_SIZE); + doc->add( *Field::UnStored(_T("heading"), wcharBuffer) ); + } + //printf("setting fields (%s).\n", (const char *)*key); //fflush(stdout); } _______________________________________________ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page