Just keep it simple. Index the entire source file. One source file is one document. While indexing preserve dot (.), Hypen(-) and other special characters. You could use whitespace analyzer.
I hope it helps Regards Aditya www.findbestopensource.com On Wed, Jun 4, 2014 at 3:29 PM, Johan Tibell <johan.tib...@gmail.com> wrote: > The the majority of queries will be look-ups of functions/types by fully > qualified name. For example, the query [Data.Map.insert] will find the > definition and all uses of the `insert` function defined in the `Data.Map` > module. The corpus is all Haskell open source code on hackage.haskell.org. > > Being able to support qualified name queries is the main benefit of > indexing the output of the compiler (which has resolved unqualified names > to qualified names) rather than using a simple text-based indexing. > > There are three levels of name qualification I want to support in queries: > > * Unqualified: myFunction > * Module qualified: MyModule.myFunction > * Package and module qualified: mypackage-MyModule.myFunction > > I expect the middle one to be used the most. The last form is sometimes > needed for disambiguation and the first is nice to support as a shorthand > when the function name is unlikely to be ambiguous. > > For scoring I'd like to have a couple of attributes available. The most > important one is whether a term represents a use site or a definition site. > This would allow the definition of a function to appear as the first search > result. > > Is this precise enough? Naturally the scope will grow over time, but this > is the core of what I'm trying to do. > > -- Johan > > > On Wed, Jun 4, 2014 at 8:02 AM, Aditya <findbestopensou...@gmail.com> > wrote: > > > Hi Johan, > > > > How you want to search, What is your search requirement and according to > > that you need to index. You could check duckduckgo or github code search. > > > > The easiest approach would be to have a parser which will read each > source > > file and indexes as a single document. When you search, you will have a > > single search field which will search the index and retrieves the result. > > The search field accepts any text in the source file. It could be > function > > name, class name, comments or variables etc. > > > > Another approach is to have different search fields for Functions, > Classes, > > Package etc. You need to parse the file, identify comments, function > name, > > class name etc and index it in a separate field. > > > > > > Regards > > Aditya > > www.findbestopensource.com > > > > > > > > > > On Wed, Jun 4, 2014 at 7:02 AM, Johan Tibell <johan.tib...@gmail.com> > > wrote: > > > > > Hi, > > > > > > I'd like to index (Haskell) source code. I've run the source code > > through a > > > compiler (GHC) to get rich information about each token (its type, > fully > > > qualified name, etc) that I want to index (and later use when ranking). > > > > > > I'm wondering how to approach indexing source code. I can see two > > possible > > > approaches: > > > > > > * Create a file containing all the metadata and write a custom > > > tokenizer/analyzer that processes the file. The file could use a simple > > > line-based format: > > > > > > myFunction,1:12-1:22,my-package,defined-here,more-metadata > > > myFunction,5:11-5:21,my-package,used-here,more-metadata > > > ... > > > > > > The tokenizer would use CharTermAttribute to write the function name, > > > OffsetAttribute to write the source span, etc. > > > > > > * Use and IndexWriter to create a Document directly, as done here: > > > > > > > > > http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3 > > > > > > I'm new to Lucene so I can't quite tell which approach is more likely > to > > > work well. Which way would you recommend? > > > > > > Other things I'd like to do that might influence the answer: > > > > > > - Index several tokens at the same position, so I can index both the > > fully > > > qualified name (e.g. module.myFunction) and unqualified name (e.g. > > > myFunction) for a term. > > > > > > -- Johan > > > > > >