Lucene.Net Phrase Suggestion

Posted on March 13, 2012

This post is all about lucene and lucene.net. Lucene is an awesome project, but it can be very confusing. Bert Williams has an awesome set of tutorials on how to get lucene up and running. I used his tutorials to get my basic lucene setup working, and if you’ve never heard of lucene before I recommend checking out his posts first!

Lucene.net is a dotnet port of the original Lucene project which was coded in java. Its a very quick, very awesome full text search index. It comes jam packed full of features, but obviously being a port, it lags a bit behind the original java version. One of the very cool things in Lucene is the ability to store an item in the index, and then break it down into “shingles”, ie different combinations of words and phrases. That way if someone does a search for a phrase “eye or londor”, it might find nothing, but can quickly suggest the phrase “eye of london” ala google.

Unfortunately, the shingle filter is one of the items that the lucene.net engine doesn’t have (yet). This means lucene.net cant break a sentence into its various phrases and store them in an index. You can create a spell index based on single words, but the phrase suggestions won’t work.

However, any lucene indexes created by the project can be read by either the java or the dotnet port. This means you can create your indexes in lucene.net and then write a small java app to create shingles of those terms into a temporary index. This temporary index can then be stored back in your main lucene.net index and all you have to do is search that index for suggestions to come up. Which is exactly what I did.

This is the code I used to do that. To get this to work, follow the steps laid out in Bert Williams’ tutorials above, until you have completed the steps laid out in “alternatives and did you mean” which is the fourth or fifth article in the series. Once you’ve set up a spelling index with single words in it, stop and run my code. This will break down the descriptions into shingles and add those shingles into your existing spelling index. From there on, continue as normal.

Ok, so first off here is the java code which creates a separate, temporary, shingle index.

/**
 * @(#)DirectorySearchShingle.java
 *
 * DirectorySearchShingle application
 *
 * @author
 * @version 1.00 2011/2/16
 */
 import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.shingle.ShingleAnalyzerWrapper;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.FilterIndexReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Collector;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Scorer;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;


import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.index.TermEnum;

public class DirectorySearchShingle {

	 public static void main(String[] args)
	 {
	 	try
	 	{
	 	 Analyzer temp = new WhitespaceAnalyzer();
		 Analyzer analyzer = new ShingleAnalyzerWrapper(temp, 4);
		 shingleThings(analyzer);
	 	}
	 	catch(Exception ex)
	 	{

	 	}
	 }

	 public static void shingleThings(Analyzer analyzer) throws Exception
	 {
	 	shingleDirectory(analyzer);
	 }

	 public static void shingleDirectory(Analyzer analyzer) throws Exception
	 {

		//this is the lucene index that i have already imported stuff into.				
	 	String dirIndexPathFrom = "C:\\LunceneIndex\\Directory";
	 	
	 	//this is where im storing my shingle results
	 	//this is a temporary index, i will later take this index and merge it with my already existing 
	 	//single phrase index => http://www.devatwork.nl/articles/lucenenet/alternatives-did-you-mean-lucenenet/
	 	String dirIndexPathTo = "C:\\LunceneIndex\\ShingleDirectory";
	 	
	 	//this is a list of fields in the index i want to break down into every possible combination.
	 	//so if the original field contains "the cat jumped"
	 	//the shingles are 
	 	//=> the cat
		//=> cat jumped
		String[] fieldsToIndex = {"Company" };

	 	setUpSearcher(analyzer, dirIndexPathFrom, dirIndexPathTo, fieldsToIndex);
	 	showTerm(dirIndexPathTo);
	 }

    public static void setUpSearcher(Analyzer analyzer, String indexPathToReadFrom, String indexPathToWriteTo, String[] fieldsToImport) throws Exception {
    	try
    	{

    	    //Directory dir = new RAMDirectory();
    //IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(TEST_VERSION_CURRENT, analyzer));
    IndexWriter writer = new IndexWriter(FSDirectory.open(new File(indexPathToWriteTo)), analyzer, true, IndexWriter.MaxFieldLength.LIMITED);

		IndexReader r = IndexReader.open(FSDirectory.open(new File(indexPathToReadFrom)), true); // only searching, so read-only=true

		int num = r.numDocs();

		for ( int i = 0; i < num; i++)
		{
    		if ( ! r.isDeleted( i))
    	{
	        Document d = r.document( i);

	        for (int jj = 0; jj < fieldsToImport.length; jj++)
	        {
    		    Document doc = new Document();
				doc.add(new Field("word",  d.get(fieldsToImport[jj]),  Field.Store.YES,Field.Index.ANALYZED));
				writer.addDocument(doc);
	        }

    	}
		}

		r.close();
		writer.close();
		System.out.println(num);

    	}
    	catch(Exception ex)
    	{

    	}
  }
  }

Now that I have created my temporary shingle index, I run some c# code in order to add these shingles into my spelling index.

static class importLuceneShingles
   {
       public const System.String F_WORD = "word";
       private static readonly Term F_WORD_TERM = new Term(F_WORD);
 
       public static void importShingles(String indexWritePath, String indexReadPath)
       {
           IndexWriter writer = new IndexWriter(indexWritePath, new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED);
           IndexReader r = IndexReader.Open(indexReadPath, true); // only searching, so read-only=true
           IndexSearcher projectTenderTerms = new IndexSearcher(indexWritePath);
 
 
           var temper = r.Terms();
           int i = 0;
 
           while (temper.Next())
           {
               i++;
               if (i % 100000 == 0)
               {
                   //Console.Clear();
                   Console.WriteLine(i);
               }
               String word = temper.Term().Text().ToLower().Trim();
 
 
               if (projectTenderTerms.DocFreq(F_WORD_TERM.CreateTerm(word)) > 0)
               {
                   // if the word already exist in the gramindex
                   continue;
               }
 
               int len = word.Length;
               if (len < 3)
               {
                   continue; // too short we bail but "too long" is fine...
               }
 
               // ok index the word
               Document doc = CreateDocument(word, GetMin(len), GetMax(len));
               writer.AddDocument(doc);
           }
 
           r.Close();
           writer.Optimize();
           writer.Close();
           Console.WriteLine("all the imports are complete, hope your face isn't broken");
 
 
       }
 
       private static int GetMin(int l)
       {
           if (l > 5)
           {
               return 3;
           }
           if (l == 5)
           {
               return 2;
           }
           return 1;
       }
 
       private static int GetMax(int l)
       {
           if (l > 5)
           {
               return 4;
           }
           if (l == 5)
           {
               return 3;
           }
           return 2;
       }
 
       private static Document CreateDocument(System.String text, int ng1, int ng2)
       {
           Document doc = new Document();
           doc.Add(new Field(F_WORD, text, Field.Store.YES, Field.Index.NOT_ANALYZED)); // orig term
           AddGram(text, doc, ng1, ng2);
           return doc;
       }
 
       private static void AddGram(System.String text, Document doc, int ng1, int ng2)
       {
           int len = text.Length;
           for (int ng = ng1; ng <= ng2; ng++)
           {
               System.String key = "gram" + ng;
               System.String end = null;
               for (int i = 0; i < len - ng + 1; i++)
               {
                   System.String gram = text.Substring(i, (i + ng) - (i));
                   doc.Add(new Field(key, gram, Field.Store.NO, Field.Index.NOT_ANALYZED));
                   if (i == 0)
                   {
                       doc.Add(new Field("start" + ng, gram, Field.Store.NO, Field.Index.NOT_ANALYZED));
                   }
                   end = gram;
               }
               if (end != null)
               {
                   // may not be present if len==ng1
                   doc.Add(new Field("end" + ng, end, Field.Store.NO, Field.Index.NOT_ANALYZED));
               }
           }
       }

The importShingles method takes two arguments, indexWritePath which represents your existing spelling index and indexReadPath which is the path to the index we have just created in our java application. Once this method has run thats it, you’ve succesfully stored all of your shingles into your spelling index. Your suggestions will now work with single words and phrases. Just remember to re-run the spelling and importing whenever you add a new object into your main index.