Using ASCIIFoldingFilter or any other additional TokenFilter with Lucene

In: Uncategorized

3 Aug 2010

Here are instructions on how to incorporate different TokenFilters in your Lucene application.

I have a Lucene index that contains a lot of French words. Sometimes people search with an accent (Lucéne) and sometimes without an accent (Lucene) and I’d like to treat these searches exactly the same. So there’s a filter already built to handle this called ASCIIFoldingFilter. What it does is convert the é to a e along with all the other characters that contain accents. It only converts them in the index not in the actual content of the document. So the documents and your search results still contain the properly accented characters.

But I had one heck of a time trying to figure out how to actually USE the ASCIIFoldingFilter. So here’s how it’s done:

Step 1 : create your own custom analyzer and add all the TokenFilters you’d like to use :

package com.example.lucene;

import java.io.IOException;
import java.io.Reader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.ASCIIFoldingFilter;
import org.apache.lucene.util.Version;

public class MyCustomAnalyzer extends Analyzer {

	@Override
	 public TokenStream reusableTokenStream(
		        String fieldName, Reader reader) throws IOException {

		        SavedStreams streams =
		            (SavedStreams) getPreviousTokenStream();

		        if (streams == null) {
		            streams = new SavedStreams();
		            setPreviousTokenStream(streams);

		            streams.tokenizer = new StandardTokenizer(Version.LUCENE_29,reader);
		            streams.stream = new ASCIIFoldingFilter(streams.tokenizer);
		            streams.stream = new StandardFilter(streams.stream);
		            streams.stream = new LowerCaseFilter(streams.stream);

		        } else {
		            streams.tokenizer.reset(reader);
		        }

		        return streams.stream;
		    }

	   private class SavedStreams {
	        Tokenizer tokenizer;
	        TokenStream stream;
	    }

	@Override
	public TokenStream tokenStream(String feildName, Reader reader) {
		Tokenizer tokenizer = new StandardTokenizer(Version.LUCENE_29,reader);
		TokenStream stream = new ASCIIFoldingFilter(tokenizer);
		stream = new StandardFilter(stream);
        stream = new LowerCaseFilter(stream);

		return stream;
	}
}

I’m using the standard StandardAnalyzer and the LowerCaseFliter along with the ASCIIFoldingFilter but you can add any TokenFilter you’d like. Oh and for the love of jebus make sure you put additional filters BEFORE the StandardAnalyzer.

Step 2 : Use your custom analyzer when creating the index:

IndexWriter writer = new IndexWriter(INDEX_DIR, new MyCustomAnalyzer(), true);
//insert standard index-building code here

Step 3 : use your custom analyzer when doing the search:

QueryParser parser = new MultiFieldQueryParser(new String[] {"title","description","keywords"}, new MyCustomAnalyzer());
//insert standard lucene search code here

And you are done!

Comment Form

About this blog

Nothing that notable about this blog. It's just going to be a place where we post random stuff.

Photostream

  • free cccam: Ola great share. I think the best cccam server are those from fishbone cloud I would like to see [...]
  • FirstSam: I see you don't monetize your blog, don't waste your traffic, you can earn additional cash every m [...]
  • Arv: In else case it should invoke: super.writeText(writer, text); NOT writer.write(text); [...]
  • Chand: Hi could any one suggest, if few childs should be CDATA and dynamic value... How to get this with X [...]
  • Andrew: This is definitely a great start to the solution I need, but I need CDATA tags to be added dynamical [...]