How to Use Lucene.NET for Searching Email Address Domains

Searching for email addresses using Lucene.NET can be a bit challenging due to the way the search system interprets strings. If you have ever tried to search for a specific email domain (like @gmail.com) and ended up with errors or no results, you’re not alone. This blog post will guide you through the process of creating a custom searching solution that allows you to effectively query email addresses.

The Problem

When attempting to search email addresses in Lucene.NET, you may face the following issues:

  • Error with Asterisks: Using a wildcard query like *@gmail.com results in an error, as asterisks cannot start a query.
  • Whole Word Search Limitations: Running a search query directly with @gmail.com will not return results for email addresses like foo@gmail.com, since lexical searches typically treat it as a whole word.

These limitations can be frustrating, especially when you want to search specifically by domain.

The Solution

To tackle this problem, we can customize the way Lucene analyzes and tokenizes email addresses. This involves creating two components:

  1. Tokenizer: Custom behavior for splitting text into tokens based on criteria.
  2. Analyzer: Utilizing the tokenizer to create a stream of tokens for indexing.

Step 1: Create a Custom Tokenizer

We’ll start by creating a custom tokenizer called WhitespaceAndAtSymbolTokenizer. This tokenizer will treat both whitespace and the @ symbol as indicators of new words, allowing us to split email addresses more effectively.

Here’s the source code for your custom tokenizer:

class WhitespaceAndAtSymbolTokenizer : CharTokenizer
{
    public WhitespaceAndAtSymbolTokenizer(TextReader input)
        : base(input)
    {
    }

    protected override bool IsTokenChar(char c)
    {
        // Make whitespace characters and the @ symbol be indicators of new words.
        return !(char.IsWhiteSpace(c) || c == '@');
    }
}

Step 2: Create a Custom Analyzer

Next, we’ll implement the WhitespaceAndAtSymbolAnalyzer, which uses our custom tokenizer:

internal class WhitespaceAndAtSymbolAnalyzer : Analyzer
{
    public override TokenStream TokenStream(string fieldName, TextReader reader)
    {
        return new WhitespaceAndAtSymbolTokenizer(reader);
    }
}

Step 3: Rebuild Your Index

After setting up the custom tokenizer and analyzer, you’ll need to recreate your index using the new analyzer. Below is how you can do this:

IndexWriter index = new IndexWriter(indexDirectory, new WhitespaceAndAtSymbolAnalyzer());
index.AddDocument(myDocument);

Step 4: Perform Searches

When performing searches, make sure to use the custom analyzer as well. Here’s a simple code snippet that demonstrates how to query using the analyzer:

IndexSearcher searcher = new IndexSearcher(indexDirectory);
Query query = new QueryParser("TheFieldNameToSearch", new WhitespaceAndAtSymbolAnalyzer()).Parse("@gmail.com");
Hits hits = query.Search(query);

Conclusion

By using the WhitespaceAndAtSymbolTokenizer and WhitespaceAndAtSymbolAnalyzer, you can effectively search for email address domains in Lucene.NET. This solution not only resolves the immediate problem of searching for @gmail.com but also sets a solid foundation for scaling your email domain searches in the future.


Mixing custom tokenization with effective searching strategies can make a significant difference in your applications. If you have any further questions or need assistance with your Lucene.NET implementation, feel free to reach out in the comments below!