2 February 2018

IR: From Lucene To Elasticsearch – Part II

Introduction

In the previous part of the series, we have discussed how does the indexing process speed up information retrieval and in this part we will apply the indexing process mentioned before as we get to know Apache Lucene, an IR library that empowers many great technologies such as Elasticsearch and Apache Solr.

Apache Lucene is an open source library for data indexing and querying. It is written entirely in Java, thus, being open source and cross platform has allowed the library to be adopted very quickly by open source communities and various entreprises. In fact Lucene is used even on some data base engines to allow faster full text search.

Lucene was in development since 1999, adopted by the Apache Software Foundation in 2001 and became the foundation’s top-level project in 2005. It has grown a lot over the years to reach very good performance (http://home.apache.org/~mikemccand/lucenebench/indexing.html) and support for various languages.

:::info
This tutorial presents only snippets of code, the full code can be found at TODO: [https://github.com/repo_here]()**. The code is provided as a maven project that can be built and executed right away.
:::

I. Getting Development Environment Ready

First, we need to set up our development environment. There is an empty maven project with all the dependencies included waiting for you here: TODO: GITHUB URL HERE.

Will update ASAP.

My IDE of choice is eclipse and the build system is maven.

Let’s get the development project and build it!

# clone the repository
git clone TODO
cd REPOSITORY_NAME

# Build the project
mvn package

Next we run the project:

java -jar target/JAR_NAME_HERE //TODO

II. Analyzing Project

Let’s have a look at our pom.xml configuration file, to check the dependencies that we will be using:

Package nameUsage
lucene-coreLucene Indexing Library
lucene-analyzers-commonLucene Analyzers
llucene-queryparserQuery Parser
tika-coreFor Reading PDF Files
tika-parsersPDF Parsing
commons-loggingLogging
commons-ioio

III. Lucene Architecture

Lucene is a very powerful library. It gained popularity for a very good reason. It is scalable, highly performant, fast for both indexing and querying with low memory usage. In order to achieve such performance magnitude, important design choices needed to be made. Therefor, to leverage the power of Lucene is mandatory to understand at least its behavior if not its internal architecture.

1. Lucene Documents

Lucene is about indexing textual information. Lucene indexing process operates over what we call Lucene documents. A lucene document is a collection of fields, You can think of it as a one level JSON document.

Example:

"sender": "sender@server.com"
"reciepient": "someuser@someserver.com"
"date": 4646845
"message": "Call me ASAP, thank you"

When a document is created and indexed, Lucene assigns it a unique id number and each field is indexed in a separate posting list.

As we mentioned previously, during the indexing process portions of data may get discarded in analysis : Inverted Index wont contain full original data. Index size is roughly about 20~30% of the original document size.

2.Storage

Lucene stores fields seperately from the index.

  • Inverted Index
  • Field Data Table (.fdt)

The reason for this design choice is that it is easier to retrieve multiple fields from same document than it is to retrieve same field from different documents.

3. Field Properties

In Lucene, a field may be:

  • Indexed & stored
  • Indexed but not stored (e.g content is stored somewhere else, say a database)
  • Stored but not index (e.g you don’t want to allow searching by that field, but you would like to display its content in search results)

4. Lucene Index and Segments

Lucene indexes are stored as a collection of files into a directory.

Lucene indexes should never be modified once created and updating individual fields of a document is not possible. If a document is updated, Lucene will remove the old index and create a new one which is expensive as it requires reindexing every field.

This implies one major advantage and another major drawback, it greatly reduces possibility of index corruption, however it updating an index is less efficient.

A Lucene index is a collection of segments. When new documents are added to existing index, they are stored in a new segment.

A segment is a logically seperate set of index files. It stores all the information of a single independent index.

Each segment stores one or more documents in its own inverted index and field storage. When querying an index, Lucene runs same query on all segments, collects all the hits into one result set.

Separate segments queries as one logical unit. Querying multiple indexes makes it slower.

Lucene attempts to periodically merge segments together. Unless having very large index and being concerned about over large files, ideal number of segments is 1. The Merging process is quite costly (especially for large segments).

When deleted, a document is marked for deletion and gets removed only when its segment gets merged. Segments marked for deletion wont be displayed in query results.

These design choices that allowed Lucene to be extremely fast, has also made it inefficient for storing data that requires fast frequent updates.

Lucene architecture is very modular, allowing you to write your own analyzers and filters. There are many powerful built-in analyzers that you almost wont need to write anything yourself, but the sake of knowledge, we are going to do it.

5. Lucene Analysis Process

Let’s recall the indexing process we presented last time:

Lucene follows exactly this process. First, we create some documents, analyze them, then we apply some filters to generate proper terms and then index the entire term stream.

  1. Tokenization
  2. Filtering
  3. Indexing

i. Tokenizer

The tokenizer seperates keywords from a sequence of characters. It removes white spaces and stopwords. Stopwords are words that are not to be index such as the, a, etc.

To implement a custom Tokenizer, one would need to extend the Tokenizer class. See Building a custom Lucene tokenizer section at http://www.citrine.io/blog/2015/2/14/building-a-custom-analyzer-in-lucene. We will be using WhitespaceTokenizer which works perfectly fine for most applications. In fact many Lucene tokenizers uses code generated using JFlex since a tokenizer by hand can get complicated.

ii. Filtering

Applies some filters on the keywords, it can be transformations, complete removal or insertion of new tokens (used in cases of synonym search).

Creating a custom Filter can be done by extending the TokenFilter class.

Note that filters can be chained, means that you can apply various filters one after the other. The order of the filter chain is very important as it will have a huge impact on the final data that will be indexed. For example a stemming filter would transform every word to its root, such as stem('looking') = 'look'. Applying a synonym filter after the stemming step would probably give better results than applying it before.

iii. Indexing

The indexing step is transparent to the programmer. Lucene will index

IV. Hello, Lucene

Now time to move to some actual coding!

Let’s write our first Lucene program, which will index some static data. We will slightly improve our application to able to index PDF documents and of course query them. You can later on toy with your code, maybe creating a web application where you can upload documents and query them would be a good practice!

1. Content Indexing

a. Index Directory

Now, attempting to do anything, Lucene needs to know a directory in which it will create its indexes. So we first must create such a Directory (org.apache.lucene.store.Directory) and we have a couple of options:

  1. File system directory:
Log log = LogFactory.getLog(HelloLucene.class);
final String dirUrl = "/usr/home/praisethemoon/lucene_index_dir/";
Directory dir = null;

try {
    dir = FSDirectory.open(Paths.get(new URI(dirUrl)));
} catch (IOException | URISyntaxException e) {
    log.fatal(e);
}
  1. RAM for temporary storage:
Directory dir = null;
dir = new RAMDirectory();

b. Analyzer

Next step is to prepare the analyzer to use to process our documents. We will use the StandardAnalyzer as it will require less configuration and does a good job for most applications. Later on we will implement and use our own analyzer.

Analyzer analyzer = new StandardAnalyzer(); 

c. IndexWriterConfig

Now we need to prepare some configuration for our Index Writer. The configuration class is called IndexWriterConfig, it simply takes the analyzer we would like to use:

IndexWriterConfig iwc = new IndexWritierConfig(analyzer); 

d. IndexWriter

Next we can create our own index writer which takes two arguments:

  1. Target directory
  2. IndexWriterConfig instance
IndexWriter writer = null;
try {
    writer = new IndexWriter(dir, iwc);
} catch (IOException e) {
    log.fatal(e);
}

e. Document

We are ready to index our data, so let’s create some documents:

Document doc = new Document(); 
doc.add(new StringField("title", "Lucene Introduction", Store.YES)); 
doc.add(new TextField("content", "Java library.", Store.YES)); 

Here we have created two fields, a StringField and a TextField, both inherits IndexableField. Both of these fields’ constructor have the same parameters. The first one is the field name, the second is the field textual content and the third is whether the field is stored or not.

For the sake of demonstration, the difference between StringField and TextField is that StringField a field that is indexed but not tokenized while TextField is both indexed and tokenized.

f. Write and Commit

Once our data is ready, we can index it as follows:

try {
    writer.addDocument(doc);
    writer.close(); 
} catch (IOException e) {
    log.fatal(e);
}

2. Index Querying

Now we are done with the indexing process, so let’s get into querying!

a. Source Directory

First of all, we need to specify our source index directory which we will be querying. This is done exactly as with the indexing process, we simply need to create a directory instance:

Log log = LogFactory.getLog(HelloLucene.class);
final String dirUrl = "/usr/home/praisethemoon/lucene_index_dir/";
Directory dir = null;

try {
    dir = FSDirectory.open(Paths.get(new URI(dirUrl)));
} catch (IOException | URISyntaxException e) {
    log.fatal(e);
}

b. Analyzer

Just like in the indexing process, we need to specify the analyzer will we will use to parse the query. To achieve better search results, it is best to use the same analyzer used for the indexing. So in our case, we will be using the same StandardAnalyzer

Analyzer analyzer = new StandardAnalyzer(); 

c. Index Reader Config

Maven dependencies

    <dependencies>
        <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-core -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>6.6.0</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-analyzers-common -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>6.6.0</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-queryparser -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>6.6.0</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.tika/tika-core -->
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-core</artifactId>
            <version>1.15</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.tika/tika-parsers -->
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-parsers</artifactId>
            <version>1.15</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/commons-logging/commons-logging -->
        <dependency>
            <groupId>commons-logging</groupId>
            <artifactId>commons-logging</artifactId>
            <version>1.2</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.5</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/commons-codec/commons-codec -->
        <dependency>
            <groupId>commons-codec</groupId>
            <artifactId>commons-codec</artifactId>
            <version>1.10</version>
        </dependency>
    </dependencies>

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *