Integrating Zend Framework Lucene with your Cake Application

By Andre Medeiros (andremedeiros)
This is a short tutorial that teaches you how to integrate Zend Framework's Lucene implementation (100% PHP) to your application. It requires your server to have PHP5 installed, since ZF only runs on PHP5, and is likelly to be deprecated very soon.

Introduction



These days, each web application requires FULLTEXT search. MySQL has a nice native implementation, PostgreSQL has one too, Lucene, Ferret (ruby port of Lucene) are just to name a few.

However, when working on a personal project, I faced a difficulty that MySQL's InnoDB engine has: it doesn't have FULLTEXT support. There is also no release date for this feature, giving me no choice but to look for an alternative.

Lucene seemed to be the tool for the job. Fortunatelly, ZF has this covered with their search library, which is based on Lucene. It has its drawbacks too:

1. It doesn't update (as far as I know). To update the index, you have to rebuild it

2. It is still in preview phase. The code on this article is likelly to change

3. As of yet, it doesn't support UTF8 nativelly. There is a "quick fix" (read temporary sollution) at http://framework.zend.com/manual/en/zend.search.charset.html




Getting the framework



First, you need to download the framework. Head to http://framework.zend.com/ and download the latest preview.

You will need the following files:
Download code library/Zend/Exception.php
library/Zend/Search/


Extract those files to your vendors directory so that the structure is like the one bellow:
Download code <base directory>/vendors/Zend/Exception.php
<base directory>/vendors/Zend/Search



Indexing your content



Ideally, you would have a bake task to do the indexing part. Since CakePHP 1.2 isn't out yet, we'll have an indexer.php that will do the trick. It could be called by a cron job once a day or more (deppending on your need). This file should also reside outside your webroot folder (/app/webroot), so we'll put it on /app.

Here's the code for indexer.php:

Download code
<?php

// Add your vendor directory to the includepath. ZF needs this.
ini_set('include_path'ini_get('include_path') . ':' dirname(__FILE__) . '/vendors');

// Require the Lucene Class
require_once('Zend/Search/Lucene.php');

// Establish your connection to the database
mysql_connect('localhost''user''p4ssw0rd');
mysql_select_db('documents');

// Create a new index. This folder has to be readable by the httpd user
// I will use the cache directory to store the index data
$indexPath dirname(__FILE__) . '/app/tmp/cache/index';
$index = new Zend_Search_Lucene($indexPathtrue);

// Lets get some records to add to the index
$documents_rs mysql_query('SELECT * FROM documents');
while(
$document mysql_fetch_object($documents_rs)) {
    
// Create a new searchable document instance
    
$doc = new Zend_Search_Lucene_Document();

    
// Add some information
    
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('document_id'$document->id));
    
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('document_created'$document->created));
    
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('document_updated'$document->updated));
    
$doc->addField(Zend_Search_Lucene_Field::Text('document_title'$document->title));
    
$doc->addField(Zend_Search_Lucene_Field::Text('document_description'$document->description));
    
    
// Add the document to the index
    
$index->addDocument($doc);
}

// Commit the index
$index->commit();
?>


You will, of course, need to adapt this code to your application's needs. To know what field type to use in each sittuation, read http://framework.zend.com/manual/en/zend.search.html#zend.search.index-creation.understanding-field-types.

Call the indexer.php file from the command line for now, to create an initial index of your data. As i've stated earlier, you can use it as a cron job.

Querying your index



Now that the content is indexed, you need to query it. I have made a simple component that allows you to do just that. Save this file under app/controllers/components/lucene.php:

Component Class:

Download code <?php 
// I'm not sure this is a good idea inside Cake, but I had no problems so far
ini_set('include_path'ini_get('include_path') . PATH_SEPARATOR VENDORS);
vendor('Zend' DS 'Search' DS 'Lucene');

class 
LuceneComponent extends Object {
    var 
$controller true;
    var 
$index null;
    
    function 
startup(&$controller) {
    }    

    
// Get the index object
    
function &getIndex() {
        if(!
$this->index) {
            
$this->index = new Zend_Search_Lucene(TMP DS 'lucene');
        }
        return 
$this->index;
    }
    
    
// Executes a query to the index and returns the results
    
function query($query) {
        
        
$index =& $this->getIndex();
        
$results $index->find($query);
        return 
$results;
    }
}
?>


Now, all you need is to call it from your controller. Here's an example:

Controller Class:

Download code <?php 
class SearchController extends AppController {
    var 
$name 'Search';
    var 
$components = array('lucene');
    var 
$helpers = array('html');

    function 
documents() {
        if(!empty(
$this->data)) {
            
$documents $this->lucene->query($this->data['Search']['terms']);
            
$this->set('results'$documents);
        }
    }
}
?>


And, the corresponding view:

Download code
<?php echo $html->formTag('/search/documents'); ?>
Search: 
<?php echo $html->input('Search/terms'); ?>
<?php 
echo $html->submit(); ?>

</form>

<?php if(isset($results)): ?>
  <h1>Search results: found <?php echo count($results); ?> document(s):</h1>
  <?php foreach($results as $result): ?>
    <h3><?php echo $result->document_title?> - <?php echo $document->score?></h3>
    <p>
      <?php echo $result->document_description?>
      <hr>
      <a href="/documents/view/<?php echo $result->document_id?>">View document</a>
    </p>
  <?php endforeach; ?>
<?php 
endif; ?>


I would advise you to read the Search component's manual section on this, since it has lots of details on querying the index. Go to http://framework.zend.com/manual/en/zend.search.html to read it.

Good luck, and let me know how it worked out for you.

 

Comments 102

CakePHP Team Comments Author Comments
 

Comment

1 No UTF8

Big drawback!

Well for me..
Posted Oct 17, 2006 by stab
 

Comment

2 There is a way

There is a sollution. It doesn't support UTF8 nativelly, but you can convert your data to ISO-8559-1, for instance.

http://framework.zend.com/manual/en/zend.search.charset.html
Posted Oct 18, 2006 by Andre Medeiros
 

Question

3 No Results

Tried to implement your tutorial, but unfortunately I am having a few issues...

Firstly just a correction to your code - $indexPath is declared using an uppercase P, but then in the line below you use a lowercase.

Now, I have been able to successfully index my table and am able to view the documents using Luke. I am also able to use getDocument($id) which retrieves the document in an array. I can also successfully getFieldNames() which tells me how many fields are in the document.

Unfortunately though, when I try to query the index, no results are returned. If I use the same query string in Luke, I get the correct results.

I have tried the preview version 0.2.0 as well as the snapshot.

Any ideas?
Posted Nov 7, 2006 by Graham
 

Comment

4 No Results

Tried to implement your tutorial, but unfortunately I am having a few issues...

Firstly just a correction to your code - $indexPath is declared using an uppercase P, but then in the line below you use a lowercase.

Now, I have been able to successfully index my table and am able to view the documents using Luke. I am also able to use getDocument($id) which retrieves the document in an array. I can also successfully getFieldNames() which tells me how many fields are in the document.

Unfortunately though, when I try to query the index, no results are returned. If I use the same query string in Luke, I get the correct results.

I have tried the preview version 0.2.0 as well as the snapshot.

Any ideas?


Thanks for the catch!

Although Zend Search is byte-compatible with Apache's Lucene, the query engine still needs a lot of polishing.

What worked for me was querying using fields, ie:

tags:cars tags:beer
title:specs

And so on.

Let me know how it worked out
Posted Nov 8, 2006 by Andre Medeiros
 

Question

5 Size limitations

Is there a maximum size that the index can be??

Can this support terabytes of data?

Thanks,
Darius
Posted Nov 16, 2006 by Darius Grissom
 

Comment

6 Response to Size limitations

Is there a maximum size that the index can be??

Can this support terabytes of data?

Thanks,
Darius


I wouldn't know. It would, of course, vary according to a number of factors, like php's configured memory limit, if they rebuild the index all at once, or by parts, etc.

To be honest, I wouldn't count on it supporting terabytes of data.
Posted Dec 4, 2006 by Andre Medeiros
 

Question

7 multiple search areas

hi,

would it be possible to have muliple search facilities on the one site using this? e.g a full site search, a news search, a products search (the latter in their respective sections on the site).

lukemack.
Posted Dec 31, 1969 by Luke Mackenzie
 

Question

8 version

also, has anyone tried this with the latest version of the Zend component?
Posted Dec 31, 1969 by Luke Mackenzie
 

Question

9 is there a way to specify limit and page

Hi,

I was looking Lucene API and didn't find any way to specify the limit or page number (as done in sql). Am I missing something?

Regards,
ritesh
Posted Dec 31, 1969 by Ritesh Agrawal
 

Question

10 Zend Version

Make sure you put your indexer.php in /app directory and not in /webroot or somewhere else. I didn't read the directions very carefully, and ran into some problems. Also, don't run indexer from Zend Studio. Again, follow the directions by running it from the command prompt.

Cheers!

Posted Apr 18, 2007 by Jeff Stevenson
 

Comment

11 Size limits

Is there a maximum size that the index can be??

Can this support terabytes of data?

Thanks,
Darius


I wouldn't know. It would, of course, vary according to a number of factors, like php's configured memory limit, if they rebuild the index all at once, or by parts, etc.

To be honest, I wouldn't count on it supporting terabytes of data.


2Gb is the 32bit storage limitation - not sure about 64bit though.
Posted May 4, 2007 by T Wyld
 

Comment

12 Problems

Hi,

I have this working to the stage where an index is being created in my tmp directory. however, i'm getting errors when running a search which i think are caused by the component code:

function &getIndex() {
if(!$this->index) {
$this->index = new Zend_Search_Lucene(TMP . DS . 'lucene');
}
return $this->index;
}

above this function, index is set to null and so the code attempts to use the 'lucene' directory which is not mentioned anywhere in this tutorial.
i therefore get a 'no such file or directory error' when trying to run a search. giving the followinf line as the cause:

Zend/Search/Lucene/Storage/File/Filesystem.php on line 63

anyone got any ideas?

Posted May 17, 2007 by Luke Mackenzie
 

Comment

13 Indexing

Hi,

I have this working to the stage where an index is being created in my tmp directory. however, i'm getting errors when running a search which i think are caused by the component code:

function &getIndex() {
if(!$this->index) {
$this->index = new Zend_Search_Lucene(TMP . DS . 'lucene');
}
return $this->index;
}

above this function, index is set to null and so the code attempts to use the 'lucene' directory which is not mentioned anywhere in this tutorial.
i therefore get a 'no such file or directory error' when trying to run a search. giving the followinf line as the cause:

Zend/Search/Lucene/Storage/File/Filesystem.php on line 63

anyone got any ideas?




I believe the path to the index file is wrong. Or that the index is not created at all.

Posted May 23, 2007 by Spenna
 

Comment

14 Paginate results

Thanks for the great article...

I have got all the search and indexing functionality working, but I just can't get my head around how to paginate the results.

Anyone got any pointers?

G.
Posted May 25, 2007 by Graham