Integrating Zend Framework Lucene with your Cake Application
This is a short tutorial that teaches you how to integrate Zend Framework's Lucene implementation (100% PHP) to your application. It requires your server to have PHP5 installed, since ZF only runs on PHP5, and is likelly to be deprecated very soon.
Introduction
These days, each web application requires FULLTEXT search. MySQL has a nice native implementation, PostgreSQL has one too, Lucene, Ferret (ruby port of Lucene) are just to name a few.
However, when working on a personal project, I faced a difficulty that MySQL's InnoDB engine has: it doesn't have FULLTEXT support. There is also no release date for this feature, giving me no choice but to look for an alternative.
Lucene seemed to be the tool for the job. Fortunatelly, ZF has this covered with their search library, which is based on Lucene. It has its drawbacks too:
1. It doesn't update (as far as I know). To update the index, you have to rebuild it
2. It is still in preview phase. The code on this article is likelly to change
3. As of yet, it doesn't support UTF8 nativelly. There is a "quick fix" (read temporary sollution) at http://framework.zend.com/manual/en/zend.search.charset.html
Getting the framework
First, you need to download the framework. Head to http://framework.zend.com/ and download the latest preview.
You will need the following files:
Download code
library/Zend/Exception.php
library/Zend/Search/
Extract those files to your vendors directory so that the structure is like the one bellow:
Download code
<base directory>/vendors/Zend/Exception.php
<base directory>/vendors/Zend/Search
Indexing your content
Ideally, you would have a bake task to do the indexing part. Since CakePHP 1.2 isn't out yet, we'll have an indexer.php that will do the trick. It could be called by a cron job once a day or more (deppending on your need). This file should also reside outside your webroot folder (/app/webroot), so we'll put it on /app.
Here's the code for indexer.php:
Download code
<?php
// Add your vendor directory to the includepath. ZF needs this.
ini_set('include_path', ini_get('include_path') . ':' . dirname(__FILE__) . '/vendors');
// Require the Lucene Class
require_once('Zend/Search/Lucene.php');
// Establish your connection to the database
mysql_connect('localhost', 'user', 'p4ssw0rd');
mysql_select_db('documents');
// Create a new index. This folder has to be readable by the httpd user
// I will use the cache directory to store the index data
$indexPath = dirname(__FILE__) . '/app/tmp/cache/index';
$index = new Zend_Search_Lucene($indexPath, true);
// Lets get some records to add to the index
$documents_rs = mysql_query('SELECT * FROM documents');
while($document = mysql_fetch_object($documents_rs)) {
// Create a new searchable document instance
$doc = new Zend_Search_Lucene_Document();
// Add some information
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('document_id', $document->id));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('document_created', $document->created));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('document_updated', $document->updated));
$doc->addField(Zend_Search_Lucene_Field::Text('document_title', $document->title));
$doc->addField(Zend_Search_Lucene_Field::Text('document_description', $document->description));
// Add the document to the index
$index->addDocument($doc);
}
// Commit the index
$index->commit();
?>
You will, of course, need to adapt this code to your application's needs. To know what field type to use in each sittuation, read http://framework.zend.com/manual/en/zend.search.html#zend.search.index-creation.understanding-field-types.
Call the indexer.php file from the command line for now, to create an initial index of your data. As i've stated earlier, you can use it as a cron job.
Querying your index
Now that the content is indexed, you need to query it. I have made a simple component that allows you to do just that. Save this file under app/controllers/components/lucene.php:
Component Class:
Download code
<?php
// I'm not sure this is a good idea inside Cake, but I had no problems so far
ini_set('include_path', ini_get('include_path') . PATH_SEPARATOR . VENDORS);
vendor('Zend' . DS . 'Search' . DS . 'Lucene');
class LuceneComponent extends Object {
var $controller = true;
var $index = null;
function startup(&$controller) {
}
// Get the index object
function &getIndex() {
if(!$this->index) {
$this->index = new Zend_Search_Lucene(TMP . DS . 'lucene');
}
return $this->index;
}
// Executes a query to the index and returns the results
function query($query) {
$index =& $this->getIndex();
$results = $index->find($query);
return $results;
}
}
?>
Now, all you need is to call it from your controller. Here's an example:
Controller Class:
Download code
<?php
class SearchController extends AppController {
var $name = 'Search';
var $components = array('lucene');
var $helpers = array('html');
function documents() {
if(!empty($this->data)) {
$documents = $this->lucene->query($this->data['Search']['terms']);
$this->set('results', $documents);
}
}
}
?>
And, the corresponding view:
Download code
<?php echo $html->formTag('/search/documents'); ?>
Search:
<?php echo $html->input('Search/terms'); ?>
<?php echo $html->submit(); ?>
</form>
<?php if(isset($results)): ?>
<h1>Search results: found <?php echo count($results); ?> document(s):</h1>
<?php foreach($results as $result): ?>
<h3><?php echo $result->document_title; ?> - <?php echo $document->score; ?></h3>
<p>
<?php echo $result->document_description; ?>
<hr>
<a href="/documents/view/<?php echo $result->document_id; ?>">View document</a>
</p>
<?php endforeach; ?>
<?php endif; ?>
I would advise you to read the Search component's manual section on this, since it has lots of details on querying the index. Go to http://framework.zend.com/manual/en/zend.search.html to read it.
Good luck, and let me know how it worked out for you.
Comments
Comment
1 No UTF8
Well for me..
Comment
2 There is a way
http://framework.zend.com/manual/en/zend.search.charset.html
Question
3 No Results
Firstly just a correction to your code - $indexPath is declared using an uppercase P, but then in the line below you use a lowercase.
Now, I have been able to successfully index my table and am able to view the documents using Luke. I am also able to use getDocument($id) which retrieves the document in an array. I can also successfully getFieldNames() which tells me how many fields are in the document.
Unfortunately though, when I try to query the index, no results are returned. If I use the same query string in Luke, I get the correct results.
I have tried the preview version 0.2.0 as well as the snapshot.
Any ideas?
Comment
4 No Results
Thanks for the catch!
Although Zend Search is byte-compatible with Apache's Lucene, the query engine still needs a lot of polishing.
What worked for me was querying using fields, ie:
tags:cars tags:beer
title:specs
And so on.
Let me know how it worked out
Question
5 Size limitations
Can this support terabytes of data?
Thanks,
Darius
Comment
6 Response to Size limitations
I wouldn't know. It would, of course, vary according to a number of factors, like php's configured memory limit, if they rebuild the index all at once, or by parts, etc.
To be honest, I wouldn't count on it supporting terabytes of data.
Question
7 multiple search areas
would it be possible to have muliple search facilities on the one site using this? e.g a full site search, a news search, a products search (the latter in their respective sections on the site).
lukemack.
Question
8 version
Question
9 is there a way to specify limit and page
I was looking Lucene API and didn't find any way to specify the limit or page number (as done in sql). Am I missing something?
Regards,
ritesh
Question
10 Zend Version
Cheers!
Comment
11 Size limits
2Gb is the 32bit storage limitation - not sure about 64bit though.
Comment
12 Problems
I have this working to the stage where an index is being created in my tmp directory. however, i'm getting errors when running a search which i think are caused by the component code:
function &getIndex() {
if(!$this->index) {
$this->index = new Zend_Search_Lucene(TMP . DS . 'lucene');
}
return $this->index;
}
above this function, index is set to null and so the code attempts to use the 'lucene' directory which is not mentioned anywhere in this tutorial.
i therefore get a 'no such file or directory error' when trying to run a search. giving the followinf line as the cause:
Zend/Search/Lucene/Storage/File/Filesystem.php on line 63
anyone got any ideas?
Comment
13 Indexing
I believe the path to the index file is wrong. Or that the index is not created at all.
Comment
14 Paginate results
I have got all the search and indexing functionality working, but I just can't get my head around how to paginate the results.
Anyone got any pointers?
G.