Showing a text excerpt around multiple keywords

by grigri
On a typical search results page, you want to display an excerpt of each matching item, showing the keywords. TextHelper::excerpt() works, but only for a single word/phrase. This helper finds the chunk of text containing as many keywords as possible.
This is very simple to use : just pass it a chunk of text and an array of keywords. There are quite a few options for tweaking how a particular match is valued, but the defaults I use seem to work fine most of the time.

One thing to note is that the `$length` parameter does not include the prefix or suffix length.

The Algorithm

The algorithm is pretty simple; I'll try and explain the concept:

Any given excerpt has a score, which basically corresponds to the number of keywords present within it.

We don't need to test every possible excerpt, nor even every word boundary -- we only need to check every position where the score changes.

The score can only change at certain points (events) -- when a keyword appears (push event) and when a keyword disappears (pop event).

So, we create a list of the events in the order they occur (event queue), and process them, cumulatively building up the score (as the terminology suggests, it's a stack system).

Once this is done, look at each event position and find the one with the highest score. That's the excerpt to grab.


Notes

This algorithm is a starting point, not a complete solution. I've found that adding a 'bump' event to centre excerpts around keywords works quite well, as well as favourising different keywords rather than repeats of the same one.

I'm planning to extend this with more options, for getting the helper to 'prefer' excerpts containing whole sentences, and a few others.

The `excerpt()` method does not hilight the keywords in any way; use in conjunction with TextHelper::highlight() to do that (I might add this as an option at a later date)

There's a debug option which outputs a table of events and a table of matches - useful for tweaking settings on test data. The excerpt is always returned, the debug information is always output.

Code


Anyway, here's the code:

Helper Class:

<?php 
class SummaryHelper extends AppHelper {

  function 
excerpt($text$words$length=150$prefix="..."$suffix null$options = array()) {

    
// Set default score modifiers [tweak away...]
    
$options am(array(
      
'exact_case_bonus'  => 2,
      
'exact_word_bonus'  => 3,
      
'abs_length_weight' => 0.0,
      
'rel_length_weight' => 1.0,

      
'debug' => true
    
), $options);

    
// Null suffix defaults to same as prefix
    
if (is_null($suffix)) {
      
$suffix $prefix;
    }

    
// Not enough to work with?
    
if (strlen($text) <= $length) {
      return 
$text;
    }

    
// Just in case
    
if (!is_array($words)) {
      
$words = array($words);
    }

    
// Build the event list
    // [also calculate maximum word length for relative weight bonus]
    
$events = array();
    
$maxWordLength 0;

    foreach (
$words as $word) {

      if (
strlen($word) > $maxWordLength) {
        
$maxWordLength strlen($word);
      }

      
$i = -1;
      while ( (
$i stripos($text$word$i+1)) !== false ) {

        
// Basic score for a match is always 1
        
$score 1;

        
// Apply modifiers
        
if (substr($text$istrlen($word)) == $word) {
          
// Case matches exactly
          
$score += $options['exact_case_bonus'];
        }
        if (
$options['abs_length_weight'] != 0.0) {
          
// Absolute length weight (longer words count for more)
          
$score += strlen($word) * $options['abs_length_weight'];
        }
        if (
$options['rel_length_weight'] != 0.0) {
          
// Relative length weight (longer words count for more)
          
$score += strlen($word) / $maxWordLength $options['rel_length_weight'];
        }
        if (
preg_match('/\W/'substr($text$i-11))) {
          
// The start of the word matches exactly
          
$score += $options['exact_word_bonus'];
        }
        if (
preg_match('/\W/'substr($text$i+strlen($word), 1))) {
          
// The end of the word matches exactly
          
$score += $options['exact_word_bonus'];
        }

        
// Push event occurs when the word comes into range
        
$events[] = array(
          
'type'  => 'push',
          
'word'  => $word,
          
'pos'   => max(0$i strlen($word) - $length),
          
'score' => $score
        
);
        
// Pop event occurs when the word goes out of range
        
$events[] = array(
          
'type' => 'pop',
          
'word' => $word,
          
'pos'  => $i 1,
          
'score' => $score
        
);
        
// Bump event makes it more attractive for words to be in the
        // middle of the excerpt [@todo: this needs work]
        
$events[] = array(
          
'type' => 'bump',
          
'word' => $word,
          
'pos'  => max(0$i floor(strlen($word)/2) - floor($length/2)),
          
'score' => 0.5
        
);

      }
    }

    
// If nothing is found then just truncate from the beginning
    
if (empty($events)) {
      return 
substr($text0$length) . $suffix;
    }

    
// We want to handle each event in the order it occurs in
    // [i.e. we want an event queue]
    
$events sortByKey($events'pos');

    
$scores = array();
    
$score 0;
    
$current_words = array();

    
// Process each event in turn
    
foreach ($events as $idx => $event) {
      
$thisPos floor($event['pos']);

      
$word strtolower($event['word']);

      switch (
$event['type']) {
      case 
'push':
        if (empty(
$current_words[$word])) {
          
// First occurence of a word gets full value
          
$current_words[$word] = 1;
          
$score += $event['score'];
        }
        else {
          
// Subsequent occurrences mean less and less
          
$current_words[$word]++;
          
$score += $event['score'] / sizeof($current_words[$word]);
        }
        break;
      case 
'pop':
        if ((
$current_words[$word])==1) {
          unset(
$current_words[$word]);
          
$score -= ($event['score']);
        }
        else {
          
$current_words[$word]--;
          
$score -= $event['score'] / sizeof($current_words[$word]);
        }
        break;
      case 
'bump':
        if (!empty(
$event['score'])) {
          
$score += $event['score'];
        }
        break;
      default:
      }

      
// Close enough for government work...
      
$score round($score2);

      
// Store the position/score entry
      
$scores[$thisPos] = $score;

      
// For use with debugging
      
$debugWords[$thisPos] = $current_words;

      
// Remove score bump
      
if ($event['type'] == 'bump') {
          
$score -= $event['score'];
      }
    }

    
// Calculate the best score
    // Yeah, could have done this in the main event loop
    // but it's better here
    
$bestScore 0;
    foreach (
$scores as $pos => $score) {
        if (
$score $bestScore) {
          
$bestScore $score;
        }
    }


    if (
$options['debug']) {
      
// This is really quick, really tatty debug information
      // (but it works)
      
echo "<table border>";
      echo 
"<caption>Events</caption>";
      echo 
"<tr><th>Pos</th><th>Type</th><th>Word</th><th>Score</th>";
      foreach (
$events as $event) {
        echo 
"<tr>";
        echo 
"<td>{$event['pos']}</td><td>{$event['type']}</td><td>{$event['word']}</td><td>{$event['score']}</td>";
        echo 
"</tr>";
      }
      echo 
"</table>";

      echo 
"<table border>";
      echo 
"<caption>Positions and their scores</caption>";
      
$idx 0;
      foreach (
$scores as $pos => $score) {
        
$excerpt substr($text$pos$length);
        
$style = ($score == $bestScore) ? 'background: #ff7;' '';

        
//$score = floor($score + 0.5);

        
echo "<tr>";
        echo 
"<th style=\"$style\">" $idx "</th>";
        echo 
"<td style=\"$style\">" $pos "</td>";
        echo 
"<td style=\"$style\"><div style=\"float: left; width: 2em; margin-right: 1em; text-align right; background: #ddd\">" $score "</div><code>" str_repeat('*'$score) . "</code></td>";
        echo 
"<td style=\"$style\"><table border>";
        foreach (
$debugWords[$pos] as $word => $count) {
          echo 
"<tr><td>$word</td><td>$count</td></tr>";
        }
        echo 
"</table></td>";
        echo 
"<td style=\"$style\">" . (preg_replace('/(' implode('|'$words) . ')/i''<b style="border: 1px solid red;">\1</b>'htmlentities($excerpt))) . "</td>";
        echo 
"</tr>";
        
$idx++;
      }
      echo 
"</table>";
    }


    
// Find all positions that correspond to the best score
    
$positions = array();
    foreach (
$scores as $pos => $score) {
      if (
$score == $bestScore) {
        
$positions[] = $pos;
      }
    }

    if (
sizeof($positions) > 1) {
      
// Scores are tied => do something clever to choose one
      // @todo: Actually do something clever here
      
$pos $positions[0];
    }
    else {
      
$pos $positions[0];
    }

    
// Extract the excerpt from the position, (pre|ap)pend the (pre|suf)fix
    
$excerpt substr($text$pos$length);
    if (
$pos 0) {
      
$excerpt $prefix $excerpt;
    }
    if (
$pos $length strlen($text)) {
      
$excerpt .= $suffix;
    }

    return 
$excerpt;
  }
}
?>

And here's a sample usage:

View Template:


<?php echo $summary->excerpt($data['Article']['body'], array('some''keywords''here')); ?>

Report

More on Helpers

Advertising

Comments

  • manuweg posted on 02/11/10 02:39:57 AM
    This is great - exactly what I was looking for! Makes the search that much more relevant. For added effect add this function to your summary helper:


    function highlightWords($string, $words)
            {
              foreach ( $words as $word )
              {
                  $string = str_ireplace($word, '<span style="background:#f5cf60">'.$word.'</span>', $string);
              }
              /*** return the highlighted string ***/
              return $string;
            }


    Which you then call like this:


    <?php echo $summary->highlightWords(strip_tags($summary->excerpt($node['Node']['body'], $this->params['named'])),$this->params['named']); ?>

    And then you get highlighted search terms within the relevant passages. You don't have to strip_tags() obviously.
login to post a comment.