Case and accent insensitive searchs in MongoDB and PHP

The problem

Comming from MySQL, programmers gets a little lazy about searching strings. MySQL does all the dirty job for you: search joao and find joao, joão, João, JOÃO and any other variants.

But... when you start using MongoDB, even with his powerfull features, well, you get lost. Let's find a solution.

The Solution

A PHP function using Regular Expression (REGEX) to query MongoDB.

! MongoDB has evolved since this article was written (2013). Consider using MongoDB Text Indexes as a waaaay better solution.

The Code

First of all, generally you don't want just search a exact string, but part of it. For this you used to use LIKE clause in WHERE, as bellow:

SELECT * FROM people WHERE name LIKE "%joão%";

And you find joão, with any case, with or without accent, in any part of the field name.

In MongoDB you don't have LIKE clause, but, much more powerfull, you can use regular expressions. Then you can do:

db.my_collection.find({ name: /.*joão.*/}

and you will find joão in any part of the field name, but case SENSITIVE. For case-insensitive searchs, just adding /i do the job quite well,:

db.my_collection.find({ name: /.*joão.*/i}

Now we really got the real problem. We can do something like /.*jo[aãáâàAÃÁÂÁ]o.*/i and resolv part of the problem... because this matchs joão, joao, joÃo, JOÃO and so on, but not jóão (see the first "o" with accent).

This is the problem in a small size, think about your user using very stranges forms to write the names, or even worst, think about strange names in others languages with stranges accents (I know portuguese accents, but we don't use, for exemple, n or Ñ like spanish).

As I use MongoDB with PHP, and PHP already knows to handle regular expressions, let's use it to get all variants for joão with accents.

<?php
/**
 * Description of StringUtil
 *
 * @author  Rafael Goulart
 */

class StringUtil {

    const ACCENT_STRINGS = 'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËẼÌÍÎÏĨÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëẽìíîïĩðñòóôõöøùúûüýÿ';
    const NO_ACCENT_STRINGS = 'SOZsozYYuAAAAAAACEEEEEIIIIIDNOOOOOOUUUUYsaaaaaaaceeeeeiiiiionoooooouuuuyy';

    /**
     * Returns a string with accent to REGEX expression to find any combinations
     * in accent insentive way
     *
     * @param string $text The text.
     * @return string The REGEX text.
     */
    static public function accentToRegex($text)
    {

        $from = str_split(utf8_decode(self::ACCENT_STRINGS));
        $to   = str_split(strtolower(self::NO_ACCENT_STRINGS));
        $text = utf8_decode($text);
        $regex = array();
        foreach ($to as $key => $value) {
            if (isset($regex[$value])) {
                $regex[$value] .= $from[$key];
            } else {
                $regex[$value] = $value;
            }
        }
        foreach ($regex as $rg_key => $rg) {
            $text = preg_replace("/[$rg]/", "_{$rg_key}_", $text);
        }
        foreach ($regex as $rg_key => $rg) {
            $text = preg_replace("/_{$rg_key}_/", "[$rg]", $text);
        }

        return utf8_encode($text);
    }
}

Using this function, joão (or any similar form) will be transformed in:

j[oÒÓÔÕÖØðòóôõöø][aÁÂÃÄÅÆàáâãäåæ][oÒÓÔÕÖØðòóôõöø]

And so we can do the complete search using MongoDB PHP Driver like this:

<?php
// Connection
$mongo = new Mongo("localhost:27017", array("persist" => "x"));

// Selecting database
$db = $mongo->my_db;

// Selecting collection
$collection = $db->my_collection;

// Creating the Query
$search = StringUtil::accentToRegex('joão');
$query =  array('name' =>new MongoRegex("/.*{$search}.*/i"));

// Running and returning results
$cursor = $collection->find(query);

And it's done! Of course you can search multiple fields simultaneosly, but that's up to you...

References