Case and accent insensitive searchs in MongoDB and PHP

Comming from MySQL, programmers gets a little lazy about searching strings. MySQL does all the dirty job for you: search joao and find joao, joão, João, JOÃO and any other variants.

But… when you start using MongoDB, even with his powerfull features, well, you get lost. Let’s find a solution.

First of all, generally you don’t want just search a exact string, but part of it. For this you used to use LIKE clause in WHERE, as bellow:

SELECT * FROM people WHERE name LIKE "%joão%";

And you find joão, with any case, with or without accent, in any part of the field name.

In MongoDB you don’t have LIKE clause, but, much more powerfull, you can use regular expressions. Then you can do:

db.my_collection.find({ name: /.*joão.*/}

and you will find joão in any part of the field name, but case SENSITIVE. For case-insensitive searchs, just adding /i do the job quite well,:

db.my_collection.find({ name: /.*joão.*/i}

Now we really got the real problem. We can do something like /.*jo[aãáâàAÃÁÂÁ]o.*/i and resolv part of the problem… because this matchs joão, joao, joÃo, JOÃO and so on, but not jóão (see the first “o” with accent).

This is the problem in a small size, think about your user using very stranges forms to write the names, or even worst, think about strange names in others languages with stranges accents (I know portuguese accents, but we don’t use, for exemple, n or Ñ like spanish).

As I use MongoDB with PHP, and PHP already knows to handle regular expressions, let’s use it to get all variants for joão with accents.

<?php
/**
 * Description of StringUtil
 *
 * @author  Rafael Goulart
 */

class StringUtil {

    const ACCENT_STRINGS = 'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËẼÌÍÎÏĨÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëẽìíîïĩðñòóôõöøùúûüýÿ';
    const NO_ACCENT_STRINGS = 'SOZsozYYuAAAAAAACEEEEEIIIIIDNOOOOOOUUUUYsaaaaaaaceeeeeiiiiionoooooouuuuyy';

    /**
     * Returns a string with accent to REGEX expression to find any combinations
     * in accent insentive way
     *
     * @param string $text The text.
     * @return string The REGEX text.
     */
    static public function accentToRegex($text)
    {

        $from = str_split(utf8_decode(self::ACCENT_STRINGS));
        $to   = str_split(strtolower(self::NO_ACCENT_STRINGS));
        $text = utf8_decode($text);
        $regex = array();
        foreach ($to as $key => $value) {
            if (isset($regex[$value])) {
                $regex[$value] .= $from[$key];
            } else {
                $regex[$value] = $value;
            }
        }
        foreach ($regex as $rg_key => $rg) {
            $text = preg_replace("/[$rg]/", "_{$rg_key}_", $text);
        }
        foreach ($regex as $rg_key => $rg) {
            $text = preg_replace("/_{$rg_key}_/", "[$rg]", $text);
        }

        return utf8_encode($text);
    }
}

Using this function, joão (or any similar form) will be transformed in:

j[oÒÓÔÕÖØðòóôõöø][aÁÂÃÄÅÆàáâãäåæ][oÒÓÔÕÖØðòóôõöø]

And so we can do the complete search using MongoDB PHP Driver like this:

<?php
// Connection
$mongo = new Mongo("localhost:27017", array("persist" => "x"));

// Selecting database
$db = $mongo->my_db;

// Selecting collection
$collection = $db->my_collection;

// Creating the Query
$search = StringUtil::accentToRegex('joão');
$query =  array('name' =>new MongoRegex("/.*{$search}.*/i"));

// Running and returning results
$cursor = $collection->find(query);

And it’s done! Of course you can search multiple fields simultaneosly, but that’s up to you…

  • Rafael

    OLá chará! Estou trabalhando com java e estudando mongodb para aplicar em uma solução onde trabalho, no entanto, estou tendo dificuldades acredito eu bobas até, preciso realizar uma consulta por um trecho de uma string, populei uma coleção com 500 mil documentos onde existem somente um atributo: usuario.

    O formato das string na coleção estão assim:
    usuario1@host.com.br
    usuario2@host.com.br

    Acontece que eu tento tanto no console do mongo quanto no código em java executar o código assim:

    -console:
    db.logs.find({“usuario” : “/.*usuario.*/”})

    E não resulta em nada, o que há de errado?

    Obrigado!

  • Muito obrigado pela solução!

    Muito embora acredito que não tenha a melhor performace do mundo, ela funciona muitíssimo bem!

    Valeu!

  • Pingback: how to remove permanent marker()

  • cara valeu!! me ajudou muito!!!

  • ” I haven’t jumped for joy over with these yet – however are pretty nice pens. When using promotional gifts as gifts it’s smart to use items which are unrelated. Any number of business presents although give an effect in your advertising process but in addition impressed your audience promotional mugs: an excellent business promotion product promotional mugs have many space to show the message and logo.

  • Pingback: A PHP script for MongoDB accent search needs to translate into JavaScript | BlogoSfera()

  • Make sure you continue track of all of the
    expenses associated with your marketing promotions, such as the indirect costs like shipping,
    handling and storage costs. Exporters should therefore devise an important and most appropriate strategy for exporting in place.
    You are able to use promotional merchandising to capture the minds in
    the people and reach your desired goal, which is to cause them to become aware about your organization’s existence.

  • You are awesome! Cheers!

  • save my life! Thanks man…

  • It was hard to find your blog in google search results.
    I found it on 19 place, you should build a lot of quality backlinks
    , it will help you to get more visitors. I know how to
    help you, just search in google – k2 seo tricks

  • Alan Ribeiro

    Rafael,
    Sua biblioteca funcionou muito bem, contudo, se faço o inverso, ou seja, pesquiso por ‘joao’ sem acento, a pesquisa não encontra os nomes com ‘JOÃO’.

    • Alan Ribeiro

      Para resolver o problema inclui, no final das constantes o seguinte:
      const ACCENT_STRINGS =
      ‘ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËẼÌÍÎÏĨÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëẽìíîïĩðñòóôõöøùúûüýÿaeiouçAEIOUÇ’;
      const NO_ACCENT_STRINGS =
      ‘SOZsozYYuAAAAAAACEEEEEIIIIIDNOOOOOOUUUUYsaaaaaaaceeeeeiiiiionoooooouuuuyyaeioucAEIOUÇ’;

  • Tahir Ata Barry

    Although MondoDB does not offer it, but you can very easily do it in PHP or any other language. Here is how to do it in PHP.

    $cn = new MongoClient($dbHost);
    $db = $cn->selectDB($dbName);
    $col = new MongoCollection($db, $collectionName);

    $cursor = $col->find();
    $cursor = iterator_to_array($cursor);

    foreach ($cursor as $key => $row) {
    $name[$key] = $row[‘name’];
    $email[$key] = $row[’email’];
    }
    //$name is the field to sort on, taken from the above loop
    //You can use SORT_ASC or SORT_DESC

    array_multisort($name, SORT_ASC, $cursor);

    foreach ($cursor as $doc) {
    echo $doc[‘name’].’-‘.$doc[’email’].”;
    }

  • Obrigado pelo código Rafael!!!
    eu estava errando em não usar o decode UTF8 XD

    Abraço Equipe Mestre Search

  • Eduardo Alvarado D

    THANK YOU, YOUR CODE IS AMAZING!

    Obrigado!, Gracias!

  • Kinjal Fdgdfd Patel

    THANK YOU, It really work for me. You save my time

  • Marcos Maia

    Boa tarde Rafael. Sua implementação me ajudou bastante, não sei como te agradecer! Como programo em Python, fiz um port de seu código que atendesse as minhas necessidades, espero que não tenha problema. Eis o link caso mais alguém precisar.
    https://bitbucket.org/marcosmaia1/regexstrings

    • Olá, Marcos.
      Duas coisas:
      – Este artigo é antigo, atualmente o MongoDB possui algum suporte nativo a case insensitive search https://docs.mongodb.com/manual/core/index-case-insensitive/
      – Segundo, uma boa prática: ao criar um código derivado, cite o código original em um comentário ou README.

      []’s

      • Marcos Maia

        Obrigado pelo feedback Rafael. Vou citar corretamente seu artigo no meu código! Abraços!