Multilingual search in Django with Haystack

Multilingual search in Django with Haystack

When you build a website for a swiss company, like we do at tangent, that’s likely you would need to perform search in 3 languages.

There is multiple way to perform multilingual search, and Steve Kearns explains that in this slideshare

We chose to implement the solution at indexing time, and that was, unpredictabily easy.

The solution

The solution is really simple, let say you want to index a book title in 3 languages: german, french and italian. Your book will look like:

class Book(models.Model):
    title_de = models.CharField(max_length=100)
    title_fr = models.CharField(max_length=100)
    title_it = models.CharField(max_length=100)

Your Haystack index will look like:

class BookIndex(indexes.RealTimeSearchIndex, indexes.Indexable):
    title =  indexes.CharField(document=True)

At indexing time, we will send an index to solr for each language to a specific solr core.

------------        "Mein Buch" ---|    ------------
| SOLR FR  |<------ "Mon livre"    ---->| SOLR DE  |
------------        "Mio libro"         ------------
                         |
                         |
                         V
                    ------------
                    | SOLR IT  |
                    ------------

At query time, we will only query the core matching our language:

                          ------------
    "Les miserables" ---->| SOLR FR  | ---> French results 
                          ------------

Implementation

1. Config

Define your search engines in your configuration file

    HAYSTACK_CONNECTIONS = {
        'default':{
            'ENGINE': 'myproject.backend.MultilingualSolrEngine',
            'URL': 'http://127.0.0.1:8080/solr-de',
            },
        'default_de':{
            'ENGINE': 'myproject.backend.MultilingualSolrEngine',
            'URL': 'http://127.0.0.1:8080/solr-de',
            },
        'default_fr':{
            'ENGINE': 'myproject.backend.MultilingualSolrEngine',
            'URL': 'http://127.0.0.1:8080/solr-fr',
            },
        'default_it':{
            'ENGINE': 'myproject.backend.MultilingualSolrEngine',
            'URL': 'http://127.0.0.1:8080/solr-it',
            },
        }

Each connection follows the same pattern, ”<name>_<language_code>". Some of you may have notice the usage of a special backend, but that’s step 2.

2. Add the backend

Copy the following backend. This will do the job for indexation, publishing each content to all cores.

from django.conf import settings
from django.utils import translation
from haystack import connections
from haystack.backends.solr_backend import SolrEngine, SolrSearchBackend, SolrSearchQuery
from haystack.constants import DEFAULT_ALIAS

def get_using(language, alias=DEFAULT_ALIAS):
    new_using = alias + "_" + language
    using = new_using if new_using in settings.HAYSTACK_CONNECTIONS else alias
    return using

class MultilingualSolrSearchBackend(SolrSearchBackend):
    def update(self, index, iterable, commit=True, multilingual=True):
        if multilingual:
            initial_language = translation.get_language()[:2]
            # retrieve unique backend name
            backends = []
            for language, __ in settings.LANGUAGES:
                using = get_using(language, alias=self.connection_alias)
                # Ensure each backend is called only once
                if using in backends:
                    continue
                else:
                    backends.append(using)
                translation.activate(language)
                backend = connections[using].get_backend()
                backend.update(index, iterable, commit, multilingual=False)
            translation.activate(initial_language)
        else:
            print "[%s]" % self.connection_alias
            super(MultilingualSolrSearchBackend, self).update(index, iterable, commit)

If you run update_index now, you will see haystack updating indexes for all cores.

3. Query time now!

To choose the right engine when querying solr through haystack, we should define our own SearchQuery and provide the right using.

class MultilingualSolrSearchQuery(SolrSearchQuery):
    def __init__(self, using=DEFAULT_ALIAS):
        language = translation.get_language()[:2]
        using = get_using(language)
        super(MultilingualSolrSearchQuery, self).__init__(using)

The engine job is to link the query and the backend and to provide an entry point.

class MultilingualSolrEngine(SolrEngine):
    backend = MultilingualSolrSearchBackend
    query = MultilingualSolrSearchQuery

5. That’s it

There is nothing else to do. Just write your haystack views and indexes and everything without changing a single line of code.

Downside

There is no absolute solution, never. As the others, this one has few downsides. The ones I saw are:

  • Many cores to manage, deploy and monitor
  • Slow indexing. If you are using real time haystack indexes like we do, any time you update an instance of an indexed model, a signal trigger an update to the N cores, which significantly slow down the update. Most of the time, we care more about the query time than the update time. If you do, that might not be the right solution for you.

You like this post. Follow me on twitter: @atresontani

Published: September 20 2012

blog comments powered by Disqus