Prerequisites: some knowledge on haystack and elasticsearch and obviously django are required.
When we're using e.g. synonyms or stopwords in our indices, we need to reindex our data in order to have the new settings on board. Elasticsearch documentation rises this problem and suggest how to fix, see Reindex Your Data. Well, let's do with haystack!
Suppose we have this haystack settings:
HAYSTACK_CONNECTIONS = { 'default': { 'ENGINE': 'search.backend.ElasticsearchEngineCustom', 'URL': 'http://127.0.0.1:9200/', 'INDEX_NAME': 'haystack', 'BATCH_SIZE': 1000 } }
As you can see I created a custom backend, but it's not the focus on this article, please read the links below to get an idea:
- Stretching Haystack's ElasticSearch Backend
- Extending Haystack’s Elasticsearch backend
- Search for multiple words elasticsearch haystack
from haystack.backends.elasticsearch_backend import ElasticsearchSearchBackend class ElasticsearchEngineBackendCustom(ElasticsearchSearchBackend): def __init__(self, connection_alias, **connection_options): super(ElasticsearchEngineBackendCustom, self).__init__(connection_alias, **connection_options) self.setup_complete = True class ElasticsearchEngineCustom(ElasticsearchSearchEngine): backend = ElasticsearchEngineBackendCustom
Setting self.setup_complete to True avoids the 'put index' call from haystack setup and permits us to use INDEX_NAME as index alias name.
Now we have to manage all the haystack index setup via management commands overriding rebuild_index and creating the reindex_index command, let's do it.
In a module named utils, I created the current_index function that is an utility function that return the current index in use and the next version number for the current index (we'll use the number on management commands). Below the code:
INDEX_TEMPLATE = "{}_v{}" def number_sequence(): n = 0 while True: yield n n += 1 def current_index(es_client, index_name): version = number_sequence() index = INDEX_TEMPLATE.format(index_name, version.next()) if not es_client.indices.exists_alias(name=index_name): return INDEX_TEMPLATE.format(index_name, 0), 1 while not es_client.indices.exists(index=index): index = INDEX_TEMPLATE.format(index_name, version.next()) return index, version.next()
The guard for exists_alias is used to know if the rebuild_index has been run at least once. If no alias is present, there is no index. The version number added to the name start from 0. The default index, when the rebuild_index command is called will be haystack_v0.
An objection is the use of while not... because the function search from zero to current index version number and it could be a great number and for each number hits the elasticsearch; it could be more efficient having two indexes name. Take this as a "quick win" solution and improving it!
Let's go to write the reindex_index command, the steps are:
- Delete the new index ignoring the errors
- Create the new index with the current index settings and fields mapping
- Reindex data from current index to new index
- Zero downtime: remove from alias the current index and add the new index
- Delete current index, that is old. The new one is used
from elasticsearch import Elasticsearch from elasticsearch.helpers import reindex from django.core.management.base import BaseCommand from django.conf import settings from search.utils import current_index, INDEX_TEMPLATE class Command(BaseCommand): def __init__(self): super(Command, self).__init__() self.es_client = Elasticsearch(hosts=settings.HAYSTACK_CONNECTIONS['default']['URL']) self.index_alias = settings.HAYSTACK_CONNECTIONS['default']['INDEX_NAME'] self.current_index, version = current_index(self.es_client, self.index_alias) self.new_index = INDEX_TEMPLATE.format(self.index_alias, version) # Update settings with fields mapping self.index_settings = settings.ELASTICSEARCH_INDEX_SETTINGS self.index_settings.update(self.es_client.indices.get_mapping()[self.current_index]) def handle(self, *args, **options): self.es_client.indices.delete(index=self.new_index, ignore=[404, 400]) self.es_client.indices.create(index=self.new_index, body=self.index_settings) reindex(self.es_client, self.current_index, self.new_index) update_aliases = { "actions": [ {"remove": {"index": self.current_index, "alias": self.index_alias}}, {"add": {"index": self.new_index, "alias": self.index_alias}} ] } self.es_client.indices.update_aliases(body=update_aliases) self.es_client.indices.delete(index=self.current_index) print(u"Successfully reindex.")
Now we need to override rebuild_index command with:
- Rewriting clear_index call
- Creating the alias using the INDEX_NAME from haystack settings
- delete all index that match haystack_v*
- create the index "v0" with haystack settings
- create the alias haystack to haystack_v0
- Call the update_index
from elasticsearch import Elasticsearch from django.conf import settings from django.core.management import call_command from haystack.backends.elasticsearch_backend import ElasticsearchSearchBackend from haystack.management.commands import rebuild_index from search.utils import INDEX_TEMPLATE class Command(rebuild_index.Command): def __init__(self): super(Command, self).__init__() self.es_client = Elasticsearch(hosts=settings.HAYSTACK_CONNECTIONS['default']['URL']) self.index_name = INDEX_TEMPLATE.format(settings.HAYSTACK_CONNECTIONS['default']['INDEX_NAME'], 0) self.index_alias = settings.HAYSTACK_CONNECTIONS['default']['INDEX_NAME'] def _create_index_alias(self): self.es_client.indices.delete(index=self.index_name.replace("0", "*")) self.es_client.indices.create(index=self.index_name, body=ElasticsearchSearchBackend.DEFAULT_SETTINGS) print(u"Created index {}".format(self.index_name)) self.es_client.indices.put_alias(index=self.index_name, name=self.index_alias) print(u"Added index {}_v0 to alias {}.".format(self.index_name, self.index_alias)) def handle(self, **options): self._create_index_alias() call_command('update_index', **options)
Instead of ElasticsearchSearchBackend.DEFAULT_SETTINGS you should use your elasticsearch settings. The links above explained how to make it.
A nice improvement is having the yes/no options, the rebuild command is destructive.
Any suggestions will be appreciated! That's it. Enjoy! :-)
Cheers
Nessun commento:
Posta un commento