giovedì 18 giugno 2015

Django Haystack Elasticsearch: index pdf files

In this article I would like to explain how to index pdf files into a haystack elasticsearch backend and to follow you need some knowledge about django and haystack.
Elasticsearch configuration is not treated.


IMHO, haystack documentation is not very clear about "FileIndex", or better, yes but only for Solr backend, see Rich Content Extraction; for elasticsearch backend you need to get your hands dirty :-)

  • The pdf files for index are located into the django media directory under document folder and subfolders.

That we need
  1. Retrieve all files and put some data into a list of dictionaries
  2. A pdf file model, haystack requires a model in order to perform index
  3. A custom haystack elasticsearch backend, we need to override the extract_file_contents method
  4. A file
Solving 1.

This is simple, walk through the directories and store the full path into a list of dictionaries.
I left the exercise to the reader.
The final result is to obtain a result like this:

  {"path": "/path/to/media/my_fantastic_pdf.pdf, "url": "media/url/my_fantastic_pdf.pdf"},
  {"path": "...", "url": "..."},

Solving 2.

First of all we need a model, not managed:

class PdfFileInfo(models.Model):
    path = models.CharField(max_length=250)
    url = models.CharField(max_length=250)

    objects = PdfFileInfoManager()

    def get_absolute_url(self):
        return self.url

    class Meta:
        managed = False

As you can see we don't have a real table, then we need to create a custom QuerySet and Manager in order to supply to this lack.
Searching around the net I've found this article how to quack like a QuerySet that explains how to have a copy of original django QuerySet and having some nice tricks.

Below the code of QuerySet:

class PdfFileInfoQuerySet(object):

    def __init__(self):
        # avoid circular dependencies
        from .models import PdfFileInfo

        self.pdf_files = []
        docs = retrieve_files()  # remember it's your homework :-P
        for pk, doc in enumerate(docs):
            doc['id'] = pk 

    def __iter__(self):
        for pdf_file in self.pdf_files:
            yield pdf_file

    def __repr__(self):
        return repr(self.pdf_files)

    def __getitem__(self, k):
        if not isinstance(k, (slice, int, long)):
            raise TypeError
        assert ((not isinstance(k, slice) and (k >= 0))
                or (isinstance(k, slice) and (k.start is None or k.start >= 0)
                    and (k.stop is None or k.stop >= 0))), "Negative indexing is not supported."
        if isinstance(k, slice):
            return self.pdf_files[k]
            return self.pdf_files[k:k + 1][0]

    def count(self):
        return len(self.pdf_files)

    def all(self):
        return self._clone()

    def filter(self, *args, **kwargs):
        return self._clone()

    def exclude(self, *args, **kwargs):
        return self._clone()

    def order_by(self, *ordering):
        return self._clone()

    def _clone(self):
        qs = PdfFileInfoQuerySet()
        qs.pdf_files = self.pdf_files[:]
        return qs

Note on a pitfall: assign the pk to the model ensures that indexer will create all the documents into the index, otherwise it will create only one document (the last item).

And for the Manager:

class PdfFileInfoManager(models.Manager):
    def all(self):
        return PdfFileInfoQuerySet()

Solving 3.

Creating a custom backend... The "easy" part :-)
I choose pyPdf for extracting pdf contents. [Python recipe]

class ElasticsearchEngineBackendCustom(ElasticsearchSearchBackend):
    # ... 
    def extract_file_contents(self, file_obj):

        pdf = pyPdf.PdfFileReader(file_obj)

        content = ""
        for num_page in range(0, pdf.getNumPages()):
            content += pdf.getPage(num_page).extractText() + "\n"

        content = (" ".join(content.replace(u"\xa0", " ").strip().split())).encode("ascii", "xmlcharrefreplace")

        pdf_info = {
            'contents': content

        return pdf_info

class ElasticsearchEngineCustom(ElasticsearchSearchEngine):
    backend = ElasticsearchEngineBackendCustom

You can find some other info about extending backend on my stackoverflow answer

Solving 4.

Cool, now we have the basis to build the index like haystack documentation says.

class FileIndex(indexes.SearchIndex, indexes.Indexable):
    # ...
    def prepare(self, obj):
        data = super(FileIndex, self).prepare(obj)

        extracted_data = self._get_backend(None).extract_file_contents(open(obj.path, "rb"))

        t = loader.select_template(('search/indexes/file_text.txt',))
        data['text'] = t.render(Context({'object': obj, 'extracted': extracted_data}))

        return data

    def get_model(self):
        return PdfFileInfo

    def index_queryset(self, using=None):
        return PdfFileInfo.objects.all()

The template search/indexes/file_text.txt is very simple:

{{ extracted.contents|striptags|safe }}

That's it, run the rebuild_index command and see indexer in action.

This is a working example, maybe require some adjustments for your purpose, e.g. I think that with a little effort you can index file if it's an "attachment file" in your django model.

Any suggestions will be appreciated!

4 commenti:

AKD ha detto...

Good morning Sir,

We can't find PdfFileInfoManager . Do you know how to fix that ?


S... ha detto...


`PdfFileInfoManager` is the custom manager.
You'll have to write it inside your `` file or inside a `` file and then import it.
See for other details.


Unknown ha detto...

Hello Sir ?
Nice Work! Did you manage to search any content of that indexed file?
If so how did you do it!
Best regards

S... ha detto...

Hi Hubert,

> Nice Work!

Thanks! :-)

> Did you manage to search any content of that indexed file?

Only text.
But you can refere to maybe it could be useful for your goal.