A fun aspect of software development is the ever-growing array of technologies. The options can feel endless, whether it's the programming language, database, or where and how the solution will be deployed.
The challenge with so many choices is that it's hard to decide. It’s easy to lean toward trending or popular technologies because the documentation and helpful resources are plentiful. If it's popular or trending that usually means it's solid, with noticeable benefits in leveraging it and minimal downsides, if any, right?
…right?
However, as I walk you through my quest, you’ll see that sometimes the best solution is unclear and that certain oversights can dramatically change the outcome.
The Quest
I had been waiting for this day for a long time. The day in which I was recognized by the veteran adventurer known as the “Team Lead”. They provided me an opportunity to prove myself, tasking me to scour the brutal lands of “The Internet” to defeat an angry fire-breathing dragon who-…
Well, I'm paraphrasing. They didn't say that exactly; it was more so I was tasked with finding the best approach to search for entries on a new large data set. Not quite as epic sounding, but a quest is a quest, and "data set" and "dragon" are close enough.
Our current solution is built using Django and PostgreSQL, so that was my starting point. The new data that needed to be searchable via an API endpoint consisted of short-phrased textual data. For this post, we'll use a stand-in data set composed of movie titles, each with a language designation and multiple translated equivalents. This text-based data set totaled around 15,000 English entries alone and was expected to grow further, possibly to the hundreds of thousands.
Thankfully, the way this data was used would be specific to a particular language, so searching would only have to deal with one language at a time. Specifically, the UX would be an autocomplete suggestion leveraging the browser language setting.
To summarize, the requirements for this quest were:
- PostgreSQL compatible
- Django compatible
- Good performance (a threshold of <200ms)
- Ideally ranging from 10 - 30ms
- Capable of searching for long (single or multi-words) and short (single or two characters) search terms
- Able to paginate results
Gameplan and Considerations
Sometimes I wish I was a wizard from D&D. They have a spell called Augery, which tells the caster if their next decision would be good or bad. I chose the keyboard warrior class, so I needed to consider a different approach when evaluating available options.
But who needs spells when you have scientific rigor? In other words, I needed to develop a repeatable way to measure each method and then compare them. So, I came up with the following plan to measure effectiveness:
- Create my stand-in data set
- Measure the performance of the following scenarios:
a. Once with "long" search terms, i.e., single or multi-words
b. Another with "short" search terms, i.e., single or two characters - Record time spent and the first three results for each iteration
- Get the total average time and get the standard deviation (sample) of the results
As English would be the most used language, I created an English-only stand-in data set equal to around 15,000 entries and made a loop structured like this:
from datetime import datetime
import math
from statistics import mean, stdev
from django.db import connection, reset_queries
...
delta_time = [] # list of total time taken
sql_exec_time = [] # list of SQL executions
for _ in range(loop):
search_term = get_search_term() # get a random search term
reset_queries() # clears django.db.connection.queries
start = datetime.now()
results = my_search(search_term, search_method)
end = datetime.now()
delta = end - start
delta_ms = delta.total_seconds() * 1000
delta_time.append(delta_ms)
queries = connection.queries
total_time = math.fsum(float(q['time']) for q in queries)
sql_exec_time.append(total_time)
delta_avg = mean(delta_time) # delta average
delta_std = stdev(delta_time) # delta standard deviation
sql_exec_time_avg = mean(sql_exec_time) # SQL execution time average
sql_exec_time_std = stdev(sql_exec_time) # SQL execution time standard deviation
Keep it Simple Basic, Silly!
To kickstart this journey, I decided that the more straightforward solution should be my first destination. Quick to set up and quick to tear down. Django provides a variety of ways to search your data; the most basic form of this would be using contains
and icontains
. As described in the game plan, I would test two search terms. Long terms would be one or two entire words like “Star” or “The Ring,” while short terms would consist of one to two characters like "A" or "Be".
After a hundred iterations, these were the results:
term type |
search method |
delta avg(ms) |
delta std |
SQL exec time avg(sec) |
SQL exec time std |
---|---|---|---|---|---|
long |
icontains |
105.14 |
83.32 |
0.09 |
0.01 |
short |
icontains |
642.79 |
464.4 |
0.11 |
0.02 |
As I predicted, the simple icontains
search performed well with long search terms but suffered when using short terms. With the delta average measuring query time + serialization time + network time + deserialization time, I felt it more accurately represented the search method’s performance. SQL execution time, only consisted of its namesake. However, another factor to consider when using icontains
was how the query was executed icontains/ILIKE
will cause a sequential scan to occur. Which means we are unable to use indexes for performance improvements. Although long search terms fell comfortably in the desired performance range, short terms did not… sort of.
Well, OK, the numbers don't lie; we have a clear range in which the short terms performance delta averages exceed. However, keen-eyed individuals such as yourselves may have noticed something odd. But I must ask you to keep this in your adventurer's backpack for now, as this was one of the oversights I mentioned initially and will be addressed later in this journey.
So, for now, I considered this path of icontains
unviable, so I continued on my journey to fancier alternatives.
Full-Text Search (FTS)
As I continued, I ventured forth for Fancy-Text Sear--
Just kidding, the other promising lead in my search was using PostgreSQL's Full-Text Search engine. Django conveniently provides functionality with the django.contrib.postgres.search
module. However, I wasn't familiar with what exactly full-text search was. Luckily, PostgreSQL's documentation was quite extensive.
Full-text search (FTS or just text search) is a technique used for the complex look-up of documents. A document is a unit of searching in a full-text search system, i.e., what the search is searching upon. A common use case, similar to our usage with icontains()
, is providing a search term and returning all documents containing this term.
However, the important difference is the additional utility and flexibility that FTS can provide. There is linguistic support that can handle derived words e.g., smoke and smoking. Preprocessing and indexing are possible, part of which involves ignoring stop words. These are words that are useless for searching as they are too common, like the. Finally, we can sort these results by relevancy which becomes extremely relevant (sorry) when a large number of results are found.
Django Provides
Thankfully, Django provides usage of this search method in a convenient fashion:
Thing.objects.filter(some_column__search='my term')
As the documentation describes, the code above creates a to_tsvector
in the database from the field specified (some_column)
and a plainto_tsquery
from the search term ('my term'
). The results are obtained by matching the query and the vector.
Now for my performance test using this new search, my queryset became:
queryset = (
Movie.objects
.filter(language='en')
.filter(title__search=search_term)
)
return list(queryset)
Performance Results
To my surprise, the fancy new thing I learned about that looked to be the solution was in fact, not the solution, at least in terms of performance. It seemed our search method was in another castle.
term type |
search method |
delta avg(ms) |
delta std |
SQL exec time avg(sec) |
SQL exec time std |
---|---|---|---|---|---|
long |
django search |
431.75 |
1711.31 |
0.42 |
1.71 |
short |
django search |
588.26 |
463.19 |
0.58 |
0.46 |
When compared to the icontains
search, it was only slightly better when searching on short terms and worse on long ones.
I read the docs
Although the benefit of relevancy and a more flexible search were available, the performance of the FTS left much to be desired. However, Django's documentation mentions if you're searching for more than a few hundred records, there will probably be performance problems. This can be remedied if all the fields we're querying on, in our case just the single field, are contained within one particular model. We can do two things to improve performance:
1. Generate and save the search vectors in a SearchVectorField
so it doesn't need to generate it each time we query.
2. Add an index to that field.
Modifications to the model for both adding an index and a SearchVectorField
are straightforward when dealing with a single language as seen below.
# ... other imports
from django.contrib.postgres.indexes import GinIndex
class Movie(models.Model):
language_code = models.CharField(max_length=4)
title = models.CharField(max_length=250)
vector_column = SearchVectorField(null=True) # new field
class Meta:
indexes = [
GinIndex(fields=["vector_column"]), # add index
]
The caveat with the new field is the necessary addition of a database trigger. As described in the Postgres docs when using a separate column like SearchVectorField
it is necessary to create a TRIGGER
to update the column whenever the document content(title column) changes. Thankfully, there are built-in functions to do this with tsvector_update_trigger
and tsvector_update_trigger_column
. I needed to create a custom migration like so:
operations = [
migrations.RunSQL(
sql='''
CREATE TRIGGER vector_column_trigger
BEFORE INSERT OR UPDATE OF title, vector_column
ON movies_movie
FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(
vector_column, 'pg_catalog.english', title
);
UPDATE movies_movie SET vector_column = NULL;
''',
reverse_sql = '''
DROP TRIGGER IF EXISTS vector_column_trigger
ON movies_movie;
'''
),
]
Performance Results
With steps taken to improve performance, well, the results speak for themselves:
term type |
search method |
delta avg(ms) |
delta std |
SQL exec time avg(sec) |
SQL exec time std |
---|---|---|---|---|---|
long |
vector column |
1.128652 |
0.46 |
0.000177 |
0.00039 |
short |
vector column |
0.82339 |
0.40 |
1.1e-05 |
0.00012 |
An incredible improvement compared to the options I had explored, lacking the fragility of the basic search and offering the flexibility and features of FTS, this looked to be the prime contender. However, I realized I hadn't considered one thing. So far, my tests were assuming English only but what about other languages?
You'll remember in the opening that I said, "certain oversights can dramatically change the outcome". This is one of two oversights.
Languages of The World
The requirement was that the search would be one particular language at a time and so far it had only been English. The inclusion of other languages brought the question of what languages are we supporting and how does multi-language support work with FTS and where do you start?
Updating the Migration
My first lead was in my custom migration with the trigger, pg_catalog.english
if english
was an option that would mean there were other languages this pg_catalog
had. I would just have to create a trigger for each language respectively. This was easier said than done as there is complexity with multiple languages. I created a function that based on the language would create the appropriate vector. If I wanted to support English and French the migration for the trigger now would look like this.
operations = [
migrations.RunSQL(
sql='''
CREATE OR REPLACE FUNCTION label_trigger() RETURNS trigger AS $$
BEGIN
IF new.language = 'en' THEN
new.vector_column = to_tsvector('pg_catalog.english', coalesce(new.title,''));
ELSE
new.vector_column = to_tsvector('pg_catalog.french', coalesce(new.title,''));
new.is_preferred = True;
END IF;
RETURN new;
END
$$ LANGUAGE plpgsql;
CREATE TRIGGER vector_column_trigger
BEFORE INSERT OR UPDATE OF title, vector_column
ON movies_movie
FOR EACH ROW EXECUTE FUNCTION
label_trigger();
UPDATE movies_movie SET vector_column = NULL;
''',
reverse_sql = '''
DROP TRIGGER IF EXISTS vector_column_trigger
ON movies_movie;
'''
),
]
pg_catalog Languages
So now I had a way to do multiple languages with FTS while maintaining good performance. But the language support wasn't just for English and French. I needed to determine which languages were supported by Postgres, to check if all the languages I required were supported. Using psql and the command \\dF
I got the answer I was looking for.
=> \\dF
List of text search configurations
Schema | Name | Description
------------+------------+---------------------------------------
pg_catalog | arabic | configuration for arabic language
pg_catalog | danish | configuration for danish language
pg_catalog | dutch | configuration for dutch language
pg_catalog | english | configuration for english language
pg_catalog | finnish | configuration for finnish language
pg_catalog | french | configuration for french language
pg_catalog | german | configuration for german language
pg_catalog | hungarian | configuration for hungarian language
pg_catalog | indonesian | configuration for indonesian language
pg_catalog | irish | configuration for irish language
pg_catalog | italian | configuration for italian language
pg_catalog | lithuanian | configuration for lithuanian language
pg_catalog | nepali | configuration for nepali language
pg_catalog | norwegian | configuration for norwegian language
pg_catalog | portuguese | configuration for portuguese language
pg_catalog | romanian | configuration for romanian language
pg_catalog | russian | configuration for russian language
pg_catalog | simple | simple configuration
pg_catalog | spanish | configuration for spanish language
pg_catalog | swedish | configuration for swedish language
pg_catalog | tamil | configuration for tamil language
pg_catalog | turkish | configuration for turkish language
Unfortunately, it was not the answer I wanted. The majority of the languages I needed were there but there were still a few missing—namely, Japanese, Vietnamese, Chinese, and Polish. I knew there was a way to create my own config, but after looking at the details, I felt out of my depth.
Custom Configuration
Documentation came to my rescue again, as the outline of creating a custom config was neatly provided. A configuration is comprised of four types of database objects:
Text search parsers break documents into tokens and classify each token (for example, as words or numbers).
Text search dictionaries convert tokens to normalized form and reject stop words.
Text search templates provide the functions underlying dictionaries (a dictionary simply specifies a template and a set of parameters for the template).
Text search configurations select a parser and a set of dictionaries for normalizing the tokens produced by the parser.
The next part of the documentation was when I realized this was not a viable option:
Text search parsers and templates are built from low-level C functions; therefore it requires C programming ability to develop new ones, and superuser privileges to install one into a database.
My C skills were lacking and the timeframe provided to me made it clear this would be out of scope. It was back to the drawing board for me and I felt like I was back to square one.
I Called for Aid
Feeling stuck I reached out to one of my senior colleagues to discuss my current predicament. They shifted my focus back onto the basic search as language would not be a concern. This was true but I countered with the glaring issue of performance that was still present. This was when they picked up on a strange inconsistency in my performance results.
This was the other oversight, by the way…and thankfully, the last.
The Numbers, What Do They Mean
term type |
search method |
delta avg(ms) |
delta std |
SQL exec time avg(sec) |
SQL exec time std |
---|---|---|---|---|---|
long |
icontains |
105.14 |
83.32 |
0.09 |
0.01 |
short |
icontains |
642.79 |
464.4 |
0.11 |
0.02 |
long |
django search |
431.75 |
1711.31 |
0.42 |
1.71 |
short |
django search |
588.26 |
463.19 |
0.58 |
0.46 |
long |
vector column |
1.128652 |
0.46 |
0.000177 |
0.00039 |
short |
vector column |
0.82339 |
0.40 |
1.1e-05 |
0.00012 |
When showing my results, they found it odd that there was a discrepancy with my icontains
results between the delta columns and the SQL execution time columns. Why did the SQL execution times remain similar while the delta averages varied greatly? It was something I never noticed, I was mainly focused on the delta column numbers as it included the overhead of Django's execution time. They then noted how there was no LIMIT
being set. I was simply returning all the results at once, leading to slower performance. With a slight modification to my queryset, I redid all my basic search tests.
queryset = (
Movie.objects
.filter(language='en')
.filter(title__icontains=search_term)
[:25]
)
return list(queryset)
Performance Results
term type |
search method |
delta avg(ms) |
delta std |
SQL exec time avg(sec) |
SQL exec time std |
---|---|---|---|---|---|
long |
icontains |
21.740007 |
29.52 |
0.02 |
0.029 |
short |
icontains |
2.051341 |
1.21 |
0.001 |
0.001 |
An amazing improvement from such a simple change that I overlooked. My concerns of fragility were unfounded as when reviewing the results being returned from the search, my team found them acceptable for our use case. Just like that, I was back aboard the basic search train.
Why Not Both?
As I mentioned initially, sometimes there is no clean and clear solution. I was intent on finding the “right” solution to use and only that solution because that's how it works. I learned that sometimes that’s not the case as my senior colleague had yet another brilliant suggestion; could we use both?
For languages that were supported by FTS, we could employ its use and then fall back on basic search if unsupported. Both were in the acceptable performance range, even after the retesting, as long as the vector column and index were used for FTS. Since we could determine the language being used for the search and it could only ever be one language, logic could be implemented based on that. It was one of those things where it all makes sense when explained and then makes you think, “why didn’t I think of that?”
Conclusion
I hope this journey has been rewarding and informative as we have explored the various capabilities Django and PostgreSQL can offer regarding Full-Text Search. By comparing the use of icontains
to a more complex search, we have gained a better understanding of the power and limits of PostgreSQL.
Of course, It's important to remember that sometimes our expectations for a perfect solution may not be met. Combining solutions is possible, so staying flexible and open to new ideas is important when searching for a solution. Additionally, staying organized and documenting progress is essential to identify potential solutions and determine the most suitable one for the task.