How to speed up "select...like" queries using pg_trgm


In the realm of databases, PostgreSQL stands out for its adaptability and the plethora of features it offers, especially when it comes to handling textual data. An often underappreciated gem in this context is the pg_trgm extension.

What is pg_trgm?

pg_trgm is a PostgreSQL extension that provides functionality for determining the similarity between text strings and improving text searches. It operates by breaking down strings into trigrams, which are groups of three consecutive characters extracted from the text. For instance, the word “Daniele” would be broken down into the trigrams: d, da,dan,ani,nie,iel,ele,le .

This approach facilitates fuzzy searches, identifying results that “resemble” the search term even in the presence of typos or minor variations.

How to Use It?

First and foremost, the extension needs to be enabled in your database with the command:

CREATE EXTENSION pg_trgm;

Once enabled, you can leverage the functions and operators provided by pg_trgm to enhance searches. For instance, to find records similar to a search string, you might use the % operator:

SELECT * FROM your_table WHERE your_text_column % 'word';

This will return all records where your_text_column is similar to ‘word’.

If you are curious and want to see how text are splitted, use show_trgm as shown below.

SELECT show_trgm('Daniele');


--- OUTPUT 
--- {"  d"," da","ani","dan","ele","iel","le ","nie"}

Practical Examples

Suppose we have a songs table with a title column. To find titles similar to “Sultans of swing”, one could write:

SELECT title FROM songs WHERE title % 'Sultens of swing';

This command could successfully retrieve the correct “Sultans of swing” despite the misspellings.

To understand how it works, let’s talk about the % operator.

The % operator can be applied on two text and returns a boolean. Quoting the PG documentation: “Returns true if its arguments have a similarity that is greater than the current similarity threshold set by pg_trgm.similarity_threshold.”

similarity_threshold params is at 0.3 by default. So it consider “similar” two words with a similarity greater or equal to 30%. This can be OK for your scenario or not. In case you need, changing this parameter is quite simple.

To read the current value of pg_trgm.similarity_threshold you can use the following statement:

show pg_trgm.similarity_threshold;

Let’s say we’d like to set pg_trgm.similarity_threshold to 0.5.

Stamement Scope
set pg_trgm.similarity_threshold = 0.5; Session
ALTER DATABASE SET pg_trgm.similarity_threshold = 0.5; Database
ALTER SYSTEM SET pg_trgm.similarity_threshold = 0.5; Cluster

To optimize performance on large tables, it’s advisable to create a GIST or GIN index on the trigrams:

CREATE INDEX idx_title_trgm ON songs USING gist (title gist_trgm_ops);

Now, after this small introduction about pg_trgm and the % operator, let’s focus on the topic of this article.

Other information about the available pg_trgm operators here.

Enhancing LIKE Queries with pg_trgm

Beyond managing fuzzy searches, the pg_trgm extension harbors a trick up its sleeve that can significantly speed up searches using LIKE and ILIKE operators in PostgreSQL, especially those containing wildcard characters % at both ends of the search string. This is particularly useful because, in the absence of pg_trgm, such searches can be very slow on large datasets, as they require a sequential scan of the entire table.

Optimizing LIKE/ILIKE Queries

With pg_trgm installed and properly configured, PostgreSQL can use GIN and GIST trigram-based indexes to accelerate LIKE and ILIKE queries. This method leverages the data’s trigram representation to drastically reduce the data set to be examined during the search.

For example, using the same songs table mentioned earlier, if we want to find all titles containing the string “love”, a typical query might be:

SELECT title FROM songs WHERE title LIKE '%love%';

Without pg_trgm, this query could be slow on a large dataset. However, with a GIN or GIST trigram index on the title column, PostgreSQL can quickly narrow down the search to records that have a higher likelihood of matching the pattern, significantly reducing query execution time.

Creating an Index to Optimize LIKE/ILIKE

To take advantage of this optimization, we can create a GIN index on the column of interest as follows:

CREATE INDEX idx_title_like ON songs USING gin (title gin_trgm_ops);

With this index in place, LIKE and ILIKE queries including patterns with wildcards at both ends can be executed much faster, thanks to trigram-based candidate pre-selection.

Let’s do a real-world test

To do a real-world test, create an half-million records table using the following script.

drop table if exists random_people;
create table random_people as
  with rnd_people as (
    select distinct  
	    ('[0:13]={"Daniele","Debora","Mattia","Jake","Amy","Samuel","Jacopo","Martina","Sofia","Henry","Neil","Tim","John","George"}'::text[])
		    [floor(random()*14)] || ' ' || floor(random()*100000+1)::text || '°'  first_name,
	    ('[0:13]={"Bianchi","Lamborghini","Ferrari","De Tommaso","Rossi","Verdi","Gialli","Caponi","Gallini","Gatti","Ford","Daniel","Harrison","Macdonald"}'::text[])
		    [floor(random()*14)] || ' ' || floor(random()*100000+1)::text || '°' last_name
    from 
      generate_series(1, 10 * 1000 * 50)
  )
  select 
    row_number() over() id, first_name, last_name 
  from 
    rnd_people;

select count(*) from random_people; -- 500,000 records   

EXPLAIN ANALYZE says the query takes c.a. 160ms to execute.

explain analyze select * from random_people where last_name ilike '%tom%';  -- c.a. 160 ms on my machine 

Now, create a GIN index on last_name field using gin_trgm_ops operator class.

CREATE EXTENSION if not exists pg_trgm;
drop index if exists index_last_name_on_random_people_trigram;
CREATE INDEX CONCURRENTLY index_last_name_on_random_people_trigram
ON random_people
USING gin (last_name gin_trgm_ops);

Now, the final test. Re-execute the previous query.

explain analyze select * from random_people where last_name ilike '%tom%'; -- c.a. 0.025 ms on my machine!

Great! From 160ms to 0.025ms (on my machine).

Final Thoughts

Utilizing pg_trgm to optimize LIKE and ILIKE queries is a prime example of how PostgreSQL extensions can enhance application performance and flexibility. It’s important to remember that while trigram-based indexes can significantly speed up these searches, they also require additional space and can impact data insertion times. Therefore, as always in engineering, their implementation should be carefully considered based on the specific requirements of the application.

In conclusion, integrating pg_trgm into your indexing and search strategies can profoundly transform the performance of textual queries. “Sapere aude” (Dare to know) - this extension invites you to explore and fully exploit PostgreSQL’s advanced capabilities for textual data management.

PostgreSQL Trainings

If you want to become proficient in PostgreSQL there are some very popular trainings provided by my company:

If your company needs a custom training, let us know.

Trainings are available remotely and on-site.

Links


Comments

comments powered by Disqus