SQLite for beginners: Full Text Search

learn
SQLite for beginners: Full Text Search

SQLite has an extension called FTS5, which provides full-text search capabilities to your SQLite database.

So, how do you use the full-text search features in SQLite?

Watch this short video to learn more.

Transcript

"I've got this movie review application that's powered by SQLite, and I need to add the ability to search for movies by its title.

So, how should I implement this search feature?

Well, you could use a few SQL operators in your queries to find an answer, such as the: like, glob and regular expression operators.

Let me show you an example of using the like operator.

So I'll open up my movie database in the command line.

Ok next, I'll query for the count from the title table, where the primary_title, is like star wars, but you'll notice I'm using these percent signs, that are wild card characters, before and after the search term, which basically means, we don't care what comes before this phrase, and we don't care about what comes after this phrase, all we care about is that this phrase must be in the primary title, somewhere.

So, how do you think the performance will be, with this query?

In other words, do you think this will be a fast efficient query, or will it be slow?

Let's try it out and see.

Ok, that seemed pretty slow, right?

Let's turn on timing, so we can quantify how slow this query is.

And, now I'll run the query again, and it looks like it takes about 0.8 seconds.

So, how does this compare to expectations? Is it good, is it ok, is it bad?

Well, one of our goals in the applications we build should be to generate a response in less than 100 milliseconds.

Why 100 milliseconds?

Because to us humans, this feels instantaneous.

So, I'll ask again, is our query fast enough?

Nope, it's about 1 order of magnitude slower than our maximum allowed response time, and ideally, it should be 2-3 orders of magnitude faster, than what we're seeing right now.

Ok, so, why is this query so slow?

Well, to answer this question, let's take a look at the query plan, which we can do by prefixing the last query with 'explain query plan', and... the plan is to do a scan on the titles table.

In other words, to get a result, the database has to scan and look at every entry in the titles table, which has about 9.4 million rows.

So is the like operator going to work for us? Well, if this was a one-off, ad-hoc query, then using the like operator could be fine, but if this is a normal query, that's routinely used, then a table scan query isn't going to cut it for us.

Now, as I mentioned earlier, there are a couple of additional operators, the glob and regexp operators, that you could use to search for a term, but these operators have similar performance characteristics to the like operator, so they aren't going to work for us either.

So, how do we avoid doing a table scan?

Well, usually you add an index to improve a query's performance, so can we just add an index to fix our problem?

In other words, could we just add an index on the title tables, primary title column, like this?

Well, we could add this index, and it might slightly improve the query plan, in certain situations, but generally speaking, adding an index to the primary title column, won't provide the type of full-text search performance we need.

Ok, so are we stuck with a poor-performing query, or is there another solution?

Well, the core of SQLite doesn't offer a solution, however, there is an extension we can use to solve our problem.

All right, so what's an extension?

Well, SQLite has been designed in such a way that the core features are somewhat small in number, but, there are several mechanisms for extending SQLite, and these mechanisms are called extensions.

Now, to solve our text-searching problem, we're going to use an extension called the full-text search extension, or FTS5.

Ok, so where do we get this extension from?

I mean, do we need to download it from somewhere and install it?

Well, here's the thing. There are a few extensions that are very popular, and widely used, and the SQLite team has decided to include these extensions with SQLite by default. So if you're using the version of SQLite that came with your computer, or if you're using a package manager like brew, or npm, then SQLite likely already has the full-text search extension installed.

If you want to check to see if your version of SQLite includes the full-text search extension, you can issue the pragma compile, options command, and if you see ENABLE_FTS5, then you know the extension is included with your install.

Ok, so how do you use the full-text search extension?

Well, it works a bit differently from other databases, in that you don't, touch or modify the source table, in other words, we're not going to modify the titles table in any way, instead, we're going to add what's called a virtual table.

Ok, so what's a virtual table?

Well, it's a table you define, similar to normal tables, and you query and maintain the table, with standard SQL statements, like selects, inserts, updates and deletes, but the storage and retrieval of the data, can be implemented in interesting and clever ways.

For example, the fts5 extension, stores and indexes the virtual tables data in a way that's optimized for performing full-text searches.

Let me show you how you might use the FTS5 extension, to improve our query from earlier.

So, the first thing I'll do is create a virtual table, named titles, fts, and we'll use the fts5 extension, then I'll specify what columns should be a part of this virtual table.

So, I'll add the title_id column, which we'll use in queries in just a moment here.

Next, I'll add the primary_title column, followed by the original_title, and that's all the columns we'll add to our virtual table.

Now, as a quick aside, you don't define primary keys in virtual tables and you don't specify datatypes, and if you accidentally specify these things, you'll see an error.

Ok, now we've got this virtual table, so how do we use it?

Well, as I said a moment ago, you interact with this virtual table, just like you would with a normal table, by doing inserts, updates, deletes and selects.

So, let's populate this virtual table by doing an insert.

So, I'll say insert into our new table, titles_fts, and I'll include the title id the primary title and the original title columns, and the values we'll insert will come from the corresponding columns in the titles table.

Now, this is going to take a few seconds, because we're inserting more than nine million records, but I'll ask you a question while we wait.

So, the original titles table, and this new titles_fts, virtual table are logically linked together at this point, in other words, if we do inserts, updates or deletes in this table, we need to make similar updates in this table.

So the question is, how do we ensure that both of these tables are updated in unison, so they're kept in sync?

Well, you could just manually modify both tables in unison, in your code, but there's a good chance that at some point, someone will forget to update both tables, and you'll introduce a bug.

So is there anything we can do to avoid this kind of a bug?

Well, we could use a database trigger.

So how would a trigger work in this scenario?

Well, basically we'd create a trigger for inserts, that automatically inserts the new data into the titles_fts table, after inserting data into the title table.

Then we'd do similar things for updates and deletes.

Now, by using these triggers, we don't have to worry about keeping the tables, titles and titles_fts in synch, because the triggers will handle this for you.

All right, let's head back to the terminal, and we see that this insert was completed successfully.

Ok, now that we've populated the title_fts table from the titles table, we can use it to perform efficient, full-text searches, let me show you.

So, let's query for all movies that include the title 'star wars';

So, I'll count the number of rows from the titles table, which I'll alias as t, then I'll inner join the titles_fts table, which I'll alias as s, then I'll join on the title_id column from both tables, then I'll say where titles_fts matches the term 'star wars';

Ok, so when we used the like operator earlier, we were seeing a response that took about 0.8 seconds, and this new query that's using the full-text search extension is taking just shy of 10 milliseconds, so it's about 2 orders of magnitude faster, cool.

Now, what do you think about this syntax you see here?

It's kind of weird because we're not referencing any columns here, which is odd for a where clause, right?

Ok, so the way to think about this is that we're looking for any records that match the term 'star wars', across all the columns in the titles_fts, virtual table.

So in our case, it's doing a full-text search across these three columns, but what if we wanted to limit the search to certain columns? So is there a way to handle a more limited search?

Yep, all you need to do is modify the match term, to include the column name here, or if you wanted to search across more than one column, you'd specify the column names like this.

Now, the queries we've used so far haven't returned rows, they just returned the count, but let's change this query and have it return the primary_title and the original title, then I'll just limit the rows returned to 10.

So are the 10 titles we see here, the best matches for our search term?

Well, not really, at this point, we're just getting 10 of the more than 4000 matching rows, in no particular sequence.

So, is there a way of sorting the results to get the most relevant records first?

Yep, there is a couple of similar ways to rank your results.

First, there's a function named bm25, which is a ranking function and then there's a rank hidden column. I'll go ahead and add both of these to our last query, and check out the results.

So, both of these new values, that you see here, return a number and the lower the number is, the more relevant the match is, so we could sort by either one of these, to get our true, top 10 results, but you're probably curious what the difference is between these two ranking methods given the fact that they both seem to provide the same value.

Ok, so by default both of these ranking methods return the same value, but the bm25 function can accept additional parameters, that apply different weights to different columns.

For example, if we wanted to apply more weight to the primary title over the original title, then we could pass weight values into the bm25 function, where the first number after the table name applies a weight to the corresponding first column in the virtual table, and this weight applies to the second column, and so on.

So, as you can see, the bm25 ranking function gives you a few more options than the hidden rank column, and you can decide which ranking method to use.

Ok, so to order our results by relevance, we'd just need to order by either the rank or the result of the bm25 function.

Ok, obviously there's much more you can learn about full-text search in SQLite, and you can learn more at this link, but hopefully, this video gives you a bit of a head start.

Hey, here at Mycelial, we're big fans of SQLite, and we're building tools around SQLite that will allow you to synchronize an SQLite database to multiple computers without conflict.

If you are interested in SQLite and what we're doing with it, please subscribe to our channel, and consider joining our newsletter to learn more."

MORE ABOUT MYCELIAL