Skip to main content

SQLite for beginners: Indexes, beyond the basics

· 11 min read
James Moore

If you want to level up your database skills, the area you should probably focus on first is Indexes.

So, are you familiar with the various types of indexes you can create?

Check out this video to learn more.

Transcript

"If you want to level up your database skills, the area you should probably focus on first is Indexes. In other words, ideally, you should know what the various types of indexes are, how they work when you should use them, and so on.

So, do you know what these terms mean? If not, stick with me to learn more.

So, I've got this orders table, in an SQLite database, and this table has 20 million rows in it.

Now, before we start talking about indexes, I want to quickly look at how queries are executed when no indexes exist, and so, at this point, the orders table we're going to be using, has no indexes at all, other than the Primary Key.

Ok, let's start by considering the following simple query, select customer id and the sum of the order total, from the orders table, and of course, we'll group by the customer id.

So, how does the database execute this query?

Well, the database is going to have to scan every single row in the table.

So, what does a table scan look like, under the hood?

Well, SQLite will figure out which database pages contain rows from the orders table, and those pages will get pulled from the disk, and the customer id and order total will get pulled out of the page and used to satisfy the query.

And this process of fetching a page, and pulling the needed values out of the page, repeats over and over until all the order data is retrieved.

Now, did you notice any inefficiencies in what I just described?

The inefficiency I'm referring to is subtle, so I'll give you a hint, it has to do with Input/Output or IO.

So do you see what I'm referring to?

Well, the problem with this query is that we have to pull the entire page from disk, and of course, pages are row-oriented in SQLite, which means database pages contain all the columns in the row or rows that are stored in the page, so to execute this query, we end up having to perform IO to fetch a lot of data that we aren't really interested in.

So, can we improve this query, somehow?

Well, yeah we can, and the solution isn't obvious, but before we try and improve the query, let's measure how long it takes to execute this query, without any optimizations applied.

So, I'll run this query, then I'll fast forward the video, so we don't have to wait, and as you can see, it took about 8 seconds to complete this query.

Ok, so what can we do to improve this query?

Well, the thing you do to improve most queries is to add one or more indexes, so we could add an index here, but would it help?

Well, let's see.

So I'll add an index on the customer id, so to do this, I'll create an index named orders, customer, id, idx on the order tables customer id column.

Ok, now we'll run the query again, and I'll speed up the video, and huh... ok, well the query took just shy of 30 seconds, so, yeah the index we added didn't help, if fact it seemed to make it worse.

So what's going on here?

Well, it seems like the query planner picked an inefficient solution, and if we run the last query again, but we prepend "explain query plan", you see that a scan of the orders table is occurring, and it's also referencing our new index.

So basically, our query is reading every page from the orders table, and it's reading from our new index, for some reason, and it's probably jumping back and forth between the index and the table data.

Ok, so should we get rid of the last index?

Well, probably not because it seems very likely that other queries will need to search by the customer id, as that's pretty common, so are we stuck or is there something else we can do?

What if we created another index that includes both the customer id and order total columns?

Would that help our query?

Let's try and see.

So I'll create our new index on both columns, as you see here.

Now I'll go ahead and run our query again, and check that out, our query finished in about 1 second, which is about 30 times faster than the last time we ran this.

Ok, I bet this result is surprising to some of you, so how is it that this index helps our query?

Well, it helps because this index can now be used as the source of data for our query.

In other words, our index contains all the data needed to calculate a result, and so we don't need to scan the orders table anymore, we can just scan the index to calculate the result.

So why is it more efficient to scan the index instead of the table?

Because, when we're pulling the pages for the index, we're just getting the data we need, we're not having to pull all the unnecessary data, as happens when your doing a table scan like we saw earlier. So there's a lot less IO that needs to happen, to calculate the result.

There's a name for this kind of an index, it's called a covering index.

So, is adding this index a good idea?

Well, that's a judgment call, and here are the things you need to consider when making this decision.

First of all, is this a query that is frequently used? If you do use this query regularly, then adding this index might be the right choice, but there are tradeoffs that you need to keep in mind.

For example, having this index takes disk space, and we can see how much disk space is used by using the sqlite3 analyzer, which is installed by default with the SQLite command line interface.

And as you can see here, this new index consumes about 12.5 % of the total disk space, which is about 373 MB.

Now, if we've got plenty of disk space, then having this index could be totally fine, but disk space isn't the only concern.

The other relevant concern is that the more indexes we have, the worse the performance of mutations. In other words, inserts, updates and deletes can take longer to complete, because the indexes need to be updated on every mutation.

So, should we keep this index? Well, again it's a judgment call, but the more important thing for you to know is that covering indexes can result in greatly improved queries, and it's a tool you should have in your tool belt.

You know, a common query that happens against this orders table is to query for an order by its purchase order id.

So for example, here's a fairly common query for an order by the purchase order number.

Now, as you can see this query is pretty slow, and if we prefix the query with explain query plan, we see it's slow because it's doing a full table scan of the orders table.

In other words, it has to look at all 20 million rows to determine the result.

So, let's go ahead and add an index on the order tables purchase order column.

And if we run the query again, it completes almost instantly which is great, but I want to point something out about the purchase order column.

If you look at a few records, you'll notice that the purchase order column is null for a lot of rows.

So, do you see any potential problems with the purchase order index we created, considering the fact that there seem to be quite a few null values in this column?

Well, the problem is, that the null values are included in the index, but users don't routinely search for null purchase order values, usually, they search for a specific purchase order number.

So, what does this mean?

Well, it means that our index is using space to store the null values, which is wasteful because users don't search for null purchase order numbers.

So, is there any way to save the wasted disk space, that's consumed by null values?

Yep, there is, we can change our index to be what's called a partial index.

But before we fix our index, let's look at the current size of the index, which includes the null values, and as you can see here, the index consumes about 7.7% of the disk space, which is about 249 MB in size.

Ok, so how do we fix this index?

Well, let me show you.

First, I'll drop the purchase order index, then I'll create a new index, just like we did a moment ago, but I'll add a where clause, which specifies that the index should be created only when the purchase order is not null.

Now, I'll run our purchase order query again, and as you can see, it executes just as fast as before.

So, let's take a look at how much disk space our new purchase order index takes, and as you can see here, it's down to about 4.1% or 133 MB in size, so we were able to nearly half the amount of space used by our purchase order index, without any compromise in performance, nice.

Ok, there's one more type of index I want to look at.

So, sometimes users query the orders table by the contact name like this.

But this query is case sensitive and often the users don't get a result from this query, because of this case sensitivity. For example, if I query for this contact in lowercase, I don't get a result, because again, it's case-sensitive.

I'd like to make this query a bit more friendly and forgiving, in other words, I'd like this query to work, in a noncase-sensitive way, so do you have any thoughts on how we can accomplish this?

Well, what if we did this:

we query for the orders where the lowercase version of the contacts column, matches our lowercase name.

So, this works, but it's not a very efficient query.

Do you see why?

Well, the problem is, that this part of the where clause, where we call the lower function, has to be called on the contact column of every single row in our table, to properly evaluate this entire expression, so effectively, this query has to perform an expensive table scan to get a result.

So how do we fix this?

Well, we have to add an index, on the contact column, right? So do you think this create index statement will work for us?

hmm, well let's try it and see, so I'll add the index, and I'll run our query again, and as you can see it's still a really slow query, so why didn't this index work for us?

Well, because it's indexing in a case-sensitive way, but there's a trick we can perform, where instead of indexing on the contact column, instead we index on the lowercase version of the contact column, by calling the lower function right here.

So, now I'll run our query again, and check that out, that was really fast, cool.

And by the way, this kind of index is called an indexed expression.

Hey, here at Mycelial, we're big fans of SQLite, and we're building tools around SQLite that will allow you to synchronize an SQLite database to multiple computers without conflict.

If you are interested in SQLite and what we're doing with it, please subscribe to our channel, and consider joining our newsletter to learn more."