Database statistics are a set of data that provides information about the distribution of values within a database. This information is used by the database management system to make informed decisions about query optimization and to improve query performance.
Want to learn more about statistics in SQLite? checkout this video.
"SQL is a declarative language, which means you specify what you want, but you don't specify how to get it. The question of how to satisfy a sql query is handled for you, by the database engine, or more specifically, it's handled by the query planner.
In other words, you specify what you want declaratively in SQL, and you pass the SQL on to the database, where the query planner, figures out how to fulfill your query, in a cost-effective way.
Now, it's tempting to think about the query planner as a magical black box, that we can mostly ignore, but I don't think the black box description is entirely accurate, and I say this because the inputs to the query planner, are things that you and I control, and if the inputs to the query planner are bad, then the output of the query planner can be bad as well.
Ok so, what are the inputs to the query planner?
Well, things like queries, tables, indexes and statistics.
Now, in this video, we're going to take a closer look at one of these inputs to the query planner, statistics.
So what are database statistics?
Well, its dynamic metadata that assists the query planner in making better decisions.
Ok, so how do these statistics help the query planner?
Well, to answer this question, I want to look at some examples.
So, I've got this contacts table in SQLite, that looks like this.
Now, we're going to consider how the query planner decides the best way to satisfy a few sql queries against this contacts table.
So, here's a query for contacts with the name James, and just so you know, there is an index on the name column.
So, let me ask you, do you think the database engine should always use the index to satisfy this query?
Well, not necessarily, if for example, there are only a few rows in the contacts table, then it might be more efficient to ignore the index and perform a table scan.
Ok, so why would we not want to use the index?
Well, keep in mind, when you're using an index, there's a bit of overhead, because you're having to jump between pages for the index and pages for the table, and when there's just a few rows in the table, it can make more sense to just fallback to a table scan, because it could cost less overall.
Now, if on the other hand, you've got a large number of rows, then using the index is typically orders of magnitude faster.
So should the query planner always use an index, when there are many rows?
Well, not necessarily.
For example, consider this private column, which has boolean values and is an indexed column.
Now in this example, we're dealing with a column with few distinct values, just true or false in this case, and let me ask you, if you wanted to run this query, should the query planner use the index, or should it do a table scan?
Well, it depends on the distribution of values.
For example, if 99% of all private values are true, then you'd want this query to perform a table scan, but if you ran this query for the 1% of rows with private set to false, then you'd prefer to use the index.
Ok, hopefully you can see from these examples that the query planner should ideally have some statistics about the tables, the indexes, the distribution of data and so on, so that it can make the best decisions on how to satisfy queries.
Ok, now it's not important that you know about the specific optimizations that the query planner uses, but what is important for you to know, is that the query planner is very sophisticated, and it can perform quite efficiently if it has all the proper inputs, such as appropriate indexes, and statistics.
So, where do the statistics come from? I mean, how do they get generated?
Well, in SQLite, you need to manually update statistics, and there's a couple of different ways to do this.
The first way is to run the Analyze command.
Now, if you use the analyze command like this, then statics will be generated for every index in the database, but keep in mind, you can narrow down what's analyzed by specifying a schema name, a table name or an index name.
Ok, you should know that generating statistics with the analyze command is an expensive process, that can take a bit of time, and it does put a write lock on the database while it's processing, which could be problematic for you, so this isn't the recommended usage pattern for long-lived databases.
Instead, what's recommended in SQLite is the following.
First, you can set a limit on the amount of scanning performed by Analyze, by issuing the following PRAGMA.
By setting this analysis limit, we're basically telling the analysis process to only consider a sampling of data when generating statics, which isn't perfect, but it's usually good enough.
Now, with this analysis limit set, when you run analyze, you'll notice it finishes much quicker because again, it's just looking at a sampling of data.
Now, running analyze like this, isn't the recommended usage pattern, instead of calling analyze, you should instead run pragma optimize.
Ok, running pragma optimize is usually a no-op, however, if SQLite determines that updating statistics will improve performance, then running this pragma command will trigger an update of the statistics.
Now, you don't need to know much about the statistics that are collected, but if you're curious, you can see the statistics by querying the sqlite_stat1 table and depending on your compiled version of SQLite, there could be a few other stat tables.
Ok, so what are we looking at here?
Well, we see the table and index names, then we see a stats column.
Now, the first value in the stat column is the approximate number of rows in the index. The second integer is the approximate number of rows in the index that have the same value in the first column of the index and there could be more integers after this, for the other columns in the index, if there are any.
Now, for most developers, you don't really need to look at this table, let alone understand what it means, the real takeaway from this video is that you should know that statistics are a thing that the query planner uses, and if you use a long-lived database, you should update the statistics regularly.
Ok so, how often should you update the statistics?
Well, the recommendation is to run the pragma optimize command just before closing each database connection, and if your database connections stay open for long periods of time, then you might want to set a timer to periodically run the pragma optimize command.
Hey, here at Mycelial, we're big fans of SQLite, and we're building tools around SQLite that will allow you to synchronize an SQLite database to multiple computers without conflict.
If you are interested in SQLite and what we're doing with it, please subscribe to our channel, and consider joining our newsletter to learn more."