SQL Window Functions

learn
SQL Window Functions

SQL window functions allow you to perform calculations on a set of rows that are related to the current row, without the need for self-joins or subqueries. They are used to perform a calculation across a set of table rows that are somehow related to the current row, in a way similar to aggregate functions. Want to learn more? Check out this video

Transcript

"So, what are SQL window functions?

Well, they're SQL functions that syntactically look like this, where you've got a function call here, that gets applied over some sets of rows, and those sets of rows are called windows.

Now one of the important things to recognize here, is that this over clause defines the window, that the function is applied to.

So, a way of thinking about window functions is that you can pick which function to use here, from a list of possible window functions, and you can define which rows the function should be applied against, with this over clause.

Ok, so what functions can you use here?

Well, you can use these aggregate functions, which you're probably already familiar with, or you can use these analytical functions.

Now, there's a significant difference between using these aggregate functions the conventional way, versus using them as window functions.

For example, I've got this table, which contains the results of runners in a 10k race.

So, let's say we'd like to figure out the average amount of time it took to finish the race, so to do this, I'll select the average time from the results table, and if we run this query, we see the result right here.

Ok, now I want to show you how the average function works when it's used as a window function.

So, to change our basic aggregate function call to be a window function, I'll simply add the clause, over, followed by a pair of parenthesis.

Now let's run this query, and check that out, we don't get a single result like we did a moment ago, we get a bunch of rows, with the same result.

So, this is the first big difference between normal aggregate functions and window aggregate functions.

Normal aggregate functions will take many rows and aggregate them into fewer rows, so for example, this SQL statement only returns one row, with the result, but the window aggregate function returned a result in every single row in our query.

In fact, what's going on here will probably make more sense, if I add an asterisk to the select clause.

Now if I run the query again we see every single row in the results table, and, we also see the average time, repeated in every row.

Now, this part of the window function here, the over clause, can be modified to manipulate the windows, or sets of rows, that this function is applied to.

So for example, with an empty over clause like this, the window that the average function is applied to, would be every single time value returned by our query, but by passing some additional information in the over clause, we can change from one big window to potentially multiple smaller windows.

For example, if the over clause includes the phrase, partition by age_group, then we'll be essentially creating a window or grouping of rows, for each age group, as you see here, and the average function will only be applied to the time values within each age_group window.

Let's go run this query, and if we keep an eye on the average column, you'll notice the value changes for each age group, and again, this is because of what you see here, in the over clause.

Next, let's calculate the fastest race times, both overall and by age group.

So, do you have any thoughts on how we can calculate the fastest time?

Well, let's figure out the overall fastest time first.

So, to figure this out I'll use the min function as a window function, passing in the time, then I'll add an empty over clause and I'll alias this as win time.

And if we run this query, we see the overall fastest time shown here, cool.

So, what do we need to do, to get the fastest race time by age group?

Any thoughts?

Well, what we need to do is pretty similar to this line here, so I'll copy and paste this expression, but then I'll partition the windows by age group and I'll alias this as group win time.

Now I'll run the query, and as you can see, we now have a column with the fastest group time.

The next thing I want to calculate is the overall race results, in other words, I want a column in our result set, that shows which place each runner finished overall, so first place, and second place and third place, and so on.

So, the aggregate window functions aren't going to help us here, but there are a few analytic window functions that we can turn to.

Ok, let's start by trying out this row number function.

So I'll call row number, and then I'll use an empty over clause, followed by an alias of place, and now I'll go ahead and run this query.

Now, if you look at our new column, it seems to assign a unique number to each row, but the number seems to be arbitrary. In other words, the row number doesn't correlate with the runner's race time, but there is something we can add to the over clause, to make the row number more meaningful.

Ok, so what can we do?

Well, we can define a sort order to our window, let me show you what I mean.

So, in the over clause, I'll say, order by time.

Ok, so what does this do for us?

Well, it sorts our window by time, and then it applies the row number, so when we run this, the row number should approximately represent the runner's overall placement in the race, based on their finish time.

So, I'll go ahead and run this, and now if you look at the results, the first-place row, seems to have the fastest time, as does the second, third and so on.

Ok, but there is a problem with this solution, which is evident on these two rows here.

So, what are we looking at?

Well, these two runners have the exact same finish time, in other words, these runners tied each other, so they should have the same overall placement value, but as you can see here, with our current solution they have different placement values.

Ok, so how do we fix this?

Well, the row number window function, doesn't seem like it's going to work for us, so we're going to need to look for another solution.

So, do any of these other analytic functions look like they might be helpful to us?

Well, what about this rank function? I mean, we are trying to figure out the placement or ranking of runners, so let's give this function a try.

So, first I'll change the alias of the row number column, to rn, then I'll call our rank function, followed by the over clause, but let me ask you this, do you think we need anything in this over clause?

Well, yeah we're going to need to add something here to order the window so that the rank function knows how to rank the results, so I'll say, order by time, and then I'll run the query.

Now if we look at the place column, and in particular, if we look at the runners that tied, we now see that they both have the same placement or ranking value, which is what we were wanting, and notice that there is no sixth place because there were two people tied for fifth place so the sixth place is skipped as you would expect.

Ok, there's one other function that deals with ranking, that I want to briefly take a look at.

The function I'm referring to is the dense rank function.

I'll go ahead and copy and paste the rank line, then I'll rename rank to dense_ rank and I'll give it a different alias and I'll run this query.

Now, the difference between rank and dense rank, is that the dense rank function doesn't skip any numbers. So for example, rank skipped the sixth position here, because there were two people tied in fifth, but dense rank used the sixth position after the two fifth positions.

So, in our case, dense rank doesn't work for us, but it can be useful in other scenarios.

The next query I want to create is a query that returns the top 3 runners for each age group.

So, any thoughts on how to calculate this?

Well, we can use our rank function again, but this time we'll apply the function to a different window, so to define our window I'll say over, partition by age group, and I'll also order our windows by the time column, then I'll alias the column as group place.

Now, I'll run the query, and as you can see here, we now have an overall ranking, and we have group-specific rankings.

Now, to get the top 3 runners from each group, I'll use a simple subquery.

So I'll select * from our subquery, which I'll wrap in parathesis, then I'll give it an alias of a, and I'll say where a dot group place is less than 4.

Now if we run this query, we get only the top three finishers from each age group, cool.

So, in summary, window functions give you a nice, terse, readable way of doing analysis with SQL.

Now, one final thing. In this video, I used the excellent relational database SQLite, but the concepts we covered apply to all the popular relational databases.

Hey, here at Mycelial, we're big fans of SQLite, and we're building tools around SQLite that will allow you to synchronize an SQLite database to multiple computers without conflict.

If you are interested in SQLite and what we're doing with it, please subscribe to our channel, and consider joining our newsletter to learn more."

MORE ABOUT MYCELIAL