[ad_1]
Data scientists handle time series data on a daily basis and being able to manipulate and analyses these data is a required part of the job. SQL window functions allow you to do just this and is a common data science interview question. So let's talk about what time series data is, when to use them, and how to implement functions to help manage times series data.
What Is Times Series Data?
Time series data are variables within your data that have a time component. This means that each value in this attribute has either a date or time value, sometimes they have both. Here are some examples of times series data:
• The daily stock price for companies because each stock price is associated with a specific day
• The daily average stock index value over the last few years because each value is mapped to a specific day
• Unique visits to a website over a month
• Platform registrations each day
• Monthly sales and revenue
• Daily logins for an app
LAG and LEAD Window Functions
When handling time series data a common calculation is to calculate growth or averages over time. This means that you'll need to either grab the future date or previous date and it's associated values.
Two WINDOW functions that allow you to accomplish this is LAG and LEAD, which are extremely useful for dealing with time related data. The main difference between LAG and LEAD is that LAG gets data from previous rows, while LEAD is the opposite, it fetches data from following rows.
We can use either one of the two functions to compare month over month growth for example. As a data analytics professional, you are very likely to work on time related data, and if you are able to use LAG or LEAD efficiently, you will be a very productive data scientist.
A Data Science Interview Question That Requires A Window Function
Let's go through an advanced data science sql interview question dealing with this window function. You'll see window functions commonly being part of interview questions but you'll also see them a lot in your daily work so it's important to know how to use them.
Let's go through one question from Airbnb called growth of Airbnb. If you want to follow along interactively, you can do so here.
The question is to estimate the growth of Airbnb each year using the number of hosts registered as the growth metric. The rate of growth is calculated by taking ((number of hosts registered in the current year – number of hosts registered in the previous year) / the number of hosts registered in the previous year) * 100.
Output the year, number of hosts in the current year, number of hosts in the previous year, and the rate of growth. Round the rate of growth to the nearest percent and order the result in the ascending order based on the year.
Approach Step 1: Count the host for the current year
The first step is to count hosts by year so we'll need to extract the year from the date values.
SELECT extract(year
FROM host_since::date) AS year,
count(id) current_year_host
FROM airbnb_search_details
WHERE host_since IS NOT NULL
GROUP BY extract(year
FROM host_since::date)
ORDER BY year
Approach Step 2: Count the host for the previous year.
This is where you'll be using the LAG window function. Here you'll create a view where we have the year, number of hosts in that current year, and then number of hosts from the previous year. Use a lag function for the previous year count and take the last year's value and put it in the same row as this year's count. This way you will have 3 columns in your view — year, current year host count, and last year's host count. The LAG function allows you to easily pull the last year's host count in your row. This makes it easy for you to implement any metric like a growth rate because you have all the values you need on one row for SQL to easily calculate a metric. Here's the code for it:
SELECT year,
current_year_host,
LAG(current_year_host, 1) OVER (ORDER BY year) AS prev_year_host
FROM
(SELECT extract(year
FROM host_since::date) AS year,
count(id) current_year_host
FROM airbnb_search_details
WHERE host_since IS NOT NULL
GROUP BY extract(year
FROM host_since::date)
ORDER BY year) t1) t2
Approach 3: Implement the growth metric
As mentioned earlier, it's much easier to implement a metric like the one below when all the values are on one row. This is why you perform the LAG function. Implement the growth rate calculation round(((current_year_host – prev_year_host)/(cast(prev_year_host AS numeric)))*100) estimated_growth
SELECT year,
current_year_host,
prev_year_host,
round(((current_year_host – prev_year_host)/(cast(prev_year_host AS numeric)))*100) estimated_growth
FROM
(SELECT year,
current_year_host,
LAG(current_year_host, 1) OVER (ORDER BY year) AS prev_year_host
FROM
(SELECT extract(year
FROM host_since::date) AS year,
count(id) current_year_host
FROM airbnb_search_details
WHERE host_since IS NOT NULL
GROUP BY extract(year
FROM host_since::date)
ORDER BY year) t1) t2
[ad_2]
Source by Rakib Raihan