maandag 29 november 2021

LATERAL joins in Snowflake

Introduction

I was studying the LATERAL JOIN construct in order to understand the joining methods of Snowflake and I stumbled upon the article of Dave Abercrombie called "HOW TO: LATERAL FLATTEN AND JSON TUTORIAL". A great article and explanations. I decided to use this article for this blogpost. I've included his demo queries in the this article and investigated the query profiles. 


Prerequisites

First let's setup the demo database with this script.

//=============================================================================
// Set context
//============================================================================

ALTER WAREHOUSE IF EXISTS COMPUTE_WH RESUME IF SUSPENDED;

USE ROLE SYSADMIN;
USE WAREHOUSE Compute_WH;

//=============================================================================
// Setup of the experiment
//=============================================================================

DROP DATABASE IF EXISTS LATERALFLATTEN;

CREATE OR REPLACE DATABASE LATERALFLATTEN;
USE DATABASE LATERALFLATTEN;

CREATE TABLE IF NOT EXISTS snowflake_lateral_test_dept (
    DEPT_ID int,
    DEPT_NAME string
);

INSERT INTO snowflake_lateral_test_dept (DEPT_ID, DEPT_NAME) VALUES (1, 'A'), (2, 'B'), (3, 'C'), (4, 'D');

SELECT * FROM snowflake_lateral_test_dept

CREATE TABLE IF NOT EXISTS snowflake_lateral_test_emp (EMP_ID int, DEPT_ID int, AGE NUMBER, HEIGHT NUMBER, WEIGHT NUMBER)

INSERT INTO snowflake_lateral_test_emp (EMP_ID, DEPT_ID, AGE, HEIGHT, WEIGHT) VALUES
(1, 1, 25.5, 1025.7, 152.3),
(2 , 1 , 26.9 , 963.3 , 151.4),
(3 , 1 , 25.8 , 1042.7 , 149.6),
(4 , 1 , 26.3 , 1003.3 , 153.7),
(5 , 1 , 23 , 1020.8 , 151.8),
(6 , 1 , 23.5 , 985.3 , 150.1),
(7 , 2 , 32 , 1078.3 , 155.4),
(8 , 2 , 28.4 , 1142.9 , 156.4),
(9 , 2 , 30.2 , 1095.6 , 158),
(10 , 2 , 30.8 , 1089.2 , 158.4),
(11 , 2 , 29.6 , 1146.1 , 164.4),
(12 , 2 , 28.5 , 1089.4 , 164.6),
(13 , 3 , 35.5 , 1188.5 , 171.7),
(14 , 3 , 32.5 , 1235.6 , 169.4),
(15 , 3 , 37.1 , 1212.1 , 169.6),
(16 , 3 , 36.7 , 1224.9 , 168.7),
(17 , 3 , 34.1 , 1190.8 , 168.6),
(18 , 3 , 33.6 , 1177.5 , 173.2),
(19 , 4 , 40.1 , 1281.1 , 182.2),
(20 , 4 , 39.9 , 1306.7 , 180.6),
(21 , 4 , 38.1 , 1311.4 , 175.9),
(22 , 4 , 38.2 , 1287.4 , 184.4),
(23 , 4 , 39.1 , 1294.9 , 177.2),
(24 , 4 , 37.7 , 1320.1 , 179.9)

SELECT * FROM snowflake_lateral_test_emp



Just a normal join

The article starts with a normal join and an aggregate and the query returns a average per department, order by department


-- no subquery at all
  select d.dept_name,
            avg(e.age) as avg_age
      from snowflake_lateral_test_dept d
      join snowflake_lateral_test_emp e
        on d.dept_id = e.dept_id
group by 1
order by 1
;


Results in :




The following Query Profile (execution plan) represents the normal join construct. The JoinFilter is used for removing tuples that can be identified as not possibly matching the condition of a Join further in the query plan. I'm not sure what the purpose is here, it seems that the same number of rows is going in and out (24).




Common Table Expression

With a CTE you can break up the query in procedural parts and it makes queries more readable. Here is the example I borrowed from Dave's article.


with sub as (
 select dept_id,
            avg(age) as avg_age
      from snowflake_lateral_test_emp
group by dept_id
) -- --------------------------------
   select d.dept_name,
              sub.avg_age
    from snowflake_lateral_test_dept d
    join sub
      on d.dept_id = sub.dept_id
order by 1
;


Results in :



The Query Profile is as follows :




With a subquery

Here is an example with the subquery (not correlated).  As Dave explains the subquery is treated like a small table that is used as a inline view.

  select d.dept_name,
         sub.avg_age
   from snowflake_lateral_test_dept d,
        ( --sub
          select dept_id,
                 avg(age) as avg_age
           from snowflake_lateral_test_emp
        group by dept_id
        ) sub
   where d.dept_id = sub.dept_id
order by 1
;


Results in :




Here is the Query Profile and here you can see the two paths: one for the subquery (average) and one for the department:




Correlated subquery

And here is the example of the correlated subquery. 


select d.dept_name,
   ( -- correlated subquery
    select avg(e.age) as avg_age
     from snowflake_lateral_test_emp e
    where e.dept_id = d.dept_id
   ) as avg_age
 from snowflake_lateral_test_dept d
order by 1
;


Resulting in




And here is the Query profile for the correlated subquery. It is a bit different than the previous example. Here is a Filter predicate used with SUM(E.age) is NOT NULL and COUNT(E.AGE) IS NOT NULL.




Here you can see that the LATERAL FLATTEN predicate is used in the Query profile and is also used in the following example.

Lateral join with a correlated subquery

This is the example of the query with the LATERAL keyword. To my knowledge it seems comparable with the OUTER APPLY and the INNER APPLY construct of Microsoft SQL Server family code. I've written about this in this blogpost : joining tables with the OUTER APPLY.


 select d.dept_id,
        d.dept_name,
        e2.avg_age,
        e2.avg_height
    from snowflake_lateral_test_dept d,
 lateral ( -- correlated subquery
            select avg(e1.age) as avg_age,
                   avg(e1.height) as avg_height
              from snowflake_lateral_test_emp e1
             where e1.dept_id = d.dept_id
          ) e2
;


Results in 




With the following Query Profile : 



The LATERAL JOIN (and OUTER APPLY) becomes very handy when you want to return more than one column form the subquery (you have to add more subqueries to the query)

Conclusion

Here I've combined the different Query Profiles in the following diagram.




Hennie