SQL Transitivity -- Community Note

Virtuoso SQL supports tree and graph data structures represented as relations operated on through the use of Transitive SQL Subqueries.

A derived table (i.e., the product of a SELECT query within a FROM clause) can be declared to be transitive. This is achieved by applying the TRANSITIVE modifier to the SELECT keyword alongside other conventional modifiers such as DISTINCT or TOP.

The syntax of this modifier is as follow:

transitive_decl ::= TRANSITIVE <trans_option>[, ...]

trans_opt :: =
             | T_MIN ( intnum )
             | T_MAX ( intnum )
             | T_DISTINCT
             | T_EXISTS
             | T_NO_CYCLES
             | T_CYCLES_ONLY
             | T_NO_ORDER
             | T_SHORTEST_ONLY
             | T_IN <position_list>
             | T_OUT ( <position_list )
             | T_END_FLAG ( intnum )
             | T_FINAL_AS name
             | T_STEP ( <position_or_path_spec> )
             | T_DIRECTION (intnum)

position_list ::=  intnum [,...]

position_or_path_spec ::= intnum 
                        | 'step_no' 
                        | 'path_id'

A transitive derived table is a projection (also known as a query solution) that may comprise four different types of columns:

  • Input
  • Output
  • Step Data
  • Special

When a transitive derived table occurs in a query, the enclosing query must specify an equality condition for either (1) all input columns, (2) all output columns, or (3) both. The designation of input and output columns is for convenience only. The order of query execution will generally be decided by the optimizer, unless overridden with the T_DIRECTION option.

Consider a simplified social network application comprised of the following data:

CREATE TABLE knows ( p1 INT , 
                     p2 INT , 
                     PRIMARY KEY ( p1 , p2 ) 
                   ) ;
ALTER INDEX knows ON knows PARTITION ( p1 INT ) ;
CREATE INDEX knows2 ON knows ( p2 , p1 ) PARTITION ( p2 INT ) ;

INSERT INTO knows VALUES ( 1 , 2 ) ;
INSERT INTO knows VALUES ( 1 , 3 ) ;
INSERT INTO knows VALUES ( 2 , 4 ) ;

All persons have single integer identifiers. There is a row in the knows table if person p1 claims to know person p2.

The most basic query is to find all the people that a given person knows either directly or indirectly.

SELECT * 
FROM ( SELECT TRANSITIVE 
              T_IN ( 1 ) 
              T_OUT ( 2 ) 
              T_DISTINCT    p1, 
                            p2 
       FROM knows )              AS k 
WHERE k.p1 = 1;

The transitive derived table simply selects from the knows table. The enclosing top level query gives an initial value for the input column of the transitive SELECT. This leaves the output column p2 unbound, so the query will iterate over the possible values of p2. Initially, the query loops over the people directly known by 1. In the next stage, it takes the binding of p2 and uses it as a new value of the input column p1 to look for people the first degree contact knows and so on, until no new values are found.

The basic meaning of the transitive modifier is that given initial values for input column(s), the subquery is evaluated to produce values for the output columns. Then these values are fed back as values of input columns and so forth, until some termination condition is reached. If there are equality conditions for columns designated as output but no conditions for columns designated as input, then the same process runs from output to input. The terms input and output do not imply execution order. If there are bindings for both input and output columns in the enclosing query, then the transitive derived table looks for ways of connecting the input and output bindings. If no such way is found, the subquery is empty and causes the whole enclosing query to also have no result. A transitive derived table cannot be the right side of an OUTER JOIN directly but can be wrapped in a derived table that is. In this way, an OUTER JOIN usage is also possible, whether finding a path is optional.

The result set of a transitive subquery can be thought of as a set of paths. A path consists of one or more consecutive bindings for the input columns and is ordered. In our example, a path is p1=1, p1=2, p1=4. This is the path connecting persons 1 and 4. If there are columns in the SELECT that are neither input or output, they too are recorded for each step of the path. The result set may include just the ends of a path, i.e., one row where the input columns have the beginning and the output columns the end of the path. This means there is one row per distinct path. The result set may also include a row for each step on each path.

In this example, we bind both ends of the transitive subquery and ask how person 1 and 4 are connected. Since the columns p1 and p2 have an equality condition, each row of the result set has these at values 1 and 4, respectively.

SELECT * 
FROM ( SELECT TRANSITIVE 
              T_IN ( 1 ) 
              T_OUT ( 2 ) 
              T_DIRECTION 3 
              T_DISTINCT 
              T_SHORTEST_ONLY                          p1, 
                                                       p2, 
                                       T_STEP ( 1 ) AS via , 
                               T_STEP ( 'path_id' ) AS path , 
                               T_STEP ( 'step_no' ) AS step 
       FROM knows )                                            AS k 
WHERE p1 = 1 
AND p2 = 4 ;
P1    P2    VIA    PATH   STEP
1     1     1      0      0
1     3     1      1      0
1     3     3      1      1
1     2     1      2      0
1     2     2      2      1
1     4     1      3      0
1     4     2      3      1
1     4     4      3      2

The three rightmost columns allow returning information in the intermediate steps of the transitive evaluation. t_step (1) means the value of the column at position 1 at the intermediate step. The t_step ('step_no') is the sequence number of the step returned. The t_step ('path_id') is a number identifying the connection path, since there may be many paths joining persons 1 and 4.

In this situation, the result set has one row per step, including a row for the initial and final steps. While the evaluation order may vary internally, the result set is presented as if the query were evaluated from input to output, i.e., looking for people known by 1, finding 2 and 3, then looking for people they know, finding that 2 knows 4, which is a solution, since p2 = 4 was specified in the outer select. If the outer query had p1 = 4 and p2 = 1, there would be an empty result set since there is no path from 4 to 1.

For example, if tables have multipart keys, there can be many input and output columns but there must be an equal number of both, since the engine internally feeds the output back into the input or vice versa. The transitive derived table may be arbitrarily complex.

We may have an application that returns extra information about a step. This could, for example, be a metric of distance. In such a case, a column which is not designated as either input nor output and is not a t_step () function call, will simply be returned as is.

The result set of a transitive subquery will either have one row for each state reached, or have one row for each step on the path to each state reached.

The first example returns only the ends of the paths, i.e., directly and indirectly known person ids. It does not return for each returned id how this person is known, through which set of connections. The second example returns a row for each step on each path. Steps will be returned if the selection has t_step () calls or columns that are neither input or output.

The forms of t_step are:

  • t_step ( <column number> )

    This returns the value that the column, one of the columns designated as input, has at this step. The input or output columns themselves, if there is a condition on them, look equal to the condition. This allows seeing intermediate values of input columns on a path.

  • t_step ( 'step_no' )

    This returns the ordinal number of the step on the path. Step 0 corresponds to the input variables being at the value seen in the enclosing query. Step 1 is one removed from this. Step numbering is assigned as if evaluating from input to output.

Consider this:

SELECT * 
FROM ( SELECT TRANSITIVE 
              T_IN ( 1 ) 
              T_OUT ( 2 ) 
              T_MIN ( 0 ) 
              T_DISTINCT                            p1, 
                                                    p2, 
                                    T_STEP ( 1 ) AS via, 
                            T_STEP ( 'path_id' ) AS path , 
                            T_STEP ( 'step_no' ) AS step 
       FROM knows)                                          AS k 
WHERE p1 = 1 ;
P1    P2    VIA    PATH    STEP
1     1     1      0       0
1     3     1      1       0
1     3     3      1       1
1     2     1      2       0
1     2     2      2       1
1     4     1      3       0
1     4     2      3       1
1     4     4      3       2

This returns four paths, all starting at 1: the path from 1 to 1; the path from 1 to 2; the path from 1 to 3; and the path from 1 to 2 to 4. The path column has values from 0 to 3, distinguishing the four different paths returned. The p1 column is the start of the path, thus always 1 since this is given in the outer query. The p2 column is the end of the path. The via column is the value of p1 at the intermediate step. The step number where via is equal to p1 is 0. The next step number is 1. At the highest step number of each path, p2 and via are the same.

Now, let us do this in reverse:

SELECT * 
FROM ( SELECT TRANSITIVE 
              T_IN ( 1 ) 
              T_OUT ( 2 ) 
              T_MIN ( 0 ) 
              T_DISTINCT                           p1 , 
                                                   p2 , 
                                   T_STEP ( 1 ) AS via , 
                           T_STEP ( 'path_id' ) AS path , 
                           T_STEP ( 'step_no' ) AS step 
       FROM knows)                                         AS k 
WHERE p2 = 4 ;
P1    P2    VIA    PATH    STEP
4     4     4      0       0
2     4     2      1       0
2     4     4      1       1
1     4     1      2       0
1     4     2      2       1
1     4     4      2       2

We give an initial value to p2 and leave p1 free. Now we get three paths: the path from 4 to 4; from 2 to 4; and from 1 to 2 to 4. We enumerate the steps as if counting from input to output, albeit internally the evaluation order is the reverse. Again, step number 0 has the via column equal to p1 and the highest numbered step has via equal to p2.

Descriptions of Transitive Options

  • T_MIN (INTNUM) — This means that paths shorter than the number are not returned. In the examples above, we had t_min at 0, so that a path of zero length (i.e., where the first output equals the outer conditions for the inputs) was also returned.

  • T_MAX (INTNUM) — This gives a maximum length of path. Paths that exceed this step threshold are not returned. A value of 1 means that the subquery is evaluated once, i.e., the outputs of the first evaluation are not fed back into the inputs. Specifying a minimum of 0 and a maximum of 1 means an optional JOIN. Specifying t_min and t_max both as 1 means an ordinary derived table.

  • T_DISTINCT — This means that if a binding of input columns is produced more than once, only the first is used; i.e., the same point is not traversed twice even if many paths lead to it.

  • T_EXISTS — Only one path is generated and returned.

  • T_NO_CYCLES — If a path is found that loops over itself (i.e. a next step has the input values equal to the input values of a previous step on the path), the binding is ignored.

  • T_CYCLES_ONLY — Only paths that have a cycle (i.e., input values of a subsequent step equal the input values of a previous step on the same path) are returned.

  • T_SHORTEST_ONLY — If both ends of the path are given, the evaluation stops at the length of path where the first solution is found. If many paths of equal length are found, they are returned, but longer paths are not sought.

  • T_IN (column_ordinal_positions) — This specifies which column is to be used to indicate the source (or IN) t node.

  • T_OUT (column_ordinal_positions) — This specifies which column is to be used to indicate the destination (or OUT) t node.

    T_END_FLAG (intnum) — This option is used to indicate that the transitive operation should stop when the variable returns a non-zero value. Its value is a positive integer value indicating the ordinal position of the variable in the select list to be used.

  • T_DIRECTION (intnum) — A value of 0 (default) means that the SQL optimizer decides which way the transitive subquery is evaluated. 1 means from input to output, 2 from output to input, 3 from both ends. Supposing we are looking at how two points are related, it makes sense to start expanding the transitive closure at both ends. In the above example, this would be going from p1 to p2 on one side and from p2 to p1 on the other.

Related