Virtuoso SQL supports tree and graph data structures represented as relations operated on through the use of Transitive SQL Subqueries.
A derived table (i.e., the product of a SELECT
query within a FROM
clause) can be declared to be transitive. This is achieved by applying the TRANSITIVE
modifier to the SELECT
keyword alongside other conventional modifiers such as DISTINCT
or TOP
.
The syntax of this modifier is as follow:
transitive_decl ::= TRANSITIVE <trans_option>[, ...]
trans_opt :: =
| T_MIN ( intnum )
| T_MAX ( intnum )
| T_DISTINCT
| T_EXISTS
| T_NO_CYCLES
| T_CYCLES_ONLY
| T_NO_ORDER
| T_SHORTEST_ONLY
| T_IN <position_list>
| T_OUT ( <position_list )
| T_END_FLAG ( intnum )
| T_FINAL_AS name
| T_STEP ( <position_or_path_spec> )
| T_DIRECTION (intnum)
position_list ::= intnum [,...]
position_or_path_spec ::= intnum
| 'step_no'
| 'path_id'
A transitive derived table is a projection (also known as a query solution) that may comprise four different types of columns:
- Input
- Output
- Step Data
- Special
When a transitive derived table occurs in a query, the enclosing query must specify an equality condition for either (1) all input columns, (2) all output columns, or (3) both. The designation of input and output columns is for convenience only. The order of query execution will generally be decided by the optimizer, unless overridden with the T_DIRECTION
option.
Consider a simplified social network application comprised of the following data:
CREATE TABLE knows ( p1 INT ,
p2 INT ,
PRIMARY KEY ( p1 , p2 )
) ;
ALTER INDEX knows ON knows PARTITION ( p1 INT ) ;
CREATE INDEX knows2 ON knows ( p2 , p1 ) PARTITION ( p2 INT ) ;
INSERT INTO knows VALUES ( 1 , 2 ) ;
INSERT INTO knows VALUES ( 1 , 3 ) ;
INSERT INTO knows VALUES ( 2 , 4 ) ;
All persons have single integer identifiers. There is a row in the knows
table if person p1
claims to know person p2
.
The most basic query is to find all the people that a given person knows either directly or indirectly.
SELECT *
FROM ( SELECT TRANSITIVE
T_IN ( 1 )
T_OUT ( 2 )
T_DISTINCT p1,
p2
FROM knows ) AS k
WHERE k.p1 = 1;
The transitive derived table simply selects from the knows
table. The enclosing top level query gives an initial value for the input column of the transitive SELECT
. This leaves the output column p2
unbound, so the query will iterate over the possible values of p2
. Initially, the query loops over the people directly known by 1
. In the next stage, it takes the binding of p2
and uses it as a new value of the input column p1
to look for people the first degree contact knows and so on, until no new values are found.
The basic meaning of the transitive modifier is that given initial values for input column(s), the subquery is evaluated to produce values for the output columns. Then these values are fed back as values of input columns and so forth, until some termination condition is reached. If there are equality conditions for columns designated as output but no conditions for columns designated as input, then the same process runs from output to input. The terms input and output do not imply execution order. If there are bindings for both input and output columns in the enclosing query, then the transitive derived table looks for ways of connecting the input and output bindings. If no such way is found, the subquery is empty and causes the whole enclosing query to also have no result. A transitive derived table cannot be the right side of an OUTER JOIN
directly but can be wrapped in a derived table that is. In this way, an OUTER JOIN
usage is also possible, whether finding a path is optional.
The result set of a transitive subquery can be thought of as a set of paths. A path consists of one or more consecutive bindings for the input columns and is ordered. In our example, a path is p1=1, p1=2, p1=4
. This is the path connecting persons 1
and 4
. If there are columns in the SELECT
that are neither input or output, they too are recorded for each step of the path. The result set may include just the ends of a path, i.e., one row where the input columns have the beginning and the output columns the end of the path. This means there is one row per distinct path. The result set may also include a row for each step on each path.
In this example, we bind both ends of the transitive subquery and ask how person 1
and 4
are connected. Since the columns p1
and p2
have an equality condition, each row of the result set has these at values 1
and 4
, respectively.
SELECT *
FROM ( SELECT TRANSITIVE
T_IN ( 1 )
T_OUT ( 2 )
T_DIRECTION 3
T_DISTINCT
T_SHORTEST_ONLY p1,
p2,
T_STEP ( 1 ) AS via ,
T_STEP ( 'path_id' ) AS path ,
T_STEP ( 'step_no' ) AS step
FROM knows ) AS k
WHERE p1 = 1
AND p2 = 4 ;
P1 P2 VIA PATH STEP
1 1 1 0 0
1 3 1 1 0
1 3 3 1 1
1 2 1 2 0
1 2 2 2 1
1 4 1 3 0
1 4 2 3 1
1 4 4 3 2
The three rightmost columns allow returning information in the intermediate steps of the transitive evaluation. t_step (1)
means the value of the column at position 1 at the intermediate step. The t_step ('step_no')
is the sequence number of the step returned. The t_step ('path_id')
is a number identifying the connection path, since there may be many paths joining persons 1
and 4
.
In this situation, the result set has one row per step, including a row for the initial and final steps. While the evaluation order may vary internally, the result set is presented as if the query were evaluated from input to output, i.e., looking for people known by 1
, finding 2
and 3
, then looking for people they know, finding that 2
knows 4
, which is a solution, since p2 = 4
was specified in the outer select. If the outer query had p1 = 4
and p2 = 1
, there would be an empty result set since there is no path from 4
to 1
.
For example, if tables have multipart keys, there can be many input and output columns but there must be an equal number of both, since the engine internally feeds the output back into the input or vice versa. The transitive derived table may be arbitrarily complex.
We may have an application that returns extra information about a step. This could, for example, be a metric of distance. In such a case, a column which is not designated as either input nor output and is not a t_step ()
function call, will simply be returned as is.
The result set of a transitive subquery will either have one row for each state reached, or have one row for each step on the path to each state reached.
The first example returns only the ends of the paths, i.e., directly and indirectly known person id
s. It does not return for each returned id
how this person is known, through which set of connections. The second example returns a row for each step on each path. Steps will be returned if the selection has t_step ()
calls or columns that are neither input or output.
The forms of t_step
are:
-
t_step ( <column number> )
This returns the value that the column, one of the columns designated as input, has at this step. The input or output columns themselves, if there is a condition on them, look equal to the condition. This allows seeing intermediate values of input columns on a path.
-
t_step ( 'step_no' )
This returns the ordinal number of the step on the path. Step 0 corresponds to the input variables being at the value seen in the enclosing query. Step 1 is one removed from this. Step numbering is assigned as if evaluating from input to output.
Consider this:
SELECT *
FROM ( SELECT TRANSITIVE
T_IN ( 1 )
T_OUT ( 2 )
T_MIN ( 0 )
T_DISTINCT p1,
p2,
T_STEP ( 1 ) AS via,
T_STEP ( 'path_id' ) AS path ,
T_STEP ( 'step_no' ) AS step
FROM knows) AS k
WHERE p1 = 1 ;
P1 P2 VIA PATH STEP
1 1 1 0 0
1 3 1 1 0
1 3 3 1 1
1 2 1 2 0
1 2 2 2 1
1 4 1 3 0
1 4 2 3 1
1 4 4 3 2
This returns four path
s, all starting at 1
: the path
from 1
to 1
; the path
from 1
to 2
; the path
from 1
to 3
; and the path
from 1
to 2
to 4
. The path
column has values from 0
to 3
, distinguishing the four different path
s returned. The p1
column is the start of the path, thus always 1
since this is given in the outer query. The p2
column is the end of the path. The via
column is the value of p1
at the intermediate step
. The step
number where via
is equal to p1
is 0
. The next step
number is 1
. At the highest step
number of each path
, p2
and via
are the same.
Now, let us do this in reverse:
SELECT *
FROM ( SELECT TRANSITIVE
T_IN ( 1 )
T_OUT ( 2 )
T_MIN ( 0 )
T_DISTINCT p1 ,
p2 ,
T_STEP ( 1 ) AS via ,
T_STEP ( 'path_id' ) AS path ,
T_STEP ( 'step_no' ) AS step
FROM knows) AS k
WHERE p2 = 4 ;
P1 P2 VIA PATH STEP
4 4 4 0 0
2 4 2 1 0
2 4 4 1 1
1 4 1 2 0
1 4 2 2 1
1 4 4 2 2
We give an initial value to p2
and leave p1
free. Now we get three path
s: the path
from 4
to 4
; from 2
to 4
; and from 1
to 2
to 4
. We enumerate the step
s as if counting from input to output, albeit internally the evaluation order is the reverse. Again, step
number 0
has the via
column equal to p1
and the highest numbered step has via
equal to p2
.
Descriptions of Transitive Options
-
T_MIN (INTNUM)
— This means thatpath
s shorter than the number are not returned. In the examples above, we hadt_min
at0
, so that apath
of zero length (i.e., where the first output equals the outer conditions for the inputs) was also returned. -
T_MAX (INTNUM)
— This gives a maximum length ofpath
.Path
s that exceed thisstep
threshold are not returned. A value of1
means that the subquery is evaluated once, i.e., the outputs of the first evaluation are not fed back into the inputs. Specifying a minimum of0
and a maximum of1
means an optionalJOIN
. Specifyingt_min
andt_max
both as1
means an ordinary derived table. -
T_DISTINCT
— This means that if a binding of input columns is produced more than once, only the first is used; i.e., the same point is not traversed twice even if manypath
s lead to it. -
T_EXISTS
— Only onepath
is generated and returned. -
T_NO_CYCLES
— If apath
is found that loops over itself (i.e. a nextstep
has the input values equal to the input values of a previousstep
on thepath
), the binding is ignored. -
T_CYCLES_ONLY
— Onlypath
s that have a cycle (i.e., input values of a subsequentstep
equal the input values of a previousstep
on the samepath
) are returned. -
T_SHORTEST_ONLY
— If both ends of thepath
are given, the evaluation stops at the length ofpath
where the first solution is found. If manypath
s of equal length are found, they are returned, but longer paths are not sought. -
T_IN (column_ordinal_positions)
— This specifies which column is to be used to indicate the source (orIN
)t
node. -
T_OUT (column_ordinal_positions)
— This specifies which column is to be used to indicate the destination (orOUT
)t
node.T_END_FLAG (intnum)
— This option is used to indicate that the transitive operation should stop when the variable returns a non-zero value. Its value is a positive integer value indicating the ordinal position of the variable in the select list to be used. -
T_DIRECTION (intnum)
— A value of0
(default) means that the SQL optimizer decides which way the transitive subquery is evaluated.1
means from input to output,2
from output to input,3
from both ends. Supposing we are looking at how two points are related, it makes sense to start expanding the transitive closure at both ends. In the above example, this would be going fromp1
top2
on one side and fromp2
top1
on the other.
Related
- SQL Transitivity Option Documentation
- SPARQL Transitivity – how this functionality is surfaced in SPARQL-BI