Virtuoso SQL supports tree and graph data structures represented as relations operated on through the use of Transitive SQL Subqueries.
A derived table (i.e., the product of a select query within a from clause) can be declared to be transitive. This is achieved by applying the
TRANSITIVE modifier to the
SELECT keyword alongside other conventional modifiers such as
The syntax of this is as follow:
transitive_decl ::= TRANSITIVE <trans_option>[, ...] trans_opt :: = | T_MIN ( intnum ) | T_MAX ( intnum ) | T_DISTINCT | T_EXISTS | T_NO_CYCLES | T_CYCLES_ONLY | T_NO_ORDER | T_SHORTEST_ONLY | T_IN <position_list> | T_OUT ( <position_list ) | T_END_FLAG ( intnum ) | T_FINAL_AS name | T_STEP ( <position_or_path_spec> ) | T_DIRECTION (intnum) position_list ::= intnum [,...] position_or_path_spec ::= intnum | 'step_no' | 'path_id'
A transitive derived table is a projection (query solution) that may comprise four different types of columns:
- Step Data
When a transitive derived table occurs in a query, the enclosing query must specify an equality condition for either (1) all input columns, (2) all output columns, or (3) both. The designation of input and output columns is for convenience only. The order of query execution will generally be decided by the optimizer, unless overridden with the
Consider a simplified social network application comprised of the following data:
CREATE TABLE knows ( p1 INT , p2 INT , PRIMARY KEY ( p1 , p2 ) ) ; ALTER INDEX knows ON knows PARTITION ( p1 INT ) ; CREATE INDEX knows2 ON knows ( p2 , p1 ) PARTITION ( p2 INT ) ; INSERT INTO knows VALUES ( 1 , 2 ) ; INSERT INTO knows VALUES ( 1 , 3 ) ; INSERT INTO knows VALUES ( 2 , 4 ) ;
All persons have single integer identifiers. There is a row in the
knows table if person
p1 claims to know person
The most basic query is to find all the people that a given person knows either directly or indirectly.
SELECT * FROM ( SELECT TRANSITIVE T_IN ( 1 ) T_OUT ( 2 ) T_DISTINCT p1, p2 FROM knows ) AS k WHERE k.p1 = 1;
The transitive derived table simply selects from the
knows table. The enclosing top level query gives an initial value for the input column of the transitive
SELECT. This leaves the output column
p2 unbound, so the query will iterate over the possible values of
p2. Initially, the query loops over the people directly known by
1. In the next stage, it takes the binding of
p2 and uses it as a new value of the input column
p1 to look for people the first degree contact knows and so on, until no new values are found.
The basic meaning of the transitive modifier is that given initial values for input column(s), the subquery is evaluated to produce values for the output columns. Then these values are fed back as values of input columns and so forth, until some termination condition is reached. If there are equality conditions for columns designated as output but no conditions for columns designated as input, then the same process runs from output to input. The terms input and output do not imply execution order. If there are bindings for both input and output columns in the enclosing query, then the transitive derived table looks for ways of connecting the input and output bindings. If no such way is found, the subquery is empty and causes the whole enclosing query to also have no result. A transitive derived table cannot be the right side of an
OUTER JOIN directly but can be wrapped in a derived table that is. In this way, an
OUTER JOIN usage is also possible, whether finding a path is optional.
The result set of a transitive subquery can be thought of as a set of paths. A path consists of one or more consecutive bindings for the input columns and is ordered. In our example, a path is
p1=1, p1=2, p1=4. This is the path connecting persons
4. If there are columns in the
SELECT that are neither input or output, they too are recorded for each step of the path. The result set may include just the ends of a path, i.e., one row where the input columns have the beginning and the output columns the end of the path. This means there is one row per distinct path. The result set may also include a row for each step on each path.
In this example, we bind both ends of the transitive subquery and ask how person
4 are connected. Since the columns
p2 have an equality condition, each row of the result set has these at values
SELECT * FROM ( SELECT TRANSITIVE T_IN ( 1 ) T_OUT ( 2 ) T_DIRECTION 3 T_DISTINCT T_SHORTEST_ONLY p1, p2, T_STEP ( 1 ) AS via , T_STEP ( 'path_id' ) AS path , T_STEP ( 'step_no' ) AS step FROM knows ) AS k WHERE p1 = 1 AND p2 = 4 ;
P1 P2 VIA PATH STEP 1 1 1 0 0 1 3 1 1 0 1 3 3 1 1 1 2 1 2 0 1 2 2 2 1 1 4 1 3 0 1 4 2 3 1 1 4 4 3 2
The three rightmost columns allow returning information in the intermediate steps of the transitive evaluation.
t_step (1) means the value of the column at position 1 at the intermediate step. The
t_step ('step_no') is the sequence number of the step returned. The
t_step ('path_id') is a number identifying the connection path, since there may be many paths joining persons
In this situation, the result set has one row per step, including a row for the initial and final steps. While the evaluation order may vary internally, the result set is presented as if the query were evaluated from input to output, i.e., looking for people known by
3, then looking for people they know, finding that
4, which is a solution, since
p2 = 4 was specified in the outer select. If the outer query had
p1 = 4 and
p2 = 1, there would be an empty result set since there is no path from
For example, if tables have multipart keys, there can be many input and output columns but there must be an equal number of both, since the engine internally feeds the output back into the input or vice versa. The transitive derived table may be arbitrarily complex.
We may have an application that returns extra information about a step. This could, for example, be a metric of distance. In such a case, a column which is not designated as either input nor output and is not a
t_step () function call, will simply be returned as is.
The result set of a transitive subquery will either have one row for each state reached, or have one row for each step on the path to each state reached.
The first example returns only the ends of the paths, i.e., directly and indirectly known person
ids. It does not return for each returned
id how this person is known, through which set of connections. The second example returns a row for each step on each path. Steps will be returned if the selection has
t_step () calls or columns that are neither input or output.
The forms of
t_step ( <column number> )
This returns the value that the column, one of the columns designated as input, has at this step. The input or output columns themselves, if there is a condition on them, look equal to the condition. This allows seeing intermediate values of input columns on a path.
t_step ( 'step_no' )
This returns the ordinal number of the step on the path. Step 0 corresponds to the input variables being at the value seen in the enclosing query. Step 1 is one removed from this. Step numbering is assigned as if evaluating from input to output.
SELECT * FROM ( SELECT TRANSITIVE T_IN ( 1 ) T_OUT ( 2 ) T_MIN ( 0 ) T_DISTINCT p1, p2, T_STEP ( 1 ) AS via, T_STEP ( 'path_id' ) AS path , T_STEP ( 'step_no' ) AS step FROM knows) AS k WHERE p1 = 1 ;
P1 P2 VIA PATH STEP 1 1 1 0 0 1 3 1 1 0 1 3 3 1 1 1 2 1 2 0 1 2 2 2 1 1 4 1 3 0 1 4 2 3 1 1 4 4 3 2
This returns four
paths, all starting at
3; and the
path column has values from
3, distinguishing the four different
paths returned. The
p1 column is the start of the path, thus always
1 since this is given in the outer query. The
p2 column is the end of the path. The
via column is the value of
p1 at the intermediate
step number where
via is equal to
0. The next
step number is
1. At the highest
step number of each
via are the same.
Now, let us do this in reverse:
SELECT * FROM ( SELECT TRANSITIVE T_IN ( 1 ) T_OUT ( 2 ) T_MIN ( 0 ) T_DISTINCT p1 , p2 , T_STEP ( 1 ) AS via , T_STEP ( 'path_id' ) AS path , T_STEP ( 'step_no' ) AS step FROM knows) AS k WHERE p2 = 4 ;
P1 P2 VIA PATH STEP 4 4 4 0 0 2 4 2 1 0 2 4 4 1 1 1 4 1 2 0 1 4 2 2 1 1 4 4 2 2
We give an initial value to
p2 and leave
p1 free. Now we get three
4; and from
4. We enumerate the
steps as if counting from input to output, albeit internally the evaluation order is the reverse. Again,
0 has the
via column equal to
p1 and the highest numbered step has
via equal to
Descriptions of Transitive Options
T_MIN (INTNUM)- This means that
paths shorter than the number are not returned. In the examples above, we had
0, so that a
pathof zero length (i.e., where the first output equals the outer conditions for the inputs) was also returned.
T_MAX (INTNUM)- This gives a maximum length of
Paths that exceed this
stepthreshold are not returned. A value of
1means that the subquery is evaluated once, i.e., the outputs of the first evaluation are not fed back into the inputs. Specifying a minimum of
0and a maximum of
1means an optional
1means an ordinary derived table.
T_DISTINCT- This means that if a binding of input columns is produced more than once, only the first is used; i.e., the same point is not traversed twice even if many
paths lead to it.
T_EXISTS- Only one
pathis generated and returned.
T_NO_CYCLES- If a
pathis found that loops over itself (i.e. a next
stephas the input values equal to the input values of a previous
path), the binding is ignored.
paths that have a cycle (i.e., input values of a subsequent
stepequal the input values of a previous
stepon the same
path) are returned.
T_SHORTEST_ONLY- If both ends of the
pathare given, the evaluation stops at the length of
pathwhere the first solution is found. If many
paths of equal length are found, they are returned, but longer paths are not sought.
T_IN (column_ordinal_positions)- This specifies which column is to be used to indicate the source (or
T_OUT (column_ordinal_positions)- This specifies which column is to be used to indicate the destination (or
T_DIRECTION INTNUM- A value of
0(default) means that the SQL optimizer decides which way the transitive subquery is evaluated.
1means from input to output,
2from output to input,
3from both ends. Supposing we are looking at how two points are related, it makes sense to start expanding the transitive closure at both ends. In the above example, this would be going from
p2on one side and from
p1on the other.
- SPARQL Transitivity – how this functionality is surfaced in SPARQL-BI