Modeling Hierarchies (Part 1)
As a computer users, we deal with hierarchies every day and working with relational data involves hierarchical relationships of varying degrees of complexity from the simple parent-child-grandchild (e.g. Customer -> Order -> Item) to very complex relationships (e.g. Bill of Materials).
So what, exactly, is a hierarchy
A hierarchy is a representation (usually shown graphically as a tree-diagram) of the way in which elements of data are related to one another. Each element is represented in the hierarchy by a ‘Node’ (usually represented as boxes), and each node is related to one, or more, other nodes by ‘Edges’ (usually represented as lines). The way in which nodes, and their edges, is interpreted depends upon the information that the hierarchy is describing. Thus in organizational hierarchies the nodes may represent either “Positions” or “Departments”, and the edges then represent either a ‘reports to’, or an ‘is a member of’ relationship. Conversely in a Bill of Materials hierarchy the nodes typically represent ‘Assembly Units’ and the edges define ‘is made of’ relationships.
The position of a node is defined by the number of incoming links (its ‘indegree’) that define its ‘parents’ and the number of outgoing links (its ‘outdegree’) that define its ‘children’. A node with an indegree of zero and an outdegree greater than zero defines the starting point of a hierarchy and is referred to as a ‘Root’ node. Conversely a node with an indegree greater than zero and an outdegree of zero defines the end point of a hierarchy and is referred to as a ‘leaf’ node. Nodes where both indegree and outdegree are greater than zero are ‘intermediate’ and those where both are zero are ‘isolated’.
Types of hierarchy
Simple hierarchies are comprised of a single family of nodes that are directly related to each other. In this case the hierarchy is represented as a single ‘Tree’. More complex hierarchies may comprise more than one family of nodes and the hierarchy is a collection of ‘SubTrees’. However complex the hierarchy, it is a defining characteristic that it is possible to trace a path from any node in the hierarchy to any other node by following the edges. Another characteristic is that all nodes except the root have an indegree of at least 1. While there are an infinite variety of possible hierarchies that can be defined using Nodes and Edges, there are two basic classes of hierarchy; either ‘Symmetric’ or ‘Asymmetric’.
Symmetric Hierarchies
A symmetric hierarchy is one in which the number of levels is the same for all component subtrees (Figure 1). Notice that the subtrees are not themselves required to be symmetrical with each other, what defines the hierarchy as symmetric is that there is at least one node at each separate level of every subtree.

Such a hierarchy can be modeled using simple relational tables, where one table is used to represent each level of the hierarchyCC (Figure 2). The relationship between records in any pair of tables is always at least one-to-one and there can never be any gaps in the structure.

This is probably the commonest type of hierarchy and handles many scenarios including the both functional (e.g. Customers and Orders) and structural (e.g. Company Organization). However, it is constrained by the limitation that there can never be any gaps in levels of the hierarchy and, if you need to model the case where Level 0 always the direct parent of Level 1 data, but is sometimes the direct parent of Level 2 data, this approach will fail.
Asymmetric Hierarchies
An asymmetric hierarchy is one in which the number of levels is not the same for all component subtrees (Figure 3). Notice that, as with the symmetric hierarchy, the structure of the subtrees does not matter. The defining characteristic is that there is not always a node at every level of every subtree.
It is clear that this type of hierarchy cannot be directly represented by mapping the levels to a set of relational tables because there are gaps in the structure. The first subtree has no “Level 2” nodes, while the second has a leaf node at “Level 1” and nothing below it. The third subtree has both leaf nodes and intermediate nodes at “Level 2” and a Leaf node that exists at “Level 4” but is related directly to a “Level 2” node that also has children at “Level 3”.

Since there are no clear rules about how any given subtree is constructed, the only way in which such a hierarchy can be modeled is by defining the individual relationships that exist between pairs of nodes. Typically this is done by defining a reflexive (also known as ‘self-referential’) table in which each record defines a node and relates specifically to its immediate parent (Figure 4).

However, it must be noted that this approach relies on the fact that any given node can have one, and only one, immediate parent. If a single node can be linked to more than one parent (as is often the case in reality) even this approach fails. The only solution in such cases (often used!) is to break normalization and duplicate the node record for each unique parent.
Working with hierarchies
As stated at the beginning of this paper, the purpose of any hierarchy is to describe how elements of data are related to one another. The reason for doing this is to permit the aggregation of raw data so as to avoid the necessity of working directly with the details. Hierarchies allow us to define meaningful levels of detail, whether these are "Sales Teams" or "Component Assemblies" or "Project Plans" depends only on the interpretation. As long as we are dealing only with symmetric hierarchies there is really no difficulty. SQL allows us to perform the relevant aggregation operations on the series of related tables that describe the hierarchy directly. Thus to retrieve a list of all the nodes at any given level we simply query the relevant table. To find all children of any given node requires only a single join. To traverse the entire hierarchy requires a series of joins but is still directly possible using SQL.
However, as soon as we have to consider the possibility of an asymmetric hierarchy we run into a problem. As noted above, such hierarchies can only be modeled by defining the relationships between pairs of nodes directly. In order to reconstruct the hierarchy we have to perform an operation often referred to as ‘tree-walk’. In other words we have to carry out the following steps:
-
[1] Retrieve the starting key for the root node
-
[2] Find all records where that key is defined as the parent (“Set1”).
-
[3] For the first record in the result set find all records where that key is defined as the parent
-
[4] Repeat [3] until there are no more children
-
[5] Repeat steps [3] and [4] for each record in Set1
Unfortunately basic SQL (which is primarily a set based language) does not directly support such recursive operations, or functions that rely on them. (Interestingly such support was proposed for the SQL3 standard – the so-called ‘RECURSIVE UNION’ operator but it was removed in 1996 and there are no plans to re-introduce it into the standard). Recent versions of some database management systems (including SQL Server 2005) have included enhanced functionality to address some of these issues, but even so it remains problematic. There are basically three alternatives to using SQL:
-
Handle the issue in code. This is the solution adopted most often; tools and applications that rely on hierarchies typically include functionality that will perform a tree-walk transparently. The classic example is the "Bill of Materials Explosion" found in manufacturing and stock control applications. The problem is that such code is difficult to write and is, particularly when dealing with large volumes of data and many levels, comparatively slow to execute.
-
Use a nested set approach. This approach relies upon the fact that a hierarchy can be represented not only as a tree, but also as a nested set of data. The problem with this approach is that it is difficult to handle dynamic, or changing, structures. This approach is described in detail in the following section.
-
Use a mapping approach. This approach relies on the fact that it is possible to create, for every node, a detailed set of the paths which define its location. Such paths are used to define all possible results from ‘walking the tree’ for every node in the hierarchy. This approach is described in detail below.
The Nested Set approach
The nested set model relies upon the fact that instead of being viewed as a tree, a hierarchy can be seen as a set of nested data points. Figure 5 depicts the same hierarchy illustrated in Figure 3 with each node identified by a unique key (a single letter in this case).

This same hierarchy can also be viewed as a nested set by replacing the lines usually used to represent the edges of the hierarchy with nodes that are nested inside each other (Figure 6).

An examination of this diagram shows that it is, in fact, identical to the tree shown at Figure 5. Thus nodes “B”, “C” and “D” are children of “A”. Nodes “G” and “H” are children of “B” and “K” is the only child of “H”. This diagram can be represented in a data table by associating a pair of numbers with each node. These numbers defines the lowest number that exists within the given node’s subtree (if any) and the highest number within that subtree. Figure 7 shows the hierarchy with the first tree numbered.

(You may find it easier to visualize, if you think of the numbers as being defined by starting at the root node of the tree in figure 6 with the number “1” and then visiting each node in the tree in turn by following the edges and working Left-to-Right and Top-to-Bottom. As you enter each node, increment the number by 1 and assign it to the ‘left’ side of the node. When you reach a leaf node, increment the number by one and assign it to the ‘right’. As you return to each node after visiting all of it’s children, again increment the number by 1 and assign it to the ‘right’. This approach is sometimes referred to as ‘Left-Right Numbering’ for this very reason).
The result is that every node defines the limiting values that identify all of its children. It is then possible to retrieve all nodes that fall within a given subtree in a single query by selecting out those nodes whose ‘left’ value is greater than the parent left value and the ‘right’ value is less than that of the parent. This is obvious from figure 7 where Node B defines the range [2-9]. By selecting all nodes where “LEFT > 2 AND RIGHT < 9” we would get nodes “G”, “H” and “K”. This is a very powerful way of handling any type of hierarchy – whether symmetric or asymmetric and neatly gets around the issue of having to walk the tree dynamically by defining the limits as part of the node definition.
It is also easily handled within the context of the standard approach to defining the nodes in a hierarchy as a series of parent-child relationships. All that is needed is two additional columns in the table (usually named ‘lft’ and ‘rgt’ to avoid conflicting with the key words ‘LEFT’ and 'RIGHT’) to hold the limiting values for each node. Figure 8 shows such a structure and the data for the numbered portion of the hierarchy.

The real drawback to this approach is that it is really only suitable for structures that are static. Inserting or deleting a node, or moving a node from one place to another, requires that the entire tree numbering be validated and re-calculated. This can be a lengthy and difficult process since it requires walking the entire tree in the specific order (Top-to-Bottom and Left-to-Right) to ensure that limits correctly reflect the dependent structures. This is not to say that it is not a viable approach, it is widely used when dealing with hierarchies that are essentially static and, as noted, it does get around the limitations of SQL by avoiding the need to use a recursive algorithm. However it is very rigid and is only the best solution in certain, very specialized, circumstances.
The Mapping approach
The third approach breaks with the traditional method of representing a hierarchy as a series of parent-child relationships defined in a single reflexive table. Instead of trying to represent the relationships between pairs of nodes, we describe the hierarchy as the set of all permissible paths between a series of isolated nodes.
In order to do so we simply traverse the hierarchy starting with the root node and note, for each node that we encounter, the identities of all the nodes, that lie between it and the root node. It is important to recognize that this methodology is firmly rooted in the definition of what constitutes a hierarchy – in that it relies on the principle that it is possible to trace an unbroken path between the root and any node. It is also important to recognize that because each node is treated individually, there are no internal dependencies to consider. It does not matter whether the hierarchy is symmetrical or asymmetrical. Any hierarchy can be described in this fashion.
Thus the hierarchy shown at figure 5 can be represented by the data shown in Table 1

The crucial element behind this concept is that instead of storing the data in a single table it is stored in a pair of related tables; the first defines the nodes themselves and the second defines the paths between them. The reason for separating the data into two tables is to allow for the possibility that the same node definition is used in more than one hierarchy. (For example in a hierarchy that links two organizations and defines the structure of each as a subtree, the same ‘department’ node may appear in each). Since there is no need to define the structure directly with the node we can normalize the data. Figure 9 shows the schema for the tables needed

Notice that the node definition table is now merely a look-up that allows us to assign a description to path. It is the structure table that now describes how the nodes are related and it is the path column in this table that is the key to the whole methodology. The path column can be used directly in SQL queries to retrieve sets of data that conform to a specific structural pattern and this allows us to rip data out of the associated tables. For this reason the ‘path’ field is usually referred to as a ‘Rip Field’ and the structure table as a ‘Rip Table’
By attaching data to the structure, rather than to nodes, we achieve a degree of freedom that cannot be handled by any other approach. In the next article I will explore this approach in detail and show how it can be implemented.