2016-10-03

Circular Dependency in Relational Databases

This post is going to have an element of storytelling to it, because while some elementary understanding of SQL will be required, the Cliff's Notes seem to tell it all:

You probably want to avoid circular dependencies in relational databases.

These Cliff's Notes are incomplete, a little vague, a bit of an oversimplification, and up for debate besides. Some RDBMs (relational database managers) handle circular dependency better than others, and some problem domains necessitate circular dependency. Take, for example, table representations of nodes in a graph that allows cyclical paths.

However, not all problems necessitate circular dependency, and some RDBMs prohibit, or aren't capable of handling, circular dependencies. In these cases, we need to design around the limitations of our RDBM, and in any case, it's good to be aware of circular dependencies, just on principle. The purpose of this post is thus to share my own early explorations of database design, and to describe how I broke out of the circular dependency trap on a small side project.

Some project context will help here. I've been working on a passion project for a little while now. The high level idea is to build a "Create Your Own Adventure" game, where all content is user-generated, and the best story line branches rise to the top through a voting system like that used by Reddit and StackOverflow. An important early part of this project involved designing a decent database schema to build the rest of the project on top of.

You may wonder why I chose an RDB (relational database) for this in the first place, rather than an alternative such as some flavor of NoSQL. I plan to host this on the AWS (Amazon Web Services) free tier, so I can learn more about AWS. Unfortunately, at the time I started this project, the AWS free tier didn't offer any option other than RBDs. Incidentally, an RDB is almost certainly best for this type of application due to the interrelationships of different types of data, but that's a topic for another time.

So, how did a circular dependency surface in my design?

One feature I want this app to have is position persistence. When a visitor is reading the story, I want to remember where they left off, so they can conveniently come back and resume reading the story at a later date. My plan was to give the Users table a column called Current, to track a given user's position in the story. In other words, a story node would appear as a foreign key in the Users table.

The Nodes table was designed with an Author column to record who had authored a given story node. This was to support a variety of features, such as prohibiting authors from voting on their own posts, and allowing an author to edit their contributions, while protecting those posts from being edited by others. Unfortunately, you'll notice this means a user (author) is now a foreign key in the Nodes table.

Uh oh.

Now I have a node ID as a foreign key in the Users table, and a user ID as a foreign key in the Nodes table. This is a circular dependency of sorts, and is a block in SQL. You'll find some argument as to whether circular dependencies are fundamentally bad practice, or whether SQL just suffers some unhappy limitations. After all, what's conceptually wrong with tracking current node in the Users table, and author in the Nodes table? There's no conflict in authority over the data, since a user doesn't own its current node in this read-only context, it's just a position marker. To be fair, SQL doesn't understand our motivations, and to argue that an RDBM's designers could have done more toward enabling us to convey these human intuitions to the machine is too philosophical for this post.

Whatever the reason and our opinion of it, MySQL forbids this kind of interdependency. The solution I came to was to create a simple table called Positions, with a compound primary key consisting of two foreign keys, one being a user ID from the Users table, the other being a node ID from the Nodes table. Despite some apparent duplication of information, this works for the simple reason that the Positions table relies on Users and Nodes, but neither Users nor Nodes relies on Positions. I'll try to illustrate the improvement in design without opening Visio:

An arrow (->) implies dependency

Before:

Users -> Nodes
Nodes -> Users

As you can see, this yields Users -> Nodes -> Users, and so on, ad infinitum. This is bad.

After:

Nodes -> Users
Positions -> Users
Positions -> Nodes

We can combine two of these lines together to yield:

Positions -> Nodes -> Users
Positions -> Users

but you can see that this doesn't introduce any loops, or circular dependencies, as my first design did.

Thanks for reading!
- Steven Kitzes

No comments: