Wednesday, January 25, 2017

The Ten Rules of Schema Growth

Data outlives code, and a valuable database supports many applications over time. These ten rules will help grow your database schema without breaking your applications.

1.  Prod is not like dev.

Production is not development. In production, one or more codebases depend on your data, and these ten rules below should be followed exactingly.

A dev environment can be much more relaxed.  Alone on your development machine experimenting with a new feature, you have no users to break.  You can soften the rules, so long as you harden them when transitioning to production.

2.  Grow your schema, and never break it.

The lack of common vocabulary makes it all too easy to automate the wrong practices. I will use the terms growth and breakage as defined in Rich Hickey's Spec-ulation talk.  In schema terms:

  • growth is providing more schema
  • breakage is removing schema, or changing the meaning of existing schema.

In contrast to these terms, many people use "migrations", "refactoring", or "evolution". These usages tend to focus on repeatability, convenience, and the needs of new programs, ignoring the distinction between growth and breakage. The problem here is obvious: Breakage is bad, so we don't want it to be more convenient!

Using precise language underscores the costs of of breakage. Most migrations are easily categorized as growth or breakage by considering the rules below.  Growth migrations are suitable for production, and breakage migrations are, at best, a dev-only convenience. Keep them widely separate.

3. The database is the source of truth.

Schema growth needs to be reproducible from one environment to another.  Reproducibility supports the development and testing of new schema before putting it into production and also the reuse of schema in different databases. Schema growth also needs to be evident in the database itself, so that you can determine what the database has, what it needs, and when growth occurred.

For both of these reasons, the database is the proper source of truth for schema growth. When the database is the source of truth, reproducability and auditability happen for free via the ordinary
query and transaction capabilities of the database.  (If your database is not up to the tasks of queries and transactions you have bigger problems beyond the scope of this article).

Storing schema in a database is strictly more powerful than storing schema as text files in source control. The database is the actual home for schema, plus it provides validation, structure, query, transactions, and history. A source control system provides only history and is separate from the data itself.

Note that this does not mean "never put schema information in source control". Source control may be convenient for other reasons, e.g. it may be more readily accessible. You may redundantly store schema in source control, but remember that the database is definitive.

4.  Growing is adding.

As you acquire more information about your domain, grow your schema to match. You can grow a schema by adding new things, and only by adding new things, for example:

  • adding new attributes to an existing 'type'
  • adding new types
  • adding relationships between types

5.  Never remove a name.

Removing a named schema component at any level is a breaking change for programs that depend on that name. Never remove a name.

6.  Never reuse a name.

The meaning of a name is established when the name is first introduced. Reusing that name to mean something substantially different breaks programs that depend on that meaning. This can be even
worse than removing the name, as the breakage may not be as immediately obvious.

7.  Use aliases.

If you are familiar with database refactoring patterns, the advice in Rules Five and Six may seem stark. After all, one purpose of refactoring is to adopt better names as we discover them. How can we
do that if names can never be removed or changed in meaning?

The simple solution is to use more than one alias to refer to the same schema entity. Consider the following example:

  • In iteration 1, users of your system are identified by their email with an attribute named :user/id
  • In iteration 2, you discover that users sometimes have non-email identifiers for users and that you want to store a user's email even when not using the email as an identifier. In short, you wish that :user/id was named :user/primary-email.

No problem! Just create :user/primary-email as an alias for :user/id. Older programs can continue to use :user/id, and newer programs can use the now-preferred :user/primary-email.

8.  Namespace all names.

Namespaces greatly reduce the cost of getting a name wrong, as the same local name can safely have different meanings in different namespaces.  Continuing the previous example, imagine that the local
name id is used to refer to a UUID in several namespaces, e.g. :inventory/id, :order/id, and so on. The fact that :user/id is not a UUID is inconsistent, and newer programs should not have to put up with this.

Namespaces let you improve the situation without breaking existing programs. You can introduce :user-v2/id, and new programs can ignore names in the user namespace. If you don't like v2, you can also pick a more semantic name for the new namespace.

9.  Annotate your schema.

Databases are good at storing data about your schema. Adding annotations to your schema can help both human readers and make sense of how the schema grew over time. For example:

  • you could annotate names that are not recommended for new programs with a :schema/deprecated flag, or you could get fancier still with :schema/deprecated-at or :schema/deprecated-because. Note that such deprecated names are still never removed (Rule Five).
  • you could provide :schema/see-also or :schema/see-instead pointers to more current conventions. 

In fact, all the database refactoring patterns that are typically implemented as breaking changes could be implemented non-destructively, with the refactoring details recorded as an annotation. For example, the breaking "split column" refactoring might instead be implemented as schema growth:

  • add N new columns
  • (optional) add a :schema/split-into attribute on the original column whose value is the new columns, and possibly even the recipe for the split

10. Plan for accretion.

If a system is going to grow at all, then programs must not bake in limiting presumptions.  For example: If a schema states that :user/id is a string, then programs can rely on :user/id being a string and not occasionally an integer or a boolean.  But a program cannot assume that a user entity will be limited to a the set of attributes previously seen, or that it understands the semantics of attributes that it has not seen before.

Are these rules specific to a particular database?

No. These rules apply to almost any SQL or NoSQL database.  The rules even apply to the so-called "schemaless" databases.  A better word for schemaless is "schema-implicit", i.e. the schema is implicit in your data and the database has no reified awareness of it.  With an implicit schema, all the rules still apply, except that the database is impotent to help you (no Rule 3).

In Context

Many of the resources on migrations, refactoring, and database evolution emphasize repeatability and the needs of new programs, without making the top-level distinctions of growth vs. breakage and prod vs. dev. As a result, these resources encourage breaking the rules in this article.

Happily, these resources can easily be recast in growth-only terms.  You can grow your schema without breaking your app. You can continuously deploy without continuously propagating breakage.  Here's what it looks like in Datomic.


2 comments :

  1. This comment has been removed by the author.

    ReplyDelete
  2. This article is a prime example of why I'm a fan of Clojure and Datomic. I'm learning fundamentals here that would otherwise take a lifetime of trial and error to discover the hard way. Looking forward to more articles like this. Thank you, Stu.

    ReplyDelete