Tuesday, August 19, 2014

Stuff Happens: Fixing Bad Data in Datomic

Oops!  You just put some bad data in your system, and now you need a way to clean it up.  In this article, we look at how to recover from data errors.  Along the way, we will explore how Datomic models time.

A Motivating Example

ACME Co. buys, sells, and processes things.  Unfortunately, their circa-2003 web interface is not a shining example of UI/UX design.  Befuddled by all the modal screens, managers regularly put bad data into the system.

In fact, manager Steve just accidentally created an inventory record showing that ACME now has 999 Tribbles.  This is ridiculous, since everyone knows that the CEO refuses to deal in Tribbles, citing "a bad experience". In a rather excited voice, Steve says "Quick, please delete the last entry added to the system!"

As is so often the case, job one is to carefully interpret a stakeholder's request.  In particular, the words "delete", "last", and "entry" all warrant careful consideration.

What "Delete" Means

Let's start with "delete".  At first glance, one might think that Steve wants us to do our best to "unhappen" his mistake. But upon reflection, that isn't such a good idea.  The database is a live system, and someone may have used the bad data to make decisions.  If so, then simply excising the mistake will just lead to more confusion later. What if, during the few minutes Tribbles were in the inventory, we sold a Tribble?  Or, more subtly, what if we moved our widget inventory to a different warehouse to make room for the nonexistent Tribbles?

Databases need to remember the history of their data, even (and perhaps especially) when the data is later discovered to be bad.  An easy analogy to drive home the point is source control.  Source control systems act as a simple database for code.  When a bug happens, you care very much about which versions of the code manifest the bug.

So rather than deleting the Tribbles, we want something more like a "reverting commit" in source control; that is, to record that we no longer believe that we have (or in fact ever had) Tribbles, but that during a specific time window we mistakenly believed we had 999 Tribbles.

What "Last" Means

Next, let us consider the temporal word "last".  Happily, ACID databases have a unit of time ordering: transactions. Datomic goes a step further with reified transactions, i.e. transactions that you can manipulate as first-class objects in the system.  In such a system, you might indeed be able to say "Treat the last transaction as a data entry error and record a reverting transaction."

We still need to be careful, though, about what we mean by "last".  Again, the inventory system is a live system, so the last transaction is a moving target.  It would be dangerous and incorrect to blindly revert the last transaction, without making sure that Steve's erroneous transaction is still the last one. Generally, we will want a way to review recent transactions, and then narrow down by some other criteria, e.g. the id for Tribbles.

What "Entry" Means

Finally, let's consider what constitutes an "entry".  This is a domain concept, and might or might not correspond directly to a row in a table, or to a document, or whatever metaphor your database encourages.  Mistakes can have arbitrary and irregular shapes.  When correcting errors, we will want to have granular representations of data, so that we can precisely target the error without also having to modify correct data nearby.

Putting that all together, we need to:
  • locate the precise problem among recent transaction data
  • create a transaction that reverts the bad information...
  • ...without forgetting that the problem happened.  
Let's take this to code.

Retraction in Code

The code examples that follow are in Groovy rather than Java for concision's sake.  You can download the sample code from the Datomic Groovy Examples repository.

Let's begin with our erroneous transaction, creating a record for Tribbles with a unique id, description, and count:
addTribbles = [[':db/id': tempid(':db.part/user'),
                ':item/id': '0042-TRBL',
                ':item/description': 'Tribble: a low maintenance pet.',
                ':item/count': 999]];
conn.transact(addTribbles).get();

Finding the Problem

OK, the clock is ticking: we have bad data in the system.  How can we look to see what we have?  For many everyday uses, you want an associative (entity) view of data.  Depending on your language and tools, an entity representation may be a strongly-typed object or a mere dictionary, and may be generated by hand or automated through an ORM.  Regardless, an associative representation allows you to navigate from keys to values.

In Datomic, you can pass a lookup ref (a pair containing a unique key and its value) to the entity API to get an entity:
db = conn.db();
db.entity([':item/id', '0042-TRBL']);
Entities imply a three way relation of entity / attribute / value.  The entity is implicit in object identity, the attributes are keys (or getter methods), and the values are looked up via the attributes. Given an entity, you can call keySet to find its keys, or get to lookup the value at a particular key.

What entities do not tell you is how, when, or why data got into the system.  One way to improve this is to think in terms of a broader relation.  Instead of a 3-tuple of entity / attribute / value, Datomic uses a datom: a 5-tuple of entity / attribute / value / transaction / added.  The fourth slot, transaction, references an entity that records the time the transaction was added to the system (plus possibly other facts about the transaction).  The fifth slot, added, is a boolean that records whether the datom is being asserted (true) or retracted (false).

Datoms allow you to see (and annotate) the complete history of entities in your system.  For example, the following Datalog query will return all the facts about Tribbles.
q('''[:find ?aname ?v ?tx ?added
      :in $ ?e
      :where [?e ?a ?v ?tx ?added]
             [?a :db/ident ?aname]]''',
  db,
  [':item/id', '0042-TRBL']);
We aren't going to cover Datalog syntax here, but will point out a few things in passing:
  • variables begin with ?
  • data patterns in the :where clause are specified in Datom order, i.e. entity, attribute, value, transaction, added
  • the :in clause binds parameters, so this query binds ?e to find only facts about id 0042-TRBL, Tribbles.
The query results, shown below, show us that all facts about Tribbles were added in the same transaction, 13194139534315.

AttributeValueTransactionAdded?
:item/descriptionTribble: a low maintenance pet.13194139534315true
:item/id0042-TRBL13194139534315true
:item/count99913194139534315true

Searching The Log

In the query above, we knew that the problem was Tribbles, and could use that to look up the information.  Let's imagine instead that we know only that the problem just happened. To ask a "when" question, you can explicitly access the database log, which is a time index. In Datomic, the basisT of a database identifies the most recent transaction, and can be passed to the tx-data function inside a query to access the log and return datoms from a specific transaction. The following query shows the entire transaction that added the Tribbles.
log = conn.log();
q('''[:find ?e ?aname ?v ?tx ?added
      :in $ ?log ?tx
      :where [(tx-data ?log ?tx) [[?e ?a ?v _ ?added]]]
             [?a :db/ident ?aname]]''',
  db,
  log,
  db.basisT());
This query shows us not only the troubled Tribble datoms, but also information about the transaction that added them:

EntityAttributeValueTransactionAdded?
17592186045420:item/descriptionTribble: a low maintenance pet.13194139534315true
17592186045420:item/id0042-TRBL13194139534315true
17592186045420:item/count99913194139534315true
13194139534315:db/txInstant2014-05-19T17:20:48.200-00:0013194139534315true

The combination of granular datoms, the powerful Datalog query language, and direct access to the database log make it possible to find out everything about a data error.  Now let's fix the error, without losing history of the event.

Fixing the Problem

The transaction that added the Tribbles actually said three things about Tribbles:
  • Tribbles exist
  • Tribbles have a description
  • we have 999 Tribbles  

Simple Retraction

Let's presume that it is innocuous to acknowledge the existence of Tribbles.  Thus we can make a very granular correction, refuting only the absurd notion that we have Tribbles in inventory:
errDoc = 'Error correction entry. We do not sell Tribbles.';
weDontSellTribbles = 
[[':db/add', [':item/id', '0042-TRBL'], ':item/count', 0],
 [':db/add', tempid(':db.part/tx'), ':db/doc', errDoc]];
conn.transact(weDontSellTribbles).get();
This correction adds two facts to the database:
  1. an assertion that we have, in fact, zero Tribbles
  2. a documentation string on the transaction entity (:db.part/tx) explaining the correction

Retraction with Provenance

Another possibility is that ACME does not want even to acknowledge the existence of Tribbles.  So we need to retract the entire Tribble entity.  Also, ACME has a policy of recording a manager's identity along with an error correction. Here, then is a transaction that removes the Tribble entity entirely, and credits John Doe with the removal:
retractTribbles = 
[[':db.fn/retractEntity', [':item/id', '0042-TRBL']],
 [':db/add', 
  tempid(':db.part/tx'), 
  ':corrected/by', [':manager/email', 'jdoe@example.com']]];
conn.transact(retractTribbles).get();
:db.fn/retractEntity is a built-in database function that expands to datoms retracting every fact about an entity.  :correction/by is an attribute specific to the schema of this database, and is an example of how you can extend the information model to capture provenance information for transactions.

You Can Go Back Again

Now that Tribbles have been retracted entirely, they will be invisible to queries or entity calls against the database.  But while they are gone, they are not forgotten.  The history view of a database shows not just the present, but the entire past of the database.  The following query shows the complete history of Tribbles in the database:
hist = conn.db().history();
q('''[:find ?aname ?v ?tx ?added
      :in $ ?e
      :where [?e ?a ?v ?tx ?added]
             [?a :db/ident ?aname]]''',
  hist,
  [':item/id', '0042-TRBL']);
As you can see below, this history includes the original problem transaction, the partial correction, and the complete removal of the entity.

AttributeValueTransactionAdded?
:item/descriptionTribble: a low maintenance pet.13194139534315true
:item/id0042-TRBL13194139534315true
:item/count99913194139534315true
:item/count99913194139534317false
:item/count013194139534317true
:item/id0042-TRBL13194139534318false
:item/descriptionTribble: a low maintenance pet.13194139534318false
:item/count013194139534318false

Where Are We?

We have now accomplished everything we set out to do:
  • We used Datalog queries to precisely locate problem data, both by recency and by identifying key.
  • We took advantage of the granularity of datoms to create transactions that precisely retracted problem data.
  • We used reified transactions and the history view of the database to remember what went wrong, and how it was fixed.
Time is a fundamental dimension of information systems, and it cannot be easily retrofitted to systems that only support traditional row or document views of information.  In particular, note that the queries above take a database (not a connection) argument.  That database can be filtered with historyasOf, or
since to pass different windows of time to the same query.  With a time model, "point-in-time" or "range-of-time" queries can use the same logic as "now" queries.

That all sounds pretty good.  So, should all of your information systems build upon a time model?

Modeling Time: Concerns and Tradeoffs

There are a number of issues to consider when modeling time in an information system; we will consider a few of the most common issues here.
  1. What are the performance concerns when keeping historical data?
  2. Is there some data for which historical information is irrelevant or counterproductive?
  3. But I really want to delete something!
  4. I understand the importance of remembering past mistakes, but I want "clean" history, too.

Performance Concerns

Data structures that preserve past values of themselves are called persistent.  A common first reaction to persistence in databases is "Can I actually afford to remember everything?"  You may remember that a similar objection was raised when Git first became popular: "Can I really afford to keep the entire history of projects on my laptop?"

It is important to understand that history can be managed so that queries are "pay as you go".  If you do not use history in a particular query, that query will perform just as well as if you were not storing history.  Modern immutable databases such as Datomic make such optimizations.

With that out of the way, the other major concern is storage cost. For many systems, particularly transactional systems of record, the balance of this argument is usually in favor of keeping everything. If the information is worth recording at all, it is worth the cost of keeping it.

But some information has little value, and does not warrant history.

Not Everything Needs History

At the other extreme from systems of record you have high-volume data that has little information value per fact, but some utility in the aggregate.  A good example of this is a hit counter on web site.  While you might be interested to know that 10,000 people are visiting every hour, you probably don't care to know the exact time for each counter increment.

Counters and other high churn, low point-in-time value storage does not benefit from history. 

Excision

There are a few scenarios where systems really do need to deliberately and irrevocably lose data.  The most common such scenario is a legal requirement to remove information from a system for privacy or intellectual property reasons.

For such situations, Datomic provides excision, which permanently removes facts from the system. This feature should never be used to fix mistakes.  To continue the source code analogy, this would be the equivalent of saying "Wow, that was a really dodgy bug, let's corrupt source control so it looks like the bug never happened."

Clean History

We have been talking about history as "what we believed in the past". But from there one might aspire to know "what was true in the past", or at least "what we now think was true at some point of time in the past.

The ability to annotate transactions provides a path to multiple notions of history. For example, you could define a :tx/mistake attribute, and use it to mark transactions that later prove to be erroneous.  You could then query against a filtered database that does not see those transactions at all.

Wrapping Up

Stuff happens, and information systems will acquire bad data.  Never delete your mistakes.  Use fine grained information such as datoms to pinpoint your errors, and use retraction and reified transactions in a persistent database such as Datomic to correct errors while preserving history.

3 comments :

  1. Great post! My wish is to be able to actually use this information sometime :)

    One question. The complete history shows transaction 13194139534317 to have 999 Tribbles (retracted) and 0 Tribbles (accepted). Should the first fact have a different id?

    ReplyDelete
  2. Hi Daniel,

    Both of those datoms were added in the same transaction, and correctly show the same transaction id. This kind of pairing will happen in any transaction that asserts a new value for a :cardinality/one fact.

    Stu

    ReplyDelete
  3. Instead of retracting some facts or an entity, suppose I had a coding mistake and actually retracted some entities, which now cause errors in entities with ref's to the retracted one.

    Would there be a series of operations to effectively invert the transaction which retracted the entity erroneously?

    ReplyDelete