Tuesday, October 28, 2014

Datomic Pull API

Datomic's new Pull API is a declarative way to make hierarchical selections of information about entities. You supply a pattern to specify which attributes of the entity (and nested entities) you want to pull, and db.pull returns a map for each entity.

Pull API vs. Entity API

The Pull API has two important advantages over the existing Entity API:

Pull uses a declarative, data-driven spec, whereas Entity encourages building results via code. Data-driven specs are easier to build, compose, transmit and store. Pull patterns are smaller than entity code that does the same job, and can be easier to understand and maintain.

Pull API results match standard collection interfaces (e.g. Java maps) in programming languages, where Entity results do not. This eliminates the need for an additional allocation/transformation step per entity.

Wildcards

A pull pattern is a list of attribute specifications.  If you want all attributes, you can use the wildcard (*) specification along with an entity identifier.  (In the examples below, entity identifiers such as led-zeppelin are variables defined in the complete code examples in Clojure and Java.)

;; Clojure API
(d/pull db '[*] led-zeppelin)

;; Java API
db.pull("[*]", ledZeppelin)

A pull result is a map per entity, shown here in edn:

;; result
{:artist/sortName "Led Zeppelin", 
 :artist/name "Led Zeppelin", 
 :artist/type {:db/id 17592186045746}, 
 :artist/country {:db/id 17592186045576}, 
 :artist/gid #uuid "678d88b2-87b0-403b-b63d-5da7465aecc3", 
 :artist/endDay 25, 
 :artist/startYear 1968, 
 :artist/endMonth 9, 
 :artist/endYear 1980, 
 :db/id 17592186050305}

Attributes

You can also specify the attributes you want explicitly, as with :artist/name and :artist/gid below:

;; pattern
[:artist/name :artist/gid]

;; input 
led-zeppelin

;; result
{:artist/gid #uuid "678d88b2-87b0-403b-b63d-5da7465aecc3", 
 :artist/name "Led Zeppelin"}

The underscore prefix reverses the direction of an attribute, so :artist/_country pulls all the artists for a particular country:

;; pattern
[:artist/_country]

;; input
greatBritain

;; result
{:artist/_country [{:db/id 17592186045751} 
                   {:db/id 17592186045755} 
                    ...]}

Components

Datomic component attributes are pulled recursively by default, so the :release/media pattern below automatically returns a release's tracks as well:

;; pattern
[:release/media]

;; input
darkSideOfTheMoon

;; result
  {:release/media
   [{:db/id 17592186121277,
     :medium/format {:db/id 17592186045741},
     :medium/position 1,
     :medium/trackCount 10,
     :medium/tracks
     [{:db/id 17592186121278,
       :track/duration 68346,
       :track/name "Speak to Me",
       :track/position 1,
       :track/artists [{:db/id 17592186046909}]}
      {:db/id 17592186121279,
       :track/duration 168720,
       :track/name "Breathe",
       :track/position 2,
       :track/artists [{:db/id 17592186046909}]}
      {:db/id 17592186121280,
       :track/duration 230600,
       :track/name "On the Run",
       :track/position 3,
       :track/artists [{:db/id 17592186046909}]}
      ...]}]}

Map Specifications

Instead of just an attribute name, you can use a nested map specification to pull related entities.  The pattern below pulls the :db/id and :artist/name of each artist:

;; pattern
[:track/name {:track/artists [:db/id :artist/name]}]

;; input
ghostRiders

;; result
{:track/artists [{:db/id 17592186048186, :artist/name "Bob Dylan"}
                 {:db/id 17592186049854, :artist/name "George Harrison"}],
 :track/name "Ghost Riders in the Sky"}

And of course everything nests arbitrarily, in case you need the release's medium's track's names and artists:

 ;; pattern
[{:release/media
  [{:medium/tracks
    [:track/name {:track/artists [:artist/name]}]}]}]

;; input
concertForBanglaDesh

;; result
 [{:medium/tracks
   [{:track/artists
     [{:artist/name "Ravi Shankar"} {:artist/name "George Harrison"}],
     :track/name "George Harrison / Ravi Shankar Introduction"}
    {:track/artists [{:artist/name "Ravi Shankar"}],
     :track/name "Bangla Dhun"}]}
  {:medium/tracks
   [{:track/artists [{:artist/name "George Harrison"}],
     :track/name "Wah-Wah"}
    {:track/artists [{:artist/name "George Harrison"}],
     :track/name "My Sweet Lord"}
    {:track/artists [{:artist/name "George Harrison"}],
     :track/name "Awaiting on You All"}
    {:track/artists [{:artist/name "Billy Preston"}],
     :track/name "That's the Way God Planned It"}]
   ...]}

Try It Out

The Pull API has many other capabilities not shown here.  See the full docs for defaults, limits, bounded and unbounded recursion, and more.  Or check out the examples in Clojure and Java.

Tuesday, August 19, 2014

Stuff Happens: Fixing Bad Data in Datomic

Oops!  You just put some bad data in your system, and now you need a way to clean it up.  In this article, we look at how to recover from data errors.  Along the way, we will explore how Datomic models time.

A Motivating Example

ACME Co. buys, sells, and processes things.  Unfortunately, their circa-2003 web interface is not a shining example of UI/UX design.  Befuddled by all the modal screens, managers regularly put bad data into the system.

In fact, manager Steve just accidentally created an inventory record showing that ACME now has 999 Tribbles.  This is ridiculous, since everyone knows that the CEO refuses to deal in Tribbles, citing "a bad experience". In a rather excited voice, Steve says "Quick, please delete the last entry added to the system!"

As is so often the case, job one is to carefully interpret a stakeholder's request.  In particular, the words "delete", "last", and "entry" all warrant careful consideration.

What "Delete" Means

Let's start with "delete".  At first glance, one might think that Steve wants us to do our best to "unhappen" his mistake. But upon reflection, that isn't such a good idea.  The database is a live system, and someone may have used the bad data to make decisions.  If so, then simply excising the mistake will just lead to more confusion later. What if, during the few minutes Tribbles were in the inventory, we sold a Tribble?  Or, more subtly, what if we moved our widget inventory to a different warehouse to make room for the nonexistent Tribbles?

Databases need to remember the history of their data, even (and perhaps especially) when the data is later discovered to be bad.  An easy analogy to drive home the point is source control.  Source control systems act as a simple database for code.  When a bug happens, you care very much about which versions of the code manifest the bug.

So rather than deleting the Tribbles, we want something more like a "reverting commit" in source control; that is, to record that we no longer believe that we have (or in fact ever had) Tribbles, but that during a specific time window we mistakenly believed we had 999 Tribbles.

What "Last" Means

Next, let us consider the temporal word "last".  Happily, ACID databases have a unit of time ordering: transactions. Datomic goes a step further with reified transactions, i.e. transactions that you can manipulate as first-class objects in the system.  In such a system, you might indeed be able to say "Treat the last transaction as a data entry error and record a reverting transaction."

We still need to be careful, though, about what we mean by "last".  Again, the inventory system is a live system, so the last transaction is a moving target.  It would be dangerous and incorrect to blindly revert the last transaction, without making sure that Steve's erroneous transaction is still the last one. Generally, we will want a way to review recent transactions, and then narrow down by some other criteria, e.g. the id for Tribbles.

What "Entry" Means

Finally, let's consider what constitutes an "entry".  This is a domain concept, and might or might not correspond directly to a row in a table, or to a document, or whatever metaphor your database encourages.  Mistakes can have arbitrary and irregular shapes.  When correcting errors, we will want to have granular representations of data, so that we can precisely target the error without also having to modify correct data nearby.

Putting that all together, we need to:
  • locate the precise problem among recent transaction data
  • create a transaction that reverts the bad information...
  • ...without forgetting that the problem happened.  
Let's take this to code.

Retraction in Code

The code examples that follow are in Groovy rather than Java for concision's sake.  You can download the sample code from the Datomic Groovy Examples repository.

Let's begin with our erroneous transaction, creating a record for Tribbles with a unique id, description, and count:
addTribbles = [[':db/id': tempid(':db.part/user'),
                ':item/id': '0042-TRBL',
                ':item/description': 'Tribble: a low maintenance pet.',
                ':item/count': 999]];
conn.transact(addTribbles).get();

Finding the Problem

OK, the clock is ticking: we have bad data in the system.  How can we look to see what we have?  For many everyday uses, you want an associative (entity) view of data.  Depending on your language and tools, an entity representation may be a strongly-typed object or a mere dictionary, and may be generated by hand or automated through an ORM.  Regardless, an associative representation allows you to navigate from keys to values.

In Datomic, you can pass a lookup ref (a pair containing a unique key and its value) to the entity API to get an entity:
db = conn.db();
db.entity([':item/id', '0042-TRBL']);
Entities imply a three way relation of entity / attribute / value.  The entity is implicit in object identity, the attributes are keys (or getter methods), and the values are looked up via the attributes. Given an entity, you can call keySet to find its keys, or get to lookup the value at a particular key.

What entities do not tell you is how, when, or why data got into the system.  One way to improve this is to think in terms of a broader relation.  Instead of a 3-tuple of entity / attribute / value, Datomic uses a datom: a 5-tuple of entity / attribute / value / transaction / added.  The fourth slot, transaction, references an entity that records the time the transaction was added to the system (plus possibly other facts about the transaction).  The fifth slot, added, is a boolean that records whether the datom is being asserted (true) or retracted (false).

Datoms allow you to see (and annotate) the complete history of entities in your system.  For example, the following Datalog query will return all the facts about Tribbles.
q('''[:find ?aname ?v ?tx ?added
      :in $ ?e
      :where [?e ?a ?v ?tx ?added]
             [?a :db/ident ?aname]]''',
  db,
  [':item/id', '0042-TRBL']);
We aren't going to cover Datalog syntax here, but will point out a few things in passing:
  • variables begin with ?
  • data patterns in the :where clause are specified in Datom order, i.e. entity, attribute, value, transaction, added
  • the :in clause binds parameters, so this query binds ?e to find only facts about id 0042-TRBL, Tribbles.
The query results, shown below, show us that all facts about Tribbles were added in the same transaction, 13194139534315.

AttributeValueTransactionAdded?
:item/descriptionTribble: a low maintenance pet.13194139534315true
:item/id0042-TRBL13194139534315true
:item/count99913194139534315true

Searching The Log

In the query above, we knew that the problem was Tribbles, and could use that to look up the information.  Let's imagine instead that we know only that the problem just happened. To ask a "when" question, you can explicitly access the database log, which is a time index. In Datomic, the basisT of a database identifies the most recent transaction, and can be passed to the tx-data function inside a query to access the log and return datoms from a specific transaction. The following query shows the entire transaction that added the Tribbles.
log = conn.log();
q('''[:find ?e ?aname ?v ?tx ?added
      :in $ ?log ?tx
      :where [(tx-data ?log ?tx) [[?e ?a ?v _ ?added]]]
             [?a :db/ident ?aname]]''',
  db,
  log,
  db.basisT());
This query shows us not only the troubled Tribble datoms, but also information about the transaction that added them:

EntityAttributeValueTransactionAdded?
17592186045420:item/descriptionTribble: a low maintenance pet.13194139534315true
17592186045420:item/id0042-TRBL13194139534315true
17592186045420:item/count99913194139534315true
13194139534315:db/txInstant2014-05-19T17:20:48.200-00:0013194139534315true

The combination of granular datoms, the powerful Datalog query language, and direct access to the database log make it possible to find out everything about a data error.  Now let's fix the error, without losing history of the event.

Fixing the Problem

The transaction that added the Tribbles actually said three things about Tribbles:
  • Tribbles exist
  • Tribbles have a description
  • we have 999 Tribbles  

Simple Retraction

Let's presume that it is innocuous to acknowledge the existence of Tribbles.  Thus we can make a very granular correction, refuting only the absurd notion that we have Tribbles in inventory:
errDoc = 'Error correction entry. We do not sell Tribbles.';
weDontSellTribbles = 
[[':db/add', [':item/id', '0042-TRBL'], ':item/count', 0],
 [':db/add', tempid(':db.part/tx'), ':db/doc', errDoc]];
conn.transact(weDontSellTribbles).get();
This correction adds two facts to the database:
  1. an assertion that we have, in fact, zero Tribbles
  2. a documentation string on the transaction entity (:db.part/tx) explaining the correction

Retraction with Provenance

Another possibility is that ACME does not want even to acknowledge the existence of Tribbles.  So we need to retract the entire Tribble entity.  Also, ACME has a policy of recording a manager's identity along with an error correction. Here, then is a transaction that removes the Tribble entity entirely, and credits John Doe with the removal:
retractTribbles = 
[[':db.fn/retractEntity', [':item/id', '0042-TRBL']],
 [':db/add', 
  tempid(':db.part/tx'), 
  ':corrected/by', [':manager/email', 'jdoe@example.com']]];
conn.transact(retractTribbles).get();
:db.fn/retractEntity is a built-in database function that expands to datoms retracting every fact about an entity.  :correction/by is an attribute specific to the schema of this database, and is an example of how you can extend the information model to capture provenance information for transactions.

You Can Go Back Again

Now that Tribbles have been retracted entirely, they will be invisible to queries or entity calls against the database.  But while they are gone, they are not forgotten.  The history view of a database shows not just the present, but the entire past of the database.  The following query shows the complete history of Tribbles in the database:
hist = conn.db().history();
q('''[:find ?aname ?v ?tx ?added
      :in $ ?e
      :where [?e ?a ?v ?tx ?added]
             [?a :db/ident ?aname]]''',
  hist,
  [':item/id', '0042-TRBL']);
As you can see below, this history includes the original problem transaction, the partial correction, and the complete removal of the entity.

AttributeValueTransactionAdded?
:item/descriptionTribble: a low maintenance pet.13194139534315true
:item/id0042-TRBL13194139534315true
:item/count99913194139534315true
:item/count99913194139534317false
:item/count013194139534317true
:item/id0042-TRBL13194139534318false
:item/descriptionTribble: a low maintenance pet.13194139534318false
:item/count013194139534318false

Where Are We?

We have now accomplished everything we set out to do:
  • We used Datalog queries to precisely locate problem data, both by recency and by identifying key.
  • We took advantage of the granularity of datoms to create transactions that precisely retracted problem data.
  • We used reified transactions and the history view of the database to remember what went wrong, and how it was fixed.
Time is a fundamental dimension of information systems, and it cannot be easily retrofitted to systems that only support traditional row or document views of information.  In particular, note that the queries above take a database (not a connection) argument.  That database can be filtered with historyasOf, or
since to pass different windows of time to the same query.  With a time model, "point-in-time" or "range-of-time" queries can use the same logic as "now" queries.

That all sounds pretty good.  So, should all of your information systems build upon a time model?

Modeling Time: Concerns and Tradeoffs

There are a number of issues to consider when modeling time in an information system; we will consider a few of the most common issues here.
  1. What are the performance concerns when keeping historical data?
  2. Is there some data for which historical information is irrelevant or counterproductive?
  3. But I really want to delete something!
  4. I understand the importance of remembering past mistakes, but I want "clean" history, too.

Performance Concerns

Data structures that preserve past values of themselves are called persistent.  A common first reaction to persistence in databases is "Can I actually afford to remember everything?"  You may remember that a similar objection was raised when Git first became popular: "Can I really afford to keep the entire history of projects on my laptop?"

It is important to understand that history can be managed so that queries are "pay as you go".  If you do not use history in a particular query, that query will perform just as well as if you were not storing history.  Modern immutable databases such as Datomic make such optimizations.

With that out of the way, the other major concern is storage cost. For many systems, particularly transactional systems of record, the balance of this argument is usually in favor of keeping everything. If the information is worth recording at all, it is worth the cost of keeping it.

But some information has little value, and does not warrant history.

Not Everything Needs History

At the other extreme from systems of record you have high-volume data that has little information value per fact, but some utility in the aggregate.  A good example of this is a hit counter on web site.  While you might be interested to know that 10,000 people are visiting every hour, you probably don't care to know the exact time for each counter increment.

Counters and other high churn, low point-in-time value storage does not benefit from history. 

Excision

There are a few scenarios where systems really do need to deliberately and irrevocably lose data.  The most common such scenario is a legal requirement to remove information from a system for privacy or intellectual property reasons.

For such situations, Datomic provides excision, which permanently removes facts from the system. This feature should never be used to fix mistakes.  To continue the source code analogy, this would be the equivalent of saying "Wow, that was a really dodgy bug, let's corrupt source control so it looks like the bug never happened."

Clean History

We have been talking about history as "what we believed in the past". But from there one might aspire to know "what was true in the past", or at least "what we now think was true at some point of time in the past.

The ability to annotate transactions provides a path to multiple notions of history. For example, you could define a :tx/mistake attribute, and use it to mark transactions that later prove to be erroneous.  You could then query against a filtered database that does not see those transactions at all.

Wrapping Up

Stuff happens, and information systems will acquire bad data.  Never delete your mistakes.  Use fine grained information such as datoms to pinpoint your errors, and use retraction and reified transactions in a persistent database such as Datomic to correct errors while preserving history.

Monday, March 24, 2014

Datomic Adaptive Indexing

We are pleased to announce today new Adaptive Indexing support for Datomic. Adaptive indexing involves both a new index format and algorithm for how indexes are maintained. It should prove especially beneficial to those with large databases, high write loads, or large imports.

Some of the benefits:

  • Reduced work to be done per indexing job
There will be fewer index segments written, which should take less time and/or allow for reduced write provisioning when using DynamoDB.

  • Reduced memory requirements
We have revised downwards both the default and recommended thresholds for memory-index-threshold (to 32m) and memory-index-max (to 512m), for all workloads. This will also reduce memory pressure on peers, which are similarly configured. In addition, indexing itself uses less memory and has better GC characteristics. You will not approach memory-index-max except during imports and bulk loads.

  • Simpler configuration and less variability
You should rarely need to diverge from the defaults, and generally can use the same configuration for imports and ongoing production.

  • Sustainable import rates independent of db size
You will see indexing job times flatten out even as your db size grows linearly, as there is a sub-linear worst-case relationship between db size and indexing job size All of this is done while minimizing the amount of merging done during reads.

TL;DR - much better and more predictable performance, using fewer resources.

We expect adaptive indexing to make a tangible difference to most customers, and look forward to your feedback. Be sure to read the release notices for important details.

Thursday, February 13, 2014

Datomic Lookup Refs

Datomic's new lookup refs provide an easy way to identify entities by domain unique identifiers.

Rationale

Datomic supports several approaches to identity, including internal identifiers (entity IDs that are unique within a database) and external identifiers (domain attributes that are designated to be :db.unique/identity).  Lookup refs provide a concise data representation that accords to external ids the powers of internal ids, and can be used interchangeably with internal ids in the Datomic API. This is powerful, since often, e.g. in the front-end of a system or during imports, you already have the external ids in hand. With lookups refs you can often avoid having to incorporate a lookup in application code since the lookup will be done implicitly when resolving the lookup ref (thus the name).

Implementation

A lookup ref is simply a java.util.List containing two elements: a unique attribute and a value for that attribute.  Encoded in edn, a Lookup ref looks like this:

[:org/email "info@datomic.com"]

Lookup refs can be used in the entity API, or in any of the database index APIs.  For example, the following code retrieves an entity:

// Groovy
db.entity([':org/email', 'info@datomic.com'])
;; Clojure
(d/entity db [:org/email "info@datomic.com"])

Lookup refs can also be used to refer to existing entities in transaction data.  This allows updating transactions to specify :db/id directly instead of using a dummy upsert, as shown below:

;; edn transaction data, using temp :db/id and dummy upsert
[{:db/id #db/id[:db.part/user]
  :org/email "info@datomic.com",
  :org/favoriteColor :blue}]

;; edn transaction data, using lookup ref
[{:db/id [:org/email "info@datomic.com"],
  :org/favoriteColor :blue}]

Similarly, they can be used in transactions to build ref relationships.

We expect that lookup refs will expedite a wide variety of use cases, including imports, interactive exploration, and client application development.  

We look forward to your feedback.

Thursday, January 23, 2014

Schema Alteration

Datomic is a database that has flexible, minimal schema. Starting with version 0.9.4470, available here, we have added the ability to alter existing schema attributes after they are first defined. You can alter schema to

  • rename attributes
  • rename your own programmatic identities (uses of :db/ident)
  • add or remove indexes 
  • add or remove uniqueness constraints
  • change attribute cardinality
  • change whether history is retained for an attribute
  • change whether an attribute is treated as a component

Schema alterations use the same transaction API as all other transactions, just as schema installation does.  All schema alterations can be performed while a database is online, without requiring database downtime.  Most schema changes are effective immediately, at the end of the transaction.  There is one exception: adding an index requires a background job to build the new index. You can use the new syncSchema API for detecting when a schema change is available.

When renaming an attribute or identity, you can continue to use the old name as long as you haven't repurposed it. This allows for incremental application updating.

See the schema alteration docs for the details.

Schema alteration has been our most requested enhancement. We hope you find it useful and look forward to your feedback.

Wednesday, January 8, 2014

Datomic 2013 Recap

2013 was a great year for Datomic.  The value of a flexible information model and immutable data have proven themselves time and again.  Customers have built a variety of powerful systems, taking advantage of
  • ACID transactions
  • pluggable SQL/NoSQL/cloud storage
  • complete access to the history of information
  • the Datalog query language
  • elastic read scalability
  • a granular information model
Over the course of the year, we produced over 40 Datomic releases. The API has been remarkably stable: Our commitment to a strong architecture has allowed us to focus on adding features and fleshing out the vision, without the churn of revisiting past decisions.  

A major new feature is the Datomic Console, a graphical UI for exploring Datomic databases.  The console provides a great visual introduction to the Datomic information model.  It supports exploring schema, building and executing queries, navigating entities, examining transaction history, and walking raw indexes. 

We made several API additions:
  • Excision, a sound model (and API) for permanent removal of data, with auditability.
  • The log API provides the ability to access the log, which is more properly viewed as a time index.
  • The seekDatoms and entidAt APIs provide advanced capability for accessing Datomic's indexes, augmenting the datoms API.
  • The sync API allows multiple processes to coordinate around points in time-of-record, or relative to local process time.
  • Transaction map expansion automates the creation of arbitrarily nested data.
We also made a number of operational improvements:
  • We added Cassandra and to the list of supported storage, in addition to the existing options of DynamoDB, SQL, filesystem, CouchBase, Infinispan, and Riak.
  • The Starter Edition of the Datomic Pro license makes all storages available, for free.
  • We have added a number of new CloudWatch metrics, and a pluggable metrics API for integration with other systems.
  • The MusicBrainz sample database is a great dataset for exploring Datomic.
  • We continue to track AWS best practices, now supporting IAM roles for distributing credentials and DynamoDB local for testing.
We are looking forward to an equally exciting 2014. We will be delivering a number of new features requested by users, plus a few big surprises.

Many thanks to our customers and early adopters for your support and feedback.

Happy New Year!