Datomic Best Practices

07 June 2015

We are happy to announce the addition of a Best Practices section to the Datomic Development Resources documentation. The Best Practices topic is a tool to help users quickly find preferred usage patterns when approaching common scenarios in Datomic.

The Datomic Best Practices documentation includes sections on:

Where appropriate, examples are included as are links to the full documentation for further reading.

See the Best Practices documentation for the full list. We expect to add more subtopics to the Best Practices documentation over time, so check back periodically.

We look forward to your feedback. If you have any comments on the new best practices topics, or any questions you think would be addressed by an additional topic, let us know! ..

More Power for Datomic Datalog: Negation, Disjunction, and Range Optimizations

13 December 2014

Today's Datomic release includes a number of enhancements to Datomic's Datalog query language:

Negation, via the new not and not-join clauses
Disjunction (or) without using rules, via the new or and or-join clauses
Required rule bindings
Improved optimization of range predicates

Each is described below, and you can follow the examples from the mbrainz data set in Java or in Clojure.

Negation

The not clause shrinks a result set, removing already-bound tuples whose variables match a set of clauses.

For example, the following query finds the names of all artists who are not Canadian:

[:find ?name
 :where [?eid :artist/name ?name]
 (not [?eid :artist/country :country/CA])]

The not-join clause generalizes not, and allows you to also specify which variables inside the not-join will unify with the enclosing query.

The following query counts artists who did not release an album in 1970. The not-join specifies that ?artist unifies with the containing query, and (implicitly) that ?release is local.

[:find (count ?artist) .
 :where [?artist :artist/name]
 (not-join [?artist]
 [?release :release/artists ?artist]
 [?release :release/year 1970])]

As elsewhere in Datomic Datalog, lists of clauses inside not or not-join are connected by an implicit and.

Disjunction

Datomic Datalog has long supported disjunction (or) via rules. Rules are a powerful abstraction mechanism, allowing you to create named pieces of logic to be reused across multiple queries. Rules are also the path to recursion in query. But sometimes you don't need all that, and you just want simple disjunction in situ.

The or clause binds variables that match at least one of a set of predicates.

As an example, consider the way mbrainz models release media. There are four separate enums that all represent vinyl: vinyl, vinyl-7, vinyl-10, andvinyl-12. An or clause can group these all together, finding all the vinyl releases:

[:find (count ?medium) .
 :where (or [?medium :medium/format :medium.format/vinyl7]
 [?medium :medium/format :medium.format/vinyl10]
 [?medium :medium/format :medium.format/vinyl12]
 [?medium :medium/format :medium.format/vinyl])]

The and clause is available as an explicit form inside or clauses. For example, the following query finds artists that are either groups or female individuals:

[:find (count ?artist) .
 :where (or [?artist :artist/type :artist.type/group]
 (and [?artist :artist/type :artist.type/person]
 [?artist :artist/gender :artist.gender/female]))]

Note that and clauses are used only inside or – elsewhere in query, and is implicit wherever clauses appear sequentially.

The or-join clause is a generalization of or, just as not-join generalizes not.

All Together

Negation and disjunction can be nested, allowing direct expression of e.g. 'not this or that', e.g.

[:find ?name
 :where [?eid :artist/name ?name]
 (not (or [?eid :artist/country :country/US])
 [?eid :artist/country :country/CA])]

which means "find all the artists, except those that are from the US or Canada."

Required Rule Bindings

Rules are relational, and do not dictate the order in which variables are bound. Imagine a rule that finds tracks shorter than a certain maximum length for an artist. You might write that rule as:

[(short-track ?a ?t ?len ?max)
 [?t :track/artists ?a]
 [?t :track/duration ?len]
 [(< ?len ?max)]]

This rule does not tell the caller anything about when the four variables might be bound, revealing little about its use. You might try to

invoke the rule in a context where no variables are bound yet, finding all possible relationships
invoke the rule with all variables bound, as a predicate check
any combination of variable bindings between these two extremes

Often, complete flexibility in binding is exactly what you want. But in this case, the author of the rule has a specific intent:

?max is a required input, as nothing inside the rule knows how to manufacture a ?max
?a is an intended input (this rule is about artists)

Rules can specify required vars in a list as the first element after the rule name. The rule head above can thus be rewritten as

(short-track [?a ?max] ?t ?len)

Now the ?a and ?max variables are required. Datomic will not invoke a rule until required variables are bound, and will throw an exception if they cannot be bound.

Optimization of Range Predicates

Datomic now better leverages indexes to lookup range predicates (=, <=, >=, <. and >) where one argument is bound at the start of the query. For example, in the following query

[:find (count ?artist) .
 :in $ ?max-year
 :where [?artist :artist/startYear ?year]
 [(< ?year ?max-year)]]

Datomic will see that ?max-year is bound, and use the AVET index on :artist/startYear to consider only datoms guaranteed to match the predicate. This can result in substantial speedups and reduced memory usage for queries that substantially limit results via range predicates.

Summary

Taken together, the new capabilities above add substantial expressive power and performance to Datomic Datalog. Get the full details in the docs, and happy querying!

Datomic Pull API

28 September 2014

Datomic's new Pull API is a declarative way to make hierarchical selections of information about entities. You supply a pattern to specify which attributes of the entity (and nested entities) you want to pull, and db.pull returns a map for each entity.

Pull API vs. Entity API

The Pull API has two important advantages over the existing Entity API:

Pull uses a declarative, data-driven spec, whereas Entity encourages building results via code. Data-driven specs are easier to build, compose, transmit and store. Pull patterns are smaller than entity code that does the same job, and can be easier to understand and maintain.

Pull API results match standard collection interfaces (e.g. Java maps) in programming languages, where Entity results do not. This eliminates the need for an additional allocation/transformation step per entity.

Wildcards

A pull pattern is a list of attribute specifications. If you want all attributes, you can use the wildcard (*) specification along with an entity identifier. (In the examples below, entity identifiers such as led-zeppelin are variables defined in the complete code examples in Clojure and Java.)

;; Clojure API
(d/pull db '[*] led-zeppelin)

;; Java API
db.pull("[*]", ledZeppelin)

A pull result is a map per entity, shown here in edn:

;; result
{:artist/sortName "Led Zeppelin", 
 :artist/name "Led Zeppelin", 
 :artist/type {:db/id 17592186045746}, 
 :artist/country {:db/id 17592186045576}, 
 :artist/gid #uuid "678d88b2-87b0-403b-b63d-5da7465aecc3", 
 :artist/endDay 25, 
 :artist/startYear 1968, 
 :artist/endMonth 9, 
 :artist/endYear 1980, 
 :db/id 17592186050305}

Attributes

You can also specify the attributes you want explicitly, as with :artist/name and :artist/gid below:

;; pattern
[:artist/name :artist/gid]

;; input 
led-zeppelin

;; result
{:artist/gid #uuid "678d88b2-87b0-403b-b63d-5da7465aecc3", 
 :artist/name "Led Zeppelin"}

The underscore prefix reverses the direction of an attribute, so :artist/_country pulls all the artists for a particular country:

;; pattern
[:artist/_country]

;; input
greatBritain

;; result
{:artist/_country [{:db/id 17592186045751} 
 {:db/id 17592186045755} 
 ...]}

Components

Datomic component attributes are pulled recursively by default, so the :release/media pattern below automatically returns a release's tracks as well:

;; pattern
[:release/media]

;; input
darkSideOfTheMoon

;; result
 {:release/media
 [{:db/id 17592186121277,
 :medium/format {:db/id 17592186045741},
 :medium/position 1,
 :medium/trackCount 10,
 :medium/tracks
 [{:db/id 17592186121278,
 :track/duration 68346,
 :track/name "Speak to Me",
 :track/position 1,
 :track/artists [{:db/id 17592186046909}]}
 {:db/id 17592186121279,
 :track/duration 168720,
 :track/name "Breathe",
 :track/position 2,
 :track/artists [{:db/id 17592186046909}]}
 {:db/id 17592186121280,
 :track/duration 230600,
 :track/name "On the Run",
 :track/position 3,
 :track/artists [{:db/id 17592186046909}]}
 ...]}]}

Map Specifications

Instead of just an attribute name, you can use a nested map specification to pull related entities. The pattern below pulls the :db/id and :artist/name of each artist:

;; pattern
[:track/name {:track/artists [:db/id :artist/name]}]

;; input
ghostRiders

;; result
{:track/artists [{:db/id 17592186048186, :artist/name "Bob Dylan"}
 {:db/id 17592186049854, :artist/name "George Harrison"}],
 :track/name "Ghost Riders in the Sky"}

And of course everything nests arbitrarily, in case you need the release's medium's track's names and artists:

 ;; pattern
[{:release/media
 [{:medium/tracks
 [:track/name {:track/artists [:artist/name]}]}]}]

;; input
concertForBanglaDesh

;; result
 [{:medium/tracks
 [{:track/artists
 [{:artist/name "Ravi Shankar"} {:artist/name "George Harrison"}],
 :track/name "George Harrison / Ravi Shankar Introduction"}
 {:track/artists [{:artist/name "Ravi Shankar"}],
 :track/name "Bangla Dhun"}]}
 {:medium/tracks
 [{:track/artists [{:artist/name "George Harrison"}],
 :track/name "Wah-Wah"}
 {:track/artists [{:artist/name "George Harrison"}],
 :track/name "My Sweet Lord"}
 {:track/artists [{:artist/name "George Harrison"}],
 :track/name "Awaiting on You All"}
 {:track/artists [{:artist/name "Billy Preston"}],
 :track/name "That's the Way God Planned It"}]
 ...]}

Try It Out

The Pull API has many other capabilities not shown here. See the full docs for defaults, limits, bounded and unbounded recursion, and more. Or check out the examples in Clojure and Java.

Stuff Happens: Fixing Bad Data in Datomic

19 July 2014

Oops! You just put some bad data in your system, and now you need a way to clean it up. In this article, we look at how to recover from data errors. Along the way, we will explore how Datomic models time.

A Motivating Example

ACME Co. buys, sells, and processes things. Unfortunately, their circa-2003 web interface is not a shining example of UI/UX design. Befuddled by all the modal screens, managers regularly put bad data into the system.

In fact, manager Steve just accidentally created an inventory record showing that ACME now has 999 Tribbles. This is ridiculous, since everyone knows that the CEO refuses to deal in Tribbles, citing "a bad experience". In a rather excited voice, Steve says "Quick, please delete the last entry added to the system!"

As is so often the case, job one is to carefully interpret a stakeholder's request. In particular, the words "delete", "last", and "entry" all warrant careful consideration.

What "Delete" Means

Let's start with "delete". At first glance, one might think that Steve wants us to do our best to "unhappen" his mistake. But upon reflection, that isn't such a good idea. The database is a live system, and someone may have used the bad data to make decisions. If so, then simply excising the mistake will just lead to more confusion later. What if, during the few minutes Tribbles were in the inventory, we sold a Tribble? Or, more subtly, what if we moved our widget inventory to a different warehouse to make room for the nonexistent Tribbles?

Databases need to remember the history of their data, even (and perhaps especially) when the data is later discovered to be bad. An easy analogy to drive home the point is source control. Source control systems act as a simple database for code. When a bug happens, you care very much about which versions of the code manifest the bug.

So rather than deleting the Tribbles, we want something more like a "reverting commit" in source control; that is, to record that we no longer believe that we have (or in fact ever had) Tribbles, but that during a specific time window we mistakenly believed we had 999 Tribbles.

What "Last" Means

Next, let us consider the temporal word "last". Happily, ACID databases have a unit of time ordering: transactions. Datomic goes a step further with reified transactions, i.e. transactions that you can manipulate as first-class objects in the system. In such a system, you might indeed be able to say "Treat the last transaction as a data entry error and record a reverting transaction."

We still need to be careful, though, about what we mean by "last". Again, the inventory system is a live system, so the last transaction is a moving target. It would be dangerous and incorrect to blindly revert the last transaction, without making sure that Steve's erroneous transaction is still the last one. Generally, we will want a way to review recent transactions, and then narrow down by some other criteria, e.g. the id for Tribbles.

What "Entry" Means

Finally, let's consider what constitutes an "entry". This is a domain concept, and might or might not correspond directly to a row in a table, or to a document, or whatever metaphor your database encourages. Mistakes can have arbitrary and irregular shapes. When correcting errors, we will want to have granular representations of data, so that we can precisely target the error without also having to modify correct data nearby.

Putting that all together, we need to:

locate the precise problem among recent transaction data
create a transaction that reverts the bad information...
...without forgetting that the problem happened.

Let's take this to code.

Retraction in Code

The code examples that follow are in Groovy rather than Java for concision's sake. You can download the sample code from the Datomic Groovy Examples repository.

Let's begin with our erroneous transaction, creating a record for Tribbles with a unique id, description, and count:

addTribbles = [[':db/id': tempid(':db.part/user'),
 ':item/id': '0042-TRBL',
 ':item/description': 'Tribble: a low maintenance pet.',
 ':item/count': 999]];
conn.transact(addTribbles).get();

Finding the Problem

OK, the clock is ticking: we have bad data in the system. How can we look to see what we have? For many everyday uses, you want an associative (entity) view of data. Depending on your language and tools, an entity representation may be a strongly-typed object or a mere dictionary, and may be generated by hand or automated through an ORM. Regardless, an associative representation allows you to navigate from keys to values.

In Datomic, you can pass a lookup ref (a pair containing a unique key and its value) to the entity API to get an entity:

db = conn.db();
db.entity([':item/id', '0042-TRBL']);

Entities imply a three way relation of entity / attribute / value. The entity is implicit in object identity, the attributes are keys (or getter methods), and the values are looked up via the attributes. Given an entity, you can call keySet to find its keys, or get to lookup the value at a particular key.

What entities do not tell you is how, when, or why data got into the system. One way to improve this is to think in terms of a broader relation. Instead of a 3-tuple of entity / attribute / value, Datomic uses a datom: a 5-tuple of entity / attribute / value / transaction / added. The fourth slot, transaction, references an entity that records the time the transaction was added to the system (plus possibly other facts about the transaction). The fifth slot, added, is a boolean that records whether the datom is being asserted (true) or retracted (false).

Datoms allow you to see (and annotate) the complete history of entities in your system. For example, the following Datalog query will return all the facts about Tribbles.

q('''[:find ?aname ?v ?tx ?added
 :in $ ?e
 :where [?e ?a ?v ?tx ?added]
 [?a :db/ident ?aname]]''',
 db,
 [':item/id', '0042-TRBL']);

We aren't going to cover Datalog syntax here, but will point out a few things in passing:

variables begin with ?
data patterns in the :where clause are specified in Datom order, i.e. entity, attribute, value, transaction, added
the :in clause binds parameters, so this query binds ?e to find only facts about id 0042-TRBL, Tribbles.

The query results, shown below, show us that all facts about Tribbles were added in the same transaction, 13194139534315.


Attribute	Value	Transaction	Added?
:item/description	Tribble: a low maintenance pet.	13194139534315	true
:item/id	0042-TRBL	13194139534315	true
:item/count	999	13194139534315	true

Searching The Log

In the query above, we knew that the problem was Tribbles, and could use that to look up the information. Let's imagine instead that we know only that the problem just happened. To ask a "when" question, you can explicitly access the database log, which is a time index. In Datomic, the basisT of a database identifies the most recent transaction, and can be passed to the tx-data function inside a query to access the log and return datoms from a specific transaction. The following query shows the entire transaction that added the Tribbles.

log = conn.log();
q('''[:find ?e ?aname ?v ?tx ?added
 :in $ ?log ?tx
 :where [(tx-data ?log ?tx) [[?e ?a ?v _ ?added]]]
 [?a :db/ident ?aname]]''',
 db,
 log,
 db.basisT());

This query shows us not only the troubled Tribble datoms, but also information about the transaction that added them:


Entity	Attribute	Value	Transaction	Added?
17592186045420	:item/description	Tribble: a low maintenance pet.	13194139534315	true
17592186045420	:item/id	0042-TRBL	13194139534315	true
17592186045420	:item/count	999	13194139534315	true
13194139534315	:db/txInstant	2014-05-19T17:20:48.200-00:00	13194139534315	true

The combination of granular datoms, the powerful Datalog query language, and direct access to the database log make it possible to find out everything about a data error. Now let's fix the error, without losing history of the event.

Fixing the Problem

The transaction that added the Tribbles actually said three things about Tribbles:

Tribbles exist
Tribbles have a description
we have 999 Tribbles

Simple Retraction

Let's presume that it is innocuous to acknowledge the existence of Tribbles. Thus we can make a very granular correction, refuting only the absurd notion that we have Tribbles in inventory:

errDoc = 'Error correction entry. We do not sell Tribbles.';
weDontSellTribbles = 
[[':db/add', [':item/id', '0042-TRBL'], ':item/count', 0],
 [':db/add', tempid(':db.part/tx'), ':db/doc', errDoc]];
conn.transact(weDontSellTribbles).get();

This correction adds two facts to the database:

an assertion that we have, in fact, zero Tribbles
a documentation string on the transaction entity (:db.part/tx) explaining the correction

Retraction with Provenance

Another possibility is that ACME does not want even to acknowledge the existence of Tribbles. So we need to retract the entire Tribble entity. Also, ACME has a policy of recording a manager's identity along with an error correction. Here, then is a transaction that removes the Tribble entity entirely, and credits John Doe with the removal:

retractTribbles = 
[[':db.fn/retractEntity', [':item/id', '0042-TRBL']],
 [':db/add', 
 tempid(':db.part/tx'), 
 ':corrected/by', [':manager/email', 'jdoe@example.com']]];
conn.transact(retractTribbles).get();

:db.fn/retractEntity is a built-in database function that expands to datoms retracting every fact about an entity. :correction/by is an attribute specific to the schema of this database, and is an example of how you can extend the information model to capture provenance information for transactions.

You Can Go Back Again

Now that Tribbles have been retracted entirely, they will be invisible to queries or entity calls against the database. But while they are gone, they are not forgotten. The history view of a database shows not just the present, but the entire past of the database. The following query shows the complete history of Tribbles in the database:

hist = conn.db().history();
q('''[:find ?aname ?v ?tx ?added
 :in $ ?e
 :where [?e ?a ?v ?tx ?added]
 [?a :db/ident ?aname]]''',
 hist,
 [':item/id', '0042-TRBL']);

As you can see below, this history includes the original problem transaction, the partial correction, and the complete removal of the entity.


Attribute	Value	Transaction	Added?
:item/description	Tribble: a low maintenance pet.	13194139534315	true
:item/id	0042-TRBL	13194139534315	true
:item/count	999	13194139534315	true
:item/count	999	13194139534317	false
:item/count	0	13194139534317	true
:item/id	0042-TRBL	13194139534318	false
:item/description	Tribble: a low maintenance pet.	13194139534318	false
:item/count	0	13194139534318	false

Where Are We?

We have now accomplished everything we set out to do:

We used Datalog queries to precisely locate problem data, both by recency and by identifying key.
We took advantage of the granularity of datoms to create transactions that precisely retracted problem data.
We used reified transactions and the history view of the database to remember what went wrong, and how it was fixed.

Time is a fundamental dimension of information systems, and it cannot be easily retrofitted to systems that only support traditional row or document views of information. In particular, note that the queries above take a database (not a connection) argument. That database can be filtered with history, asOf, or

since to pass different windows of time to the same query. With a time model, "point-in-time" or "range-of-time" queries can use the same logic as "now" queries.

That all sounds pretty good. So, should all of your information systems build upon a time model?

Modeling Time: Concerns and Tradeoffs

There are a number of issues to consider when modeling time in an information system; we will consider a few of the most common issues here.

What are the performance concerns when keeping historical data?
Is there some data for which historical information is irrelevant or counterproductive?
But I really want to delete something!
I understand the importance of remembering past mistakes, but I want "clean" history, too.

Performance Concerns

Data structures that preserve past values of themselves are called persistent. A common first reaction to persistence in databases is "Can I actually afford to remember everything?" You may remember that a similar objection was raised when Git first became popular: "Can I really afford to keep the entire history of projects on my laptop?"

It is important to understand that history can be managed so that queries are "pay as you go". If you do not use history in a particular query, that query will perform just as well as if you were not storing history. Modern immutable databases such as Datomic make such optimizations.

With that out of the way, the other major concern is storage cost. For many systems, particularly transactional systems of record, the balance of this argument is usually in favor of keeping everything. If the information is worth recording at all, it is worth the cost of keeping it.

But some information has little value, and does not warrant history.

Not Everything Needs History

At the other extreme from systems of record you have high-volume data that has little information value per fact, but some utility in the aggregate. A good example of this is a hit counter on web site. While you might be interested to know that 10,000 people are visiting every hour, you probably don't care to know the exact time for each counter increment.

Counters and other high churn, low point-in-time value storage does not benefit from history.

Excision

There are a few scenarios where systems really do need to deliberately and irrevocably lose data. The most common such scenario is a legal requirement to remove information from a system for privacy or intellectual property reasons.

For such situations, Datomic provides excision, which permanently removes facts from the system. This feature should never be used to fix mistakes. To continue the source code analogy, this would be the equivalent of saying "Wow, that was a really dodgy bug, let's corrupt source control so it looks like the bug never happened."

Clean History

We have been talking about history as "what we believed in the past". But from there one might aspire to know "what was true in the past", or at least "what we now think was true at some point of time in the past.

The ability to annotate transactions provides a path to multiple notions of history. For example, you could define a :tx/mistake attribute, and use it to mark transactions that later prove to be erroneous. You could then query against a filtered database that does not see those transactions at all.

Wrapping Up

Stuff happens, and information systems will acquire bad data. Never delete your mistakes. Use fine grained information such as datoms to pinpoint your errors, and use retraction and reified transactions in a persistent database such as Datomic to correct errors while preserving history.

Datomic Adaptive Indexing

24 February 2014

We are pleased to announce today new Adaptive Indexing support for Datomic. Adaptive indexing involves both a new index format and algorithm for how indexes are maintained. It should prove especially beneficial to those with large databases, high write loads, or large imports.

Some of the benefits:

Reduced work to be done per indexing job

There will be fewer index segments written, which should take less time and/or allow for reduced write provisioning when using DynamoDB.

Reduced memory requirements

We have revised downwards both the default and recommended thresholds for memory-index-threshold (to 32m) and memory-index-max (to 512m), for all workloads. This will also reduce memory pressure on peers, which are similarly configured. In addition, indexing itself uses less memory and has better GC characteristics. You will not approach memory-index-max except during imports and bulk loads.

Simpler configuration and less variability

You should rarely need to diverge from the defaults, and generally can use the same configuration for imports and ongoing production.

Sustainable import rates independent of db size

You will see indexing job times flatten out even as your db size grows linearly, as there is a sub-linear worst-case relationship between db size and indexing job size All of this is done while minimizing the amount of merging done during reads.

TL;DR - much better and more predictable performance, using fewer resources.

We expect adaptive indexing to make a tangible difference to most customers, and look forward to your feedback. Be sure to read the release notices for important details.

PREVIOUS 8 of 13 NEXT