Tuesday, January 13, 2015

More Power for Datomic Datalog: Negation, Disjunction, and Range Optimizations

Today's Datomic release includes a number of enhancements to Datomic's Datalog query language:
  • Negation, via the new not and not-join clauses
  • Disjunction (or) without using rules, via the new or and or-join clauses
  • Required rule bindings
  • Improved optimization of range predicates
Each is described below, and you can follow the examples from the mbrainz data set in Java or in Clojure

Negation

The not clause shrinks a result set, removing already-bound tuples whose variables match a set of clauses.  
For example, the following query finds the names of all artists who are not Canadian:
[:find ?name
 :where [?eid :artist/name ?name]
        (not [?eid :artist/country :country/CA])]
The not-join clause generalizes not, and allows you to also specify which variables inside the not-join will unify with the enclosing query.
The following query counts artists who did not release an album in 1970.  The not-join specifies that ?artist unifies with the containing query, and (implicitly) that ?release is local.
[:find (count ?artist) .
 :where [?artist :artist/name]
        (not-join [?artist]
          [?release :release/artists ?artist]
          [?release :release/year 1970])]
As elsewhere in Datomic Datalog, lists of clauses inside not or not-join are connected by an implicit and

Disjunction

Datomic Datalog has long supported disjunction (or) via rules.  Rules are a powerful abstraction mechanism, allowing you to create named pieces of logic to be reused across multiple queries.  Rules are also the path to recursion in query.  But sometimes you don't need all that, and you just want simple disjunction in situ.
The or clause binds variables that match at least one of a set of predicates.
As an example, consider the way mbrainz models release media.   There are four separate enums that all represent vinyl: vinylvinyl-7vinyl-10, andvinyl-12.  An or clause can group these all together, finding all the vinyl releases:
[:find (count ?medium) .
 :where (or [?medium :medium/format :medium.format/vinyl7]
            [?medium :medium/format :medium.format/vinyl10]
            [?medium :medium/format :medium.format/vinyl12]
            [?medium :medium/format :medium.format/vinyl])]
The and clause is available as an explicit form inside or clauses.  For example, the following query finds artists that are either groups or female individuals:
[:find (count ?artist) .
 :where (or [?artist :artist/type :artist.type/group]
            (and [?artist :artist/type :artist.type/person]
                 [?artist :artist/gender :artist.gender/female]))]
Note that and clauses are used only inside or – elsewhere in query, and is implicit wherever clauses appear sequentially.
The or-join clause is a generalization of or, just as not-join generalizes not.

All Together

Negation and disjunction can be nested, allowing direct expression of e.g. 'not this or that', e.g.
[:find ?name
 :where [?eid :artist/name ?name]
        (not (or [?eid :artist/country :country/US])
                 [?eid :artist/country :country/CA])]
which means "find all the artists, except those that are from the US or Canada."
Required Rule Bindings
Rules are relational, and do not dictate the order in which variables are bound.   Imagine a rule that finds tracks shorter than a certain maximum length for an artist.  You might write that rule as:
[(short-track ?a ?t ?len ?max)
 [?t :track/artists ?a]
 [?t :track/duration ?len]
 [(< ?len ?max)]]
This rule does not tell the caller anything about when the four variables might be bound, revealing little about its use.  You might try to
  • invoke the rule in a context where no variables are bound yet, finding all possible relationships
  • invoke the rule with all variables bound, as a predicate check
  • any combination of variable bindings between these two extremes 

Often, complete flexibility in binding is exactly what you want.  But in this case, the author of the rule has a specific intent:
  • ?max is a required input, as nothing inside the rule knows how to manufacture a ?max
  • ?a is an intended input (this rule is about artists)
Rules can specify required vars in a list as the first element after the rule name.  The rule head above can thus be rewritten as
(short-track [?a ?max] ?t ?len)
Now the ?a and ?max variables are required. Datomic will not invoke a rule until required variables are bound, and will throw an exception if they cannot be bound.
Optimization of Range Predicates
Datomic now better leverages indexes to lookup range predicates (=, <=, >=, <. and >) where one argument is bound at the start of the query.  For example, in the following query
[:find (count ?artist) .
 :in $ ?max-year
 :where [?artist :artist/startYear ?year]
        [(< ?year ?max-year)]]

Datomic will see that ?max-year is bound, and use the AVET index on :artist/startYear to consider only datoms guaranteed to match the predicate.  This can result in substantial speedups and reduced memory usage for queries that substantially limit results via range predicates.

Summary

Taken together, the new capabilities above add substantial expressive power and performance to Datomic Datalog.  Get the full details in the docs, and happy querying!