Wednesday, December 18, 2013

Cassandra Support

We are pleased to announce alpha support for Cassandra as a storage service for Datomic, now available in version 0.9.4384.

Cassandra is an elastically scalable, distributed, redundant and highly-available column store. Recent versions of Cassandra added support for compare-and-swap operations via lightweight
transactions using an implementation of the Paxos protocol. Datomic leverages this mechanism to manage a small number of keys per database that require coordinated access, while the bulk of a database's content is written as immutable data to a quorum of replicas in the cluster.

Cassandra support requires a minimum Apache Cassandra 2.0.2 or newer, or equivalent. Native CQL protocol support must be enabled. Cross-data center deployments are not supported. Cassandra internal security is supported, but optional.

This release represents preliminary support based on requests from users. We are very interested in feedback.

For instructions on configuring Cassandra for use with Datomic, see Setting Up Storage Services.

Monday, November 25, 2013

Using IAM Roles with Datomic on AWS

With today's Datomic release, you can use IAM roles to manage permissions when running in AWS.

Motivation

Datomic's AWS support has been designed according to the principle of least privilege.  When running in AWS, a Datomic transactor or peer needs only the minimum permissions necessary to communicate with various AWS services.  These permissions are documented in Setting Up Storage Services.

But you still need some way to install these minimal permissions on ephemeral virtual hardware. Early versions of AWS left this problem to the developer.  Solutions were tedious and ad hoc, but more important they were risky.  Leaving every application developer the task of passing credentials around is a recipe for credentials lying around in a hundred different places (or even checked into source code repositories.)

IAM roles provide a generic solution to this problem.  From the FAQ: "An IAM role allows you to delegate access, with defined permissions, to trusted entities without having to share long term access keys (emphasis added).  From a developer perspective, IAM roles get credentials out of your application code.

Implementation

Starting with version 0.9.4314, Datomic supports IAM roles as the default mechanism for conveying credentials in AWS.  What does this mean for developers?
  1. If you are configuring Datomic for the first time, the setup instructions will secure peers and transactors using IAM roles. 
  2. If you have an existing Datomic installation and want to upgrade to roles, Migrating to IAM Roles will walk you through the process.
  3. Using explicit credentials in transactor properties and in connection URIs is deprecated, but will continue to work.  Your existing deployments will not break.
IAM roles make your application both easier to manage and more secure.  Use them.

Friday, November 8, 2013

Datomic Pro Starter Edition

We are happy to announce today the release of Datomic Pro Starter Edition, enabling the use of Datomic for small production deployments at no cost.

Datomic Pro Starter Edition features most benefits of Datomic Pro:
  • Support for all storages
  • A perpetual license with 12 months of updates included
  • Support for the full Datomic programming model
  • Datomic Console included with download
Datomic Pro Starter Edition features community support, and does not include:
  • High Availability transactor support
  • Integrated memcached
  • Running more than 3 processes (2 peers + transactor)
To get started, register and download Datomic Pro Starter Edition

Datomic Pro Starter Edition lets your team build a fully operational system and deploy to production with no additional steps or costs.

Tuesday, October 29, 2013

Datomic Console

[Update: Watch the intro video.]

The Datomic Console is a graphical UI for exploring Datomic databases.



It supports exploring schema, building and executing queries, navigating entities, examining transaction history, and walking raw indexes.  The Datomic Console is included in Datomic Pro, and is available as a separate download for Datomic Free users.

Exploring Schema

The upper left corner of the console displays a tree view of the attributes defined for the current database.



Query

The Query tab provides two synchronized views of queries: a graphical builder, and the equivalent textual representation.



You can see the results of a query in the Dataset pane on the lower right.


Entities

The Entities tab provides a tree view of an entity, plus the ability to drill in to related entities.


Transactions

The Transactions tab provides a graphical view of the history of your database at scales ranging from days down to seconds.



When you zoom in, the specific datoms in a transaction are displayed in the dataset pane.

Indexes

The Indexes tab allows you to browse ranges within a Datomic index, displaying results in the dataset pane.


And More

This post only scratches the surface, see the full docs for more details.  You can save arbitrary datasets, giving them a name for reuse in subsequent queries.  And, of course, you can use Datomic's time features to work with as-of, since, and historical views of your data.

Friday, October 11, 2013

The Transaction Report Queue

Summary: Functional databases such as Datomic eliminate the burden of deciding what information a downstream observer needs.  Just give 'em everything, lazily.

The Transaction Report Queue

In Datomic, you can monitor all transactions.  Any peer process in the system can request a transaction report queue of every transaction against a particular database.

TxReportWatcher is a simple example of this.  It watches a particular attribute in the database, and prints the entity id and value whenever a change to that attribute appears.  The essence of the code is only a few lines of Java:

final BlockingQueue queue = conn.txReportQueue();

while (true) {
    final Map tx = queue.take();
    Collection results = q(byAttribute, tx.get(TX_DATA), attrid);
    for (Iterator it = results.iterator(); it.hasNext(); ) {
        printList(it.next());
    }
}

There are several things to note here:
  • The Datomic query function q is used to query the transaction data, showing that the full power of the database query language is available while handling transaction events.
  • The TX_DATA map key points to all the data added to the database by this particular transaction.
  • Everything is made of generic data structures accessible from any JVM language: queues, collections, maps, and lists.  (There is no ResultSet API.)

Context Matters

How much information does a transaction observer need in order to take useful action?  An easy but naive answer is "just the TX_DATA".

But when you move past toy systems and blog examples, context matters.  For example, when a user places an order in a system, you might want to take different actions based on 
  • that user's order history
  • the current inventory status
  • time limited promotions
  • that user's relation to other users
It is impossible to anticipate in advance the context you might need.  But if you don't provide enough information, you will have to go back and ask for more.  There are many risks here.  The biggest risk is that such asking will introduce tighter coupling via the need for coordination, e.g. going back and asking questions of a database, questions that must be coordinated with the time of the event.  Unnecessary coordination is a cause of complexity (and an enemy of scalability).

Is there another way?  You bet!  If you have a functional database, where points in time are immutable values, then you can make the entire database available.  Datomic provides exactly this. In addition to the TX_DATA key, the DB_AFTER key points to the entire database as of the completion of the transaction.  And the DB_BEFORE key points to the entire database immediately before the transaction started.  Because both the "before" and "after" copies of the database are immutable values, no coordination with other processes is required, ever.

A Common Misconception

Developers often raise an objection to this approach:  "Oh, I see, this approach is limited to tiny databases that can fit in memory and be passed around."  Not at all.  Because they are immutable, Datomic databases can be lazily realized, pulling into memory only the parts that are needed.

Moreover, Datomic's indexes provide leverage over your data. Queries do not have to realize the entire database, they can use just the data needed.  Datomic indexes provide leverage for "row", "column", "document", and "graph" access styles, so a wide variety of workloads are efficient.  Different peers will "cache over" their working sets automatically, without you having to plan in advance which machine needs which data.

Composition FTW

Datomic's transaction report queue makes it possible for any peer to observe and respond to transactions, with complete access to all necessary context, and without any coordination with database writes.  Transaction reports are a simple building block for scalable systems.

Monday, July 1, 2013

Datomic MusicBrainz sample database

MusicBrainz is an open music encyclopedia that collects music metadata and makes it available to the public. We are pleased to release a sample project that uses the MusicBrainz dataset to help people get familiar with using Datomic.
The MusicBrainz dataset makes a great example database for learning, evaluating, or testing Datomic for a couple of reasons:
  • It deals with a domain with which nearly everyone is familiar
  • It is of decent size: 60,438 labels; 664,226 artists; 1,035,592 album releases; and 13,233,625 recorded tracks
  • It comprises a good number of entities, attributes, and relationships
  • It is fun to play with, query, and explore

Schema

The mbrainz-sample schema is an adaptation of a subset of the full MusicBrainz schema. We didn't include some entities, and we made some simplifying assumptions and combined some entities. In particular:
  • We omit any notion of Work
  • We combine Track, Tracklist and Recording into simply "track"
  • We renamed Release group to "abstractRelease"

Abstract Release vs. Release vs. Medium

(Adapted from the MusicBrainz schema docs)
An "abstractRelease" is an abstract "album" entity (e.g. "The Wall" by Pink Floyd). A "release" is something you can buy in your music store (e.g. the 1984 US vinyl release of "The Wall" by Columbia, as opposed to the 2000 US CD release by Capitol Records).
Therefore, when you query for releases e.g. by name, you may see duplicate releases. To find just the "work of art" level album entity, query for abstractRelease.
The media are the physical components comprising a release (disks, CDs, tapes, cartridges, piano rolls). One medium will have several tracks, and the total tracks across all media represent the track list of the release.

Relationship Diagram


Entities

For information about the individual entities and their attributes, please see the schema page in the wiki, or the EDN schema itself.

Getting Started

First get Datomic, and start up a transactor.

Getting the Data

Next download the mbrainz backup:

    # 2.8 GB, md5 4e7d254c77600e68e9dc71b1a2785c53
    wget http://s3.amazonaws.com/mbrainz/datomic-mbrainz-backup-20130611.tar
and extract:
    # this takes a while
    tar -xvf datomic-mbrainz-backup-20130611.tar
Finally, restore the backup:
    # takes a while, but prints progress -- ~150,000 segments in restore
    bin/datomic restore-db file:datomic-mbrainz-backup-20130611 datomic:free://localhost:4334/mbrainz

Getting the Code

Clone the git repo somewhere convenient:
    git clone git@github.com:Datomic/mbrainz-sample.git
    cd mbrainz-sample

Running the examples

From Java

Fire up your favorite IDE, and configure it to use both the included pom.xml and the following Java options when running:

    -Xmx2g -server

From Clojure

Start up a Clojure REPL:
    # from the root of the mbrainz-sample repo
    lein repl
Then connect to the database and run the queries.

Thanks

We would like to thank the MusicBrainz project for defining and compiling a great dataset, and for making it freely available.

Wednesday, June 19, 2013

Component Entities

This post demonstrates Datomic's component entities, and highlights a new way to create components available in today's release.  You can follow along in the code via the sample project.

The code examples use Groovy, a JVM language that combines similarity to Java with concision.  If you are a Java developer new to Groovy, you may want to read this first.

Why Components?

In a database, some entities are their own identities, and others exist only as part of a larger parent entity.  In Datomic, the latter entities are called components, and are reached from the parent via an attribute whose schema includes :db/isComponent true

As a familiar example, consider orders, line items, and products.  Orders have references to line items, and those references are through a component attribute, since line items have no independent existence outside of an order.  Line items, in turn, have references to products.  References to products are not component references, because products exist regardless of whether or not they are part of any particular order.

The schema for a line item component reference looks like this:

{:db/id #db/id[:db.part/db]
 :db/ident :order/lineItems
 :db/isComponent true
 :db/valueType :db.type/ref
 :db/cardinality :db.cardinality/many
 :db.install/_attribute :db.part/db}

Notice also that line items are :db.cardinality/many, since a single order can have many of them.

Component attributes gain three special abilities in Datomic:
  • you can create components via nested maps in a transaction (new in 0.8.4020)
  • touching an entity recursively touches all its components
  • :db.fn/retractEntity recursively deletes all its components
Each of these abilities is demonstrated below.

Creating Components

To demonstrate line item components, let's create an order for some chocolate and whisky.  First, here is a query for products ?e matching a particular description ?v:

productQuery = '''[:find ?e
                   :in $ ?v
                   :where [?e :product/description ?v]]''';

Now, we can query for the products we want to order:

(chocolate, whisky) = ['Expensive Chocolate', 'Cheap Whisky'].collect {
  q(productQuery, conn.db(), it)[0][0];
}
===> [17592186045454, 17592186045455]

The statement above uses Groovy's multiple assignment to assign chocolate to the first query result, and whisky to the second.

Now that we have some products, we can create an order with some line items. As of today's release, you can do this via nested maps:

order = [['order/lineItems': [['lineItem/product': chocolate,
                               'lineItem/quantity': 1,
                               'lineItem/price': 48.00],
                              ['lineItem/product': whisky,
                               'lineItem/quantity': 2,
                               'lineItem/price': 38.00]],
          'db/id': tempid()]];

The nested maps above expand into two subentities.  Notice that you do not need to create a tempid for the nested line items -- they will be auto-assigned tempids in the same partition as the parent order.

The order above is pure data (a list of maps). This greatly facilitates development, testing, and composition.  When we are ready to put the data in the database, the transaction is as simple as:

conn.transact(order).get();

Touching Components

Now we can query to find the order we just created.  To demonstrate that query can reach anywhere within your data, we will do a multiway join to find the order via product description:

ordersByProductQuery = '''
[:find ?e
 :in $ ?productDesc
 :where [?e :order/lineItems ?item]
        [?item :lineItem/product ?prod]
        [?prod :product/description ?productDesc]]''';

The query above joins

  • from the provided productDesc input to to the product entity ?prod
  • from ?prod to the order item ?item
  • from ?item to the order ?e
and returns ?e.

We are going to immediately pass ?e to datomic's entity API, so let's take a moment to create a Groovy closure qe that automates query + get entity:


qe = { query, db, Object[] more ->
  db.entity(q(query, db, *more)[0][0])
}


Now we can find an order the includes chocolate:


order = qe(ordersByProductQuery, db, 'Expensive Chocolate');


Because the Datomic database is an immutable value in your own address space, entities can be lazily realized.  When you first look at the order, you won't see any attributes at all:


===> {:db/id 17592186045457}


The touch API will realize all the immediate attributes of the order, plus it will recursively realize any components:


order.touch();
===> {:order/lineItems #{{:lineItem/product #, 
                          :lineItem/price 38.00M, 
                          :lineItem/quantity 2, 
                          :db/id 17592186045459} 
                         {:lineItem/product #, 
                          :lineItem/price 48.00M, 
                          :lineItem/quantity 1, 
                          :db/id 17592186045458}}, 
      :db/id 17592186045457}


Notice that the line items are immediately realized, and you can see all their attributes.  However, the products are not immediately realized, since they are not components.   You can, of course, touch them yourself if you want.

Retracting Components

I am not as hungry or thirsty as I thought.  Let's retract that order, using Datomic's :db.fn/retractEntity:


conn.transact([[":db.fn/retractEntity", order[":db/id"]]]).get();


Retracting an entity will retract all its subcomponents, in this case the line items.  To see that the line items are gone, we can count all the line items in our database:


q('''[:find (count ?e)
      :where [?e :order/lineItems]]''',
  db);
===> []


References to non-components will not be retracted.  The products are all still there:


q('''[:find (count ?e)
      :where [?e :product/description]]''',
  db);
===> [[2]]

Conclusion


Components allow you to create substantial trees of data with nested maps, and then treat the entire tree as a single unit for lifecycle management (particularly retraction).  All nested items remain visible as first-class targets for query, so the shape of your data at transaction time does not dictate the shape of your queries.  This is a key value proposition of Datomic when compared to row, column, or document stores.

Wednesday, June 12, 2013

Using Datomic from Groovy, Part 1: Everything is Data

In this post, I will demonstrate transacting and querying against Datomic from Groovy.  The examples shown here are based on the following schema, for a simple social news application:


There are a number of more in-depth samples in the datomic-groovy-examples project on Github.

Why Groovy

Groovy offers four key advantages for a Java programmer using Datomic:
  • Groovy provides interactive development through groovysh, the Groovy shell.  When combined with Datomic's dynamic, data-driven style, this makes it easy to interactively develop code in real time.  The source code for this post has a number of other examples designed for interactive study within the Groovy shell.
  • Groovy's collection literals make it easy to see your data.  Lists and maps are as easy as:
aList = ['John', 'Doe'];
aMap = [firstName: 'John', 
        lastName: 'Doe'];
  • Groovy's closures make it easy to write functions, without the noise of single-method interfaces and anonymous inner classes. For instance, you could grab all the lastNames from a collection of maps
lastNames = people.collect { it['lastName'] }

  • Of the popular expressive languages that target the JVM, Groovy's syntax is most similar to Java's.

Transactions

A Datomic transaction takes a list of data to be added to the database, and returns a future map describing the results. The simplest possible transaction is a list of one sublist that adds an atomic fact, or datom, to the database, using the following shape:

conn.transact([[op, entityId, attributeName, value]]);

The components above are:
  • op is a keyword naming the operation. :db/add adds a datom, and :db/retract retracts a datom.
  • entityId is the numeric id of an entity.  You can use tempid when creating a new entity.
  • attributeName is a keyword naming an attribute.
  • value is the value of an attribute.  The allowed types for an attribute include numerics, strings, dates, URIs, UUIDs, binaries, and references to other entities.
Keywords are names, prefixed with a colon, possibly with a leading namespace prefix separated by the slash char, e.g.

:hereIsAnUnqualifiedName
:the.namespace.comes.before/theName

Putting this all together, you might add a new user's first name with:

conn.transact([[':db/add', newUserId, ':user/firstName', 'John']]);

If you are adding multiple datoms about the same entity, you can use a map instead of a list, with the special keyword :db/id identifying the entity.  For example, the following two transactions are equivalent:

// create an entity with two attributes (map form)
conn.transact([[':db/id': newUserId,
                ':user/firstName': 'John',
                ':user/lastName': 'Doe']]);

// create an entity with two attributes (list form)
conn.transact([[':db/add' newUserId, ':user/firstName', 'John'],
               [':db/add' newUserId, ':user/lastName', 'Doe']]);

Let's look next at composing larger transactions out of smaller building blocks. You have already seen creating a user:

newUser = [[':db/id': newUserId,
            ':user/email': 'john@example.com',
            ':user/firstName': 'John',
            ':user/lastName': 'Doe']];

Notice that this time we did not call transact yet, instead we just stored data describing the user into newUser.   

Now imagine that you have a collection of story ids in hand, and you want to create a new user who upvotes those stories.   Groovy's collect method iterates over a collection, transforming values using a closure with a default single parameter named it. We can use collect to build new assertions that refer to each story in a collection of storyIds:

upvoteStories = storyIds.collect {
  [':db/add', newUserId, ':user/upVotes', it]
}

Now we are ready to build a bigger transaction out of the pieces.  Because transactions are made of data, we don't need a special API for this.  Groovy already has an API for concatenating lists, called +:

conn.transact(upvoteAllStories + newUser);

Building Datomic transactions from data has many advantages over imperative or object-oriented approaches:
  • Composition is automatic, and requires no special API.
  • ACID transactionality is scoped to transaction calls, and does not require careful management across separate calls to the database.
  • Because they are data, Datomic transactions are flexible across system topology changes: they can be built offline for later use, serialized, and/or enqueued.

Query

The Datomic query API is named q, and it takes a query plus one or more inputs. The simple query below takes a query plus a single input, the database db, and returns the id of every entity in the system with an email address:

q('''[:find ?e 
      :where [?e :user/email]]''', db);

Keyword constants in the :where clause constrain the results.  Here, :user/email constrains results to only those entities possessing an email.  

Symbols preceded by a question mark are variables, and will be populated by the query engine.  The variable ?e will match every entity id associated with an email.

A query always returns a set of lists, and the :find clause specifies the shape of lists to return.  In this example, the lists are of size one since a single variable ?e is specified by :find.

Note that the query argument to q is notionally a list.  As a convenience, you can pass the query argument as either a list or (as shown here) as an edn string literal.

The next query further constrains the result, to find a specific email address:

q('''[:find ?e
      :in $ ?email
      :where [?e :user/email ?email]]''',
  db, 'editor@example.com');

There are several things to see here.  There are now two inputs to the query: the database itself, and the specific email "editor@example.com" we are looking for.  Since there is more than one input, the inputs must be named by an :in clause.  The :in clause names inputs in the order they appear:
  1. $ is Datomic shorthand for a single database input. 
  2. ?email is bound to the scalar "editor@example.com".
Inputs need not be scalar. The shape [?varname ...] in an :in clause is called a collection binding form, and it binds a collection instead of a single value. The following query looks up two different users by email:

q('''[:find ?e
      :in $ [?email ...]
      :where [?e :user/email ?email]]''',
  db, ['editor@example.com', 'stuarthalloway@datomic.com']);

Another way to join is by having more than one constraint in a :where clause.  Whenever a variable appears more than once, it must match the same set of values in all the locations that it appears.  The following query joins through ?user to find all the comments for a user:

q('''[:find ?comment
      :in $ ?email
      :where [?user :user/email ?email]
             [?comment :comment/author ?user]]''',
  db, 'editor@example.com')

We have only scratched the surface here.  Datomic's query also supports rules, predicates, function calls, cross-database queries (with joins!), aggregates, and even queries against ordinary Java collections without a database.  In fact, the Datalog query language used in Datomic supports a superset of the capabilities of the relational algebra that underpins SQL.

Conclusion

In this installment, you have seen the powerful, compositional nature of programming with generic data.  In Part 2, we will look at the database as a value, and explore the implications of having a lazy, immutable database inside your application process.

Monday, June 3, 2013

Sync


Background

Datomic's approach to updating peers uses a push model. Rather than have every read request route to the same server in order to get consistent data, data is stored immutably, and as soon as there is new information, all peers are notified. This completely eliminates polling any server. Thus, contrary to common presumption, when you ask the connection for the db value, there is no network communication involved: you are immediately given the local value of the db about which the connection was most recently informed.

Everyone sees a valid, consistent view. You can never see partial transactions, corruption/regression of timelines, causal anomalies etc. Datomic is always 'business rules' valid, and causally consistent.

Motivation

That does not mean that every peer sees the same thing simultaneously. Just as in the real world, it is never the case that everyone sees the same thing "at the same time" in a live distributed system.  There is no inherent shared truth, as you might convey a message to me about X at the speed of light but I can only perceive X at the speed of sound. Thus, I know X is coming, but I might have to wait for it.

This means that some peer A might commit a transaction and tell B about it before B is informed via the normal channels. This is an interesting case, as it has to do with perception and propagation delays. It is not a question of consistency, it is a question of communication synchronization.

It comes up when you would like to read-your-own-writes via other peers (e.g. when a client hits different peer servers via a load balancer), and when there is out-of-band communication of writes (A tells B about its write before the transactor does).

Tools

We've added a new sync API to help you manage these situations.

The first form of sync takes a basis point (T). It returns a future that will be fulfilled with a version of the db that includes point T. This does not cause any additional interaction with the transactor - the future will be filled by the normal communication on the update channels. But it saves you from having to poll for arrival. Most often, you will already have the requested T, and the future will complete immediately. This is the preferred method to use if you have any ability to convey the basis T, either in the message from A to B, or e.g. in cookies as a client hits different peers using a load balancer. You can easily get the basis T for any db value you have in hand.

The second form of sync takes no arguments, and works via 'ping' of the transactor. It promises not to return until all transactions that have been acknowledged by the transactor at the time sync was called have arrived at this peer. Thus if A has successfully committed a transaction and told B about it, and B then calls sync(), the database returned by sync will include A's transaction.

Conclusion

While these synchronization tools are powerful, make sure you use them only when necessary. The Datomic defaults were designed to leverage the inherent parallelism possible given immutable, accretion-only semantics and distributed storage. Notifications to peers are sent at the same time as the acknowledgement to the peer submitting the transaction, and thus are as 'simultaneous' as network communication can be. The sync tools need only be utilized to enforce cross-peer causal relationships.

Thursday, May 16, 2013

A Whirlwind Tour of Datomic Query

Introduction

This tour is to help those new to Datomic understand Datomic's built-in datalog by providing a simple domain and schema, and by walking through some use cases.  For a more complete treatment of Datomic's query capabilities, please take a look at the documentation.

Reading the Code

To follow along in code, you can download Datomic Free Edition at http://downloads.datomic.com/free.html, and the sample project at https://github.com/datomic/day-of-datomic. The code examples should be executed interactively, in the order presented in the article, from a Clojure REPL. The complete code is in the query_tour.clj file. The => prefix indicates responses that is printed by the REPL. The use of ellipsis () in output indicates that a larger result has been truncated for brevity.

A Simple Schema

The example queries that follow work against a simplified schema that you might use for a social news database, containing users, stories, and comments:

Get Connected

In order to get a database connection, you will need to require namespaces for the Datomic API and for the news application, then setup a sample database.

Listing 1: Getting Connected

(require
  '[datomic.api :as d]
  '[datomic.samples.news :as news])

  (def uri "datomic:mem://news")
  (def conn (news/setup-sample-db-1 uri))

Get a Database Value

The connection from the previous step is already populated with a schema and sample data, but before you query against it, you must get the current value of the database. Syntactically this is trival:

Listing 2: Getting a Database Value

(def db (d/db conn))

While the syntax of this step is trivial, the semantic implications are deep. Datomic queries do not happen "over there" in some database process, they happen here, in your process's memory space.

A first query

Users in the system are identified by a :user/email attribute. Lets find all of them:

Listing 3: Finding All Users

(d/q '[:find ?e
       :where [?e :user/email]]
      db)
  
=> #{[17592186045424] [17592186045425]}

The large integers are entity ids. Entity ids are auto-assigned, so you may not see the same values on your system. There are several things to note here:
  • As query is the most commonly used function in Datomic, it has the terse name d/q.
  • The first argument is a query expression. The :where clause indicates what you want to find, "those entities ?e that have a :user/email attribute." The :find clause tells which variables to return.
  • The second argument is the database value we obtained in the previous step.

A simple join

Rather than finding all users, let's query for a particular user. This requires joining two query inputs: the database value, and the email you are seeking:

Listing 4: Finding a Specific User

(d/q '[:find ?e
       :in $ ?email
       :where [?e :user/email ?email]]
     db
     "editor@example.com")
  
=> #{[17592186045425]}

Notice that there are now three arguments to the query: the query expression plus the two inputs. Also, since there are two inputs, there is now an :in clause to name the inputs, in the order they appear, e.g. $ binds the database, and ?email binds "editor@example.com". Names starting with $ name databases, and names starting with ? name variables.

A database join

The previous join of a database to a single prebound variable may not even look like a join to you – in many databases this operation would be called a parameterized query. So let's look at a more traditional join, that leverages indexes in the database to associate more than one "type" of entity. The following query will find all of the editor's comments:

Listing 5: Finding a User's Comments

(d/q '[:find ?comment
       :in $ ?email
       :where [?user :user/email ?email]
              [?comment :comment/author ?user]]
      db
     "editor@example.com")
=> #{[17592186045451]
     [17592186045450]
     ... }

Notice that the :where clause now has two data patterns: one
  associating users to their emails, and another associating
  comments to the user that authored them. Because the variable ?user appears in both clauses, the joins
  on ?user, finding only comments made by the editor.

Aggregates

You can obtain aggregates by changing the :find clause to include an aggregating function. Instead of finding the editor's comments, let's just count them:

Listing 6: Returning an Aggregate

(d/q '[:find (count ?comment)
       :in $ ?email
       :where [?user :user/email ?email]
              [?comment :comment/author ?user]]
      db
     "editor@example.com")
  
=> [[10]]

More joins

Queries can make many joins. The following query joins comments to their referents, and joins the referents to :user/email to find any comments that are about people:

Listing 7: Multiple Joins

(d/q '[:find (count ?comment)
       :where [?comment :comment/author]
              [?commentable :comments ?comment]
              [?commentable :user/email]]
      db)
=> []

No results is good news, because you don't want to let things get personal by allowing people to comment on other people.

Schema-aware joins

You cannot comment on people, but what kinds of things can you comment on?

The three slots you have seen in data patterns so far are entity, attribute, and value. Up this point, the entity and value positions have typically been variables, but the attribute has always been constant.
That need not be the case. Here is a query that finds all the attributes of entities that have been commented on:

Listing 8: A Schema Query

(d/q '[:find ?attr-name
       :where [?ref :comments]
              [?ref ?attr]
              [?attr :db/ident ?attr-name]]
     db)
=> #{[:story/title]
     [:comment/body]
     [:story/url]
     [:comment/author]
     [:comments]}

Schema entities are ordinary entities, like any other data in the system. Rather then return their entity ids (?attr in the query above), you can join through :db/ident to find the programmatic identifiers that name each attribute. Judging from these attribute names, there are comments about stories and comments about other comments, exactly what you would expect given the schema.

Entities

In client/server databases where query happens in another process, it is often critical to get the answer and all of its details in a single query. (If you spread the work across multiple steps, you run the risk of the data changing between steps, leading to inconsistent results.)

Since Datomic performs queries in-process, against an immutable database value, it is feasible to decompose the work of query into steps. One very decomposition is to use a query to find entities, and then to use Datomic's entity API to navigate to the relevant details about those entities.
Let's try it with the editor. First, use a query to find the editor's entity id:

Listing 9: Finding an Entity ID

(def editor-id (->> (d/q '[:find ?e
                           :in $ ?email
                           :where [?e :user/email ?email]]
                        db
                        "editor@example.com")
                  ffirst))

Notice that the query returns only the ?e, not any particular attribute values.
Now, you can call d/entity, passing the database value and the entity id to get the editor entity.

Listing 10: Getting an Entity

(def editor (d/entity db editor-id))

When you first look at an entity, it appears to be a mostly empty map:

Listing 11: A Lazy Entity

editor
=> {:db/id 17592186045425}

That is because entities are lazy. Their attribute values will appear once you ask for them:

Listing 12: Requesting an Attribute

(:user/firstName editor)
=> "Edward"

If you are feeling more eager, you can touch an entity to immediately realize all of its attributes:

Listing 13: Touching an Entity

(d/touch editor)
  => {:user/firstName "Edward",
  :user/lastName "Itor",
  :user/email "editor@example.com",
  :db/id 17592186045425}

Are Entities ORM?

No. Entities are used for some of the same purposes that you might use an ORM, but their capabilities are quite different. Entities differ from ORM objects in that the conversion between raw datoms and entities is entirely a mechanical process. There is never any configuration, and relationships are always available and (lazily!) navigable.

Entities also differ from most ORM objects in that relationships can be navigated in either direction. In the previous code example, you saw how the d/touch method would automatically navigate all outbound relationships from an entity. However, you can also navigate inbound relationships, by following the convention of prefixing attribute names with an underscore. For example, a user's comments happen to be modeled as a relationship from the comment to the editor. To reach these comments from the editor entity, you can navigate the :comment/author attribute backwards:

Listing 14: Navigating Backwards

(-> editor :comment/_author)
=> [{:db/id 17592186045441}
    {:db/id 17592186045443}
    ... ]

This process can, of course, be extended as far as you like, e.g. the following example navigates to all the comments people have made on the editor's comments:

Listing 15: Navigating Deeper

(->> editor :comment/_author (mapcat :comments))
=> ({:db/id 17592186045448}
    {:db/id 17592186045450}
    ...)

Time travel

Update-in-place databases can tell you about the present, but most businesses need also to know about the past. Datomic provides this, by allowing you to take a value of a database as of a certain point in time.

Given any datom, there are three time-related pieces of data you can request:
  • the transaction entity tx that created the datom
  • the relative time, t of the transaction
  • the clock time :db/txInstant of the transaction
The transaction entity is available as a fourth optional component of any data pattern. The following query finds the transaction that set the current value for the editor's first name:

Listing 16: Querying for a Transaction

(def txid (->> (d/q '[:find ?tx
                      :in $ ?e
                      :where [?e :user/firstName _ ?tx]]
                  db
                  editor-id)
             ffirst))

Given a transaction id, the d/tx->t function returns the system-relative time that the transaction happened.

Listing 17: Converting Transaction to T

(d/tx->t txid)
=> 1023

Relative time is useful for "happened-before" type questions, but sometimes you want to know the actual wall clock time. This is stored once per transaction, as the :db/txInstant property of the transaction entity:

Listing 18: Getting a Tx Instant

(-> (d/entity (d/db conn) txid) :db/txInstant)
=> #inst "2013-02-20T16:27:11.788-00:00"

Given a t, tx, or txInstant value, you can travel to that point in time with d/as-of. The example below goes back in time to before the point that the first name Edward was introduced, to see its past value:

Listing 19: Going Back in Time

(def older-db (d/as-of db (dec txid)))
  
(:user/firstName (d/entity older-db editor-id))
=> "Ed"

The example above shows an as-of value of the database being used as an argument to d/entity, but as-of values can be used anyhere a current database value can be used, including as an argument to query.

Auditing

While as-of is useful for looking at a moment in the past, you may also want to perform time-spanning queries. This is particularly true in audit scenarios, where you might want a complete history report of some value.

To perform such queries, you can use the d/history view of a database, which spans all of time. The following query shows the entire history of the editor's :user/firstName attribute:

Listing 20: Querying Across All Time

(def hist (d/history db))

(->> (d/q '[:find ?tx ?v ?op
            :in $ ?e ?attr
            :where [?e ?attr ?v ?tx ?op]]
        hist
        editor-id
        :user/firstName)
   (sort-by first))
  
=> ([13194139534319 "Ed" true]
    [13194139534335 "Ed" false]
    [13194139534335 "Edward" true])

The optional fifth field in a :where data pattern, named ?op in the example above, matches true if a datom is asserted, or false if it is being retracted.

Transaction 13194139534319 set the editor's first name to "Ed", and transaction 13194139534335 set it to "Edward". The cardinality of :user/firstName is one, which means that the system will only permit one value of that attribute per entity at any given time. To enforce this constaint, Datomic will automatically retract past values where necessary. Thus the transaction that asserts "Edward" also includes a retraction for the previous value "Ed".

Everything is Data

One of Datomic's design goals is "put declarative power in the hands of developers." Having queries that run in your process goes a long way to meet this objective, but what if you have data that is not in a Datomic database?

Datomic's query engine can run without a database, against arbitrary data structures in your application.
The following example shows the query you began with, this time running against a plain Java list in memory:

Listing 21: Querying Plain Java Data

(d/q '[:find ?e
  :where [?e :user/email]]
  [[1 :user/email "jdoe@example.com"]
   [1 :user/firstName "John"]
   [2 :user/email "jane@example.com"]])
=> #{[1] [2]}

The idea behind POJOs (plain old Java objects) is exactly right, and Datomic encourages you to take it one step further, to plain old lists and maps.

Friday, May 10, 2013

Excision

Motivation

It is a key value proposition of Datomic that you can tell not only what you know, but how you came to know it.  When you add a fact:

conn.transact(list(":db/add", 42, ":firstName", "John"));

Datomic does more than merely record that 42's first name is "John".  Each datom is also associated with a transaction entity, which records the moment (:db/txInstant) the datom was recorded.


Given these reified transactions, it is possible to track the history of information.  Let's say John decides he prefers to go by "Jack":

conn.transact(list(":db/add", 42, ":firstName", "Jack"));

When you assert a new value for a cardinality-one attribute such as :firstName, Datomic will automatically retract any past value (cardinality-one means that you cannot have two first names simultaneously).  So now the database looks like this:

Given this information model, it is easy to see that Datomic can support queries that tell you:
  • what you know now
  • what you knew at some point in the past
  • how and when you came to know any particular datom
So far so good, but there is a fly in the ointment.  In certain situations you may be forced to excise data, pulling it out root and branch and forgetting that you ever knew it.  This may happen if you store data that must comply with privacy or IP laws, or you may have a regulatory requirement to keep records for seven years and then "shred" them.  For these scenarios, Datomic provides excision.

Excision in Datomic

You can request excision of data by transacting a new entity with the following attributes:
  • :db/excise is required, and refers to a target entity or attribute to be excised. Thus there are two scenarios - excise all or part of an entity, or excise some or all of the values of a particular attribute.
  • :db.excise/attrs is an optional, cardinality-many, reference attribute that limits an excision to a set of attributes, useful only when the target of excise is an entity. (If :db.excise/attrs are not specified, then all matching attributes will be excised.)
  • :db.excise/beforeT is an optional, long-valued attribute that limits an excision to only datoms whose t is before the specified beforeT, which may be a t or tx id. This can be used with entity or attribute targets.
  • :db.excise/before is an optional, instant-valued attribute that limits an excision to only datoms whose transaction time is before the specified before. This can be used with entity or attribute targets.

Example: Excising Specific Entities

To excise a specific entity, manufacture a new entity with a :db/excise attribute pointing to that entity's id.  For example, if user 42 requests that his personal data be removed from the system, the transaction data would be:

[{:db/id #db/id[db.part/user],
  :db/excise 42}]

Since :db.excise/attrs is not specified in the transaction data above, all datoms about entity 42 will be excised.

Example: Excising a Window in Time

To excise old values of a particular attribute, you can create an excision for the attribute you want to eliminate, and then limit the excision using either before or beforeT.  Imagine tracking application events that have users, categories, and details.  Your application produces a ton of events, but you don't care about the old ones.  Here is a transaction that will excise all the pre-2012 events:

  [{:db/id #db/id[db.part/user],
    :db/excise :event/user
    :db.excise/before #inst "2012"}
   {:db/id #db/id[db.part/user],
    :db/excise :event/category
    :db.excise/before #inst "2012"}
   {:db/id #db/id[db.part/user],
    :db/excise :event/description
    :db.excise/before #inst "2012"}]

Remembering That You Forgot

It is a key value proposition of Datomic that you can tell not only what you know, but how you came to know it.  This seems to be at odds with excision: if you remember what you forgot, then you didn't really forget it!

You cannot remember what you forgot, but you can remember that you forgot.  Excise attributes are ordinary attributes in the database, and you can query them.  The following query would tell you if datoms about entity 42 have ever been excised:

[:find ?e :where [?e :db/excise 42]]

Once you find those entities, you can of course use the entity API to navigate to the specific attribute and before filters of the excisions.

Excise attributes are protected from excision, so you cannot erase your tracks.  (Other important datoms such as schema are also protected, see the documentation for full details.)

Handle With Care

Excision is different from any other operation in Datomic.  While excision requests are transactions, excision itself is not transactional.  Excision will happen on the next indexing job.

Excision is permanent and unrecoverable.  Take a backup before performing significant excisions, and use excision only when your domain requires that you deliberately forget certain data.