Wednesday, June 19, 2013

Component Entities

This post demonstrates Datomic's component entities, and highlights a new way to create components available in today's release.  You can follow along in the code via the sample project.

The code examples use Groovy, a JVM language that combines similarity to Java with concision.  If you are a Java developer new to Groovy, you may want to read this first.

Why Components?

In a database, some entities are their own identities, and others exist only as part of a larger parent entity.  In Datomic, the latter entities are called components, and are reached from the parent via an attribute whose schema includes :db/isComponent true

As a familiar example, consider orders, line items, and products.  Orders have references to line items, and those references are through a component attribute, since line items have no independent existence outside of an order.  Line items, in turn, have references to products.  References to products are not component references, because products exist regardless of whether or not they are part of any particular order.

The schema for a line item component reference looks like this:

{:db/id #db/id[:db.part/db]
 :db/ident :order/lineItems
 :db/isComponent true
 :db/valueType :db.type/ref
 :db/cardinality :db.cardinality/many
 :db.install/_attribute :db.part/db}

Notice also that line items are :db.cardinality/many, since a single order can have many of them.

Component attributes gain three special abilities in Datomic:
  • you can create components via nested maps in a transaction (new in 0.8.4020)
  • touching an entity recursively touches all its components
  • :db.fn/retractEntity recursively deletes all its components
Each of these abilities is demonstrated below.

Creating Components

To demonstrate line item components, let's create an order for some chocolate and whisky.  First, here is a query for products ?e matching a particular description ?v:

productQuery = '''[:find ?e
                   :in $ ?v
                   :where [?e :product/description ?v]]''';

Now, we can query for the products we want to order:

(chocolate, whisky) = ['Expensive Chocolate', 'Cheap Whisky'].collect {
  q(productQuery, conn.db(), it)[0][0];
}
===> [17592186045454, 17592186045455]

The statement above uses Groovy's multiple assignment to assign chocolate to the first query result, and whisky to the second.

Now that we have some products, we can create an order with some line items. As of today's release, you can do this via nested maps:

order = [['order/lineItems': [['lineItem/product': chocolate,
                               'lineItem/quantity': 1,
                               'lineItem/price': 48.00],
                              ['lineItem/product': whisky,
                               'lineItem/quantity': 2,
                               'lineItem/price': 38.00]],
          'db/id': tempid()]];

The nested maps above expand into two subentities.  Notice that you do not need to create a tempid for the nested line items -- they will be auto-assigned tempids in the same partition as the parent order.

The order above is pure data (a list of maps). This greatly facilitates development, testing, and composition.  When we are ready to put the data in the database, the transaction is as simple as:

conn.transact(order).get();

Touching Components

Now we can query to find the order we just created.  To demonstrate that query can reach anywhere within your data, we will do a multiway join to find the order via product description:

ordersByProductQuery = '''
[:find ?e
 :in $ ?productDesc
 :where [?e :order/lineItems ?item]
        [?item :lineItem/product ?prod]
        [?prod :product/description ?productDesc]]''';

The query above joins

  • from the provided productDesc input to to the product entity ?prod
  • from ?prod to the order item ?item
  • from ?item to the order ?e
and returns ?e.

We are going to immediately pass ?e to datomic's entity API, so let's take a moment to create a Groovy closure qe that automates query + get entity:


qe = { query, db, Object[] more ->
  db.entity(q(query, db, *more)[0][0])
}


Now we can find an order the includes chocolate:


order = qe(ordersByProductQuery, db, 'Expensive Chocolate');


Because the Datomic database is an immutable value in your own address space, entities can be lazily realized.  When you first look at the order, you won't see any attributes at all:


===> {:db/id 17592186045457}


The touch API will realize all the immediate attributes of the order, plus it will recursively realize any components:


order.touch();
===> {:order/lineItems #{{:lineItem/product #, 
                          :lineItem/price 38.00M, 
                          :lineItem/quantity 2, 
                          :db/id 17592186045459} 
                         {:lineItem/product #, 
                          :lineItem/price 48.00M, 
                          :lineItem/quantity 1, 
                          :db/id 17592186045458}}, 
      :db/id 17592186045457}


Notice that the line items are immediately realized, and you can see all their attributes.  However, the products are not immediately realized, since they are not components.   You can, of course, touch them yourself if you want.

Retracting Components

I am not as hungry or thirsty as I thought.  Let's retract that order, using Datomic's :db.fn/retractEntity:


conn.transact([[":db.fn/retractEntity", order[":db/id"]]]).get();


Retracting an entity will retract all its subcomponents, in this case the line items.  To see that the line items are gone, we can count all the line items in our database:


q('''[:find (count ?e)
      :where [?e :order/lineItems]]''',
  db);
===> []


References to non-components will not be retracted.  The products are all still there:


q('''[:find (count ?e)
      :where [?e :product/description]]''',
  db);
===> [[2]]

Conclusion


Components allow you to create substantial trees of data with nested maps, and then treat the entire tree as a single unit for lifecycle management (particularly retraction).  All nested items remain visible as first-class targets for query, so the shape of your data at transaction time does not dictate the shape of your queries.  This is a key value proposition of Datomic when compared to row, column, or document stores.

Wednesday, June 12, 2013

Using Datomic from Groovy, Part 1: Everything is Data

In this post, I will demonstrate transacting and querying against Datomic from Groovy.  The examples shown here are based on the following schema, for a simple social news application:


There are a number of more in-depth samples in the datomic-groovy-examples project on Github.

Why Groovy

Groovy offers four key advantages for a Java programmer using Datomic:
  • Groovy provides interactive development through groovysh, the Groovy shell.  When combined with Datomic's dynamic, data-driven style, this makes it easy to interactively develop code in real time.  The source code for this post has a number of other examples designed for interactive study within the Groovy shell.
  • Groovy's collection literals make it easy to see your data.  Lists and maps are as easy as:
aList = ['John', 'Doe'];
aMap = [firstName: 'John', 
        lastName: 'Doe'];
  • Groovy's closures make it easy to write functions, without the noise of single-method interfaces and anonymous inner classes. For instance, you could grab all the lastNames from a collection of maps
lastNames = people.collect { it['lastName'] }

  • Of the popular expressive languages that target the JVM, Groovy's syntax is most similar to Java's.

Transactions

A Datomic transaction takes a list of data to be added to the database, and returns a future map describing the results. The simplest possible transaction is a list of one sublist that adds an atomic fact, or datom, to the database, using the following shape:

conn.transact([[op, entityId, attributeName, value]]);

The components above are:
  • op is a keyword naming the operation. :db/add adds a datom, and :db/retract retracts a datom.
  • entityId is the numeric id of an entity.  You can use tempid when creating a new entity.
  • attributeName is a keyword naming an attribute.
  • value is the value of an attribute.  The allowed types for an attribute include numerics, strings, dates, URIs, UUIDs, binaries, and references to other entities.
Keywords are names, prefixed with a colon, possibly with a leading namespace prefix separated by the slash char, e.g.

:hereIsAnUnqualifiedName
:the.namespace.comes.before/theName

Putting this all together, you might add a new user's first name with:

conn.transact([[':db/add', newUserId, ':user/firstName', 'John']]);

If you are adding multiple datoms about the same entity, you can use a map instead of a list, with the special keyword :db/id identifying the entity.  For example, the following two transactions are equivalent:

// create an entity with two attributes (map form)
conn.transact([[':db/id': newUserId,
                ':user/firstName': 'John',
                ':user/lastName': 'Doe']]);

// create an entity with two attributes (list form)
conn.transact([[':db/add' newUserId, ':user/firstName', 'John'],
               [':db/add' newUserId, ':user/lastName', 'Doe']]);

Let's look next at composing larger transactions out of smaller building blocks. You have already seen creating a user:

newUser = [[':db/id': newUserId,
            ':user/email': 'john@example.com',
            ':user/firstName': 'John',
            ':user/lastName': 'Doe']];

Notice that this time we did not call transact yet, instead we just stored data describing the user into newUser.   

Now imagine that you have a collection of story ids in hand, and you want to create a new user who upvotes those stories.   Groovy's collect method iterates over a collection, transforming values using a closure with a default single parameter named it. We can use collect to build new assertions that refer to each story in a collection of storyIds:

upvoteStories = storyIds.collect {
  [':db/add', newUserId, ':user/upVotes', it]
}

Now we are ready to build a bigger transaction out of the pieces.  Because transactions are made of data, we don't need a special API for this.  Groovy already has an API for concatenating lists, called +:

conn.transact(upvoteAllStories + newUser);

Building Datomic transactions from data has many advantages over imperative or object-oriented approaches:
  • Composition is automatic, and requires no special API.
  • ACID transactionality is scoped to transaction calls, and does not require careful management across separate calls to the database.
  • Because they are data, Datomic transactions are flexible across system topology changes: they can be built offline for later use, serialized, and/or enqueued.

Query

The Datomic query API is named q, and it takes a query plus one or more inputs. The simple query below takes a query plus a single input, the database db, and returns the id of every entity in the system with an email address:

q('''[:find ?e 
      :where [?e :user/email]]''', db);

Keyword constants in the :where clause constrain the results.  Here, :user/email constrains results to only those entities possessing an email.  

Symbols preceded by a question mark are variables, and will be populated by the query engine.  The variable ?e will match every entity id associated with an email.

A query always returns a set of lists, and the :find clause specifies the shape of lists to return.  In this example, the lists are of size one since a single variable ?e is specified by :find.

Note that the query argument to q is notionally a list.  As a convenience, you can pass the query argument as either a list or (as shown here) as an edn string literal.

The next query further constrains the result, to find a specific email address:

q('''[:find ?e
      :in $ ?email
      :where [?e :user/email ?email]]''',
  db, 'editor@example.com');

There are several things to see here.  There are now two inputs to the query: the database itself, and the specific email "editor@example.com" we are looking for.  Since there is more than one input, the inputs must be named by an :in clause.  The :in clause names inputs in the order they appear:
  1. $ is Datomic shorthand for a single database input. 
  2. ?email is bound to the scalar "editor@example.com".
Inputs need not be scalar. The shape [?varname ...] in an :in clause is called a collection binding form, and it binds a collection instead of a single value. The following query looks up two different users by email:

q('''[:find ?e
      :in $ [?email ...]
      :where [?e :user/email ?email]]''',
  db, ['editor@example.com', 'stuarthalloway@datomic.com']);

Another way to join is by having more than one constraint in a :where clause.  Whenever a variable appears more than once, it must match the same set of values in all the locations that it appears.  The following query joins through ?user to find all the comments for a user:

q('''[:find ?comment
      :in $ ?email
      :where [?user :user/email ?email]
             [?comment :comment/author ?user]]''',
  db, 'editor@example.com')

We have only scratched the surface here.  Datomic's query also supports rules, predicates, function calls, cross-database queries (with joins!), aggregates, and even queries against ordinary Java collections without a database.  In fact, the Datalog query language used in Datomic supports a superset of the capabilities of the relational algebra that underpins SQL.

Conclusion

In this installment, you have seen the powerful, compositional nature of programming with generic data.  In Part 2, we will look at the database as a value, and explore the implications of having a lazy, immutable database inside your application process.

Monday, June 3, 2013

Sync


Background

Datomic's approach to updating peers uses a push model. Rather than have every read request route to the same server in order to get consistent data, data is stored immutably, and as soon as there is new information, all peers are notified. This completely eliminates polling any server. Thus, contrary to common presumption, when you ask the connection for the db value, there is no network communication involved: you are immediately given the local value of the db about which the connection was most recently informed.

Everyone sees a valid, consistent view. You can never see partial transactions, corruption/regression of timelines, causal anomalies etc. Datomic is always 'business rules' valid, and causally consistent.

Motivation

That does not mean that every peer sees the same thing simultaneously. Just as in the real world, it is never the case that everyone sees the same thing "at the same time" in a live distributed system.  There is no inherent shared truth, as you might convey a message to me about X at the speed of light but I can only perceive X at the speed of sound. Thus, I know X is coming, but I might have to wait for it.

This means that some peer A might commit a transaction and tell B about it before B is informed via the normal channels. This is an interesting case, as it has to do with perception and propagation delays. It is not a question of consistency, it is a question of communication synchronization.

It comes up when you would like to read-your-own-writes via other peers (e.g. when a client hits different peer servers via a load balancer), and when there is out-of-band communication of writes (A tells B about its write before the transactor does).

Tools

We've added a new sync API to help you manage these situations.

The first form of sync takes a basis point (T). It returns a future that will be fulfilled with a version of the db that includes point T. This does not cause any additional interaction with the transactor - the future will be filled by the normal communication on the update channels. But it saves you from having to poll for arrival. Most often, you will already have the requested T, and the future will complete immediately. This is the preferred method to use if you have any ability to convey the basis T, either in the message from A to B, or e.g. in cookies as a client hits different peers using a load balancer. You can easily get the basis T for any db value you have in hand.

The second form of sync takes no arguments, and works via 'ping' of the transactor. It promises not to return until all transactions that have been acknowledged by the transactor at the time sync was called have arrived at this peer. Thus if A has successfully committed a transaction and told B about it, and B then calls sync(), the database returned by sync will include A's transaction.

Conclusion

While these synchronization tools are powerful, make sure you use them only when necessary. The Datomic defaults were designed to leverage the inherent parallelism possible given immutable, accretion-only semantics and distributed storage. Notifications to peers are sent at the same time as the acknowledgement to the peer submitting the transaction, and thus are as 'simultaneous' as network communication can be. The sync tools need only be utilized to enforce cross-peer causal relationships.