Thursday, March 29, 2018

Important Security Update For free: and dev: Storage Protocols

Release 0.9.5697 fixes two security vulnerabilities in Datomic On-Prem transactors running the free: or dev: storage protocol.

For more information, see the release announcement.

The Datomic team would like to thank Caio Vargas, Matheus Bernardes, and Nubank for reporting this issue.

Monday, February 19, 2018

Access Control in Datomic Cloud

In this article, we will look at Datomic access control, covering
  • authentication and authorization
  • network-level access
  • how the bastion works
  • some example deployments

Authentication and Authorization

Datomic integrates with AWS Identity and Access Management (IAM) for authentication and authorization, via a technique that we'll call S3 proxying. Here's how it works:

Every Datomic permission has a hierarchical name. For example, read-only access to database Jan is named access/dbs/db/Jan/read.

Permission names have a 1-1 correspondence with keys in the Datomic system S3 bucket.

The Datomic client signs requests using AWS's Signature Version 4. But instead of using your IAM credentials directly, the Datomic client uses your IAM credentials to retrieve a signing key from S3.

Thus, IAM read permissions of S3 paths act as proxies for Datomic permissions. As a result, you can use all of the ordinary IAM tools (roles, groups, users, policies, etc.) to authorize use of Datomic.

After decades of experience with racing to log into new servers to change the admin password, we think that this "secure by default" is pretty cool. But that is not the end of the story, as clients also must have network-level access to Datomic.

Network-Level Access

Datomic Cloud is designed to be accessed by applications running inside a VPC, and (unlike a service!) is never exposed to the Internet. You must make an explicit choice to access Datomic. You could:

  • run an EC2 instance in the Datomic VPC and in the Datomic applications security group
  • peer another VPC with the Datomic VPC 
  • configure a VPN Connection

Each of these approaches has its place for application access, and I will say more about them in a future article. For easy access to Datomic from a developer's laptop we offer the bastion.

How the Bastion Works

The bastion is a dedicated machine with one job only: to enable developer access to Datomic. When you turn the bastion on, you get a barebones AWS Linux instance that does exactly one thing: forwards SSH traffic to your Datomic system.

To connect through the bastion:
  1. run the Datomic socks proxy script on your local machine
  2. add a proxy port argument when creating a system client
  3. the Datomic client sees the proxy port argument and connects to the socks proxy 
  4. the socks proxy forwards encrypted SSH traffic to the bastion 
  5. the bastion forwards Datomic client protocol traffic to Datomic

Access to the bastion is secured using the same IAM + S3 proxying technique used earlier for auth. The bastion has an auto-generated, ephemeral private key that is stored in S3 and secured by IAM.

The bastion is dynamic, and you can turn it on or off an any time. And the client support means that the bastion is entirely transparent to your code, which differs only in the argument used to create the client.

A Concrete Example

As a concrete example, here is how the Datomic team configures access for some of our Datomic systems:

Our 'ci' system is dedicated to continuous integration, supporting several dozen Jenkins projects. The ci system contains no sensitive data, only canned and generated examples. The ci system runs the Solo Topology with the bastion enabled to allow access by automated tests.

Our 'devdata' system contains non-sensitive data used by the development team (think departmental and sample apps). The devdata system runs the Solo Topology with the bastion enabled to allow access by developers.

Our 'applications' system supports applications and contains real-world data. The applications system is reachable only by deployed application code, and needs to be highly available. So the applications system uses the Production Topology with the bastion disabled, and also uses fine-grained IAM permissions to limit applications to the databases they need.


Datomic is secure by default, integrating directly with AWS IAM and VPC capabilities. The bastion makes it easy for developers to get connected, so you can be up and transacting on a new system in minutes.

To learn more, check out

Or just dive in and get started.

Wednesday, January 24, 2018

High Availability in Datomic Cloud

Based on my rigorous polling, these are the top five questions people have about HA in Datomic Cloud:
  • How does it work?
  • What do I have to do?
  • Hey wait a minute! What about serializability?
  • Is AWS cool?
  • So what else have you got?

How Does It Work?

In the Production Topology, Datomic cluster nodes sit behind an Application Load Balancer (ALB). As any node can handle any request, there is no single point of failure. The starting cluster size of two ensures that nodes remain available in the face of a single node failure. By increasing the cluster size beyond two, you both enhance availability and increase the number of queries the system can handle. An AutoScaling Group (ASG) monitors nodes and automatically replaces nodes that fail to health check.

What Do I Have To Do?

Nothing. HA is automatic.

Hey Wait a Minute! What About Serializability?

Datomic is always transactional, fully serialized, and consistent in both the ACID and CAP senses. Don't waste your life writing code to compensate for partial failures and subtle concurrency bugs when you could be making your application better and shipping it faster.                                             
So how does that square with shared-nothing cluster nodes? The answer is simple: The nodes use DynamoDB to serialize all writes per database.

At any point in time a database has a preferred node for transactions. In normal operation all transactions for a database will flow to/through that node. If for any reason (e.g. a temporary network partition) the preferred node can't be reached, any node can and will handle transactions. Consistency is ensured by conditional writes to DynamoDB. If a node becomes unreachable, Datomic will choose a new preferred node.
Note in particular that:
  1. This is not a master/follower system like Datomic On-Prem and many other databases – nobody is tracking mastership and there are no failover intervals.
  2. This should not be confused with parallel multi-writer systems such as Cassandra. Write availability is governed by the availability of DynamoDB conditional writes and strongly-consistent reads.

Is AWS Cool?

Very cool. Datomic Cloud showcases the benefit of designing for AWS vs. porting to AWS, and there is a lot going on behind zeroconf HA:

So What Else Have You Got?

Not all systems need HA. You can prototype a Datomic Cloud system with the (non-HA) Solo Topology for about $1/day.  The topology differences are entirely abstracted away from clients and applications, so you can easily upgrade to the Production Topology later.

For More Information:

Check out the docs for
Or just jump right in.

Wednesday, January 17, 2018

Datomic Cloud

Datomic on AWS: Easy, Integrated, and Powerful

We are excited to announce the release of Datomic Cloud, making Datomic more accessible than ever before:
Datomic Cloud is a new product intended for greenfield development on AWS. If you are not yet targeting the cloud, check out what customers are saying about the established line of Datomic On-Prem products (Datomic Pro and Enterprise).
Datomic Cloud is accessible through the latest release of the Datomic Client APITo learn more, you can:
We would love your feedback! Come and join us on the new developer forum.

Datomic Cloud on the AWS Marketplace

Tuesday, December 5, 2017

Datomic Pull :as

Datomic's Pull API provides a declarative way to make hierarchical and nested selections of information about entities.  The 0.9.5656 release enhances the Pull API with a new :as clause that provides control over the returned keys.

As an example, imagine that you want information about Led Zeppelin's tracks from the mbrainz dataset. The following pull pattern navigates to the artist's tracks, using limit to return a single track:

;; pull expression
'[[:track/_artists :limit 1]]

=> #:track{:_artists
           [#:db{:id 17592188757937}]}

The entity id 17592188757937 is not terribly interesting, so you can use a nested pull pattern to request the track name instead:

;; pull pattern
'[{[:track/_artists :limit 1] [:track/name]}]

=> #:track{:_artists [#:track{:name "Black Dog"}]}

That is better, but what if you want different key names? This can happen for reasons including:

  • you are targeting an environment that does not support symbolic names, so you need a string instead of a keyword key
  • you do not want to expose the direction of navigation (e.g. the underscore in :track/_artists)
  • your consumers are expecting a different name
The :as option lets you rename result keys to arbitrary values that you provide, and works at any level of nesting in a pull pattern. The pattern below uses :as twice to rename the two keys in the result:

;; pull expression
'[{[:track/_artists :limit 1 :as "Tracks"]
   [[:track/name :as "Name"]]}]

=> {"Tracks" [{"Name" "Black Dog"}]}

To try it out you can grab the latest release, review the Pull grammar, and work through these examples at the REPL.

Thursday, March 23, 2017

New Datomic Training Videos and Getting Started Documentation

We are excited to announce the release of a new set of Day of Datomic training videos!
Filmed at Clojure/Conj in Austin, TX in December of 2016, this series covers everything from the architecture and data model of Datomic to operation and scaling considerations.

The new training sessions provide a great foundation for developing a Datomic-based system. For those of you who have watched the original Day of Datomic videos, the series released today uses the new Datomic Client library for the examples and workshops, so if you haven't yet explored Datomic Clients, now is the perfect opportunity to do so!

If you ever want to refer back to the original Peer-based training videos, don't worry - they're all still available as well.

In addition to an updated Day of Datomic, we've released a fully re-organized and re-written Getting Started section in the Datomic Documentation. We have gathered and incorporated feedback from new and existing users and hope that the new Getting Started is a much more comprehensive and accessible introduction to Datomic.

We look forward to your thoughts and feedback. If you have any comments on the new training videos, the new getting started section, or any additional thoughts, please let us know!

Wednesday, January 25, 2017

The Ten Rules of Schema Growth

Data outlives code, and a valuable database supports many applications over time. These ten rules will help grow your database schema without breaking your applications.

1.  Prod is not like dev.

Production is not development. In production, one or more codebases depend on your data, and these ten rules below should be followed exactingly.

A dev environment can be much more relaxed.  Alone on your development machine experimenting with a new feature, you have no users to break.  You can soften the rules, so long as you harden them when transitioning to production.

2.  Grow your schema, and never break it.

The lack of common vocabulary makes it all too easy to automate the wrong practices. I will use the terms growth and breakage as defined in Rich Hickey's Spec-ulation talk.  In schema terms:

  • growth is providing more schema
  • breakage is removing schema, or changing the meaning of existing schema.

In contrast to these terms, many people use "migrations", "refactoring", or "evolution". These usages tend to focus on repeatability, convenience, and the needs of new programs, ignoring the distinction between growth and breakage. The problem here is obvious: Breakage is bad, so we don't want it to be more convenient!

Using precise language underscores the costs of of breakage. Most migrations are easily categorized as growth or breakage by considering the rules below.  Growth migrations are suitable for production, and breakage migrations are, at best, a dev-only convenience. Keep them widely separate.

3. The database is the source of truth.

Schema growth needs to be reproducible from one environment to another.  Reproducibility supports the development and testing of new schema before putting it into production and also the reuse of schema in different databases. Schema growth also needs to be evident in the database itself, so that you can determine what the database has, what it needs, and when growth occurred.

For both of these reasons, the database is the proper source of truth for schema growth. When the database is the source of truth, reproducability and auditability happen for free via the ordinary
query and transaction capabilities of the database.  (If your database is not up to the tasks of queries and transactions you have bigger problems beyond the scope of this article).

Storing schema in a database is strictly more powerful than storing schema as text files in source control. The database is the actual home for schema, plus it provides validation, structure, query, transactions, and history. A source control system provides only history and is separate from the data itself.

Note that this does not mean "never put schema information in source control". Source control may be convenient for other reasons, e.g. it may be more readily accessible. You may redundantly store schema in source control, but remember that the database is definitive.

4.  Growing is adding.

As you acquire more information about your domain, grow your schema to match. You can grow a schema by adding new things, and only by adding new things, for example:

  • adding new attributes to an existing 'type'
  • adding new types
  • adding relationships between types

5.  Never remove a name.

Removing a named schema component at any level is a breaking change for programs that depend on that name. Never remove a name.

6.  Never reuse a name.

The meaning of a name is established when the name is first introduced. Reusing that name to mean something substantially different breaks programs that depend on that meaning. This can be even
worse than removing the name, as the breakage may not be as immediately obvious.

7.  Use aliases.

If you are familiar with database refactoring patterns, the advice in Rules Five and Six may seem stark. After all, one purpose of refactoring is to adopt better names as we discover them. How can we
do that if names can never be removed or changed in meaning?

The simple solution is to use more than one alias to refer to the same schema entity. Consider the following example:

  • In iteration 1, users of your system are identified by their email with an attribute named :user/id
  • In iteration 2, you discover that users sometimes have non-email identifiers for users and that you want to store a user's email even when not using the email as an identifier. In short, you wish that :user/id was named :user/primary-email.

No problem! Just create :user/primary-email as an alias for :user/id. Older programs can continue to use :user/id, and newer programs can use the now-preferred :user/primary-email.

8.  Namespace all names.

Namespaces greatly reduce the cost of getting a name wrong, as the same local name can safely have different meanings in different namespaces.  Continuing the previous example, imagine that the local
name id is used to refer to a UUID in several namespaces, e.g. :inventory/id, :order/id, and so on. The fact that :user/id is not a UUID is inconsistent, and newer programs should not have to put up with this.

Namespaces let you improve the situation without breaking existing programs. You can introduce :user-v2/id, and new programs can ignore names in the user namespace. If you don't like v2, you can also pick a more semantic name for the new namespace.

9.  Annotate your schema.

Databases are good at storing data about your schema. Adding annotations to your schema can help both human readers and make sense of how the schema grew over time. For example:

  • you could annotate names that are not recommended for new programs with a :schema/deprecated flag, or you could get fancier still with :schema/deprecated-at or :schema/deprecated-because. Note that such deprecated names are still never removed (Rule Five).
  • you could provide :schema/see-also or :schema/see-instead pointers to more current conventions. 

In fact, all the database refactoring patterns that are typically implemented as breaking changes could be implemented non-destructively, with the refactoring details recorded as an annotation. For example, the breaking "split column" refactoring might instead be implemented as schema growth:

  • add N new columns
  • (optional) add a :schema/split-into attribute on the original column whose value is the new columns, and possibly even the recipe for the split

10. Plan for accretion.

If a system is going to grow at all, then programs must not bake in limiting presumptions.  For example: If a schema states that :user/id is a string, then programs can rely on :user/id being a string and not occasionally an integer or a boolean.  But a program cannot assume that a user entity will be limited to a the set of attributes previously seen, or that it understands the semantics of attributes that it has not seen before.

Are these rules specific to a particular database?

No. These rules apply to almost any SQL or NoSQL database.  The rules even apply to the so-called "schemaless" databases.  A better word for schemaless is "schema-implicit", i.e. the schema is implicit in your data and the database has no reified awareness of it.  With an implicit schema, all the rules still apply, except that the database is impotent to help you (no Rule 3).

In Context

Many of the resources on migrations, refactoring, and database evolution emphasize repeatability and the needs of new programs, without making the top-level distinctions of growth vs. breakage and prod vs. dev. As a result, these resources encourage breaking the rules in this article.

Happily, these resources can easily be recast in growth-only terms.  You can grow your schema without breaking your app. You can continuously deploy without continuously propagating breakage.  Here's what it looks like in Datomic.