Friday, November 2, 2012

Riak and Couchbase Support


We are pleased to announce today preliminary support for two new storage services for Datomic: Riak and Couchbase.

Riak is an elasticly scalable, distributed, redundant and highly available key-value store in the Dynamo model. It is a great option for Datomic users who want to run such a system system on their own premises (vs e.g. DynamoDB on AWS). Because Riak supports only eventual consistency at this time, a Datomic system running on Riak also utilizes Apache ZooKeeper, a highly-available coordination service. Datomic uses ZooKeeper for transactor failover coordination, and for the handful of keys per database that need to be updated with CAS. The bulk of the data is kept in Riak, immutably, and leverages all of Riak's redundancy and availability characteristics..

Couchbase is an elasticly scalable, distributed and redundant document store. Like Riak, it supports redundant storage. Unlike Riak, it does offer consistency and CAS, trading off for a more conventional availability model, with either manual or automatic failover.

Both solutions are backed with commercial support offerings from the people who make them.

With all three services (Riak, ZooKeeper, Couchbase), Datomic can run on an existing installation alongside other applications without conflict. Thus you can combine your use of Datomic with other uses of the storages, at which they excel. We consider this hybrid use to be highly appealing in practice, as different parts of your applications have different requirements.

As always, you can backup and restore to/from these storages and any other, and switch your application from one storage to another with a change to a different URI.

This makes the set of Datomic storages:

  • In-process memory
  • Transactor-local dev/free mode
  • SQL
  • DynamoDB
  • Riak
  • Couchbase
  • Infinispan


We are very excited about these new options, which greatly expand your choices, especially for non-cloud deployments. Each storage represents different tradeoffs, but what is important is that the choices are yours to make, as you decide what best fits your business and technical requirements.

We look forward to your feedback as we fine tune these integrations for production use.

Wednesday, October 10, 2012

codeq


codeq

Backstory

Programmer Sally: "So, what are you going to do today Bob?"

Programmer Bob: "I'm not happy with the file baz.clj residing in my/ns. So I'm going to go to line 96 and change 2 to 42. I've been thinking about deleting line 124. If I have time, I'm also going to insert some text I've been working on at line 64."

Programmer Sally: (what's wrong with Bob?)

Short Story

codeq ( 'co-deck') is a little application that imports your Git repositories into a Datomic database, then performs language-aware analysis on them, extending the Git model down from the file to the code quantum (codeq) level, and up across repos. By doing so, codeq allows you to:

  • Track change at the program unit level (e.g. function and method definitions)
  • Query your programs and libraries declaratively, with the same cognitive units and names you use while programming
  • Query across repos

The resulting database is highly programmable, and can serve as infrastructure for editors, IDEs, code browsing, analysis and documentation tools.

codeq is open source (EPL), and on github. It works with Datomic Free.

Long Story

We love Git. We use it, and by now so do most of you. Used conservatively, Git provides a good basis for information management (it keeps everything!).

But it is important to understand Git's limits. Without a live connection to the editors, merge tools etc that munge the files, Git is relegated to simply discovering what has changed in the filesystem, and wisely just stores new stuff, using content-based addressing to determine what is new.

Any further information we get from such a recording of novelty has to be derived. Diffs derive mechanical change information, and tree and file diffs are at the core of Git's facilities.

Language-aware analyses can derive change information co-aligned with program semantics.

How it works

Git looks in directories and finds files, names them by the SHAs of their contents, and encodes their relationship to the enclosing tree by filename. During the import phase, codeq pretty much transfers Git's model intact.

During the analysis phase, codeq looks in files and finds code segments, names them by their SHAs, and encodes their relationship to the enclosing file by line+column location. It further associates them with programmatic names and semantics (e.g. definition, or usage), if possible. We call this semantic segment a codeq.

Thus, one way to understand codeq is as an extension of the git model and tree down to a finer granularity, aligned to program semantics.

codeq Model



In Git, every repo is an island. When trying to understand the interactions across your entire program, including the many libraries it uses, each with their own Git repo, it becomes useful to combine repo information in a single database. The beauty of content-based addressing and globally unique namespacing (as done by e.g. Java and Clojure) is that such merging is conflict-free. So, codeq supports the importation of multiple repos into the same database and keeps them sorted, superimposing a level above the Git model.

In Detail

Now we can consider a particular scenario - a programmer edits a file, changing one function definition and inserting another.

From the Git perspective, the two files are different blobs, with different SHAs, in different trees under the same name. Anything else it tells you about what has changed is done via dynamic diffing, and is usually expressed in terms of lines and blocks of text.

Since we know (some of) these files are programs, and they are stored immutably, it seems worthwhile to perform a one-time analysis in order to track change more finely. codeq will break down the top level of the file into the corresponding program-language units (e.g. for Clojure, mostly function-defining forms). It gives each a SHA (if the block of code has never been seen before). It will then look inside and try to determine what the code is about (in context - since the same code can appear e.g. in different namespaces and thus name different things). The 'meaning' is normally some reference to a namespaced program identity, and for some purpose (definition, use). So a codeq encodes the location of a code segment in a file, and its semantics.

File edit:



After the edit, the analysis finds the identical first 2 segments in the same place in the new file, a new segment following, and then the old 3rd segment in a new location. The file ends with a new code segment, but is 'about' the same thing as the prior 4th segment. As time goes by we can get timelines for program constructs that are independent of the timelines of the files that contain them, and more closely aligned with our work (what's the history of this function definition?)

This importation and analysis is not a one-shot thing. You continue to use Git normally. You can go back later and import your newer changes from Git, or perform new or enhanced analyses.  Due to the first-class nature of Datomic transactions, the codeq database knows what has happened to it, what has been imported, what analyses have been run, what schemas have been installed etc.

Put on a happy interface

Git has a powerful engine - durable persistent data structures with structural sharing and some fast C code that manipulates them. Unfortunately, it is hidden behind a plethora of command line utilities each of which has a boatload of options and a variety of outputs - let's have a parsing party!

If only we had some declarative query technology with elegant support for recursion, we could turn all this tree walking into a piece of cake. Wait! - we can fight 1970's technology with... more 1970's technology - Datalog


(def rules
 '[[(node-files ?n ?f) [?n :node/object ?f] [?f :git/type :blob]]
   [(node-files ?n ?f) [?n :node/object ?t] [?t :git/type :tree] 
                       [?t :tree/nodes ?n2] (node-files ?n2 ?f)]
   [(object-nodes ?o ?n) [?n :node/object ?o]]
   [(object-nodes ?o ?n) [?n2 :node/object ?o] [?t :tree/nodes ?n2] (object-nodes ?t ?n)]
   [(commit-files ?c ?f) [?c :commit/tree ?root] (node-files ?root ?f)]
   [(commit-codeqs ?c ?cq) (commit-files ?c ?f) [?cq :codeq/file ?f]]
   [(file-commits ?f ?c) (object-nodes ?f ?n) [?c :commit/tree ?n]]
   [(codeq-commits ?cq ?c) [?cq :codeq/file ?f] (file-commits ?f ?c)]])


You can follow along with the schema, but suffice to say, that is all the code you need to:

  • Find all the files referenced by a commit
  • Find all the codeqs referenced by a commit
  • Find all the commits including a file
  • Find all the commits including a codeq

This query uses those rules to find all of the different definitions of the function datomic.codeq.core/commit, and when they were first defined:

(d/q '[:find ?src (min ?date)
       :in $ % ?name 
       :where
       [?n :code/name ?name]
       [?cq :clj/def ?n]
       [?cq :codeq/code ?cs]
       [?cs :code/text ?src]
       [?cq :codeq/file ?f]
       (file-commits ?f ?c)
       (?c :commit/authoredAt ?date)]
     db rules "datomic.codeq.core/commit")


If you don't know Datalog, it's worth the 11 minutes it will take you to learn it.



I hope this gives you a sense of the motivation for codeq, and some excitement for trying it out. It's still early days, and we are definitely looking for help in enhancing the analysis, integrating with tools, supporting other languages etc.

Have fun!

Rich



Monday, September 17, 2012

Datomic Monitoring and Performance

I have just added a new section to the Datomic documentation on monitoring and performance. If you are tuning Datomic, it is a must-read.

I want to call attention in this blog post to one aspect of Datomic monitoring that might not be immediately apparent. Which is that you can use Amazon CloudWatch to monitor any instance of Datomic Pro, regardless of which storage you are using, and regardless of whether you are running any of your processes in the AWS cloud.

Part of the value of a good cloud architecture is being able to mix-and-match the pieces.

Thursday, September 6, 2012

REST API

I'm pleased to announce that, starting with version 0.8.3488, Datomic now offers a REST API.

There are a number of reasons to do this, first and foremost is that it will now be possible to access Datomic from non-JVM languages.

How does it work?

The command bin/rest will start up a peer which runs as a stand-alone HTTP server. You can access that service from any application using any language and any HTTP library. That's it!


This greatly enhances your architectural options:


Q&A


So, does this make Datomic client-server? 

Yes and no. First off, the 'servers' are themselves peers, so you still get elastic, configuration-free horizontal query scalability by simply starting/stopping more peers. Second, the Datomic model, which substantially mitigates the problems of client-server databases, is faithfully replicated for clients:

  • A peer service can serve more than one database, against more than one storage (Pro)
  • Clients get the same kind of data-driven API as do peers
  • Clients can get repeatable queries, and multiple queries with the same basis
  • Clients get the time-travel as-of and since capabilities, raw access to indexes and ranges
  • Clients can issue multi-datasource queries, and pass data to query
  • Clients can get a push feed of transaction events, via a Server-Sent Events source

Basically, it's just more options. Build full peers in a JVM language or lightweight clients in any language, or any combination.

Do these clients count towards my peer count? 

No. The clients do not run the Datomic peer library, and thus do not count. Each REST service process counts as a peer.

Does this work with Datomic Free? 

Yes it does.

What's the status of the API? 

It's still alpha, as we want to incorporate your feedback.

Are there client libraries for language _______?

Not yet. This is something we hope each language community will help us build.

Where can I get more details?

Check out the REST API docs


We hope you enjoy this new API and the capabilities it affords. As always, we welcome your feedback and input. If you've been on the sidelines waiting for Datomic to come to your favorite language - welcome!

Monday, September 3, 2012

ElastiCache in 5 minutes

You can add ElastiCache to your AWS-based Datomic system in minutes:
  • Login to your ElastiCache management console and choose "Launch Cache Cluster"
  • Complete the wizard. You can accept all defaults.
  • Back at the console, select "Cache Security Groups":
  • ... and authorize your Datomic system's security group (default is "datomic"):

  • Back at the console, select "Cache Clusters" and copy the node endpoints string:

  • Paste the node endpoints into your transactor properties file:
    memcached=foo.use1.cache.amazonaws.com:11211
  • Set the node endpoints in your peer application code:
    System.setProperty
    ("datomic.memcachedServers",
     "
foo.use1.cache.amazonaws.com:11211);

That's all there is to it.  Datomic will transparently use ElastiCache. There is no need to configure any cache timeouts, or change any code. Datomic's key names will not conflict with any other use, so you can use the cache for other tasks as well.

You can also configure different caches for different process in the system. For example, I keep a local memcached process running at the office, so from a cold peer (my development laptop) queries are fast, even when I am connecting to a transactor that is running in the AWS cloud.

Friday, August 10, 2012

Keep Chocolate Love Atomic

Datomic is a database of atomic facts, or datoms, that consist of entity, attribute, value, and transaction. For example, "I love chocolate (as of tx 1000)."

Of couse, I am capable of loving many things, so the :loves attribute should be :cardinality/many. Here is an abbreviated history of my loves:

; some point in time...
[[:db/add stu :loves :chocolate]
 [:db/add stu :loves :vanilla]]

; later...
[[:db/add stu :loves :octomore]
 [:db/retract stu :loves :vanilla]]

The set of all things I currently love is derived information, and it can be calculated from the history of atomic facts. Based on the transactions above, I currently love :chocolate and :octomore.

Datomic automatically handles this derivation, as can be seen through the entity interface:

stu.get(:loves) 
=> #{:chocolate :octomore}

Now, imagine creating a web interface with checkboxes for different things a person might love.  You initially populate the interface with my current loves, pulled from the database. I interact with the system, and you get back a set of checkbox states.

At this point, you should submit adds and retracts only for the new facts I created -- not a set with an add or retract for every UI element. This is a subtle point. If I liked chocolate before, and I didn't uncheck chocolate, what is the harm in saying "Stu likes chocolate" again?

The biggest problem is that you are lying to the database. I didn't repeat my love of chocolate. What if the system also had a user interface more subtle than checkboxes, that allowed me to reiterate past preferences? You wouldn't be able to tell the difference.

An obvious warning sign is when you find yourself submitting derived information (the set of my likes) when you actually have the facts (what I just said) in hand. Ignoring facts and recording derived information is always perilous -- imagine managing a system that records birthdays and ages, but not birthdates.

A more subtle mistake is to abuse transactions to extract facts from derived information. You have a new derived set in hand, and the database knows how to calculate the previous derived set. Given those two things, you could write a transaction function that takes the two sets and backtracks to figure out what changed.

This approach has a variant of the dishonesty problem mentioned before, in that it provides no way for me to reiterate my love for chocolate. But the other problem with this approach may be even worse: It imposes coordination in the implementation, where no coordination was required by the domain.

Let's say that I choose, at some point in time, to start liking :cheesecake and :nachos. These are atomic choices, requiring no coordination with any historical record. If you send Datomic a set of all checkbox states, and ask it to discover :cheesecake and :nachos inside a transaction, you are manufacturing a coordination job that has no basis in reality. Unnecessary coordination is an enemy of scalability and reuse.

The root cause of confusion here is update-in-place thinking. The checkbox model exposes derived
information (the current states) but not the facts (the choices the user made). Given the set of checkbox states, you should do the diff in the web tier as soon as you pull data out of the form. This still has the problem that there is no way to restate that you love chocolate, but now the scope of the problem is localized to its cause -- the checkbox model. You can fix the problem, or not (you often don't care, which is why checkboxes work the way they do). But at least you are not propagating the problem into the permanent record.

Datomic is built on an understanding that data is created by atomic addition, not by corruptive modification. When your input source has an update-in-place model (such as checkbox states), you should convert to atomic facts before creating a transaction.

Now go eat some chocolate.

Tuesday, July 24, 2012

Datomic Free Edition

We're happy to announce today the release of Datomic Free Edition. This edition is oriented around making Datomic easier to get, and use, for open source and smaller production deployments.
  • Datomic Free Edition is ... free!
  • The system supports transactor-local storage
  • The peer library includes a memory database and Datomic Datalog
  • The Free transactor and peers are freely redistributable 
  • The transactor supports 2 simultaneous peers
Of particular note here is that Datomic Free Edition comes with a redistributable license, and does not require a personal/business-specific license from us. That means you can download Datomic Free, build e.g. an open source application with it, and ship/include Datomic Free binaries with your software. You can also put the Datomic Free bits into public repositories and package managers (as long as you retain the licenses and copyright notices).

There is a ton of capability included in the Free Edition, including the Datomic in-process memory database (great for testing), and the Datomic datalog engine, which works on both Datomic databases and in-memory collections. That's right, free datalog for everyone.

You can use Datomic Free Edition in production, and you can use it in commercial applications.

Datomic Free edition is completely API-compatible with what we are now calling Datomic Pro edition (the one that pays the bills). Datomic Pro adds the ability to use additional storages like SQL and DynamoDB, support for more peers, as well as high-availabilty mode for transactors and our new memcache support.

You can read about the editions here. 

Memcache Support

We're happy to announce today transparent integrated support for memcached in Datomic Pro Edition.

One of the nice things about the Datomic architecture is that the index segments kept in storage are immutable. That enables them to be cached extensively. Currently that caching happens inside the peers, which keep segments they have needed thus far in the application process heap.

While this is great for process-local working sets, there is only so much a single machine can cache. So, we've added support for an optional second tier of distributed, shared cache, leveraging a memcached cluster. This tier of cache can be as large as you wish, and is shared between all the peers.

The entire use of memcached is automatic and integrated - just provide the endpoints of your memcached cluster in configuration. The peer protocols will automatically both look in it, and populate it on cache misses. Being based upon immutability, there are no cache coherence problems nor expiration policy woes.

The architecture incorporating memcached looks like this:



The benefits of this are many:


  • You can get a shared cache of arbitrary size - many deployments will be able to fit their entire database in memcached if desired. 
  • If you are using a storage that is not otherwise distributed (e.g. unclustered PostgreSQL), the memcached tier can both almost entirely remove the read load on the single server and distribute it.
  • Even when using a distributed storage like DynamoDB, the memcached tier can reduce your read provisioning and increase speed.
  • Developers can set up a small memcached daemon locally so their DB will always feel 'hot' across process restarts.
  • The memcached tier will enable hybrid strategies where the peers, transactors and memcached are all local but the storage is remote (e.g. DynamoDB).

Datomic is a good citizen in its use of memcached - it doesn't need to 'own' the cluster, and all of the Datomic keys incorporate UUIDs so they won't conflict with other application-level use of the same memcached cluster.

We hope you enjoy this feature, which is included in the Pro Edition at no extra charge.

Datomic Editions and Pricing

Over the past few months we've gotten feedback and input regarding our pricing and licensing, and we've revamped them to make things simpler and clearer. The subscription pricing made people feel as if the offering was a service (it's not), as well as brought about misgivings about termination etc, so we've dropped it. 


Here's the new offering:


  • We've added Datomic Free Edition - it's free and redistributable
  • Datomic Pro is licensed software.
  • It is offered with a perpetual license.
  • Maintenance (updates and support) for the first 12 months is included.
  • Maintenance in subsequent years is ~50% of the license fee.
  • Pricing is yearly, and up front on web site.
  • Pricing is based upon the number of processes (transactors + peers) using the software in production in your organization.
  • Development and testing usage doesn't count against your license limits


We've priced it such that the expenditure for maintenance roughly correlates to our older subscription pricing.


Evaluation has changed a bit as well. In all modes, Datomic Pro will require a license key. You can get a free 30-day eval key via the web site. There is no longer a 'runs for a week without a key' mode.


Note that Datomic is not just for cloud deployments - our SQL and other storage support lets you run it on-premise or in the cloud.


You can get more information on the editions and pricing here.