Monday, July 1, 2013

Datomic MusicBrainz sample database

MusicBrainz is an open music encyclopedia that collects music metadata and makes it available to the public. We are pleased to release a sample project that uses the MusicBrainz dataset to help people get familiar with using Datomic.
The MusicBrainz dataset makes a great example database for learning, evaluating, or testing Datomic for a couple of reasons:
  • It deals with a domain with which nearly everyone is familiar
  • It is of decent size: 60,438 labels; 664,226 artists; 1,035,592 album releases; and 13,233,625 recorded tracks
  • It comprises a good number of entities, attributes, and relationships
  • It is fun to play with, query, and explore

Schema

The mbrainz-sample schema is an adaptation of a subset of the full MusicBrainz schema. We didn't include some entities, and we made some simplifying assumptions and combined some entities. In particular:
  • We omit any notion of Work
  • We combine Track, Tracklist and Recording into simply "track"
  • We renamed Release group to "abstractRelease"

Abstract Release vs. Release vs. Medium

(Adapted from the MusicBrainz schema docs)
An "abstractRelease" is an abstract "album" entity (e.g. "The Wall" by Pink Floyd). A "release" is something you can buy in your music store (e.g. the 1984 US vinyl release of "The Wall" by Columbia, as opposed to the 2000 US CD release by Capitol Records).
Therefore, when you query for releases e.g. by name, you may see duplicate releases. To find just the "work of art" level album entity, query for abstractRelease.
The media are the physical components comprising a release (disks, CDs, tapes, cartridges, piano rolls). One medium will have several tracks, and the total tracks across all media represent the track list of the release.

Relationship Diagram


Entities

For information about the individual entities and their attributes, please see the schema page in the wiki, or the EDN schema itself.

Getting Started

First get Datomic, and start up a transactor.

Getting the Data

Next download the mbrainz backup:

    # 2.8 GB, md5 4e7d254c77600e68e9dc71b1a2785c53
    wget http://s3.amazonaws.com/mbrainz/datomic-mbrainz-backup-20130611.tar
and extract:
    # this takes a while
    tar -xvf datomic-mbrainz-backup-20130611.tar
Finally, restore the backup:
    # takes a while, but prints progress -- ~150,000 segments in restore
    bin/datomic restore-db file:datomic-mbrainz-backup-20130611 datomic:free://localhost:4334/mbrainz

Getting the Code

Clone the git repo somewhere convenient:
    git clone git@github.com:Datomic/mbrainz-sample.git
    cd mbrainz-sample

Running the examples

From Java

Fire up your favorite IDE, and configure it to use both the included pom.xml and the following Java options when running:

    -Xmx2g -server

From Clojure

Start up a Clojure REPL:
    # from the root of the mbrainz-sample repo
    lein repl
Then connect to the database and run the queries.

Thanks

We would like to thank the MusicBrainz project for defining and compiling a great dataset, and for making it freely available.