Menu

What is Crux?

Introduction

Crux—to use Martin Kleppmann’s phrase—is an unbundled database.

What do we have to gain from turning the database inside out? Simpler code, better scalability, better robustness, lower latency, and more flexibility for doing interesting things with data.

— Martin Kleppmann (2014)

It is a database turned inside out, using:

  • Apache Kafka for the primary storage of transactions and documents as semi-immutable logs.

  • RocksDB or LMDB to host indexes for rich query support.

This decoupled design enables Crux to be maintained as a small and efficient core that can be scaled and evolved to support a large variety of use-cases.

Crux Kafka Node Diagram

Crux is an open source document database with bitemporal graph queries.

Document database with graph queries

Get Started

Reading the documentation is a good way to learn about Crux, so you are in the right place! Let’s continue the journey.

Bitemporal

Crux is a bitemporal database that stores transaction time and valid time histories. While a [uni]temporal database enables "time travel" querying through the transactional sequence of database states from the moment of database creation to its current state, Crux also provides "time travel" querying for a discrete valid time axis without unnecessary design complexity or performance impact. This means a Crux user can populate the database with past and future information regardless of the order in which the information arrives, and make corrections to past recordings to build an ever-improving temporal model of a given domain.

Bitemporal modelling is broadly useful for event-based architectures and is a critical requirement for systems in any industry with strong auditing regulations, where you need to be able to answer the question "what did you know and when did you know it?".

Read more about Bitemporality in Crux or specifically the known uses for these capabilities.

Query

Crux supports a Datalog query interface for reading data and traversing relationships across all documents. Queries are executed so that the results are lazily streamed from the underlying indexes.

Crux is ultimately a store of versioned EDN documents. The fields within these documents are automatically indexed as Entity-Attribute-Value triples to support efficient graph queries.

Schemaless

Crux does not enforce any schema for the documents it stores. One reason for this is that data might come from many different places, and may not ultimately be owned by the service using Crux to query the data. This design enables schema-on-write and/or schema-on-read to be achieved outside of the core of Crux to meet the exact application requirements.

Distributed

Nodes can come and go, with local indexes stored in a Key/Value store such as RocksDB, whilst reading and writing master data to central log topics (only Kafka is currently supported). Queries are not distributed and there is no sharding of documents across nodes.

Crux can also run in a non-distributed "standalone" mode, where the transaction and document logs exist only inside of a local Key/Value store such as RocksDB. This is appropriate for non-critical usage where availability and durability requirements do not justify the need for Kafka.

Eviction

Crux supports eviction of active and historical data to assist with technical compliance for information privacy regulations.

The main transaction log contains only hashes and is immutable. All document content is stored in a dedicated document log that can be evicted by compaction.

Get Started

Introduction

This guide contains simple steps showing how to transact data and run a simple query. However, there are a few topics you might benefit from learning about before you get too far with attempting to use Crux:

  • EDN – the extensible data notation format used throughout the Crux APIs, see Essential EDN for Crux.

  • The Datalog query language – Crux supports an EDN-flavoured version of Datalog. The Queries section within this documentation provides a good overview. You can also find an interactive tutorial for EDN-flavoured Datalog here.

  • Clojure – whilst a complete Java API is provided, a basic understanding of Clojure is recommended – Clojure is a succinct and pragmatic data-oriented language with strong support for immutability and parallelism. See Clojure.org.

Setting Up

Follow the below steps to quickly set yourself up a Crux playground:

Project Dependency

First add Crux as a project dependency:

  juxt/crux-core {:mvn/version "19.09-1.5.0-alpha"}

Start a Crux node

(require '[crux.api :as crux])
(import (crux.api ICruxAPI))

(def ^crux.api.ICruxAPI node
  (crux/start-node {:crux.node/topology :crux.standalone/topology
                    :crux.node/kv-store "crux.kv.memdb/kv"
                    :crux.kv/db-dir "data/db-dir-1"
                    :crux.standalone/event-log-dir "data/eventlog-1"
                    :crux.standalone/event-log-kv-store "crux.kv.memdb/kv"}))

For the purposes of this Hello World, we are using the simplest configuration of Crux, where all of the pluggable components are in-memory. There is no Kafka or RocksDB to worry about.

Transacting

(crux/submit-tx
 node
 [[:crux.tx/put
   {:crux.db/id :dbpedia.resource/Pablo-Picasso ; id
    :name "Pablo"
    :last-name "Picasso"}
   #inst "2018-05-18T09:20:27.966-00:00"]]) ; valid time

Querying

(crux/q (crux/db node)
        '{:find [e]
          :where [[e :name "Pablo"]]})

You should get:

#{[:dbpedia.resource/Pablo-Picasso]}

An entity query would be:

(crux/entity (crux/db node) :dbpedia.resource/Pablo-Picasso)

You should get:

{:crux.db/id :dbpedia.resource/Pablo-Picasso
:name "Pablo"
:last-name "Picasso"}

Next Steps

Now you know the basics of how to interact with Crux you may want to dive into our tutorials. Otherwise, let’s take a look at the kinds of things you are able to do with Queries.

Queries

Introduction

Crux is a document database that provides you with a comprehensive means of traversing and querying across all of your documents and data without any need to define a schema ahead of time. This is possible because Crux is "schemaless" and automatically indexes the top-level fields in all of your documents to support efficient ad-hoc joins and retrievals. With these capabilities you can quickly build queries that match directly against the relations in your data without worrying too much about the shape of your documents or how that shape might change in future.

Crux is also a graph database. The central characteristic of a graph database is that it can support arbitrary-depth graph queries (recursive traversals) very efficiently by default, without any need for schema-level optimisations. Crux gives you the ability to construct graph queries via a Datalog query language and uses graph-friendly indexes to provide a powerful set of querying capabilities. Additionally, when Crux’s indexes are deployed directly alongside your application you are able to easily blend Datalog and code together to construct highly complex graph algorithms.

This page walks through many of the more interesting queries that run as part of Crux’s default test suite. See test/crux/query_test.clj for the full suite of query tests and how each test listed below runs in the wider context.

Extensible Data Notation (edn) is used as the data format for the public Crux APIs. To gain an understanding of edn see Essential EDN for Crux.

Note that all Crux Datalog queries run using a point-in-time view of the database which means the query capabilities and patterns presented in this section are not aware of valid times or transaction times.

Basic Query

A Datalog query consists of a set of variables and a set of clauses. The result of running a query is a result set (or lazy sequence) of the possible combinations of values that satisfy all of the clauses at the same time. These combinations of values are referred to as "tuples".

The possible values within the result tuples are derived from your database of documents. The documents themselves are represented in the database indexes as "entity–attribute–value" (EAV) facts. For example, a single document {:crux.db/id :myid :color "blue" :age 12} is transformed into two facts [[:myid :color "blue"][:myid :age 12]].

In the most basic case, a Datalog query works by searching for "subgraphs" in the database that match the pattern defined by the clauses. The values within these subgraphs are then returned according to the list of return variables requested in the :find vector within the query.

Our first query runs on a database that contains the following 3 documents which get broken apart and indexed as "entities":

        [{:crux.db/id :ivan
          :name "Ivan"
          :last-name "Ivanov"}

         {:crux.db/id :petr
          :name "Petr"
          :last-name "Petrov"}

         {:crux.db/id :smith
          :name "Smith"
          :last-name "Smith"}]

Note that :ivan, :petr and :smith are edn keywords, which may be used as document IDs in addition to UUIDs.

The following query has 3 clauses, represented as edn vectors within the :where vector. These clauses constrain the result set to match only the entity (or subgraph of interconnected entities) that satisfy all 3 clauses at once:

 '{:find [p1]
   :where [[p1 :name n]
           [p1 :last-name n]
           [p1 :name "Smith"]]}

Let’s try to work out what these 3 clauses do…​

p1 and n are logical variables. Logic variables are often prefixed with ? for clarity but this is optional.

[p1 :name n] is looking for all entities that have a value under the attribute of :name and then binds the corresponding entity ID to p1 and the corresponding value to n. Since all 3 entities in our database have a :name attribute, this clause alone will simply return all 3 entities.

[p1 :last-name n] reuses the variable n from the previous clause which is significant because it constrains the query to only look for entities where the value of :name (from the first clause) is equal to the value of :last-name (from the second clause). Looking at documents that were processed by our database there is only one possible entity that can be returned, because it has the same values :name and :last-name.

[p1 :name "Smith"] only serves to reinforce the conclusion from the previous two clauses which is that the variable n can only be matched against the string "Smith" within our database.

…​so what is the actual result of the query? Well that is defined by the :find vector which states that only the values corresponding to p1 should be returned, which in this case is simply :smith (the keyword database ID for the document relating to our protagonist "Smith Smith"). Results are returned as an edn set, which means duplicate results will not appear.

The edn result set only contains the value :smith

#{[:smith]}

Arguments

For the next set of queries we will again use the same set of documents for our database as used in the previous section:

        [{:crux.db/id :ivan
          :name "Ivan"
          :last-name "Ivanov"}

         {:crux.db/id :petr
          :name "Petr"
          :last-name "Petrov"}

         {:crux.db/id :smith
          :name "Smith"
          :last-name "Smith"}]

Query: "Match on entity ID and value"

 {:find '[n]
  :where '[[e :name n]]
  :args [{'e :ivan
          'n "Ivan"}]}

Our first query supplies two arguments to the query via a map within the :args vector. The effect of this is to make sure that regardless of whether other :name values in the database also equal "Ivan", that only the entity with an ID matching our specific :ivan ID is considered within the query. Use of arguments means we can avoid hard-coding values directly into the query clauses.

Result Set:

#{["Ivan"]}

Query: "Match entities with given values"

 {:find '[e]
  :where '[[e :name n]]
  :args [{'n "Ivan"}
         {'n "Petr"}]}

This next query shows how multiple argument values can be mapped to a single field. This allows us to usefully parameterise the input to a query such that we do not have to rerun a single query multiple times (which would be significantly less efficient!).

Result Set:

#{[:petr] [:ivan]}

Query: "Match entities with given value tuples"

 {:find '[e]
  :where '[[e :name n]
           [e :last-name l]]
  :args [{'n "Ivan" 'l "Ivanov"}
         {'n "Petr" 'l "Petrov"
          }]}

Here we see how we can extend the parameterisation to match using multiple fields at once.

Result Set:

#{[:petr] [:ivan]}

Query: "Use predicates with arguments"

 {:find '[n]
  :where '[[(re-find #"I" n)]
           [(= l "Ivanov")]]
  :args [{'n "Ivan" 'l "Ivanov"}
         {'n "Petr" 'l "Petrov"}]}

Something else we can do with arguments is apply predicates to them directly within the clauses. Predicates return either true or false but all predicates used in clauses must return true in order for the given combination of field values to be part of the valid result set. In this case only :name "Ivan" satisfies [(re-find #"I" n)] (which returns true for any values that begin with "I").

 #{["Ivan"]}

Query: "Use range constraints with arguments"

 {:find '[age]
  :where '[[(>= age 21)]]
  :args [{'age 22}]}

Finally we can see how we can return an argument that passes all of the predicates by including it in the :find vector. This essentially bypasses any interaction with the data in our database.

Result Set:

#{[22]}

Valid time travel

Congratulations! You already know enough about queries to build a simple CRUD application with Crux. However, your manager has just told you that the new CRUD application you have been designing needs to backfill the historical document versions from the legacy CRUD application. Luckily Crux makes it easy for your application to both insert and retrieve these old versions.

Here we will see how you are able to run queries at a given point in the valid time axis against, implicitly, the most recent transaction time.

First, we transact a very old document into the database with the ID :malcolm and the :name "Malcolm", and specify the valid time instant at which this document became valid in the legacy system: #inst "1986-10-22".

    {:crux.db/id :malcolm :name "Malcolm" :last-name "Sparks"}
    #inst "1986-10-22"

Next we transact a slightly more recent (though still very old!) revision of that same document where the :name has been corrected to "Malcolma", again using a historical timestamp extracted from the legacy system.

    {:crux.db/id :malcolm :name "Malcolma" :last-name "Sparks"}
    #inst "1986-10-24"

We are then able to query at different points in the valid time axis to check for the validity of the correction. We define a query q:

  '{:find [e]
    :where [[e :name "Malcolma"]
            [e :last-name "Sparks"]]}

Firstly we can verify that "Malcolma" was unknown at #inst "1986-10-23".

; Using Clojure: `(api/q (api/db my-crux-system #inst "1986-10-23") q)`

Result Set:

#{}

We can then verify that "Malcolma" is the currently known :name for the entity with ID :malcolm by simply not specifying a valid time alongside the query. This will be the case so long as there are no newer versions (in the valid time axis) of the document that affect the current valid time version.

; Using Clojure: `(api/q (api/db my-crux-system) q)`

Result Set:

#{[:malcolm]}

History API

Full Document History

Crux allows you to retrieve all versions of a document:

(api/submit-tx
  system
  [[:crux.tx/put
    {:crux.db/id :ids.persons/Jeff
     :person/name "Jeff"
     :person/wealth 100}
    #inst "2018-05-18T09:20:27.966"]
   [:crux.tx/put
    {:crux.db/id :ids.persons/Jeff
     :person/name "Jeff"
     :person/wealth 1000}
    #inst "2015-05-18T09:20:27.966"]])

;yields
{:crux.tx/tx-id 1555314836178,
 :crux.tx/tx-time #inst "2019-04-15T07:53:56.178-00:00"}


(api/history system :ids.persons/Jeff)

; yields
[{:crux.db/id ; sha1 hash of document id
  "c7e66f757f198e08a07a8ea6dfc84bc3ab1c6613",
  :crux.db/content-hash ; sha1 hash of document contents
  "6ca48d3bf05a16cd8d30e6b466f76d5cc281b561",
  :crux.db/valid-time #inst "2018-05-18T09:20:27.966-00:00",
  :crux.tx/tx-time #inst "2019-04-15T07:53:55.817-00:00",
  :crux.tx/tx-id 1555314835817}
 {:crux.db/id "c7e66f757f198e08a07a8ea6dfc84bc3ab1c6613",
  :crux.db/content-hash "a95f149636e0a10a78452298e2135791c0203529",
  :crux.db/valid-time #inst "2015-05-18T09:20:27.966-00:00",
  :crux.tx/tx-time #inst "2019-04-15T07:53:56.178-00:00",
  :crux.tx/tx-id 1555314836178}]

Document History Range

Retrievable document versions can be bounded by four time coordinates:

  • valid-time-start

  • tx-time-start

  • valid-time-end

  • tx-time-end

All coordinates are inclusive. All coordinates can be null.

(api/history-range system :ids.persons/Jeff
  #inst "2015-05-18T09:20:27.966"  ; valid-time start or nil
  #inst "2015-05-18T09:20:27.966"  ; transaction-time start or nil
  #inst "2020-05-18T09:20:27.966"  ; valid-time end or nil, inclusive
  #inst "2020-05-18T09:20:27.966") ; transaction-time end or nil, inclusive.

; yields
({:crux.db/id ; sha1 hash of document id
  "c7e66f757f198e08a07a8ea6dfc84bc3ab1c6613",
  :crux.db/content-hash  ; sha1 hash of document contents
  "a95f149636e0a10a78452298e2135791c0203529",
  :crux.db/valid-time #inst "2015-05-18T09:20:27.966-00:00",
  :crux.tx/tx-time #inst "2019-04-15T07:53:56.178-00:00",
  :crux.tx/tx-id 1555314836178}
  {:crux.db/id "c7e66f757f198e08a07a8ea6dfc84bc3ab1c6613",
   :crux.db/content-hash "6ca48d3bf05a16cd8d30e6b466f76d5cc281b561",
   :crux.db/valid-time #inst "2018-05-18T09:20:27.966-00:00",
   :crux.tx/tx-time #inst "2019-04-15T07:53:55.817-00:00",
   :crux.tx/tx-id 1555314835817})


(api/entity (api/db system) "c7e66f757f198e08a07a8ea6dfc84bc3ab1c6613")

; yields
{:crux.db/id :ids.persons/Jeff,
 :d.person/name "Jeff",
 :d.person/wealth 100}

Joins

Query: "Join across entities on a single attribute"

Given the following documents in the database

        [{:crux.db/id :ivan :name "Ivan"}
         {:crux.db/id :petr :name "Petr"}
         {:crux.db/id :sergei :name "Sergei"}
         {:crux.db/id :denis-a :name "Denis"}
         {:crux.db/id :denis-b :name "Denis"}]

We can run a query to return a set of tuples that satisfy the join on the attribute :name

 '{:find [p1 p2]
   :where [[p1 :name n]
           [p2 :name n]]}

Result Set:

#{[:ivan :ivan]
  [:petr :petr]
  [:sergei :sergei]
  [:denis-a :denis-a]
  [:denis-b :denis-b]
  [:denis-a :denis-b]
  [:denis-b :denis-a]}

Note that every person joins once, plus 2 more matches.

Query: "Join with two attributes, including a multi-valued attribute"

Given the following documents in the database

      [{:crux.db/id :ivan :name "Ivan" :last-name "Ivanov"}
       {:crux.db/id :petr :name "Petr" :follows #{"Ivanov"}}]

We can run a query to return a set of entities that :follows the set of entities with the :name value of "Ivan"

 '{:find [e2]
   :where [[e :last-name l]
           [e2 :follows l]
           [e :name "Ivan"]]}

Result Set:

#{[:petr]}

Note that because Crux is schemaless there is no need to have elsewhere declared that the :follows attribute may take a value of edn type set.

Ordering and Pagination

A Datalog query naturally returns a result set of tuples, however, the tuples can also be consumed as a lazy sequence and therefore you will always have an implicit order available. Ordinarily this implicit order is not meaningful because the join order and result order are unlikely to correlate.

The :order-by option is available for use in the query map to explicitly control the result order.

'{:find [time device-id temperature humidity]
  :where [[c :condition/time time]
          [c :condition/device-id device-id]
          [c :condition/temperature temperature]
          [c :condition/humidity humidity]]
  :order-by [[time :desc] [device-id :asc]]}

Use of :order-by will typically require that results are fully-realised by the query engine, however this happens transparently and it will automatically spill to disk when sorting large result sets.

Basic :offset and :limit options are supported however typical pagination use-cases will need a more comprehensive approach because :offset will naively scroll through the initial result set each time.

'{:find [time device-id temperature humidity]
  :where [[c :condition/time time]
          [c :condition/device-id device-id]
          [c :condition/temperature temperature]
          [c :condition/humidity humidity]]
  :order-by [[device-id :asc]]
  :limit 10
  :offset 90}

Pagination relies on efficient retrieval of explicitly ordered documents and this may be achieved using a user-defined attribute with values that get sorted in the desired order. You can then use this attribute within your Datalog queries to apply range filters using predicates.

{:find '[time device-id temperature humidity]
 :where '[[c :condition/time time]
          [c :condition/device-id device-id]
          [(>= device-id my-offset)]
          [c :condition/temperature temperature]
          [c :condition/humidity humidity]]
 :order-by '[[device-id :asc]]
 :limit 10
 :args [{'my-offset 990}]}

Additionally, since Crux stores documents and can traverse arbitrary keys as document references, you can model the ordering of document IDs with vector values, e.g. {:crux.db/id :zoe :closest-friends [:amy :ben :chris]}

More powerful ordering and pagination features may be provided in the future. Feel free to open an issue or get in touch to discuss your requirements.

Rules

This example of a rule demonstrates a recursive traversal of entities that are connected to a given entity via the :follow attribute.

'{:find [?e2]
  :where [(follow ?e1 ?e2)]
  :args [{?e1 :1}]
  :rules [[(follow ?e1 ?e2)
           [?e1 :follow ?e2]]
          [(follow ?e1 ?e2)
           [?e1 :follow ?t]
           (follow ?t ?e2)]]})

Lazy Queries

The function crux.api/q takes 2 or 3 arguments, db and q but also optionally a snapshot which is already opened and managed by the caller (using with-open for example). This version of the call returns a lazy sequence of the results, while the other version provides a set. A snapshot can be retrieved from a kv instance via crux.api/new-snapshot.

Clojure Tips

Quoting

Logic variables used in queries must always be quoted in the :find and :where clauses, which in the most minimal case could look like the following:

(crux/q db
  {:find ['?e]
   :where [['?e :event/employee-code '?code]]}))

However it is often convenient to quote entire clauses or even the entire query map rather than each individual use of every logical variable, for instance:

(crux/q db
  '{:find [?e]
    :where [[?e :event/employee-code ?code]]}))

Confusion may arise when you later attempt to introduce references to Clojure variables within your query map, such as when using :args. This can be resolved by introducing more granular quoting for specific parts of the query map:

(let [my-code 101214]
  (crux/q db
    {:find '[?e]
     :where '[[?e :event/employee-code ?code]]
     :args [{'?code my-code}]}))

Maps and Vectors in data

Say you have a document like so and you want to add it to a Crux db:

{:crux.db/id :me
 :list ["carrots" "peas" "shampoo"]
 :pockets {:left ["lint" "change"]
           :right ["phone"]}}

Crux breaks down vectors into individual components so the query engine is able see all elements on the base level. As a result of this the query engine is not required to traverse any structures or any other types of search algorithm which would slow the query down. The same thing should apply for maps so instead of doing :pocket {:left thing :right thing} you should put them under a namespace, instead structuring the data as :pocket/left thing :pocket/right thing to put the data all on the base level. Like so:

(crux/submit-tx
  node
  [[:crux.tx/put
    {:crux.db/id :me
     :list ["carrots" "peas" "shampoo"]
     :pockets/left ["lint" "change"]
     :pockets/right ["phone"]}]
   [:crux.tx/put
    {:crux.db/id :you
     :list ["carrots" "tomatoes" "wig"]
     :pockets/left ["wallet" "watch"]
     :pockets/right ["spectacles"]}]])

To query inside these vectors the code would be:

(crux/q (crux/db node) '{:find [e l]
                         :where [[e :list l]]
                         :args [{l "carrots"}]})
;; => #{[:you "carrots"] [:me "carrots"]}

(crux/q (crux/db node) '{:find [e p]
                         :where [[e :pockets/left p]]
                         :args [{p "watch"}]})
;; => #{[:you "watch"]}

Note that l and p is returned as a single element as Crux decomposes the vector

DataScript Differences

This list is not necessarily exhaustive and is based on the partial re-usage of DataScript’s query test suite within Crux’s query tests.

Crux does not support:

  • vars in the attribute position, such as [e ?a "Ivan"] or [e _ "Ivan"]

Crux does not yet support:

  • ground, get-else, get-some, missing?, missing? back-ref

  • destructuring

  • source vars, e.g. function references passed into the query via :args

Note that many of these not yet supported query features can be achieved via simple function calls since you can currently fully qualify any function that is loaded. In future, limitations on available functions may be introduced to enforce security restrictions for remote query execution.

Test queries from DataScript such as "Rule with branches" and "Mutually recursive rules" work correctly with Crux and demonstrate advanced query patterns. See the Crux tests for details.

Transactions

Overview

There are four transaction (write) operations:

Table 1. Write Operations
Operation Purpose

crux.tx/put

Write a version of a document

crux.tx/cas

Compare and swap the version of a document, if that version is as expected

crux.tx/delete

Deletes the specific document at a given valid time

crux.tx/evict

Evicts a document entirely, including all historical versions

A document looks like this:

{:crux.db/id :dbpedia.resource/Pablo-Picasso
 :name "Pablo"
 :last-name "Picasso"}

In practice when using Crux, one calls crux.db/submit-tx with a sequence of transaction operations:

[[:crux.tx/put
 {:crux.db/id :dbpedia.resource/Pablo-Picasso
  :name "Pablo"
  :last-name "Picasso"}
 #inst "2018-05-18T09:20:27.966-00:00"]]

If the transaction contains CAS operations, all CAS operations must pass their pre-condition check or the entire transaction is aborted. This happens at the query node during indexing, and not when submitting the transaction.

For operations containing documents, the id and the document are hashed, and the operation and hash is submitted to the tx-topic in the event log. The document itself is submitted to the doc-topic, using its content hash as key. In Kafka, the doc-topic is compacted, which enables later deletion of documents.

Valid IDs

The following types of :crux.db/id are allowed:

  • Keyword (e.g. {:crux.db/id :my-id} or {:crux.db/id :dbpedia.resource/Pablo-Picasso})

  • UUID (e.g. {:crux.db/id #uuid "6f0232d0-f3f9-4020-a75f-17b067f41203"} or {:crux.db/id #crux/id "6f0232d0-f3f9-4020-a75f-17b067f41203"})

  • URI (e.g. {:crux.db/id #crux/id "mailto:crux@juxt.pro"})

  • URL (e.g. {:crux.db/id #crux/id "https://github.com/juxt/crux"}), including http, https, ftp and file protocols

The #crux/id reader literal will take any string and attempt to coerce it into a valid ID. Use of #crux/id with a valid ID type will also work (e.g. {:crux.db/id #crux/id :my-id}).

URIs and URLs are interpreted using Java classes (java.net.URI and java.net.URL respectively) and therefore you can also use these directly.

Put

Put’s a document into Crux. If a document already exists with the given :crux.db/id, a new version of this document will be created at the supplied valid time.

[:crux.tx/put
 {:crux.db/id :dbpedia.resource/Pablo-Picasso :first-name :Pablo} (1)
 #inst "2018-05-18T09:20:27.966-00:00"] (2)
  1. The document itself. Note that the ID must be included as part of the document.

  2. valid time

Note that valid time is optional and defaults to transaction time, which is taken from the Kafka log.

Crux currently writes into the past at a single point, so to overwrite several versions or a range in time, one is required to submit a transaction containing several operations.

CAS

The CAS operation (compare and swap) swaps an existing document version with a newer one, if the existing document is as expected.

[:crux.tx/cas
 {..} (1)
 {..} (2)
  #inst "2018-05-18T09:21:31.846-00:00"] (3)
  1. Expected Document

  2. New document

  3. valid time

Delete

Deletes a document at a given valid time. Historical version of the document will still be available.

[:crux.tx/delete :dbpedia.resource/Pablo-Picasso
#inst "2018-05-18T09:21:52.151-00:00"]

Evict

Evicts a document from Crux. Historical versions of the documents will no longer be available.

[:crux.tx/evict :dbpedia.resource/Pablo-Picasso]

Configuration

Nodes

To start a Crux node, use the Java API or the Clojure crux.api.

Within Clojure, we call start-node from within crux.api, passing it a set of options for the node. There are a number of different configuration options a Crux node can have, grouped into topologies.

Table 2. Crux Topologies
Name Transaction Log Topology

Standalone

Uses local event log

:crux.standalone/topology

Kafka

Uses Kafka

:crux.kafka/topology

JDBC

Uses JDBC event log

:crux.jdbc/topology

Use a Kafka node when horizontal scalability is required or when you want the guarantees that Kafka offers in terms of resiliency, availability, and retention of data.

Multiple Kafka nodes participate in a cluster with Kafka as the primary store and as the central means of coordination.

The JDBC node is useful when you don’t want the overhead of maintaining a Kafka cluster. Read more about the motivations of this setup here.

The Standalone node is a single Crux instance which has everything it needs locally. This is good for experimenting with Crux and for small to medium sized deployments, where running a single instance is permissible.

Crux nodes implement the ICruxAPI interface and are the starting point for making use of Crux. Nodes also implement java.io.Closeable and can therefore be lifecycle managed.

Properties

The following properties are within the topology used as a base for the other topologies, crux.node:

Table 3. crux.node configuration
Property Default Value

:crux.node/kv-store

'crux.kv.rocksdb/kv

:crux.node/object-store

'crux.object-store/kv-object-store

The following set of options are used by KV backend implementations, defined within crux.kv:

Table 4. crux.kv options
Property Description Default Value

:crux.kv/db-dir

Directory to store K/V files

data

:crux.kv/sync?

Sync the KV store to disk after every write?

false

:crux.kv/check-and-store-index-version

Check and store index version upon start?

true

Standalone Node

Using a Crux standalone node is the best way to get started. Once you’ve started a standalone Crux instance as described below, you can then follow the getting started example.

Local Standalone Mode
Table 5. Standalone configuration
Property Description Default Value

:crux.standalone/event-log-kv-store

Key/Value store to use for standalone event-log persistence

'crux.kv.rocksdb/kv

:crux.standalone/event-log-dir

Directory used to store the event-log and used for backup/restore, i.e. "data/eventlog-1"

:crux.standalone/event-log-sync?

Sync the event-log backend KV store to disk after every write?

false

Project Dependency

  juxt/crux-core {:mvn/version "19.09-1.5.0-alpha"}

Getting started

The following code creates a node which runs completely within memory (with both the event-log store and db store using crux.kv.memdb/kv):

(require '[crux.api :as crux])
(import (crux.api ICruxAPI))

(def ^crux.api.ICruxAPI node
  (crux/start-node {:crux.node/topology :crux.standalone/topology
                    :crux.node/kv-store "crux.kv.memdb/kv"
                    :crux.kv/db-dir "data/db-dir-1"
                    :crux.standalone/event-log-dir "data/eventlog-1"
                    :crux.standalone/event-log-kv-store "crux.kv.memdb/kv"}))

You can later stop the node if you wish:

(.close node)

RocksDB

RocksDB is used, by default, as Crux’s primary store (in place of the in memory kv store in the example above). In order to use RocksDB within crux, however, you must first add RocksDB as a project dependency:

Project Dependency

  juxt/crux-rocksdb {:mvn/version "19.09-1.5.0-alpha"}

Starting a node using RocksDB

(def ^crux.api.ICruxAPI node
  (crux/start-node {:crux.node/topology :crux.standalone/topology
                    :crux.node/kv-store "crux.kv.rocksdb/kv"
                    :crux.kv/db-dir "data/db-dir-1"
                    :crux.standalone/event-log-dir "data/eventlog-1"}))

LMDB

An alternative to RocksDB, LMDB provides faster queries in exchange for a slower ingest rate.

Project Dependency

  juxt/crux-lmdb {:mvn/version "19.09-1.5.0-alpha"}

Starting a node using LMDB

(def ^crux.api.ICruxAPI node
  (crux/start-node {:crux.node/topology :crux.standalone/topology
                    :crux.node/kv-store "crux.kv.lmdb/kv"
                    :crux.kv/db-dir "data/db-dir-1"
                    :crux.standalone/event-log-dir "data/eventlog-1"
                    :crux.standalone/event-log-kv-store "crux.kv.lmdb/kv"}))

Kafka Nodes

When using Crux at scale it is recommended to use multiple Crux nodes connected via a Kafka cluster.

Local Cluster Mode

Kafka nodes have the following properties:

Table 6. Kafka node configuration
Property Description Default value

:crux.kafka/bootstrap-servers

URL for connecting to Kafka

localhost:9092

:crux.kafka/tx-topic

Name of Kafka transaction log topic

crux-transaction-log

:crux.kafka/doc-topic

Name of Kafka documents topic

crux-docs

:crux.kafka/create-topics

Option to automatically create Kafka topics if they do not already exist

true

:crux.kafka/doc-partitions

Number of partitions for the document topic

1

:crux.kafka/replication-factor

Number of times to replicate data on Kafka

1

:crux.kafka/group-id

Kafka client group.id

(Either environment variable HOSTNAME, COMPUTERNAME, or a random UUID)

:crux.kafka/kafka-properties-file

File to supply Kakfa connection properties to the underlying Kafka API

:crux.kafka/kafka-properties-map

Map to supply Kakfa connection properties to the underlying Kafka API

Project Dependencies

  juxt/crux-core {:mvn/version "19.09-1.5.0-alpha"}
  juxt/crux-kafka {:mvn/version "19.09-1.5.0-alpha"}

Getting started

Use the API to start a Kafka node, configuring it with the bootstrap-servers property in order to connect to Kafka:

(def ^crux.api.ICruxAPI node
  (crux/start-node {:crux.node/topology :crux.kafka/topology
                    :crux.node/kv-store "crux.kv.memdb/kv"
                    :crux.kafka/bootstrap-servers "localhost:9092"}))
Note
If you don’t specify kv-store then by default the Kafka node will use RocksDB. You will need to add RocksDB to your list of project dependencies.

You can later stop the node if you wish:

(.close node)

Embedded Kafka

Crux is ready to work with an embedded Kafka for when you don’t have a independently running Kafka available to connect to (such as during development).

Project Depencies

  juxt/crux-core {:mvn/version "19.09-1.5.0-alpha"}
  juxt/crux-kafka-embedded {:mvn/version "19.09-1.5.0-alpha"}

Getting started

(require '[crux.kafka.embedded :as ek])

(def storage-dir "dev-storage")
(def embedded-kafka-options
  {:crux.kafka.embedded/zookeeper-data-dir (str storage-dir "/zookeeper")
   :crux.kafka.embedded/kafka-log-dir (str storage-dir "/kafka-log")
   :crux.kafka.embedded/kafka-port 9092})

(def embedded-kafka (ek/start-embedded-kafka embedded-kafka-options))

You can later stop the Embedded Kafka if you wish:

(.close embedded-kafka)

JDBC Nodes

JDBC Nodes use next.jdbc internally and pass through the relevant configuration options that you can find here.

Local Cluster Mode

Below is the minimal configuration you will need:

Table 7. Minimal JDBC Configuration
Property Description

:crux.jdbc/dbtype

One of: postgresql, oracle, mysql, h2, sqlite

:crux.jdbc/dbname

Database Name

Depending on the type of JDBC database used, you may also need some of the following properties:

Table 8. Other JDBC Properties
Property Description

:crux.kv/db-dir

For h2 and sqlite

:crux.jdbc/host

Database Host

:crux.jdbc/user

Database Username

:crux.jdbc/password

Database Password

Project Dependencies

  juxt/crux-core {:mvn/version "19.09-1.5.0-alpha"}
  juxt/crux-jdbc {:mvn/version "19.09-1.5.0-alpha"}

Getting started

Use the API to start a JDBC node, configuring it with the required parameters:

(def ^crux.api.ICruxAPI node
  (crux/start-node {:crux.node/topology :crux.jdbc/topology
                    :crux.jdbc/dbtype "postgresql"
                    :crux.jdbc/dbname "cruxdb"
                    :crux.jdbc/host "<host>"
                    :crux.jdbc/user "<user>"
                    :crux.jdbc/password "<password>"}))

Http

Crux can be used programmatically as a library, but Crux also ships with an embedded HTTP server, that allows clients to use the API remotely via REST.

Remote Cluster Mode

Set the server-port configuration property on a Crux node to expose a HTTP port that will accept REST requests:

Table 9. HTTP Nodes Configuration
Component Property Description

http-server

server-port

Port for Crux HTTP Server e.g. 8080

Visit the guide on using the REST api for examples of how to interact with Crux over HTTP.

Docker

If you want to experiment with Crux using a demo Docker container from Docker Hub (no JVM/JDK/Clojure install required, only Docker!) then please see the standalone web service example. You can also use this self-contained demonstration image to experiment with the REST API.

Backup and Restore

Crux provides utility APIs for local backup and restore when you are using the standalone mode. For an example of usage, see the standalone web service example.

An additional example of backup and restore is provided that only applies to a stopped standalone node here.

In a clustered deployment, only Kafka’s official backup and restore functionality should be relied on to provide safe durability. The standalone mode’s backup and restore operations can instead be used for creating operational snapshots of a node’s indexes for scaling purposes.

Bitemporality

Overview

Crux is optimised for efficient and globally consistent point-in-time queries using a pair of transaction-time and valid-time timestamps.

Ad-hoc systems for bitemporal recordkeeping typically rely on explicitly tracking either valid-from and valid-to timestamps or range types directly within relations. The bitemporal document model that Crux provides is very simple to reason about and it is universal across the entire database, therefore it does not require you to consider which historical information is worth storing in special "bitemporal tables" upfront.

One or more documents may be inserted into Crux via a put transaction at a specific valid-time, defaulting to the transaction time (i.e. now), and each document remains valid until explicitly updated with a new version via put (or cas) or deleted via delete.

Why?

The rationale for bitemporality is also explained in this blog post.

A baseline notion of time that is always available is transaction-time; the point at which data is transacted into the database.

Bitemporality is the addition of another time-axis: valid-time.

Table 10. Time Axes
Time Purpose

transaction-time

Used for audit purposes, technical requirements such as event sourcing.

valid-time

Used for querying data across time, historical analysis.

transaction-time represents the point at which data arrives into the database. This gives us an audit trail and we can see what the state of the database was at a particular point in time. You cannot write a new transaction with a transaction-time that is in the past.

valid-time is an arbitrary time that can originate from an upstream system, or by default is set to transaction-time. Valid time is what users will typically use for query purposes.

Writes can be made in the past of valid-time as retroactive operations. Users will normally ask "what is the value of this entity at valid-time?" regardless if this history has been rewritten several times at multiple transaction-times. Writes can also be made in the future of valid-time as proactive operations.

Typically you only need to consider using both transaction-time and valid-time for ensuring globally consistent reads across nodes or to query for audit reasons.

Note
In Crux, when transaction-time isn’t specified, it is set to now. When writing data, in case there isn’t any specific valid-time available, valid-time and transaction-time take the same value.

Valid Time

In situations where your database is not the ultimate owner of the data—where corrections to data can flow in from various sources and at various times—use of transaction-time is inappropriate for historical queries.

Imagine you have a financial trading system and you want to perform calculations based on the official 'end of day', that occurs each day at 17:00 hours. Does all the data arrive into your database at exactly 17:00? Or does the data arrive from an upstream source where we have to allow for data to arrive out of order, and where some might always arrive after 17:00?

This can often be the case with high throughput systems where there are clusters of processing nodes, enriching the data before it gets to our store.

In this example, we want our queries to include the straggling bits of data for our calculation purposes, and this is where valid-time comes in. When data arrives into our database, it can come with an arbitrary time-stamp that we can use for querying purposes.

We can tolerate data arriving out of order, as we’re not completely dependent on transaction-time.

In a ecosystem of many systems, where one cannot control the ultimate time line, or other systems abilities to write into the past, one needs bitemporality to ensure evolving but consistent views of the data.

Transaction Time

For audit reasons, we might wish to know with certainty the value of a given entity-attribute at a given tx-instant. In this case, we want to exclude the possibility of the valid past being amended, so we need a pre-correction view of the data, relying on tx-instant.

To achieve this you can use as-of using ts (valid-time) and tx-ts (transaction-time).

Only if you want to ensure consistent reads across nodes or to query for audit reasons, would you want to consider using both transaction-time and valid-time.

Domain Time

Valid time is valuable for tracking a consistent view of the entire state of the database, however, unless you explicitly include a timestamp or other temporal component within your documents you cannot currently use this information about valid time inside of your Datalog queries.

Domain time or "user-defined" time is simply the storing of any additional time-related information within your documents, for instance valid-time, duration or timestamps relating to additional temporal life-cycles (e.g. decision, receipt, notification, availability).

Queries that use domain times do not automatically benefit from any kind of native indexes to support efficient execution, however Crux encourages you to build additional layers of functionality to do so. See Decorators.

Known Uses

Recording bitemporal information with your data is essential when dealing with lag, corrections, and efficient auditability:

  • Lag is found wherever there is risk of non-trivial delay until an event can be recorded. This is common between systems that communicate over unreliable networks.

  • Corrections are needed as errors are uncovered and as facts are reconciled.

  • Ad-hoc auditing is an otherwise intensive and slow process requiring significant operational complexity.

With Crux you retain visibility of all historical changes whilst compensating for lag, making corrections, and performing audit queries. By default, deleting data only erases visibility of that data from the current perspective. You may of course still evict data completely as the legal status of information changes.

These capabilities are known to be useful for:

  • Event Sourcing (e.g. retroactive and scheduled events and event-driven computing on evolving graphs)

  • Ingesting out-of-order temporal data from upstream timestamping systems

  • Maintaining a slowly changing dimension for decision support applications

  • Recovering from accidental data changes and application errors (e.g. billing systems)

  • Auditing all data changes and performing data forensics when necessary

  • Responding to new compliance regulations and audit requirements

  • Avoiding the need to set up additional databases for historical data and improving end-to-end data governance

  • Building historical models that factor in all historical data (e.g. insurance calculations)

  • Accounting and financial calculations (e.g payroll systems)

  • Development, simulation and testing

  • Live migrations from legacy systems using ad-hoc batches of backfilled temporal data

  • Scheduling and previewing future states (e.g. publishing and content management)

  • Reconciling temporal data across eventually consistent systems

Applied industry-specific examples include:

  • Legal Documentation – maintain visibility of all critical dates relating to legal documents, including what laws were known to be applicable at the time, and any subsequent laws that may be relevant and applied retrospectively

  • Insurance Coverage – assess the level of coverage for a beneficiary across the lifecycle of care and legislation changes

  • Reconstruction of Trades – readily comply with evolving financial regulations

  • Adverse Events in Healthcare – accurately record a patient’s records over time and mitigate human error

  • Intelligence Gathering – build an accurate model of currently known information to aid predictions and understanding of motives across time

  • Criminal Investigations – efficiently organise analysis and evidence whilst enabling a simple retracing of investigative efforts

    == Example Queries

Crime Investigations

This example is based on an academic paper.

Indexing temporal data using existing B +-trees

https://www.comp.nus.edu.sg/~ooibc/stbtree95.pdf
Cheng Hian Goh, Hongjun Lu, Kian-Lee Tan, Published in Data Knowl. Eng. 1996
DOI:10.1016/0169-023X(95)00034-P
See "7. Support for complex queries in bitemporal databases"

During a criminal investigation it is critical to be able to refine a temporal understanding of past events as new evidence is brought to light, errors in documentation are accounted for, and speculation is corroborated. The paper referenced above gives the following query example:

Find all persons who are known to be present in the United States on day 2 (valid time), as of day 3 (transaction time)

The paper then lists a sequence of entry and departure events at various United States border checkpoints. We as the investigator will step through this sequence to monitor a set of suspects. These events will arrive in an undetermined chronological order based on how and when each checkpoint is able to manually relay the information.

Day 0

Assuming Day 0 for the investigation period is #inst "2018-12-31", the initial documents are ingested using the Day 0 valid time:

  {:crux.db/id :p2
   :entry-pt :SFO
   :arrival-time #inst "2018-12-31"
   :departure-time :na}

  {:crux.db/id :p3
   :entry-pt :LA
   :arrival-time #inst "2018-12-31"
   :departure-time :na}
  #inst "2018-12-31"

The first document shows that Person 2 was recorded entering via :SFO and the second document shows that Person 3 was recorded entering :LA.

Day 1

No new recorded events arrive on Day 1 (#inst "2019-01-01"), so there are no documents available to ingest.

Day 2

A single event arrives on Day 2 showing Person 4 arriving at :NY:

  {:crux.db/id :p4
   :entry-pt :NY
   :arrival-time #inst "2019-01-02"
   :departure-time :na}
  #inst "2019-01-02"
Day 3

Next, we learn on Day 3 that Person 4 departed from :NY, which is represented as an update to the existing document using the Day 3 valid time:

  {:crux.db/id :p4
   :entry-pt :NY
   :arrival-time #inst "2019-01-02"
   :departure-time #inst "2019-01-03"}
  #inst "2019-01-03"
Day 4

On Day 4 we begin to receive events relating to the previous days of the investigation.

First we receive an event showing that Person 1 entered :NY on Day 0 which must ingest using the Day 0 valid time #inst "2018-12-31":

  {:crux.db/id :p1
   :entry-pt :NY
   :arrival-time #inst "2018-12-31"
   :departure-time :na}
  #inst "2018-12-31"

We then receive an event showing that Person 1 departed from :NY on Day 3, so again we ingest this document using the corresponding Day 3 valid time:

  {:crux.db/id :p1
   :entry-pt :NY
   :arrival-time #inst "2018-12-31"
   :departure-time #inst "2019-01-03"}
  #inst "2019-01-03"

Finally, we receive two events relating to Day 4, which can be ingested using the current valid time:

  {:crux.db/id :p1
   :entry-pt :LA
   :arrival-time #inst "2019-01-04"
   :departure-time :na}

  {:crux.db/id :p3
   :entry-pt :LA
   :arrival-time #inst "2018-12-31"
   :departure-time #inst "2019-01-04"}
  #inst "2019-01-04"
Day 5

On Day 5 there is an event showing that Person 2, having arrived on Day 0 (which we already knew), departed from :SFO on Day 5.

  {:crux.db/id :p2
   :entry-pt :SFO
   :arrival-time #inst "2018-12-31"
   :departure-time #inst "2019-01-05"}
  #inst "2019-01-05"
Day 6

No new recorded events arrive on Day 6 (#inst "2019-01-06"), so there are no documents available to ingest.

Day 7

On Day 7 two documents arrive. The first document corrects the previous assertion that Person 3 departed on Day 4, which was misrecorded due to human error. The second document shows that Person 3 has only just departed on Day 7, which is how the previous error was noticed.

  {:crux.db/id :p3
   :entry-pt :LA
   :arrival-time #inst "2018-12-31"
   :departure-time :na}
  #inst "2019-01-04"

  {:crux.db/id :p3
   :entry-pt :LA
   :arrival-time #inst "2018-12-31"
   :departure-time #inst "2019-01-07"}
  #inst "2019-01-07"
Day 8

Two documents have been received relating to new arrivals on Day 8. Note that Person 3 has arrived back in the country again.

  {:crux.db/id :p3
   :entry-pt :SFO
   :arrival-time #inst "2019-01-08"
   :departure-time :na}
  #inst "2019-01-08"

  {:crux.db/id :p4
   :entry-pt :LA
   :arrival-time #inst "2019-01-08"
   :departure-time :na}
  #inst "2019-01-08"
Day 9

On Day 9 we learn that Person 3 also departed on Day 8.

  {:crux.db/id :p3
   :entry-pt :SFO
   :arrival-time #inst "2019-01-08"
   :departure-time #inst "2019-01-08"}
  #inst "2019-01-09"
Day 10

A single document arrives showing that Person 5 entered at :LA earlier that day.

  {:crux.db/id :p5
   :entry-pt :LA
   :arrival-time #inst "2019-01-10"
   :departure-time :na}
  #inst "2019-01-10"
Day 11

Similarly to the previous day, a single document arrives showing that Person 7 entered at :NY earlier that day.

  {:crux.db/id :p7
   :entry-pt :NY
   :arrival-time #inst "2019-01-11"
   :departure-time :na}
  #inst "2019-01-11"
Day 12

Finally, on Day 12 we learn that Person 6 entered at :NY that same day.

  {:crux.db/id :p6
   :entry-pt :NY
   :arrival-time #inst "2019-01-12"
   :departure-time :na}
  #inst "2019-01-12"
Question Time

Let’s review the question we need to answer to aid our investigations:

Find all persons who are known to be present in the United States on day 2 (valid time), as of day 3 (transaction time)

We are able to easily express this as a query in Crux:

  {:find [p entry-pt arrival-time departure-time]
   :where [[p :entry-pt entry-pt]
           [p :arrival-time arrival-time]
           [p :departure-time departure-time]]}
  #inst "2019-01-03"                    ; `as of` transaction time
  #inst "2019-01-02"                    ; `as at` valid time

The answer given by Crux is a simple set of the three relevant people along with the details of their last entry and confirmation that none of them were known to have yet departed at this point:

  #{[:p2 :SFO #inst "2018-12-31" :na]
    [:p3 :LA #inst "2018-12-31" :na]
    [:p4 :NY #inst "2019-01-02" :na]}

Retroactive Data Structures

At a theoretical level Crux has similar properties to retroactive data structures, which are data structures that support "efficient modifications to a sequence of operations that have been performed on the structure […​] modifications can take the form of retroactive insertion, deletion or updating of an operation that was performed at some time in the past".

Crux’s bitemporal indexes are partially persistent due to the immutability of transaction time. This allows you to query any previous version, but only update the latest version. The efficient representation of valid time in the indexes makes Crux "fully retroactive", which is analogous to partial persistence in the temporal dimension, and enables globally-consistent reads.

Crux does not natively implement "non-oblivious retroactivity" (i.e. persisted queries and cascading corrections), although this is an important area of investigation for event sourcing applications, temporal constraints, and reactive bitemporal queries.

In summary, the Crux indexes as a whole could be described as a "partially persistent and fully retroactive data structure".

Tutorials

Space Adventure

The "choose your own adventure" style tutorial series is an interactive story that you can follow along with and complete assignments.

For the no-install browser-based tutorial, follow the Nextjournal edition.

Otherwise, visit the original version in our blog post.

A Bitemporal Tale

For an interactive no-install browser-based tutorial see the Nextjournal edition of the Tale.

Otherwise, see the original the version in our blog post.

Essential EDN

edn (Extensible Data Notation)

; Comments start with a semicolon.
; Anything after the semicolon is ignored.

nil         ; also known in other languages as null

; Booleans
true
false

; Strings are enclosed in double quotes
"time travel is fun"
"time traveller's fun"

; Keywords start with a colon. They behave like enums. Kind of
; like symbols in Ruby.
:time
:machine
:time-machine

; Symbols are used to represent identifiers.
; You can namespace symbols by using /. Whatever precedes / is
; the namespace of the symbol.
spoon
kitchen/spoon ; not the same as spoon
kitchen/fork
github/fork   ; you can't eat with this

; Underscore is a valid symbol identifier that has a special
; meaning in Crux Datalog where it is treated like a wildcard
; that prevents binding/unification. These are called "blanks".
_

; Integers and floats
42
3.14159

; Lists are sequences of values
(:widget :sprocket 9 "some text!")

; Vectors allow random access. Kind of like arrays in JavaScript.
[:first 1 2 :fourth]

; Maps are associative data structures that associate the key with its value
{:avocado     2
 :pepper      1
 :lemon-juice 3.5}

; You may use commas for readability. They are treated as whitespace.
{:avocado 2, :pepper 1, :lemon-juice 3.5}

; Sets are collections that contain unique elements.
#{:a :b 88 "huat"}

; Quoting is used by languages to prevent evaluation of an edn data
; structure. In Clojure the apostrophe is used as the short-hand for
; quoting and it enables you to easily construct complex Crux queries.
; Without the apostrophes inside this map, Clojure would expect `a`,
; `b`, and `c` to be valid symbols.
{:find '[a b c]
 :where [['a 'b 'c]]}

; Adapted from https://learnxinyminutes.com/docs/edn/
; License https://creativecommons.org/licenses/by-sa/3.0/deed.en_US
; © 2019 Jason Yeo, Jonathan D Johnston

For further information on EDN, see Official EDN Format

FAQs

Do I need to think about bitemporality to make use of Crux?

Not at all. Many users don’t have an immediate use for business-level time travel queries, in which case transaction time is typically regarded as "enough". However, use of valid time also enables operational advantages such as backfilling and other simple methods for migrating data between live systems in ways that isn’t easy when relying on transaction time alone (i.e. where logs must be replayed, merged and truncated to achieve the same effect). Therefore, it is sensible to use valid time in case you have these operational needs in the future. Valid time is recorded by default whenever you submit transactions.

Comparisons

How does Datalog compare to SQL

Datalog is a well-established deductive query language that combines facts and rules during execution to achieve the same power as relational algebra
recursion (e.g. SQL with Common Table Expressions). Datalog makes heavy use of efficient joins over granular indexes which removes any need for thinking about upfront normalisation and query shapes. Datalog already has significant traction in both industry and academia.

The EdgeDB team wrote a popular blog post outlining the shortcomings of SQL and Datalog is the only broadly-proven alternative. Additionally the use of EDN Datalog from Clojure makes queries "much more programmable" than the equivalent of building SQL strings in any other language, as explained in this blog post.

We plan to provide limited SQL/JDBC support for Crux in the future, potentially using Apache Calcite.

How does Crux compare to Datomic (On-Prem)?

At a high level Crux is bitemporal, document-centric, schemaless, and designed to work with Kafka as an "unbundled" database. Bitemporality provides a user-assigned "valid time" axis for point-in-time queries in addition to the underlying system-assigned "transaction time". The main similarities are that both systems support EDN Datalog queries (though they not compatible), are written using Clojure, and provide elegant use of the database "as a value".

In the excellent talk "Deconstructing the Database" by Rich Hickey, he outlines many core principles that informed the design of both Datomic and Crux:

  1. Declarative programming is ideal

  2. SQL is the most popular declarative programming language but most SQL databases do not provide a consistent "basis" for running these declarative queries because they do not store and maintain views of historical data by default

  3. Client-server considerations should not affect how queries are constructed

  4. Recording history is valuable

  5. All systems should clearly separate reaction and perception: a transactional component that accepts novelty and passes it to an indexer that integrates novelty into the indexed view of the world (reaction) + a query support component that accepts questions and uses the indexes to answer the questions quickly (perception)

  6. Traditionally a database was a big complicated thing, it was a special thing, and you only had one. You would communicate to it with a foreign language, such as SQL strings. These are legacy design choices

  7. Questions dominate in most applications, or in other words, most applications are read-oriented. Therefore arbitrary read-scalability is a more general problem to address than arbitrary write-scalability (if you need arbitrary write-scalability then you inevitably have to sacrifice system-wide transactions and consistent queries)

  8. Using a cache for a database is not simple and should never be viewed an architectural necessity: "When does the cache get invalidated? It’s your problem!"

  9. The relational model makes it challenging to record historical data for evolving domains and therefore SQL databases do not provide an adequate "information model"

  10. Accreting "facts" over time provides a real information model and is also simpler than recording relations (composite facts) as seen in a typical relational database

  11. RDF is an attempt to create a universal schema for information using [subject predicate object] triples as facts. However RDF triples are not sufficient because these facts do not have a temporal component (e.g. timestamp or transaction coordinate)

  12. Perception does not require coordination and therefore queries should not affect concurrently executing transactions or cause resource contention (i.e. "stop the world")

  13. "Reified process" (i.e. transaction metadata and temporal indexing) should enable efficient historical queries and make interactive auditing practical

  14. Enabling the programmer to use the database "as a value" is dramatically less complex than working with typical databases in a client-server model and it very naturally aligns with functional programming: "The state of the database is a value defined by the set of facts in effect at a given moment in time."

Rich then outlines how these principles are realised in the original design for Datomic (now "Datomic On-Prem") and this is where Crux and Datomic begin to diverge:

  1. Datomic maintains a global index which can be lazily retrieved by peers from shared "storage". Conversely, a Crux node represents an isolated coupling of local storage and local indexing components together with the query engine. Crux nodes are therefore fully independent asides from the shared transaction log and document log

  2. Both systems rely on existing storage technologies for the primary storage of data. Datomic’s covering indexes are stored in a shared storage service with multiple back-end options. Crux, when used with Kafka, uses basic Kafka topics as the primary distributed store for content and transaction logs.

  3. Datomic peers lazily read from the global index and therefore automatically cache their dynamic working sets. Crux does not use a global index and currently does not offer any node-level sharding either so each node must contain the full database. In other words, each Crux node is like an unpartitioned replica of the entire database, except the nodes do not store the transaction log locally so there is no "master". Crux may support manual node-level sharding in the future via simple configuration. One benefit of manual sharding is that both the size of the Crux node on disk and the long-tail query latency will be more predictable

  4. Datomic uses an explicit "transactor" component, whereas the role of the transactor in Crux is fulfilled by a passive transaction log (e.g. a single-partition Kafka topic) where unconfirmed transactions are optimistically appended, and therefore a transaction in Crux is not confirmed until a node reads from the transaction log and confirms it locally

  5. Datomic’s transactions and transaction functions are processed via a centralised transactor which can be configured for High-Availability using standby transactors. Centralised execution of transaction functions is effectively an optimisation that is useful for managing contention whilst minimising external complexity, and the trade-off is that the use of transaction functions will ultimately impact the serialised transaction throughput of the entire system. Crux does not currently provide a standard means of creating transaction functions but it is an area we are keen to see explored. If transaction functions and other kinds of validations of constraints are needed then it is recommended to use a gatekeeper pattern which involves electing a primary Crux node (e.g. using ZooKeeper) to execute transactions against, thereby creating a similar effect to Datomic’s transactor component

Other differences compared to Crux:

  1. Datomic’s datom model provides a very granular and comprehensive interface for expressing novelty through the assertion and retraction of facts. Crux instead uses documents (i.e. schemaless EDN maps) which are atomically ingested and processed as groups of facts that correspond to top-level fields with each document. This design choice simplifies bitemporal indexing (i.e. the use of valid time + transaction time coordinates) whilst satisfying typical requirements and improving the ergonomics of integration with other document-oriented systems. Additionally, the ordering of fields using the same key in a document is naturally preserved and can be readily retrieved, whereas Datomic requires explicit modelling of order for cardinality-many attributes. The main downside of Crux’s document model is that re-transacting entire documents to update a single field can be considered inefficient, but this could be mitigated using lower-level compression techniques and content-addressable storage. Retractions in Crux are implicit and deleted documents are simply replaced with empty documents

  2. Datomic enforces a simple information schema for attributes including explicit reference types and cardinality constraints. Crux is schemaless as we believe that schema should be optional and be implemented as higher level "decorators" using a spectrum of schema-on-read and/or schema-on write designs. Since Crux does not track any reference types for attributes, Datalog queries simply attempt to evaluate and navigate attributes as reference types during execution

  3. Datomic’s Datalog query language is more featureful and has more built-in operations than Crux’s equivalent, however Crux also returns results lazily and can spill to disk when sorting large result sets. Both systems provide powerful graph query possibilities

Note that Datomic Cloud is separate technology platform that is designed from the ground up to run on AWS and it is out of scope for this comparison.

In summary, Datomic (On-Prem) is a proven technology with a well-reasoned information model and sophisticated approach to scaling. Crux offloads primary scaling concerns to distributed log storage systems like Kafka (following the "unbundled" architecture) and to standard operational features within platforms like Kubernetes (e.g. snapshotting of nodes with pre-built indexes for rapid horizontal scaling). Unlike Datomic, Crux is document-centric and uses a bitemporal information model to enable business-level use of time-travel queries.

Technical

Is Crux eventually consistent? Strongly consistent? Or something else?

An easy answer is that Crux is "strongly consistent" with ACID semantics.

What consistency does Crux provide?

A Crux ClusterNode system provides sequential consistency by default due to the use of a single unpartitioned Kafka topic for the transaction log. Transactions are executed non-interleaved (i.e. a serial schedule) on every Crux node independently. Being able to read your writes when using the HTTP interface requires stickiness to a particular node. For a cluster of nodes to be linearizable as a whole would require that every node always sees the result of every transaction immediately after it is written. This could be achieved with the cost of non-trivial additional latency. Further reading: http://www.bailis.org/papers/hat-vldb2014.pdf, https://jepsen.io/consistency/models/sequential

How is consistency provided by Crux?

Crux does not try to enforce consistency among nodes, which all consume the log in the same order, but nodes may be at different points. A client using the same node will have a consistent view. Reading your own writes can be achieved by providing the transaction time Kafka assigned to the submitted transaction, which is returned in a promise from crux.api/submit-tx, in the call to crux.api/sync. This will block until this transaction time has been seen by the cluster node.

Write consistency across nodes is provided via the :crux.db/cas operation. The user needs to attempt to perform a CAS, then wait for the transaction time (as above), and check that the entity got updated. More advanced algorithms can be built on top of this. As mentioned above, all CAS operations in a transaction must pass their pre-condition check for the transaction to proceed and get indexed, which enables one to enforce consistency across documents. There is currently no way to check if a transaction got aborted, apart from checking if the write succeeded.

Will a lack of schema lead to confusion?

It of course depends.

While Crux does not enforce a schema, the user may do so in a layer above to achieve the semantics of schema-on-read (per node) and schema-on-write (via a gateway node). Crux only requires that the data can be represented as valid EDN documents. Data ingested from different systems can still be assigned qualified keys, which does not require a shared schema to be defined while still avoiding collision. Defining such a common schema up front might be prohibitive and Crux instead aims to enable exploration of the data from different sources early. This exploration can also help discover and define the common schema of interest.

Crux only indexes top-level attributes in a document, so to avoid indexing certain attributes, one can currently move them down into a nested map, as nested values aren’t indexed. This is useful both to increase throughput and to save disk space. A smaller index also leads to more efficient queries. We are considering to eventually give further control over what to index more explicitly.

How does Crux deal with time?

The valid time can be set manually per transaction operation, and might already be defined by an upstream system before reaching Crux. This also allows to deal with integration concerns like when a message queue is down and data arrives later than it should.

If not set, Crux defaults valid time to the transaction time, which is the LogAppendTime assigned by the Kafka broker to the transaction record. This time is taken from the local clock of the Kafka broker, which acts as the master wall clock time.

Crux does not rely on clock synchronisation or try to make any guarantees about valid time. Assigning valid time manually needs to be done with care, as there has to be either a clear owner of the clock, or that the exact valid time ordering between different nodes doesn’t strictly matter for the data where it’s used. NTP can mitigate this, potentially to an acceptable degree, but it cannot fully guarantee ordering between nodes.

Feature Support

Does Crux support RDF/SPARQL?

No. We have a simple ingestion mechanism for RDF data in crux.rdf but this is not a core feature. There is a also a query translator for a subset of SPARQL. RDF and SPARQL support could eventually be written as a layer on top of Crux as a module, but there are no plans for this by the core team.

Does Crux provide transaction functions?

Not directly, currently. You may use a "gatekeeper" pattern to enforce the desired level of transaction function consistency required.

  As the log is ingested in the same order at all nodes, purely
functional transformations of the tx-ops are possible. Enabling
experimental support for transaction functions, which are subject to
change and undocumented, can be done via the environment variable
feature flag `CRUX_ENABLE_TX_FNS`.
Does Crux support the full Datomic/DataScript dialect of Datalog?

No. There is no support for Datomic’s built-in functions, or for accessing the log and history directly. There is also no support for variable bindings or multiple source vars.

Other differences include that :rules and :args, which is a relation represented as a list of maps which is joined with the query, are being provided in the same query map as the :find and :where clause. Crux additionally supports the built-in == for unification as well as the !=. Both these unification operators can also take sets of literals as arguments, requiring at least one to match, which is basically a form of or.

Many of these aspects may be subject to change, but compatibility with other Datalog databases is not a goal for Crux.

Any plans for Datalog, Cypher, Gremlin or SPARQL support?

The goal is to support different languages, and decouple the query engine from its syntax, but this is not currently the case. There is a query translator for a subset of SPARQL in crux.sparql.

Does Crux support sharding?

Not currently. We are considering support for sharding the document topic as this would allow nodes to easily consume only the documents they are interested in. At the moment the tx-topic must use a single partition to guarantee transaction ordering. We are also considering support for sharding this topic via partitioning or by adding more transaction topics. Each partition / topic would have its own independent time line, but Crux would still support for cross shard queries. Sharding is mainly useful to increase throughput.

Does Crux support pull expressions?

No. As each Crux node is its own document store, the documents are local to the query node and can easily be accessed directly via the lower level read operations. We aim to make this more convenient soon.

We are also considering support for remote document stores via the crux.db.ObjectStore interface, mainly to support larger data sets, but there would still be a local cache. The indexes would stay local as this is key to efficient queries.

Do you have any benchmarks?

We are releasing a public benchmark dashboard in the near future. In the meantime feel free to run your own local tests using the scripts in the /test directory. The RocksDB project has performed some impressive benchmarks which give a strong sense of how large a single Crux node backed by RocksDB can confidently scale to. LMDB is generally faster for reads and RocksDB is generally faster for writes.

About Crux

Background

JUXT has been working on Crux since 2017, following a set of experiences where bitemporality proved challenging to implement at-scale using existing off-the-shelf technologies. Crux has been built by a very small core team who have, by necessity, had to keep the requirements and implemented scope to a minimum. Crux represents a strong foundation for further research and development efforts and provides a basis on which JUXT can release and support a broad variety of open source software products.

Vision

Crux ultimately aims to make the use of bitemporal modelling intuitive and accessible to as wide an audience as possible. Whilst bitemporality is only considered absolutely essential for a small number of use-cases, we believe that broad usage bitemporality could significantly reduce the hidden complexity in our global information systems.

Support

Please contact us if you would like to discuss your support requirements: crux@juxt.pro

Managed Hosting

JUXT offers a Managed Hosting service for Crux to accelerate your development and provide you with a secure and reliable service.

Software

JUXT currently offers bespoke support packages for Crux with SLAs that can be customised to meet your requirements.

Reseller

Are you looking to implement Crux as part of a solution for your client(s)? JUXT can provide Software Support and Managed Hosting on your behalf. Please consider joining our reseller program.

Deployment

Engage JUXT for Deployment Services Support if you need help with the initial process of installing and configuring Crux in your environment.

Contributing

Thanks

Crux would not exist without the community of vibrant open source projects on which it depends, and we hope that the Crux community will serve to extend and reflect our gratitude.

GitHub

We currently use GitHub Issues to work on near-term changes to the Crux codebase and documentation. Please see the issues labelled with "good first issue" if you are looking for ideas to help push Crux forward: https://github.com/juxt/crux/labels/good%20first%20issue

PRs with fixes and improvements to these docs are very welcome.

Commits

Please strive to follow the best-practices for commit messages which are outlined here: https://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html

CLA

A Contributor License Agreement (CLA) is necessary for us to ensure that we can support a healthy Crux ecosystem over the indefinite future. Please complete the very short .odt LibreOffice template linked below and email it to us, crux@juxt.pro, along with a reference to your current PR on GitHub.

Releases

For technical changes, see:https://github.com/juxt/crux/blob/master/CHANGELOG.md

API

Javadocs

Please consult the Javadocs for the official Crux API.

REST

Introduction

Crux offers a small REST API that allows you to send transactions and run queries over HTTP. For instance, you could deploy your Crux nodes along with Kafka into a Kubernetes pod running on AWS and interact with Crux from your application purely via HTTP. Using Crux in this manner is a valid use-case but it cannot support all of the features and benfits that running the Crux node inside of your application provides.

Your application only needs to communicate with one Crux node when using the REST API. This Crux node may placed be behind a load balancer which spreads the load over multiple nodes transparently to application. In addition, different Crux nodes might be still be catching up with the head of the transaction log, and since different queries might go to different nodes, you have to be conscious of read consistency issues when designing your application to use Crux in this way. Fortunately, you can mitigate read consistency issues the ability to query consistent point-in-time snapshots of the database by specifying temporal coordinates along with your queries.

The REST API also provides an experimental endpoint for SPARQL 1.1 Protocol queries under /sparql/, rewriting the query into the Crux Datalog dialect. Only a small subset of SPARQL is supported and no other RDF features are available.

Using the HTTP API

The HTTP interface is provided as a Ring middleware in a Clojure namespace, located at crux/crux-http-server/src/crux/http_server.clj. There is an example of using this middleware in a full example HTTP server configuration: https://github.com/juxt/crux/tree/master/docs/example/standalone_webservice

Whilst CORS may be easily configured for use while prototyping a Single Page Application that uses Crux directly from a web browser, it is currently NOT recommended to expose Crux directly to any untrusted endpoints (including web browsers) in production since the default query API does not sandbox or otherwise restrict the execution of queries.

Index

Table 11. API
uri method description

/

GET

returns various details about the state of the database

/document/[content-hash]

GET or POST

returns the document for a given hash

/documents

POST

returns a map of document ids and respective documents for a given set of content hashes submitted in the request body

/entity

POST

Returns an entity for a given ID and optional valid-time/transaction-time co-ordinates

/entity-tx

POST

Returns the :put or :cas transaction that most recently set a key

/history/[:key]

GET OR POST

Returns the transaction history of a key

/query

POST

Takes a datalog query and returns its results

/query-stream

POST

Same as /query but the results are streamed

/sync

GET

Wait until the Kafka consumer’s lag is back to 0

/tx-log

GET

Returns a list of all transactions

/tx-log

POST

The "write" endpoint, to post transactions.

GET /

Returns various details about the state of the database. Can be used as a health check.

curl -X GET $nodeURL/
{:crux.kv/kv-store "crux.kv.rocksdb/kv",
 :crux.kv/estimate-num-keys 92,
 :crux.kv/size 72448,
 :crux.zk/zk-active? true,
 :crux.tx-log/consumer-state
   {:crux.kafka.topic-partition/crux-docs-0
      {:offset 25,
       :time #inst "2019-01-08T11:06:41.867-00:00",
       :lag 0},
    :crux.kafka.topic-partition/crux-transaction-log-0
      {:offset 19,
       :time #inst "2019-01-08T11:06:41.869-00:00",
       :lag 0}}}
Note
estimate-num-keys is an (over)estimate of the number of transactions in the log (each of which is a key in RocksDB). RocksDB does not provide an exact key count.
GET/POST /document/[content-hash]

Returns the document stored under that hash, if it exists.

curl -X GET $nodeURL/document/7af0444315845ab3efdfbdfa516e68952c1486f2
{:crux.db/id :foobar, :name "FooBar"}
Note
Hashes for older versions of a document can be obtained with /history-range or /history, under the :crux.db/content-hash keys.
GET/POST /documents

Returns a map from the documents ids to the documents for ids set. Possible to get map keys as #crux/id literals if preserve-crux-ids param is set to "true"

curl -X POST $nodeURL/documents \
     -H "Content-Type: application/edn" \
     -d '#{"7af0444315845ab3efdfbdfa516e68952c1486f2"}'
{"7af0444315845ab3efdfbdfa516e68952c1486f2" {:crux.db/id :foobar, :name "FooBar"}}
POST /entity

Takes a key and, optionally, a :valid-time and/or :transact-time (defaulting to now). Returns the value stored under that key at those times.

See Bitemporality for more information.

curl -X POST \
     -H "Content-Type: application/edn" \
     -d '{:eid :tommy}' \
     $nodeURL/entity
{:crux.db/id :tommy, :name "Tommy", :last-name "Petrov"}
curl -X POST \
     -H "Content-Type: application/edn" \
     -d '{:eid :tommy :valid-time #inst "1999-01-08T14:03:27.254-00:00"}' \
     $nodeURL/entity
nil
POST /entity-tx

Takes a key and, optionally, :valid-time and/or :transact-time (defaulting to now). Returns the :put or :cas transaction that most recently set that key at those times.

See Bitemporality for more information.

curl -X POST \
     -H "Content-Type: application/edn" \
     -d '{:eid :foobar}' \
     $nodeURL/entity-tx
{:crux.db/id "8843d7f92416211de9ebb963ff4ce28125932878",
 :crux.db/content-hash "7af0444315845ab3efdfbdfa516e68952c1486f2",
 :crux.db/valid-time #inst "2019-01-08T16:34:47.738-00:00",
 :crux.tx/tx-id 0,
 :crux.tx/tx-time #inst "2019-01-08T16:34:47.738-00:00"}
GET/POST /history/[:key]

Returns the transaction history of a key, from newest to oldest transaction time.

curl -X GET $nodeURL/history/:ivan
[{:crux.db/id "a15f8b81a160b4eebe5c84e9e3b65c87b9b2f18e",
  :crux.db/content-hash "c28f6d258397651106b7cb24bb0d3be234dc8bd1",
  :crux.db/valid-time #inst "2019-01-07T14:57:08.462-00:00",
  :crux.tx/tx-id 14,
  :crux.tx/tx-time #inst "2019-01-07T16:51:55.185-00:00"}

 {...}]
POST /query

Takes a Datalog query and returns its results.

curl -X POST \
     -H "Content-Type: application/edn" \
     -d '{:query {:find [e] :where [[e :last-name "Petrov"]]}}' \
     $nodeURL/query
#{[:boris][:ivan]}

Note that you are able to add :full-results? true to the query map to easily retrieve the source documents relating to the entities in the result set. For instance to retrieve all documents in a single query:

curl -X POST \
     -H "Content-Type: application/edn" \
     -d '{:query {:find [e] :where [[e :crux.db/id _]] :full-results? true}}' \
     $nodeURL/query
POST /query-stream

Same as /query but the results are streamed.

GET /sync

Wait until the Kafka consumer’s lag is back to 0 (i.e. when it no longer has pending transactions to write). Timeout is 10 seconds by default, but can be specified as a parameter in milliseconds. Returns the transaction time of the most recent transaction.

curl -X GET $nodeURL/sync?timeout=500
#inst "2019-01-08T11:06:41.869-00:00"
GET /tx-log

Returns a list of all transactions, from oldest to newest transaction time.

curl -X GET $nodeURL/tx-log
({:crux.tx/tx-time #inst "2019-01-07T15:11:13.411-00:00",
  :crux.api/tx-ops [[
    :crux.tx/put "c28f6d258397651106b7cb24bb0d3be234dc8bd1"
    #inst "2019-01-07T14:57:08.462-00:00"]],
  :crux.tx/tx-id 0}

 {:crux.tx/tx-time #inst "2019-01-07T15:11:32.284-00:00",
  ...})
POST /tx-log

Takes a vector of transactions (any combination of :put, :delete, :cas and :evict) and executes them in order. This is the only "write" endpoint.

curl -X POST \
     -H "Content-Type: application/edn" \
     -d '[[:crux.tx/put {:crux.db/id :ivan, :name "Ivan" :last-name "Petrov"}],
          [:crux.tx/put {:crux.db/id :boris, :name "Boris" :last-name "Petrov"}],
          [:crux.tx/delete :maria  #inst "2012-05-07T14:57:08.462-00:00"]]' \
     $nodeURL/tx-log
{:crux.tx/tx-id 7, :crux.tx/tx-time #inst "2019-01-07T16:14:19.675-00:00"}

Clojure

(ns crux.api)

crux.api exposes a union of methods from ICruxNode and ICruxDatasource, with few lifecycle members added.

ICruxNode
db
  (db
    [node]
    [node ^Date valid-time]
    [node ^Date valid-time ^Date transaction-time]
    "Will return the latest value of the db currently known. Non-blocking.

     When a valid time is specified then returned db value contains only those
     documents whose valid time is not after the specified. Non-blocking.

     When both valid and transaction time are specified returns a db value
     as of the valid and transaction time. Will block until the transaction
     time is present in the index.")
document
  (document [node content-hash]
    "Reads a document from the document store based on its
    content hash.")
history
  (history [node eid]
    "Returns the transaction history of an entity, in reverse
    chronological order. Includes corrections, but does not include
    the actual documents.")
history-range
  (history-range [node eid
                  ^Date valid-time-start
                  ^Date transaction-time-start
                  ^Date valid-time-end
                  ^Date transaction-time-end]
    "Returns the transaction history of an entity, ordered by valid
    time / transaction time in chronological order, earliest
    first. Includes corrections, but does not include the actual
    documents.

    Giving null as any of the date arguments makes the range open
    ended for that value.")
status
  (status [node]
    "Returns the status of this node as a map.")
submit-tx
  (submit-tx [node tx-ops]
    "Writes transactions to the log for processing
     tx-ops datalog style transactions.
     Returns a map with details about the submitted transaction,
     including tx-time and tx-id.")
submitted-tx-updated-entity?
  (submitted-tx-updated-entity? [node submitted-tx eid]
    "Checks if a submitted tx did update an entity.
    submitted-tx must be a map returned from `submit-tx`
    eid is an object that can be coerced into an entity id.
    Returns true if the entity was updated in this transaction.")
submitted-tx-corrected-entity?
  (submitted-tx-corrected-entity? [node submitted-tx ^Date valid-time eid]
    "Checks if a submitted tx did correct an entity as of valid time.
    submitted-tx must be a map returned from `submit-tx`
    valid-time valid time of the correction to check.
    eid is an object that can be coerced into an entity id.
    Returns true if the entity was updated in this transaction.")
sync
  (sync
    [node ^Duration timeout]
    [node ^Date transaction-time ^Duration timeout]
    "If the transaction-time is supplied, blocks until indexing has
    processed a tx with a greater-than transaction-time, otherwise
    blocks until the node has caught up indexing the tx-log
    backlog. Will throw an exception on timeout. The returned date is
    the latest index time when this node has caught up as of this
    call. This can be used as the second parameter in (db valid-time,
    transaction-time) for consistent reads.
    timeout – max time to wait, can be null for the default.
    Returns the latest known transaction time.")
new-tx-log-context
  (new-tx-log-context ^java.io.Closeable [node]
    "Returns a new transaction log context allowing for lazy reading
    of the transaction log in a try-with-resources block using
    (tx-log ^Closeable tx-Log-context, from-tx-id, boolean with-documents?).

    Returns an implementation specific context.")
tx-log
  (tx-log [node tx-log-context from-tx-id with-documents?]
    "Reads the transaction log lazily. Optionally includes
    documents, which allow the contents under the :crux.api/tx-ops
    key to be piped into (submit-tx tx-ops) of another
    Crux instance.
    tx-log-context  a context from (new-tx-log-context node)
    from-tx-id      optional transaction id to start from.
    with-documents? should the documents be included?

    Returns a lazy sequence of the transaction log.")
attribute-stats
  (attribute-stats [node]
    "Returns frequencies of indexed attributes")

ICruxDatasource

Represents the database as of a specific valid and transaction time.

entity
  (entity [db eid]
    "queries a document map for an entity.
    eid is an object which can be coerced into an entity id.
    returns the entity document map.")
entity-tx
  (entity-tx [db eid]
    "returns the transaction details for an entity. Details
    include tx-id and tx-time.
    eid is an object that can be coerced into an entity id.")
new-snapshot
  (new-snapshot ^java.io.Closeable [db]
     "Returns a new implementation specific snapshot allowing for lazy query
     results in a try-with-resources block using (q db  snapshot  query)}.
     Can also be used for
     (history-ascending db snapshot  eid) and
     (history-descending db snapshot  eid)
     returns an implementation specific snapshot")
q
  (q
    [db query]
    [db snapshot query]
    "q[uery] a Crux db.
    query param is a datalog query in map, vector or string form.
    First signature will evaluate eagerly and will return a set or vector
    of result tuples.
    Second signature accepts a db snapshot, see `new-snapshot`.
    Evaluates *lazily* consequently returns lazy sequence of result tuples.")
history-ascending
  (history-ascending
    [db snapshot eid]
    "Retrieves entity history lazily in chronological order
    from and including the valid time of the db while respecting
    transaction time. Includes the documents.")
history-descending
  (history-descending
    [db snapshot eid]
    "Retrieves entity history lazily in reverse chronological order
    from and including the valid time of the db while respecting
    transaction time. Includes the documents.")
valid-time
  (valid-time [db]
    "returns the valid time of the db.
    If valid time wasn't specified at the moment of the db value retrieval
    then valid time will be time of the latest transaction.")
transaction-time
  (transaction-time [db]
    "returns the time of the latest transaction applied to this db value.
    If a tx time was specified when db value was acquired then returns
    the specified time."))

Lifecycle members

start-node
(defn start-node ^ICruxAPI [options])
Note
requires any dependendies on the classpath that the Crux modules may need.

Options:

{:crux.node/topology e.g. "crux.standalone/topology"}

Options are specified as keywords using their long format name, like :crux.kafka/bootstrap-servers etc. See the individual modules used in the specified topology for option descriptions.

returns a node which implements ICruxAPI and java.io.Closeable. Latter allows the node to be stopped by calling (.close node).

throws IndexVersionOutOfSyncException if the index needs rebuilding. throws NonMonotonicTimeException if the clock has moved backwards since last run. Only applicable when using the event log.

new-api-client
(defn new-api-client ^ICruxAPI [url])

Creates a new remote API client ICruxAPI. The remote client requires valid and transaction time to be specified for all calls to db.

Note
requires either clj-http or http-kit on the classpath, see crux.remote-api-client/internal-http-request-fn for more information.

Param url the URL to a Crux HTTP end-point.

Returns a remote API client.

new-ingest-client
(defn new-ingest-client ^ICruxAsyncIngestAPI [options])

Starts an ingest client for transacting into Kafka without running a full local node with index.

For valid options, see crux.kafka/default-options. Options are specified as keywords using their long format name, like :crux.kafka/bootstrap-servers etc.

Options:

{:crux.kafka/bootstrap-servers "kafka-cluster-kafka-brokers.crux.svc.cluster.local:9092"
:crux.kafka/group-id           "group-id"
:crux.kafka/tx-topic           "crux-transaction-log"
:crux.kafka/doc-topic          "crux-docs"
:crux.kafka/create-topics      true
:crux.kafka/doc-partitions     1
:crux.kafka/replication-factor 1}

Returns a crux.api.ICruxIngestAPI component that implements java.io.Closeable, which allows the client to be stopped by calling close.

Advanced

Patterns

Introduction

Here we document patterns and helpful functions that have been suggested by users. A broad understanding of these patterns will be useful to guide iterations on the next generation of API layers for Crux. Some of these patterns may be appropriate candidates for evolving into easily consumable decorators.

Contributing

PRs are welcome, see Contributing for guidelines. If you would prefer not to sign up to our CLA just point us towards your GitHub gist and we can link to it.

(defn entity-update
  [entity-id new-attrs valid-time]
  (let [entity-prev-value (crux/entity (crux/db node) entity-id)]
    (crux/submit-tx node
      [[:crux.tx/put
        (merge entity-prev-value new-attrs)
        valid-time]])))

; by @spacegangster

Implicit Node

(defn q
  [query]
  (crux/q (crux/db node) query))

(defn entity
  [entity-id]
  (crux/entity (crux/db node) entity-id))

; by @spacegangster

Entities

(defn lookup-vector
 [db eid]
 (if (vector? eid)
   (let [[index value] eid]
     (recur
       db
       (ffirst
         (crux.api/q db
                     {:find ['?e]
                      :where [['?e index value]]}))))
   (crux.api/entity db eid)))

; by @SevereOverfl0w
(defn entity-at
  [entity-id valid-time]
  (crux/entity (crux/db node valid-time) entity-id))

(defn entity-with-adjacent
  [entity-id keys-to-pull]
  (let [db (crux/db node)
        ids->entities
        (fn [ids]
          (cond-> (map #(crux/entity db %) ids)
            (set? ids) set
            (vector? ids) vec))]
    (reduce
      (fn [e adj-k]
        (let [v (get e adj-k)]
          (assoc e adj-k
                 (cond
                   (keyword? v) (crux/entity db v)
                   (or (set? v)
                       (vector? v)) (ids->entities v)
                   :else v))))
      (crux/entity db entity-id)
      keys-to-pull)))

; by @spacegangster

Transaction Ops

; Use spec to validate your operations prior to submission

(clojure.spec.alpha/conform
   (clojure.spec.alpha/or :put :crux.tx/put-op
                          :delete :crux.tx/delete-op
                          :cas :crux.tx/cas-op
                          :evict :crux.tx/evict-op)
   [:crux.tx/cas
    {:crux.db/id #uuid "6f0232d0-f3f9-4020-a75f-17b067f41203"
     :name "John Wayne"
     :username "jwa"}
    {:crux.db/id #uuid "6f0232d0-f3f9-4020-a75f-17b067f41203"
     :name "John Wayne"
     :username "jwa"
     :new-field 2}])

; by @SevereOverfl0w

Decorators

Introduction

Decorators are features that can be built on top of Crux that can be used optionally and can be created easily without needing understand how the core of Crux works. Use of decorators is currently an advanced topic and requires a basic understanding of Clojure to configure.

Aggregation

The aggregation.alpha namespace demonstrates a good example of building a highly general aggregation extension to the basic Crux query. The standard api/q function can now be ignored as the aggr/q function is used instead (it wraps api/q with transducer magic).

  (t/deftest test-count-aggregation
    (f/transact-entity-maps!
     *kv*
     [{:crux.db/id :a1 :user/name "patrik" :user/post 1 :post/cost 30}
      {:crux.db/id :a2 :user/name "patrik" :user/post 2 :post/cost 35}
      {:crux.db/id :a3 :user/name "patrik" :user/post 3 :post/cost 5}
      {:crux.db/id :a4 :user/name "niclas" :user/post 1 :post/cost 8}])

    (t/testing "with vector syntax"
      (t/is (= [{:user-name "niclas" :post-count 1 :cost-sum 8}
                {:user-name "patrik" :post-count 3 :cost-sum 70}]
               (aggr/q
                (api/db *api*)
                '{:aggr {:partition-by [?user-name]
                         :select
                         {?cost-sum [0 (+ acc ?post-cost)]
                          ?post-count [0 (inc acc) ?e]}}
                  :where [[?e :user/name ?user-name]
                          [?e :post/cost ?post-cost]]})))))

See the aggregation_test.clj for more examples of usage.

Queries (Advanced)

Racket Datalog

Several Datalog tests from the Racket Datalog examples have been translated and re-used within Crux’s query tests.

  • tutorial.rkt

  • path.rkt

  • revpath.rkt

  • bidipath.rkt

  • sym.rkt

Datalog Research

Several Datalog examples from a classic Datalog paper have been translated and re-used within Crux’s query tests.

What you Always Wanted to Know About Datalog (And Never Dared to Ask)

https://www.semanticscholar.org/paper/What-you-Always-Wanted-to-Know-About-Datalog-(And-Ceri-Gottlob/630444d76e5aa81867344cb11aaddaab8dc8174c
Stefano Ceri, Georg Gottlob, Letizia Tanca, Published in IEEE Trans. Knowl. Data Eng. 1989
DOI:10.1109/69.43410

Specifically:

  • "sgc"

  • 3 examples of "stratified Datalog"

WatDiv SPARQL Tests

Waterloo SPARQL Diversity Test Suite
https://dsg.uwaterloo.ca/watdiv/

WatDiv has been developed to measure how an RDF data management system performs across a wide spectrum of SPARQL queries with varying structural characteristics and selectivity classes.

Benchmarking has been performed against the WatDiv test suite. These tests demonstrate comprehensive RDF subgraph matching. Note that Crux does not natively implement the RDF specification and only a simplified subset of the RDF tests have been translated for use in Crux. See the Crux tests for details.

LUBM Web Ontology Language (OWL) Tests

Lehigh University Benchmark
http://swat.cse.lehigh.edu/projects/lubm/

The Lehigh University Benchmark is developed to facilitate the evaluation of Semantic Web repositories in a standard and systematic way. The benchmark is intended to evaluate the performance of those repositories with respect to extensional queries over a large data set that commits to a single realistic ontology. It consists of a university domain ontology, customizable and repeatable synthetic data, a set of test queries, and several performance metrics.

Benchmarking has been performed against the LUBM test suite. These tests demonstrate extreme stress testing for subgraph matching. See the Crux tests for details.

Kafka Connect Crux

Introduction

A Kafka Connect plugin for transferring data between Crux nodes and Kafka.

The Crux source connector will publish transacations on a node to a Kafka topic, and the sink connector can receive transactions from a Kafka topic and submit them to a node.

Table 12. Currently supported data formats
Data format Sink/Source

JSON

Both

Avro

Sink

Transit

Source

EDN

Both

To get started with the connector, there are two separate guides (depending on whether you are using a full Confluent Platform installation, or a basic Kafka installation):

Confluent Platform Quickstart

Installing the connector

Use confluent-hub install juxt/kafka-connect-crux:19.09-1.5.0-alpha to download and install the connector from Confluent hub. The downloaded connector is then placed within your confluent install’s 'share/confluent-hub-components' folder.

The connector can be used as either a source or a sink. In either case, there should be an associated Crux node to communicate with.

Creating the Crux node

To use our connector, you must first have a Crux node connected to Kafka. To do this, we start by adding the following dependencies to a project:

juxt/crux-core {:mvn/version "19.09-1.5.0-alpha"}
juxt/crux-kafka {:mvn/version "19.09-1.5.0-alpha"}
juxt/crux-http-server {:mvn/version "19.09-1.5.0-alpha"}
juxt/crux-rocksdb {:mvn/version "19.09-1.5.0-alpha"}

Ensure first that you have a running Kafka broker to connect to. We import the dependencies into a file or REPL, then create our Kafka connected 'node' with an associated http server for the connector to communicate with:

(require '[crux.api :as crux]
	 '[crux.http-server :as srv])
(import (crux.api ICruxAPI))

(def ^crux.api.ICruxAPI node
  (crux/start-node {:crux.node/topology :crux.kafka/topology
                    :crux.kafka/bootstrap-servers "localhost:9092"
		    :server-port 3000}))

(srv/start-http-server node)

Sink Connector

Run the following command within the base of the Confluent folder, to create a worker which connects to the 'connect-test' topic, ready to send messages to the node. This also makes use of connect-file-source, checking for changes in a file called 'test.txt':

./bin/connect-standalone etc/kafka/connect-standalone.properties share/confluent-hub-components/juxt-kafka-connect-crux/etc/local-crux-sink.properties etc/kafka/connect-file-source.properties

Run the following within your Confluent directory, to add a line of JSON to 'test.txt':

echo '{"crux.db/id": "415c45c9-7cbe-4660-801b-dab9edc60c84", "value": "baz"}' >> test.txt

Now, verify that this was transacted within your REPL:

(crux/entity (crux/db node) "415c45c9-7cbe-4660-801b-dab9edc60c84")
==>
{:crux.db/id #crux/id "415c45c9-7cbe-4660-801b-dab9edc60c84", :value "baz"}

Source Connector

Run the following command within the base of the Confluent folder, to create a worker connects to the 'connect-test' topic, ready to receive messages from the node. This also makes use of 'connect-file-sink', outputting transactions to your node within 'test.sink.txt':

./bin/connect-standalone etc/kafka/connect-standalone.properties share/confluent-hub-components/juxt-kafka-connect-crux/etc/local-crux-source.properties etc/kafka/connect-file-sink.properties

Within your REPL, transact an element into Crux:

(crux/submit-tx node [[:crux.tx/put {:crux.db/id #crux/id "415c45c9-7cbe-4660-801b-dab9edc60c82", :value "baz-source"}]])

Check the contents of 'test.sink.txt' using the command below, and you should see that the transactions were outputted to the 'connect-test' topic:

tail test.sink.txt
==>
[[:crux.tx/put {:crux.db/id #crux/id "415c45c9-7cbe-4660-801b-dab9edc60c82", :value "baz-source"} #inst "2019-09-19T12:31:21.342-00:00"]]

Kafka Quickstart

Installing the connector

Download the connector from Confluent hub, then unzip the downloaded folder:

unzip juxt-kafka-connect-crux-19.09-1.5.0-alpha.zip

Navigate into the base of the Kafka folder, then run the following commands:

cp $CONNECTOR_PATH/lib/*-standalone.jar $KAFKA_HOME/libs
cp $CONNECTOR_PATH/etc/*.properties $KAFKA_HOME/config

The connector can be used as either a source or a sink. In either case, there should be an associated Crux node to communicate with.

Creating the Crux node

To use our connector, you must first have a Crux node connected to Kafka. To do this, we start by adding the following dependencies to a project:

juxt/crux-core {:mvn/version "19.09-1.5.0-alpha"}
juxt/crux-kafka {:mvn/version "19.09-1.5.0-alpha"}
juxt/crux-http-server {:mvn/version "19.09-1.5.0-alpha"}
juxt/crux-rocksdb {:mvn/version "19.09-1.5.0-alpha"}

Ensure first that you have a running Kafka broker to connect to. We import the dependencies into a file or REPL, then create our Kafka connected 'node' with an associated http server for the connector to communicate with:

(require '[crux.api :as crux]
	 '[crux.http-server :as srv])
(import (crux.api ICruxAPI))

(def ^crux.api.ICruxAPI node
  (crux/start-node {:crux.node/topology :crux.kafka/topology
                    :crux.kafka/bootstrap-servers "localhost:9092"
		    :server-port 3000}))

(srv/start-http-server node)

Sink Connector

Run the following command within the base of the Kafka folder, to create a worker which connects to the 'connect-test' topic, ready to send messages to the node. This also makes use of connect-file-source, checking for changes in a file called 'test.txt':

./bin/connect-standalone.sh config/connect-standalone.properties config/local-crux-sink.properties config/connect-file-source.properties

Run the following within your Kafka directory, to add a line of JSON to 'test.txt':

echo '{"crux.db/id": "415c45c9-7cbe-4660-801b-dab9edc60c84", "value": "baz"}' >> test.txt

Now, verify that this was transacted within your REPL:

(crux/entity (crux/db node) "415c45c9-7cbe-4660-801b-dab9edc60c84")
==>
{:crux.db/id #crux/id "415c45c9-7cbe-4660-801b-dab9edc60c84", :value "baz"}

Source Connector

Run the following command within the base of the Kafka folder, to create a worker connects to the 'connect-test' topic, ready to receive messages from the node. This also makes use of 'connect-file-sink', outputting transactions to your node within 'test.sink.txt':

./bin/connect-standalone.sh config/connect-standalone.properties config/local-crux-source.properties config/connect-file-sink.properties

Within your REPL, transact an element into Crux:

(crux/submit-tx node [[:crux.tx/put {:crux.db/id #crux/id "415c45c9-7cbe-4660-801b-dab9edc60c82", :value "baz-source"}]])

Check the contents of 'test.sink.txt' using the command below, and you should see that the transactions were outputted to the 'connect-test' topic:

tail test.sink.txt
==>
[[:crux.tx/put {:crux.db/id #crux/id "415c45c9-7cbe-4660-801b-dab9edc60c82", :value "baz-source"} #inst "2019-09-19T12:31:21.342-00:00"]]

Source Configuration

url
  • Destination URL of Crux HTTP end point

  • Type: String

  • Importance: High

  • Default: "http://localhost:3000"

topic
  • The Kafka topic to publish data to

  • Type: String

  • Importance: High

  • Default: "connect-test"

format
  • Format to send data out as: edn, json or transit

  • Type: String

  • Importance: Low

  • Default: "edn"

mode
  • Mode to use: tx or doc

  • Type: String

  • Importance: Low

  • Default: "tx"

batch.size
  • The maximum number of records the Source task can read from Crux at one time.

  • Type: Int

  • Importance: LOW

  • Default: 2000

Sink Configuration

url
  • Destination URL of Crux HTTP end point

  • Type: String

  • Importance: High

  • Default: "http://localhost:3000"

id.key
  • Record key to use as :crux.db/id

  • Type: String

  • Importance: Low

  • Default: "crux.db/id"