‘Jumble sale’ developer meeting

We had another “jumble sale” developer meeting this week, in which a number of people talked briefly about topics that interested them.

Rhys, Reece and Richard H talked about integrating with third-party authentication and authorization systems. For some of our projects, we have specified a simple API that the third-party authentication provider needs to support. This has been an effective approach, and we’ve found it particularly useful that we wrote a complete test suite to begin with against a dummy provider – that meant that we could very easily tell if the third-party had followed the specification, and potentially they could have run the tests themselves (although they didn’t). For another project, we have used SAML to support single-sign-on to integrate with existing sign-on systems. We’ve used pac4j from Scala to do this integration, and combined it with MarkLogic authentication to allow login to our systems.

Rodney talked about TypeScript, and how it improves JavaScript by adding better typing. It’s a strict superset of JavaScript, so all valid JavaScript code is valid TypeScript, and it compiles to JavaScript. However, we haven’t yet looked into how we can transpile TypeScript using the frameworks we typically work with.

Inigo talked about his investigation of why an AWS-hosted MarkLogic cluster was somehow transferring many terabytes of data per month. The possible causes of this were:

  • the communication between end users and Play
  • … between Play and MarkLogic
  • … between the hosts in the ML cluster,
  • … and from the servers to their S3 backups.

He used stats from the AWS load balancer combined with Firefox’s network tab to analyze page weight to rule out the communication with end users. By enabling logging on the load balancer used between Play and MarkLogic, he was able to rule out the second cause. Storing data from AWS in S3 is free as long as it’s all in the same region – which it was. Running iftop on the cluster nodes showed which sockets the data was flowing between and when – which pinned it down to the MarkLogic intra-cluster communication, in regular bursts. Looking at what the system was doing, there was a regular background task running that queried all of the documents. This took about 30s, but we hadn’t previously been concerned about the inefficiency since it didn’t affect users. However, because it wasn’t using indices, it needed to do a lot of data transfer between the cluster nodes to check the content – which led to our data transfer volume. Rewriting the query to use indices made it much faster, but also prevented it from having to do any data transfer, significantly dropping our AWS data transfer bill. The lesson of this is that when running a cluster on AWS, we care about different sorts of inefficiency than we would do running it on local machines.

Richard B talked about Effective Scala Futures – how to use Scala futures without using global variables that make testing hard, and how to bulkhead separate threadpools to manage different tasks. His slides are here: Effective Scala Futures dev talk

NixOS and MarkLogic 9

This week, Rodney talked about Nix – the NixOS Linux distro and NixOps tooling, and Simon talked about MarkLogic 9 Early Access that we have been working with for a few months.

Nix Expression Language is a declarative functional language, with aspects that are similar to Haskell, Scheme, and YAML. It’s a domain-specific language that can perform tasks similar to Ansible for configuration, and similar to make for building packages and projects while keeping track of dependencies and when they change.

NixOS is a Linux distribution that uses Nix EL to describe the configuration of a machine, built on the Nix package manager. It has atomic upgrades and rollbacks of packages.

NixOps again uses Nix EL, but at a higher level to describe how to provision a set of machines in a cloud environment like AWS. Unlike other tools like Puppet or Ansible, NixOps describes the complete desired state of the machine – so if you manually install a package that NixOps isn’t expecting, it will uninstall it again.

A Nix file can be NixOps will provision machines based on a description in Nix EL. However, Nix builds are run in an isolated envi

We’re not currently using Nix for deployment (we generally use Ansible), but Rodney is using it on his desktop, and thinks it is a good set of tools.

We have been using MarkLogic 9 Early Access for a few months. Simon talked about the new features around data integration, security, and manageability, and the improvements to the Query Console.

The ‘template driven extraction’ feature allows us to define templates that convert XML or CSV files to row data or triples on insert. It is best on ‘row like’ data, such as XML with a set of repeated components, and allows you to define anchor elements via XPath, and then a set of relative XPaths that define particular data fields to be extracted from that content. You can then query this data via a select, or in conjunction with other data via the Optic API.

The ‘Optic API’ allows joins and aggregates over documents, row data and triples – for example, combining a SPARQL query of graph data with a SQL query of relational data. This is valuable because different data is often represented and queried in a form that is particular to the style of data – but needs to be combined with other data to be useful. For example, purchase data might be best represented as a relational database style set of rows, but then processed by aggregating it and combining it with metadata on what is being purchased that is best represented as nodes in a graph database.

The ‘data movement SDK’ extends capabilities that were previously available via tools such as mlcp and corb, and exposes them via a Java API. This allows bulk ingestion of documents, and their transformation.

The ‘entity services’ is a modelling framework for data stored in MarkLogic. It makes it easier to harmonize data from different sources into a consistent model, using the ‘envelope pattern’ to wrap and preserve the original data. It’s not doing anything that couldn’t have been done manually in previous versions of MarkLogic, but simplifies the code for this.

The most interesting new security features allows permissions for XML content to be assigned at the element level, rather than at the document level. So, hiding elements within a document based on the role of the user, or making elements read-only by role. This is interesting for some of our applications that make particular parts of documents available based on the user.

One of the small but important improvements to the query console is displaying different results for each query tab. This makes working with it a lot better. It also displays type-ahead autocomplete suggestions for function names – this is also useful, although initially disconcerting.

In all, MarkLogic 9 offers a number of interesting new features. We are currently using the Early Access version of it. We’ve encountered and raised a few issues with it, which have been fixed, and we’re looking forward to it being launched.