‘Jumble sale’ developer meeting

We had another “jumble sale” developer meeting this week, in which a number of people talked briefly about topics that interested them.

Rhys, Reece and Richard H talked about integrating with third-party authentication and authorization systems. For some of our projects, we have specified a simple API that the third-party authentication provider needs to support. This has been an effective approach, and we’ve found it particularly useful that we wrote a complete test suite to begin with against a dummy provider – that meant that we could very easily tell if the third-party had followed the specification, and potentially they could have run the tests themselves (although they didn’t). For another project, we have used SAML to support single-sign-on to integrate with existing sign-on systems. We’ve used pac4j from Scala to do this integration, and combined it with MarkLogic authentication to allow login to our systems.

Rodney talked about TypeScript, and how it improves JavaScript by adding better typing. It’s a strict superset of JavaScript, so all valid JavaScript code is valid TypeScript, and it compiles to JavaScript. However, we haven’t yet looked into how we can transpile TypeScript using the frameworks we typically work with.

Inigo talked about his investigation of why an AWS-hosted MarkLogic cluster was somehow transferring many terabytes of data per month. The possible causes of this were:

  • the communication between end users and Play
  • … between Play and MarkLogic
  • … between the hosts in the ML cluster,
  • … and from the servers to their S3 backups.

He used stats from the AWS load balancer combined with Firefox’s network tab to analyze page weight to rule out the communication with end users. By enabling logging on the load balancer used between Play and MarkLogic, he was able to rule out the second cause. Storing data from AWS in S3 is free as long as it’s all in the same region – which it was. Running iftop on the cluster nodes showed which sockets the data was flowing between and when – which pinned it down to the MarkLogic intra-cluster communication, in regular bursts. Looking at what the system was doing, there was a regular background task running that queried all of the documents. This took about 30s, but we hadn’t previously been concerned about the inefficiency since it didn’t affect users. However, because it wasn’t using indices, it needed to do a lot of data transfer between the cluster nodes to check the content – which led to our data transfer volume. Rewriting the query to use indices made it much faster, but also prevented it from having to do any data transfer, significantly dropping our AWS data transfer bill. The lesson of this is that when running a cluster on AWS, we care about different sorts of inefficiency than we would do running it on local machines.

Richard B talked about Effective Scala Futures – how to use Scala futures without using global variables that make testing hard, and how to bulkhead separate threadpools to manage different tasks. His slides are here: Effective Scala Futures dev talk

NixOS and MarkLogic 9

This week, Rodney talked about Nix – the NixOS Linux distro and NixOps tooling, and Simon talked about MarkLogic 9 Early Access that we have been working with for a few months.

Nix Expression Language is a declarative functional language, with aspects that are similar to Haskell, Scheme, and YAML. It’s a domain-specific language that can perform tasks similar to Ansible for configuration, and similar to make for building packages and projects while keeping track of dependencies and when they change.

NixOS is a Linux distribution that uses Nix EL to describe the configuration of a machine, built on the Nix package manager. It has atomic upgrades and rollbacks of packages.

NixOps again uses Nix EL, but at a higher level to describe how to provision a set of machines in a cloud environment like AWS. Unlike other tools like Puppet or Ansible, NixOps describes the complete desired state of the machine – so if you manually install a package that NixOps isn’t expecting, it will uninstall it again.

A Nix file can be NixOps will provision machines based on a description in Nix EL. However, Nix builds are run in an isolated envi

We’re not currently using Nix for deployment (we generally use Ansible), but Rodney is using it on his desktop, and thinks it is a good set of tools.

We have been using MarkLogic 9 Early Access for a few months. Simon talked about the new features around data integration, security, and manageability, and the improvements to the Query Console.

The ‘template driven extraction’ feature allows us to define templates that convert XML or CSV files to row data or triples on insert. It is best on ‘row like’ data, such as XML with a set of repeated components, and allows you to define anchor elements via XPath, and then a set of relative XPaths that define particular data fields to be extracted from that content. You can then query this data via a select, or in conjunction with other data via the Optic API.

The ‘Optic API’ allows joins and aggregates over documents, row data and triples – for example, combining a SPARQL query of graph data with a SQL query of relational data. This is valuable because different data is often represented and queried in a form that is particular to the style of data – but needs to be combined with other data to be useful. For example, purchase data might be best represented as a relational database style set of rows, but then processed by aggregating it and combining it with metadata on what is being purchased that is best represented as nodes in a graph database.

The ‘data movement SDK’ extends capabilities that were previously available via tools such as mlcp and corb, and exposes them via a Java API. This allows bulk ingestion of documents, and their transformation.

The ‘entity services’ is a modelling framework for data stored in MarkLogic. It makes it easier to harmonize data from different sources into a consistent model, using the ‘envelope pattern’ to wrap and preserve the original data. It’s not doing anything that couldn’t have been done manually in previous versions of MarkLogic, but simplifies the code for this.

The most interesting new security features allows permissions for XML content to be assigned at the element level, rather than at the document level. So, hiding elements within a document based on the role of the user, or making elements read-only by role. This is interesting for some of our applications that make particular parts of documents available based on the user.

One of the small but important improvements to the query console is displaying different results for each query tab. This makes working with it a lot better. It also displays type-ahead autocomplete suggestions for function names – this is also useful, although initially disconcerting.

In all, MarkLogic 9 offers a number of interesting new features. We are currently using the Early Access version of it. We’ve encountered and raised a few issues with it, which have been fixed, and we’re looking forward to it being launched.

 

Fantastic Lambdas and How to Deploy Them

As mentioned previously, we are using terraform to spin up resources in AWS in an automated and repeatable fashion. Mostly it just works, but now and again things get tricky. We hit such a situation when automating the deployment of AWS Lambdas. We were using terraform to create AWS resources and then continuously deploying with ansible. So if the lambda source code changed, ansible would deploy the new version, while privileges and other plumbing were taken care of by terraform. It all seemed to work well, but trouble was lurking.

The problem was that when setting up the initial version of the lambda in terraform we were effectively creating it empty and leaving it up to ansible to deploy the actual code. This is fine up to the point you need to run your terraform script once again. Terraform defines its resources declaratively, so if additional resources or changes are needed you simply run the script again and everything is brought up to date. But when it came to the lambda it would say to itself “This lambda is declared as being empty, but it isn’t. I’ll fix that!”. So running the terraform script would wipe the source code. Oops.

We got around this by storing the lambda source code in s3 and always deploying from there. The terraform script ensures that the bucket and source zip exists and creates the lambda using that source:

resource "aws_s3_bucket" "source_bucket" {
  bucket = "my-bucket-for-source"
}

resource "aws_s3_bucket_object" "lambda_source" {
  bucket = "${aws_s3_bucket.source_bucket.bucket}"
  key = "source.zip"
  source = "initial_empty_lambda.zip"
}

resource "aws_lambda_function" "my_lambda" {
  function_name = "my_lambda_function"
  s3_bucket = "${aws_s3_bucket.source_bucket.bucket}"
  s3_key = "source.zip"
  runtime = "nodejs4.3"
  environment {
    variables = {
      foo = "bar"
      bez = "baz"
    }
  }
}

Note that creating the zip in the way specified (without using the etag attribute) means that terraform only checks if the file exists in s3. Importantly it won’t overwrite an updated zip with the empty one later on…

Meanwhile, the ansible playbook uploads the latest zip to the s3 bucket and updates the lambda source using that. So now running terraform will not break the lambda, sanity restored.

 

Dev meeting – Lambda, Elixir, XQuery, Rider and coding standards

In this week’s dev meeting, we had a ‘jumble sale’ meeting, in which various developers each brought along something short to talk about.

First, Joe talked about Amazon Lambda again, in an update to the discussion we’d had a few weeks ago. This was prompted by Amazon’s update to the way that configuration works for Lambda. Previously, to manage lambdas running in separate environments (e.g. dev, test and production), we had to construct zip files containing JSON config files and upload a different file to each environment. Now, Amazon has made environment variables available in lambdas – so you can pass in these parameters instead, which considerably simplifies our deployment process.

Chris talked about the Secret Elixir Club, which has been rumoured to be meeting occasionally at lunch time to discuss Elixir. Elixir is a dynamically typed, pure functional language, based on the Erlang VM. Erlang is a little dated as a language, but its environment is excellent – with massive scalability, great support for concurrency via its actor model, and with deployments that reputedly have a ‘9 9s uptime’ – i.e. availability of 99.9999999%. Elixir takes this foundation, and adds to it a language which is more modern and more akin to languages like Ruby and Scala – supporting pattern matching, higher-order functions, and so on.

Reece talked about the new features in XQuery 3.1. This adds features like arrays, maps and the arrow operator to XQuery, and string interpolation. Some features of XQuery 3.1 are available in MarkLogic 9, which is due for release soon, and that we’ve been doing some early work with. Unfortunately Simon was ill, and so wasn’t able to talk more about our MarkLogic 9 work.

Inigo talked about our coding standards document. Many companies have long and detailed coding standards documents, that aren’t actually followed in practice, and largely cover cosmetic features of code like brace style and indentation. Our coding standards document is the following:

  • Strive for immutability
  • Avoid side-effects when possible
  • Prefer package level and overview documentation over method or in-code documentation
  • Write unit tests where possible, and execute them via continuous integration
  • Use existing libraries and third-party code when reasonable
  • Prefer JQuery as a JavaScript library except when something more complex is necessary. Value accessibility.
  • In user interfaces, prefer allowing forgiveness to asking permission (i.e. make operations reversible, rather than prompting the user to confirm them)
  • For Java coding, read Effective Java, and use it (and apply to .NET and Scala where appropriate)
  • Use logging – effectively

Nikolay effectively summarized these as “be competent and code well”.

Finally, Nikolay talked about Rider, JetBrains new IDE for C#. Everyone with experience of Java or Scala is keen on IntelliJ IDEA, which is by far the best IDE for those languages. Rider is the equivalent for C#, and is now in public Early Access. Various of us have used it, and found it generally better than Visual Studio, and its tendency to crash regularly has largely been fixed.

 

Terraform tricks to cope with conditionally created resources

Terraform is a great tool and we use it extensively to spin up resources in AWS.  It’s very easy to get started, the documentation is great and once you have built yourself a development environment it’s just a matter of changing a few config settings and you have yourself a staging and prod environment too.

It gets more tricky when the different environments are not exactly copies of each other. For example, in development you might want to create a SNS topic to use for testing whereas in staging and production you might want to use an external one.

Creating resources in some environments and not others can be controlled using the “count” parameter on a resource. So in the case of the SNS topic we only want to create in development, we can set an sns_count variable to be 1 in development and 0 elsewhere and use this to only create the topic in development.

But suppose we want to use the ARN of the created topic if it exists, or a fixed ARN of an external topic otherwise. Now it becomes even harder. Suppose that we are creating the SNS topic like this:

resource "aws_sns_topic" "sns_topic" {
  name = "our_topic"
  count = "${var.sns_count}"
}

and we have a variable topic_arn that will contain a fixed topic ARN to use if we have not created one. Terraform does not have any kind of conditional logic, but has a simple ternary function and there is a coalesce function which takes the first non-empty string from a list. So you might think that the following string interpolation would work:

"${coalesce(aws_sns_topic.sns_topic.arn, var.topic_arn)}"

Unfortunately it doesn’t because in the case where we do not create the topic, Terraform complains that there is no arn attribute.

So, to get around this we use this syntax instead:

"${coalesce(join("", aws_sns_topic.sns_topic.*.arn), var.topic_arn)}"

The .*.arn syntax returns a list of the ARNs for all the created topics, which will be either one or none. So we flatten this to a string with the join function and finally, it works as we want. Phew!

Dev meeting – IntelliJ IDEA XQuery, and AWS Lambda

Reece talked about the work that he’s been doing on an improved XQuery plugin for IntelliJ IDEA. This was inspired by some annoyances he was encountering with the existing XQuery plugin – although it’s generally good, it doesn’t have complete support for MarkLogic’s brand of XQuery, and it doesn’t correctly parse all XQuery.

Reece’s implementation covers more areas of the XQuery language, has better error reporting for issues in your XQuery code, can check for file encoding issues, supports Find Usages, and has a number of other features.

Reece is uploading his XQuery plugin to the Jetbrains repository this weekend, so it should soon be publicly available.

Joe and Chris talked about the work that we’ve been doing with AWS Lambda. AWS Lambda is what Joe hoped that cloud computing would be all along. AWS EC2 is a server in the cloud, but not fundamentally different from having a server on your own network. However, AWS Lambda is effectively serverless – you upload a piece of code, and then it runs without you worrying about the infrastructure at all. However, the underlying infrastructure can make a difference to you, in terms of the latency of invoking it when it hasn’t recently been called.

We are using AWS Lambda as part of a filter for an Amazon SQS queue. We’re testing it using an SBT plugin – SBTMocha. The challenges with testing it are:

  • When uploading the lambda, you don’t need the AWS libraries. However, when you’re testing it locally, you need to have the AWS library available. We dealt with this by separating the AWS-specific content from the testable content, and just testing the interesting part that doesn’t depend on AWS.
  • We have a number of environments: dev, qa, production. The configuration for the function needs to vary between each of them. The standard way of dealing with this is to include a “conf” directory as part of the upload, from which the lambda reads its configuration. Our build process in Jenkins constructs an appropriate zip file for the environment.

Using Lambda has been very useful for us – it has removed the need for a server.

Dev meeting – Terraform and CloudFormation

In our dev meeting this week, we discussed Terraform and Cloud Formation.

Terraform is a tool for managing infrastructure. We’ve previously talked about Puppet, Ansible, and so on – it’s not the same as them, but is instead the step before those tools. It creates the cloud servers that can then be setup by Puppet or Ansible (although, for example, Ansible can be used to set up cloud servers too – and Terraform can be used to set up software like MySQL too).

Terraform isn’t cloud agnostic – it has providers that are specific to a platform. This means that you can take full advantage of those platforms – but means that it isn’t trivial to switch from one to another. We’ve only used it with AWS, so far.

You write “.tf” files to configure your infrastructure – this is in a declarative JSON-esque format. You list things like the servers you want, and the groups they’re in, and so on. It will generally work out the order in which these things need to be created itself. You can apply a prefix to the things it generates, to make it easier to run it repeatedly against the same account.

You install Terraform locally (which is typically by downloading a binary – written in Go), then point it at the configuration files you have created (a module full of files). You need to make your AWS API key available to it – this is via a Terraform specific mechanism (e.g. putting it in a tfvars file). Then you proclaim “Behold! I am the Terraformer!” (at least, Chris does), and execute Terraform. It then writes out a state file, describing what has been created – and this should be checked in to Git. When you next run Terraform, then it will check your Terraform state file, and make appropriate changes to your system based on it. It will check the local state against the cloud state, so if you’ve made manual changes to the cloud servers, it will revert them.

Having created the servers using Terraform, it will run a “provisioner”, to run initial setup the first time. This should then call on to another configuration system. Out-of-the-box, it only supports Chef – and running scripts. In our case, we’ve needed to install Puppet on the server via a script, and then invoke that newly added Puppet to configure itself. The provisioning step should be as minimal as possible – and Puppet should do all the subsequent configuration, not the server. Once Puppet is up-and-running, it will poll the puppet master to keep itself up-to-date and configured.

We’ve generally had a good experience with Terraform – all of our pain has come from Puppet, not from Terraform.

CloudFormation is similar to Terraform, but is AWS-specific. It has a pretty GUI that shows all the components that will be set up. We have been using it for setting up MarkLogic clusters – it’s easiest to start with the standard MarkLogic cluster CloudFormation template, and edit that. It supports some limited conditional logic in the templates – such as creating different things based on region. You can open a template with the live state of a current stack – which provides an easy way of altering the configuration of that stack. There are example CloudFormation templates on the AWS marketplace. The MarkLogic template that we’re using creates servers using MarkLogic AMIs.

We all agreed that using either of these tools was better than managing things via the console – and of the two, we broadly preferred Terraform. We are likely to use Terraform more in future. However, for simple things, we are likely to use Ansible in preference to either – because it can do the creation of AWS servers, as well as doing their configuration, all within one tool.

Silly Scala tricks – the Ternary operator

Scala doesn’t use the ternary operator:

  test ? iftrue : iffalse

preferring to use if:

  if (test) iftrue else iffalse

Dan and I (Inigo) were on a long train journey together and decided to see if we could use Scala’s flexible syntax and support for domain-specific languages to implement our own ternary.

Our best shot is:

class TernaryResults[T](l : => T, r : => T) {
  def left = l
  def right = r
}

implicit class Ternary[T](q: Boolean) {
  def ?=(results: TernaryResults[T]): T = if (q) results.left else results.right
}

implicit class TernaryColon[T](left : => T) {
  def ⫶(right : => T) = new TernaryResults(left, right)
}

(1+1 == 2) ?= "Maths works!" ⫶ "Maths is broken"

This is pretty close – it’s abusing Unicode support to have a “colon-like” operator since colon is already taken, and it needs to use ?= rather than ? so the operator precedence works correctly, but it’s recognizably a ternary operator, and it evaluates lazily.