Fantastic Lambdas and How to Deploy Them

As mentioned previously, we are using terraform to spin up resources in AWS in an automated and repeatable fashion. Mostly it just works, but now and again things get tricky. We hit such a situation when automating the deployment of AWS Lambdas. We were using terraform to create AWS resources and then continuously deploying with ansible. So if the lambda source code changed, ansible would deploy the new version, while privileges and other plumbing were taken care of by terraform. It all seemed to work well, but trouble was lurking.

The problem was that when setting up the initial version of the lambda in terraform we were effectively creating it empty and leaving it up to ansible to deploy the actual code. This is fine up to the point you need to run your terraform script once again. Terraform defines its resources declaratively, so if additional resources or changes are needed you simply run the script again and everything is brought up to date. But when it came to the lambda it would say to itself “This lambda is declared as being empty, but it isn’t. I’ll fix that!”. So running the terraform script would wipe the source code. Oops.

We got around this by storing the lambda source code in s3 and always deploying from there. The terraform script ensures that the bucket and source zip exists and creates the lambda using that source:

resource "aws_s3_bucket" "source_bucket" {
  bucket = "my-bucket-for-source"
}

resource "aws_s3_bucket_object" "lambda_source" {
  bucket = "${aws_s3_bucket.source_bucket.bucket}"
  key = "source.zip"
  source = "initial_empty_lambda.zip"
}

resource "aws_lambda_function" "my_lambda" {
  function_name = "my_lambda_function"
  s3_bucket = "${aws_s3_bucket.source_bucket.bucket}"
  s3_key = "source.zip"
  runtime = "nodejs4.3"
  environment {
    variables = {
      foo = "bar"
      bez = "baz"
    }
  }
}

Note that creating the zip in the way specified (without using the etag attribute) means that terraform only checks if the file exists in s3. Importantly it won’t overwrite an updated zip with the empty one later on…

Meanwhile, the ansible playbook uploads the latest zip to the s3 bucket and updates the lambda source using that. So now running terraform will not break the lambda, sanity restored.

 

Terraform tricks to cope with conditionally created resources

Terraform is a great tool and we use it extensively to spin up resources in AWS.  It’s very easy to get started, the documentation is great and once you have built yourself a development environment it’s just a matter of changing a few config settings and you have yourself a staging and prod environment too.

It gets more tricky when the different environments are not exactly copies of each other. For example, in development you might want to create a SNS topic to use for testing whereas in staging and production you might want to use an external one.

Creating resources in some environments and not others can be controlled using the “count” parameter on a resource. So in the case of the SNS topic we only want to create in development, we can set an sns_count variable to be 1 in development and 0 elsewhere and use this to only create the topic in development.

But suppose we want to use the ARN of the created topic if it exists, or a fixed ARN of an external topic otherwise. Now it becomes even harder. Suppose that we are creating the SNS topic like this:

resource "aws_sns_topic" "sns_topic" {
  name = "our_topic"
  count = "${var.sns_count}"
}

and we have a variable topic_arn that will contain a fixed topic ARN to use if we have not created one. Terraform does not have any kind of conditional logic, but has a simple ternary function and there is a coalesce function which takes the first non-empty string from a list. So you might think that the following string interpolation would work:

"${coalesce(aws_sns_topic.sns_topic.arn, var.topic_arn)}"

Unfortunately it doesn’t because in the case where we do not create the topic, Terraform complains that there is no arn attribute.

So, to get around this we use this syntax instead:

"${coalesce(join("", aws_sns_topic.sns_topic.*.arn), var.topic_arn)}"

The .*.arn syntax returns a list of the ARNs for all the created topics, which will be either one or none. So we flatten this to a string with the join function and finally, it works as we want. Phew!

Lessons learnt from production problems

At last week’s dev meeting we swapped war stories about problems encountered in production, how we tracked down the root cause and lessons we learnt. Here are some highlights:

  • Environment is very important. When trying to reproduce a bug, you need to try to replicate the environment as closely as possible. Even very minor changes to libraries, frameworks, subsystems or operating system patches can make a major difference to whether the bug is seen or not.
  • “We haven’t touched anything”. The person telling you that nothing has changed may well believe this is the case. However, if a system mysteriously stops working it’s best to actually verify that this is the case. There was a bit of moaning about Windows at this point. At least with text file based configuration you can easily do a diff. Not so easy when someone has unchecked a vital property in Windows config settings.
  • Test your systems like a real user. We discussed a problem on a website where the functionality in question had been tested thoroughly by people within the organisation. Unfortunately, when real external users came to use the site, it didn’t work as expected. This was due to a private address being used that could only be seen within the corporate network. So it worked for internal users, but not for real customers. If it had been tested via an external network this would have come to light before real users hit problems.
  • The bug may be in old code. It’s possible that the bug you are seeing has been lying dormant for years and is being triggered by some innocuous change. We talked about a situation where a new release would cause a site to have severe performance problems. Much time was spent looking at all the changes going into that release to see if there was some change to database access or the like causing the problem. In the end it transpired that a small change to the cookies used was triggering a latent bug in a script used for load balancing. This script had been running fine in production for years until this seemingly minor change caused it fork processes like crazy and bring down the site.

Surprising output from Specs2

I noticed yesterday that the output from some integration tests running on our CI server were producing large amounts of output. It turned out that 41000 lines of the 43000 lines were coming from a single test. The reason for this is the way Specs2 handles the contain() matcher with lists of Strings. It means that the following:

listOfIds must not(contain(badId))

is effectively the following, checking each string to see if it contains the given one:

listOfIds(0) must not(contain(badId))
listOfIds(1) must not(contain(badId))
listOfIds(2) must not(contain(badId))
....

So if you are looking at a long list, this yields some very verbose output. With types other than String, this seems to work as expected.

Virtuoso Jena Provider Problem

In a project that we’ve just started, we are using OpenLink Virtuoso as a triple store. I encountered a frustrating bug when accessing it via the Jena Provider where submitting a SPARQL query with a top-level LIMIT clause would return one less result than expected. In my case, the first query I tried was an existential query with LIMIT 1, so it caused much head scratching as to why I was getting no results.

Luckily OpenLink are responsive to issues raised on GitHub, so once I raised this issue and created an example project, it was quickly found to be solved by using the latest version of their JDBC4 jar. Problem solved.