Chef staging environment

Current situation

We are using Chef in order to deploy our apps. It is a really great tool, associated with Git it allow us to work all together an change the configuration to our live environment by a simple push.

In order to achieve that we have the following process :

Git commit –> Jenkins –> Build agent –> Chef Server

  1. Someone does a commit on GIT
  2. Jenkins monitor the repo, if a commit has been done it will pull the commit on a build-agent which will test the commit (Foodcritic)
  3. If the test succeed the build-agent runs a script on the Chef Server which pull the change then load it into Chef

It worked pretty well, the problem was that usually you will pass the Footcritic test but something else will failed during deployment, something you cannot detect without running chef-client on a server. Because we only had a production environment, every time someone broke chef, it impacted all our server. It was not a major problem as we don’t run chef-client as a daemon  so until someone manually ran chef-client on a server nothing was impacted. It was almost true, because we are using AWS autoscaling which can trigger the build of a new server using chef.

As you can imagine it started to be a big issue when we were working on chef while some new code was pushed to our server (we are using continuous integration).

What do we want to fix ?

  • Detect all possible issues before to deploy our chef changes

The only way we have to test Chef in a real environment is run it on a server, you cannot really use the same server each time because some chef role could conflict with each other, and you would have to uninstall everything before each new test run. Therefore we need to boot a brand new server. Possible solutions to automatically boot a new server ?

  1. AWS EC2 : That’s what we use for production, the only issue is that you have to pay 1 hour every time you boot an instance.
  2. Another cloud service : AWS works very well for us, we don;t want to bother with another one
  3. Our own virtualisation product : It has to be easy to manipulate from our current infrastructure and it’s better if it is opensource

We chose to use LXC , it is built in the Linux kernel, the overhead is minimal, and it is really easy to script as a VM is as simple as a folder

  • Do not impact production when we work on chef changes

Possible options?

  1. Use the chef ENVIRONMENT feature : chef has a build in environment feature, it allow us to use a specific environment, so we can choose on our server if we want to use production or staging, for example. The problem is that it is not as isolated as we would like. First of all when you deploy a cookbook in the staging environment it will simply fix the cookbook version in the environment. So you have to bump cookbook version and be sure you don’t update the production cookbook version. Second issue, you don’t have version on the Roles or Databags, even if you can define attributes per environment (which is really nice) you cannot change it without impacting the production role. It was the first solution we chose, however its implementation was quite complex, and it was really easy to break production by playing with staging.
  2. Use a second chef server : It is perfect, it is completely isolated from production, and we can change the server we want to use on a client by editing /etc/chef/client.rb. The only problem is that we need to pay for a new server, server which will require monitoring and maintenance

We chose the second solution, to be sure at 100% that we will not break production.

  • Automatically boot a LXC server using our staging chef server, and be sure the change succeed before to push it to production

There are not a lot of ways to test chef, we could have developed our own tools to run a LXC environment running chef-client then a series of test on this LXC server, however we preferred to use Toft which is a Ruby library based on Cucumber, Rspec and LXC. Toft handles the launch of the LXC container and the Cucumber test against this containers.

The solution

We knew our tools, Chef, Jenkins, LXC, Toft.

We then had to chose a process that was as simple as possible, as transparent as possible for the users (us) and as automated as possible. After different test we chose the following process =>

Git commit –> Jenkins –> Build agent –> Chef Staging Server –> Toft –> LXC container –> Jenkins — Build agent –> Chef Production Server

  1. Someone does a commit on GIT
  2. A first Jenkins job monitors the repo, if a commit has been done it will pull the commit on a build-agent which will test the commit (Foodcritic)
  3. If the test succeeds the build-agent runs a script on the Chef STAGING Server
  4. Toft will be run on our LXC server, it will boot a LXC container with the roles we want, then it will test the results (files expected, process running etc…)
  5. If the Toft run and test succeeds, run a second Jenkins job
  6. A second Jenkins job will retrieve the last successful build (validated by toft), then it will tell to a build-agent to pull this specific successful build on the chef PRODUCTION server and to load it into Chef

That’s it, a fully automated test solution for Chef.

Posted in Uncategorized | Tagged , , , , , | 1 Comment

The Xmas Gift of IAM Self Service

Update: 9 April 2013
AWS has announced an update to IAM which allows the use of variables which makes this process much easier: http://docs.aws.amazon.com/IAM/latest/UserGuide/PolicyVariables.html. This post has been updated to reflect these changes.

When we created IAM accounts for our team, we had a desire to as self service as possible. Ideally, we wanted to provide a username and temporary password and then have users be able to set themselves up an access key, security certificate and MFA and be able to manage this themselves ongoing. However, this obviously couldn’t be at the expense of security. We needed to ensure that users could only modify their own information and if we remove a user from our account they’re gone.

On the surface, this seemed like a trivial problem, we’d simply generate a policy for full IAM access for each individual user and that would be the end of it:

{
“Statement”: [
{
"Sid": "Stmt1356297700489",
"Action": [
"iam:*"
],
“Effect”: “Allow”,
“Resource”: [
"arn:aws:iam::049718304780:ACCOUNT#:user/${aws:username}"
]
}
]
}

Unfortunately, the major challenge we faced was that IAM permissions through the console appears to be an all or nothing system. Even if you have full IAM access for your own user, you can’t do anything as you don’t have any permissions to list other users:

IAM fail

After much investigation (trial & error), we found the only way to achieve what we wanted was giving groups full IAM read only access and controlling write access with individual policies. While this may seem like a security vulnerability, the IAM console is cleverly designed to not provide any secret information. It may not be ideal that all users can see the groups that are setup, who has access keys etc., but none of this information is particularly useful.

This is the policy we’re using at the moment:

{
“Statement”: [
{
"Action": [
"iam:*Password*",
"iam:*AccessKey*",
"iam:*SigningCertificate*",
"iam:*MFADevice*",
"iam:UpdateLoginProfile"
],
“Effect”: “Allow”,
“Resource”: [
"arn:aws:iam::ACCOUNT#:user/${aws:username}"
]
},
“Action”: [
"iam:*MFADevice*"
],
“Effect”: “Allow”,
“Resource”: [
"arn:aws:iam::ACCOUNT#:mfa/${aws:username}"
]
}
]
}

Properly formatted code is available in this gist. We have had to change this a couple of time as AWS has made modifications to the names of IAM permissions. Unfortunately, we haven’t found this out until someone has told us they can’t do something.

With our relatively small number of users, setting this up manually wasn’t too onerous but it also should be fairly straightforward to script. If anyone does get around to scripting it, please share! We’ll probably end up doing it soon anyway.

Hopefully this post will become redundant soon if AWS improves how it handles these permissions.

Posted in Uncategorized | Leave a comment

London ElasticSearch user group #1

One of the great things about working for Lonely Planet is the opportunity to get out and about at meetups and conferences. Last week we organised and sponsored the first London ElasticSearch user group meetup.

We had two talks. Andrew Clegg introduced his ElasticSearch plugin for fast approximations, which builds on the probabilistic data structures provided by Clearspring’s stream-lib. As well as a clever piece of work in its own right, I think it shows one of the strengths of ES, its extensibility. Check out the slides.

Next up was our very own Marc Watts. Marc introduced some of the tools and techniques we’ve employed at LP to roll out ElaticSearch as fast as possible. Marc picked up on some of the speed bumps we found with integration testing, Tire, and monitoring. He also showed some live metrics showing a healthy, happy ElasticSearch cluster.

At LP, we use ElasticSearch as our primary document store, and it’s the source of truth for our ‘view of the world’. By building the whole site on a search engine, we can scale our editorial team and enable them to curate an enormous amount of content. (A different kind of scalability challenge from the ones we normally talk about on this blog.)

Finally, we were lucky enough to have a Q&A with Shay Bannon and Uri Hoeness from ElasticSearch. They were in town to deliver training (highly recommended if you’re looking for a deep dive in ES internals and production usage). Their session covered a ton of interesting and useful information, some of which I’ve tried to précis here (any inaccuracies are down to my note-taking skill, not Shay or Uri):

  • The team aim for consistency across languages for the client libs, e.g. standard names and data structures for low-level methods. Look out for an overhaul for Tire and others.
  • Replacing HTTP with another standard e.g. protobufs or Thrift would not be a huge win: the important thing is realy the quality of the underlying HTTP client libs. For example, not all Ruby HTTP libs supportkeep-alive, parsing headers in Perl takes sooo long
  • For 1.0, some good things to have would be (a) no full restarts for major upgrades (b) better story for loading data into memory (Lucene 4.0 will help) (c) backoffs (d) better story for snapshot/restore
  • New in 0.20 are warmers: on refresh, just before results are available, the warmer searches are run so first users won’t hit disk
  • 0.21+ will focus on upgrade to Lucene 4.0
  • Field data cache loads all fields even if all queries are restricted to tight filter. This is because other queries might need theproper data
  • Nested queries are much faster than parent/child queries (and probably faster than any B-tree-based document store). Parent/child allows for documents with different lifecycles, but the docs have to be joined in memory
  • Shay doesn’t believe that SSL between nodes is the full story for thorough node security. Right now, you can maybe do something with nginx proxies in front of each node.
  • You can use ES as primary doc store. But recommendation is to have the ability to reindex everything. At least one ES user has PBs of data in ES but also flat files on S3
  • Will there be a way of splitting shards? Well, just 10 shards will get you way far. Also if you don’t identify your partitioning key, you’re in trouble anyway e.g. MongoDB clusters die cos folks add nodes when they’re at 80% capacity, but splitting is expensive and takes more than the remaining 20%… Shay gave atalk on this at BerlinBuzzwords.

The next #lesug meetup will probably take place end of January 2013. For updates and more information:

At some point soon we’ll get set up on MeetUp.com, too.

So thanks very much to our speakers, and Shay and Uri, and everyone who came along for a chat.

Thanks also to Lonely Planet for sponsoring the evening’s pizza and drinks. (This was the first meetup I’ve organised so I was mighty relieved to find I’d ordered enough pizza.)

One last thing. If you’re interested in working with ElasticSearch, we’re hiring.

Posted in Uncategorized | Leave a comment

Fozzie Updates

This week we received a pull request to the Fozzie gem, which enabled developers to turn off the Fozzie Rails Middleware when used within a Rails application.

After some thinking I felt a better way to handle this was to abstract the Rails specific functionality into a seperate Gem, in the same pattern as the RSpec and RSpec Rails gems.

It also felt like a good time to promote Fozzie to Version 1.0.0, after some positive feedback I received at theWebPerfDay on Friday.

Therefore, Fozzie 1.0.0 is now up and requires you to add the following to your Rails application to monitor your Controller methods:

gem 'fozzie_rails'

If you want to use Fozzie in your Rails application, but without the Controller monitoring, use:

gem 'fozzie'

A big thank you to all who have so far contributed code and comments to Fozzie.

Posted in Uncategorized | Leave a comment

Performance and metrics tools and resources

As promised, here are all the tools and resources mentioned in our (@mjenno and @davenolantalk at VelocityConf today.

Graphite and friends

Alternative frontends

Related tools

Holt-Winters

Also

Continuous experimentation

And there will be more posts on CE here soon (if there aren’t, please harass me on Twitter).

Posted in Uncategorized | Leave a comment

lonelyplanet.com Performance Baseline

In our first post on site performance we promised to share some figures baselining our current performance. Since then, we’ve been preparing for our presentation at Velcoity EU so it’s taken longer than hoped to share these figures.

To baseline our performance, we’ve gathered figures for August on major areas of our site. These are full page load times using backbone tests in IE8 averaged across US, UK and Australia:

Homepage: 8.757

Forum: 4.236

Destination: 5.454

Shop homepage: 5.513

Shop details: 4.918

Things to do list: 5.671 

This list is far from exhaustive. From here, the most important thing we’re doing is working on improving these numbers and expect to make great progress over the coming months.

We’re also really excited about the tools we’re building to help us understand how our site performs for real users in all locations, on all browsers. More to come on that! In the interim, we’ve started monitoring more areas of the site with backbone tests so we can give a more complete picture.

Posted in Uncategorized | Leave a comment

How to get a job with our Engineering team

We’re looking to hire enthusiastic, energetic and clever people. Are you great at Linux? AWS? Postgres? Ruby? MS SQL? A little of each, or a specialist in one field?

Send us a quick YouTube or Vimeo video telling us about yourself and what you’re good at. Tell us a bit about:

1. What technology are you excited and passionate about?

2. What do you think is going to take off in the industry in the next year or 2?

3. Something entertaining (this one is completely optional).

Feel free to include anything else you like!

We’re in London, so you’ll need to be too.

We do accept standard CV’s, but a video gives us a quick glimpse in to important things like cultural fit and communication qualities.

Posted in Uncategorized | Leave a comment