Friday 2 May 2014

Lessons learned from setting up a website on Amazon EC2

I recently got involved with helping someone sort out their website on an Amazon EC2 instance, it had been a few years since I had the need to do anything with EC2, I realised that I was a novice in this world - and it raised a number of issues related to deploying to EC2 and performance.

So I thought it may be useful to run through them for any other EC2 novices who are asked to do something similar, and want to learn from my rather blundering progress through this :-)

Apologies for those of you are already well familiar with EC2 for covering some of the basics.

The system moodpin.co.uk was based on a commercial PHP application, Pintastic.
So this allows you to set up a site like pinterest.com or wanelo.com
These sort of sites are for creating subject specific photo sharing social media systems, so like Instagram, Picassa etc. but focussed around communities of shared (usually commerical) interest. For example buying shoes, interior decor etc.
The common UI that they tend to present are big scrolling pages of submitted images related to topics for sharing, comment and discussion.

So this system sends out a lot of notification emails, involves displaying hundreds of images per page - the visual pin board - and to help with performance has custom caching built in - triggered by cron jobs.

Hence we have a number of cron jobs with the caching ones running every couple of minutes. To me this appeared a pretty crude caching mechanism - but my job was not to rewrite the application, but just tweak the code and get it all running OK.
The code mainly uses a standard MVC approach like everything else these days!

So demonstrating how outdated my knowledge of EC2 or this application were. I thought OK - first of all what platform is it. It was Amazon's own Linux - this uses yum rather than apt for package installs so as distros go its perhaps more Redhat-like than Debian.

For those unfamiliar with the basics - go to Amazon web services and sign up!
You can then choose to add some of the 40 odd different services that are available under the EC2 umbrella.

Once you have signed up to a few of these, you get a management console that links to a control dashboard for each service. The first step usually being - the one with the computer instances on, EC2. From there you can pick an AMI (ie. operating system image), a zone - eg. US West (Oregon) and use it to create a new instance. Add an SSH key pair for shell access and then fire it up and download the pem file so you can ssh into your new Amazon box.

So the client wanted the usual little tweaks to PHP code,  CSS tweaking - so easy stuff its just web development ... done in a jiffy (well after digging through the MVC layers, templating language, cache issues and CSS inheritance etc. for a fairly complex PHP app you have never come across before, when PHP is not exactly your favourite language ... jiffyish maybe)
Then we got to the more SysAdmin related requests ... lets just say I probably shouldn't rush out and buy a DevOps tee-shirt just yet ...

'Get email working'

  1. Try to send an email from the web application - write a plain PHP script that just sends a test email - just run mail from the linux command line ... Got it there is no MTA installed! 
  2. Install an MTA - sendmail. Go back up that stack of actions and they are all working ... hurray that was easy.
  3. A week or so later ... 'emails stopped working'
  4. Go back to step 1. and yep - emails stopped working
  5. Look at the mail logs and see what the problem is.
  6. Realise that there are masses of emails being sent out ... but all of it is bouncing back as unverified.
  7. Think ... wow that pintastic site's notifier is busy - must be getting lots of traffic *
  8. So why has Amazon started bouncing all the email?
  9. Search Amazon's docs. Amazon has a very minimal test quota allowed for email. Once that quota is filled, unverified email will be blocked.
  10. Amazon has historically been one of the main sources of SPAM machines, that history means that it has to set up a much more elaborate mechanism for validating email that most hosting companies, and it no longer allows direct emailing from EC2 boxes (apart from minimal test quotas)
  11. So what we need to do is set up our mail to be sent via the Amazon SES service - add SES service and enable it
  12. So now we need to send authorised emails to the Amazon SES gateway that will then forward them on to the outside world
  13. Try to get sendmail to send authenticated emails, follow guide but it continues to bounce with authentication failure, give up and install postfix, follow the 20 steps of setting up the SASL password etc., and eventually it doesn't bounce with authentication errors - hurray!
  14. But the email still bounces. So we need to verify all our sending email addresses - managed by the SES console - or use DKIM to get the whole domain verified and signed from which we are sending.
  15. Modify the emails used by the sending software to ones which we can receive and validate - send and validate them. Our emails are working again.
  16. Leave it a few days, we are not sending email anymore, boooo!
  17. Check all the SES documentation, surprise, surprise SES also has quota limits for test level only, and you have to formally apply to get those limits lifted.
  18. Contact the client and get him to make a formal request for quota lifting on his account.
  19. *As part of the investigation check that email log a little more closely, it seems rather large, and we seem to be using up our quotas really quickly ... ah the default setup for unix cron sends an email for every job that returns text. The pintastic cache job returns text, so we are sending a pointless email every two minutes ... or trying to ... whoops. Make sure no cron or other unix system command is acting as a SPAM bot.
  20. A few days later - Amazon say our quota has been lifted
  21. Our emails have started sending again ... and they are still sending today !!!
Clients response, OK thanks, by the way since we added all the start up data / ie. uploaded images, the site takes at least two minutes to render the home page - or times out altogether.
Hmmm I did kinda notice that ... but hey he hadn't asked me to make the site actually usable speed wise ... until now!

'Why is the site, really, really slow?'


Hmmm wow it really is slow, lots of the time it just dies, that PHP cache thingy can't be doing much, so whats the problem.

  1. Lets look at the web site, wow it takes 5 minutes for the page to come back ... so this isnt exactly Apache bench territory ... run up a few tabs looking at the home page ... and it starts just returning server timeouts.
  2. So whats happening on the server ... whats killing the box ... top tells us that its Apache killing us here - with 50 odd processes spawning and sucking up all memory and CPU.
  3. So we check out our Apache config and its the usual PHP orientated config of MPM prefork. But what are the values set ... they are for a great big multiprocessor cadillac of a machine, whilst ours is more of a smart car in its scale. 
  4. Lesson is that Amazon AMI's are certainly not smart enough to have different image configs for different hardware specs of instances they provide. So it appears they default their configs to suiting the top of the range instances (since I guess they cost the most). If you have a minimal hardware spec box ... you should reconfigure hardware related parameters for the software you run on it ... or potentially it will fail.
  5. Slash all those servers, clients etc. values to the number of servers and processes the server can actually deliver. Slightly trial and error here ... but eventually we got MaxClients 30 instead of 500 etc. and give it a huge timeout.

    <IfModule prefork.c>
    StartServers       4
    MinSpareServers    2
    MaxSpareServers  10
    ServerLimit      30
    MaxClients       30
    MaxRequestsPerChild  4000
    </IfModule>
  6. Now lets hammer our site again ... hurray it doesn't completely fall over ... one day it may return a page, but its horribly horribly slow still ie. 3 minutes absolute top speed - further home page requests the slower they get.
  7. So lets get some stat.s, access the page with browser web dev network tools. Whats taking the time here. Hmmm web page a second, not great but acceptable, JS and CSS 0.25 sec, OK. Images hmmm images hmmm for the home page particularly ... 3-6 minutes ... so basically unusable.
  8. So time to bite the bullet we know Apache can be slower at serving static pages if its not optimised for it - especially if resources are limited (its processes have a bigger memory overhead), thats why the Apache foundation has another web server, Apache Trafficserver , for that job
  9. But whats the standard static server (thats grabbed half of Apache's share of the web in the last few years), yep nginx
  10. So lets set up the front end of our site as nginx acting as a reverse proxy to Apache just doing the PHP work, with nginx serving all images. So modify Apache to just serve on 8080 on localhost and flip the site over to an nginx front end, with the following nginx conf ...

    server {        listen       80;
            server_name  moodpin.co.uk;
                                                                                                                                                                                                       
            location ^~ /(cache|cms|uploads) {
                     root   /var/www/html/;
                     expires 7d;
                    access_log  /var/log/nginx/d-a.direct.log ;
            }
                                                                                                                                                                                                      
            location ~* \.(css|rdf|xml|ico|txt|gif|jpg|png|jpeg)$ {
                     expires 365d;
                     root  /var/www/html/;
                    access_log  /var/log/nginx/d-a.direct.log ;
            }

          location / {
                proxy_pass         http://127.0.0.1:8080/;
     
    Wow, wow, so take that 3-6 minutes and replace it with 1-2 seconds.
  11. So how many images on the home page - about 150 plus more with scrolling ... so that means we have a site that is on average under 0.5% dynamic code driven content and 99.5% static content/requests per page.
    That is a very very static site - hence the 100 x faster speed!
  12. So there you go client take that souped up smart car and go 
  13. Client replies ... ummm sites down - server proxy timeout error
  14. Go to Google and check, so we have to make sure that nginx has timeout settings greater than Apache's - and nginx default timeout is 60 seconds
  15. Make nginx _timeout settings into 10 minutes ... sounds bad, try the site, and it consistently delivers pages in 3 seconds or so assume that the scrolling request update page nature of the app, makes the timeout required much longer than the apparent time Apache is delivering PHP within?
  16. Show the client again, hes happy.
  17. Few days later ... this bit of the sites not working now
  18. Check the code, discover that there is a handful of javascript files used by the system that are not really static - they are PHP templates generating javascript that appear static. Remove js file types from the list of files above in the nginx config. Hurray generated javascript served from Apache PHP now. Bit of site works again
  19. OK we are done ... don't run Apache bench against the site ... if the client actually gets any users and it cant' cope - tell him to upgrade his instance.

    I hope my tails of devops debuggery are useful to you, Bye!