Starting up in the cloud: saving money meets convenience

I was at the second CloudCamp Scotland last night, and two things struck me. Firstly, is it really six months since the previous one? And secondly, it’s amazing how much more sense all this ‘cloud’ stuff makes when you’ve actually tried it out.
There’s still something… not quite right… about how The Cloud is talked about, marketed, conferenced around. Cloud Computing is Capitalised, as if it’s something New and Special and Exciting. We’re “in the cloud”! When I was at the first CloudCamp, I thought I knew what the cloud was, but ended up being even more confused than I was before — mostly by all the jargon, tough talk, and obfuscation of what is in fact a Really Simple Idea.
Imagine you can borrow servers off someone else. Now imagine that there’s actually an unlimited number of them at your disposal, but you only pay for what you use. That’s the cloud.
What you do with the servers, how you configure them, and so on — that’s where the jargon starts creeping in. But fundamentally, all the cloud means is that you don’t have to have servers sitting in your office any more.
A startup’s story
An interesting comment made at CloudCamp by way of Seedcamp was about startups using the cloud. Most Seedcamp finalists have been going for a couple of years; when they started, the cloud wasn’t the most viable option, but apparently plenty of them were saying that if they started up now, they’d definitely use it.
Why? Well, firstly, there’s the cost aspect. You only pay for what you use, so you can turn stuff on and off at will, saving you money. There’s also the TechCrunch effect. If you get a ton of traffic and signups, you can quickly scale up the resources you have to meet that demand — no more single-server slashdotting. Then when the spike falls off, you don’t have a load of unused machines sitting around costing you money — you just return them to The Cloud and Bob’s your uncle.
(Aside: I really do have an Uncle Bob. Do you?)
There’s still some kind of crazy magic wall in the way though. I’m a startup. I’ve decided this cloud thing sounds jolly sensible. Now how on earth do I actually use it?
How we used the cloud for a one-off, time-limited project

FestBuzz was a perfect ‘cloud’ project. It’s focused around a specific period of time (August 2009), after which the demand is negligible (but we still wanted to keep it online somewhere). During its lifespan, we wanted to do some complex computational stuff, to have a data-backed website, to be constantly fetching data in, and to be able to roll out one-off mashups. We had absolutely no idea how much traffic we’d get beforehand.
So from the start we designed it in a modular fashion, with The Cloud™ at the forefront of our minds. The main components of the site broke down fairly easily into different units:
- Web server (Apache with mod_wsgi)
- Database (MySQL)
- Django site, including plenty of images/javascript/CSS. Data-driven, but reading data only.
- Django administration site, able to edit the data.
- Data fetching subsystem (effectively, a set of scripts that righteously abused the Twitter API to pull in tweets)
- Data processing subsystem (more Python code that did NLP and other stuff, such as show matching, on the tweets)
- Separately, code hosting/versioning and other file hosting
After developing a skeleton site locally, the first “cloudy” thing we did was to set up a new server, pay-as-you-go, and deploy a Django environment on it. We didn’t even have Apache at this stage. We could have just bought a VPS or remotely rack-mounted box and used that, especially considering the fact we wanted to keep it alive afterwards: running a single server 24/7 is not really one of the most cost effective things you can do in the cloud. However, this was development time: we saved money by turning the server off when we were done coding and turning it back on when we wanted to roll out new updates.
Another early thing we set up was Git code hosting — again, not something that explicitly needed the cloud, but something that happened to use it. This was cool for us because the provider we used didn’t actually charge at all — we basically had flexible, secure code hosting, saved somewhere in the ether that didn’t care if we added another megabyte or ten. We also used Dropbox, another cloud storage provider, to sync and publish relevant documents — as well as Google Mail and Docs, of course.
But those are all cloud services with pretty interfaces. When we started our first pay-by-the-hour server, it was ugly. You basically get a root (or sudo user) prompt and a machine with nothing installed. Pretty similar to buying a ‘real’ server! And if you want more servers? You get more of the same. That’s it. That’s the cloud. An arbitrary number of brand new machines, with arbitrary numbers of CPUs, amounts of RAM and storage, etc. The higher the specs, the more a machine costs to run per hour.
So the dirty way to do Cloud, which is more or less what we did, is to manually take charge.
We started with one machine. Installed everything we needed to get a perfect staging server on it. Configured Apache to serve Django, configured MySQL to run locally, had some of our data scripts running in a screen session. For all intents and purposes it was a single self-contained machine, and for the first week of the site’s launch — as nothing particularly intensive was going on — this was basically the ‘live’ server.
Two things happened before we pushed that server to live, though: we got our cloud hosting provider to image the server, so that we could start unlimited copies of it whenever we wanted, and we pushed all our media files to a content delivery network (CDN). We used Amazon CloudFront, and let me tell you — it’s cheap. I just checked and our entire bill for the month, for serving a ton of small image, JS and CSS files, was about £4. The only problem with CloudFront was versioning; you have to be a little careful when pushing new files, because usually the old ones will still be served up, so use numeric versioning and/or don’t serve files until they’re final.
OK, so where are we in the cloud? We have one server running 24/7 that’s doing everything, for now. We’ve got another copy of it that we’re starting up, doing some development work on, and shutting down. Thanks to Git, to deploy the development code on the live server we just need to update the local codebase — that’s fine for our short-term needs.
Things start to heat up a bit, and we want to crank up some more data processing power. We look at the load on our single server, and decide to start up a new server with a decent amount of memory to serve the database. Again, we do everything manually, importing the data and switching over smoothly. We then start up another server to do some data munching and nothing else. The way we’ve written our data systems, the various components can run in parallel quite happily — there’s no dodgy overwriting/blocking/waiting going on. If we hadn’t been thinking about the cloud from the start, we might not have designed it this way, though I’d like to think we would (it’s best practice, after all!).
This is about as exciting as our setup ever got. But it’s pretty cool. To have farmed out stuff to separate ‘real’ servers, we’d have paid a lot more. As soon as the Festival ended, we could turn off the data and development servers, and move the database back to the web server (mostly to save money, as the DB server was fairly expensive to run). If we’d done this ‘for reals’ we might well have just used a local setup for development and only paid for one live server, but therein lies ruin:
We decided to officially launch FestBuzz at an event run by the Edinburgh Fringe (the ‘Twinge Party’). As part of this, we designed a one-off sub-site that aggregated and displayed tweets about the party itself, and from a comedy tweet session we were running. The event kicked off at 6pm on the 14th; at 5.30pm on the 14th, I got an email from a friend. “Hey, did you realise that FestBuzz is throwing errors?” For some reason, the site monitoring system hadn’t picked up this particular problem (it was pretty random in the end; reversing the order of two import statements fixed it).
Suffice it to say, not a good time for the site to be malfunctioning. Fortunately, the development server was on hand — identical to the live server, but for some bizarre reason, not throwing the same errors. A quick switch and the working server was live, and we all breathed again. If we’d only had a local development server, this change wouldn’t have been an option. As it was, we could easily have started up an imaged version of the same server, fetched the latest code, and deployed that (a better, albeit slightly longer and unnecessary solution). All thanks to the cloud.
The cloud also meant that when we wanted a playground to do a bit of tinkering, we could just start one up, pay something like 20p for the few hours we tinkered, and turn it back down again. It gave us a lot of flexibility. But, as outlined above, the way we went about it was very hands-on and — because the scale of the site was small — we mostly did things manually. If you’re only starting two or three new servers, tinkering with them yourself is perfectly reasonable. If you’re starting two or three hundred, then you might want to use the many other tools out there.
The magic
I feel that doing things the way we did gave us a fairly ground-level view of the whole process. I now have a feeling for how much work it is to manage this stuff manually, and an appreciation of the flexibility granted by being able to just create a server out of thin air for some random task. There are plenty of things we could have done differently, better; plenty of ‘best practices’ and other systems we could have laid down in case the site needed to scale quickly. While costing us time and money in the short term, failsafe procedures certainly pay for themselves should you get slashdotted.
Load balancing. We didn’t do this. Despite using the cloud, we had a SPF, as the Twinge Party near-disaster showed. Ideally we’d have had multiple available reading servers and a load balancer that happily farmed stuff out.
Distributed data processing. We didn’t do this either, as our single server seemed to cope with everything just fine. We could have processed data more quickly and efficiently if we’d farmed it out to multiple servers, especially since things like Map/Reduce are, quite frankly, designed for this sort of stuff. Why didn’t we use this? Time and sufficiency. Our solution was sufficient, if not efficient, and we were very, very short on time. Learning Map/Reduce and rewriting to accomodate it wasn’t really an option. We did briefly look at Hadoop as well, but the same constraint applied.
Automatic scaling. If we wanted a new server, we manually entered the control panel and turned one on. Thanks to API magic (I never did get the sample API code to work) or an intermediary, we could have automated some of this stuff — again, in case the volume of data really started to scale, for example, if we’d gotten our hashtag to trend.
In a way, we were lucky — we didn’t need to scale, because stuff didn’t go supernova. I’d like to think that thanks to the cloud and some elbow grease, we would have coped if we had, but I’m thankful that our learning experience with the cloud was sufficient to educate, especially in the areas of ‘things we should have done but didn’t', while also meaning that our somewhat crude solution worked fine given the volumes of traffic and data we had.
I just hope this sheds some light on what this ‘cloud’ malarkey is all about when it comes to startups: saving money, mostly! It’s just an easy way to get tons of computers. What you do with them is up to you.
Rather than thinking of the cloud as anything shiny and new, can I cheekily suggest it’s a return to form for useful networked computing? Looked back on, will cloud storage/apps seem like the logical extension of Unix shell accounts, with the last decade-and-a-bit of offline personal computer storage as an odd blip caused by bandwidth constraints? ;)