Browsing archives for September, 2009

Starting up in the cloud: saving money meets convenience

Startups 24 September 2009 | 1 Comment

Cloud by akakumo on flickr

I was at the second CloudCamp Scotland last night, and two things struck me. Firstly, is it really six months since the previous one? And secondly, it’s amazing how much more sense all this ‘cloud’ stuff makes when you’ve actually tried it out.

There’s still something… not quite right… about how The Cloud is talked about, marketed, conferenced around. Cloud Computing is Capitalised, as if it’s something New and Special and Exciting. We’re “in the cloud”! When I was at the first CloudCamp, I thought I knew what the cloud was, but ended up being even more confused than I was before — mostly by all the jargon, tough talk, and obfuscation of what is in fact a Really Simple Idea.

Imagine you can borrow servers off someone else. Now imagine that there’s actually an unlimited number of them at your disposal, but you only pay for what you use. That’s the cloud.

What you do with the servers, how you configure them, and so on — that’s where the jargon starts creeping in. But fundamentally, all the cloud means is that you don’t have to have servers sitting in your office any more.

A startup’s story

An interesting comment made at CloudCamp by way of Seedcamp was about startups using the cloud. Most Seedcamp finalists have been going for a couple of years; when they started, the cloud wasn’t the most viable option, but apparently plenty of them were saying that if they started up now, they’d definitely use it.

Why? Well, firstly, there’s the cost aspect. You only pay for what you use, so you can turn stuff on and off at will, saving you money. There’s also the TechCrunch effect. If you get a ton of traffic and signups, you can quickly scale up the resources you have to meet that demand — no more single-server slashdotting. Then when the spike falls off, you don’t have a load of unused machines sitting around costing you money — you just return them to The Cloud and Bob’s your uncle.

(Aside: I really do have an Uncle Bob. Do you?)

There’s still some kind of crazy magic wall in the way though. I’m a startup. I’ve decided this cloud thing sounds jolly sensible. Now how on earth do I actually use it?

How we used the cloud for a one-off, time-limited project

FestBuzz

FestBuzz was a perfect ‘cloud’ project. It’s focused around a specific period of time (August 2009), after which the demand is negligible (but we still wanted to keep it online somewhere). During its lifespan, we wanted to do some complex computational stuff, to have a data-backed website, to be constantly fetching data in, and to be able to roll out one-off mashups. We had absolutely no idea how much traffic we’d get beforehand.

So from the start we designed it in a modular fashion, with The Cloud™ at the forefront of our minds. The main components of the site broke down fairly easily into different units:

  • Web server (Apache with mod_wsgi)
  • Database (MySQL)
  • Django site, including plenty of images/javascript/CSS. Data-driven, but reading data only.
  • Django administration site, able to edit the data.
  • Data fetching subsystem (effectively, a set of scripts that righteously abused the Twitter API to pull in tweets)
  • Data processing subsystem (more Python code that did NLP and other stuff, such as show matching, on the tweets)
  • Separately, code hosting/versioning and other file hosting

After developing a skeleton site locally, the first “cloudy” thing we did was to set up a new server, pay-as-you-go, and deploy a Django environment on it. We didn’t even have Apache at this stage. We could have just bought a VPS or remotely rack-mounted box and used that, especially considering the fact we wanted to keep it alive afterwards: running a single server 24/7 is not really one of the most cost effective things you can do in the cloud. However, this was development time: we saved money by turning the server off when we were done coding and turning it back on when we wanted to roll out new updates.

Another early thing we set up was Git code hosting — again, not something that explicitly needed the cloud, but something that happened to use it. This was cool for us because the provider we used didn’t actually charge at all — we basically had flexible, secure code hosting, saved somewhere in the ether that didn’t care if we added another megabyte or ten. We also used Dropbox, another cloud storage provider, to sync and publish relevant documents — as well as Google Mail and Docs, of course.

But those are all cloud services with pretty interfaces. When we started our first pay-by-the-hour server, it was ugly. You basically get a root (or sudo user) prompt and a machine with nothing installed. Pretty similar to buying a ‘real’ server! And if you want more servers? You get more of the same. That’s it. That’s the cloud. An arbitrary number of brand new machines, with arbitrary numbers of CPUs, amounts of RAM and storage, etc. The higher the specs, the more a machine costs to run per hour.

So the dirty way to do Cloud, which is more or less what we did, is to manually take charge.

We started with one machine. Installed everything we needed to get a perfect staging server on it. Configured Apache to serve Django, configured MySQL to run locally, had some of our data scripts running in a screen session. For all intents and purposes it was a single self-contained machine, and for the first week of the site’s launch — as nothing particularly intensive was going on — this was basically the ‘live’ server.

Two things happened before we pushed that server to live, though: we got our cloud hosting provider to image the server, so that we could start unlimited copies of it whenever we wanted, and we pushed all our media files to a content delivery network (CDN). We used Amazon CloudFront, and let me tell you — it’s cheap. I just checked and our entire bill for the month, for serving a ton of small image, JS and CSS files, was about £4. The only problem with CloudFront was versioning; you have to be a little careful when pushing new files, because usually the old ones will still be served up, so use numeric versioning and/or don’t serve files until they’re final.

OK, so where are we in the cloud? We have one server running 24/7 that’s doing everything, for now. We’ve got another copy of it that we’re starting up, doing some development work on, and shutting down. Thanks to Git, to deploy the development code on the live server we just need to update the local codebase — that’s fine for our short-term needs.

Things start to heat up a bit, and we want to crank up some more data processing power. We look at the load on our single server, and decide to start up a new server with a decent amount of memory to serve the database. Again, we do everything manually, importing the data and switching over smoothly. We then start up another server to do some data munching and nothing else. The way we’ve written our data systems, the various components can run in parallel quite happily — there’s no dodgy overwriting/blocking/waiting going on. If we hadn’t been thinking about the cloud from the start, we might not have designed it this way, though I’d like to think we would (it’s best practice, after all!).

This is about as exciting as our setup ever got. But it’s pretty cool. To have farmed out stuff to separate ‘real’ servers, we’d have paid a lot more. As soon as the Festival ended, we could turn off the data and development servers, and move the database back to the web server (mostly to save money, as the DB server was fairly expensive to run). If we’d done this ‘for reals’ we might well have just used a local setup for development and only paid for one live server, but therein lies ruin:

We decided to officially launch FestBuzz at an event run by the Edinburgh Fringe (the ‘Twinge Party’). As part of this, we designed a one-off sub-site that aggregated and displayed tweets about the party itself, and from a comedy tweet session we were running. The event kicked off at 6pm on the 14th; at 5.30pm on the 14th, I got an email from a friend. “Hey, did you realise that FestBuzz is throwing errors?” For some reason, the site monitoring system hadn’t picked up this particular problem (it was pretty random in the end; reversing the order of two import statements fixed it).

Suffice it to say, not a good time for the site to be malfunctioning. Fortunately, the development server was on hand — identical to the live server, but for some bizarre reason, not throwing the same errors. A quick switch and the working server was live, and we all breathed again. If we’d only had a local development server, this change wouldn’t have been an option. As it was, we could easily have started up an imaged version of the same server, fetched the latest code, and deployed that (a better, albeit slightly longer and unnecessary solution). All thanks to the cloud.

The cloud also meant that when we wanted a playground to do a bit of tinkering, we could just start one up, pay something like 20p for the few hours we tinkered, and turn it back down again. It gave us a lot of flexibility. But, as outlined above, the way we went about it was very hands-on and — because the scale of the site was small — we mostly did things manually. If you’re only starting two or three new servers, tinkering with them yourself is perfectly reasonable. If you’re starting two or three hundred, then you might want to use the many other tools out there.

The magic

I feel that doing things the way we did gave us a fairly ground-level view of the whole process. I now have a feeling for how much work it is to manage this stuff manually, and an appreciation of the flexibility granted by being able to just create a server out of thin air for some random task. There are plenty of things we could have done differently, better; plenty of ‘best practices’ and other systems we could have laid down in case the site needed to scale quickly. While costing us time and money in the short term, failsafe procedures certainly pay for themselves should you get slashdotted.

Load balancing. We didn’t do this. Despite using the cloud, we had a SPF, as the Twinge Party near-disaster showed. Ideally we’d have had multiple available reading servers and a load balancer that happily farmed stuff out.

Distributed data processing. We didn’t do this either, as our single server seemed to cope with everything just fine. We could have processed data more quickly and efficiently if we’d farmed it out to multiple servers, especially since things like Map/Reduce are, quite frankly, designed for this sort of stuff. Why didn’t we use this? Time and sufficiency. Our solution was sufficient, if not efficient, and we were very, very short on time. Learning Map/Reduce and rewriting to accomodate it wasn’t really an option. We did briefly look at Hadoop as well, but the same constraint applied.

Automatic scaling. If we wanted a new server, we manually entered the control panel and turned one on. Thanks to API magic (I never did get the sample API code to work) or an intermediary, we could have automated some of this stuff — again, in case the volume of data really started to scale, for example, if we’d gotten our hashtag to trend.

In a way, we were lucky — we didn’t need to scale, because stuff didn’t go supernova. I’d like to think that thanks to the cloud and some elbow grease, we would have coped if we had, but I’m thankful that our learning experience with the cloud was sufficient to educate, especially in the areas of ‘things we should have done but didn’t', while also meaning that our somewhat crude solution worked fine given the volumes of traffic and data we had.

I just hope this sheds some light on what this ‘cloud’ malarkey is all about when it comes to startups: saving money, mostly! It’s just an easy way to get tons of computers. What you do with them is up to you.

Tagged in , , , , , ,

Link voting: real-time respect

Featured, Online 22 September 2009 | 1 Comment

By clickykbd on flickr

Sometimes life just moves too quickly, y’know?

This post over at RWW is surprisingly thought-provoking for all it’s sponsored. (Aside: What a strange grammatical construction.) I’m not really sure I trust or even believe their random numbers, but the concept of implicit vs explicit voting for sites and the interaction of realtime vs old-school search are both interesting.

Implicit voting

So, implicit voting is where you give a site a silent thumbs-up. The most common way of implicitly voting for a site is just to visit it; this actually works in two ways, the action of clicking to get to the site, and what you do once you’re there. Explicit voting, on the other hand, is where you actively promote the site — for example, by tweeting/retweeting a link to it, or linking to it yourself.

Where does submission to a social news site such as Digg or Hacker News fit in? Well, my first thought is that submitting is explicit voting, but simply voting up (I agree with the submitter that this site is interesting) is implicit. By this matter you could say that retweeting links falls somewhere between implicit and explicit: if you model Twitter as a kind of Digg, with retweets as ‘votes’, you can see the parallels. Is del.icio.us’ing a link implicit or explicit? Bookmarking locally? Linking in IRC?

Anyway, that’s a case of detail.

Tracking explicit voting is fairly easy: look around for mentions of the URL. OK, there’s some magic involved in de-obfuscating and unifying references, but that’s just techie icing. Once you know who’s mentioned the URL and when you can do all sorts of computations to work out some kind of search ranking system. PageRank is just one approach, but there are modifications and things you can borrow from other search algorithms, especially HITS (one of my favourites!), that exploit the social graph as well. If you have more information — perhaps the entire tweet, or blog post, or whatever — you can even do language analysis and add that extra dimension of understanding on to the link. But fundamentally, you’re just looking at links.

It gets a lot more interesting when you try to work out an implicit measurement system. For votes that are click-throughs, there are ways to measure those, although not perfectly: bit.ly statistics, toolbar trackers, etc. For votes that are based within a site, you’re kind of stuck unless you’re a) the site owner or b) embedded in the user’s browser somewhere. The browser is the best place: there, you can measure if the user has it open in a tab for hours untouched, or if they keep flicking to and from it, etc, etc. But by the very nature of such things, you’re going to get a selective set of data. And what about the aforementioned pasting into IRC/IM/email, what about linking the fact I spent thirty minutes on a site with the fact I tweeted it and then I wrote a blog post about it?

It all comes back to user lifestreams, and the fact that today’s communication is far too disjointed for these types of measurements. Which is a shame, I think. Somehow we must be able to combine the wisdom of the crowd with an individual’s self-knowledge: I know that all these sites belong to me, so I know it’s just me voting for the site with a fairly loud mouth. (Unifying the voter isn’t even a necessary step, but I feel it’s important, especially when you consider fun things such as recommendation algorithms and shill detections).

Real-time information

Let’s assume we have some kind of implicit data about links as well as explicit. A key measurement axis we have is time – so we can spot voting spikes, clusters, etc. The long tail is an interesting quandary, though. Do people searching for a term want the most recent/trendy items, or the ones voted most authoritative over time? It depends on the user, and on the search. Even for a trending topic, a user might be searching for the background, not the latest happenings — so you have to offer both, surely, to satisfy user needs. At what point does a short term voting spike become part of a long-term vote? Would a smoothing function of time work?

There’s also the option to embed implicit voting within the search system itself, something like Google’s SearchWiki. If a site provided the information you wanted, you somehow give it a thumbs up. (Of course, users do this explicitly at the moment by tweeting links, though — at least with my own behaviour — that’s not that frequently linked to searching. I’m far more likely to tweet something I’ve browsed to or been sent). This would provide a trackable form of implicit voting, but still nothing near perfect.

User behaviour could be a problem, of course; what would cause a user to vote up a site? Interestingness? Relevance? I vote things up on Hacker News because they’re interesting, but I’d vote things up on Google if they were relevant. In a way, the real-time, sporadic flurry of retweets is a measure of interestingness and timeliness; the time spent on a site is a measure of interestingness and usefulness; the bounce rate and whether it shows up in search results at all is a measure of relevance. What are we measuring? Until we know that, we can’t rank!

The peer-to-peer system proposed by the RWW article’s author, Faroo, is one way of doing things, but I’m somewhat sceptical. I don’t think it’s going to be possible to get quality implicit voting data in sufficient representative quantities to do anything particularly accurate just yet, but as our habits and the way we search and browse change, it may become so.

Update: This TechCrunch post about the star rating distribution on YouTube — and, as a side link, this post about web reputation systems — are both interesting and vaguely related. Especially when you consider the proposed measure of implicit voting for YouTube videos: how many times you rewatch it, or whether you even finished watching it at all. (Is that accurate? If I watch a video for a few seconds, long enough to identify it as a decent version of the Black Knight scene so I can link it to a friend, does that mean I dislike it? Or are situations like that mere noise?)

Tagged in , , , , , , ,

Forgetting the milk

Productivity 16 September 2009 | 0 Comments

fofurasfelinas ~ flickr

I use Remember the Milk, but I never use it to remember the milk.

Ironic.

Here’s why. I’m not a terribly good GTD-aholic. I am forever thankful to the day I absorbed the GTD principle of “don’t worry about stuff before you have to”, i.e. I schedule in tasks in RTM for the day I have to think about them and then forget completely. It’s fantastic. However, I still keep some pretty terrible habits kicking around: one of them is what my mum used to call “the big shelf”. I think visually, so it doesn’t matter if everything’s in heaps on the floor, as long as I remember which things are in which heap, I’m fine.

This pops up in RTM as todo-list-laziness. I have one list. “Inbox”. It contains all my tasks. This is partly exacerbated by the fact I mostly use RTM via the Gmail gadget which gives me no incentive to use multiple lists. But who cares. It’s fine as it is.

Until I want to make a shopping list, and my current system totally and utterly breaks. I can’t add shopping list items as individual tasks, so to speak; my system gets overcluttered and, since I’m date-driven, I basically have to add nonsensical tasks like “milk today” “eggs today”. Even if I created them in a new list that’d still be the case, though at least then I’d have some separation from actual to-dos. The only solution I can see within RTM, specifically date-driven RTM use, is to add a “Sainsbury’s” task and add my shopping list as a note. Makes it hard to see at a glance what I’ve got to buy, hard to note down suddenly-remembered items, etc.

Good thing I’m a paper junkie, really. It just struck me as extremely amusing that I can’t use an app called Remember the Milk to do just that.

Tagged in , , , ,

Upcoming startup events in Edinburgh

Startups 14 September 2009 | 0 Comments

I love it when other people do work so I don’t have to. This week’s fluffy pink kudos goes to StartupCafe, for their weekly ‘menu’ of startup events in Edinburgh – something I’ve been half-heartedly meaning to get compiling myself, and never quite got around to.

This week’s events can be found here.

Tagged in , , ,

Tech Media Invest – and a tale of good customer service

Startups 7 September 2009 | 0 Comments

We’re in the Guardian’s Tech Media Invest 100. Hurrah! Also coming up this week, we’re talking to David Lammy (the Minister for Higher Education) and pitching for the first time in about 2.5 months – let’s see if we can remember how it all works…

And an aside, I recently bought a suit on eBay (as you do). It duly arrived, with the trousers marked and one size smaller than described. My general experience with eBay, especially when it says in big letters ’seller does not offer a returns policy’, is that they basically say ’sucks to be you’ and you’re left holding the baby, so to speak. However, a really pleasant surprise; I wrote a fairly factual comment (I wasn’t angry per se, just a little disappointed) outlining the two ‘faults’, and the seller got back to me within 10 minutes to offer a return – then replied again a minute later saying ‘just give the trousers to someone they fit, and here’s a refund for one-third of the price you paid’. Wow!

This goes to show two things: when you deal with humans you often get a better deal than you expect from corporates; and being fairly pleasant, not angry, always helps. (Of course, perhaps the overhead of cleaning and selling on the trousers was too much for her to bother with, as well.) Anyway, one satisfied customer here. Hurrah!

Tagged in , , , , ,