Using sketchy sentiment to pump up your post count

Online 7 April 2010 | 0 Comments

Finally, a post topic that combines both sentiment analysis and the meta-world of professional blogging!

I usually like TechCrunch for the most part, but these two articles have annoyed me: ‘Sentiment is split on the iPad‘ and ‘More iPad Sentiment Analysis‘. Both use poor, crude methods of sentiment analysis to produce posts full of fluff and pretty graphs. Result? Whatever point the blogger wanted to make. (You know what they say about statistics).

A quick rundown of the problems: Spurious classification algorithms, poor data sizes, and non-credible results. An algorithm that analyses every piece of traffic on Twitter and comes up with “51% positive, 49% negative” is Just Plain Wrong. There’s going to be a ton of stuff in the middle, unclassifiable, undecided, even just retweets of blog posts with the word in the title, and any graph should reflect that as well. Stripping out the neutral, a result of 51/49 just seems completely nonsensical to me, and I’ve been working with Twitter sentiment for a long time now.

It’d also be very interesting to know what methods the classifiers use, probably available with some digging, but I fear it’s manual keyword lists that some poor sod had to draw up — “hmm, I think if someone says the iPad is ’stupid’ that’s probably negative, yah?”.

Attensity does better, but what on earth does “not thrilled” mean (weak negative?) and again, where’s the neutral or noise aspect? It’s valuable to know just how many tweets were about the iPad, and how many of those were about sentiment. What if a TechCrunch headline with a negative word got retweeted 2000 times? That’s what we in the trade call “skew”. Plus, classifying on a small sample is just crazy. Why? Surely it can’t be computational limits; were these the only tweets with sentiment information? That’s useful data! Why throw it all away…

It also looks like there are some great leaps in logic in terms of distinguishing between “Like the iPad because it might replace iPhone” and “Don’t like the iPad because it won’t replace my iPhone”. How do you automatically extract the difference between “Can’t replace battery” and praise for the battery life? Sigh.

Plus, there’s the key mistake of not showing error, accuracy bounds, or mistakes. Both posts assume the algorithms are 100% correct. While that makes for some pretty graphs, it just isn’t true, and with no idea of sample size or result size (e.g. for the battery category above) then a result of 5% could just mean one out of a total of twenty tweets with the word battery in was negative. It’s the same for intent to purchase. Not every tweet will have any kind of intent, so if you just took the tweets containing “will” “buy” “iPad” or “won’t” “buy” “iPad”,

Of course, the reason I’m most annoyed at these posts is that I could have helped put together a custom dataset and classifier to provide much more detailed data, and didn’t. But while I can’t go back in time and change things, I can at least point out the flaws in using off the shelf graphs to meet your daily post quota as a pro-blogger.

Tagged in , , , ,

Why reblogging is great for Google, and for you

Social Media 26 March 2010 | 2 Comments

Disclaimer: This post is personal opinion, the views expressed here are not those of Google, and not influenced by any relationships the poster may have with the Big G.


There have been arguments raging on and offline about paywalls, the commons, old media versus new media, and ‘information should be free’ for — well, it feels like forever now. One of the (many) components of new media under fire is the army of filthy idea-stealin’ bloggers, people who merrily subscribe to paid content and then go and paraphrase it on their free-to-view blogs (or in some cases, just copy it). Paul Carr makes an excellent point about the commoditisation of facts, the human need for information and thus the Internet hivemind’s tendency to trend towards free.

Information being free is good, for obvious reasons, unless you’re someone who wants to get paid to create it. There are plenty of arguments for well-crafted columns, investigative journalism, paid political pundits and so forth. But here’s a thought about the oft-maligned practice of reblogging, rephrasing, and retweeting.

Language is variable.

The more ways an idea or piece of information is expressed linguistically, the easier it is to find — it’ll match far more search queries, as a simple starting point. Although, in an echo of the Sapir-Whorf hypothesis, perhaps expressing an idea in multiple languages, or with different phrasings and words, could change the way people think about the idea. Even if this happens, the idea reaches far more people than it would have if it were confined to one site, in one language, by one author.

From Google’s point of view, if someone takes a New York Times article, paraphrases it, and links back to it, the data miners jump for joy. Beautiful, delicious data. We learn new things about the relationships between words and concepts — maybe one article said climate change but another global warming. The link-back gives us contextual data that can help too. (Linking to a climate change article with the text “This article on global warming”, for example).

Of course, paraphrasing and rewriting has been going on for years, a staple of the essay or lit review. But as with voice recognition, having the power to implement and use a feedback loop at world-scale is a mind-blowing thing. Google has the power to build an entire semantic web out of paraphrased blog posts, and that’s before we even look at contextual links in Wikipedia or Twitter link summaries. If that’s scary, just think of the magic that happens when you search for something and get a result that isn’t the exact terms you entered, but is the exact concept. With a bit of data, intelligence and an army of semantic web PhDs, it just could happen.

Tagged in , , , , , , ,

The future of journalism

Startups 17 August 2009 | 1 Comment

will_lion on flickrSilicon Valley-based seed fund and incubator Y-Combinator have started nudging smart people without ideas in the direction of a few pet ideas they’re keen to fund. While I think on one level this is a great idea, I also can see a few problems: people abandoning existing half-formed ideas to pursue something they think has greater chance of being funded, people shoehorning themselves into topics off-kilter with their actual instincts and skills, and people trying to game the system by applying with a ‘request for startups’ idea but turning it into something else once funded.

Fortunately, the guys behind YC are pretty smart, and have almost certainly thought of more ways this could go wrong than I have. On the “great idea” level it is definitely a good way to 1. learn how the guys with the money and experience actually think, and 2. encourage people to focus on something worthwhile — though, as the HN comments say, if you don’t already have a few ideas of your own then are you really the right sort of person to be jumping into the startup world?

However, above all, these ideas get people thinking. The HN comments are already raging, the TC commenters are partly missing the point and partly supportive, and I expect to see plenty more discussion on the topic as more ideas to be funded emerge.

As to the RFS1 itself – ‘the future of journalism’ – it’s interesting to see people who are already getting it, and others who are already shooting far off the mark. At the moment, journalism’s trying to marry the need to make money with processes and a business model that stem back to the printing press, evolving over the years but still firmly rooted in a concept that people nowadays have trouble dealing with — paying for stuff.

It’s actually quite interesting to think how we’re innovating in this space – almost accidentally. FestBuzz is competing with several types of journalism, online and off; some of our ideas for later down the line marry different types of media, new and old, with data-mining and magic to do cool stuff. The thing about FestBuzz in particular is we found a way to make money off something that’s free to consumers without using advertising as our main source of income. I can’t help but think that some of the lessons we’re learning about how to bridge the print and online industries, how to deal with information producers and consumers, how to make information free to all in a way they want to consume it… all of that fits right in with this idea of reinventing journalism as something that might actually make money rather than die out.

But enough about us. There are plenty of other business models and ideas floating around to kickstart any thoughts you might be having on reinventing journalism starting with the need to make money, not the assumption that people will pay:

Pro-blogging.
Obviously a subject dear to my heart, this ticks the box of ‘paying people to write content’ which is something most journalists like to hear, but on the other hand: reduced barrier to entry, low pay rates, constant small trickle of content is more rewarding than occasional big articles (so the concept of a feature/column is somewhat worn away) – yet big flashy content is needed to attract viewers through viral means (digg etc). Less of a focus on daily news and current affairs, partly due to reduced access and budgets to cover them.

Citizen journalism.
A poncy term for “people on Twitter are on the scene of breaking news first”. People submit their news/pictures to a central site, news agencies pay a subscription fee, extra for exclusivity, some of which goes back to the citizen journalists cited/used in a story. Reduces costs for news organisations to have people on the ground in key locations, democratises news, allowing bloggers and online-only organisations to cover breaking news too, but still somewhat reliant on current business models.

Subscription based news access.
Either online or via Kindle/mobile/iPhone, a centralised news gateway that you pay for (possibly freemium). Challenge: Convincing people to pay, and working out where the money goes. Multiple approaches here: personalised aggregation, collaborative news filtering, topic-specific news streams; access to ‘professional’ articles consolidated into one place; cross-media integration with paper headlines, multimedia, known brands; cross-platform access i.e. RSS on steroids, including filtered twitter, facebook, etc, streams.

Would you pay for an iPhone version of HN tailored to your own preferences (no articles on Erlang for me, please!)? Would you pay for a paper, virtual or physical, that consolidated the best of the day’s current affairs as voted by other people – the Times’ political commentary with the Guardian’s media coverage, the FT’s straight-faced finance with a little bit of Daily Mail celebrity-spotting sprinkled in for tea breaks? Would you pay to consume RSS as you do today but with the ability to collaboratively view it, chatting with other readers? The problem is to most of these the answer is never going to be an immediate ‘yes!’. Maybe you’d get a hit from sales of the iPhone app or other one-off costs, but many things along these lines have been tried and have failed admirably.

Topic-specific physical news.
Instead of paying £x for a paper which you don’t read half of, pay £x/2 for two halves of a different paper and build your own. Again, fairly linked to current models, but a sort of physical hybrid of the stuff above with the need (or desire) to consume a dead-tree version.

The final point for today (I could go on all afternoon, but there’s work to be done!) is on how to think about this stuff.

The above are all ideas I’ve been thinking about for a while, in one form or another – and you can tell where my recent thoughts have been focused. But if you’re set on reinventing an industry, you don’t start from an idea or an application, you start from the industry itself. How does journalism work? How do journalists and news providers make money? What do people consume? What do they pay for (note, they may not be paying in money, but in clicks, in eyeballs, in time..)? What other data can you get about the way people consume news and media, and the way it’s delivered to them? Where are the weaknesses in the value chain? Why does the business model look the way it does? (Hint, it’s in pg’s post.)

Then think about how people of 2010 (not that far away any more) might consume news. What’s different? What happens if the news organisations go out of business, or go online-only? Are their current sources of income viable long-term? Short-term? What other ways can they make money? What do they own and what can they sell? How do journalists get paid? How else can they get paid? Who else can write news? Who else can deliver news? Does ‘news’ mean what’s happening now, or anything that someone somewhere finds current and interesting? How do news organisations gain and retain credibility? How do companies and celebrities rely on the news machine to make money and gain fame? How do paparrazzi and news photographers fit in? What would the world look like if there was only one video news channel in each country? What levels of competition and collaboration are necessary to keep ‘good’ reporting alive? What can someone with a BBC badge do that someone with a ‘my-blog.co.uk’ business card can’t? Why?

So many questions. Have a cup of tea and a think. And in, ooh, about two or three months, maybe, I’ll write more about my personal view on these things and how what I’m doing at the moment fits into the picture.

Tagged in , , , , , , , ,

Professional blogging in practice: part 3

Lifestyle 27 February 2009 | 1 Comment

The final part of a series looking into the realities of professional blogging for others. Check out part 1 and part 2 if you missed them!

Outside! - mexicanwave on flickr

The day-to-day life of a blogger can be a lonely one. Although you may be working as part of a large team, you don’t end up face to face with them on a daily basis; generally it’s you, your laptop and… that’s it. The very technologies that allow us to form large, multidisciplinary teleworking teams are also the same ones that cause us to be more isolated than ever.

Fortunately, loneworkers, writers, entrepreneurs and stay-at-home parents have all perfected the art of not going off your rocker while you’re alone with your thoughts and nowt else all day. Here are a few of the ’stay healthy, stay sane’ working and living habits I’ve picked up, both as an entrepreneur and blogger, with a slant towards the cheap ‘n’ cheerful — after all, blogging doesn’t pay that well! [...]

Tagged in , , , , ,

Professional blogging in practice: part 2

Headline, Online 13 February 2009 | 2 Comments

Following on from last week’s post about finding sources, today I’m looking at the rest of the professional blogger’s daily pipeline.

Once you’ve found something to write about, it’s time to sit back, relax and let your blogger instincts do the rest. Right? Perhaps. Once you get into the habit of posting multiple times a day on the same site, a lot of the following stages in a post’s lifecycle do become second nature, but when you’re starting out it’s useful to run through the checklist in your head.

[...]

Tagged in , , , , , , ,