August 9, 2004

I am not in a very bloggerly mood in general. There is a reason for this – I am in the process of compiling a blog corpus of 100000 words, which means I need to take 200 random blogs and extract 5000 words each from them.

This means reading a lot of blogs, and trust me, there are many of them you wouldn’t want to have to read. I can’t just harvest the text blindly; I’ve tried using a tool but not found anything useful, and so I need to do it manually, month by month when I’m lucky enough to pick a blog that has monthly archives….

I start on Jan 1, 2004, for each blog I pick and collect postings until I pass 5000 words. There’s a lot of stuff that needs to be stripped out of the raw text, headers, footers and long quotes from other sources for instance. And I add the comments (if any) as footnotes to each posting. That takes a lot of time. And more than half of the blogs I look at aren’t usable for my purposes – they just don’t contain enough text after Jan 1.

And it makes me feel that there is too much blogging in the world already.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: