Friday, August 21, 2009

Bring the Controversy OR, Research Plan Update

Remember that plan I had to use a pattern classification algorithm to distinguish different eras of SF writing? I got some very lovely expressions of interest and support, but I didn't get the one thing that I needed: someone saying, "Why yes, here are one hundred Golden Age sf stories, all scanned in, digitized and proofread." And unfortunately, what with the new job, finishing grad school, and stepping up responsibilities at Strange Horizons, I'm not going to be able to do the scutwork required for this one. So that goes on the back burner.

So I thought about other research plans for next year. This is the second project I've had to scrap for lack of time, unfortunately. I thought about doing an author-overview of Greg Egan's short fiction, but I'm afraid that will actually be too hard, given the year I'm looking at. Unfortunately, writing pattern recognition algorithms is much easier for me than thinking deeply and writing coherently about patterns and themes in an author's work. So I had to go with my fall-back plan, which I hope will be entertaining for you all:

Can I write a pattern recognition algorithm that reliably distinguishes between fantasy and science fiction?

I know I'm asking for trouble, but this is one of the easiest research projects I can do. I'll limit the input data to short fiction from the last 5 years, so that I'll be mostly comparing apples to apples (and it will be relatively easy to find 100 stories of each type already online--that's the most important part). My feature set will be related to grammatical usage, to see if there are significant style differences between sf and fantasy. The initial training and testing sets will consist of 'core' sf and fantasy: pieces where no one would reasonably dispute their categorization. If it looks like a doubtful case, I'll save it for later. I may be tapping the hive mind to confirm my suspicions on occassion, also to see if any reasonable dispute arises. 50 stories of each set will be used to do feature selection and initial training, the other 50 will be the initial testing set.

If the results look promising on the testing set, then I'll start throwing boarderline cases at the algorithm, and see how it classifies some of the trickier stories. How will it classify "The Merchant and the Alchemist's Gate," for instance? That should yield some interesting results.

I want to be VERY clear here: if it is in any way successful in classifying the two (and I don't necessarily think it will be, see below), I will NOT be saying that "Thus X is fantasy and Y is science fiction, absolutely and forever, so mote it be!" This will be purely descriptive, not prescriptive, and will simply be another data point to use in the decades-long categorization debate. I'll be doing it because it's fun and relatively easy.

Why do I have any hope of success? Well, one of the last algorithms I wrote was for a grad school project. I asked it to distinguish between fiction and non-fiction using grammatical frequencies as the features. I was very surprised when it was able to correctly classify the two with 92% accuracy based on only 3 features. That's pretty amazing, frankly. Among the things it mis-classified: Ted Chiang's "Exhalation" initially showed up as non-fiction (it didn't in a revised version of the algorithm), a NY Times article on flooding in North Dakota showed up as fiction, as did a Michael Moore essay, and my reviewing manifesto showed up as fiction as well. It's the cases that break the algorithm that are always the most interesting, and I'm hoping that this little science project can contribute a little to the ongoing discussion. However, I'm totally prepared for there to be no significant difference at all; I have a suspicion that adventure writing is adventure writing, whether it uses swords or blasters. But that will be an interesting result all on its own.

I'll keep you posted as I go along. I plan to present the complete results at the 31st International Conference for the Fantastic in the Arts in March (assuming they approve my abstract--they may laugh it out of the conference). Feel free to throw suggestions (or short stories that have been previously published) my way! I want to make sure I get as diverse a sample set as possible.

3 comments:

Blue Tyson said...

You ever end up doing this Karen? Sounds pretty interesting.

Finding scanned stories is extremely simple, which you probably know now anyway

Scanning them yourself does take a while - had to do some for my own Egan project, and there are still a few of his I don't have.

Karen Burnham said...

No, I was never able to find the time. Full-time work + finishing grad school sucked out the rest of 2009 and most of 2010. Now I'm done with grad school, but starting research for a book on Greg Egan. I haven't touched any code since last Spring.

I still think it'd be fun, and I hope to get around to it someday.

Blue Tyson said...

Ok, thanks.

Yes, the whole Egan thing is taking a while, certainly.

The only SF work I don't have digital copies of Egan-wise are these :-

Novels

Teranesia - Gollancz 1999 (hopefully they'll do one of these sometime)


Stories

Beyond the Whistle Test - Analog Nov 1989
Fidelity - Asimov's Sep 1991
Before - Interzone 57 Mar 1992
Reification Highway - Interzone 64 Oct 1992

So if you need something, happy to send.

I did do something of a literature search on Egan a while ago if you hadn't got to it already, there might be something you haven't come across :-

http://borderguards.blogspot.com/

bt