Friday, July 3, 2009

Help Needed!

Hi Everyone! I'm calling on the hive mind of awesomeness out there to help me with my next ICFA paper. I've got an idea for a combination of two of my favorite things: science fiction and pattern recognition algorithms.

Here's the idea: I feed a whole bunch of science fiction short stories into a pattern recognition algorithm and then see if it can correctly identify the era of origin for a bunch of other short stories. The three eras I have in mind are the "Golden Age" (1934-1955), the "New Wave" (1964-1980) and "Post-Cyberpunk" (1990-present). The question is, after I train the algorithm on a whole bunch of core sf texts from each of these identifiable eras, would it then be able to correctly place, say, "All You Zombies" as being Golden Age? I've had good luck using this technique to distinguish non-fiction articles from short stories (92% accuracy), and I'd like to expand the approach.

So here's what I need help with: first off, please attack my premises! How legitimate are these categories? How reasonable are the cut-off dates? Do you think that this sort of classification will be too hard or too easy for a poor little computer program? Are there more interesting questions I could be asking using this sort of technique? Contrariwise, is this approach too reductive?

Next up, I need help tracking down about 100 short stories for each time period. Ideally the stories would be purely sf, no slipstream or other fuzzy genre stories (trying to eliminate variables for the poor little algorithm). They would also be less than 10,000 words long and available in full text online (for ease of data collection).

I've already got some initial ideas of course:

Golden Age
  • Asimov Robot stories
  • Heinlein's Future History stories
  • Bradbury's Martian Chronicles
  • Stanley Weinbaum's "Martian Odyssey"
  • Stories like those found in "Adventures in Time and Space"
  • "Cold Equations"
New Wave
  • Philip K. Dick
  • James Tiptree Jr.
  • Dangerous Visions and Again Dangerous Visions
  • Barrington J. Bayley
  • Philip Jose Farmer
Post-Cyberpunk (probably will need another name for this era)
  • Cory Doctorow
  • Charles Stross
  • Ted Chiang
  • Stephen Baxter
Hopefully that gives you a flavor of what I'm looking for? There's no guarantee that this will work, or even that it will produce an interesting ICFA paper (I'm also kicking around an idea for an XKCD-based paper, for instance). It's early days. But I'd like to give it a try, especially now that my super-sekrit intensive-data-collection reader response theory project is on hold.

Thanks in advance for all comments and suggestions!

7 comments:

Duncan Lawie said...

I'll be fascinated to see if the error rate increases across the three periods - for two reasons. Firstly, I suspect we have a pretty consistent idea of what Golden Age SF looked like, and the material which has survived to be transferred onto the internet may well be pretty consistent. Secondly, the modern era has not yet been through any comprehensive "canonisation" yet whilst the material, building on the longer history of SF, is less likely to be self-similar.

I've got no idea of where the field of pattern recognition is. Does it make sense to get some human beans to do the same type of sorting of the works as a baseline?

Karen Burnham said...

Duncan- That's a good point about the fact that time hasn't filtered the latest stuff yet--see my inability to even come up with a good name for the period! That's one reason that I'm fishing for more obscure Golden Age & New Wave stories: I want to make sure that I'm not just comparing Asimov/Clarke/Heinlein to the wider diversity of voices writing today.

When it comes to the Pattern Recognition, I'll be doing the initial classification (based on the dates), against which the program's classification will be measured.

Crotchety Old Fan said...

If you really want to test boundaries, you've got to include stories by Cordwainer Smith.

In fact, a very good source for a wide diversity of material (styles, authors) would be the Del Rey series of books - The Best Ofs:
Fredric Brown, Henry Kuttner, Cordwainer Smith,C. L. Moore, Jack Williamson, Fritz Leiber, Robert Bloch, Philip K. Dick, Murray Leinster and multiple others in the series.

I'd also suggest that: you periods need a little adjusting: for completeness sake - a Before the Golden Age period (say 1900 - 1934),
Golden Age - '34 to '58
post golden age '59 through '84 or so (traditional works not participating in the 'new wave')
new wave running parallel from '60 or so through mid 70's
and then after that you really have to represent multiple branches (I think.)
Hope that helps.

Karen Burnham said...

Yay Cordwainer Smith! You're right, he's an excellent boundary case. Another one I suspect will be problematic is Alfred Bester.

Hmmm... I'll have to think about some of the parallel-in-time categories. I can see where it would be much more complete, but would also make it harder to do initial sorting. Also, every extra category is a multiplier of data needed--I'd need ~90 Doc Smith and Ed Hamilton stories, ~90 post-singularity stories, etc. I'll definitely give it some thought, though.

Crotchety Old Fan said...

It may be that you can engage in a little recursive hair-splitting.
My gut instinct tells me that any system capable of sorting golden age and earlier from later works could then do a second pass on the later works grouping and fairly easily determine whether something is new wave or not by measuring 'closeness' to the older group: those works that fall out as 50% or closer are not new wave - anything that is left is.
If 'gut instinct' bears out in a manual sort, you'll save time on the algorithm AND the number of works needed for your sample.

Karen Burnham said...

While that's true from a gut instinct sorting point of view, unfortunately it's not true from a mathematical-proof-of-validity point of view. The math says that you need a certain number of samples, per class, per # of features used to do classification. So it's hard to get around the "more classes => more samples" requirement.

Anonymous said...

you have a nice site. thanks for sharing this site. you can download lots of ebook from here

http://feboook.blogspot.com