May 14, 2012

From: Reddit AMA

What can be done to improve natural language search algorithms? For example, this morning on Wolfram|Alpha I tried: “Time it takes to walk 500km” and it searched “time it”. “Time to walk 500km” and it searched for “walk”. “How long does it take to walk 500km” and it searched “how long does it take”. “Time taken to walk 500km at average human walking speed” and it searched “average human walking speed”.

An interesting question… very related to a blog post I recently wrote about “artificial stupidity”:

If we change your input e.g. to “Time it takes to go 500km at 2mph” then it works just fine.

And in fact I think the problem Wolfram|Alpha is having with your input is not so much to do with natural language understanding as such, but rather with having enough knowledge, and handling it correctly.

Wolfram|Alpha has a value for “average human walking speed”, and indeed “500km at average human walking speed” works just fine. The problem is with the “linguistic compression” to e.g. “Time to walk 500km”… which requires extra knowledge.

We’re always working on upgrades to the linguistics/knowledge frameworks of Wolfram|Alpha… and there’s one particular upgrade that I think might make your examples here work… though I’m not sure.

We’ve been steadily working through different kinds of inputs and domains… and in areas like math where the system is quite mature, I’m very pleased at the level of query success we’re seeing.

Your examples here are exactly the kind of thing we spend a long time analyzing to improve things. You might think that it will get mired in specifics… but one of the achievements has been to develop frameworks that allow good generalization.

If you have other examples, please send them! (You can use the feedback form at the bottom of any Wolfram|Alpha page; yes, actual humans look at those…) It’s particularly nice to have the kind of “reformulation sequence” that you give here. The way we anonymize our query logs happens to make it difficult for us to piece together such sequences right now.

