Scott Prevost is General Manager and Director of Product for Powerset, the company whose semantic search engine was recently acquired by Microsoft. In this interview he describes the history of Powerset's natural language engine, and explains how it works as part of a hybrid approach to indexing, retrieval, and ranking.
Scott will expand on these topics in his keynote address at Web 3.0 in October.
JU: The notion of search enhanced by natural language understanding has a long history. I was just reading Danny Sullivan's rant about how he's been hearing about this for years, but it's never amounted to anything.
Of course, people are all over the map on this topic, but nonetheless you guys are doing certain demonstrable things, and working on other things. So I'd like to find out more about how the technology -- which was acquired from Xerox, where it had been worked on for a long time -- actually works. What you mean by natural language understanding, how you're applying the technology, and where this is going.
SP: Well, there are a lot of questions tucked in there, but maybe we can start with what we licensed from PARC, what was formerly Xerox PARC. They had been working for 30 years in a linguistic framework called LFG -- lexical functional grammar -- and they built a very robust parser. It's probably parsed more sentences than any other parser in the world.
What it allows us to do is take apart every document that we index, sentence by sentence, uncover its linguistic structure, and then translate that into a semantic representation we can encode in our index.
JU: Can you confirm or deny something that Danny Sullivan reported, which is that it takes on the order of two months to index Wikipedia one time using this method?
SP: [laughs] That's a very, very old number. It all depends on the number of machines, but we do it now on the order of a couple of days.
JU: And it scales linearly?
SP: Yes. And in fact we're working really hard to bring those numbers down. We have a very small data center right now. We're looking at what it takes to stand up a 2 billion document index, and it's absolutely attainable.
I think Danny Sullivan realized, when he wrote another article on the day we launched, that we're doing something different. He called us an understanding engine. It's not the case that we're just applying linguistic technology at runtime, by parsing the query and then trying to use the same old kind of keyword index for retrieval. We're actually doing the heavy lifting at index time.
We're actually reading each sentence in the corpus, pulling out semantic representations, indexing those semantic representations, and then at query time we try to match the meaning of the query to the meaning of what's in the document. That allows us to both increase precision and improve recall.
JU: When you say semantic representation, what it means -- or anyway what's evident in the current version -- is subject/verb/object triples, basically. That seems to be how things are organized.
SP: That's one small part of what the engine does. It's the part we've exposed in the user interface in a very direct way. But actually those are only three of several dozen semantic roles that we uncover at index time, and all of those roles go into selecting documents, and snippets of documents, when we present the organic results.
JU: Really? So even though the patterns aren't exposed in the advanced query interface, they're still used?
SP: That's right, they're still used.
JU: What would be an example of one of those other patterns, and how it's applied?
SP: So, you ask a question like: "When did Hurricane Katrina strike?" The 'when' is a certain kind of semantic role that we've indexed, separately from the subject, verb, and object. There are a number of other roles like that: location, time, other types of relationships.
JU: I saw a private demo, about a year ago, in which one of the most striking examples was something like: "Companies acquired by IBM between 1996 and 2003". At that point, I think the light bulb goes on in people's heads about what this could really be.
That class of query isn't exposed yet, but it's an example of what's possible, right?
SP: Absolutely. That's exactly the direction we're moving in. Initially most of the work we've done has been on the index side. Now we're starting to catch up on the query side, which allows us to complete the loop and do queries like that.
JU: The other piece that's visible on the website, in addition to the Wikipedia stuff, is the Freebase material that you've recently integated. That's an interesting case because there you can pull semantics directly from Freebase. So this becomes more of a query-time interface to something which is already structured and understandable.
SP: Yeah, that's right. Freebase is kind of like Wikipedia, except it's all structured data. Unlike with our core technology, which turns unstructured data into structured data, with Freebase we just go directly to the structured data. But it uses the same linguistic technology on the front end to parse the query, which then gets mapped into a Freebase database call.
But by using linguistic technology to parse the query, we're able to match very flexible ways of saying things. We don't have to imagine every possible way someone might ask for a particular piece of information. The linguistic engine takes care of a lot of that for us.
JU: That's why I can type in something like "Barack Obama's book" and get back the answer Dreams From My Father directly from Freebase.
So, what was the intent of including Freebase along with Wikipedia. What are you trying to show there?
SP: That the linguistic technology can be used with both structured and unstructured data. Freebase just has a lot of really great information.
One of the things about a natural language front end is that it encourages people to ask questions and expect answers. With the Freebase database, it's pretty easy to provide direct answers right at the top of the search results page, which users find to be a nice experience.
Of course you have to be very high-precision, so we've tuned the Freebase stuff for precision rather than recall.
JU: Tell me about the natural language landscape: the variety of approaches that exist, the style that you're using, how that compares to others, how all this fits into the history of the technology.
SP: The technology goes back a long way, three decades or so.
JU: Longer, actually.
SP: Yeah, really since the beginnings of AI people have been trying to use computers to understand and generate language. There have been a number of different approaches: purely symbolic approaches, statistical approaches. We really use a hybrid.
The Xerox technology uses a particular grammatical formalism, and we do use symbolic approaches to our semantic rules. But we also then put these semantic features into our index, and use machine learning and statistical approaches to retrieve and rank results.
It really is a combination. We try not to be religious about these things, but just use best of breed, and choose the right tools for the jobs we're doing.
JU: One of the things that Peter Norvig at Google is always saying is that the real secret to their success is vast quantities of data, and that in the end you don't really need AI, you just need lots and lots of data and the ability to crunch through it.
I assume you would argue that the natural language techniques are also helpful, and that as the quantity of data in your possession grows, the power that it brings to the table will also grow.
SP: Yeah. One thing we try to do with the natural language technology is give a leg up to the statistical and machine learning approaches. If you look at a search engine that just uses keywords, the information you have about the page is pretty slim.
We're trying to capture more information about each page that we index, which enhances our ability to retrieve and rank. For example, it allows us to retrieve documents where there are no keyword matches, but there's a good meaning match.
JU: For example?
SP: So, consider a query like: "What politicians were killed by disease?" Powerset will retrieve documents that don't include the words 'disease' or 'politician' or 'kill', but that are about particular politicians who died from particular diseases.
JU: Is the process of mapping generic terms to specific instances a hybrid of human editorial effort and statistical techniques?
SP: Yeah. We use things like WordNet, which is a giant dictionary or thesaurus of the English language that shows how various word senses relate to each other. We use that with some editing on our own. We also use machine learning techniques to figure out some word relationships, and which are most helpful in retrieval and ranking. So it really is a combination.
JU: When did you start this work?
SP: The company was founded three years ago, and I joined two years ago. But of course the work at PARC goes back 30 years.
JU: You obviously have an academic background in this field.
SP: Yeah, I have a Ph.D. in computation linguistics, as do probably about twenty other people at Powerset.
JU: What's your take on how this engine will start to surface through the various Microsoft online properties?
SP: The two areas where we can make a big impact are, first of all, improving core relevance, which is an absolute must for every search engine. And then also the user experience. Some of the technology -- and you start to see it in the Wikipedia search engine that we put out -- some of it really allows us to do different things in the presentation of these results. Thing that can save the user time, by getting the answer right on the search results page.
Our goal is to continue to work on improving relevance, and we've shown that by using these semantic features we can drive large relevance improvements, but there's still a lot of work to be done there.
JU: In that case, the improvements would be under the covers, the person using Live Search wouldn't know that you were contributing to the relevance of the result.
SP: That's correct. Another way it can happen is by creating a different quality of snippet or caption, things that highlight the parts that match the query instead of just bolding the keywords. Actually highlight the answer right there on the search results page, so you don't have to click through to determine if it's the right page.
JU: There's a related area called entity extraction, and there's been a lot of action there. For example there's a company called ClearForest, recently acquired by Reuters, which has put a lot of work into entity extraction. What's the story on that front?
SP: A lot of companies are working on this, we have our own in-house effort for name recognition and entity recognition, and this is of course really helpful as a kind of light semantic layer. But for us, it becomes deeper because we can start to relate all kinds of entities to one another, based on where we've seen them, and also with the help of things like Freebase.
To follow up on how you'll see the impact in things like Live Search, beyond the improvement in relevance and in the quality of snippet, I think you'll see features like related searches, other ways of presenting information similar to the Factz that are shown in our Wikipedia product, I think you'll see a lot more work on the instant answers, with a database that extends beyond Freebase.
Without committing to particular deliverables, these are the kinds of things I think you can expect to see. And you'll also continue to see growth on powerset.com, where we can be a bit more daring in terms of ways of presenting search results.
JU: Well thanks for your time. This has been interesting, and I'll be fascinated to see how things unfold over the next few years. I've got a feeling you'll have access to a pile of resources to work with...
SP: Yeah, we're really excited about it. As a startup, it's hard to build a full-scale web search engine. Having the resources available, and the really smart people at Live Search, is just a tremendous boost to us.