ARTIFICIAL INTELLIGENCE AS COMMON SENSE KNOWLEDGE

Professor Douglas Lenat

The goal of AI and the goal of what I do is not understanding human cognition, consciousness and so on. It is not replacing human tasks per se but rather mental amplification: the same way that physics through engineering has provided physical amplification to us so that we can use automobiles to travel faster than our muscles could, microphones to be heard farther than our lungs could power our voices. In AI we use computers as mental amplifiers so that we can perform tasks, be creative, have common sense better and more usefully in our lives than we otherwise could. The paper I am presenting here is a synthesis of three presentations squeezed together describing how artificial intelligence is able to produce programs to help with those three kinds of tasks.

Another important point I think worth making is the distinction that was made between initial states and laws of nature and, as Professor Wigner said, this is one of the great, miraculous insights of physics and science. To realize what physics should do or where its source of power comes we focus on the laws of nature. In a sense, this is a misleading miracle from the point of view of our goal in AI because if we analogize it too strongly it causes us to focus on inference mechanisms rather than on the starting knowledge base. As a result, we try to develop mathematical theories of consciousness and reasoning and thinking and worry about equations for consciousness and so forth. That type of work has not panned out in the last thirty years. I guess there are other similarly misleading miracles from religion in early AI. All of these are focusing on laws of nature, on inference mechanisms rather than on initial states or initial starting states of knowledge, although our experience has been that programs that try to be general problem solvers, that try to be short simple models of intelligence, turn out not to be particularly useful given the goal that I stated. One of the first useful programs, in that sense, was the Dendral program which was developed at Stanford around 1965. As most of you are probably familiar, Dendral took a chemical formula like C(28)(33)N and would type out a list of all possible three dimensional configurations that could have that formula, all the isomers of that compound. If you take a strict topological treatment of a problem in a typical case, there are about forty three million isomers, but, by talking to chemists and extracting their rules of thumb for what they saw as a plausible isomer, they were able to put in knowledge about chemical topology and fundamental first principles of chemistry, and, in many cases, reduced the number of candidates that were considered. What these experiences lead us to do is to focus not on the inference mechanism but on the knowledge base.

Typically, in all these expert systems, performance programs act as collections of rules. Take the Mycin medical diagnosis program or the XCON system that DEC uses to configure orders. The reason for programming these as IF-THEN rules is basically this: if you talk to an expert about how to solve some problem like, how the monkey should get the bananas, they say, well, the monkey gets off the box and pulls it over and climbs back on the box and reaches up and grabs the bananas and pulls. If you're not careful you will believe what they say and you will build some finely tuned expert program that does just what they say. A hundred thousand lines of Fortran or so. Then you try it out and it doesn't quite work and they say, "Oh yes, well there's one thing I forgot to tell you," and they say, "But it's not going to work now. There was just one more thing I wanted to tell you." What that kind of experience has led to, after a lot of pain, is a methodology of expert system rules: namely, taking a prototype system, getting it running as fast as possible, talking to the expert, and in case after case, where the system breaks down/gets the wrong answer, asking the expert what's wrong in this case, getting a new rule, adding it to the system, and incrementally approaching competence. There is still a long way to go in expert systems. There is a lot of types of knowledge we do not know how to represent for example, including causality. I will not go into expert systems in much more detail. They are covered in all the leading magazines these days.

What are the bottlenecks in building expert systems? There seem to be two bottlenecks. One is what we might consider to be the knowledge-acquisition bottleneck, which says it takes a long time to get knowledge from the heads of experts into a usable form in the machine. In almost every case so far, the successful building of an expert system has required the mediation of a human knowledge engineer sitting at that bottleneck, interfacing between the experts and the growing expert system. We do not really understand knowledge engineering despite the fact that it is called engineering. It is really a lot more like an art. The only way we have really been able to communicate this skill of being a knowledge engineer has been mostly practicing the guild system of the Middle Ages, getting people involved in building expert systems for several years and then having them simply pick up, on their own, the necessary skills and go off and build systems from then on. So, there are very few knowledge engineers in the world. Perhaps twenty or thirty and another few come out each year whereas, as you can tell from the popular press, the demand for them and the need for them is perhaps three or four times larger.

The second bottleneck is what we might call the brittleness or flexibility bottleneck. Think of all the world's knowledge as a tree, general knowledge at the top and specific knowledge down at the bottom. Then almost every existing computer program, certainly every expert system, really covers a narrow slice, a narrow sliver or even just a leaf of this tree. So, for instance, while Mycin can be considered a medical diagnosis system, really what it does is decide which of five kinds of meningitis you are most likely to have. It does that better than most GPs. In fact it does that better than most experts. However, if you ask it to help treat a broken bicycle it will tell you what kind of meningitis it is most likely to have.

So there are basically two problems both of which have to do with adding knowledge to the system. And there have been several different solutions that have been considered for that. I will go through three of them here.

The first one is the natural understanding dream. This is an appropriate place to talk about this, given all the work that Roger Schank has done on language understanding here at Yale. This has been a dream for about thirty to thirty five years, perhaps forty years by now. Ever since it began it has been ten years away from success, as it is now. The basic problem with this dream is that it simply feeds books and articles into the computer and has it understand them. Pick an early, perhaps unfairly early, example of what is wrong with this dream. People thought that you could do machine translations, for instance, by looking up words one at a time. But in using a Russian-English dictionary to do a Russian translation you end up with translations like, "The spirit is willing but the flesh is weak" turning into, "The vodka is fine but the meat is rotten," or "Out of sight, out of mind," turning into "The man is as blind as a saint." One of the problems that a disciple of Schank's at Berkeley is working on is more or less on this level. He was analyzing old George Burns and Gracy Allen routines. One that he showed me was: Gevrge says, "My aunt is in the hospital. She was sick and I visited her and I took her flowers," and Gracy says, "That's terrible George. You should have brought her flowers." The word "took" reverses its meaning. To take a simpler example, let's say, "The pen is in the box." There are two kinds of pens, writing implements and corrals. How do you know which one is meant by the sentence? The answer has nothing to do with English, has nothing to do with parsing this sentence, with deciding where the nouns are and what "pen" is really supposed to mean and so forth. The answer is simply that you have to know about the world. You have to know what the different kinds of pens are, what their sizes are, what their relative frequencies are. You have to know what boxes are, what their function is, what they are made of and so forth and consider the two cases and decide which is the more plausible meaning. And, in this case, one is wildly more plausible than the other. You can fool yourselves by looking at a paragraph or a short story, putting down the knowledge you need to understand the story, thinking that you sort of "cracked" natural language and then you turn the page and here's a whole new set of facts about the world you need in order to handle the language in that next story. That is why I think that the natural language understanding dream will stay a dream until we have a large comprehensive knowledge base of common sense, of the hundreds of thousands of things like the fact that cardboard boxes are usually of a typical size and cardboard can only support a certain weight and so forth.

The second dream of how we can get knowledge into the computer en masse is the dream of machine learning. Let the program go off by itself and learn from the environment the same way that people learn from the environment. There have been lots of experiments in machine learning that have led to some reasonably well tested and well understood methods by now. I am not going to discuss any of these but if you would like to find out more about them let me recommend to you MACHINE LEARNING, Volumes One and Two. I will mention one particular kind of learning with reference to the second point here. We have discussed performance. Now we can study creativity, namely learning by discovery.

My thesis at Stanford about eleven years ago was the AM (AUTOMATED MATHEMATICIAN) program that dealt with trying to get a program to make some simple discoveries using the scientific method. AM was given about a hundred and ten initial concepts dealing with set theory and functions, concepts like reversing a list and unioning sets and so forth. And AM was also given a couple of hundred heuristics of which my tried and true favorite is the one that says: If you get some function passed from A to Z and there is some extreme kind of image that you know about, it is usually worth your time and trouble to define the inverse image of that back in A. So, in the case of AM, where F was an intersection, an extreme kind of set like in the empty set, the inverse of that intersection is "pairs of sets whose intersection is empty" which leads to the concept of disjointedness. Heuristic rules, rules of good guessing and good judgment, lead a program to define some interesting concepts like prime numbers and lead it to decide that there is something interesting about them. That is what I mean when I say that we can get computers to recreate them. There are lots of examples of this.

I am not going to go through any of them but, in the succeeding decade, we have applied AM and its successor EURISKO to many different domains including the designs of naval ships.

We have looked at the natural language dream and the machine learning dream. The machine learning dream breaks down in the sense that the more you know the more you can learn which sounds fine until you think about the inverse, namely, you do not start with very much in the system already. And there is not really that much that you can hope that it will learn completely cut off from the world. That has led us, somewhat reluctantly, to engage in a large project at MCC which I will spend the rest of this paper discussing: handcrafting a real world knowledge base to enable the learning to take place to enable then the natural language effort to succeed. This is a two step process whereby we look to crack this particular problem. The idea is that we will use real world knowledge, what might be called common sense knowledge, the half million things that every child knows about the world, to overcome the brittleness and knowledge acquisition bottlenecks.

To see what the problem is let's go back all the way to expert systems for a minute. Typically, one will work with a tool to extract IF-THEN rules. A typical diagnosis rule might say: IF, during the intake interview, the doctor asks the patient, do you have X? (X could be headaches, stomach pains, etc.) and the patient answers, yes, THEN it concludes that the patient really does have X. That turns out to be a useful rule but you could imagine that sooner or later it is going to lead the system to a wrong diagnosis and eventually the problem gets tracked back to this rule. For instance, X might be a history of insanity in his family and the patient decided that he would lie about his family. And that causes the person maintaining the system to add an extra little "unless" clause to the rule which says, Wand X is not something the patient could lie about to the doctor." So, time after time, you have to go back and fix this rule up adding "unless" conditions incrementally, getting it better and better. On the other hand, this "unless" condition really has nothing to do with medical diagnosis per se; it has to do with common sense. In fact, that's one reason it was eliminated from the original formulation of the rule to begin with. Now why can't the system figure out that there are those possibilities? The reason is because, in general, an expert system's view of this rule is really a lot like this: The expert system really does not know anything about doctors and patients and diseases and diagnosis: it just knows that there are some tokens, some symbols, which it is pushing around according to some rules. And that's really what's missing and what we are hoping that this half million fact or knowledge base will actually supply. So, let me give you an example of how that will work. After typing in that rule our program, the EURISKO program, would ask the person entering the rule to explain every unknown term in the rule. So, for instance, "ask and interview" is a kind of query and "doctor" is a member of the set "doctors" and "patient" is a member of the set "patients" which then has to get explained to the system and so forth. Now how does it help to have that kind of reply. Really what we are saying is, supposed you know that asking a question during the intake interview is a kind of query and you already know in the system all the things above this dotted line: you know this query is a kind of recall in the communication process, and so forth. How does that help? Well, suppose you now want to say what are all the ways in which this rule could fail. Eventually you get to asking what all the ways that asking a question could give me the wrong answer and all the ways that could fail basically evolve into negating the constraints on all of the superior nodes in this network. And you probably can't read it but some of those constraints are things like - in order for a communication to go between A and B it has to be true that B can understand what A is saying and that A actually desired to communicate to B and so forth. And, when you negate those constraints in this particular small example, you get a half dozen possible reasons why that rule could fail, among which is the fact that the patient lied to the doctor, the patient didn't really want the doctor to know the true answer. There are lots of other plausible ones that we haven't even thought of when we made up the example, but this came up automatically as we followed it through. For instance, maybe the doctor did not understand what the patient was saying, and as you carry this out longer and longer and go to higher and higher, more and more general, concepts you get less and less plausible but nevertheless still possible explanations like, for instance, maybe that person was not really a doctor, maybe there weren't two people there after all. So, so much for common sense.

Let's turn a little bit to analogy because one of the hypotheses that we are working on nowadays is that analogy is one of the great untapped sources of power, something that people use to be intelligent, to perform, to be creative, to have common sense. If you have nine things that you know about, there are nine times eight divided by two analogies that are possible there. That is not very many. The odds that any of those are going to be new and unexpected and useful is more or less negligible. On the other hand, people knowing lots of different things, hundreds of thousands or millions of things to analogize to, generally have expected gain out of analogizing. And, hopefully, our program will eventually have that same gain. We want analogy to help solve problems, to help suggest new concepts to add to the system, to help edit and add knowledge to the system and also to help build up a library of analogies.

Let us see how a little of this works. Suppose that someone has proposed a medical treatment as warfare against disease analogy. Well, there might be some concepts that occur in one building that do not occur in the other. And the analogy would then lead to considering those analogous concepts. Secondly, after you decided to consider one of those such as "susceptible to weapons" and you started editing it and filling in information about it, that analogy would also help you fill in or guess what value to fill for many of the slots of that new concept. For instance, what does "susceptible to weapons" make sense for? Using just the analogy we have it makes sense for "enemy soldiers" which turns out to be correct. The idea is to suggest new concepts to add and help in the editing phase of the copying of the process to enter that knowledge. To give you a five second answer for why analogy works.

Analogy works because there is a lot of common causality in the world, common causes which lead to an overlap between two systems, between two phenomena or whatever. We, as human beings, can only observe a tiny bit of that overlap, a tiny bit of what is going on at this level of the world. And, since we only observe a tiny fraction of what is going on at this level often enough for it to be useful, often enough for it to be cost effective, we find that, whenever we find an overlap at this level, it is worth seeing if in fact there are additional overlap features, even though we do not understand the cause or causality behind it. So what it means for us to be working on analogy is essentially that we are taking what used to be a single great phenomenon, analogical reasoning, breaking it down, finding four different dimensions along which analogies vary from each other. And now, in this huge four-dimensional matrix of a thousand cells, a couple of thousand cells, looking at each of those specific phenomena and putting down heuristics for how to find analogies, how to use them, when to trust them, and when not to trust them and so forth. Let me give an example of what the first dimension is like.

The first dimension was the level or degree of match. You can have identical matches such as between irrigating and raining. Irrigating and teaching are less so. Teaching is transporting knowledge to the minds of students instead of transporting water to the ground of plants. If you have not had it already, that means that learning is analogous to growing. If you did have it already it just reinforces that. Land tilling is similar to irrigating, in one case moving ground to water and, in the other case, moving water to ground. Another case of that is the fact that juries and doughnuts both come by the dozen.

Already analogy has begun to pay off. In analyzing several medical specialties and so forth, it ended up taking us about four hours to diagram that part of the human services network of concepts. After we did that, as an experiment, we looked at educational professions and that ended up taking us about nine minutes. Basically, it was done by taking every medical concept such as patients and then adding the educational analogue of it, in this case students. It only took us a small amount of time because we could copy and edit the information already there,

Our methodology for building this huge body of common sense knowledge is to take a look at a one volume desk encyclopedia which has about 30,000 one paragraph size articles, look at 400 mutually distinct kinds of articles and, for each kind of article, represent not only the stated information but, perhaps more importantly, look behind each segment and say: what level of knowledge did the writer of this article presume the reader already had in understanding this sentence. So, for instance, at one point there were a pair of sentences which said Napoleon died in St. Helena at a certain time and Wellington was greatly saddened by the news. Now, if after reading that, I ask you who do you think lived longer you would be offended saying that I just told you. Obviously, Wellington outlived Napoleon. If I say why, you give me some facts of common sense which are normally impolite to even mention like "humans live for a single continuous interval of time and in order to be staggered by something you have to be conscious of it, and in order to do that you have to be alive."

Let me sum up and review what I have presented. We have been assuming that intelligent behavior could be modeled using a collection of heuristics and considering the usefulness and indistinctness of various concepts. And intelligent behavior includes creativity and discovery. The performance aspect of this led to the knowledge engineer's frame of mind, the difference between telling the program what to know versus the more traditional telling it what to do. We also saw that context was important. We have looked at several sources of power for doing the kind of mental amplification that I wanted to do. We looked at analogy and heuristic reasoning and indirectly we looked at some others in our work such as representing knowledge in various ways. I will not go into this here but, as I mentioned, I think that analogy is going to turn out to be one of the key as yet untapped sources of power for machine recall. We have seen how having a general body of knowledge could help by making some simple ties to existing expert system level rules, could help them have the property of being less brittle. Remember the brittleness bottleneck. We are hoping to overcome that by having this ability to fall back on more general knowledge when the expert rules as provided don't work. We briefly discussed analogy, how it works and how we are going to use it. In particular, I explained what it means for us to be doing research on it, namely to be looking at a thousand phenomena each of which is special, a very specialized kind of analogical reasoning and to ask it a set of questions that lead to heuristics for how to do that kind of analogical reasoning.

Let me conclude by directing you to the paper that my group and I published in the AI magazine of Winter 1985/Spring 1986. It is a good summary of the work that we have been doing for the last year and intend to do for the coming decade. This decade's work on building up a common sense knowledge base will take about 250 man years of effort. Then, after the decade of work on common sense succeeds, there will be time to return to learning and discovery and to return to the work on expert systems and natural language understanding. But the performance side of how to use these techniques is what we have been working on now.