The Simulation of Expertise: Deeper Blue and the Riddle of Cognition. Origins & Design 19:1. McGrew, Timothy

News and Commentary
Origins & Design 19:1

The Simulation of Expertise:
Deeper Blue and the Riddle of Cognition

Timothy McGrew
Department of Philosophy
Western Michigan University
timothy.mcgrew@wmich.edu

Something is going on in the grandmaster’s mind that is not only radically different from what Deeper Blue’s program does, but also inconceivably more efficient. It is a kind of computational miracle that humans can play chess at all.

When Garry Kasparov resigned the sixth game of his chess match with IBM’s “Deeper Blue” on May 11, 1997, a dream came true for the artificial intelligence community. In the very center of the public eye, a machine had bested the world’s top human player at a contest that has stood for centuries as the touchstone of human intellect. At stake was more than a set of games or a multi-million dollar purse; it was a sort of Turing test for intelligence on the chess board, an attempt to simulate human expertise in a high-prestige arena. “If one could devise a successful chess machine,” wrote a group of early researchers, “one would seem to have penetrated to the core of human intellectual endeavor.”1

Many people unacquainted with the history of computer chess seemed to consider the result to have been inevitable. But among the cognoscenti, reactions to the computer’s victory were decidedly mixed. Writing in prospect, Richard Dawkins made no attempt to conceal his delight at the spectacle of a computer program downing the world champion. “Humanity,” he explained, “needs a lesson in humility.”2 But in the auditorium the audience of chess players greeted the victorious IBM team with booing and hisses. A sizable number of Internet commentators, both chess players and computer scientists, dismissed the result as a fluke. Kasparov himself, who had raised the psychological stakes by billing himself as defending the honor of the human mind, had to reap the bitter harvest sown by his bravado; he behaved badly after his loss and probably reduced still further the slim hopes that IBM would sponsor a rematch. Newspaper cartoonists recycled the old theme of humans unplugging the machines, but their humor had a slightly bitter edge to it. If unplugging them is the best we can do, isn’t that really an admission that we have been bested? Have computers, at least in this field, finally unlocked the secrets of human thought?

How Computers Play Chess

It had been an extraordinarily long road to the top. Though there had been some work on the problem done in the mid-1940’s, Claude Shannon’s article in the Februrary 1950 issue of Scientific American was the first official publication of a computer science researcher on what appeared, at first blush, to be an ideal task for the machines: to play chess at the level of the best human players. Superficially nothing could be simpler. Chess, stripped of its noble associations, is an abstract full information game that can be solved, in theory, with finite computational resources. In principle, it is in the same class as tic-tac-toe -- enormously more complicated, of course, yet still a solvable game.

But it soon emerged that the problem is nowhere near as simple as that description makes it seem. The comparison to tic-tac-toe in particular illustrates just how misleading a superficial similarity can be, and what enormous difficulties are hidden behind those qualifying phrases “in principle” and “in theory.” To understand why, and to appreciate fully what an incredible feat it is for the human mind to play a competent game of chess, we need to examine more concretely some of the features of the game that make it resistant to a straight computational solution and some of the artful ways that programmers have tried to overcome these problems.

The first and, in some ways, the most dramatic problem involves the sheer computational complexity of chess. In the initial position of a game of chess, each side has twenty legal moves. But as the center pawns are moved, that number escalates rapidly to the point where a typical middlegame position will offer either side an average of thirty eight legal moves. Programmers speak of this as the “search width,” the size of the set of options one has at a given turn. Some of these, of course, will be blunders that incur an immediate disadvantage of one sort or another. But in order to play a reasonably strong game of chess, a computer program must generate all of the positions (or “nodes”) that might arise at the next turn and evaluate each one. Having achieved this, the computer must now consider all of the positions arising from the opponent’s responses to each of these moves, and evaluate those. This process is iterated until the computer reaches the point where its time allocation is used up,3 when it must select the move that, according to the evaluations of the terminal nodes at the deepest level it has achieved, maximizes its gains against the opponent’s “best” responses -- the enemy moves that the program evaluates as most effectively minimizing the value of its own position (see Figure 1).

Programmers speak of the successive iterations of this process as the “search depth,” taking one “ply” (a move for just one side) as the standard unit of depth. In a typical middlegame position, the number of positions that must be generated in order to “search” for the most advantageous move is a function of both width (W) and depth (D), namely, WD.

In other words, the search problem is exponential: it requires drastically more than twice the computational resources for the machine to generate and evaluate twice as many moves ahead. Given the average figure cited above, an eight-ply lookahead requires generation of over two million times the number of nodes that a four-ply lookahead would require, for a total on the order of 4x1012 nodes.

Big numbers can be boggling, and everyone knows that computers are remarkably fast. Does this snowballing complexity of the possibilities in chess really pose a problem for a fast computer? The simple answer is, it does. The number of possible ways to play the first ten moves in a game of chess (twenty ply) has been calculated to be far in excess of the number of fundamental particles in the universe. Even Deeper Blue, running on linked parallel processors that enable it to evaluate 200 million nodes per second, could not push an exhaustive full-width search forward in a typical middlegame situation to a depth of more than about seven ply in a ten-minute time frame. Ten ply would take over a year; twelve ply would take about fifteen centuries.

Faced with this prohibitive mathematical obstacle, programmers resort to a technique called “alpha-beta pruning” that allows the program to dispense with the evaluation of each and every node. Though eminently logical, the alpha-beta strategy is a little difficult to describe in words. Suppose that in a given position, Black has attacked White’s Queen with a Pawn and it is in White’s best interest to retreat her rather than suffer the loss of that piece. Only a limited number of moves will remove her from danger; the rest will simply lose her outright. Under such circumstances, the program does not have to evaluate every Black response to each of White’s options in order to select the best move: the moment it sees that (say) White’s moving the King into the corner allows Black to respond by capturing the Queen with the pawn, the score for the King move is so drastically minimized that it is bound to be worse than retreating the Queen, no matter what other options Black has. At that point, the computer can cease (for that depth) to consider any further responses to the King move: the remainder of the options in that portion of the search tree can be disregarded, since the move is already refuted. In a situation where there is a very small number of viable retreat options for the Queen, this may reduce the search width in that branch of the tree to a handful of branches -- hence the name “pruning” (see Figure 2).

It is easy to show that alpha-beta pruning will always yield the same move decision at any given depth. But to be most effective, such pruning needs to be accompanied by a helpful ordering of moves. In our hypothetical example, it does not narrow the search tree a bit if the capture of the Queen is the last of the opponent’s options to be considered, since by that time an evaluation for every other node has already been generated. Conversely, it narrows it to the maximum extent if the capture of the Queen is the first move to be considered, as all other responses are irrelevant. Ideally, with perfect ordering,4 alpha-beta pruning can reduce the number of nodes that must be evaluated to something on the order of 2 (WD/2).

In actual practice the theoretical ideal is not attained, but the best programs have sufficiently sophisticated algorithms to cut the problem down to about twice the ideal value. At the typical values of W for chess positions, a factor of two or four on the left is negligible and we can approximate the benefits of alpha-beta pruning by saying that it leaves us with roughly the square root of the number of nodes that would originally have had to be evaluated. Put in different terms, this enables a program to achieve twice the search depth in a fixed time interval as it would without alpha-beta pruning.

Despite this reduction, the search problem in chess remains an exponential function of depth. As a consequence, linear improvements in hardware speed have a diminishing return in terms of search depth. A computer twice as fast cannot look twice as deep; it cannot even look one ply deeper. In the absence of a drastic change in technology, the “brute force” approach to chess has just about reached its limits. No forseeable advances in hardware will even bring us close to a computational “solution” to chess within the projected lifetime of the universe.

The second problem, if less spectacular, is probably a deeper one: the difficulty of creating a meaningful evaluation function. As the program generates each node of the tree, it evaluates it numerically, typically with values that are expressed as fractions of a pawn’s value. Evaluation of each node is static: it depends solely on the configuration on the board at that time, not on dynamic possibilities that may unfold from the position. Sheer material superiority is generally a large factor in the evaluation, on the well-tested supposition that all else being equal, superior force is decisive in chess. Hence, computers are typically acquisitive, seeking to capture material if no concrete punishment turns up within the limits of the search. Static credits are also awarded for positions possessing certain non-material advantages (centralized pieces, control of open lines, a Rook on the seventh rank, a passed pawn) and debits for positional disadvantages (doubled or isolated pawns, a compromised castled position, a Knight on the edge of the board).

But in many types of chess positions, one side has an advantage that does not fall neatly into such categories and does not issue in material gain within the typical search horizon of the program. In such positions, the computer may play moves that are irrelevant or worse, not seeing the danger coming until it is much too late to stop it. While some of these situations can be described sufficiently accurately to enable the computer to navigate their strategic complexities, it is very difficult to sort out the features that make one position relevantly similar to another. At present, some of the most prominent programmers are simply making ad hoc fixes. Bob Hyatt’s program “Crafty,” a refined version of his famous “Cray Blitz” program that won the world computer chess championship several times in the 1980’s, boasts approximately one thousand separate evaluation parameters, and Hyatt adds more specific fixes as inventive humans at the Internet Chess Club find one trick after another that the program’s existing evaluation parameters cannot see through.

This is one respect in which Deeper Blue has an advantage over its competition. IBM hired a bevy of grandmasters, including the current US champion Joel Benjamin, to work intensively with the programming team, to increase the number of parameters and fine-tune their values. Though the IBM team jealously guards the secrets of their program, it is reliably reported that they have something on the order of ten thousand specific parameters. In a special test before the 1997 Kasparov match, they “slowed down” the phenomenal parallel processing hardware on which Deeper Blue was scheduled to run and played what was, roughly, a uniform platform match against one of the best commercial programs. Deeper Blue won every game, proving that increased sophistication in the evaluation department pays off handsomely.

Human Expertise at Chess

The foregoing account suggests that the secret of human expertise in chess lies in the greater number of evaluation parameters human masters use. Deeper Blue’s performance seems to indicate that advances there are narrowing not merely the performance gap but the gap in understanding that separated “stupid” early chess computers from expert humans. But a moment’s reflection shatters this impression. Deeper Blue is running on a machine capable of evaluating 200 million nodes per second. A top grandmaster, at a very generous estimate, can visualize and evaluate perhaps as many as a hundred different possibilities in a minute of concentrated thought. This is a speed difference of eight orders of magnitude, greater than the relative speed gap between the most advanced tactical fighter jet and the average inchworm. Clearly, something is going on in the human grandmaster’s mind that is not only radically different from what Deeper Blue’s program does, but also inconceivably more efficient. In view of the incredible complexity of chess and the limited speed of the human mind, it is a kind of computational miracle that humans can play chess at all.

A major experiment carried out in the late 1930’s and early 1940’s by Dutch psychologist Adriaan de Groot sheds light on the issue of human expertise at chess.5 In one part of the experiment, de Groot asked a wide range of chess players, from grandmasters competing at the great Avro tournament in 1938 down to club-level players, to evaluate some test positions, giving a running verbal commentary on their thought processes as they did so. Contrary to the mythical image of the chess master as a supergenius or a human computer,6 de Groot’s study shows that grandmasters and lesser players consider about the same number of possible positions (an average of 30-35; one grandmaster considered 76 positions), calculate at the deepest to about the same depth (between 6- and 7-ply), and take about the same amount of time (between 9 and 13 minutes) to reach a decision.

Of these numbers, the average of 35 positions examined per move is the most dramatic. The clear implication, backed up by de Groot’s reports of the grandmasters’ verbal protocols, is that human chess masters immediately dismiss as irrelevant almost all of the possible moves for both sides in a given position, focussing only on a few alternatives at each ply. But exactly how the grandmasters determine relevance remains a riddle. Crude evaluation parameters of the sort programmers currently use don’t begin to do the job. At one time, when the computational barrier presented by full-width searches seemed insuperable, programmers did try a “selective search” approach modeled (or more accurately, intended to be modeled) on human methods of play. But the resulting programs were so prone to oversights in complex positions that they were consistently defeated by programs designed to do a full-width search.7 For decades now the selective search method has been abandoned.

The failure of selective search programs to produce decent chess play becomes even more puzzling when one considers that human beginners, properly taught, can rapidly learn to play better chess. It is no particular secret how this is done: the beginners are shown various combinations and motifs, then given the opportunity to apply them in exercise positions. With sufficient practice, they are able to produce similar combinations in their own games. The difficulty programmers have in emulating this learning process comes at the level of characterizing the various ideas that the beginners are being asked to absorb. Talented human students do not require a full abstract description of a “type of situation” in which a given motif might be useful; they rapidly see what is pertinent in the examples and, presented with exercises, detect relevant similarities without further prompting. Just where programmers would like to find clues, pedagogic practice defers to the human mind and leaves out the intermediate steps.

Such learning from examples does not give the student an ability to verbalize the abstract essence of the ideas he is absorbing. Indeed, experienced chess players are apt to recognize themselves in William James’s famous description of the tacit knowledge of an expert who

instinctively knows that, in a novel case, this and not that will be the promising course of action. ... The doctor will feel that the patient is doomed, the dentist will have a premonition that the tooth will break, though neither can articulate a reason for his foreboding. The reason lies imbedded, but not yet laid bare, in all the countless previous cases dimly suggested by the actual one, all calling up the same conclusion, which the adept thus finds himself swept on to, he knows not how or why.8

James perhaps overstates the case. The IBM team found human grandmasters sufficiently articulate that they were able to use their input to extend and refine Deeper Blue’s evaluation parameters. But that refinement, as we have already seen, comes nowhere near to the level necessary to enable a program to perform well with a selective search.

Another marked difference between human and computer chess lies in the ability of humans to formulate and follow plans. A human master will orient himself by various guideposts, particularly by pawn structure, and when no sudden combinative strokes are possible he will pursue goals suggested by those guideposts -- occupying an open file with a Rook, maneuvering a Knight to a good outpost square, liquidating an isolated pawn before it can become a liability, and so forth. For such strategic plans, no particularly deep lookahead is necessary; the outlines of the correct plan limit the number of relevant moves one needs to consider. Where such long-range planning is possible, a master may keep his sights fixed on one set of goals for many moves, with the consequence that he does not have to start from scratch in the assessment of a position at every turn.

By contrast, a computer essentially starts over at each move, its search not guided by any overarching ideas. Sometimes this works to the progam’s favor, when a sudden break with the pursuit of typical strategic objectives enables it to achieve an immediate advantage. But fairly often the program’s “inhuman” moves are questionable, resulting in the acquisition of a relatively unimportant pawn and exposing it to serious dangers that lie beyond its search horizon. It is chiefly by this characteristic -- the readiness of the program to abandon the strategically indicated paths in the dubious pursuit of material gains -- that computers can be distinguished from human beings in blind tests.

Failure to appreciate these differences between human and computer approaches to chess probably contributed to the unfounded optimism of early researchers. The importance of the human capacity to sort out patterns, detect relevance, and apply learned knowledge is widely recognized in more sophisticated fields like medicine, but it appears to be every bit as important to human performance in abstract full information games. In construing chess as a computational problem, computer scientists overlooked the extent to which mastery in chess, like expertise in more open-ended pursuits, requires that elusive quality known as intuition.

The Simulation of Expertise

If the differences between computational approaches to chess and human cognition are so marked, why did Kasparov’s defeat generate such excitement and dismay? Part of the answer must be that people are widely uninformed about the actual nature of human problem-solving. Recognizing a grandmaster’s chess abilities as something extraordinary, outsiders are likely to impute to him improbable computational powers and to construe a man-versus-machine chess match as a contest between human and electronic computers. The truth, stranger than this popular fiction, is that chess grandmasters do not work like computers at all and their thought processes have thus far resisted computational simulation.

The point can be made another way. When it comes to the simulation of expertise at chess, we can distinguish a range of interesting questions:

Can a computer program, on suitable hardware, play chess at a level equal to that of the best human players of the game?
Can such a computer program play in a fashion that is indistinguishable, to an expert eye, from the play of a human being?
Can a computer program play chess at this level by means of a program that models the operation of skilled human thought?

Cavils about Kasparov’s play in the 1997 match aside, few people now doubt that the answer to question 1 is “yes.” Computers can attain or surpass human performance levels at chess. But thus far, the answer to question 2, a chessic version of Alan Turing’s famous test for machine intelligence, is an equally straightforward “no.” The trouble is not just that humans sometimes make gross mistakes -- the machine could be programmed to blunder from time to time as well -- but rather that currently programmers have no idea how to enable the machine to select the relevant features of the position or to form and follow plans. Barring a conceptual breakthrough in this direction, computer chess is and will remain detectably inhuman.

More importantly, we can see why Deeper Blue fails the Turing test, why computer simulation doesn’t give us performance with a human feel to it. The only way for programs to achieve high performance levels is to exploit the remarkable speed of computers in a finessed, alpha-beta pruned version of the brute-force search. Perforce, this leads to an inhuman style of play.

In its original conception, the Turing test was intended to settle the question of consciousness. At a distance of half a century even the most enthusiastic naturalists would be reluctant to endorse this quaint relic of behaviorism; consciousness is, and will remain, the “hard problem” for materialists. But the more modest goal of emulating at least the problem-solving capacities of the human brain still draws the artificial intelligence community like a vision of the Holy Grail. On that front, the computer chess saga is indeed a lesson in humility. Deeper Blue’s performance does not advance our understanding of human cognition.

Sources

Binet, A. Psychologie des grands calculateurs et joueurs d’Echecs (Paris: Hachette, 1894).

Dawkins, R. The Selfish Gene, Revised edition (1989).

Frey, P., ed. Chess Skill in Man and Machine (New York: Springer-Verlag, 1983).

James, W., Principles of Psychology, Vol. 2 (New York: Dover Publications, 1950). This classic was originally published in 1890.

Knuth, D. And Moore, R. “An analysis of alpha-beta pruning.” Artificial Intelligence 6 (1975), 293-326.

Levy, D. N. L., ed., Computer Games I (New York: Springer-Verlag, 1988).

Endnotes

1. Newell, Shaw and Simon, “Chess-Playing Programs and the Problem of Complexity,” reprinted in Levy (1988), pp. 89-115.

2. Dawkins, R. (1989), p.277.

3. This simplifies matters slightly: sophisticated programs permit search extensions under circumstances where the static evaluation is unlikely to remain stable. But these do not materially affect the mathematics of the search process.

4. For more technical information on alpha-beta pruning, see Knuth and Moore (1975).

5. Neil Charness gives a good overview of de Groot’s research in chapter two of Frey (1983).

6. An image fostered, for example, in Binet (1894).

7. Some simple positions have proved amenable to selective search, or “forward pruning,” strategies. See, e.g., Levy (1988) p.174.

8. James (1950). The entire section on “Reasoning” is pertinent.

News and Commentary Origins & Design 19:1