My Loebner Prize Contest 2008 Reflections

Congratulations!

Congratulations to the Loebner Prize Contest (hereafter LPC) 2008 entrants who made it through to the finals and also to those (including Chip) who didn’t. Now that the dust has settled, I wanted to relate my experience in creating and submitting this entry. I want to clarify from the outset that I’m not interested in sparking another fight or bashing the LPC – all that has been done many times both on and off the Robitron list. I’m simply interested in relating my experiences and perceptions (however misguided and faulty they might be).

My Desire to Write a Chatbot

I had wanted to write a bot and enter the LPC for years. I had first seen an ELIZA variant running on a TRS-80 and also wrote a variant for this machine. Years later, as described here, I wrote another ELIZA-like variant under the guise of learning C at my first job at Data General. I don’t know where I heard of the LPC in the first place, but as soon as I found the site and chatted online with the likes of ALICE and Jabberwacky, I was fascinated and dreamed of doing this myself. I read most of the transcripts from previous years and my mouth watered at the thought of creating an entry.

The first thing I ever typed into a TRS-80 when I first saw it at age 11 was “What is 2+2?”. The answer was ?SN ERROR (shorthand for “Syntax Error”). I typed in similar questions and got the same answer. I was disappointed. I asked the person showing me the computer: “What good is this thing? I thought computers were supposed to do calculations.” He said, “You’re doing it wrong. You have to type “PRINT 2+2″. That was the glimmering of the realization that you had to speak its language if you wanted it to do anything useful.

When I was reading the LPC transcripts, I saw that a standard technique of dealing with input that the bot didn’t understand was to change the subject or answer using some evasive, generic or cute response. I was fascinated by questions like “Which is larger: a 747 or my big toe?” In the LPC 2007 Rules, there were even a set of screening questions pertaining to time, general questions about things, comparisons and memory which seemed a departure from the usual stance of letting anyone submit an entry and seeing which one was the “best” using some criteria which was never clearly elaborated.

While I was researching the contest, I also ran across criticisms of it. On the LPC Contest home page itself, there are two such links, one of a thread where Marvin Minsky calls it a “stupid prize” and an “obnoxious and unproductive annual publicity campaign”. The other contains an article by Stuart Shieber detailing why he thinks the test is faulty and potential fixes for it as well as Hugh Loebner’s rejoinder to it.

My initial reaction to Minsky and Shieber’s attitude was that they seemed like a couple of elitist whiners. Their objections seemed like the rantings of people crying sour grapes because the field that they are purported experts in didn’t produce entrants of a superior quality than the actual LPC entrants. I thought that Loebner’s reply was eloquent. In particular, he says:

At the current state of the art I suggest that the appropriate orientation for the contest is to determine which of obviously artificial computer entries is the best entry, i.e. most human like, and nominate the authors as “winners.” It should not be to determine if a particular terminal is controlled by a human or a computer. If we maintain this orientation, there should be no problems holding unrestricted tests.

I agreed. While I think that contests that judge on more restricted or pointed areas certainly have their place and usefulness, I didn’t see any problems with Loebner’s vision provided that:

  • we all realized that it would be ridiculously easy to identify the chatbots
  • we ascribed “humanness” to these bots based on what real technological advances they showcased and not simply what how numerous the keyword-spotting tricks and templates they used or how clever they were

Taking the Plunge

Despite my having “discovered” the LPC in 2001 or 2002 or something like that, I wasn’t able to start coding up a chatbot in my free time until the beginning of 2008. As a software consultant, I would always put my clients’ needs above my own and the activiation energy of starting a chatbot from scratch was way too high. I already knew I didn’t want an AIML or some other chatbot that was purely pattern-and-response-based. I wanted to start from scratch and use the LPC 2007 screening questions as a driver for whatever infrastructure I needed to be able to answer these.

At the end of 2007, I told all of my clients to go away so I could pursue this dream, not knowing whether they would be there when I was done. (In consulting, “absence makes the heart grow fonder” doesn’t really apply.) I didn’t have much of a choice mentally – each year that passed made me more miserable as I read a new batch of contest transcripts and so badly wanted to play too.

I maintained a notebook of things I wanted my bot to do. At the beginning of January, it was announced that the contest deadline would be May 30. I panicked because that was just five months whereas previous years’ deadlines had been July or so.

Paring Things Down

Given the deadline, I took stock of my delusional functionality list in my notebook and decided how I could pare things down so I could submit an entry on time. I looked at the screening questions:

Set 1 - Questions relating to time:
Background facts: For testing purposes, contest management will consider these to be correct whether or not the time and venue of the contest have been changed.
 a. The system clock will be accurate to within a minute or two.
 b. The competition is scheduled to start at 10:00 AM 20 October 2007
 c. There will be 7 rounds of 30 minutes each.

Sample Questions
What time is it? 
What round is this? 
Is it morning, noon, or night? 
Etc.

Set 2 - General questions about things.

Sample Questions:
What is a hammer? 
What would I use a hammer for? 
Of what use is a taxi? 
Etc.

Set 3 Questions relating to comparisons
Sample Questions
Which is larger, a grape or a grapefruit? 
Which is faster, a train or a plane? 
John is older than Mary, and Mary is older than Sarah.  Which of them is the oldest? 
Etc:

Set 4 - Questions demonstrating "memory" or persistence.
Sample Questions
I have a friend named Harry who likes to play tennis. 
 
What is the name of the friend I just told you about? 
Do you know what game Harry likes to play?

My head started to reel when I thought about what functionality would be needed to implement the above and what research revealed would be adequate tools for the job: a backward-chaining inference engine, hooks to Wordnet, Wikipedia, Wiktionary, ConceptNet, Opencyc, the Link Parser API and my own custom knowledgebase. And even if I could adequately handle the above screening questions, there would still be a mountain of simple questions that any six-year-old child could answer that weren’t covered in the above list. (“The blue bottle is on the table. / Where is the bottle? / What color is the bottle?”)

And then there was the issue of creating a personality for my bot and having it answer questions about its past, its relatives, its job, what transportation method it used to arrive at the competition site. I panicked. Could any other bot do all of this?

The Competition

I gathered a list of bots which were online that I could ask questions: ALICE, Jabberwacky, Jeeney, Ultra HAL, Alan. To my relief, none of them could answer basic questions like whether an orange was bigger than the moon or anything even close. Jeeney later developed a hook to Wikipedia, so it could answer things like “What is a hammer?”.

I surmised that in light of the above, it would be unlikely that my competition would be able to answer most of the Loebner 2007 Screening questions. What’s more, I slowly came to the understanding that lying that I was a human and therefore having to concoct all sorts of stories about my past, my mother, my job, etc. would be a colossal waste of time if these simple screening questions weren’t addressed. What’s more, such a web of lies would make things worse because I would then need to be able to discuss my mother, my job, etc. and have responses to questions about these, etc. “Forget that.” I thought. Better to concentrate on the simple stuff first and hope that I win on the merit of my efforts.

I therefore developed the following strategy:

  1. Be able to answer the Loebner 2007 screening questions.
  2. Forget about the nonsense about pretending I’m a human. Just outright say I’m a bot.
  3. Aggressively advertise my capabilities so that the judge will see them and not be inundated with nothing more than “I don’t know”-type responses when s/he doesn’t hit the sliver of questions I attempt to address.

I wasn’t quite sure of the second point of my strategy, since it seemed a radical departure from previous years’ entries. When I tried to figure out from the Robitron list whether saying I was a bot would be a disqualifier:

If I finish my bot, my intention is to not have the bot lie and say that it is human. My bot will be very proud to be a bot and its age will be the elapsed time since my bot had its first conversation (which hasn’t happened yet).

The way I read the contest rules, this should disqualify me from winning the $25K prize but shouldn’t necessarily disqualify me from winning the “best entry” prize. (Assuming I can compete successfully against other bots which have been around for years and years, which almost seems delusional, but that’s what dreams are made of, I guess.)

For me personally, wasting my time trying to give my bot too much of a fake history and personality is equivalent to wasting my time trying to imitate fake typing. That’s not what I’m in this for. I want to make an excellent conversationalist who doesn’t have to resort to tricks and lies for entertainment value.

It’s very important to note that I am not making a value judgment on those who choose to go this route, but rather that I personally have no interest in it. If I have misinterpreted the rules and am indeed disallowed from entering the contest if I refuse to make my bot lie, then please say so and I’ll spend my energy on other endeavors.

…I got mixed reactions. The main contest organizer said that it wouldn’t necessarily constitute immediate disqualification, but it would probably greatly reduce my chances of winning and that I wasn’t getting the point: this was a Turing Test. Hugh Loebner said:

No – not if it’s a Turing Test. In fact, that would be a show stopper. I can not see why it is necessary or desirable for the bot to claim to be a bot. What is the purpose of this?

I discussed my attitude towards humanness in this competition:

That all depends how judges ascribe humanness to an entrant. See my previous reply to Hugh.

I personally thought that Hugh’s response to Shieber was quite eloquent and I was sold on it. But if we’re saying here that a bot will get lower marks or even be disqualified simply because it doesn’t pretend to be human, that seems to be at variance with accomplishing anything remotely useful with this contest. Programming a bot to pretend to be a human involves much more than one line of code where the bot affirms that it’s human – it involves an extremely labor-intensive (and IMO time-wasting) effort to code up a web of lies which invariably implodes under its own weight. Given that, I (obviously erroneously) believed (also based on what I read in “In Response”) that in the absence of a bot which was truly able to convince a judge that it was human, that the judges would react favorably to a bot which exhibited intelligent qualities regardless of whether it pretended to be a human.

Among other things, this evoked a reaction from someone who reiterated the uselessness of the LPC and said it was all about building “the best liar”. Despite the detractors, though, I resolved to keep an open mind. To be on the safe side, though, I decided to have Chip Vivant (my bot) be humorously evasive when one posed him questions about his identity rather than outright saying that he was a bot.

The Implementation

Implementing Chip was a highly stressful endeavor given the time pressure and what I wanted to accomplish. I spent inordinate amounts of time downloading things, massaging the data, developing hooks and APIs to the things I mentioned before (a backward-chaining inference engine, Wordnet, Wikipedia, Wiktionary, ConceptNet, Opencyc, the Link Parser API, my own custom knowledgebase). I developed my own template matching system. I was very proud of my infrastructure.

I also ran into several shocking discoveries along the way. Opencyc was much more useless that I thought it would be. (No offense.) I discovered to my horror that it had no clue whether an orange was bigger than the moon despite proclaiming itself “an upper ontology whose domain is all of human consensus reality”, “containing hundreds of thousands of terms, along with millions of assertions relating the terms to each other”. It was also terrible with part-of relations. What’s more, I found out that there wasn’t a single place that I could find anywhere on the Internet which had information such as the relative sizes of objects. I’d have to come up with this myself. (All the more reason to not waste time with a fake persona.)

(I give a lot of credit to my wife for supporting me morally during this time, despite the fact that I was pulling in no income and had sent my clients away. She also helped me with things like coming up with the relative object size list, which I had unsuccessfully tried to outsource to three subcontractors.)

I decided not to forego canned responses entirely. With certain things like “How are you?”, it’s pleasant to be able to answer “Fine thanks,” or some variant thereof. So my bot became a Loebner Prize 2007 Screening Questions + miscellaneous canned responses bot. Oh, and math. I thought it would be cool to do math too. So I started throwing more and more things in there.

I had some people talk to Chip and with each conversation, it seemed like there were things that people said that it would be very easy to add a canned response for, so I started throwing more and more of these in, despite having had the original goal of never having Chip answer something that he didn’t truly understand.

Launch Day

I launched Chip two days before the deadline so I could give my friends and family the link and have them talk to Chip. I also set out to implement vanilla sentence handling in order to handle scenarios like “I like red strawberries. / I like the blue piano. / What fruit do I like? / What instrument do I like? / What color is the piano?” Launch day came and people started talking to Chip. It was a bloodbath at first. There were the initial bugs that you can never iron out despite your best sterile testing efforts. And the other disconcerting thing was that my strategy of incessantly prompting people to type “What can you do?” when I said I didn’t understand something, then spewing out a massive list of things I could do, didn’t seem to be working. People wanted to ask Chip what his favorite sport was, his favorite color, etc. People either ignored my massive list or else were offended at Chip’s attempt to influence what was supposed to be a spontaneous conversation. Only two of the judges that conversed with Chip really seemed to understand what my goals were. To make matters worse, the vanilla sentence handling would often interpret a sentence in strange ways, discarding things it didn’t understand, making some sort of internal first-order logic representation, replying with “OK. I’ve memorized that.”, then failing to respond correctly to queries about what it had just memorized. The fact that I documented that Chip couldn’t handle negation yet fell on deaf ears.

The fact that I had unwittingly billed Chip as a “smart” bot prompted all sorts of physics and geology questions that Chip couldn’t answer.

More Canned Responses

My knee-jerk reaction to Chip’s getting conversationally slaughtered during the initial judging period was to pile on the canned responses shovelful after shovelful. My wife helped with this and admonished me that I should have enlisted her help sooner, since she could have authored the canned responses from the beginning. I told her that I was going down a route that I had never wanted to go down from the start and she empathized with me without really understanding why I was making such a big deal about not liking canned responses. (“That’s what makes it fun,” she said.)

In the end, Chip didn’t make it through to the finals. I’m not sure whether it was the numerous bugs at the beginning or the fact that Chip simply wasn’t as entertaining. When you converse with Chip, you’ll see that despite my having added numerous canned responses, there are still a great deal of “I don’t know”-type answers as well as “OK. I’ve memorized that.” answers which don’t pass muster when you query Chip further about what he just memorized.

Faulty Assumptions

I’ll preface this section by talking about something seemingly unrelated. I remember the moment of my “conversion” to Animal Rights very clearly. It was the summer of ’88 when I had just moved to North Carolina. I was vegetarian but not an activist. I was looking to hook up with other vegetarians and found this leaflet for The Triangle Vegetarian Society. I called the contact person for Durham and he was very nice, but in a hurry at that moment. He said “I’m so sorry to rush you and I definitely will call back. By the way, are you vegetarian for health or Animal Rights reasons?” I said both. He said “Then maybe you’d be interested in coming to the annual meeting of the North Carolina Network for Animals with me.” We arranged to meet up somewhere and he drove me to the annual meeting.

During the meeting, one person after another came up and talked about the fur protests, anti-dissection campaigns, dog washes and other events they organized. I had never participated in a protest before and started to feel a bit uncomfortable around these extremists. I thought: it’s okay to be vegetarian, and the education outreach is kind of nice, but this protest stuff is kind of weird and all these people are a bit too fired up for my taste.

Finally, the head of the organization wrapped up the meeting with a speech explaining how we were the Voice for the Voiceless and how we needed to be a voice for the animals because they had no voice themselves, yet were being tortured, maimed, mutilated and massacred by the billons for senseless reasons. At that moment, the room changed, I saw a bright light in my head and I knew I had been converted. I was now irrevocably on “the other side”.

The months that followed were very strange. Before the gathering, I reasoned, I had never really come into contact with anyone who had presented the arguments so coherently. Therefore, if I simply went to everyone and presented the arguments as coherently as they were presented to me, I’d convert the world to vegetarianism just like knocking down dominos. Of course, we all know that it doesn’t happen this way. There was a period of time where I almost lost my mind. Faced with this reality, I had two choices: succomb to despair or else partially block this reality out to return to some semblance of my former ignorant but more blissful life. I chose the latter, which also permitted me to associate with and befriend meat-eaters like I did in my “former” life.

Despite my lessons learned from that previous experience, I went into this contest hoping that if I laid out the following simple arguments, that I’d win by a landslide:

  • We are light-years away from a Turing-Test-passing bot. Yet as Hugh Loebner argued, an unrestricted Turing Test is not at variance with advancing this field provided that we judge the results with a grain of salt (search for “at the current state of the art” at the beginning of this article).
  • There is so much work to do at the fundamental level that it is counterproductive to work on creating a fake persona when faking humanness is so easily detected on more fundamental levels. (“Which is bigger: an orange or the moon?”)
  • Showing a best-faith effort to tackle these problems, albeit in a primitive, limited fashion should be rewarded by the judges more than yet another pattern-and-response bot which has no additional ability to reason or remember.
  • Given that out of the vast sea of possible common sense questions, it would be difficult to stumble upon the things Chip can answer well by chance, it would be logical for Chip to advertise the things that he was particularly suited to do. The longer the list, the more artifical and less human Chip would appear, but we’ve already established that we haven’t a hope in the world of fooling a savvy judge.

In hindsight, these assumptions were incorrect.

My Conclusions

So where does that leave me? I’m glad I entered the contest and have no regrets or ill-will about the experience. On the contrary, a concrete deadline and yearly contest were what spurred me to action. Otherwise, who knows when I would have eventually done this?

Plus, I’m five-months worth of Intellectual Property and a wealth of knowledge and understanding richer.

As for my feelings for Chip, I alternate between days of intense pride and days where I want to run him through the electronic equivalent of a garbage disposal. One of the things I enjoy about programs which employ some sort of searching algorithm, like chess or The Triangle Puzzle I wrote, is that the computer can make astonishing moves which the programmer himself can’t predict. I had hoped that my initial effort would involve something similar: the current input plus all previous inputs are run through some sort of artificially-intelligent blender and produce a surprising result which simulates intelligence. Needless to say, my current result is far from that: it’s a mixture of canned responses plus some reasoning ability, but the responses are quite unsurprising and easily predictable.

That being said, I am proud of a couple of things. I’m hereby planting several flags in the ground and, until proven wrong, am declaring that:

  • I have created the first-ever Internet-facing entity (and for all I know, the first-ever reference in any medium), that explicitly answers the question of whether an orange is larger than the moon (as well as other such object comparisons for speed, size, loudness).
  • I have created the first Internet-facing chatbot that can answer questions like “The blue bottle is on the table. / Where is the bottle? / What is on the table? / What is blue? / What color is the bottle?”

Chip may have major shortcomings, but at least he attempts to do certain things (albeit in a brittle, not-very-extensible manner) which will need to be done if we are ever to create intelligent machines.

Another positive takeaway from this contest is the great people I’ve met on the Robitron list and some wonderful discussions with these people both on and off list. Several people have been particularly encouraging. One person for whom I have great respect (and whom I’ll let step forward and identify himself or herself if s/he feels like it) said after my loss:

This is why it’s necessary to work on [your chatbot] away from any constraints such as contests or commercial pressures, because inevitably you’ll be tempted to take short cuts and make quick fixes you’ll regret later on.

…which is exactly what happened in my case. (Then again, if it hadn’t been for the contest, I might not have ever submitted an entry.)

As for the LPC itself, given that in the absence of feedback from the judges, I must assume that my initial assumptions proved to be incorrect, I find myself not being really clear on what the purpose of the contest is.

Update (7 July 2011): Later, I did get feedback in the form of a nice PDF with actionable comments, so I guess I was being too harsh when I made that initial statement.

I’m not saying that Chip should have won, but given my reasonable certitude that none of the entries have the proper infrastructure to be able to handle the sort of questions that Chip attempts to handle, nor do they innovate in other ways that I’m aware of, I’m not sure what will be proven when a winner is declared. What’s worse, I don’t see anything about the contest that encourages the kind of incremental innovation we need to solve the problem of intelligent machines. Although nothing in the contest’s structure explicitly discourages this, from what I’ve seen, it seems to encourage bots to gravitate towards a local minimum because the activation energy is too high to get to the real minimum. (I fear I’m mixing metaphors here but it’s late and I’m tired.) As far as I can see, it doesn’t appear that bots that attempt to tackle the fundamental issues are rewarded for this in any way unless they have a truckload of canned responses in addition. And every canned response which the bot truly doesn’t understand is a lie which is easily unraveled when the interrogator questions the bot further about the content of that canned response.

Again, I want to reiterate (like I’ve done many times) that I’m not making a value judgment on these pattern-and-response-based bots or saying that Chip is better. (On the contrary, such bots have proven themselves capable of handling simple Help Desk type scenarios and also have entertainment value. What’s more, I have unconditional admiration for all chatbot writers.) It’s just that the technology behind these bots is very well known and I am not seeing how they can scale and expand to handle the kind of scenarios Chip attempts to handle. (I’m also not saying that Chip can handle these well either, but he tries. Also, of the finalists, Jabberwacky‘s underlying technology is in a class by itself and I am unfamiliar with the details of the underlying technology, so my assumptions may be incorrect.)

If you’ve read this far, thanks for bearing with me. If any of the statements and affirmations I made are incorrect, please accept my apologies in advance and instead of yelling at me, help me to set the record straight.

Like it? Share it!Share on FacebookTweet about this on TwitterShare on LinkedInShare on RedditShare on StumbleUponDigg this

Leave a Comment