Chatbots Review: December 2011

Saturday, December 31, 2011

Loebner Prize II - Prizes and Contest History.

For the Loebner Prize, originally, $2,000 was awarded for the most human-seeming chatterbot in the competition. The prize was $3,000 in 2005 and $2,250 in 2006. In 2008, $3,000 was awarded.
In addition, there are two one-time-only prizes that have never been awarded. $25,000 is offered for the first chatterbot that judges cannot distinguish from a real human and which can convince judges that the human is the computer program. $100,000 is the reward for the first chatterbot that judges cannot distinguish from a real human in a Turing test that includes deciphering and understanding text, visual, and auditory input. Once this is achieved, the annual competition will end.

In 2006, the contest was organised by Tim Child (CEO of Televirtual) and Huma Shah. On August 30, the four finalists were announced:
• Rollo Carpenter
• Richard Churchill and Marie-Claire Jenkins
• Noah Duncan
• Robert Medeksza
The contest was held on 17 September in the VR theatre, Torrington Place campus of University College London. The judges included the University of Reading's cybernetics professor, Kevin Warwick, a professor of artificial intelligence, John Barnden (specialist in metaphor research at the University of Birmingham), a barrister, Victoria Butler-Cole and a journalist, Graham Duncan-Rowe. The latter's experience of the event can be found in an article in Technology Review. The winner was 'Joan', based on Jabberwacky, both created by Rollo Carpenter.

The 2007 competition was held on 21 October in New York City. The judges were: Computer Science Professor Russ Abbott, Philosophy Professor Hartry Field, Psychology Assistant Professor Clayton Curtis and English lecturer Scott Hutchins.
No bot passed the Turing Test, but the judges ranked the three contestants as follows:
• 1st: Robert Medeksza from Zabaware, creator of Ultra Hal Assistant
• 2nd: Noah Duncan, a private entry, creator of Cletus
• 3rd: Rollo Carpenter from Icogno, creator of Jabberwacky
The winner received $2,250 and the annual medal. The runners-up received $250 each.

The 2008 competition was organised by Professor Kevin Warwick, coordinated by Huma Shah and held on 12 October at the University of Reading, UK. After testing by over one hundred judges during the preliminary phase, in June and July 2008, six finalists were selected from thirteen original entrants - artificial conversational entity (ACE). Five of those invited competed in the finals:
• Brother Jerome, Peter Cole and Benji Adams
• Elbot, Fred Roberts / Artificial Solutions
• Eugene Goostman, Vladimir Veselov, Eugene Demchenko and Sergey Ulasen
• Jabberwacky, Rollo Carpenter
• Ultra Hal, Robert Medeksza
In the finals, each of the judges was given five minutes to conduct simultaneous, split-screen conversations with two hidden entities. Elbot of Artificial Solutions won the 2008 Loebner Prize bronze award, for most human-like artificial conversational entity, through fooling three of the twelve judges who interrogated it (in the human-parallel comparisons) into believing it was human. This is coming very close to the 30% traditionally required to consider that a program has actually passed the Turing test. Eugene Goostman and Ultra Hal both deceived one judge each that it was the human.
Will Pavia, a journalist for The Times, has written about his experience; a Loebner finals' judge, he was deceived by Elbot and Eugene. Kevin Warwick and Huma Shah have reported on the parallel-paired Turing tests here.

The 2009 Loebner Prize Competition was held 6 September 2009 at the Brighton Centre, Brighton UK in conjunction with Interspeech 2009 conference. The prize amount for 2009 was USD 3000.
Entrants were David Levy, Rollo Carpenter, and Mohan Embar, who finished in that order.

The 2010 Loebner Prize Competition was held on October 23rd at California State University, Los Angeles. The 2010 competition was the 20th running of the contest.

Official list of winners.
1991 Joseph Weintraub - PC Therapist
1992 Joseph Weintraub - PC Therapist
1993 Joseph Weintraub - PC Therapist
1994 Thomas Whalen - TIPS
1995 Joseph Weintraub - PC Therapist
1996 Jason Hutchens - HeX

1997 David Levy - Converse
1998 Robby Garner - Albert One
1999 Robby Garner - Albert One
2000 Richard Wallace - Artificial Linguistic Internet Computer Entity (A.L.I.C.E.)
2001 Richard Wallace - Artificial Linguistic Internet Computer Entity (A.L.I.C.E.)
2002 Kevin Copple - Ella
2003 Juergen Pirner - Jabberwock
2004 Richard Wallace - Artificial Linguistic Internet Computer Entity (A.L.I.C.E.)
2005 Rollo Carpenter - George
2006 Rollo Carpenter - Joan
2007 Robert Medeksza - Ultra Hal
2008 Fred Roberts - Elbot
2009 David Levy - Do-Much-More
2010 Bruce Wilcox - Suzette

Based on http://en.wikipedia.org/wiki/Loebner_prize licensed under the Creative Commons Attribution-Share-Alike License 3.0

Loebner Prize I - rules and restriction

The Loebner Prize is an annual competition in artificial intelligence that awards prizes to the chatterbot considered by the judges to be the most human-like. The format of the competition is that of a standard Turing test. In each round, a human judge simultaneously holds textual conversations with a computer program and a human being via computer. Based upon the responses, the judge must decide which is which.

The contest began in 1990 by Hugh Loebner in conjunction with the Cambridge Center for Behavioral Studies, Massachusetts, United States. It has since been associated with Flinders University, Dartmouth College, the Science Museum in London, and most recently the University of Reading. In 2004 and 2005, it was held in Loebner's apartment in New York City.

Within the field of artificial intelligence, the Loebner Prize is somewhat controversial; the most prominent critic, Marvin Minsky, has called it a publicity stunt that does not help the field along.

In addition, the time limit of 5 minutes and the use of untrained and unsophisticated judges has resulted in some wins that may be due to trickery rather than to plausible intelligence, as one can judge from transcripts of winning conversations

The rules of the competition have varied over the years and early competitions featured restricted conversation Turing tests but since 1995 the discussion has been unrestricted.

For the three entries in 2007, Robert Medeksza, Noah Duncan and Rollo Carpenter, some basic "screening questions" were used by the sponsor to evaluate the state of the technology. These included simple questions about the time, what round of the contest it is, etc.; general knowledge ("What is a hammer for?"); comparisons ("Which is faster, a train or a plane?"); and questions demonstrating memory for preceding parts of the same conversation. "All nouns, adjectives and verbs will come from a dictionary suitable for children or adolescents under the age of 12." Entries did not need to respond "intelligently" to the questions to be accepted.

For the first time in 2008 the sponsor allowed introduction of a preliminary phase to the contest opening up the competition to previously disallowed web-based entries judged by a variety of invited interrogators. The available rules do not state how interrogators are selected or instructed. Interrogators (who judge the systems) have limited time: 5 minutes per entity in the 2003 competition, 20+ per pair in 2004–2007 competitions, and 5 minutes to conduct simultaneous conversations with a human and the program since 2008.

Based on http://en.wikipedia.org/wiki/Loebner_prize licensed under the Creative Commons Attribution-Share-Alike License 3.0

The Turing test VI – Variations of the Turing Test.

Numerous other versions of the Turing test, including those expounded above, have been mooted through the years.

Reverse Turing test and CAPTCHA
A modification of the Turing test wherein the objective of one or more of the roles have been reversed between machines and humans is termed a reverse Turing test. An example is implied in the work of psychoanalyst Wilfred Bion, who was particularly fascinated by the "storm" that resulted from the encounter of one mind by another. Carrying this idea forward, R. D. Hinshelwood described the mind as a "mind recognizing apparatus," noting that this might be some sort of "supplement" to the Turing test. The challenge would be for the computer to be able to determine if it were interacting with a human or another computer. This is an extension of the original question that Turing attempted answer but would, perhaps, offer a high enough standard to define a machine that could "think" in a way that we typically define as characteristically human.
CAPTCHA is a form of reverse Turing test. Before being allowed to perform some action on a website, the user is presented with alphanumerical characters in a distorted graphic image and asked to type them out. This is intended to prevent automated systems from being used to abuse the site. The rationale is that software sufficiently sophisticated to read and reproduce the distorted image accurately does not exist (or is not available to the average user), so any system able to do so is likely to be a human.
Software that can reverse CAPTCHA with some accuracy by analyzing patterns in the generating engine is being actively developed.

"Fly on the wall" Turing test
The "fly on the wall" variation of the Turing test changes the original Turing-test parameters in three ways. First, parties A and B communicate with each other rather than with party C, who plays the role of a detached observer ("fly on the wall") rather than of an interrogator or other participant in the conversation. Second, party A and party B may each be either a human or a computer of the type being tested. Third, it is specified that party C must not be informed as to the identity (human versus computer) of either participant in the conversation. Party C's task is to determine which of four possible participant combinations (human A/human B, human A/computer B, computer A/human B, computer A/computer B) generated the conversation. At its most rigorous, the test is conducted in numerous iterations, in each of which the identity of each participant is determined at random (e.g., using a fair-coin toss) and independently of the determination of the other participant's identity, and in each of which a new human observer is used (to prevent the discernment abilities of party C from improving through conscious or unconscious pattern recognition over time). The computer passes the test for human-level intelligence if, over the course of a statistically significant number of iterations, the respective parties C are unable to determine with better-than-chance frequency which participant combination generated the conversation.
The "fly on the wall" variation increases the scope of intelligence being tested in that the observer is able to evaluate not only the participants' ability to answer questions but their capacity for other aspects of intelligent communication, such as the generation of questions or comments regarding an existing aspect of a conversation subject ("deepening"), the generation of questions or comments regarding new subjects or new aspects of the current subject ("broadening"), and the ability to abandon certain subject matter in favor of other subject matter currently under discussion ("narrowing") or new subject matter or aspects thereof ("shifting").
The Bion-Hinshelwood extension of the traditional test is applicable to the "fly on the wall" variation as well, enabling the testing of intellectual functions involving the ability to recognize intelligence: If a computer placed in the role of party C (reset after each iteration to prevent pattern recognition over time) can identify conversation participants with a success rate equal to or higher than the success rate of a set of humans in the party-C role, the computer is functioning at a human level with respect to the skill of intelligence recognition.

Subject matter expert Turing test
Another variation is described as the subject matter expert Turing test, where a machine's response cannot be distinguished from an expert in a given field. This is also known as a "Feigenbaum test" and was proposed by Edward Feigenbaum in a 2003 paper.

Immortality test
The Immortality-test variation of the Turing test would determine if a person's essential character is reproduced with enough fidelity to make it impossible to distinguish a reproduction of a person from the original person.

Minimum Intelligent Signal Test
The Minimum Intelligent Signal Test, proposed by Chris McKinstry, is another variation of Turing's test, where only binary responses are permitted. It is typically used to gather statistical data against which the performance of artificial intelligence programs may be measured.

Meta Turing test
Yet another variation is the Meta Turing test, in which the subject being tested (say, a computer) is classified as intelligent if it has created something that the subject itself wants to test for intelligence.

Hutter Prize
The organizers of the Hutter Prize believe that compressing natural language text is a hard AI problem, equivalent to passing the Turing test.
The data compression test has some advantages over most versions and variations of a Turing test, including:
• It gives a single number that can be directly used to compare which of two machines is "more intelligent."
• It does not require the computer to lie to the judge
The main disadvantages of using data compression as a test are:
• It is not possible to test humans this way.
• It is unknown what particular "score" on this test—if any—is equivalent to passing a human-level Turing test.

Other tests based on compression or Kolmogorov Complexity
A related approach to Hutter's prize which appeared in the late 1990s is the inclusion of compression problems in an extended Turing Test. Two major advantages of some of these tests are their applicability to nonhuman intelligences and their absence of a requirement for human testers.

Based on http://en.wikipedia.org/wiki/Turing_test licensed under the Creative Commons Attribution-Share-Alike License 3.0

Chatbots III - Commercial functions for Chatbots

Automated conversational systems have now progressed, and large companies such as Lloyds Banking Group, Royal Bank of Scotland, Renault and Citroën are already using them instead of call centers to provide a first point of contact. Chatbots can also be implemented via Twitter, or Windows Live Messenger. For example, the Robocoke chatbot for Coca Cola Hungary. This chatbot provides users with information about the brand Coca Cola, but he can also give users party and concert recommendations all over Hungary. These kind of chatbots are often used for marketing purposes.

Popular online portals like eBay and PayPal are also using multi lingual virtual agents to offer online support to their customers. For example, PayPal uses chatterbot Louise to handle queries in English and chatterbot Léa to handle queries in French. Developed by VirtuOz, both agents handle 400,000 conversations in a month.These agents have been functional since September 2008 on PayPal websites.

Malicious chatterbots are frequently used to fill chat rooms with spam and advertising, or to entice people into revealing personal information, such as bank account numbers. They are commonly found on Yahoo! Messenger, Windows Live Messenger, AOL Instant Messenger and other instant messaging protocols. There has also been a published report of a chatterbot used in a fake personal ad on a dating service's website.

Competitions for Chatbots focus on the Turing test or more specific goals. Two such annual contests are the Loebner Prize and The Chatterbox Challenge.

Based on http://en.wikipedia.org/wiki/Chatbots licensed under the Creative Commons Attribution-Share-Alike License 3.0

The Turing test V – Strengths and Weaknesses of the Turing Test.

Strengths
Tractability
The philosophy of mind, psychology, and modern neuroscience have been unable to provide definitions of "intelligence" and "thinking" that are sufficiently precise and general to be applied to machines. Without such definitions, the central questions of the philosophy of artificial intelligence cannot be answered. The Turing test, even if imperfect, at least provides something that can actually be measured. As such, it is a pragmatic solution to a difficult philosophical question.

Breadth of subject matter
The power of the Turing test derives from the fact that it is possible to talk about anything. Turing wrote that "the question and answer method seems to be suitable for introducing almost any one of the fields of human endeavor that we wish to include." John Haugeland adds that "understanding the words is not enough; you have to understand the topic as well."
In order to pass a well-designed Turing test, the machine must use natural language, reason, have knowledge and learn. The test can be extended to include video input, as well as a "hatch" through which objects can be passed: this would force the machine to demonstrate the skill of vision and robotics as well. Together, these represent almost all of the major problems of artificial intelligence.
The Feigenbaum test is designed to take advantage of the broad range of topics available to a Turing test. It compares the machine against the abilities of experts in specific fields such as literature or chemistry.

Weaknesses
The Turing test is based on the assumption that human beings can judge a machine's intelligence by comparing its behaviour with human behaviour. Every element of this assumption has been questioned: the human's judgement, the value of comparing only behaviour and the value of comparing against a human. Because of these and other considerations, some AI researchers have questioned the usefulness of the test. In practice, the test's results can easily be dominated not by the computer's (pseudo-?) intelligence, but by the attitudes, skill or naiveté of the questioner.

Human intelligence vs intelligence in general
The Turing test does not directly test whether the computer behaves intelligently - it tests only whether the computer behaves like a human being. Since human behavior and intelligent behavior are not exactly the same thing, the test can fail to accurately measure intelligence in two ways:

Some human behavior is unintelligent
The Turing test requires that the machine be able to execute all human behaviors, regardless of whether they are intelligent. It even tests for behaviors that we may not consider intelligent at all, such as the susceptibility to insults, the temptation to lie or, simply, a high frequency of typing mistakes. If a machine cannot imitate human behavior in detail, it fails the test.
This objection was raised by The Economist, in an article entitled "Artificial Stupidity" published shortly after the first Loebner prize competition in 1992. The article noted that the first Loebner winner's victory was due, at least in part, to its ability to "imitate human typing errors." Turing himself had suggested that programs add errors into their output, so as to be better "players" of the game.

Some intelligent behavior is inhuman
The Turing test does not test for highly intelligent behaviors, such as the ability to solve difficult problems or come up with original insights. In fact, it specifically requires deception on the part of the machine: if the machine is more intelligent than a human being it must deliberately avoid appearing too intelligent. If it were to solve a computational problem that is impossible for any human to solve, then the interrogator would know the program is not human, and the machine would fail the test.
Because it cannot measure intelligence that is beyond the ability of humans, the test cannot be used in order to build or evaluate systems that are more intelligent than humans. Because of this, several test alternatives that would be able to evaluate superintelligent systems have been proposed.

Real intelligence vs simulated intelligence
The Turing test is concerned strictly with how the subject acts — the external behaviour of the machine. In this regard, it takes a behaviourist or functionalist approach to the study of intelligence. The example of ELIZA suggests that a machine passing the test may be able to simulate human conversational behavior by following a simple (but large) list of mechanical rules, without thinking or having a mind at all.
John Searle has argued that external behavior cannot be used to determine if a machine is "actually" thinking or merely "simulating thinking." His chinese room argument is intended to show that, even if the Turing test is a good operational definition of intelligence, it may not indicate that the machine has a mind, consciousness, or intentionality (Intentionality is a philosophical term for the power of thoughts to be "about" something).

Turing anticipated to this line of criticism in his original paper, writing that:
I do not wish to give the impression that I think there is no mystery about consciousness. There is, for instance, something of a paradox connected with any attempt to localise it. But I do not think these mysteries necessarily need to be solved before we can answer the question with which we are concerned in this paper. — Alan Turing, (Turing 1950)

Naivete of interrogators and the anthropomorphic fallacy
The Turing test assumes that the interrogator is sophisticated enough to determine the difference between the behaviour of a machine and the behaviour of a human being, though critics argue that this is not a skill most people have.
Turing does not specify the precise skills and knowledge required by the interrogator in his description of the test, but he did use the term "average interrogator": "[the] average interrogator would not have more than 70 per cent chance of making the right identification after five minutes of questioning". Shah & Warwick (2009c) show that experts are fooled, and that interrogator strategy, 'power' vs 'solidarity' affects correct identification, the latter being more successful.
Chatterbot programs such as ELIZA have repeatedly fooled unsuspecting people into believing that they are communicating with human beings. In these cases, the "interrogator" is not even aware of the possibility that they are interacting with a computer. To successfully appear human, there is no need for the machine to have any intelligence whatsoever and only a superficial resemblance to human behaviour is required. Most would agree that a "true" Turing test has not been passed in "uninformed" situations like these.
Early Loebner prize competitions used "unsophisticated" interrogators who were easily fooled by the machines. Since 2004, the Loebner Prize organizers have deployed philosophers, computer scientists, and journalists among the interrogators. However, even some of these experts have been deceived by the machines.
Michael Shermer points out that human beings consistently choose to consider non-human objects as human whenever they are allowed the chance, a mistake called the anthropomorphic fallacy: They talk to their cars, ascribe desire and intentions to natural forces (e.g., "nature abhors a vacuum"), and worship the sun as a human-like being with intelligence. If the Turing test is applied to religious objects, Shermer argues, then, that inanimate statues, rocks, and places have consistently passed the test throughout history. This human tendency towards anthropomorphism effectively lowers the bar for the Turing test, unless interrogators are specifically trained to avoid it.

Impracticality and irrelevance: the Turing test and AI research
Mainstream AI researchers argue that trying to pass the Turing Test is merely a distraction from more fruitful research. Indeed, the Turing test is not an active focus of much academic or commercial effort—as Stuart Russell and Peter Norvig write: "AI researchers have devoted little attention to passing the Turing test." There are several reasons.
First, there are easier ways to test their programs. Most current research in AI-related fields is aimed at modest and specific goals, such as automated scheduling, object recognition, or logistics. In order to test the intelligence of the programs that solve these problems, AI researchers simply give them the task directly, rather than going through the roundabout method of posing the question in a chat room populated with computers and people.
Second, creating life-like simulations of human beings is a difficult problem on its own that does not need to be solved to achieve the basic goals of AI research. Believable human characters may be interesting in a work of art, a game, or a sophisticated user interface, but they are not part of the science of creating intelligent machines, that is, machines that solve problems using intelligence. Russell and Norvig suggest an analogy with the history of flight: Planes are tested by how well they fly, not by comparing them to birds. "Aeronautical engineering texts," they write, "do not define the goal of their field as 'making machines that fly so exactly like pigeons that they can fool other pigeons.'"
Turing, for his part, never intended his test to be used as a practical, day-to-day measure of the intelligence of AI programs; he wanted to provide a clear and understandable example to aid in the discussion of the philosophy of artificial intelligence. As such, it is not surprising that the Turing test has had so little influence on AI research — the philosophy of AI, writes John McCarthy, "is unlikely to have any more effect on the practice of AI research than philosophy of science generally has on the practice of science."

Based on http://en.wikipedia.org/wiki/Turing_test licensed under the Creative Commons Attribution-Share-Alike License 3.0

Chatbots II - History of Chatbots and their development over time.

In 1950, Alan Turing published his famous article "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence. This criterion depends on the ability of a computer program to impersonate a human in a real-time written conversation with a human judge, sufficiently well that the judge is unable to distinguish reliably - on the basis of the conversational content alone - between the program and a real human. The notoriety of Turing's proposed test stimulated great interest in Joseph Weizenbaum's program ELIZA, published in 1966, which seemed to be able to fool users into believing that they were conversing with a real human. However Weizenbaum himself did not claim that ELIZA was genuinely intelligent, and the Introduction to his paper presented it more as a debunking exercise:

[In] artificial intelligence ... machines are made to behave in wondrous ways, often sufficient to dazzle even the most experienced observer. But once a particular program is unmasked, once its inner workings are explained ... its magic crumbles away; it stands revealed as a mere collection of procedures ... The observer says to himself "I could have written that". With that thought he moves the program in question from the shelf marked "intelligent", to that reserved for curios ... The object of this paper is to cause just such a re-evaluation of the program about to be "explained". Few programs ever needed it more.

ELIZA's key method of operation (copied by chatbot designers ever since) involves the recognition of cue words or phrases in the input, and the output of corresponding pre-prepared or pre-programmed responses which can move the conversation forward in an apparently meaningful way (e.g. by responding to any input that contains the word 'MOTHER' with 'TELL ME MORE ABOUT YOUR FAMILY'). Thus an illusion of understanding is generated, even though the processing involved has been merely superficial. ELIZA showed that such an illusion is surprisingly easy to generate, because human judges are so ready to give the benefit of the doubt when conversational responses are capable of being interpreted as "intelligent". Thus the key technique here - which characterises a program as a chatbot rather than as a serious natural language processing system - is the production of responses which are sufficiently vague and non-specific that they can be understood as "intelligent" in a wide range of conversational contexts. The emphasis is typically on vagueness and unclarity, rather than any conveying of genuine information.

More recently, however, interface designers have come to appreciate that humans' readiness to interpret computer output as genuinely conversational - even when it is actually based on rather simple pattern-matching - can be exploited for useful purposes. Most people prefer to engage with programs that are human-like, and this gives chatbot-style techniques a potentially useful role in interactive systems that need to elicit information from users, as long as that information is relatively straightforward and falls into predictable categories. Thus, for example, online help systems can usefully employ chatbot techniques to identify the area of help that users require, potentially providing a "friendlier" interface than a more formal search or menu system. This sort of usage holds the prospect of moving chatbot technology from Weizenbaum's "shelf ... reserved for curios" to that marked "genuinely useful computational methods".

The classic historic early chatterbots are ELIZA (1966) and PARRY (1972). More recent notable programs include A.L.I.C.E., jabberwacky and D.U.D.E (Agence Nationale pour la Recherche and CNRS 2006). While ELIZA and PARRY were used exclusively to simulate typed conversation, many chatterbots now include functional features such as games and web searching abilities. In 1984 a book called The Policeman's Beard is Half Constructed was published, allegedly written by the chatbot Racter (though the program as released would not have been capable of doing so).

One pertinent field of AI research is natural language processing. Usually, weak AI fields employ specialized software or programming languages created specifically for the narrow function required. For example, A.L.I.C.E. utilises a programming language called AIML which is specific to its function as a conversational agent, and has since been adopted by various other developers of, so called, Alicebots. Nevertheless, A.L.I.C.E. is still purely based on pattern matching techniques without any reasoning capabilities, the same technique ELIZA was using back in 1966. This is not strong AI, which would require sapience and logical reasoning abilities.

Jabberwacky learns new responses and context based on real-time user interactions, rather than being driven from a static database. Some more recent chatterbots also combine real-time learning with evolutionary algorithms which optimise their ability to communicate based on each conversation held, with one notable example being Kyle, winner of the 2009 Leodis AI Award. Still, there is currently no general purpose conversational artificial intelligence, and some software developers focus on the practical aspect, information retrieval.

Based on http://en.wikipedia.org/wiki/Chatbots licensed under the Creative Commons Attribution-Share-Alike License 3.0

The Turing test IV – Different versions of the Turing Test and how they matched against each other.

There are at least three primary versions of the Turing test, two of which are offered in "Computing Machinery and Intelligence" and one that Saul Traiger describes as the "Standard Interpretation." While there is some debate regarding whether the "Standard Interpretation" is that described by Turing or, instead, based on a misreading of his paper, these three versions are not regarded as equivalent, and their strengths and weaknesses are distinct.

The Imitation Game
Turing's original game, as we have seen, described a simple party game involving three players. Player A is a man, player B is a woman and player C (who plays the role of the interrogator) is of either sex. In the Imitation Game, player C is unable to see either player A or player B, and can only communicate with them through written notes. By asking questions of player A and player B, player C tries to determine which of the two is the man and which is the woman. Player A's role is to trick the interrogator into making the wrong decision, while player B attempts to assist the interrogator in making the right one.

Sterret refers to this as the "Original Imitation Game Test," Turing proposes that the role of player A be filled by a computer. Thus, the computer's task is to pretend to be a woman and attempt to trick the interrogator into making an incorrect evaluation. The success of the computer is determined by comparing the outcome of the game when player A is a computer against the outcome when player A is a man. If, as Turing puts it, "the interrogator decide[s] wrongly as often when the game is played [with the computer] as he does when the game is played between a man and a woman", it may be argued that the computer is intelligent. and in contrast to Sterrett's opinion, posit that Turing did not expect the design of the machine to imitate a woman, when compared against a human.

The second version appears later in Turing's 1950 paper. As with the Original Imitation Game Test, the role of player A is performed by a computer, the difference being that the role of player B is now to be performed by a man rather than a woman.
"Let us fix our attention on one particular digital computer C. Is it true that by modifying this computer to have an adequate storage, suitably increasing its speed of action, and providing it with an appropriate programme, C can be made to play satisfactorily the part of A in the imitation game, the part of B being taken by a man?"
In this version, both player A (the computer) and player B are trying to trick the interrogator into making an incorrect decision.

The standard interpretation
Common understanding has it that the purpose of the Turing Test is not specifically to determine whether a computer is able to fool an interrogator into believing that it is a human, but rather whether a computer could imitate a human. While there is some dispute whether this interpretation was intended by Turing — Sterrett believes that it was and thus conflates the second version with this one, while others, such as Traiger, do not — this has nevertheless led to what can be viewed as the "standard interpretation." In this version, player A is a computer and player B a person of either gender. The role of the interrogator is not to determine which is male and which is female, but which is a computer and which is a human.

Imitation Game vs. Standard Turing Test
There has arisen some controversy over which of the alternative formulations of the test Turing intended. Sterrett argues that two distinct tests can be extracted from his 1950 paper and that, pace Turing's remark, they are not equivalent. The test that employs the party game and compares frequencies of success is referred to as the "Original Imitation Game Test," whereas the test consisting of a human judge conversing with a human and a machine is referred to as the "Standard Turing Test," noting that Sterrett equates this with the "standard interpretation" rather than the second version of the imitation game. Sterrett agrees that the Standard Turing Test (STT) has the problems that its critics cite but feels that, in contrast, the Original Imitation Game Test (OIG Test) so defined is immune to many of them, due to a crucial difference: Unlike the STT, it does not make similarity to human performance the criterion, even though it employs human performance in setting a criterion for machine intelligence. A man can fail the OIG Test, but it is argued that it is a virtue of a test of intelligence that failure indicates a lack of resourcefulness: The OIG Test requires the resourcefulness associated with intelligence and not merely "simulation of human conversational behaviour." The general structure of the OIG Test could even be used with non-verbal versions of imitation games.

Still other writers have interpreted Turing as proposing that the imitation game itself is the test, without specifying how to take into account Turing's statement that the test that he proposed using the party version of the imitation game is based upon a criterion of comparative frequency of success in that imitation game, rather than a capacity to succeed at one round of the game.

Saygin has suggested that maybe the original game is a way of proposing a less biased experimental design as it hides the participation of the computer.

Should the interrogator know about the computer?
Turing never makes clear whether the interrogator in his tests is aware that one of the participants is a computer. To return to the Original Imitation Game, he states only that player A is to be replaced with a machine, not that player C is to be made aware of this replacement. When Colby, FD Hilf, S Weber and AD Kramer tested PARRY, they did so by assuming that the interrogators did not need to know that one or more of those being interviewed was a computer during the interrogation. As Ayse Saygin and others have highlighted, this makes a big difference to the implementation and outcome of the test. Huma Shah & Kevin Warwick, who have organised practical Turing tests, argue knowing/not knowing may make a difference in some judges' verdict. Judges in the finals of the parallel-paired Turing tests, staged in the 18th Loebner Prize were not explicitly told, some did assume each hidden pair contained one human and one machine. Spelling errors gave away the hidden-humans; machines were identified by 'speed of response' and lengthier utterances. In an experimental study looking at Gricean maxim violations that also used the Loebner transcripts, Ayse Saygin found significant differences between the responses of participants who knew and did not know about computers being involved.

Based on http://en.wikipedia.org/wiki/Turing_test licensed under the Creative Commons Attribution-Share-Alike License 3.0

The Turing test III – The Turing Colloquiums and the Loebner Prize.

1990 was the fortieth anniversary of the first publication of Turing's "Computing Machinery and Intelligence" paper, and, thus, saw renewed interest in the test. Two significant events occurred in that year: The first was the Turing Colloquium, which was held at the University of Sussex in April, and brought together academics and researchers from a wide variety of disciplines to discuss the Turing Test in terms of its past, present, and future; the second was the formation of the annual Loebner Prize competition. However, after nineteen Loebner Prize competitions, the contest is not viewed as contributing toward the science of machine intelligence, nor palliating the controversy surrounding the usefulness of Turing's test.

The Loebner Prize provides an annual platform for practical Turing Tests with the first competition held in November, 1991.It is underwritten by Hugh Loebner; the Cambridge Center for Behavioral Studies in Massachusetts, United States organized the Prizes up to and including the 2003 contest. As Loebner described it, one reason the competition was created is to advance the state of AI research, at least in part, because no one had taken steps to implement the Turing Test despite 40 years of discussing it.

The first Loebner Prize competition in 1991 led to a renewed discussion of the viability of the Turing Test and the value of pursuing it, in both the popular press and in academia. The first contest was won by a mindless program with no identifiable intelligence that managed to fool naive interrogators into making the wrong identification. This highlighted several of the shortcomings of Turing test (discussed below): The winner won, at least in part, because it was able to "imitate human typing errors"; the unsophisticated interrogators were easily fooled; and some researchers in AI have been led to feel that the test is merely a distraction from more fruitful research.

The silver (text only) and gold (audio and visual) prizes have never been won. However, the competition has awarded the bronze medal every year for the computer system that, in the judges' opinions, demonstrates the "most human" conversational behavior among that year's entries. Artificial Linguistic Internet Computer Entity (A.L.I.C.E.) has won the bronze award on three occasions in recent times (2000, 2001, 2004). Learning AI Jabberwacky won in 2005 and 2006.

The Loebner Prize tests conversational intelligence; winners are typically chatterbot programs, or Artificial Conversational Entities (ACE)s. Early Loebner Prize rules restricted conversations: Each entry and hidden-human conversed on a single topic, thus the interrogators were restricted to one line of questioning per entity interaction. The restricted conversation rule was lifted for the 1995 Loebner Prize. Interaction duration between judge and entity has varied in Loebner Prizes. In Loebner 2003, at the University of Surrey, each interrogator was allowed five minutes to interact with an entity, machine or hidden-human. Between 2004 and 2007, the interaction time allowed in Loebner Prizes was more than twenty minutes. In 2008, the interrogation duration allowed was five minutes per pair, because the organiser, Kevin Warwick, and coordinator, Huma Shah, consider this to be the duration for any test, as Turing stated in his 1950 paper: " ... making the right identification after five minutes of questioning". They felt Loebner's longer test, implemented in Loebner Prizes 2006 and 2007, was inappropriate for the state of artificial conversation technology. It is ironic that the 2008 winning entry, Elbot, does not mimic a human; its personality is that of a robot, yet Elbot deceived three human judges that it was the human during human-parallel comparisons.

During the 2009 competition, held in Brighton, UK, the communication program restricted judges to 10 minutes for each round, 5 minutes to converse with the human, 5 minutes to converse with the program. This was to test the alternative reading of Turing's prediction that the 5-minute interaction was to be with the computer. For the 2010 competition, the Sponsor has again increased the interaction time, between interrogator and system, to 25 minutes (Rules for the 20th Loebner Prize contest).

In November 2005, the University of Surrey hosted an inaugural one-day meeting of artificial conversational entity developers, attended by winners of practical Turing Tests in the Loebner Prize: Robby Garner, Richard Wallace and Rollo Carpenter. Invited speakers included David Hamill, Hugh Loebner (sponsor of the Loebner Prize) and Huma Shah.

In parallel to the 2008 Loebner Prize held at the University of Reading, the Society for the Study of Artificial Intelligence and the Simulation of Behaviour (AISB), hosted a one-day symposium to discuss the Turing Test, organised by John Barnden, Mark Bishop, Huma Shah and Kevin Warwick. The Speakers included Royal Institution's Director Baroness Susan Greenfield, Selmer Bringsjord, Turing's biographer Andrew Hodges, and consciousness scientist Owen Holland. No agreement emerged for a canonical Turing Test, though Bringsjord expressed that a sizeable prize would result in the Turing Test being passed sooner.

2012 will see a celebration of Turing’s life and scientific impact, with a number of major events taking place throughout the year. Most of these will be linked to places with special significance in Turing’s life, such as Cambridge, Manchester, and Bletchley Park. The Alan Turing Year is coordinated by the Turing Centenary Advisory Committee (TCAC), representing a range of expertise and organisational involvement in the 2012 celebrations. Supporting organisations for the Alan Turing Year include the ACM, the ASL, the SSAISB, the BCS, the BCTCS, Bletchley Park, the BMC, the BLC, the CCS, the Association CiE, the EACSL, the EATCS, FoLLI, IACAP, the IACR, the KGS, and LICS.

Supporting TCAC is Turing100. With the aim of taking Turing's idea for a thinking machine, picturised in Hollywood movies such as Blade Runner, to a wider audience including children, Turing100 is set up to organise a special Turing test event, celebrating the 100th anniversary of Turing's birth in June 2012, at the place where the mathematician broke codes during the Second World War: Bletchley Park. The Turing100 team comprises Kevin Warwick (Chair), Huma Shah (coordinator), Ian Bland, Chris Chapman, Marc Allen; supporters include Rory Dunlop, Loebner winners Robby Garner, and Fred Roberts.

Based on http://en.wikipedia.org/wiki/Turing_test licensed under the Creative Commons Attribution-Share-Alike License 3.0

The Turing test II - Turning points in the history of the Turing Tests.

Blay Whitby lists four major turning points in the history of the Turing Test — the publication of "Computing Machinery and Intelligence" in 1950, the announcement of Joseph Weizenbaum's ELIZA in 1966, Kenneth Colby's creation of PARRY, which was first described in 1972, and the Turing Colloquium in 1990. Sixty years following its introduction, and continued argument over Turing's 'can machines think?' experiment, led to its reconsideration for the 21st century through the AISB's 'Towards a comprehensive intelligence test' symposium, 29 March - 1 April 2010, at De Montford University, UK.

ELIZA works by examining a user's typed comments for keywords. If a keyword is found, a rule that transforms the user's comments is applied, and the resulting sentence is returned. If a keyword is not found, ELIZA responds either with a generic riposte or by repeating one of the earlier comments. In addition, Weizenbaum developed ELIZA to replicate the behaviour of a Rogerian psychotherapist, allowing ELIZA to be "free to assume the pose of knowing almost nothing of the real world." With these techniques, Weizenbaum's program was able to fool some people into believing that they were talking to a real person, with some subjects being "very hard to convince that ELIZA [...] is not human." Thus, ELIZA is claimed by some to be one of the programs (perhaps the first) able to pass the Turing Test, although this view is highly contentious.

Colby's PARRY has been described as "ELIZA with attitude": it attempts to model the behaviour of a paranoid schizophrenic, using a similar (if more advanced) approach to that employed by Weizenbaum. In order to validate the work, PARRY was tested in the early 1970s using a variation of the Turing Test. A group of experienced psychiatrists analysed a combination of real patients and computers running PARRY through teletype machines. Another group of 33 psychiatrists were shown transcripts of the conversations. The two groups were then asked to identify which of the "patients" were human and which were computer programs. The psychiatrists were able to make the correct identification only 48 per cent of the time — a figure consistent with random guessing.

In the 21st century, ELIZA and PARRY have been developed into malware systems, such as CyberLover, which preys on Internet users convincing them to "reveal information about their identities or to lead them to visit a web site that will deliver malicious content to their computers" ((iTWire, 2007). A one-trick pony, CyberLover, a software program developed in Russia, has emerged as a "Valentine-risk" flirting with people "seeking relationships online in order to collect their personal data" (V3, 2010).

John Searle's 1980 paper Minds, Brains, and Programs proposed an argument against the Turing Test known as the "Chinese room" thought experiment. Searle argued that software (such as ELIZA) could pass the Turing Test simply by manipulating symbols of which they had no understanding. Without understanding, they could not be described as "thinking" in the same sense people do. Therefore—Searle concludes—the Turing Test cannot prove that a machine can think.

Arguments such as that proposed by Searle and others working on the philosophy of mind sparked off a more intense debate about the nature of intelligence, the possibility of intelligent machines and the value of the Turing test that continued through the 1980s and 1990s.

Based on http://en.wikipedia.org/wiki/Turing_test licensed under the Creative Commons Attribution-Share-Alike License 3.0

The Turing test I – The Turing Test and how it came to be.

The Turing test is a test of a machine's ability to demonstrate intelligence. A human judge engages in a natural language conversation with one human and one machine, each of which tries to appear human. All participants are separated from one another. If the judge cannot reliably tell the machine from the human, the machine is said to have passed the test. In order to test the machine's intelligence rather than its ability to render words into audio, the conversation is limited to a text-only channel such as a computer keyboard and screen.

The test was introduced by Alan Turing in his 1950 paper Computing Machinery and Intelligence, which opens with the words: "I propose to consider the question, 'Can machines think?'" Since "thinking" is difficult to define, Turing chooses to "replace the question by another, which is closely related to it and is expressed in relatively unambiguous words." Turing's new question is: "Are there imaginable digital computers which would do well in the [Turing test]"? This question, Turing believed, is one that can actually be answered. In the remainder of the paper, he argued against all the major objections to the proposition that "machines can think".
In the years since 1950, the test has proven to be both highly influential and widely criticized, and it is an essential concept in the philosophy of artificial intelligence.

The question of whether it is possible for machines to think has a long history, which is firmly entrenched in the distinction between dualist and materialist views of the mind. From the perspective of dualism, the mind is non-physical (or, at the very least, has non-physical properties) and, therefore, cannot be explained in purely physical terms. The materialist perspective argues that the mind can be explained physically, and thus leaves open the possibility of minds that are artificially produced.

In 1936, philosopher Alfred Ayer considered the standard philosophical question of other minds: how do we know that other people have the same conscious experiences that we do? In his book Language, Truth and Logic Ayer suggested a protocol to distinguish between a conscious man and an unconscious machine: "The only ground I can have for asserting that an object which appears to be conscious is not really a conscious being, but only a dummy or a machine, is that it fails to satisfy one of the empirical tests by which the presence or absence of consciousness is determined." (This suggestion is very similar to the Turing test, but it is not certain that Ayer's popular philosophical classic was familiar to Turing.)

Researchers in the United Kingdom had been exploring "machine intelligence" for up to ten years prior to the founding of the field of AI research in 1956. It was a common topic among the members of the Ratio Club who were an informal group of British cybernetics and electronics researchers that included Alan Turing, after whom the test is named.

Turing, in particular, had been tackling the notion of machine intelligence since at least 1941 and one of the earliest-known mentions of "computer intelligence" was made by him in 1947. In Turing's report, "Intelligent Machinery", he investigated "the question of whether or not it is possible for machinery to show intelligent behaviour" and, as part of that investigation, proposed what may be considered the forerunner to his later tests:
It is not difficult to devise a paper machine which will play a not very bad game of chess. Now get three men as subjects for the experiment. A, B and C. A and C are to be rather poor chess players, B is the operator who works the paper machine. ... Two rooms are used with some arrangement for communicating moves, and a game is played between C and either A or the paper machine. C may find it quite difficult to tell which he is playing.

When Turing published "Computing Machinery and Intelligence" he had been considering the possibility of artificial intelligence for many years, though this was the first published paper by Turing to focus exclusively on the notion.

Turing begins his 1950 paper with the claim "I propose to consider the question 'Can machines think?'" As he highlights, the traditional approach to such a question is to start with definitions, defining both the terms "machine" and "intelligence". Turing chooses not to do so; instead he replaces the question with a new one, "which is closely related to it and is expressed in relatively unambiguous words." In essence he proposes to change the question from "Do machines think?" to "Can machines do what we (as thinking entities) can do?" The advantage of the new question, Turing argues, is that it draws "a fairly sharp line between the physical and intellectual capacities of a man."

To demonstrate this approach Turing proposes a test inspired by a party game, known as the "Imitation Game", in which a man and a woman go into separate rooms and guests try to tell them apart by writing a series of questions and reading the typewritten answers sent back. In this game both the man and the woman aim to convince the guests that they are the other. Turing proposes recreating the game as follows:
We now ask the question, "What will happen when a machine takes the part of A in this game?" Will the interrogator decide wrongly as often when the game is played like this as he does when the game is played between a man and a woman? These questions replace our original, "Can machines think?"

Later in the paper Turing suggests an "equivalent" alternative formulation involving a judge conversing only with a computer and a man. While neither of these formulations precisely matches the version of the Turing Test that is more generally known today, he proposed a third in 1952. In this version, which Turing discussed in a BBC radio broadcast, a jury asks questions of a computer and the role of the computer is to make a significant proportion of the jury believe that it is really a man.

Turing's paper considered nine putative objections, which include all the major arguments against artificial intelligence that have been raised in the years since the paper was published. (See Computing Machinery and Intelligence.)
Turing predicted that machines would eventually be able to pass the test; in fact, he estimated that by the year 2000, machines with 109 bits (about 119.2 MiB or approximately 120 megabytes) of memory would be able to fool thirty percent of human judges in a five-minute test. He also predicted that people would then no longer consider the phrase "thinking machine" contradictory. He further predicted that machine learning would be an important part of building powerful machines, a claim considered plausible by contemporary researchers in artificial intelligence.
In a paper submitted to 19th Midwest Artificial Intelligence and Cognitive Science Conference, Dr.Shane T. Mueller predicted a modified Turing Test called a "Cognitive Decathlon" could be accomplished within 5 years.

By extrapolating an exponential growth of technology over several decades, futurist Raymond Kurzweil predicted that Turing test-capable computers would be manufactured in the near future. In 1990, he set the year around 2020. By 2005, he had revised his estimate to 2029.

The Long Bet Project is a wager of $20,000 between Mitch Kapor (pessimist) and Kurzweil (optimist) about whether a computer will pass a Turing Test by the year 2029. The bet specifies the conditions in some detail.

Based on http://en.wikipedia.org/wiki/Turing_test licensed under the Creative Commons Attribution-Share-Alike License 3.0

Chatbot I - What exactly is a Chatbot?

A chatbot (or chatterbot, or chat bot) is a computer program designed to simulate an intelligent conversation with one or more human users via auditory or textual methods. Traditionally, the aim of such simulation has been to fool the user into thinking that the program's output has been produced by a human (the Turing test). Programs playing this role are sometimes referred to as Artificial Conversational Entities, talk bots or chatterboxes. More recently, however, chatbot-like methods have been used for practical purposes such as online help, personalised service, or information acquisition, in which case the program is functioning as a type of conversational agent. What distinguishes a chatbot from more sophisticated natural language processing systems is the simplicity of the algorithms used. Although many chatbots do appear to interpret human input intelligently when generating their responses, many simply scan for keywords within the input and pull a reply with the most matching keywords, or the most similar wording pattern, from a textual database.
The term "ChatterBot" was originally coined by Michael Mauldin (Creator of the first Verbot, Julia) in 1994 to describe these conversational programs.

Based on http://en.wikipedia.org/wiki/Chatbots licensed under the Creative Commons Attribution-Share-Alike License 3.0

Natural language processing IV – Standardizing and Developing NLP.

Standardization in NLP
An ISO sub-committee is working in order to ease interoperability between lexical resources and NLP programs. The sub-committee is part of ISO/TC37 and is called ISO/TC37/SC4. Some ISO standards are already published but most of them are under construction, mainly on lexicon representation (see LMF), annotation and data category registry.

Journals

• Computational Linguistics
• International Conference on Language Resources and Evaluation
• Linguistic Issues in Language Technology

Organizations and conferences

Associations
• Association for Computational Linguistics (ACL)
• Association for Machine Translation in the Americas (AMTA)
• AFNLP - Asian Federation of Natural Language Processing Associations
• Australasian Language Technology Association (ALTA)
• Spanish Society of Natural Language Processing (SEPLN)
• Mexican Association of Natural Language Processing (AMPLN)

Conferences
Major conferences include:
• Annual Meeting of the Association for Computational Linguistics (aka ACL conference)
• International Conference on Computational Linguistics (COLING)
• International Conference on Language Resources and Evaluation (LREC)
• Conference on Intelligent Text Processing and Computational Linguistics (CICLing)
• Empirical Methods on Natural Language Processing (EMNLP)

Software tools
• General Architecture for Text Engineering (GATE)
• Modular Audio Recognition Framework
• MontyLingua
• Natural Language Toolkit (NLTK): a Python library suite

Based on http://en.wikipedia.org/wiki/Natural_language_processing licensed under the Creative Commons Attribution-Share-Alike License 3.0

Natural language processing III – Evaluating NLP

Objectives
The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system, in order to determine whether (or to what extent) the system answers the goals of its designers, or meets the needs of its users. Research in NLP evaluation has received considerable attention, because the definition of proper evaluation criteria is one way to specify precisely an NLP problem, going thus beyond the vagueness of tasks defined only as language understanding or language generation. A precise set of evaluation criteria, which includes mainly evaluation data and evaluation metrics, enables several teams to compare their solutions to a given NLP problem.

Short history of evaluation in NLP
The first evaluation campaign on written texts seems to be a campaign dedicated to message understanding in 1987 (Pallet 1998). Then, the Parseval/GEIG project compared phrase-structure grammars (Black 1991). A series of campaigns within Tipster project were realized on tasks like summarization, translation and searching (Hirschman 1998). In 1994, in Germany, the Morpholympics compared German taggers. Then, the Senseval and Romanseval campaigns were conducted with the objectives of semantic disambiguation. In 1996, the Sparkle campaign compared syntactic parsers in four different languages (English, French, German and Italian). In France, the Grace project compared a set of 21 taggers for French in 1997 (Adda 1999). In 2004, during the Technolangue/Easy project, 13 parsers for French were compared. Large-scale evaluation of dependency parsers were performed in the context of the CoNLL shared tasks in 2006 and 2007. In Italy, the evalita campaign was conducted in 2007 to compare various tools for Italian. In France, within the ANR-Passage project (end of 2007), 10 parsers for French were compared passage web site.

Adda G., Mariani J., Paroubek P., Rajman M. 1999 L'action GRACE d'évaluation de l'assignation des parties du discours pour le français. Langues vol-2
Black E., Abney S., Flickinger D., Gdaniec C., Grishman R., Harrison P., Hindle D., Ingria R., Jelinek F., Klavans J., Liberman M., Marcus M., Reukos S., Santoni B., Strzalkowski T. 1991 A procedure for quantitatively comparing the syntactic coverage of English grammars. DARPA Speech and Natural Language Workshop
Hirschman L. 1998 Language understanding evaluation: lessons learned from MUC and ATIS. LREC Granada
Pallet D.S. 1998 The NIST role in automatic speech recognition benchmark tests. LREC Granada

Different types of evaluation
Depending on the evaluation procedures, a number of distinctions are traditionally made in NLP evaluation.

• Intrinsic vs. extrinsic evaluation
Intrinsic evaluation considers an isolated NLP system and characterizes its performance mainly with respect to a gold standard result, pre-defined by the evaluators. Extrinsic evaluation, also called evaluation in use considers the NLP system in a more complex setting, either as an embedded system or serving a precise function for a human user. The extrinsic performance of the system is then characterized in terms of its utility with respect to the overall task of the complex system or the human user. For example, consider a syntactic parser that is based on the output of some new part of speech (POS) tagger. An intrinsic evaluation would run the POS tagger on some labelled data, and compare the system output of the POS tagger to the gold standard (correct) output. An extrinsic evaluation would run the parser with some other POS tagger, and then with the new POS tagger, and compare the parsing accuracy.

• Black-box vs. glass-box evaluation

Black-box evaluation requires one to run an NLP system on a given data set and to measure a number of parameters related to the quality of the process (speed, reliability, resource consumption) and, most importantly, to the quality of the result (e.g. the accuracy of data annotation or the fidelity of a translation). Glass-box evaluation looks at the design of the system, the algorithms that are implemented, the linguistic resources it uses (e.g. vocabulary size), etc. Given the complexity of NLP problems, it is often difficult to predict performance only on the basis of glass-box evaluation, but this type of evaluation is more informative with respect to error analysis or future developments of a system.

• Automatic vs. manual evaluation

In many cases, automatic procedures can be defined to evaluate an NLP system by comparing its output with the gold standard (or desired) one. Although the cost of producing the gold standard can be quite high, automatic evaluation can be repeated as often as needed without much additional costs (on the same input data). However, for many NLP problems, the definition of a gold standard is a complex task, and can prove impossible when inter-annotator agreement is insufficient. Manual evaluation is performed by human judges, which are instructed to estimate the quality of a system, or most often of a sample of its output, based on a number of criteria. Although, thanks to their linguistic competence, human judges can be considered as the reference for a number of language processing tasks, there is also considerable variation across their ratings. This is why automatic evaluation is sometimes referred to as objective evaluation, while the human kind appears to be more subjective.

Shared tasks (Campaigns)
• BioCreative
• Message Understanding Conference
• Technolangue/Easy
• Text Retrieval Conference
• Evaluation exercises on Semantic Evaluation (SemEval)

Based on http://en.wikipedia.org/wiki/Natural_language_processing licensed under the Creative Commons Attribution-Share-Alike License 3.0

Natural language processing II - Commonly researched tasks in NLP.

The following is a list of some of the most commonly researched tasks in NLP. Note that some of these tasks have direct real-world applications, while others more commonly serve as subtasks that are used to aid in solving larger tasks. What distinguishes these tasks from other potential and actual NLP tasks is not only the volume of research devoted to them but the fact that for each one there is typically a well-defined problem setting, a standard metric for evaluating the task, standard corpora on which the task can be evaluated, and competitions devoted to the specific task.

• Automatic summarization: Produce a readable summary of a chunk of text. Often used to provide summaries of text of a known type, such as articles in the financial section of a newspaper.

• Coreference resolution: Given a sentence or larger chunk of text, determine which words ("mentions") refer to the same objects ("entities"). Anaphora resolution is a specific example of this task, and is specifically concerned with matching up pronouns with the nouns or names that they refer to. The more general task of co-reference resolution also includes identify so-called "bridging relationships" involving referring expressions. For example, in a sentence such as "He entered John's house through the front door", "the front door" is a referring expression and the bridging relationship to be identified is the fact that the door being referred to is the front door of John's house (rather than of some other structure that might also be referred to).

• Discourse analysis: This rubric includes a number of related tasks. One task is identifying the discourse structure of connected text, i.e. the nature of the discourse relationships between sentences (e.g. elaboration, explanation, contrast). Another possible task is recognizing and classifying the speech acts in a chunk of text (e.g. yes-no question, content question, statement, assertion, etc.).

• Machine translation: Automatically translate text from one human language to another. This is one of the most difficult problems, and is a member of a class of problems colloquially termed "AI-complete", i.e. requiring all of the different types of knowledge that humans possess (grammar, semantics, facts about the real world, etc.) in order to solve properly.

• Morphological segmentation: Separate words into individual morphemes and identify the class of the morphemes. The difficulty of this task depends greatly on the complexity of the morphology (i.e. the structure of words) of the language being considered. English has fairly simple morphology, especially inflectional morphology, and thus it is often possible to ignore this task entirely and simply model all possible forms of a word (e.g. "open, opens, opened, opening") as separate words. In languages such as Turkish, however, such an approach is not possible, as each dictionary entry has thousands of possible word forms.

• Named entity recognition (NER): Given a stream of text, determine which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization). Note that, although capitalization can aid in recognizing named entities in languages such as English, this information cannot aid in determining the type of named entity, and in any case is often inaccurate or insufficient. For example, the first word of a sentence is also capitalized, and named entities often span several words, only some of which are capitalized. Furthermore, many other languages in non-Western scripts (e.g. Chinese or Arabic) do not have any capitalization at all, and even languages with capitalization may not consistently use it to distinguish names. For example, German capitalizes all nouns, regardless of whether they refer to names, and French and Spanish do not capitalize names that serve as adjectives.

• Natural language generation: Convert information from computer databases into readable human language.

• Natural language understanding: Convert chunks of text into more formal representations such as first-order logic structures that are easier for computer programs to manipulate.

• Optical character recognition (OCR): Given an image representing printed text, determine the corresponding text.

• Part-of-speech tagging: Given a sentence, determine the part of speech for each word. Many words, especially common ones, can serve as multiple parts of speech. For example, "book" can be a noun ("the book on the table") or verb ("to book a flight"); "set" can be a noun, verb or adjective; and "out" can be any of at least five different parts of speech. Note that some languages have more such ambiguity than others. Languages with little inflectional morphology, such as English are particularly prone to such ambiguity. Chinese is prone to such ambiguity because it is a tonal language during verbalization. Such inflection is not readily conveyed via the entities employed within the orthography to convey intended meaning.

• Parsing: Determine the parse tree (grammatical analysis) of a given sentence. The grammar for natural languages is ambiguous and typical sentences have multiple possible analyses. In fact, perhaps surprisingly, for a typical sentence there may be thousands of potential parses (most of which will seem completely nonsensical to a human).

• Question answering: Given a human-language question, determine its answer. Typical questions are have a specific right answer (such as "What is the capital of Canada?"), but sometimes open-ended questions are also considered (such as "What is the meaning of life?").

• Relationship extraction: Given a chunk of text, identify the relationships among named entities (i.e. who is the wife of whom).

• Sentence breaking (also known as sentence boundary disambiguation): Given a chunk of text, find the sentence boundaries. Sentence boundaries are often marked by periods or other punctuation marks, but these same characters can serve other purposes (e.g. marking abbreviations).

• Speech recognition: Given a sound clip of a person or people speaking, determine the textual representation of the speech. This is the opposite of text to speech and is one of the extremely difficult problems colloquially termed "AI-complete" (see above). In natural speech there are hardly any pauses between successive words, and thus speech segmentation is a necessary subtask of speech recognition (see below). Note also that in most spoken languages, the sounds representing successive letters blend into each other in a process termed coarticulation, so the conversion of the analog signal to discrete characters can be a very difficult process.

• Speech segmentation: Given a sound clip of a person or people speaking, separate it into words. A subtask of speech recognition and typically grouped with it.

• Topic segmentation and recognition: Given a chunk of text, separate it into segments each of which is devoted to a topic, and identify the topic of the segment.

• Word segmentation: Separate a chunk of continuous text into separate words. For a language like English, this is fairly trivial, since words are usually separated by spaces. However, some written languages like Chinese, Japanese and Thai do not mark word boundaries in such a fashion, and in those languages text segmentation is a significant task requiring knowledge of the vocabulary and morphology of words in the language.

• Word sense disambiguation: Many words have more than one meaning; we have to select the meaning which makes the most sense in context. For this problem, we are typically given a list of words and associated word senses, e.g. from a dictionary of from an online resource such as WordNet.

In some cases, sets of related tasks are grouped into subfields of NLP that are often considered separately from NLP as a whole. Examples include:

• Information retrieval (IR): This is concerned with storing, searching and retrieving information. It is a separate field within computer science (closer to databases), but IR relies on some NLP methods (for example, stemming). Some current research and applications seek to bridge the gap between IR and NLP.

• Information extraction (IE): This is concerned in general with the extraction of semantic information from text. This covers tasks such as named entity recognition, co-reference resolution, relationship extraction, etc.

• Speech processing: This covers speech recognition, text-to-speech and related tasks.

Other tasks include:
• Stemming
• Text simplification
• Text-to-speech
• Text-proofing
• Natural language search
• Query expansion
• Truecasing

Based on http://en.wikipedia.org/wiki/Natural_language_processing licensed under the Creative Commons Attribution-Share-Alike License 3.0

Natural language processing I - Natural language processing and how it evolved over time.

Natural language processing (NLP) is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages. In theory, natural-language processing is a very attractive method of human-computer interaction. Natural language understanding is sometimes referred to as an AI-complete problem, because natural-language recognition seems to require extensive knowledge about the outside world and the ability to manipulate it.

NLP has significant overlap with the field of computational linguistics, and is often considered a sub-field of artificial intelligence.

Modern NLP algorithms are grounded in machine learning, especially statistical machine learning. Research into modern statistical NLP algorithms requires an understanding of a number of disparate fields, including linguistics, computer science, and statistics. For a discussion of the types of algorithms currently used in NLP, see the article on pattern recognition.

The history of NLP generally starts in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published his famous article "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence. This criterion depends on the ability of a computer program to impersonate a human in a real-time written conversation with a human judge, sufficiently well that the judge is unable to distinguish reliably — on the basis of the conversational content alone — between the program and a real human. The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem. However, real progress was much slower, and after the ALPAC report in 1966, which found that ten years long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s, when the first statistical machine translation systems were developed.

Some notably successful NLP systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 to 1966. Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction. When the "patient" exceeded the very small knowledge base, ELIZA might provide a generic response, for example, responding to "My head hurts" with "Why do you say your head hurts?"

During the 70's many programmers began to write 'conceptual ontologies', which structured real-world information into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, many chatterbots were written including PARRY, Racter, and Jabberwacky.

Up to the 1980s, most NLP systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in NLP with the introduction of machine learning algorithms for language processing. This was due both to the steady increase in computational power resulting from Moore's Law and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing. Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules. Increasingly, however, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data. Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks.

Many of the notable early successes occurred in the field of machine translation, due especially to work at IBM Research, where successively more complicated statistical models were developed. These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems. As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data.
Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms. Such algorithms are able to learn from data that has not been hand-annotated with the desired answers, or using a combination of annotated and non-annotated data. Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results.

NLP through machine learning
As described above, modern approaches to natural language processing (NLP) are grounded in machine learning. The paradigm of machine learning is different from that of most prior attempts at language processing. Prior implementations of language-processing tasks typically involved the direct hand coding of large sets of rules. The machine-learning paradigm calls instead for using general learning algorithms — often, although not always, grounded in statistical inference — to automatically learn such rules through the analysis of large corpora of typical real-world examples. A corpus (plural, "corpora") is a set of documents (or sometimes, individual sentences) that have been hand-annotated with the correct values to be learned.

As an example, consider the task of part of speech tagging, i.e. determining the correct part of speech of each word in a given sentence, typically one that has never been seen before. A typical machine-learning-based implementation of a part of speech tagger proceeds in two steps, a training step and an evaluation step. The first step — the training step — makes use of a corpus of training data, which consists of a large number of sentences, each of which has the correct part of speech attached to each word. (An example of such a corpus in common use is the Penn Treebank. This includes (among other things) a set of 500 texts from the Brown Corpus, containing examples of various genres of text, and 2500 articles from the Wall Street Journal.) This corpus is analyzed and a learning model is generated from it, consisting of automatically-created rules for determining the part of speech for a word in a sentence, typically based on the nature of the word in question, the nature of surrounding words, and the most likely part of speech for those surrounding words. The model that is generated is typically the best model that can be found that simultaneously meets two conflicting objectives: To perform as well as possible on the training data, and to be as simple as possible (so that the model avoids overfitting the training data, i.e. so that it generalizes as well as possible to new data rather than only succeeding on sentences that have already been seen). In the second step (the evaluation step), the model that has been learned is used to process new sentences. An important part of the development of any learning algorithm is testing the model that has been learned on new, previously unseen data. It is critical that the data used for testing is not the same as the data used for training; otherwise, the testing accuracy will be unrealistically high.

Many different classes of machine learning algorithms have been applied to NLP tasks. In common to all of these algorithms is that they take as input a large set of "features" that are generated from the input data. As an example, for a part-of-speech tagger, typical features might be the identity of the word being processed, the identity of the words immediately to the left and right, the part-of-speech tag of the word to the left, and whether the word being considered or its immediate neighbors are content words or function words. The algorithms differ, however, in the nature of the rules generated. Some of the earliest-used algorithms, such as decision trees, produced systems of hard if-then rules similar to the systems of hand-written rules that were then common. Increasingly, however, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to each input feature. Such models have the advantage that they can express the relative certainty of many different possible answers rather than only one, producing more reliable results when such a model is included as a component of a larger system. In addition, models that make soft decisions are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data).

Systems based on machine-learning algorithms have many advantages over hand-produced rules:
• The learning procedures used during machine learning automatically focus on the most common cases, whereas when writing rules by hand it is often not obvious at all where the effort should be directed.
• Automatic learning procedures can make use of statistical inference algorithms to produce models that are robust to unfamiliar input (e.g. containing words or structures that have not been seen before) and to erroneous input (e.g. with misspelled words or words accidentally omitted). Generally, handling such input gracefully with hand-written rules — or more generally, creating systems of hand-written rules that make soft decisions — is extremely difficult and error-prone.
• Systems based on automatically learning the rules can be made more accurate simply by supplying more input data. However, systems based on hand-written rules can only be made more accurate by increasing the complexity of the rules, which is a much more difficult task. In particular, there is a limit to the complexity of systems based on hand-crafted rules, beyond which the systems become more and more unmanageable. However, creating more data to input to machine-learning systems simply requires a corresponding increase in the number of man-hours worked, generally without significant increases in the complexity of the annotation process.

Based on http://en.wikipedia.org/wiki/Natural_language_processing licensed under the Creative Commons Attribution-Share-Alike License 3.0

Automated online assistant II – Chatbots applied to automated customer service.

Customer service may be provided by a person (e.g., sales and service representative), or by automated means. Examples of automated means are Internet sites. An advantage with automated means is an increased ability to provide service 24-hours a day, which can, at least, be a complement to customer service by persons.

However, in the Internet era, a challenge has been to maintain and/or enhance the personal experience while making use of the efficiencies of online commerce. Writing in Fast Company, entrepreneur and customer systems innovator Micah Solomon has made the point that "Online customers are literally invisible to you (and you to them), so it's easy to shortchange them emotionally. But this lack of visual and tactile presence makes it even more crucial to create a sense of personal, human-to-human connection in the online arena."

Automated means can be based entirely on self- service, but may also be based on service by more or less means of artificial intelligence.
Examples of customer service by artificial means are automated online assistants that can be seen as avatars on websites. It can avail for enterprises to reduce their operating and training cost. These are driven by chatterbots, and a major underlying technology to such systems is natural language processing.

Online and telephone customer service
Artificial intelligence is implemented in automated online assistants that can be seen as avatars on web pages. It can avail for enterprises to reduce their operating and training cost. A major underlying technology to such systems is natural language processing.

Similar techniques may be used in answering machines of call centres, such as speech recognition software to allow computers to handle first level of customer support, text mining and natural language processing to allow better customer handling, agent training by automatic mining of best practices from past interactions, support automation and many other technologies to improve agent productivity and customer satisfaction.

Based on http://en.wikipedia.org/wiki/Applications_of_artificial_intelligence and http://en.wikipedia.org/wiki/Automated_customer_service#Automated_customer_service licensed under the Creative Commons Attribution-Share-Alike License 3.0

Automated online assistant I – Interactive Online Characters.

An automated online assistant is a program that uses artificial intelligence to provide customer service or other assistance on a website. Such an assistant may basically consist of a dialog system, an avatar, as well an expert system to provide specific expertise to the user.
Automated online assistants have the ability to provide customer service during 24 hours a day and 7 days a week, and may, at least, be a complement to customer service by humans.

Usage
Large companies such as Lloyds Banking Group, Royal Bank of Scotland, Renault and Citroën are now using automated online assistants instead of call centres to provide a first point of contact. Also, IKEA has an automated online assistant in their help center.
Automated online assistants can also be implemented via Twitter, or Windows Live Messenger, such as, for example, Robocoke for Coca Cola Hungary. This automated online assistant provides users with information about the brand Coca Cola, but he can also give users party and concert recommendations all over Hungary.
Popular online portals like eBay and PayPal are also using multi lingual virtual agents to offer online support to their customers. For example, PayPal uses Louise to handle queries in English and Léa to handle queries in French. Developed by VirtuOz, both agents handle 400,000 conversations in a month. These agents have been functional since September 2008 on PayPal websites.

Components
1. Dialog system
The main function of the dialog system of automated online assistants is to translate the human-generated input into a digital format that the automated online assistant can use for further processing by its expert system, as well as interpret whatever solutions or replies it generates back to what a human user understands, and optimally in a way as natural and user-friendly as possible. A major underlying technology to such systems is natural language processing.
In addition, the dialog systems of many automated online assistants have integrated chatterbots, giving them more or less ability of engaging in small talk or casual conversations unrelated to the scope of their expert systems, or simply making the dialog feel more natural.

2. Avatar
The avatar of an automated online assistant may be called an interactive online character or automated character. It makes the automated online assistant a form of embodied agent. It aims to enhance human-computer interaction by simulating real-world conversations and experience. Such an interaction model can be constructed to guide conversations in planned directions or allow characters to guide natural language exchanges.
Because such characters can express social roles and emotions of real people, they can increase the trust that users place in online experiences. The level of interactiveness increases the perceived realism and effectiveness of such "actors", which translates into more prosperous on-line services and commerce.

3. Other components
An automated online assistant also has an expert system that provides specific service, whose scope depends on the purpose of it.
Also, servers and other maintaining systems to keep the automated assistant online may also be regarded as components of it.

Based on http://en.wikipedia.org/wiki/Interactive_online_characters licensed under the Creative Commons Attribution-Share-Alike License 3.0

A Cleverbot, and how it differs from a chatterbot

A Cleverbot is an AI web application that learns how to mimic human conversations by conversing with humans. It was created by AI veteran Rollo Carpenter who also created a similar web application called Jabberwacky. Cleverbot differs from traditional chatterbots in that the user is not holding a conversation with a bot that directly responds to entered text. Instead, when the user enters text, the algorithm selects previously entered phrases from its database of 20 million conversations. It has been claimed that, "talking to Cleverbot is a little like talking with the collective community of the internet." Cleverbot was featured on The Gadget Show in March 2011.

Based on http://en.wikipedia.org/wiki/Cleverbot licensed under the Creative Commons Attribution-Share-Alike License 3.0

Hugh Loebner – The man behind the Leobner Prize

Hugh Loebner (born March 26, 1942) is notable as the sponsor of the Loebner Prize, an embodiment of the Turing test. He is an American inventor, holding six United States Patents. He is also an outspoken social activist for the decriminalization of prostitution.

Loebner prize
Loebner established the Loebner Prize in 1990. He pledged to give $100,000 and a solid gold medal to the first programmer able to write a program whose communicative behavior can fool humans into thinking that the program is human. The competition is repeated annually and has been hosted by various organizations. Within the field of artificial intelligence, the Loebner Prize is somewhat controversial; the most prominent critic, Marvin Minsky, has called it a publicity stunt that does not help the field along.
Loebner also likes to point out that, unlike the solid gold medal for the Loebner prize, the gold medals of the Olympic Games are not solid gold, but are made of silver covered with a thin layer of gold.
Fascinated by Alan Turing's imitation game, and considering creating a system himself to pass it, Loebner realised that even if he were to succeed in developing a computer that could pass the Turing test, no avenue existed in which to prove it.
In his letter of December 30, 1988 to Dr. Robert Epstein, Loebner authorized Dr. Epstein to move forward with a contest, and referring to the Turing Test, Loebner wrote: "Robert, in years to come, there may be richer prizes, and more prestigious contests, but gads, this will always be the oldest." Establishing the Loebner Prize, he introduced the Turing Test to a wider public, and stimulated interest in this science. It remains Hugh Loebner’s desire to advance AI, and for the Turing Test to serve as a tool to measure the state of the art: "There is a nobility in this endeavour. If we humans can succeed in developing an artificial intellect it will be a measure of the scope of our intellect" (from: In Response, 1994).

Prostitution
Loebner has been quite open about his visits to prostitutes. In 1994, after a campaign by officials in New York City to arrest customers of prostitutes, he wrote an opposing letter to The New York Times, and it was published. In 1996 he authored a Magna Carta for Sex Work orManifesto of Sexual Freedom, in which he denounced the criminalization of consensual sexual acts, and asked all like minded people to join a protest on 6/9/96 (a play on the 69 sex position). In interviews he has said that he believes to be too old for the young attractive women he is interested in; they would not have sex with him were it not for the money. He has compared the oppression of prostitutes and their customers to the oppression that Alan Turing faced because of his homosexual behavior.

Personal life
Loebner holds a Ph.D. in demography from the University of Massachusetts, Amherst. He is divorced, lives in New York City and owns Crown Industries, a manufacturer of crowd control stanchion and brass fittings, which is the major sponsor of the Loebner Prize in the US.

Based on http://en.wikipedia.org/wiki/Hugh_Loebner licensed under the Creative Commons Attribution-Share-Alike License 3.0

Dialog systems that enhance interactions between humans and computers

A dialog system or conversational agent (CA) is a computer system intended to converse with a human, with a coherent structure. Dialog systems have employed text, speech, graphics, haptics, gestures and other modes for communication on both the input and output channel.
What does and does not constitute a dialog system may be debatable. The typical GUI wizard does engage in some sort of dialog, but it includes very few of the common dialog system components, and dialog state is trivial.

Components of Dialog systems
There are many different architectures for dialog systems. What sets of components are included in a dialog system, and how those components divide up responsibilities differs from system to system. Principal to any dialog system is the dialog manager, which is a component that manages the state of the dialog, and dialog strategy. A typical activity cycle in a dialog system contains the following phases:
1. The user speaks, and the input is converted to plain text by the system's input recognizer/decoder, which may include:
o automatic speech recognizer (ASR)
o gesture recognizer
o handwriting recognizer
2. The text is analyzed by a Natural language understanding unit (NLU), which may include:
o Proper Name identification
o part of speech tagging
o Syntactic/semantic parser
3. The semantic information is analyzed by the dialog manager (see section below), along with a task manager that has knowledge of the specific task domain.
4. The dialog manager produces output using an output generator, which may include:
o natural language generator
o gesture generator
o layout engine
5. Finally, the output is rendered using an output renderer, which may include:
o text-to-speech engine (TTS)
o talking head
o robot or avatar
Dialog systems that are based on a text-only interface (e.g. text-based chat) contain only stages 2-4.

Dialog manager
The dialog manager is the core component of the dialog system. It maintains the history of the dialog, adopts certain dialog strategy (see below), retrieve the content (stored in files or databases), and decides on the best response to the user. The dialog manager maintains the dialog flow.
The design of the dialog manager evolves over time.
• finite-state machine
• frame-based: The system has several slots to be filled. The slots can be filled in any order. This supports mixed-initiative dialog strategy.
• information-state based
The dialog flow can have the following strategies:
• System-initiative dialog: The system is in control to guide the dialog at each step.
• Mixed-initiative dialog: Users can barge in and change the dialog direction. The system follows the user request, but tries to direct the user back the original course. This is the most commonly used dialog strategy in today's dialog systems.
• User-initiative dialog: The user takes lead, and the system respond to whatever the user directs.
The dialog manager can be connected with an expert system to give the ability to respond with specific expertise.

Types of systems
Dialog systems fall into the following categories, which are listed here along a few dimensions. Many of the categories overlap and the distinctions may not be well established.

• by modality
o text-based
o spoken dialog system
o graphical user interface
o multi-modal

• by device
o telephone-based systems
o PDA systems
o in-car systems
o robot systems
o desktop/laptop systems
 native
 in-browser systems
 in-virtual machine
o in-virtual environment
o robots

• by style
o command-based
o menu-driven
o natural language
o speech graffiti

• by initiative
o system initiative
o user initiative
o mixed initiative

Applications
Dialog systems can support a broad range of applications in business enterprises, education, government, healthcare, and entertainment. For example:
• Responding to customers' questions about products and services via a company’s website or intranet portal
• Customer service agent knowledge base: Allows agents to type in a customer’s question and guide them with a response
• Guided selling: Facilitating transactions by providing answers and guidance in the sales process, particularly for complex products being sold to novice customers
• Help desk: Responding to internal employee questions, e.g., responding to HR questions
• Website navigation: Guiding customers to relevant portions of complex websites --a Website concierge
• Technical support: Responding to technical problems, such as diagnosing a problem with a product or device
• Personalized service: Conversational agents can leverage internal and external databases to personalize interactions, such as answering questions about account balances, providing portfolio information, delivering frequent flier or membership information, for example
• Training or education: They can provide problem-solving advice while the user learns
• Simple dialog systems are widely used to decrease human workload in call centres. In this and other industrial telephony applications, the functionality provided by dialog systems is known as interactive voice response or IVR.
In some cases, conversational agents can interact with users using artificial characters. These agents are then referred to as embodied agents.

Toolkits and architectures
• VXML "Voice XML", dialog markup language (primarily for telephony) developed initially by AT&T then administered by an industry consortium and finally a W3C specification. Commercial systems include:
o Quack.com QXML Development Environment [company bought by AOL]
• AIML NLP system
• ChatScript NLP system, by Bruce Wilcox
• SALT: multimodal dialog markup language developed by Microsoft
• CSLU Toolkit a state-based speech interface prototyping environment
• VoiceBrowse: architecture enabling the dynamic production of dialogue driven by unstructured online (Internet) sources

Based on http://en.wikipedia.org/wiki/Conversational_agent licensed under the Creative Commons Attribution-Share-Alike License 3.0