Interronet Chatbot Benchmarking Method

Standard Edition, Version 1.0

Introduction

The ICBM benchmark aims at evaluating chatterbots and at attributing them a mark according to a given set of criteria. It can be adapted to a specific context by replacing the default coefficients by more suitable values. It can also be completed with other criteria.

Instructions for use

The tester should ask to the tested chatbot the questions of the below table (in green color) associated with each criterion, assign a mark to each answer (between 0 and 10) by using the proposed answer (in brown). If the answer of the chatbot is "off the subject" or "wrong" then the mark should be 0. If it is equivalent (but not necessarily identical to the proposed brown answer) then the mark should be 10. Intermediate marks are possible but should be avoided.
The mark for the criterion will be the average of the marks obtained for the questions of this criterion.
For some criterion, it has not been possible to determine questions. In this case, the tester must evaluate a mark by following the instructions in green.
The global mark will be the average of the marks of the criterion weighted by the associated coefficient.

Evaluation Grid

Learning and restitution of information
Id Title Description and Test instructions Coef Mark
T00 Simple restitution The chatbot must be able to validate or to invalidate an interrogative proposition for which the affirmative form has already been learnt.
"the cows eat grass", "do the cows eat grass ?"=> "Yes"
"I like you", "I like you ?"=> "Yes"
"A knife is used to cut", "Is a knife used to cut ?"=> "Yes"
10
T01 Restitution with Joker ("Be" verb) The chatbot must be able to answer a question with a joker (or an unknown) if an equivalent affirmative proposition, using the "be verb", has already been learnt.
"my father is Guyto", "who is my father ?"=> "Guyto"
"a whale is a big cetacean", "what is a whale ?"=> "a big cetacean"
10
T02 Restitution with Joker (general case) The chatbot must be able to answer a question with a joker (or an unknown) if an equivalent affirmative proposition has already been learnt.
"the cows eat grass", "what does a cow eat ?" => "grass"
"I sing every morning in the shower.", "where do I sing every morning ?" => "In the shower."
8
T03 Correct processing of negatives
"I like chocolate", "I don't like chocolate ?" => "It is wrong."
"SinCity is not the capital of WonderLand", "What is the capital of WonderLand" => "I don't know."
7
T04 Discrepancy detection The chatbot must be able to detect discrepancy when he "receives" a fact that is incompatible with what he had learnt previously.
"a cow eat grass.", "a cow does not eat grass." => "It is wrong"
"a bird is not a mammal", "a bird is a mammal." => "It is wrong"
8
T05 Capacity to guess by the keywords (if the structure of a question is incorrect) When the chatbot cannot understand a question (possibly because of mistakes in this question), it must try to guess the answer by analysing the keywords of the question and of the known facts.
"a car has 4 wheels", "how many wheels car" => "4"
"the capital of Japan is Tokyo", "what has the capital of Japan ?" => "I'm not sure but I would say Tokyo."
3
T06 Capacity to synthesize information The system must be able to extract pieces of information from different facts relative to a subject and to gather them in the answer.
"A sailboat is a boat.", "A sailboat is propelled by the wind.", "What is a sailboat ?" => "It is a boat. It is propelled by the wind."
"Milk contains calcium", "The color of milk is white.", "One add bacteria to milk to make yoghurt", "babies drink milk", "what is milk" => "It is a drink. One adds bacteria to it to make yogurt. Babies drink it. It contains calcium. Its color is white."
3



Reasoning
Id Title Description and Test instructions Coef Mark
T10 Inheritance use
"a pizzeria is a restaurant", "a restaurant sells food", "what does a pizzeria sells ?" => "It sells food."
"Flipper is a tursiops", "a tursiops is a dolphin", "a dolphin can jump", "Can Flipper jump ?" => Yes
7
T11 Use of "monofact" deduction rules The chatbot must be able to deduce a fact from another fact and to use this deducted fact to answer a question.
"Kim is my wife", "who is the husband of Kim ?" => "You"
"Camelot is the capital of Albion", "Where is Camelot ?" => "In Albion."
7
T12 Use of "multifact" deduction rules The chatbot must be able to deduce a fact by combining several known facts and to use the deducted fact to answer a question.
"Joe is the father of Sam", "Sam is the brother of Bill", "who is the father of Bill" => "Joe"
"turquoise is a color", "My car is turquoise", "what is the color of my car ?" => "turquoise"
"Sam likes chocolate", "chocolate is a food", "which food does Sam like ?" => "chocolate"
6
T13 Explanation of the reasoning The chatbot must be able to explain how he found the answer to a question.
"Kim is my daughter", "who is the father of Kim ?", "How do you know ?" => "I know that "Kim is your daughter". So "you are the father of Kim"."
"Paris is the capital of France", "Rome is the capital of France ?", "how do you know ?", => "I know that "Paris is the capital of France"."
7
T14 Capacity to count The chatbot must be able to count the objects having a property
"Australia is a country", "How many countries are in Oceania ?" => At least 1 (Australia).
"a week has 7 days.", "there are how many days in a week ?" => 7
4



Syntax, spelling, grammar
Id Title Description and Test instructions Coef Mark
T20 Tolerance to misspelling In case of small misspelling, the system should give the same answer as if they were no mistake. The double or trailing blanks must be processed correctly
" What is the capital of Tunysia ? " => "Tunis"
"a cow eat grass.", "a cow do not eat grass" => "It is wrong."
3
T21 Detection of the form of the input. Without using the punctuation, the system must be able to determine the type of the input: question, assertion or order.
"Paris is what ." => "A town." (or any other answer showing that the system has understood that it is a question)
"Paris is the capital of France" => "Ok" (or any other answer showing that the system has understood that it is an information)
"Tell me where is Paris" => an answer showing that the system has understood that he received an order
5
T22 Support of incomplete sentences The system should be able, in some cases, to complete the input to form a correct question.
"I am 40. And you ?" => The answer must be the same as for "How old are you ?"
"Capital of France ?" => The answer must be the same as for "What is the capital of France ?"
3
T23 Expansion of pronouns The pronouns and possessive adjectives must be replace by their "expanded" form
"The profession of Victor Hugo is writer", "I am Victor Hugo", "What is my profession" => "writer"
"Juliet loves Romeo", "she loves him ?" => "Yes"
7
T24 Support of synonyms If two word has been defined as synonyms in the dictionary, the use of one of them instead of the other should not have an impact on the answer.
"The profession of Victor Hugo is writer", "What is the job of Victor Hugo ?" => "writer"
"The president of the United States is Georges Walter Bush", "Who is the president of the Usa ?" => "Georges Walter Bush"
5
T25 Recognition of contracted forms The usual contracted forms of supported languages must be processed correctly
Example: "He's the Bernard's brother."
"Patrick is French. He's the Bernard's brother. Who is the brother of Bernard ?" => "Patrick"
"he doesn't eat meat", "does he eat meat ?" => "No"
4



Multilingualism
Id Title Description and Test instructions Coef Mark
T30 Several languages understood The system must be able to understand at least two languages.
"Je suis Victor Hugo" => "ok"
"¿ Qué comen las vacas ?" => An answer showing that the question has been understood
"los perros comen carne" => "ok"
"gli uccelli stanno cantando" => "ok"
"Quel age as tu ?" => An answer showing that the question has been understood
"die Vögel singen" => "ok"
7
T31 Adaptation to the language of the user The answer must be issued in the same language as the one used by the user.
"Où est Madrid ?", "dondé esta Madrid", "Where is Madrid ?" => "En espagne", "En España.", "In Spain"
"Quelle heure est-il ?", "Qué hora es ?", "What time is it ?" => réponse adaptée à la langue de la question
5
T32 Reusability of a learnt fact in another language: Example: "Kim is my daughter", "Qui est ma fille" => ~Kim
"the cows eat fishes", "Que mangent les vaches ?" => "des poissons"
"Tatiana es mi hija.", "Who is my Daughter ?" => "Tatiana"
7
T33 Possibility to add new languages It must be possible to add support for a new language just by adding the necessary dictionaries (at least for a basic support).
The tester must look in the documentation of the chatbot to see how to add support for new language. If it is possible, the tester must test it (using some of the sentences listed (but in the tested language of course).
3
T34 Capacity of translation The system is able to translate a text from one supported language to another (without using an internet request)
"traduction félicitations => "congratulations"
"translate in english los coches de la ciudad" => "the cars of the city"
"traducir en francés llorar cuando te vayas" => "pour pleurer quand vous partez"
2



Support of commands
Id Title Description and Test instructions Coef Mark
T40 Associations of Hard coded processing to some verbs The system must be able to call a hard-coded procedure from a NLP order.
Examples: "Show me picture_name",
Availability of some utilities: memo, calculate, play...
The tester must look in the documentation of the chatbot to see how what commands are supported and what utilities are accessible through requests to the chatbot. Example: "Show me picture_name", "calculate expression", "memo", "play music file"
7
T41 Capacity to forget The system must be able to forget or suppress some facts previously learnt (with management of access rights of course)
"my brother is Bernard", "forget about Bernard", "who is my brother" => "I don't know"
"forget about France" => "Sorry. You do not have the necessary access rights."
5
T42 Capacity to give more precisions The system must be able to give more or less information on a subject (depending on its configuration), and the user should be able to force it to give all the available information.
"Victor Hugo is a writer", "Victor Hugo is the father of Leopoldine", "Who is Victor Hugo ?", "Be more precise" => "a writer", "the father of Leopoldine"
"a A380 is a plane", "what is a A380", "Be more precise" => "Sorry. I have told you all I know."
3
T43 Capacity to learn new process (interactive programmability) The system must propose to the user to describe the action in case it is not programmed.
The tester must check that there is a mechanism allowing to teach new procedures interactively to the chatbot.
"Call William" => "I don't know how to call. Could you tell me how to do this ?"
"Create a new document" => "I don't know how to create a new document. Could you tell me how to do this ?"
5



Basic knowledge
Id Title Description and Test instructions Coef Mark
T50 Geography The system must be able to answer some simple questions relative to this domain
"What is the capital of France ?" => "Paris"
"What is Texas?" => "It is a state of the United States."
"What is the capital of Fiji islands ?" => "Suva"
"Who is the president of Russia ?" => "Vladimir Putin"
"Where is Montevideo ?" => In Uruguay
1
T51 People The system must be able to answer some simple questions about famous people
Example: "Who is people_name?"
"Who is Marilyn ?" => "An actress"
"Who is Abdelaziz Bouteflika?" => "The president of Algeria."
"Who was Jules César ?" => "A roman emperor."
"What was the profession of Elvis Presley ?" => "singer"
"What is the birth date of Winston Churchill ?" => "1874"
1
T52 Classes The system must be able to give a description of some common types of objects (and also to tell the relations between this types)
Example: "What is a car ?", "Is a plane an animal ?"
"What is a car ?" => "A vehicle"
"Is a plane an animal ?" => No
"why ?" => explanation => "a vehicle is not an animal" ...
"What is a bird ?" => description of a bird
"What is a plane used for ?" => "fly in the airs"
"What is the colour of a strawberry ?" => red
"what is the earth ?" => "a planet"
"a wolf is a canine." => "Yes, I know."
"Red is a color ?" => "Yes."
"How many days have a week ?" => "7"
4
T53 Knowledge of itself The system must be able to answer some simple questions about itself.
"Who are you ?" => Name of the chatbot
"How old are you ?" => Information on the age
"What is your birth date ?" => The "activation date" of the chatbot
"What can you do ?" => The abilities of the chatbot
"Who is your creator ?" => The name of the creator of the chatbot
"Do you speak spanish ?" => Yes or no
3



Human like features / conviviality
Id Title Description and Test instructions Coef Mark
T60 Capacity to simulate feelings The system must be able to simulate feelings or moods. This feature could be customisable or random.
"How do you feel today ?" => Human answer
"You are very smart" => Human answer
"do you like music ?" => Human answer
"Leave me alone" => Human answer
2
T61 Capacity to avoid repeating The system will avoid repeating itself even if the user repeats his question.
"You are not human.", "You are not human" => no repetition
"What is the capital of Italy ?", "What is the capital of Italy ?" => no repetition
3



Opening / Other features
Id Title Description and Test instructions Coef Mark
T70 Capacity to animate the conversation The system could decide to ask questions to the user or to tell him something that was not asked.
"Ok, that's all" => question from the chatbot to the user
If the user is silent for a long time, the chatbot must ask him a question.
2
T71 Multi user The application must be able to manage several interlocutors at one time (without confusion).
Machine 1: "I am user1", "My father is father1" => "ok"
Machine 2: "I am user2", "My father is father2" => "ok"
Machine 1: "Who is my father" => "father1"
Machine 2: "Who is my father" => "father2"
Machine 1: "Who is the father of user2 ?" => "father2"
3
T72 Customisability of the answers The user must be able to parameter the way the system answer to its requests. For instance the volubility, the "exhaustiveness", the strictness, the interactivity, ... should be customisable.
The tester must search in the documentation of the chatbot to check if the answers can be customized and if they can, he must test and evaluate it.
2
T73 Possibility to have customisable reflex answers The user (or the administrator of the system) should be able to define "reflex answers" for some type of questions. These answers are sent back to the user by the system without "thinking"
Example: "Merci" => "De rien"
The tester must search in the documentation of the chatbot to check if it is possible and if it is, he must test and evaluate it.
3
T74 Connectable to classical databases The application must be able to use the information stored in a classical database to answer questions (ideally via ODBC or some other standard protocol).
The tester must search in the documentation of the chatbot to check if it is possible and if it is, he must test and evaluate it.
3
T75 Use of Internet to search information If the system does not have the answer to a question, it should propose to the user to search on the Internet.
"who is Josephine Baker ?" => "search on internet ?", "A singer"
"how many planets in the solar system ?" => "search on internet ?", "9"
"weather paris" => The weather forecast for Paris
3
T76 Usable through Web, Email, Wap The system should be able to receive emails, http or wap requests, to process them and to send the answer to the user.
The tester must search in the documentation of the chatbot to check if it is possible and if it is, he must test and evaluate it.
2
T77 Usable from other applications The application must propose a mechanism to communicate with other applications (API, COM, DDE, ...)
The tester must search in the documentation of the chatbot to check if it is possible and if it is, he must test and evaluate it.
1
T78 Usable by voice / Voice generation The user must be able to use voice to communicate with the system. Typically, the application must be connectable to a "voice recognition/text to speech" software.
The tester must search in the documentation of the chatbot to check if it is possible and if it is, he must test and evaluate it.
4
T79 Multiplatform The application must be available on different platforms (Windows, Mac, Unix, Linux, ...)
The tester must search in the documentation of the chatbot to check if it is possible and if it is, he must test and evaluate it.
1
T7A "enrichable" dictionaries, grammars, facts, déduction rules, reflexes. The user must be able to enrich the dictionaries used by the application (at least with a text editor). Same thing for the grammar files, the file containing the facts (assertions), the file of the deduction rules and the file containing the reflex answers.
The tester must search in the documentation of the chatbot to check if it is possible and if it is, he must test and evaluate it.
3
T7B Robustness, performance, convenience The application must have the classical qualities required for all softwares.
The tester must search in the documentation of the chatbot to check if it is possible and if it is, he must test and evaluate it.
5



Average mark: __  



For remarks or questions about this document, please email Patrick Télégone.

You will find more information on http://www.interronet.com.