The ICBM benchmark aims at evaluating chatterbots and at attributing them a
mark according to a given set of criteria. It can be adapted to a specific
context by replacing the default coefficients by more suitable values. It
can also be completed with other criteria.
The tester should ask to the tested chatbot the questions of the below table
(in green color) associated with each criterion, assign a mark to each answer
(between 0 and 10) by using the proposed answer (in brown). If the answer
of the chatbot is "off the subject" or "wrong" then the mark should be 0.
If it is equivalent (but not necessarily identical to the proposed brown
answer) then the mark should be 10. Intermediate marks are possible but
should be avoided.
The mark for the criterion will be the average of the marks obtained for the
questions of this criterion.
For some criterion, it has not been possible to determine questions. In this
case, the tester must evaluate a mark by following the instructions in green.
The global mark will be the average of the marks of the criterion weighted
by the associated coefficient.
Learning and restitution of information
| Id |
Title |
Description and Test instructions |
Coef |
Mark |
| T00 |
Simple restitution |
The chatbot must be able to validate or to invalidate an interrogative proposition for which the affirmative form has already been learnt.
| "the cows eat grass", "do the cows eat grass ?" | => "Yes" |
| "I like you", "I like you ?" | => "Yes" |
| "A knife is used to cut", "Is a knife used to cut ?" | => "Yes" |
| 10
|
| T01 |
Restitution with Joker ("Be" verb) |
The chatbot must be able to answer a question with a joker (or an unknown) if an equivalent affirmative proposition, using the "be verb", has already been learnt.
| "my father is Guyto", "who is my father ?" | => "Guyto" |
| "a whale is a big cetacean", "what is a whale ?" | => "a big cetacean" |
| 10
|
| T02 |
Restitution with Joker (general case) |
The chatbot must be able to answer a question with a joker (or an unknown) if an equivalent affirmative proposition has already been learnt.
| "the cows eat grass", "what does a cow eat ?" | => "grass" |
| "I sing every morning in the shower.", "where do I sing every morning ?" | => "In the shower." |
| 8
|
| T03 |
Correct processing of negatives |
| "I like chocolate", "I don't like chocolate ?" | => "It is wrong." |
| "SinCity is not the capital of WonderLand", "What is the capital of WonderLand" | => "I don't know." |
| 7
|
| T04 |
Discrepancy detection |
The chatbot must be able to detect discrepancy when he "receives" a fact that is incompatible with what he had learnt previously.
| "a cow eat grass.", "a cow does not eat grass." | => "It is wrong" |
| "a bird is not a mammal", "a bird is a mammal." | => "It is wrong" |
| 8
|
| T05 |
Capacity to guess by the keywords (if the structure of a question is incorrect) |
When the chatbot cannot understand a question (possibly because of mistakes in this question), it must try to guess the answer by analysing the keywords of the question and of the known facts.
| "a car has 4 wheels", "how many wheels car" | => "4" |
| "the capital of Japan is Tokyo", "what has the capital of Japan ?" | => "I'm not sure but I would say Tokyo." |
| 3
|
| T06 |
Capacity to synthesize information |
The system must be able to extract pieces of information from different
facts relative to a subject and to gather them in the answer.
| "A sailboat is a boat.", "A sailboat is propelled by the wind.", "What is a sailboat ?" | => "It is a boat. It is propelled by the wind." |
| "Milk contains calcium", "The color of milk is white.", "One add bacteria to milk to make yoghurt", "babies drink milk", "what is milk" | => "It is a drink. One adds bacteria to it to make yogurt. Babies drink it. It contains calcium. Its color is white." |
| 3
|
Reasoning
| Id |
Title |
Description and Test instructions |
Coef |
Mark |
| T10 |
Inheritance use |
| "a pizzeria is a restaurant", "a restaurant sells food", "what does a pizzeria sells ?" | => "It sells food." |
| "Flipper is a tursiops", "a tursiops is a dolphin", "a dolphin can jump", "Can Flipper jump ?" | => Yes |
| 7
|
| T11 |
Use of "monofact" deduction rules |
The chatbot must be able to deduce a fact from another fact and to use this deducted fact to answer a question.
| "Kim is my wife", "who is the husband of Kim ?" | => "You" |
| "Camelot is the capital of Albion", "Where is Camelot ?" | => "In Albion." |
| 7
|
| T12
| Use of "multifact" deduction rules
| The chatbot must be able to deduce a fact by combining several known facts and to use the deducted fact to answer a question.
| "Joe is the father of Sam", "Sam is the brother of Bill", "who is the father of Bill" | => "Joe" |
| "turquoise is a color", "My car is turquoise", "what is the color of my car ?" | => "turquoise" |
| "Sam likes chocolate", "chocolate is a food", "which food does Sam like ?" | => "chocolate" |
| 6
|
| T13
| Explanation of the reasoning
| The chatbot must be able to explain how he found the answer to a question.
| "Kim is my daughter", "who is the father of Kim ?", "How do you know ?" | => "I know that "Kim is your daughter". So "you are the father of Kim"." |
| "Paris is the capital of France", "Rome is the capital of France ?", "how do you know ?", | => "I know that "Paris is the capital of France"." |
| 7
|
| T14
| Capacity to count
| The chatbot must be able to count the objects having a property
| "Australia is a country", "How many countries are in Oceania ?" | => At least 1 (Australia). |
| "a week has 7 days.", "there are how many days in a week ?" | => 7 |
| 4
|
Syntax, spelling, grammar
| Id |
Title |
Description and Test instructions |
Coef |
Mark |
| T20 |
Tolerance to misspelling |
In case of small misspelling, the system should give the same answer as if they were no mistake. The double or trailing blanks must be processed correctly
| " What is the capital of Tunysia ? " | => "Tunis" |
| "a cow eat grass.", "a cow do not eat grass" | => "It is wrong." |
| 3 |
| T21 |
Detection of the form of the input. |
Without using the punctuation, the system must be able to determine the type of the input: question, assertion or order.
| "Paris is what ." | => "A town." (or any other answer showing that the system has understood that it is a question) |
| "Paris is the capital of France" | => "Ok" (or any other answer showing that the system has understood that it is an information) |
| "Tell me where is Paris" | => an answer showing that the system has understood that he received an order |
| 5
|
| T22 |
Support of incomplete sentences |
The system should be able, in some cases, to complete the input to form a correct question.
| "I am 40. And you ?" | => The answer must be the same as for "How old are you ?" |
| "Capital of France ?" | => The answer must be the same as for "What is the capital of France ?" |
| 3
|
| T23 |
Expansion of pronouns |
The pronouns and possessive adjectives must be replace by their "expanded" form
| "The profession of Victor Hugo is writer", "I am Victor Hugo", "What is my profession" | => "writer" |
| "Juliet loves Romeo", "she loves him ?" | => "Yes" |
| 7
|
| T24 |
Support of synonyms |
If two word has been defined as synonyms in the dictionary, the use of one of them instead of the other should not have an impact on the answer.
| "The profession of Victor Hugo is writer", "What is the job of Victor Hugo ?" | => "writer" |
| "The president of the United States is Georges Walter Bush", "Who is the president of the Usa ?" | => "Georges Walter Bush" |
| 5
|
| T25 |
Recognition of contracted forms |
The usual contracted forms of supported languages must be processed correctly Example: "He's the Bernard's brother."
| "Patrick is French. He's the Bernard's brother. Who is the brother of Bernard ?" | => "Patrick" |
| "he doesn't eat meat", "does he eat meat ?" | => "No" |
| 4
|
Multilingualism
| Id |
Title |
Description and Test instructions |
Coef |
Mark |
| T30 |
Several languages understood
| The system must be able to understand at least two languages.
| "Je suis Victor Hugo" | => "ok" |
| "¿ Qué comen las vacas ?" | => An answer showing that the question has been understood |
| "los perros comen carne" | => "ok" |
| "gli uccelli stanno cantando" | => "ok" |
| "Quel age as tu ?" | => An answer showing that the question has been understood |
| "die Vögel singen" | => "ok" |
| 7
|
| T31 |
Adaptation to the language of the user
| The answer must be issued in the same language as the one used by the user.
| "Où est Madrid ?", "dondé esta Madrid", "Where is Madrid ?" | => "En espagne", "En España.", "In Spain" |
| "Quelle heure est-il ?", "Qué hora es ?", "What time is it ?" | => réponse adaptée à la langue de la question |
| 5
|
| T32
| Reusability of a learnt fact in another language:
| Example: "Kim is my daughter", "Qui est ma fille" => ~Kim
| "the cows eat fishes", "Que mangent les vaches ?" | => "des poissons" |
| "Tatiana es mi hija.", "Who is my Daughter ?" | => "Tatiana" |
| 7
|
| T33
| Possibility to add new languages
| It must be possible to add support for a new language just by adding the necessary dictionaries (at least for a basic support).
| The tester must look in the documentation of the chatbot to see how to add
support for new language. If it is possible, the tester must test it (using
some of the sentences listed (but in the tested language of course).
|
| 3
|
| T34
| Capacity of translation
| The system is able to translate a text from one supported language to another (without using an internet request)
| "traduction félicitations | => "congratulations" |
| "translate in english los coches de la ciudad" | => "the cars of the city" |
| "traducir en francés llorar cuando te vayas" | => "pour pleurer quand vous partez" |
| 2
|
Support of commands
| Id |
Title |
Description and Test instructions |
Coef |
Mark |
| T40
| Associations of Hard coded processing to some verbs
| The system must be able to call a hard-coded procedure from a NLP order. Examples: "Show me picture_name", Availability of some utilities: memo, calculate, play...
| The tester must look in the documentation of the chatbot to see how what
commands are supported and what utilities are accessible through requests to
the chatbot.
Example: "Show me picture_name", "calculate expression", "memo", "play music file"
|
| 7
|
| T41
| Capacity to forget
| The system must be able to forget or suppress some facts previously learnt (with management of access rights of course)
| "my brother is Bernard", "forget about Bernard", "who is my brother" | => "I don't know" |
| "forget about France" | => "Sorry. You do not have the necessary access rights." |
| 5
|
| T42
| Capacity to give more precisions
| The system must be able to give more or less information on a subject (depending on its configuration), and the user should be able to force it to give all the available information.
| "Victor Hugo is a writer", "Victor Hugo is the father of Leopoldine", "Who is Victor Hugo ?", "Be more precise" | => "a writer", "the father of Leopoldine" |
| "a A380 is a plane", "what is a A380", "Be more precise" | => "Sorry. I have told you all I know." |
| 3
|
| T43
| Capacity to learn new process (interactive programmability)
| The system must propose to the user to describe the action in case it is not programmed.
| The tester must check that there is a mechanism allowing to teach new procedures interactively to the chatbot.
| | "Call William" | => "I don't know how to call. Could you tell me how to do this ?" |
| "Create a new document" | => "I don't know how to create a new document. Could you tell me how to do this ?" |
| 5
|
Opening / Other features
| Id |
Title |
Description and Test instructions |
Coef |
Mark |
| T70
| Capacity to animate the conversation
| The system could decide to ask questions to the user or to tell him something that was not asked.
| "Ok, that's all" | => question from the chatbot to the user |
| If the user is silent for a long time, the chatbot must ask him a question.
|
| 2
|
| T71
| Multi user
| The application must be able to manage several interlocutors at one time (without confusion).
| Machine 1: "I am user1", "My father is father1" | => "ok" |
| Machine 2: "I am user2", "My father is father2" | => "ok" |
| Machine 1: "Who is my father" | => "father1" |
| Machine 2: "Who is my father" | => "father2" |
| Machine 1: "Who is the father of user2 ?" | => "father2" |
| 3
|
| T72
| Customisability of the answers
| The user must be able to parameter the way the system answer to its requests. For instance the volubility, the "exhaustiveness", the strictness, the interactivity, ... should be customisable.
| The tester must search in the documentation of the chatbot to check if the
answers can be customized and if they can, he must test and evaluate it. |
| 2
|
| T73
| Possibility to have customisable reflex answers
| The user (or the administrator of the system) should be able to define "reflex answers" for some type of questions. These answers are sent back to the user by the system without "thinking" Example: "Merci" => "De rien"
| The tester must search in the documentation of the chatbot to check if it is possible and if it is, he must test and evaluate it. |
| 3
|
| T74
| Connectable to classical databases
| The application must be able to use the information stored in a classical database to answer questions (ideally via ODBC or some other standard protocol).
| The tester must search in the documentation of the chatbot to check if it is
possible and if it is, he must test and evaluate it. |
| 3
|
| T75
| Use of Internet to search information
| If the system does not have the answer to a question, it should propose to the user to search on the Internet.
| "who is Josephine Baker ?" | => "search on internet ?", "A singer" |
| "how many planets in the solar system ?" | => "search on internet ?", "9" |
| "weather paris" | => The weather forecast for Paris |
| 3
|
| T76
| Usable through Web, Email, Wap
| The system should be able to receive emails, http or wap requests, to process them and to send the answer to the user.
| The tester must search in the documentation of the chatbot to check if it is possible and if it is, he must test and evaluate it.
|
| 2
|
| T77
| Usable from other applications
| The application must propose a mechanism to communicate with other applications (API, COM, DDE, ...)
| The tester must search in the documentation of the chatbot to check if it is possible and if it is, he must test and evaluate it.
|
| 1
|
| T78
| Usable by voice / Voice generation
| The user must be able to use voice to communicate with the system. Typically, the application must be connectable to a "voice recognition/text to speech" software.
| The tester must search in the documentation of the chatbot to check if it is possible and if it is, he must test and evaluate it.
|
| 4
|
| T79
| Multiplatform
| The application must be available on different platforms (Windows, Mac, Unix, Linux, ...)
| The tester must search in the documentation of the chatbot to check if it is possible and if it is, he must test and evaluate it.
|
| 1
|
| T7A
| "enrichable" dictionaries, grammars, facts, déduction rules, reflexes.
| The user must be able to enrich the dictionaries used by the application (at least with a text editor). Same thing for the grammar files, the file containing the facts (assertions), the file of the deduction rules and the file containing the reflex answers.
| The tester must search in the documentation of the chatbot to check if it is possible and if it is, he must test and evaluate it.
|
| 3
|
| T7B
| Robustness, performance, convenience
| The application must have the classical qualities required for all softwares.
| The tester must search in the documentation of the chatbot to check if it is possible and if it is, he must test and evaluate it.
|
| 5
|