Topics for Today
-
Types of evaluation
-
The Turing Test
-
Do we want to model human performance?
-
the bird/airplane argument examined
How can we evaluate AI systems?
-
Formal definition of success, when possible
E.g., for games
For tools, such as e-mail interfaces,scheduling tools, MT support tools
Customer satisfaction
Questionaires
Improvement in performance
Define formal measures
Formal measures common in research
-
Often manual annotation is required for evaluation
-
Natural language disambiguation
word senses, parts of speech, syntax, ?
-
Information retrieval : which documents do you want?
-
Diagnosis: actual diseases; reasonable diagnosis?
(sometimes already exists)
-
Summarization, machine translation: are these reasonable? Can't expect
exact matches!
-
Manual annotations for training/development too
The Turing Test
Judge computers by human behavior
Room1
Room2
Person 1---------------- Computer responding
Person 2---------------- Person responding
Can the computer fool person 1 into thinking
it's a computer?
The Turing Test: What do you think?
-
+: focus on behavior instead of the internal algorithm used
-
- (Allen, keynote address, AAAI-97) : defines success only in terms of
human intelligence
(Also not well founded: computer could act like a crazy person,
person could act like a computer)
Should we use Humans as our models?
-
Pro: they are our best examples!
-
Anti: bird/airplane argument
People tried to build machines with flapping wings
Wright brothers: ignored flapping wings, and solved the problem in
a different way
Maybe machines must do things differently
than people (animals) do them
Current practice: Both and Neither
-
Many AI researchers use probability theory and formal logic, without claim
of cognitive validity
-
But: evaluation and annotation drive the work
Equivalent to in-depth analysis and observation of human behavior