Abstract
Our work is driven by the hypothesis that, for a program to answer questions, explain the answers, and engage in a dialog just as a human does, it must have an explicit representation of knowledge. Such explicit representations naturally occur in many situations such as in designs created by engineers, software requirements created in a unified modeling language or process flow diagrams created for a manufacturing process. Automated approaches based on natural language processing have progressed on tasks such as named entity recognition, fact extraction and relation learning, but they cannot generate expressive representations with high accuracy. In this paper, we report on our effort to systematically curate a knowledge base for a substantial fraction of a biology textbook. Although this experience and the process inherently offer insights, three aspects are especially instructive for the future development of knowledge bases both by manual and by automatic methods: (1) Consider imposing a simplifying abstract structure on natural language sentences so that the surface form is closer to the target logical form to be extracted; (2) Adopt an upper ontology that is strongly motivated and influenced by natural language; (3) Develop a set of syntactic and semantic guidelines that captures how the conceptual distinctions in the ontology may be realized in natural language. Because this representation has effectively enabled reasoning, explanation and dialog, it gives a concrete target for what should be learned by automated methods.