Back to Results

EFTA00624128.pdf

Source: DOJ_DS9  •  Size: 34668.2 KB  •  OCR Confidence: 85.0%
PDF Source (No Download)

Extracted Text (OCR)

Ben Goertzel with Cassio Pennachin & Nil Geisweiller & the OpenCog Team Engineering General Intelligence, Part 2: The CogPrime Architecture for Integrative, Embodied AGI September 19, 2013 EFTA00624128 This book is dedicated by Ben Goertzel to his beloved, departed grandfather, Leo Ztuell - an amazingly warm-hearted, giving human being who was also a deep thinker and excellent scientist, who got Ben started on the path of science. As a careful experimentalist, Leo would have been properly skeptical of the big hypotheses made here - but he would have been eager to see them put to the test! EFTA00624130 Preface Welcome to the second volume of Engineering General Intelligence! This is the second half of a two-part technical treatise aimed at outlining a practical approach to engineering software systems with general intelligence at the human level and ultimately beyond. Our goal here is an ambitious one and not a modest one: Machines with flexible problem- solving ability, open-ended learning capability, creativity and eventually, their own kind of genius. Part 1 set the stage, dealing with with a variety of general conceptual issues related to the engineering of advanced AGI, as well as presenting a brief overview of the CogPrime design for Artificial General Intelligence. Now here in Part 2 we plunge deep into the nitty-gritty, and describe the multiple aspects of the CogPrime with a fairly high degree of detail. First we describe the CogPrime software architecture and knowledge representation in de- tail; then we review the "cognitive cycle" via which CogPrime perceives and acts in the world and reflects on itself. We then turn to various forms of learning: procedural. declarative (e.g. inference), simulative and integrative. Methods of enabling natural language functionality in CogPrime are then discussed; and the volume concludes with a chapter summarizing the ar- gument that CogPrime can lead to human-level (and eventually perhaps greater) AGI, and a chapter giving a "thought experiment" describing the internal dynamics via which a completed CogPrime system might solve the problem of obeying the request "Build me something with blocks that I haven't seen before." Reading this book before Engineering General Intelligence, Part 1 first is not especially recommended, since the prequel not only provides context for this one, but it also defines a number of specific terms and concepts that are used here without explanation (for example, Part One has an extensive Glossary). However, the impatient reader who has not mastered Part 1, or the reader who has finished Part 1 but is tempted to hop through Part 2 nonlinearly, might wish to first skim the final two chapters, and then return to reading in linear order. While the majority of the text here was written by the lead author Ben Goertzel, the overall work and underlying ideas have been very much a team effort, with major input from the sec- ondary authors Cassio Pennachin and Nil Geisweiller, and large contributions from various other contributors as well. Nlany chapters have specifically indicated coauthors; but the contributions from various collaborating researchers and engineers go far beyond these. The creation of the AGI approach and design presented here is a process that has occurred over a long period of time among a community of people; and this book is in fact a quite partial view of the existent Iii EFTA00624132 via body of knowledge and intuition regarding CogPrime. For example, beyond the ideas presented here, there is a body of work on the OpenCog wiki site, and then the OpenCog codebase itself. More extensive introductory remarks may be found in Preface of Part 1, including a brief history of the book and acknowledgements to some of those who helped inspire it. Also, one brief comment from the Preface of Part 1 bears repeating: At several places in this volume, as in its predecessor, we will refer to the "current" CogPrime implementation (in the OpenCog framework); in all cases this refers to the OpenCog software system as of late 2013. We fully realize that this book is not "easy reading", and that the level and nature of exposition varies somewhat from chapter to chapter. We have done our best to present these very complex ideas as clearly as we could, given our own time constraints, and the lack of commonly understood vocabularies for discussing many of the concepts and systems involved. Our hope is that the length of the book, and the conceptual difficulty of some portions, will be considered as compensated by the interest of the ideas we present. For, make no mistake — for all their technicality and subtlety, we find the ideas presented here incredibly exciting. We are talking about no less than the creation of machines with intelligence, creativity and genius equaling and ultimately exceeding that of human beings. This is, in the end, the kind of book that we (the authors) all hoped to find when we first entered the AI field: a reasonably detailed description of how to go about creating thinking machines. The fact that so few treatises of this nature, and so few projects explicitly aimed at the creation of advanced AGI, exist, is something that has perplexed us since we entered the field. Rather than just complain about it, we have taken matters into our own hands, and worked to create a design and a codebase that we believe capable of leading to human-level AGI and beyond. We feel tremendously fortunate to live in times when this sort of pursuit can be discussed in a serious, scientific way. Online Appendices Just one more thing before getting started! This book originally had even more chapters than the ones currently presented in Parts 1 and 2. In order to decrease length and increase fo- cus, however, a number of chapters dealing with peripheral - yet still relevant and interest- ing - matters were moved to online appendices. These may be downloaded in a single PDF file at http: higoert zel.orgiengineering_general_Intenigence_appendices_ B-I4.pdf. The titles of these appendices are: • Appendix A: Possible Worlds Semantics and Experiential Semantics • Appendix B: Steps Toward a Formal Theory of Cognitive Structure and Dynamics • Appendix C: Emergent Reflexive Mental Structures • Appendix D: GOLEM: Toward an AGI Meta-Architecture Enabling Both Goal Preservation and Radical Self-Improvement • Appendix E: Lojban++: A Novel Linguistic Mechanism for Teaching AGI Systems • Appendix F: PLN and the Brain • Appendix G: Possible Worlds Semantics and Experiential Semantics • Appendix H: Propositions About Environments in Which CogPrime Components are Useful EFTA00624133 ix None of these are critical to understanding the key ideas in the book, which is why they were relegated to online appendices. However, reading them will deepen your understanding of the conceptual and formal perspectives underlying the CogPrime design. September 2013 Ben Goertzet EFTA00624134 Contents Section I Architectural and Representational Mechanisms 19 The OpenCog Framework 3 19.1 Introduction 3 19.1.1 Layers of Abstraction in Describing Artificial Minds 3 19.1.2 The OpenCog Framework 4 19.2 The OpenCog Architecture 5 19.2.1 OpenCog and Hardware Models 5 19.2.2 The Key Components of the OpenCog Framework 6 19.3 The AtomSpace 7 19.3.1 The Knowledge Unit: Atoms 7 19.3.2 AtomSpace Requirements and Properties 8 19.3.3 Accessing the Atomspace 9 19.3.4 Persistence 10 19.3.5 Specialized Knowledge Stores 11 19.4 MindAgents: Cognitive Processes 13 19.4.1 A Conceptual View of CogPrime Cognitive Processes 14 19.4.2 Implementation of MindAgents 15 19.4.3 Tasks 16 19.4.4 Scheduling of MindAgents and Tasks in a Unit 16 19.4.5 The Cognitive Cycle 17 19.5 Distributed AtomSpace and Cognitive Dynamics 18 19.5.1 Distributing the AtomSpace 18 19.5.2 Distributed Processing 23 20 Knowledge Representation Using the Atomspace 27 20.1 Introduction 27 20.2 Denoting Atoms 28 20.2.1 Meta-Language 28 20.2.2 Denoting Atoms 30 20.3 Representing Functions and Predicates 35 20.3.1 Execution Links 36 20.3.2 Denoting Schema and Predicate Variables 39 xi EFTA00624136 xii Contents 20.3.3 Variable and Combinator Notation 41 20.3.4 Inheritance Between Higher-Order Types 43 20.3.5 Advanced Schema Manipulation 44 21 Representing Procedural Knowledge 49 21.1 Introduction 49 21.2 Representing Programs 50 21.3 Representational Challenges 51 21.4 What Makes a Representation Tractable? 53 21.5 The Combo Language 55 21.6 Normal Forms Postulated to Provide Tractable Representations 55 21.6.1 A Simple Type System 56 21.6.2 Boolean Normal Form 57 21.6.3 Number Normal Form 57 21.6.4 List Normal Form 57 21.6.5 Tuple Normal Form 57 21.6.6 Enum Normal Form 58 21.6.7 Function Normal Form 58 21.6.8 Action Result Normal Form 58 21.7 Program Transformations 59 21.7.1 Reductions 59 21.7.2 Neutral Transformations 60 21.7.3 Non-Neutral Transformations 62 21.8 Interfacing Between Procedural and Declarative Knowledge 63 21.8.1 Programs Manipulating Atoms 63 21.9 Declarative Representation of Procedures 64 Section II The Cognitive Cycle 22 Emotion, Motivation, Attention and Control 67 22.1 Introduction 67 22.2 A Quick Look at Action Selection 68 22.3 Psi in C,ogPrime 69 22.4 Implementing Emotion Rules atop Psi's Emotional Dynamics 72 22.4.1 Grounding the Logical Structure of Emotions in the Psi Model 73 22.5 Goals and Contexts 73 22.5.1 Goal Atoms 74 22.6 Context Atoms 76 22.7 Ubergoal Dynamics 77 22.7.1 Implicit Ubergoal Pool Modification 77 22.7.2 Explicit Ubergoal Pool Modification 78 22.8 Goal Formation 78 22.9 Goal Fulfillment and Predicate Schematization 79 22.10Context Formation 79 22.11Execut ion Management 80 22.12Goals and Time 81 EFTA00624137 Contents xiii 23 Attention Allocation 83 23.1 Introduction 83 23.2 Semantics of Short and Long Temi Importance 85 23.2.1 The Precise Semantics of STI and LTI 86 23.2.2 STI, STIFund, and Juju 89 23.2.3 Formalizing LTI 89 23.2.4 Applications of LT/bunt versus LT/cont 90 23.3 Defining Burst LTI in Terms of STI 91 23.4 Valuing LTI and STI in terms of a Single Currency 92 23.5 Economic Attention Networks 94 23.5.1 Semantics of Hebbian Links 94 23.5.2 Explicit and Implicit Hebbian Relations 95 23.6 Dynamics of STI and LTI Propagation 95 23.6.1 ECAN Update Equations 96 23.6.2 ECAN as Associative Memory 101 23.7 Glocal Economic Attention Networks 101 23.7.1 Experimental Explorations 102 23.8 Long-Term Importance and Forgetting 102 23.9 Attention Allocation via Data Mining on the System Activity Table 103 23.10Schema Credit Assignment 104 23.11Interaction between ECANs and other CogPrime Components 106 23.11.1Use of PLN and Procedure Learning to Help ECAN 106 23.11.2Use of ECAN to Help Other Cognitive Processes 106 23.12MindAgent Importance and Scheduling 107 23.13Information Geometry for Attention Allocation 108 23.13.1Brief Review of Information Geometry 108 23.13.2Information-Geometric Learning for Recurrent Networks: Extending the ANGL Algorithm 109 23.13.3Information Geometry for Economic Attention Allocation: A Detailed Example 110 24 Economic Goal and Action Selection 113 24.1 Introduction 113 24.2 Transfer of STI "Requests for Services" Between Goals 114 24.3 Feasibility Structures 116 24.4 Goal Based Schema Selection 116 24.4.1 A Game-Theoretic Approach to Action Selection 117 24.5 SchemaActivation 118 24.6 GoalBasedSchemaLearning 119 25 Integrative Procedure Evaluation 121 25.1 Introduction 121 25.2 Procedure Evaluators 121 25.2.1 Simple Procedure Evaluation 122 25.2.2 Effort Based Procedure Evaluation 122 25.2.3 Procedure Evaluation with Adaptive Evaluation Order 123 25.3 The Procedure Evaluation Process 123 EFTA00624138 xiv Contents 25.3.1 Truth Value Evaluation 124 25.3.2 Schema Execution 125 Section III Perception and Action 26 Perceptual and Motor Hierarchies 129 26.1 Introduction 129 26.2 The Generic Perception Process 130 26.2.1 The ExperienceDB 131 26.3 Interfacing CogPrime with a Virtual Agent 131 26.3.1 Perceiving the Virtual World 132 26.3.2 Acting in the Virtual World 133 26.4 Perceptual Pattern Mining 134 26.4.1 Input Data 134 26.4.2 Transaction Graphs 135 26.4.3 Spatiotemporal Conjunctions 135 26.4.4 The Mining Task 136 26.5 The Perceptual-Motor Hierarchy 136 26.6 Object Recognition from Polygonal Meshes 137 26.6.1 Algorithm Overview 138 26.6.2 Recognizing PersistentPolygonNodes (PPNodes) from PolygonNodes 138 26.6.3 Creating Adjacency Graphs from PPNodes 139 26.6.4 Clustering in the Adjacency Graph 140 26.6.5 Discussion 140 26.7 Interfacing the Atomspace with a Deep Learning Based Perception-Action Hierarchy 140 26.7.1 Hierarchical Perception Action Networks 141 26.7.2 Declarative Memory 142 26.7.3 Sensory Memory 142 26.7.4 Procedural Memory 142 26.7.5 Episodic Memory 143 26.7.6 Action Selection and Attention Allocation 144 26.8 Multiple Interaction Channels 144 27 Integrating CogPrime with a Compositional Spatiotemporal Deep Learning Network 147 27.1 Introduction 147 27.2 Integrating CSDLNs with Other AI Frameworks 149 27.3 Semantic CSDLN for Perception Processing 149 27.4 Semantic CSDLN for Motor and Sensorimotor Processing 152 27.5 Connecting the Perceptual and Motoric Hierarchies with a Goal Hierarchy 154 28 Making DeSTIN Representationally Transparent 157 28.1 Introduction 157 28.2 Review of DeSTIN Architecture and Dynamics 158 28.2.1 Beyond Gray-Scale Vision 159 28.3 Uniform DeSTIN 159 28.3.1 Translation-Invariant DeSTIN 160 EFTA00624139 Contents xv 28.3.2 Mapping States of Tran.slation-Invariant De$TIN into the Atomspace 161 28.3.3 Scale-Invariant DeSTIN 162 28.3.4 Rotation Invariant DeSTIN 163 28.3.5 Temporal Perception 164 28.4 Interpretation of DeSTIN's Activity 164 28.4.1 DeSTIN's Assumption of Hierarchical Decomposability 165 28.4.2 Distance and Utility 165 28.5 Benefits and Costs of Uniform DeSTIN 166 28.6 Imprecise Probability as a Tool for Linking CogPrime and DeSTIN 167 28.6.1 Visual Attention Focusing 167 28.6.2 Using Imprecise Probabilities to Guide Visual Attention Focusing 168 28.6.3 Sketch of Application to DeSTIN 168 29 Bridging the Symbolic/Subsymbolic Gap 171 29.1 Introduction 171 29.2 Simplified OpenCog Workflow 173 29.3 Integrating De$TIN and OpenCog 174 29.3.1 Mining Patterns from DeSTIN States 175 29.3.2 Probabilistic Inference on Mined Hypergraphs 176 29.3.3 Insertion of OpenCog-Learned Predicates into DeSTIN's Pattern Library 177 29.4 Multisensory Integration, and Perception-Action Integration 178 29.4.1 Perception-Action Integration 179 29.4.2 Thought-Experiment: Eye-Hand Coordination 181 29.5 A Practical Example: Using Subtree Mining to Bridge the Gap Between DeSTIN and PLN 182 29.5.1 The Importance of Semantic Feedback 184 29.6 Some Simple Experiments with Letters 184 29.6.1 Mining Subtrees from DeSTIN States Induced via Observing Letterforms 184 29.6.2 Mining Subtrees from DeSTIN States Induced via Observing Letterforms 185 29.7 Conclusion 188 Section IV Procedure Learning 30 Procedure Learning as Program Learning 193 30.1 Introduction 193 30.1.1 Program Learning 193 30.2 Representation-Building 195 30.3 Specification Based Procedure Learning 196 31 Learning Procedures via Imitation, Reinforcement and Correction 197 31.1 Introduction 197 31.2 IRC Learning 197 31.2.1 A Simple Example of Imitation/Reinforcement Learning 198 31.2.2 A Simple Example of Corrective Learning 199 31.3 IRC Learning in the PetBrain 201 31.3.1 Introducing Corrective Learning 203 31.4 Applying A Similar IRC Methodology to Spontaneous Learning 203 EFTA00624140 xti Contents 32 Procedure Learning via Adaptively Biased Hillcimbing 205 32.1 Introduction 205 32.2 Hillclimbing 206 32.3 Entity and Perception Filters 207 32.3.1 Entity filter 207 32.3.2 Entropy perception filter 207 32.4 Using Action Sequences as Building Blocks 208 32.5 Automatically Parametrizing the Program Size Penalty 208 32.5.1 Definition of the complexity penalty 208 32.5.2 Parameterizing the complexity penalty 209 32.5.3 Definition of the Optimization Problem 210 32.6 Some Simple Experimental Results 211 32.7 Conclusion 214 33 Probabilistic Evolutionary Procedure Learning 215 33.1 Introduction 215 33.1.1 Explicit versus Implicit Evolution in CogPrime 217 33.2 Estimation of Distribution Algorithms 218 33.3 Competent Program Evolution via MOSES 219 33.3.1 Statics 219 33.3.2 Dynamics 222 33.3.3 Architecture 223 33.3.4 Example: Artificial Ant Problem 224 33.3.5 Discussion 229 33.3.6 Conclusion 229 33.4 Integrating Feature Selection Into the Learning Process 230 33.4.1 Machine Learning, Feature Selection and AGI 231 33.4.2 Data- and Feature- Focusable Learning Problems 232 33.4.3 Integrating Feature Selection Into Learning 233 33.4.4 Integrating Feature Selection into MOSES Learning 234 33.4.5 Application to Genomic Data Classification 234 33.5 Supplying Evolutionary Learning with Long-Term Memory 236 33.6 Hierarchical Program Learning 237 33.6.1 Hierarchical Modeling of Composite Procedures in the AtomSpace 238 33.6.2 Identifying Hierarchical Structure In Combo trees via Metallodes and Dimensional Embedding 239 33.7 Fitness Function Estimation via Integrative Intelligence 242 Section V Declarative Learning 34 Probabilistic Logic Networks 247 34.1 Introduction 247 34.2 A Simple Overview of PLN 248 34.2.1 Forward and Backward Chaining 249 34.3 First Order Probabilistic Logic Networks 250 34.3.1 Core FOPLN Relationships 250 34.3.2 PLN Truth Values 251 EFTA00624141 Contents xvii 34.3.3 Auxiliary FOPLN Relationships 251 34.3.4 PLN Rules and Formulas 252 34.3.5 Inference Trails 253 34.4 Higher-Order PLN 254 34.4.1 Reducing HOPLN to FOPLN 255 34.5 Predictive Implication and Attraction 256 34.6 Confidence Decay 257 34.6.1 An Example 258 34.7 Why is PLN a Good Idea' 260 35 Spatiotemporal Inference 263 35.1 Introduction 263 35.2 Related Work on Spatio-temporal Calculi 264 35.3 Uncertainty with Distributional Fuzzy Values 267 35.4 Spatio-temporal Inference in PLN 270 35.5 Examples 272 35.5.1 Spatiotemporal Rules 272 35.5.2 The Laptop is Safe from the Rain 273 35.5.3 Fetching the Toy Inside the Upper Cupboard 273 35.6 An Integrative Approach to Planning 275 36 Adaptive, Integrative Inference Control 277 36.1 Introduction 277 36.2 High-Level Control Mechanisms 277 36.2.1 The Need for Adaptive Inference Control 278 36.3 Inference Control in PLN 279 36.3.1 Representing PLN Rules as GroundedSchemallodes 279 36.3.2 Recording Executed PLN Inferences in the Atomspace 279 36.3.3 Anatomy of a Single Inference Step 280 36.3.4 Basic Forward and Backward Inference Steps 281 36.3.5 Interaction of Forward and Backward Inference 282 36.3.6 Coordinating Variable Bindings 282 36.3.7 An Example of Problem Decomposition 284 36.3.8 Example of Casting a Variable Assignment Problem as an Optimization Problem 284 36.3.9 Backward Chaining via Nested Optimization 285 36.4 Combining Backward and Forward Inference Steps with Attention Allocation to Achieve the Same Effect as Backward Chaining (and Even Smarter Inference Dynamics) 288 36.4.1 Breakdown into MindAgents 289 36.5 Hebbian Inference Control 289 36.6 Inference Pattern Mining 293 36.7 Evolution As an Inference Control Scheme 293 36.8 Incorporating Other Cognitive Processes into Inference 294 36.9 PLN and Bayes Nets 295 EFTA00624142 xtiii Contents 37 Pattern Mining 297 37.1 Introduction 297 37.2 Finding Interesting Patterns via Program Learning 298 37.3 Pattern Mining via Frequent/Surprising Subgraph Mining 299 37.4 Fishgram 300 37.4.1 Example Patterns 300 37.4.2 The Fishgram Algorithm 301 37.4.3 Preprocessing 302 37.4.4 Search Process 303 37.4.5 Comparison to other algorithms 304 38 Speculative Concept Formation 305 38.1 Introduction 305 38.2 Evolutionary Concept Formation 306 38.3 Conceptual Blending 308 38.3.1 Outline of a CogPrime Blending Algorithm 310 38.3.2 Another Example of Blending 311 38.4 Clustering 312 38.5 Concept Formation via Formal Concept Analysis 312 38.5.1 Calculating Membership Degrees of New Concepts 313 38.5.2 Forming New Attributes 313 38.5.3 Iterating the Fuzzy Concept Formation Process 314 Section VI Integrative Learning 39 Dimensional Embedding 319 39.1 Introduction 319 39.2 Link Based Dimensional Embedding 320 39.3 Harel and Koren's Dimensional Embedding Algorithm 322 39.3.1 Step 1: Choosing Pivot Points 322 39.3.2 Step 2: Similarity Estimation 323 39.3.3 Step 3: Embedding 323 39.4 Embedding Based Inference Control 323 39.5 Dimensional Embedding and InheritanceLinks 325 40 Mental Simulation and Episodic Memory 327 40.1 Introduction 327 40.2 Internal Simulations 328 40.3 Episodic Memory 328 41 Integrative Procedure Learning 333 41.1 Introduction 333 41.1.1 The Diverse Technicalities of Procedure Learning in CogPrime 334 41.2 Preliminary Comments on Procedure Map Encapsulation and Expansion 336 41.3 Predicate Schematization 337 41.3.1 A Concrete Example 339 41.4 Concept-Driven Schema and Predicate Creation 340 41.4.1 Concept-Driven Predicate Creation 340 EFTA00624143 Contents xix 41.4.2 Concept-Driven Schema Creation 341 41.5 Inference-Guided Evolution of Pattern-Embodying Predicates 342 41.5.1 Rewarding Surprising Predicates 342 41.5.2 A More Formal Treatment 344 41.6 PredicateNode Mining 345 41.7 Learning Schema Maps 346 41.7.1 Goal-Directed Schema Evolution 347 41.8 Occam's Razor 349 42 Map Formation 351 42.1 Introduction 351 42.2 Map Encapsulation 353 42.3 Atom and Predicate Activity Tables 355 42.4 Mining the AtomSpace for Maps 356 42.4.1 Frequent Itemset Mining for Map Mining 357 42.4.2 Evolutionary Map Detection 359 42.5 Map Dynamics 359 42.6 Procedure Encapsulation and Expansion 360 42.6.1 Procedure Encapsulation in More Detail 361 42.6.2 Procedure Encapsulation in the Human Brain 361 42.7 Maps and Focused Attention 362 42.8 Recognizing and Creating Self-Referential Structures 363 42.8.1 Encouraging the Recognition of Self-Referential Structures in the AtomSpace 364 Section VII Communication Between Human and Artificial Minds 43 Communication Between Artificial Minds 369 43.1 Introduction 369 43.2 A Simple Example Using a PsyneseVocabulary Server 371 43.2.1 The Psynese Match Schema 373 43.3 Psynese as a Language 373 43.4 Psynese Mindplexes 374 43.4.1 AGI Mindplexes 375 43.5 Psynese and Natural Language Processing 376 43.5.1 Collective Language Learning 378 44 Natural Language Comprehension 379 44.1 Introduction 379 44.2 Linguistic Atom Types 381 44.3 The Comprehension and Generation Pipelines 382 44.4 Parsing with Link Grammar 383 44.4.1 Link Grammar vs. Phrase Structure Grammar 385 44.5 The RelEx Framework for Natural Language Comprehension 386 44.5.1 RelEx2Frame: Mapping Syntactico-Semantic Relationships into FrameNet Based Logical Relationships 387 44.5.2 A Priori Probabilities For Rules 389 44.5.3 Exclusions Between Rules 389 EFTA00624144 xx Contents 44.5.4 Handling Multiple Prepositional Relationships 390 44.5.5 Comparatives and Phantom Nodes 391 44.6 Frame2Atom 392 44.6.1 Examples of Frame2Atom 393 44.6.2 Issues Involving Disambiguation 396 44.7 Syn2Sem: A Semi-Supervised Alternative to RelEx and RelEx2Frame 397 44.8 Mapping Link Parses into Atom Structures 398 44.8.1 Example Training Pair 399 44.9 Making a Training Corpus 399 44.9.1 Leveraging RelEx to Create a Training Corpus 399 44.9.2 Making an Experience Based Training Corpus 399 44.9.3 Unsupervised, Experience Based Corpus Creation 400 44.10Limiting the Degree of Disambiguation Attempted 400 44.11Rule Format 401 44.11.1Example Rule 402 44.12Rule Learning 402 44.13Creating a Cyc-Like Database via Text Mining 403 44.14PROWL Grammar 404 44.14.1Brief Review of Word Grammar 405 44.14.2Word Grammar's Logical Network Model 406 44.14.3Link Grammar Parsing vs Word Grammar Parsing 407 44.14.4Contextually Guided Greedy Parsing and Generation Using Word Link Grammar 411 44.15Aspects of Language Learning 413 44.15.1 Word Sense Creation 413 44.15.2Feature Structure Learning 414 44.15.3Transformation and Semantic Mapping Rule Learning 414 44.16Experiential Language Learning 415 44.17Which Path(s) Forward? 416 45 Language Learning via Unsupervised Corpus Analysis 417 45.1 Introduction 417 45.2 Assumed Linguistic Infrastructure 419 45.3 Linguistic Content To Be Learned 421 45.3.1 Deeper Aspects of Comprehension 423 45.4 A Methodology for Unsupervised Language Learning from a Large Corpus 423 45.4.1 A High Level Perspective on Language Learning 424 45.4.2 Learning Syntax 426 45.4.3 Learning Semantics 430 45.5 The Importance of Incremental Learning 434 45.6 Integrating Language Learned via Corpus Analysis into CogPrime's Experiential Learning 435 46 Natural Language Generation 437 46.1 Introduction 437 46.2 SegSim for Sentence Generation 437 46.2.1 NLGen: Example Results 441 EFTA00624145 Contents xxi 46.3 Experiential Learning of Language Generation 444 46.4 Sem2Syn 445 46.5 Conclusion 445 47 Embodied Language Processing 447 47.1 Introduction 447 47.2 Semiosis 448 47.3 Teaching Gestural Communication 450 47.4 Simple Experiments with Embodiment and Anaphor Resolution 455 47.5 Simple Experiments with Embodiment and Question Answering 456 47.5.1 Preparing/Matching Framm 456 47.5.2 Frames2RelEx 458 47.5.3 Example of the Question Answering Pipeline 458 47.5.4 Example of the PetBrain Language Generation Pipeline 459 47.6 The Prospect of Massively Multiplayer Language Teaching 460 48 Natural Language Dialogue 463 48.1 Introduction 463 48.1.1 Two Phases of Dialogue System Development 464 48.2 Speech Act Theory and its Elaboration 464 48.3 Speech Act Schemata and Triggers 465 48.3.1 Notes Toward Example SpeechActSchema 467 48.4 Probabilistic Mining of Trigger contexts 471 48.5 Conclusion 473 Section VIII From Here to AGI 49 Summary of Argument for the CogPrime Approach 477 49.1 Introduction 477 49.2 Multi-Memory Systems 477 49.3 Perception, Action and Environment 478 49.4 Developmental Pathways 479 49.5 Knowledge Representation 480 49.6 Cognitive Processes 480 49.6.1 Uncertain Logic for Declarative Knowledge 481 49.6.2 Program Learning for Procedural Knowledge 482 49.6.3 Attention Allocation 483 49.6.4 Internal Simulation and Episodic Knowledge 484 49.6.5 Low-Level Perception and Action 484 49.6.6 Goals 485 49.7 Fulfilling the "Cognitive Equation" 485 49.8 Occam's Razor 486 49.8.1 Mind Geometry 486 49.9 Cognitive Synergy 488 49.9.1 Synergies that Help Inference 488 49.10Synergies that Help MOSES 489 49.10.1Synergies that Help Attention Allocation 489 49.10.2Further Synergies Related to Pattern Mining 489 EFTA00624146 xxii Contents 49.10.3Synergim Related to Map Formation 490 49.11Emergent Structures and Dynamics 490 49.12Ethical AGI 491 49.13Toward Superhuman General Intelligence 492 49.13.1Conclusion 492 50 Build Me Something I Haven't Seen: A CogPrime Thought Experiment 495 50.1 Introduction 495 50.2 Roles of Selected Cognitive Processes 496 50.3 A Semi-Narrative Treatment 506 50.4 Conclusion 509 A Glossary 511 A.1 List of Specialized Acronyms 511 A.2 Glossary of Specialized Terms 512 References 529 EFTA00624147 Section I Architectural and Representational Mechanisms EFTA00624148 Chapter 19 The OpenCog Framework 19.1 Introduction The primary burden of this book is to explain the CogPrime architecture for AGI - the broad outline of the design, the main dynamics it's intended to display once complete, and the reasons why we believe it will be capable of leading to general intelligence at the human level and beyond. The crux of CogPrime lies in its learning algorithms and how they are intended to interact together synergetically, making use of CogPrime's knowledge representation and other tools. Before we can get to this, however, we need to elaborate some of the "plumbing" within which this learning dynamics occurs. We will start out with a brief description of the OpenCog frame- work in which implementation of CogPrime has been, gradually and incrementally, occurring for the last few years. 19.1.1 Layers of Abstraction in Describing Artificial Minds There are multiple layers intervening between a conceptual theory of mind and a body of source code. How many layers to explicitly discuss is a somewhat arbitrary decision, but one way to picture it is exemplified in Table 19.1. In Part 1 of this work we have concerned ourselves mainly with levels 5 and 6 in the table: mathematical/conceptual modeling of cognition and philosophy of mind (with occasional forays into levels 3 and 4). Most of Part 2, on the other hand, deals with level 4 (mathematical/concep- tual AI design), verging into level 3 (high-level software design). This chapter however will focus on somewhat lower-level material, mostly level 3 with sonic dips into level 2. We will describe the basic architecture of CogPrime as a software system, implemented as "OpenCogPrime" within the OpenCog Framework (OCF). The reader may want to glance back at Chapter 6 of Part 1 before proceeding through this one, to get a memory-refresh on basic CogPrime terminology. Also, OpenCog and OpenCogPrime are open-source, so the reader who wishes to dig into the source code (mostly C++, some Python and Scheme) is welcome to; directions to find the code are on the opencog . org website. 3 EFTA00624150 4 19 The OpenCog Framework Level of Abstraction Description/Example 1 Source code 2 Detailed software design 8 Software architecture Largely programming-language-independent, but not hardware-architecture-independent: much of the ma- terial in this chapter, for example, and most of the OpenCog Framework 4 Mathematical and concep- tual Al design e.g., the sort of characterization of CogPrime given in most of this Part of this book 5 Abstract mathematical mod- eling of cognition e.g. the SRAM model discussed in chapter 7 of Part 1, which could be used to inspire or describe many different Al systems 6 Philosophy of mind e.g. Patternism, the Mind-World Correspondence Principle Table 19.1: Levels of abstractions in CogPrime's implementation and design 19.1.2 The OpenCog Framework The OpenCog Framework forms a bridge between the mathematical structures and dynamics of CogPrime's concretely implemented mind, and the nitty-gritty realities of modern computer technology. While CogPrime could in principle be implemented in a quite different infrastruc- ture, in practice the CogPrime design has been developed closely in conjunction with OpenCog, so that a qualitative understanding of the nature of the OCF is fairly necessary for an under- standing of how CogPrime is intended to function, and a detailed understanding of the OCF is necessary for doing concrete implementation work on CogPrime. Marvin Minsky, in a personal conversation with one of the authors (Goertzel), once expressed the opinion that a human-level general intelligence could probably be implemented on a 486 PC, if we just knew the algorithm. We doubt this is the case — at least not unless the 486 PC were supplied with masses of external memory and allowed to proceed much, much slower than any human being - and it is certainly not the case for CogPrime. By current computing hardware standards, a CogPrime system is a considerable resource hog. And it will remain so for a number of years, even considering technology progress. It is one of the jobs of the OCF to manage the system's gluttonous behavior. It is the software layer that abstracts the real world efficiency compromises from the rest of the system; this is why we call it a "Mind OS": it provides services, rules, and protection to the Atoms and cognitive processes (see Section 19.4) that live on top of it, which are then allowed to ignore the software architecture they live on. And so, the nature of the OCF is strongly influenced by the quantitative requirements ha- posed on the system, as well as the general nature of the structure and dynamics that it must support. The large number and great diversity of Atoms needed to create a significantly intelli- gent CogPrime, demands that we pay careful attention to such issues as concurrent, distributed processing, and scalability in general. The number of Nodes and Links that we will need in order to create a reasonably complete CogPrime is still largely unknown. But our experiments with learning, natural language processing, and cognition over the past few years have given us an intuition for the question. We currently believe that we are likely to need billions - but probably not trillions, and almost surely not quadrillions - of Atoms in order to achieve a high degree of general intelligence. Hundreds of millions strikes us as possible but overly optimistic. EFTA00624151 19.2 The OpenCog Architecture 5 In fact we have already run CogPrime systems utilizing hundreds of millions of Atoms, though in a simplified dynamical regime with only a couple very simple processes acting on most of them. The operational infrastructure of the OCF is an area where pragmatism must reign over ide- alism. What we describe here is not the ultimate possible "mind operating system" to underlie a CogPrime system, but rather a workable practical solution given the hardware, networking and software infrastructure readily available today at reasonable prices. Along these lines, it must be emphasized that the ideas presented in this chapter are the result of over a decade of practical experimentation by the authors and their colleagues with implementations of related software systems. The journey began in earnest in 1997 with the design and implementation of the Webmind AI Engine at Intelligenesis Corp., which itself went through a few major design revisions; and then in 2001-2002 the Novamente Cognition Engine was architected and imple- mented, and evolved progressively until 2008, when a subset of it was adapted for open-sourcing as OpenCog. Innumerable mistakes were made, and lessons learned, along this path. The OCF as described here is significantly different. and better, than these previous architectures, thanks to these lessons, as well as to the changing lamLscape of concurrent, distributed computing over the past few years. The design presented here reflects a mix of realism and idealism, and we haven't seen fit here to describe all the alternatives that were pursued on the route to what we present. We don't claim the approach we've chosen is ideal, but it's in use now within the OpenCog sys- tem, and it seems both workable in practice and capable of effectively supporting the entire CogPrime design. No doubt it will evolve in some respects as implementation progresses; one of the principles kept in mind during the design and development of OpenCog was modular- ity, enabling substantial modifications to particular parts of the framework to occur without requiring wholesale changes throughout the codebase. 19.2 The OpenCog Architecture 19.2.1 OpenCog and Hardware Models The job of the OCF Ls closely related to the nature of the hardware on which it runs. The ideal hardware platform for CogPrime would be a massively parallel hardware architecture, in which each Atom was given its own processor and memory. The closest thing would have been the Connection Machine II Ii1891: a CM5 was once built with 64000 processors and local RAM for each processor. But even 64000 processors wouldn't be enough for a highly intelligent CogPrime to run in a fully parallelized manner, since we're sure we need more than 64000 Atoms. Connection Machine style hardware seems to have perished in favor of more standard SNIP (Symmetric Multi-Processing) machines. It is true that each year we see SNIP machines with more and more processors on the market, and more and more cores per processor. However, the state of the art is still in the hundreds of cores range, many orders of magnitude from what would be necessary for a one Atom per processor CogPrime implementation. So, at the present time, technological and financial reasons have pushed us to implement the OpenCog system using a relatively mundane and standard hardware architecture. If the CogPrime project is successful in the relatively near term, the first human-level OpenCogPrime system will most likely live on a network of high-end commodity SMP machines. These are EFTA00624152 6 19 The OpenCog Ftamework machines with dozens of gigabytes of RAM and several processor cores, perhaps dozens but not thousands. A highly intelligent CogPrime would require a cluster of dozens and possibly hundreds or thousands of such machines. We think it's unlikely that tens of thousands will be required, and extremely unlikely that hundreds of thousands will be. Given this sort of architecture, we need effective ways to swap Atoms back and forth be- tween disk and RAM, and carefully manage the allocation of processor time among the various cognitive processes that demand it. The use of a widely-distributed network of weaker ma- chines for peripheral processing is a serious possibility, and we have some detailed software designs addressing this option: but for the near future we believe that this can best be used as augmentation to core CogPrime processing, which must remain on a dedicated cluster. Of course, the use of specialized hardware is also a viable passibility, and we have considered a host of possibilities such as • True supercomputers like those created by IBM or Cray (which these days are distributed systems, but with specialized, particularly efficient interconnection frameworks and overall control mechanisms) • GPU supercomputers such as the Nvidia Tesla (which are currently being used for vision processing systems considered for hybridization with OCP), such as DeSTIN and Hugo de Garis's Parcone • custom chips designed to implement the various CogPrime algorithms and data structures in hardware • More speculatively, it might be possible to use evolutionary quantum computing or adiabatic quantum computing a la Dwave (ht tp : //dwave . corn) to accelerate CogPrime procedure learning. All these possibilities and many more are exciting to envision, but the CogPrime architecture does not require any of them in order to be successful. 19.2.2 The Key Components of the OpenCog Framework Given the realities of implementing CogPrime on clustered commodity servers, as we have seen above, the three key questions that have to be answered in the OCF design are: 1. How do we store CogPrime's knowledge? 2. How do we enable cognitive processes to act on that knowledge, refining and improving it? 3. How do we enable scalable, distributed knowledge storage and cognitive processing of that knowledge? The remaining sections of this Chapter are dedicated to answering each of these questions in more detail. While the basic landscape of concurrent, distributed processing is largely the same as it was a decade ago - we're still dealing with distributed networks of multiprocessor von Neumann ma- chines - we can draw on advancements in both computer architecture and software. The former is materialized on the increasing availability of multiple real and virtual cores in commodity processors. The latter reflects the emergence of a number of tools and architectural patterns, largely thanks to the rise of "big data" problems and businesses. Companies and projects dealing EFTA00624153 19.3 The AtomSpace 7 with massive datasets face challenges that aren't entirely alike those of building CogPrime, but which share many useful similarities. These advances are apparent mostly in the architectute of the AtomSpace, a distributed knowledge store for efficient storage of hypergraphs and its use by CogPrime's cognitive dy- namics. The AtomSpace, like many NoSQL datastores, is heavily distributed, utilizing local caches for read and write operations, and a special purpose design for eventual consistency guarantees. We also attempt to minimize the complexities of multi-threading in the scheduling of cogni- tive dynamics, by allowing those to be deployed either as agents sharing a single OS process, or, preferably, as processes of their own. Cognitive dynamics communicate through message queues, which are provided by a sub-system that hides the deployment decision, so the mes- sages exchanged are the same whether delivered within a proms, to another process in the same machine, or to a process in another machine in the cluster. 19.3 The AtomSpace As alluded to above and in Chapter 13. and discussed more fully in Chapter 20 below, the foundation of CogPrime's knowledge representation is the Atom, an object that can be either a Node or a Link. CogPrime's hypergraph is implemented as the AtomSpace, a specialized datastore that comes along with an API designed specifically for CogPrime's requirements. 19.3.1 The Knowledge Unit: Atoms Atoms are used to represent every kind of knowledge in the system's memory in one way or another. The particulars of Atoms and how they represent knowledge will be discussed in later chapters; here we present only a minimal description in order to motivate the design of the AtomSpace. From that perspective, the most important properties of Atoms are: • Every Atom has an AtomHandle, which is a universal ID across a CogPrime deployment (possibly involving thousands of networked machines). The AtomHandles are the keys for acessing Atoms in the AtomSpace, and once a handle is assigned to an Atom it can't be changed or reused. • Atoms have TruthValue and AttentionValue entities associated with them, each of which are small collections of numbers; there are multiple versions of truth values, with varying degrees of detail. TruthValues are context-dependent, and useful Atoms will typically have multiple TruthValues. indexed by context. • Some Atoms are nodes, and may have names. • Atoms that are links will have a list of targets, of variable size (as in CogPrime's hypergraph links may connect more than two nodes). Some Atom attributes are immutable, such as Node names and, most importantly, Link targets, called outgoing sets in AtomSpace lingo. One can remove a Link, but not change its targets. This enables faster implementation of some neighborhood searches, as well as index- EFTA00624154 8 19 The OpenCog Ftamework Mg. Truth and attention values, on the other hand, are mutable, an essential requirement for CogPrime. For performance reasons, some types of knowledge have alternative representations. These alternative representations are necessary for space or speed reasons, but knowledge stored that way can always be translated back into Atoms in the AtomSpace as needed. So, for instance, procedures are represented as program trees in a ProcedureRepository, which allows for faster execution, but the trees can be expanded into a set of Nodes and Links if one wants to do reasoning on a specific program. 19.3.2 AtomSpace Requirements and Properties The major high-level requirements for the AtomSpace are the following ones: • Store Atoms indexed by their immutable AtomHandles as compactly as possible, while still enabling very efficient modification of the mutable properties of an Atom (TruthValues and AttentionValues). • Perform queries as fast as possible. • Keep the working set of all Atoms currently being used by CogPrime's cognitive dynamics in RAM. • Save and restore hypergraphs to disk, a more traditional SQL or non-SQL database, or other structure such as binary files, XML, etc. • Hold hypergraphs consisting of billions or trillions of Atoms, scaling up to petabytes of data. • Be transparently distributable across a cluster of machines. The design trade-offs in the AtomSpace implementation are driven by the needs of CogPrime. The datastore is implemented in a way that maximizes the performance of the cognitive dynam- ics running on top of it. From this perspective, the AtomSpace differs from most datastores, as the key decisions aren't made in terms of flexibility, consistency, reliability and other common criteria for databases. It is a very specialized database. Among the factors that motivate the AtomSpace's design, we can highlight a few: 1. Atoms tend to be small objects, with very few exceptions (links with many targets or Atoms with many different context-derived TruthValu). 2. Atom creation and deletion are common events, and occur according to complex patterns that may vary a lot over time, even for a particular CogPrime instance. 3. Atoms involved in CogPrime's cognitive dynamics at any given time need to live in RAM. However, the system still needs the ability to save sets of Atoms to disk in order to preserve RAM, and then retrive those later when they get contextually relevant. 4. Some Atoms will remain around for a really long time, others will be ephemeral and get removed shortly after they're created. Removal may be to disk, as outlined above, or plain deletion. Besides storing Atoms, the AtomSpace also contains a number of indices for fast Atom re- trieval according to several criteria. It can quickly search for Atoms given their type, importance, truth value, arity, targets (for Links), name (for Nodes), and any combination of the above. These are built-in indexes. The AtomSpace also allows cognitive processes to create their own EFTA00624155 19.3 The AtomSpace 9 indexes. based on the evaluation of a Procedure over the universe of Atoms, or a subset of that universe specified by the process responsible for the index. The AtomSpace also allows pattern matching queries for a given Atom structure template, which allows for fast search for small subgraphs displaying some desirable properties. In ad- dition to pattern matching, it provides neighborhood searches. Although it doesn't implement any graph-traversal primitives, it's easy for cognitive processes to do so on top of the pattern matching and neighborhood primitives. Note that. since CogPrime's hypergraph is quite different from a regular graph, using a graph database without modification would probably be inadequate. While it's possible to automati- cally translate a hypergraph into a regular graph, that process is expensive for large knowledge bases, and leads to higher space requirements, reducing the overall system's scalability. In terms of database taxonomy, the AtomSpace lies somewhere between a key-value store and a document store, as there is some structure in the contents of each value (an Atom's properties are well defined, and listed above), but no built-in flexibility to add more contents to an existing Atom. We will now discuss the above requirements in more detail, starting with querying the Atom- Space, followed by persistence to disk, and then handling of specific forms of knowledge that are best handled by specialized stores. 19.3.3 Accessing the Atomspace The AtomSpace provides an API, which allows the basic operations of creating new Atoms, updating their mutable properties, searching for Atoms and removing Atoms. More specifically, the API supports the following operations: • Create and store a new Atom. There are special methods for Nodes and Links, in the latter case with multiple convenience versions depending on the number of targets and other properties of the link. • Remove an Atom. This requires the validation that no Links currently point to that Atom, otherwise they'd be left dangling. • Look up one or more Atoms. This includes several variants, such as: - Look up an Atom by AtomHandle; - Look up a Node by name; - Find links with an Atom as target; - Pattern matching, i.e., find Atoms satisfying some predicate, which is designed as a "search criteria" by some cognitive process, and results in the creation of a specific index for that predicate; - Neighborhood search, i.e., find Atoms that are within some radius of a given centroid Atom; - Find Atoms by type (this can be combined with the previous queries, resulting in type specific versions); - Find Atoms by some AttentionValue criteria, such as the top N most important Atoms, or those with importance above some threshold (can also be combined with previous queries); EFTA00624156 10 19 The OpenCog Framework - Find Atoms by some TruthValue criteria, similar to the previous one (can also be combined with other queries); - Find Atoms based on some temporal or spatial association, a query that relies on the specialized knowledge stores mentioned below; Queries can be combined, and the Atom type, AttentionValue and TruthValue criteria are often used as filters for other queries, preventing the result set size from exploding. • Manipulate an Atom, retrieving or modifying its AttentionValue and TruthValue. In the modification case, this causes the respective indexes to be updated. 19.3.4 Persistence In many planned CogPrime deployment scenarios, the amount of knowledge that needs to be stored is too vast to fit in RAM, even if one considers a large cluster of machines hosting the AtomSpace and the cognitive processes. The AtomSpace must then be able to persist subsets of that knowledge to disk, and reload them later when necessary. The decision of whether to keep an Atom in RAM or remove it is made based on its At- tentionValue, through the process of economic attention allocation that is the topic of Chapter 23. AttentionValue determines how important an Atom is to the system, and there are mul- tiple levels of importance. For the persistence decisions, the ones that matter are Long Term Importance (LTI) and Very Long Term Importance (VLTI). LTI is used to estimate the probability that the Atom will be necessary or useful in the not too distant future. If this value is low, below a threshold i.1, then it is safe to remove the Atom from RAM, a process called forgetting. When the decision to forget an Atom has been made, VLTI enters the picture. VLTI is used to estimate the probability that the Atom will be useful eventually at some distant point in the future. If VLTI is high enough, the forgotten Atom is persisted to disk so it can be reloaded. Otherwise, the Atom is permanently forgotten. When an Atom has been forgotten, a proxy is kept in its place. The proxy is more compact than the original Atom, preserving only a crude measure of its LTI. When the proxy's LTI increases above a second threshold is, the system understands that the Atom has become relevant again, and loads it from disk. Eventually, it may happen that the proxy doesn't become important enough over a very long period of time. In this case, the system should remove even the proxy, if its Long Term Importance (LTI) is below a third threshold i3. Other actions, usually taken by the system administrator, can cause the removal of Atoms and their proxies from RAM. For instance, in a CogPrime system managing information about a number of asers of some information system, the deletion of a user from the system would came all that user's specific Atoms to be removed. When Atoms are saved to disk and have no proxies in RAM, they can only be reloaded by the system administrator. When reloaded, they will be disconnected from the rest of the AtomSpace, and should be given special attention in order to pursue the creation of new Links with the other Atoms in the system. It's important that the values of i1, is, and is be set correctly. Otherwise, one or more of the following problems may arise: • If i1 and is are too close, the system may spend a lot of resources with saving and loading Atoms. EFTA00624157 19.3 The AtomSpace 11 • If it is set too high, important Atoms will be excluded from the system's dynamics, de- creasing its intelligence. • If i3 is set too high, the system will forget very quickly and will have to spend resources re-creating necessary but no longer available evidence. • If either it or i3 is set too low, the system will consume significantly more resources than it needs to with knowledge store, sacrificing cognitive processes. Generally, we want to enforce a degree of hysteresis for the freezing and defrosting process. What we mean is that: iz - it > et > 0 it — is > c2 > 0 This ensures that when Atoms are reloaded, their importance is still above the threshold for saving, so they will have a chance to be part of cognitive dynamics and become more important, and won't be removed again too quickly. It also ensures that saved Atoms stay in the system for a period of time before their proxies are removed and they're definitely forgotten. Another important consideration is that forgetting individual Atoms makes little sense, be- cause, as pointed out above, Atoms are relatively small objects. So the forgetting process should prioritize the removal of clusters of highly interconnected Atoms whenever possible. In that case, it's passible that a large subset of those Atoms will only have relations within the cluster, so their proxies aren't needed and the memory, savings are maximized. 19.3.5 Specialized Knowledge Stores Some specific kinds of knowledge are best stored in specialized data structures, which allow big savings in space, query time, or both. The information provided by these specialized stores isn't as flexible as it would be if the knowledge were stored in full fledged Node and Link form, but most of the time CogPrime doesn't need the fully flexible format. Translation between the specialized formats and Nodes and Links is always possible, when necessary. We note that the ideal set of specialized knowledge stores is application domain specific. The stores we have deemed necessary reflect the pro-school based roadmap towards AGI, and are likely sufficient to get as through most of that roadmap, but not sufficient nor particularly adequate for an architecture where self-modification plays a key role. These specialized stores are a pragmatic compromise between performance and formalism, and their existence and design would need to be revised once CogPrime is mostly functional. 19.3.5.1 Procedure Repository Procedural knowledge, meaning knowledge that can be used both for the selection and execution of actions, has a specialized requirement - this knowledge needs to be executable by the system. While it will be passible, and conceptually straightforward, to execute a procedure that is stored as a set of Atoms in the AtomSpace, it is much simpler, faster, and safer to rely on a specialized repository. EFTA00624158 12 19 The OpenCog Ftamework Procedural knowledge in CogPrime is stored as programs in a special-purpose LISP-like programming language called Combo. The motivation and details of this language are the subject of Chapter 21. Each Combo program is associated with a Node (a GroundedProcedureNode, to be more precise), and the AtomHandle of that Node is used to index the procedure repository, where the executable version of the program is kept, along with specifications of the necessary inputs for its evaluation and what kind of output to expect. Combo programs can also be saved to disk and loaded, like regular Atoms. There is a text representation of Combo for this purpose. Program execution can be very fast, or, in cognitive dynamics terms, very slow, if it involves interacting with the external world. Therefore, the procedure repository should also facilitate the storage of program states during the execution of procedures. Concurrent execution of many procedures is possible with no significant overhead. 19.3.5.2 3D Space Map In the AGI Preschool setting, CogPrime is embodied in a three-dimensional world (either a real one, in which it controls a robot, or a virtual one, in which it controls an avatar). This requires the efficient storage and querying of vast amounts of spatial data, including very specialized queries about the spacial interrelationship between entities. This spatial data is a key form of knowledge for CogPrime's world perception, and it also needs to be accessible during learning, action selection, and action execution. All spatial knowledge is stored in a 3D Space Map, which allows for fast queries about specific regions of the world, and for queries about the proximity and relative placement of objects and entities. It can be used to provide a coarse-grained object level perception for the AtomSpace, or it can be instrumental in supporting a lower level vision layer in which pixels or polygons are used as the units of perception. In both cases. the knowledge stored in the 3D Space Map can be translated into full-fledged Atoms and Links through the AtomHandles. One characteristic feature of spatial perception is that vast amounts of data are generated constantly, but most of it is very quickly forgotten. The mind abstracts the perceptual data into the relevant concepts, which are linked with other Atoms, and most of the underlying information can then be discarded. The process is repeated at a high frequency as long as something novel is being perceived in the world. 3D Space Map is then optimized for quick inserts and deletes. 19.3.5.3 Time Server Similarly to spatial information, temporal information poses challenges for a hypergraph-based storage. It can be much more compactly stored in specific data structures, which also allow for very fast querying. The Time Server is the specialized structure for storing and querying temportal data in CogPrime. Temporal information can be stored by any cognitive process, based on its own criteria for determining that some event should be remembered in a specific temporal context in the future. This can include the perception of specific events, or the agents participation in those. such as the first time it meets a new human teacher. It can also include a collection of concepts describing specific contexts in which a set of actions has been particularly useful. The possibilities are EFTA00624159 19.4 NlindAgents: Cognitive Processes 13 numerous, but from the Time Server perspective, all equivalent. They add up to associating a time point or time interval with a set of Atoms. The Time Server is a bi-directional storage, as AtomHandles can be used as keys, but also as objects indexed by time points or time intervals. In the former case, the Time Server tells us when an Atom was associated with temporal data. In the latter case, it tells us, for a given time point or interval. which Atoms have been marked as relevant. Temporal indexing can be based on time points or time intervals. A time point can be at any granularity: from years to sub-seconds could be useful. A time interval is simply a set of two points, the second being necessary after the first one, but their granularities not necessarily the same. The temporal indexing inside the Time Server is hierarchical, so one can query for time points or intervals in granularities other than the ones originally used when the knowledge was first stored. 19.3.5.4 System Activity Table Set The last relevant specialized store is the System Activity Table Set, which is described in more detail in Chapter 23. This set of tables records, with fine-grained temporal associations, the most important activities that take place inside CogPrime. There are different tables for recording cognitive process activity (at the level of MindAgents, to be described in the next section), for maintaining a history of the level of achievement of each important goal in the system, and for recording other important aspects of the system state. such as the most important Atoms and contexts. 19.4 MindAgents: Cognitive Processes The AtomSpace holds the system's knowledge, but those Atoms are inert. How Ls that knowledge used and useful? That is the province of cognitive dynamics. These dynamics, in a CogPrime system, can be considered on two levels. First, we have the cognitive processes explicitly programmed into CogPrime's source code. These are what we call Concretely-Implemented Mind Dynamics, or CIM-Dynamics. Their implementation in software happens through objects called MindAgents. We use the term CIM- Dynamic to discuss a conceptual cognitive process, and the term MindAgents for its actual implementation and execution dynamics. The second level corresponds to the dynamics that emerge through the system's self- organizing dynamics, based on the cooperative activity of the CIM-Dynamics on the shared AtomSpace. Most of the material in the following chapters is concerted with particular CIM-Dynamics in the CogPrime system. In this section we will simply give some generalities about the CIM- Dynamics as abstract processes and as software processes, which are largely independent of the actual Al contents of the CIM-Dynamics. In practice the CIM-Dynamics involved in a CogPrime system are fairly stereotyped in form, although diverse in the actual dynamics they induce. EFTA00624160 14 19 The Opetteog Ftamework 19.4.1 A Conceptual View of CogPrime Cognitive Processes We return now to the conceptual trichotomy of cognitive processes presented in Chapter 3 of Part 1, according to which CogPrime cognitive processes may be divided into: • Control processes; • Global cognitive processes; • Focused cognitive processes. In practical terms, these may be considered as three categories of CIM-Dynamic. Control Process CIM-Dynamics are hard to stereotype. Examples are the process of homeo- static parameter adaptation of the parameters associated with the various other CIM-Dynamics, and the CIM-Dynamics concerned with the execution of procedures, especially those whose ex- ecution is made lengthy by the interactions with the external world. Control Processes tend to focus on a limited and specialized subset of Atoms or other entities, and carry out specialized mechanical operations on them (e.g. adjusting parameters, interpreting procedures). To an extent, this may be considered a "grab bag" category containing CIM- Dynamics that are not global or focused cognitive processes according to the definitions of the latter two categories. However, it is a nontrivial observation about the CogPrime system that the CIM-Dynamics that are not global or focused cognitive processes are all explicitly concerned with system control in some way or another, so this grouping makes sense. Global and Focused Cognitive Process CIM-Dynamics all have a common aspect to their structure. Then, there are aspects in which Global versus Focused CIM-Dynamics diverge from each other in stereotyped ways. In most cases, the process undertaken by a Global or Focused CIM-Dynamic involves two parts: a selection process and an actuation process. Schematically, such a CIM-Dynamic typi- cally looks something like this: 1. Fetch a set of Atoms that it is judged will be useful to process, according to some selection process. 2. Operate on these Atoms, possibly together with previously selected ones (this is what we sometimes call the actuation process of the CIM-Dynamic). 3. Go back to step 1. The major difference between Global and Focused cognitive processes lies in the selection process. In the case of a Global process, the selection process Ls very broad, sometimes yielding the whole AtomSpace, or a significant subset of it. This means that the actuation process must be very simple, or the activation of this CIM-Dynamic mast be very infrequent. On the other hand, in the case of a Focused process, the selection process is very narrow, yielding only a small number of Atoms, which can then be processed more intensively and expensively, on a per-Atom basis. Common selection processes for Focused cognitive processes are fitness-oriented selectors, which pick one or a set of Atoms from the AtomSpace with a probability based on some nu- merical quantity associated with the atom, such as properties of TruthValue or AttentionValue. There are also more specific selection processes, which choose for example Atoms obeying some particular combination of relationships in relation to some other Atoms; say choosing only Atoms that inherit from some given Atom already being processed. There is a notion, described in the PLN book, of an Atom Structure Template; this is basically just a predicate that applies to Atoms, such as EFTA00624161 19.4 MindAgents: Cognitive Processes 15 P (X) .tv equals ((InheritanceLink X cat) AND (EvaluationLink eats(X e cheese)).tv which is a template that matches everything that inherits from cat and eats cheese. Templates like this allow a much more refined selection than the above fitness-oriented selection process. Selection processes can be created by composing a fitness-oriented process with further re- strictions, such as templates. or simpler type-based restrictions. 19.4.2 Implementation of MindAgents MindAgents follow a very, simple design. They need to provide a single method through which they can be enacted. and they should execute their actions in atomic, incremental steps, where each step should be relatively quick. This design enables collaborative scheduling of MindAgents, at the cost of allowing "opportunistic" agents to have more than their fair share of resources. We rely on CogPrime developers to respect the above guidelines, instead of trying to enforce exact resource allocations on the software level. Each MindAgent can have a set of system parameters that guide its behavior. For instance, a MindAgent dedicated to inference can provide drastically different conclusions if its parameters tell it to select a small set of Atoms for processing each time, but to spend significant time on each Atom, rather than selecting many Atoms and doing shallow inferences on each one. It's expected that multiple copies of the same MindAgent will exist in the cluster, but delivering different dynamics thanks to those parameters. In addition to their main action method, MindAgents can also communicate with other MindAgents through message queues. CogPrime has. in its runtime configuration, a list of available MindAgents and their locations in the cluster. Communications between MindAgents typically take the form of specific, one-time requests, which we call Tasks. The default action of MindAgents and the processing of Tasks constitute the cognitive dy- namics of CogPrime. Nearly everything that takes place within a CogPrime deployment is done by either a MindAgent (including the control processes), a Task, or specialized code handling AtomSpace internals or communications with the external world. We now talk about how those dynamics are scheduled. MindAgents live inside a process called a CogPrime Unit. One machine in a CogPrime cluster can contain one or more Units, and one Unit can contain one or more MindAgents. In practice, given the way the AtomSpace is distributed, which requires a control process in each machine, it typically makes more sense to have a single Unit per machine, as this enables all MindAgents in that machine to make direct function calls to the AtomSpace, instead of using more expensive inter-process communication. There are exceptions to the above guideline, to accommodate various situations: 1. Very specific MindAgents may not need to communicate with other agents, or only do so very rarely, so it makes sense to give them their own process. EFTA00624162 16 19 The Opetteog Ftamework 2. MindAgents whose implementation is a poor fit for the collaborative processing in small increments design described above also should be given their own process, so they don't interfere with the overall dynamics in that machine. 3. MindAgents whose priority is either much higher or much lower than that of other agents in the same machine should be given their own proems, so operating system-level scheduling can be relied upon to reflect those wry different priority levels. 19.4.5 Tasks It is not convenient for CogPrime to do all its work directly via the action of MindAgent objects embodying CIM-Dynamics. This is especially true for MindAgents embodying focused cognitive processes. These have their selection algorithms, which are ideally suited to guarantee that, over the long run, the right Atoms get selected and processed. This, however, doesn't address the issue that, on many occasions, it may be necessary to quickly process a specific set of Atoms in order to execute an action or rapidly respond to some demand. These actions tend to be one-time, rather than the recurring patterns of mind dynamics. While it would be possible to design MindAgents so that they could both cover their long term processing needs and rapidly respond to urgent demands, we found it much simpler to augment the MindAgent framework with an additional scheduling mechanism that we call the Task framework. In essence, this is a ticketing system, designed to handle cases where MindAgents or Schema spawn one - off tasks to be executed - things that need to be done only once, rather that repeatedly and iteratively as with the things embodied in MindAgents. For instance, grab the most important Atoms from the AtomSpace and do shallow PLN rea- soning to derive immediate conclusions from them is a natural job for a MindAgent. But do search to find entities that satisfy this particular predicate P is a natural job for a Task. Tasks have AttentionValues and target MindAgents. When a Task is created it is submitted to the appropriate Unit and then put in a priority queue. The Unit will schedule some resources to processing the more important Tasks, as we'll see next. 19.4.4 Scheduling of MindAgents and Tasks in a Unit Within each Unit we have one or more MindAgents, a Task queue and, optionally, a subset of the distributed AtomSpace. If that subset isn't held in the unit, it's held in another process running on the same machine. If there is more than one Unit per machine, their relative priorities are handled by the operating system's scheduler. In addition to the Units, CogPrime has an extra maintenance process per machine, whose job is to handle changes in those priorities as well as reconfigurations caused by MindAgent migration, and machines joining or leaving the CogPrime cluster. So, at the Unit level, attention allocation in CogPrime has two aspects: how MindAgents and Tasks receive attention from CogPrime, and how Atoms receive attention from different MindAgents and Tasks. The topic of this Section is the former. The latter is dealt with elsewhere, in two ways: EFTA00624163 19.4 MindAgents: Cognitive Processes 17 • in Chapter 23, which discusses the dynamic updating of the AttentionValue structures asso- ciated with Atoms, and how these determine how much attention various focused cognitive processes MindAgents pay to them. • in the discussion of various specific CIM-Dynamics, each of which may make choices of which Atoms to focus on in its own way (though generally making use of AttentionValue and TruthValue in doing so). The attention allocation subsystem is also pertinent to MindAgent scheduling, because it discusses dynamics that update ShortTermlmportance (STI) values associated with MindA- gents, based on the usefulness of MindAgents for achieving system goals. In this chapter, we will not enter into such cognitive matters, but will merely discuss the mechanics by which these STI values are used to control processor allocation to MindAgents. Each instance of a MindAgent has its own AttentionValue, which is used to schedule processor time within the Unit. That scheduling is done by a Scheduler object which controls a collection of worker threads, whose size is a system parameter. The Scheduler aims to allocate worker threads to the MindAgents in a way that's roughly proportional to their STI, but it needs to account for starvation, as well as the need to process the Tasks in the task queue. This is an area in which we can safely borrow from reasonably mature computer science research. The requirements of cognitive dynamics scheduling are far from unique, so this is not a topic where new ideas need to be invented for OpenCog; rather, designs need to be crafted meeting CogPrime's specific requirements based on state-of-the•art knowledge and experience. One example scheduler design has two important inputs: the STI associated with each MindAgent, and a parameter determining how much resources should go to the MindAgents vs the Task queue. In the CogPrime implementation, the Scheduler maps the MindAgent STIs to a set of priority queues, and each queue is nut a number of times per cycle. Ideally one wants to keep the number of queues small, and rely on multiple Units and the OS-level scheduler to handle widely different priority levels. When the importance of a MindAgent changes, one just has to reassign it to a new queue, which is a cheap operation that can be done between cycles. MindAgent insertions and removals are handled similarly. Finally, Task execution is currently handled via allocating a certain fixed percentage of pro- cessor time, each cycle, to executing the top Tasks on the queue. Adaptation of this percentage may be valuable in the long term but was not yet implemented. Control processes are also implemented as MindAgents, and processed in the same way as the other kinds of CIM-Dynamics, although they tend to have fairly low importance. 19.4.5 The Cognitive Cycle We have mentioned the concept of a "cycle" in the discussion about scheduling, without ex- plaining what we mean. Let's address that now. All the Units in a CogPrime cluster are kept in sync by a global cognitive cycle, whose purpose is described in Section II. We mentioned above that each machine in the CogPrime cluster has a housekeeping process. One of its tasks is to keep track of the cognitive cycle, broadcasting when the machine has finished its cycle, and listening to similar broadcasts from its counterparts in the cluster. When all the machines have completed a cycle, a global counter is updated, and each machine is then free to begin the next cycle. EFTA00624164 18 19 The OpenCog Ftamework One potential annoyance with this global cognitive cycle is that some machines may complete their cycle much faster than others, and then sit idly while the stragglers finish their jobs. CogPrime addresses this issue in two ways: • Over the long run, a load balancing process will assign MindAgents from overburdened machines to underutilized ones. The MindAgent migration process is described in the next section. • In a shorter time horizon, during which a machine's configuration is fixed, there are two heuristics to minimize the waste of processor time without breaking the overall cognitive cycle coordination: - The Task queue in each of the machine's Units can be processed more extensively than it would by default; in extreme cases, the machine can go through the whole queue. - Background process MindAgents can be given extra activations, as their activity is unlikely to throw the system out of sync, unlike with more focused and goal-oriented processes. Both heuristics are implemented by the scheduler inside each unit, which has one boolean trigger for each heuristic. The triggers are set by the housekeeping process when it observes that the machine has been frequently idle over the recent past, and then reset if the situation changes. 19.5 Distributed AtomSpace and Cognitive Dynamics As hinted above, realistic CogPrime deployments will be spread around reasonably large clus- ters of co-located machines. This section describes how this distributed deployment scenario is planned for in the design of the AtomSpace and the MindAgents, and how the cognitive dynamics take place in such a scenario. We won't review the standard principles of distributed computing here, but we will focus on specific issues that arise when the CogPrime is spread across a relatively large number of machines. The two key issues that need to be handled are: • How to distribute knowledge (i.e., the AtomSpace) in a way that doesn't impose a large performance penalty? • How to allocate resources (i.e., machines) to the different cognitive processes (MindAgents) in a way that's flexible and dynamic? 19.5.1 Distributing the AtomSpace The design of a distributed AtomSpace was guided by the following high level requirements: 1. Scale up, transparently, to clusters of dozens to hundreds of machines, without requiring a single central master server. 2. The ability to store portions of an Atom repository on a number of machines in a cluster, where each machine also runs some MindAgens. The distribution of Atoms across the ma- EFTA00624165 19.5 Distributed AtomSpace and Cognitive Dynamics 19 chines should benefit from the fact that the cognitive processes on one machine are likely to access local Atoms more often than remote ones. 3. Provide transparent access to all Atoms in RAM to all machines in the cluster, even if at different latency and performance levels. 4. For local access to Atoms in the same machine, performance should be as close as possible to what one would have in a similar, but non-distributed AtomSpace. 5. Allow multiple copies of the same Atom to exist in different machines of the cluster, but only one copy per machine. 6. As Atoms are updated fairly often by cognitive dynamics, provide a mechanism for even- tual consistency. This mechanism needs not only to propagate changes to the Atoms, but sometimes to reconcile incompatible changes, such as when two cognitive processes update an Atom's TruthValue in opposite ways. Consistency is less important than efficiency, but should be guaranteed eventually. 7. Resolution of inconsistencies should be guided by the importance of the Atoms involved, so the more important ones are more quickly resolved. 8. System configuration can explicitly order the placement of sonic Atoms to specific machines, and mark a subset of those Atoms as immovable, which should ensure that local copies are always kept. 9. Atom placement across machines, aside from the immovable Atoms, should be dynamic, rebalancing based on frequency of access to the Atom by the different machines. The first requirement follows obviously from our estimates of how many machines CogPrime will require to display advanced intelligence. The second requirement above means that we don't have two kinds of machines in the cluster, where some are processing servers and some are database servers. Rather, we prefer each machine to store some knowledge and host some processes acting on that knowledge. This design assumes that there are simple heuristic ways to partition the knowledge across the machines, resulting in allocations that, most of the time, give the MindAgents local access to the Atoms they need most often. Alas, there will always be some cases in which a MindAgent needs an Atom that isn't available locally. In order to keep the design on the MindAgents simple, this leads to the third requirement, transparency, and to the fourth one, performance. This partition design, on the other hand, means that there must be some replication of knowledge, as there will always be some Atoms that are needed often by MindAgents on differ- ent machines. This leads to requirement five (allow redundant copies of an Atom). However, as MindAgents frequently update the mutable components of Atoms, requirements six and seven are needed, to minimize the impact of conflicts on system performance while striving to guar- antee that conflicts are eventually solved, and with priority proportional to the importance of the impacted Atoms. 19.5.1.1 Mechanisms of Managing Distributed Atomspaces When one digs into the details of distributed AtomSpaces, a number of subtleties emerge. Going into these in full detail here would not be appropriate, but we will make a few comments, just to give a flavor of the sorts of issues involved. To discuss these issues clearly, some special terminology is useful. In this context, it is useful to reserve the word "Atom" for its pure, theoretical definition, viz: "a Node is uniquely determined EFTA00624166 20 19 The OpenCog Ftamework by its name. A Link is uniquely determined by its outgoing set". Atoms sitting in RAM may then be called "Realized Atoms". Thus, given a single, pure "abstract/theoretical" Atom, there might be two different Realized Atoms, on two different servers, having the same name/outgoing-set. It's OK to think of a RealizedAtom as a clone of the pure, abstract Atom, and to talk about it that way. Analogously, we might call atoms living on disk, or flying on a wire, "Serialized Atoms"; and, when need be, use specialized terms like "ZMQ-serialized atoms", or " BerkeleyDB- serialized Atoms", etc. An important and obvious coherency requirement is: "If a MindAgent asks for the Handle of an Atom at time A, and then asks, later on, for the Handle of the same Atom, it should receive the same Handle." By the "AtomSpace", in general, we mean the container(s) that are used to store the set of Atoms used in an OpenCog system, both in RAM and on disk. In the case of an Atom space that is distributed across multiple machines or other data stores, we will call each of these an "Atom space portion" Atoms and Handles Each OpenCog Atom is associated with a Handle object, which is used to identify the Atom uniquely. The Handle is a sort of "key" used, at the infrastructure level, to compactly identify the Atom. In a single-machine, non-distributed Atomspace, one can effectively just use long ints as Handles, and assign successive ints as Handles to successively created new Atoms. In a distributed Atomspace, it's a little subtler. Perhaps the cleanest approach in this case is to use a hash of the serialized Atom data as the handle for an Atom. That way, if an Atom Ls created in any portion, it will inherently have the same handle as any of its clones. The issue of Handle collisions then occurs - it is possible, though it will be rare, that two different Atoms will be assigned the same Handle via the hashing function. This situation can be identified via checking, when an Atom is imported into a portion, whether there is already some Atom in that portion with the same Handle but different fundamental aspects. In the rare occasion where this situation does occur, one of the Atoms must then have its Handle changed. Changing an Atom's handle everywhere it's referenced in RAM is not a big deal, so long as it only happens occasionally. However, some sort of global record of Handle changes should be kept, to avoid confusion in the process of loading saved Atoms from disk. If a loaded Atomspace contains Atoms that have changed Handle since the file was saved, the Atom loading process needs to know about this. The standard mathematics of hash functions collisions, shows that if one has a space of H possible Handles, one will get two Atoms with the same Handle after 1.25 x NO) tries, on average.... Rearranging this, it means we'd need a space of around N2 Handles to have a space of Handles for N possible Atoms, in which one collision would occur on average.... So to have a probability of one collision, for N possible Atoms, one would have to use a handle range up to N2. The number of bits needed to encode N2 is twice as many as the number needed to encode N. So, if one wants to minimize collisions, one may need to make Handles twice as long, thus taking up more memory. However, this memory cost can be palliated via introducing "local Handles" separate from the global, system-wide Handles. The local Handles are used internally within each local Atomspace, and then each local Atomspace contains a translation table going back and forth between local EFTA00624167 19.5 Distributed AtomSpace and Cognitive Dynamics 21 and global Handles. Local handles may be long ints, allocated sequentially to each new Atom entered into a portion. Persistence to disk would always use the global Handles. To understand the memory, tradeoffs involved in these solutions, assume that the global Handles were k times as long as the local handles... and suppose that the average Handle occurred r times in the local Atomspace. Then the memory inflation ratio of the local/global solution as opposed to a solution using only the shorter local handles, would be (1 -F k + r)/r = 1 -F (k+ 1)/r if k=2 and r=10 (each handle is used 10 times on average, which is realistic based on current real-world OpenCog Atomspaces), then the ratio is just 1.3x - suggesting that using hash codes for global Handles, and local Handles to save memory in each local AtomSpace, is acceptable memory-wise. 19.5.1.2 Distribution of Atoms Given the goal of maximizing the probability that an Atom will be local to the machines of the MindAgents that need it, the two big decisions are how to allocate Atoms to machines, and then how to reconcile the results of MindAgents actuating on those Atoms. The initial allocation of Atoms to machines may be done via explicit system configuration, for Atoms known to have different levels of importance to specific MindAgents. That is, after all, how MindAgents are initially allocated to machines as well. One may, for instance, create a CogPrime cluster where one machine (or group) focuses on visual perception, one focuses on language processing, one focuses on abstract reasoning, etc. In that case one can hard-wire the location of Atoms. What if one wants to have three abstract-reasoning machines in one's cluster? Then one can define an abstract-reasoning zone consisting of three Atom repository portions. One can hard-wire that Atoms created by MindAgents in the zone must always remain in that zone - but can potentially be moved among different portions within that zone, as well as replicated across two or all of the machines, if need be. By default they would still initially be placed in the same portion as the MindAgent that created them. However Atoms are initially placed in portions, sometimes it will be appropriate to move them. And sometimes it will be appropriate to clone an Atom, so there's a copy of it in a different portion from where it exists. Various algorithms could work for this, but the following is one simple mechanism: • When an Atom A in machine M1 is requested by a MindAgent in machine M2, then a clone of A is temporarily created in M2. • When an Atom is forgotten (due to low LTI), then a check is made if it has any clones, and any links to it are changed into links to its clones. • The LTI of an Atom may get a boost if that Atom has no clones (the amount of this boost is a parameter that may be adjusted). EFTA00624168 22 19 The OpenCog Ftamework 19.5.1.3 MindAgents and the Distributed AtomSpace In the context of a distributed AtomSpace, the interactions between MindAgents and the knowl- edge store become subtler, as we'll now discuss. When a MindAgent wants to create an Atom, it will make this request of the local AtomSpace process, which hosts a subset of the whole AtomSpace. It can, on Atom creation, specify whether the Atom is immovable or not. In the former case, it will initially only be accessible by the MindAgents in the local machine. The process of assigning the new Atom an AtomHandle needs to be taken care of, in a way that doesn't introduce a central master. One way to achieve that is to make handles hierarchical, so the higher order bits indicate the machine. This, however, means that AtomHandles are no longer immutable. A better idea is to automatically allocate a subset of the AtomHandle universe to each machine. The initial use of those AtomHandles is the privilege of that machine but, as Atoms migrate or are cloned, the handles can move through the cluster. When a MindAgent wants to retrieve one or more Atoms, it will perform a query on the local AtomSpace subset, just as it would with a single machine repository. Along with the regular query parameters, it may specify whether the request should be processed locally only, or globally. Local queries will be fast, but may fail to retrieve the desired Atoms, while global queries may take a while to return. In the approach outlined above for MindAgent dynamics and scheduling, this would just cause the MindAgent to wait until results are available. Queries designed to always return a set of Atoms can have a third mode, which is "prioritize local Atoms". In this case, the AtomSpace, when processing a query that looks for Atoms that match a certain pattern would try to find all local responses before asking other machines. 19.5.1.4 Conflict Resolution A key design decision when implementing a distributed AtomSpace is the trade-off between consistency and efficiency. There is no universal answer to this conflict, but the usage scenarios for CogPrime, current and planned, tend to fall on the same broad category as far consistency goes. CogPrime's cognitive processes are relatively indifferent to conflicts and capable of working well with outdated data, especially if the conflicts are temporary. For applications such as the AGI Preschool, it is unlikely that outdated properties of single Atoms will have a large, noticeable impact on the system's behavior; even if that were to happen on rare occasions, this kind of inconsistency is often present in human behavior as well. On the other hand, CogPrime assumes fairly fast access to Atoms by the cognitive processes, so efficiency shouldn't be too heavily penalized. The robustness against mistakes and the need for performance mean that a distributed AtomSpace should follow the principle of "eventual consistency". This means that conflicts are allowed to arise, and even to persist for a while, but a mechanism is needed to reconcile them. Before describing conflict resolution, which in CogPrime is a bit more complicated than in most applications, we note that there are two kinds of conflicts. The simple one happens when an Atom that exists in multiple machines is modified in one machine, and that change isn't immediately propagated. The less obvious one happens when some process creates a new Atom in its local AtomSpace repository, but that Atom conceptually "already exists" elsewhere in the system. Both scenarios are handled in the same way, and can become complicated when, instead of a single change or creation, one needs to reconcile multiple operations. EFTA00624169 19.5 Distributed AtomSpace and Cognitive Dynamics 23 The way to handle conflicts is to have a special purpose control process, a reconciliation MindAgent, with one copy running on each machine in the cluster. This MindAgent keeps track of all recent write operations in that machine (Atom creations or changes). Each time the reconciliation MindAgent is called, it processes a certain number of Atoms in the recent writes list. It chooses the Atoms to process based on a combination of their STI, LTI and recency of creation/change. Highest priority is given to Atoms with higher STI and LTI that have been around longer. Lowest priority is given to Atoms with low STI or LTI that have been very recently changed - both because they may change again in the very near future, and because they may be forgotten before it's worth solving any conflicts. This will be the case with most perceptual Atoms, for instance. By tuning how many Atoms this reconciliation MindAgent processes each time it's activated we can tweak the consistency vs efficiency trade-off. When the AtomReconciliation agent processes an Atom, what it does is: • Searches all the machines in the cluster to see if there are other equivalent Atoms (for Nodes, these are Atoms with the same name and type; for Links, these are Atoms with the same type and targets). • If it finds equivalent Atoms, and there are conflicts to be reconciled, such as different TruthValues or AttentionValues, the decision of how to handle the conflicts is made by a special probabilistic reasoning rule, called the Rule of Choice (see Chapter 34). Basically, this means: - It decided whether to merge the conflicting Atoms. We always merge Links, but some Nodes may have different semantics, such as Nodes representing different procedures that have been given the same name. - In the case that the two Atoms A and B should be merged, it creates a new Atom C that has all the same immutable properties as A and B. It merges their TruthValues according to the probabilistic revision rule (see Chapter 34). The AttentionValues are merged by prioritizing the higher importances. - In the case that two Nodes should be allowed to remain separate, it allocates one of them (say, B) a new name. Optionally, it also evaluates whether a SimilarityLink should be created between the two different Nodes. Another use for the reconcilitation MindAgent is maintaining approximate consistency be- tween clones, which can be created by the AtomSpace itself, a S described above in Subsection 19.5.1.2. When the system knows about the multiple clones of an Atom, it keeps note of these versions in a list, which is processed periodically by a conflict resolution MindAgent, in order to prevent the clones from drifting too far apart by the actions of local cognitive processes in each machine. 19.5.2 Distributed Processing The OCF infrastructure as described above already contains a lot of distributed processing implicit in it. However. it doesn't tell you how to make the complex cognitive processes that are part of the CogPrime design distributed unto themselves - say, how to make PLN or MOSES themselves distributed. This turns out to be quite passible, but becomes quite intricate and EFTA00624170 24 19 The OpenCog Ftamework specific depending on the particular algorithms involved. For instance, the current MOSES implementation is now highly amenable to distributed and multiprocessor implementation, but in a way that depends subtly on the specifics of MOSES and has little to do with the role of MOSES in CogPrime as a whole. So we will not delve into these topics here. Another possibility worth mentioning is broadly distributed processing, in which CogPrime intelligence is spread across thousands or millions of relatively weak machines networked via the Internet. Even if none of these machines is exclusively devoted to CogPrime, the total processing power may be massive, and massively valuable. The use of this kind of broadly distributed computing resource to help CogPrime is quite possible, but involves numerous additional control problems which we will not address here. A simple case is massive global distribution of MOSES fitness evaluation. In the case where fitness evaluation is isolated and depends only on local data, this is extremely straightforward. In the more general case where fitness evaluation depends on knowledge stored in a large Atom- Space, it requires a subtler design. wherein each globally distributed MOSES subpopulation contains a pool of largely similar genotypes and a cache of relevant parts of the AtomSpace, which is continually refreshed during the fitness evaluation process. This can work so long as each globally distributed lobe has a reasonably reliable high bandwidth, low latency connection to a machine containing a large AtomSpace. On the more mundane topic of distributed processing within the main CogPrime cluster, three points are worth discussing: • Distributed communication and coordination between MindAgents. • Allocation of machines to functional groups, and MindAgent migration. • Machines entering and leaving the cluster. 19.5.2.1 Distributed Communication and Coordination Communications between MindAgents, Units and other CogPrime components are handled by a message queue subsystem. This subsystem provides a unified API, so the agents involved are unaware of the location of their partners: distributed messages, inter-process messages in the same machine, and intra-process messages in the same Unit are sent through the same API. and delivered to the same target queues. This design enables transparent distribution of MindAgents and other components. In the simplest case, of MindAgents within the same Unit, messages are delivered almost immediately, and will be available for processing by the target agent the next time it's enacted by the scheduler. In the case of messages sent to other Units or other machines, they're delivered to the messaging subsystem component of that unit, which has a dedicated thread for message delivery. That subsystem is scheduled for processing just like any other control process, although it tends to have a reasonably high importance, to ensure speedy delivery. The same messaging API and subsystem is used for control-level communications, such as the coordination of the global cognitive cycle. The cognitive cycle completion message can be used for other housekeeping contents as well. EFTA00624171 19.5 Distributed AtomSpace and Cognitive Dynamics 25 19.5.2.2 Functional Groups and MindAgent Migration A CogPrime cluster is composed of groups of machines dedicated to various high-level cognitive tasks: perception processing, language processing, background reasoning, procedure learning, action selection and execution, goal achievement planning, etc. Each of these high-level tasks will probably require a number of machines, which we call functional groups. Most of the support needed for functional groups is provided transparently by the mechanisms for distributing the AtomSpace and by the communications layer. The main issue is how much resources (i.e., how many machines) to allocate to each functional group. The initial allocation is determined by human administrators via the system configuration - each machine in the cluster has a local configuration file which tells it exactly which processes to start, along with the collection of MindAgents to be loaded onto each process and their initial AttentionValues. Over time, however, it may be necessary to modify this allocation, adding machines to overworked or highly important functional groups. For instance, one may add more machines to the natural language and perception processing groups during periods of heavy interaction with humans in the preschool environment, while repurposing those machines to procedure learning and background inference during periods in which the agent controlled by CogPrime is resting or 'sleeping'. This allocation of machines is driven by attention allocation in much the same way that processor time is allocated to MindAgents. Functional groups can be represented by Atoms, and their importance levels are updated according to the importance of the system's top level goals, and the usefulness of each functional group to their achievement. Thus, once the agent is engaged by humans, the goals of pleasing them and better understanding them would become highly important, and would thus drive the STI of the language understanding and language generation functional groups. Once there is an imbalance between a functional group's STI and its share of the machines in the cluster, a control process CIM-Dynamic is triggered to decide how to reconfigure the cluster. This CIM-Dynamic works approximately as follows: • First, it decides how many extra machines to allocate to each sub-represented functional group. • Then, it ranks the machines not already allocated to those groups based on a combination of their workload and the aggregate STI of their MindAgents and Units. The goal is to identify machines that are both relatively unimportant and working under capacity. • It will then migrate the MindAgents of those machines to other machines in the same functional group (or just remove them if clones exist), freeing them up. • Finally, it will decide how best to allocate the new machines to each functional group. This decision is heavily dependent on the nature of the work done by the MindAgents in that group, so in CogPrime these decisions will be somewhat hardcoded, as is the set of functional groups. For instance, background reasoning can be scaled just by adding extra inference MindAgents to the new machines without too much trouble. but communicating with humans requires MindAgents responsible for dialog management, and it doesn't make sense to clone those, so it's better to just give more resources to each MindAgent without increasing their numbers. The migration of MindAgents becomes, indirectly, a key driver of Atom migration. As MindA- gents move or are cloned to new machines, the AtomSpace repository in the source machine EFTA00624172 26 19 The OpenCog Ftamework should send clones of the Atoms most recently used by these MindAgents to the target ma- chine(s), anticipating a very likely distributed request that would create those clones in the near future anyway. If the MindAgents are moved but not cloned, the local copies of those Atoms in the source machine can then be (locally) forgotten. 19.5.2.3 Adding and Removing Machines Given the support for MindAgent migration and cloning outlined above, the issue of adding new machines to the cluster becomes a specific application of the heuristics just described. When a new machine Ls added to the cluster, CogPrime initially decides on a functional group for it, based both on the importance of each functional group and on their current performance - if a functional group consistently delays the completion of the cognitive cycle, it should get more machines, for instance. When the machine is added to a functional group, it is then populated with the most important or resource starved MindAgents in that group, a decision that is taken by economic attention allocation. Removal of a machine follows a similar process. First the system checks if the machine can be safely removed from its current functional group, without greatly impacting its performance. If that's the case, the non-cloned MindAgents in that machine are distributed among the remaining machines in the group, following the heuristic described above for migration. Any local-only Atoms in that machine's AtomSpace container are migrated as well, provided their LTI is high enough. In the situation in which removing a machine M1 would have an intolerable impact on the functional group's performance, a control process selects another functional group to lose a machine 2112. Then, the NlindAgents and Atoms in M1 are migrated to M2, which goes through the regular removal process first. In principle, one might use the insertion or removal of machines to perform a global op- timization of resource allocation within the system, but that process tends to be much more expensive than the simpler heuristics we just described. We believe these heuristics can give us most of the benefits of global re-allocation at a fraction of the disturbance for the system's overall dynamics during their execution. EFTA00624173 Chapter 20 Knowledge Representation Using the Atomspace 20.1 Introduction CogPrime's knowledge representation must be considered on two levels: implicit and explicit. This chapter considers mainly explicit knowledge representation, with a focus on representation of declarative knowledge. We will describe the Atom knowledge representation, a generalized hypergraph formalism which comprises a specific vocabulary of Node and Link types, used to represent declarative knowledge but also, to a lesser extent, other types of knowledge as well. Other mechanisms of representing procedural, episodic, attentional, and intentional knowledge will be handled in later chapters, as will the subtleties of implicit knowledge representation. The AtomSpace Node and Link formalism is the most obviously distinctive aspect of the OpenCog architecture, from the point of view of a software developer building AI processes in the OpenCog framework. But yet, the features of CogPrime that are most important, in terms of our theoretical reasons for estimating it likely to succeed as an advanced AGI system, are not really dependent on the particulars of the AtomSpace representation. What's important about the AtomSpace knowledge representation is mainly that it provides a flexible means for compactly representing multiple forms of knowledge, in a way that allows them to interoperate - where by "interoperate" we mean that e.g. a fragment of a chunk of declarative knowledge can link to a fragment of a chunk of attentional or procedural knowledge; or a chunk of knowledge in one category can overlap with a chunk of knowledge in another category (as when the same link has both a (declarative) truth value and an (attentional) importance value). In short, any representational infrastructure sufficiently flexible to support • compact representation of all the key categories of knowledge playing dominant roles in human memory • the flexible creation of specialized sub-representations for various particular subtypes of knowledge in all these categories, enabling compact and rapidly manipulable expression of knowledge of these subtypes • the overlap and interlinkage of knowledge of various types, including that represented using specialized sub-representations will probably be acceptable for CogPrime's purposes. However, precisely formulating these general requirements is tricky, and is significantly more difficult than simply articulating a single acceptable representational scheme, like the current OpenCog Atom formalism. The Atom 27 EFTA00624174 28 20 Knowledge Representation Using the Atomspace formalism satisfies the relevant general requirements and has proved workable from a practical software perspective. In terms of the Mind-World Correspondence Principle introduced in Chapter 10, the impor- tant point regarding the Atom representation is that it must be flexible enough to allow the compact and rapidly manipulable representation of knowledge that has aspects spanning the multiple common human knowledge categories, in a manner that allows easy implementation of cognitive processes that will manifest the Mind-World Correspondence Principle in everyday human-like situations. The actual manifestation of mind-world correspondence Ls the job of the cognitive processes acting on the AtomSpace - the job of the AtomSpace is to be an effi- cient and flexible enough representation that these cognitive processes can manifest mind-world correspondence in everyday human contexts given highly limited computational resources. 20.2 Denoting Atoms First we describe the textual notation we'll use to denote various sorts of Atoms throughout the following chapters. The discussion will also serve to give some particular examples of cognitively meaningful Atom constructs. 20.2.1 Meta-Language As always occurs when discussing (even partially) logic-based systems, when discussing Cog- Prime there is some potential for confusion between logical relationships inside the system, and logical relationships being used to describe parts of the system. For instance, we can state as observers that two Atoms inside CogPrime are equivalent, and this is different from stating that CogPrime itself contains an Equivalence relation between these two Atoms. Our formal notation needs to reflect this difference. Since we will not be doing any fancy mathematical analyses of CogPrime structures or dynamics here, there is no need to formally specify the logic being used for the metalanguage. Standard predicate logic may be assumed. So, for example, we will say things like 1lntensionallnheritanceLink Ben monster).TruthValue.strength is .5 This is a metalanguage statement. which means that the strength field of the TruthValue object associated with the link (IntensionalInheritance Ben monster) is equal to .5. This is different than saying EquivalenceLink ExOutLink GetStrength ExOutLink GetTruthValue IntensionalInheritanceLink Ben monster NumberNode 0.5 EFTA00624175 20.2 Denoting Atoms 29 which refers to an equivalence relation represented inside CogPrime. The former refers to an equals relationship observed by the authors of the book, but perhaps never represented explicitly inside CogPrime. In the first example above we have used the C++ convention structure_variable_name.field_name for denoting elements of composite structures; this convention will be stated formally below. In the second example we have used schema corresponding to TruthValue and Strength; these schema extract the appropriate fields from the Atoms they're applied to, so that e.g. ExOutLink GetTruthValue A returns the number A.TruthValue Following a convention from mathematical logic, we will also sometimes use the special symbol I - to mean "implies in the metalanguage." For example, the first-order PLN deductive inference strength rule may be written InheritanceLink A B csAB> InheritanceLink B C <BBC> i- InheritanceLink A C <sAC> where sAC sAB sBC + (1-sAB) ( sC - se sBC ) / (1- se ) This is different from saying ForAll SA, SB, SC, SsAB, Ss8C, SsAC ExtensionalImplicationLink_HOJ AND InheritanceLink SA $8 <$sAE> InheritanceLink SB SC <$sBC> AND InheritanceLink SA SC <$sAC> $sAC SsAB SsBC (1-$2AB) ($2C - Sae $28C) / (1- SsB) which is the most natural representation of the independence-based PLN deduction rule (for strength-only truth values) as a logical statement within CogPrime. In the latter expression the variables $A, $sAB, and so forth represent actual Variable Atoms within CogPrime. In the former expression the variables represent concrete, non-Variable Atoms within CogPrime, which however are being considered as variables within the metalanguage. (As explained in the PLN book, a link labeled with "Hos" refers to a "higher order judgment", meaning a relationship that interprets its relations as entities with particular truth values. For instance, ImplicationLink_HOJ Inh SX stupid <.9> Inh SX rich <.97 EFTA00624176 30 20 Knowledge Representation Using the Atomspace means that if (Init $X stupid) has a strength of .9, then (Inh $X rich) has a strength of .9). WIKISOURCE:AtomNotation 20.2.2 Denoting Atoms Atoms are the basic objects making up CogPrime knowledge. They come in various types, and are associated with various dynamics, which are embodied in MindAgents. Generally speaking Atoms are endowed with TruthValue and AttentionValue objects. They also sometimes have names, and other associated Values as previously discussed. In the following subsections we will explain how these are notated, and then discuss specific notations for Links and Nodes, the two types of Atoms in the system. 20.2.2.1 Names In order to denote an Atom in discussion, we have to call it something. Relatedly but separately, Atoms may also have names within the CogPrime system. (As a matter of implementation, in the current OpenCog version, no Links have names; whereas, all Nodes have names, but some Nodes have a null name, which is conceptually the same as not having a name.) (name,type) pairs mast be considered as unique within each Unit within a OpenCog system, otherwise they can't be used effectively to reference Atoms. It's OK if two different OpenCog Units both have Schemallodes named "+", but not if one OpenCog Unit has two Schemallodes both named "+" - this latter situation is disallowed on the software level, and is assumed in discussions not to occur. Sonic Atoms have natural names. For instance. the Schemallode corresponding to the ele- mentary schema function + may quite naturally be named "+". The NumberNode corresponding to the number .5 may naturally be named ".5", and the CharacterNode corresponding to the character c may naturally be named "c". These cases are the minority, however. For instance, a SpecificEntityNode representing a particular instance of + has no natural name, nor does a SpecificEntityNode representing a particular instance of c. Names should not be confused with Handles. Atoms have Handles, which are unique identi- fiers (in practice, numbers) assigned to them by the OpenCog core system; and these Handles are how Atoms are referenced internally, within OpenCog, nearly all the time. Accessing of Atoms by name is a special case - not all Atoms have names, but all Atoms have Handles. An example of accessing an Atom by name is looking up the CharacterNode representing the letter "c" by its name "c". There would then be two possible representations for the word "cat": 1. this word might be associated with a ListLink - and the ListLink corresponding to "cat" would be a list of the Handles of the Atoms of the nodes named "c", "ar, and "t". 2. for expedience, the word might be associated with a WordNode named "cat." In the case where an Atom has multiple versions, this may happen for instance if the Atom is considered in a different context (via a ContextLink), each version has a VersionHandle, so that accesbing an AtomVersion requires specifying an AtomHandle plus a VersionHandle. See Chapter 19 for more information on Handles. EFTA00624177 20.2 Denoting Atoms 31 OpenCog never assigns Atoms names on its own; in fact, Atom names are assigned only in the two sorts of cases just mentioned: 1. Via preprocessing of perceptual inputs (e.g. the names of NumberNode, CharacterNodes) 2. Via hard-wiring of names for Schemallodes and PredicateNodes corresponding to built-in elementary schema (e.g. +, AND, Say) If an Atom A has a name n in the system, we may write A.name n On the other hand, if we want to assign an Atom an external name, we may make a meta- language assertion such as LI :o (InheritanceLink Ben animal) indicating that we decided to name that link LI for our discussions, even though inside OpenCog it has no name. In denoting (nameless) Atoms we may use arbitrary manes like Ll. This is more convenient than using a Handle based notation which Atoms would be referred to as 1, 3433322, etc.; but sometimes we will use the Handle notation as well. Some ConceptNodes and conceptual PredicateNode or Schemallodes may correspond with human-language words or phrases like cat, bite, and so forth. This will be the minority case; more such nodes will correspond to parts of human-language concepts or fuzzy collections of human-language concepts. In discussions in this book, however, we will often invoke the unusual case in which Atoms correspond to individual human-language concepts. This is because such examples are the easiest ones to write about and discuss intuitively. The preponderance of named Atoms in the examples in the book implies no similar preponderance of named Atoms in the real OpenCog system. It is merely easier to talk about a hypothetical Atom named "cat" than it is about a hypothetical Atom with Handle 434. It is not impossible that a OpenCog system represents "cat" as a single ConceptNode, but it is just as likely that it will represent "cat" as a map composed of many different nodes without any of these having natural names. Each OpenCog works out for itself, implicitly, which concepts to represent as single Atoms and which in distributed fashion. For another example, ListLink CharacterNode "c" CharacterNode "a" CharacterNode "t" corresponds to the character string ("c", "a", "e") and would naturally be named using the string cat. In the system itself, however, this ListLink need not have any name. 20.2.2.2 Types Atoms also have types. When it is necessary to explicitly indicate the type of an atom, we will use the keyword Type, as in EFTA00624178 32 20 Knowledge Representation Using the Atomspace A .Type InheritanceLink N_345.Type ConceptNode On the other hand, there is also a built-in schema Ha.sType which lets us say EvaluationLink HasType A InheritanceLink EvaluationLink HasType N_345 ConceptNode This covers the case in which type evaluation occurs explicitly in the system, which is useful if the system is analyzing its own emergent structures and dynamics. Another option currently implemented in OpenCog is to explicitly restrict the type of a variable using TypedVariableLink such as follows TypedVariableLink VariableNode $X VariableTypeNode •ConceptNode• Note also that we will frequently remove the suffix Link or Node from their type name, such as Inheritance Concept A Concept B instead of InheritanceLink ConceptNode A ConceptNode B 20.2.2.3 Truth Values The truth value of an atom is a bundle of information describing how true the Atom is, in one of several different senses depending on the Atom type. It is encased in a TruthValue object associated with the Atom. Most of the time, we will denote the truth value of an atom in <>'s following the expression denoting the atom. This very handy notation may be used in several different ways. A complication is that some Atoms may have CompositeTruthValues, which consist of differ- ent estimates of their truth value made by different sources, which for whatever reason have not been reconciled (maybe no process has gotten around to reconciling them, maybe they corre- spond to different truth values in different contexts and thus logically need to remain separate, maybe their reconciliation is being delayed pending accumulation of more evidence, etc.). In this case we can still assume that an Atom has a default truth value, which corresponds to the highest-confidence truth value that it has, in the Universal Context. Most frequently, the notation is used with a single number in the brackets, e.g. A <.4> to indicate that the atom A has truth value .4: or IntensionallnheritanceLink Ben monster <.5> EFTA00624179 20.2 Denoting Atoms 33 to indicate that the Intensionallnheritance relation between Ben and monster has truth value strength .5. In this case, <tv> indicates (roughly speaking) that the truth value of the atom in question involves a probability distribution with a mean of tv. The precise semantics of the strength values associated with OpenCog Atoms is described in Probabilistic Logic Networks (see Chapter 34). Please note, though: This notation does not imply that the only data retained in the system about the distribution is the single number .5. If we want to refer to the truth value of an Atom A in the context C, we can use the construct ContextLink <truth value> C Sometimes, Atoms in OpenCog are labeled with two truth value components as defined by PLN: strength and weight-of-evidence. To denote these two components, we might write IntensionalInheritanceLink Ben scary <.9,.l> indicating that there is a relatively small amount of evidence in favor of the proposition that Ben is very scary. We may also put the TruthValue indicator in a different place, e.g. using indent notation, IntensionallnheritanceLink <.9,.1> Ben scary This is mostly useful when dealing with long and complicated constructions. If we want to denote a composite truth value (whose components correspond to different "versions" of the Atom), we can use a list notation, e.g. Intensionallnheritance (<.9,.1>, <.5,.9> [h,1231,<.6,.7> (c,655)) Ben scary where e.g. <.5,.9> (h,123) denotes the TruthValue version of the Atom indexed by Handle 123. The h denotes that the AtomVersion indicated by the VersionHandle h,123 is a Hypothetical Atom, in the sense de- scribed in the PLN book. Some versions may not have any index Handles. The semantics of composite TruthValues are described in the PLN book, but roughly they are as follows. Any version not indexed by a VersionHandle is a "primary TruthValue" that gives the truth value of the Atom based on some body of evidence. A version indexed by a VersionHandle is either contextual or hypothetical, as indicated rotationally by the c or h in its VersionHandle. So, for instance, if a TruthValue version for Atom A has VersionHandle h,123 that means it denotes the truth value of Atom A under the hypothetical context represented by the Atom with handle 123. If a TruthValue version for Atom A has VersionHandle c,655 this means it denotes the truth value of Atom A in the context represented by the Atom with Handle 655. Alternately, truth values may be expressed sometimes in <L,U,b> or <L,U,b,N> format, defined in terms of indefinite probability theory as defined in the PLN book and recalled in Chapter 34. For instance, IntensionallnheritanceLink Ben scary <.7,.9,.8,20> EFTA00624180 34 20 Knowledge Representation Using the Atomapace has the semantics that There is an estimated 80% chance that after 20 more observations have been made, the estimated strength of the link will be in the interval (.7,.9). The notation may also be used to specify a TruthValue probability distribution, e.g. A <g(5,7,12)› would indicate that the truth value of A is given by distribution g with parameters (5,7,12), or A <M> where M is a table of numbers, would indicate that the truth value of A is approximated by the table M. The <> notation for truth value is an unabashedly incomplete and ambiguous notation, but it is very convenient. If we want to specify, say, that the truth value strength of IntensionalIn- heritanceLink Ben monster is in fact the number .5, and no other truth value information is retained in the system, then we need to say (Intensionallnheritance Ben monster).TruthValue= [(strength, .5)) (where a hashtable form is assumed for TruthValue objects, i.e. a list of name-value pairs). But this kind of issue will rarely arise here and the <> notation will serve us well. 20.2.2.4 Attention Values The AttentionValue object associated with an Atom does not need to be notated nearly as often as truth value. When it does however we can use similar notational methods. AttentionValues may have several components, but the two critical ones are called short- term importance (STI) and long-term importance (LTI). Furthermore, multiple STI values are retained: for each (Atom, MindAgent) pair there may be a Mind-Agent-specific STI value for that Atom. The pragmatic import of these values will become clear in a later chapter when we discuss attention allocation. Roughly speaking, the long-term importance is used to control memory usage: when memory gets scarce, the atoms with the lowest LTI value are removed. On the other hand, the short-term importance is used to control processor time allocation: MindAgents, when they decide which Atoms to act on, will generally, but not only, choose the ones that have proved most useful to them in the recent past, and additionally those that have been useful for other MindAgents in the recent past. We will use the double bracket <c>> to denote attention value (in the rare cases where such denotation is necesbary). So, for instance, Cow_7 <<.5» will mean the node Cow_7 has an importance of .5; whereas, Cow_7 <<STI,•.1, LTI - .8» or simply Cow_7 <<.1, .8>> will mean the node Cow_7 has short-term importance = .1 and long-term importance = .8 . Of course, we can also use the style EFTA00624181 20.3 Representing Functions and Predicates 35 (IntensionalInheritanceLink Ben monster).AttentionValue = [(STI,.1), (LTI, .8)) where appropriate. 20.2.2.5 Links Links are represented using a simple notation that has already occurred many times in this book. For instance, Inheritance A Similarity A B Note that here the symmetry or otherwise of the link is not implicit in the notation. Similar- ityLinks are symmetrical, InheritanceLinks are not. When this distinction is necessary, it will be explicitly made. WIKISOURCE:FunctionNotation 20.3 Representing Functions and Predicates Schemallodes and PredicateNodes contain functions internally; and Links may also usefully be considered as functions. We now briefly discuss the representations and notations we will use to indicate functions in various contexts. Firstly, we will make some use of the currying notation drawn from combinatory, logic, in which adjacency indicates function application. So, for instance, using currying, f x means the function f evaluated at the argument x; and (f x y) means (f(x))(y). If we want to specify explicitly that a block of terminology is being specified using currying we will use the notation ©Impression], for instance ia[f x y z) means (Cf (WW1 (z) We will also frequently use conventional notation to refer to functions, such as f(x,y). Of course, this is consistent with the currying convention if (x,y) is interpreted as a list and f is then a function that acts on 2-element lists. We will have many other occasions than this to use list notation. Also, we will sometimes use a non-curried notation, most commonly with Links, so that e.g. InheritanceLink x y does not mean a curried evaluation but rather means InheritanceLink(x,y). EFTA00624182 36 20 Knowledge Representation Using the Atomspace 20.3.0.6 Execution Output Links In the case where f refers to a schema, the occurrence of the combination f x in the system is represented by ExOutLink f x or graphically f x e Note that, just as when we write f xl we mean to apply f to the result of applying g to x, similarly when we write ExOutLink f (ExOutLink g x) we mean the same thing. So for instance EvaluationLink (ExOutLink g x) y <.8> means that the result of applying g to x is a predicate r, so that r(y) evaluates to 'Rue with strength .8. This approach, in its purest incarnation, does not allow multi-argument schemata. Now, multi-argument schemata are never actually necessary, because one can use argument currying to simulate multiple arguments. However, this is often awkward, and things become simpler if one introduces an explicit tupling operator, which we call ListLink. Simply enough, ListLink Al ... An denotes an ordered list (Al, ..., An) 20.3.1 Execution Links ExecutionLinks give the system an easy way to record acts of schema execution. These are ternary links of the form: Schemallode: S Atom: A, ExecutionLink S A In words, this says the procedure represented by Schemallode S has taken input A and produced output B. There may also be schemata that do not take output, or do not take input. But these are treated as PredicateNodes, to be discussed below; their activity is recorded by EvaluationLinIcs, not ExecutionLinks. The TruthValue of an ExecutionLink records how frequently the result encoded in the Exe- cutionLink occurs. Specifically, EFTA00624183 20.3 Representing Functions and Predicates 37 • the TruthValue of (ExecutionLink S A B) tells you the probability of getting B as output, given that you have run schema S on input A • the TruthValue of (ExecutionLink S A) tells you the probability that if S is run, it is run on input A Often it is useful to record the time at which a given act of schema execution was carried out; in that case one ruses the atTime link, writing e.g. atTimeLink ExecutionLink S A B where T is a TimeNode, or else one uses an implicit method such as storing the time-stamp of the ExecutionLink in a core-level data-structure called the TimeServer. The implicit method is logically equivalent to explicitly using atTime, and is treated the same way by PLN inference, but provides significant advantages in terms of memory usage and lookup speed. For purposes of logically reasoning about schema, it is useful to create binary links repre- senting ExecutionLinks with some of their arguments fixed. We name these as follows: ExecutionLink) A B means: X so that ExecutionLink X A B ExecutionLink2 A B means: X so that ExecutionLink A X B ExecutionLink3 A B means: X so that ExecutionLink A B X Finally, a Schemallode may be associated with a structure called a Graph. Where S is a Schemallode, Graph(S) (x,y): ExecutionLink S x y ) Sometimes, the graph of a Schemallode may be explicitly embodied as a ConceptNode; other times, it may be constructed implicitly by a MindAgent in analyzing the Schemallode (e.g. the inference MindAgent). Note that the set of ExecutionLinks describing a Schemallode may not define that SchemaN- ode exactly, because some of them may be derived by inference. This means that the model of a Schemallode contained in its ExecutionLinks may not actually be a mathematical function, in the sense of assigning only one output to each input. One may have ExecutionLink S X A c.5> ExecutionLink S X El c.S> meaning that the system does not know whether S(X) evaluates to A or to B. So the set of ExecutionLinks modeling a Schemallode may constitute a non-function relation, even if the schema inside the Schemallode is a function. Finally, what of the case where f x represents the action of a built-in system function f on an argument x? This is an awkward case that would not be necessary if the CogPrime system were revised so that all cognitive functions were carried out using Schemallod . However, in the current CogPrime version, where most cognitive functions are carried out using C++ MindAgent objects, if we want CogPrime to study its own cognitive behavior in a statistical way, we need BuiltInSchemallodes that refer to MindAgents rather than to ComboTrees (or else, we need to represent MindAgents using ComboTrees, which will become practicable once we have a sufficiently efficient Combo interpreter). The semantics here is thus basically the same as where f refers to a schema. For instance we might have EFTA00624184 38 20 Knowledge Representation Using the Atomspace ExecutionLink FirstOrderlnferenceMindAgent (LI, L2) L3 where LI, L2 and L3 are links related by LI L2 L3 according to the first-order PLN deduction rules. 20.3.1.1 Predicates Predicates are related but not identical to schema, both conceptually and notationally. Predi- cateNodes involve predicate schema which output TruthValue objects. But there is a difference between a Schemallode embodying a predicate schema and a PredicateNode, which is that a PredicateNode doesn't output a TruthValue, it adjusts its own TruthValue as a result of the output of its own internal predicate schema. The record of the activity of a PredicateNode is given not by an ExecutionLink but rather by an: EvaluationLink P A <tv> where P is a PredicateNode, A is its input, and <tv> is the truth value assumed by the EvaluationLink corresponding to the PredicateNode being fed the input A. There is also the variant EvaluationLink P <tv> for the case where the PredicateNode P embodies a schema that takes no inputs'. A simple example of a PredicateNode is the predicate GreaterThan. In this case we have, for instance EvaluationLink GreaterThan S 6 <0> EvaluationLink GreaterThan S 3 <I> and we also have: EquivalenceLink GreaterThan ExOutLink And ListLink ExOutLink Not LessThan ExOutLink Not EqualTo Note how the variables have been stripped out of the expression, see the PLN book for more explanation about that. We will also encounter many commonsense-semantics predicates such as isMale, with e.g. actually, if P does take some inputs, EvaluationLink P <tv> is defined too and tv corresponds to the average of P(X) over all inputs X, this is explained in more depth in the PLN book. EFTA00624185 20.3 Representing Functions and Predicates 39 EvaluationLink isMale Sen_Goertzel <1> Schemata that return no outputs are treated as predicates, and handled using Evaluation- Links. The truth value of such a predicate, as a default, is considered as True if execution is successful, and False otherwise. And, analogously to the Graph operator for Schemallodes, we have for PredicateNodes the SatisfyingSet operator, defined so that the SatisfyingSet of a predicate is the set whose members are the elements that satisfy the predicate. Formally, that is: S SatisfyingSet P means Truthvalue(MemberLink X s) equals TruthValue(EvaluationLink P X) This operator allows the system to carry out advanced logical operations like higher-order inference and unification. 20.3.2 Denoting Schema and Predicate Variables CogPrime sometimes uses variables to represent the expressions inside schemata and predicates, and sometimes uses variable-free, combinatory-logic-based representations. There are two sorts of variables in the system, either of which may exist either inside compound schema or predi- cates, or else in the AtomSpace as VariableNodes: It is important to distinguish between two sorts of variables that may exist in CogPrime: • Variable Atoms, which may be quantified (bound to existential or universal quantifiers) or unquantified • Variables that are used solely as function-arguments or local variables inside the "Combo tree" structures used inside some ProcedureNodes (PredicateNodes or Schemallodes) (to be described below), but are not related to Variable Atoms Examples of quantified variables represented by Variable Atoms are $X and $Y in: ForAll SX <.0001> ExtensionallmplicationLink ExtensionallnheritanceLink $X human ThereExists SY AND ExtensionallnheritanceLink SY human EvaluationLink parent_of (SX, SY) An example of an unquantified Variable Atom is $X in ExtensionallmplicationLink <.3> ExtensionallnheritanceLink $X human ThereExists SY AND ExtensionallnheritanceLink $Y human EvaluationLink parent_of (SX, SY) EFTA00624186 40 20 Knowledge Representation Using the Atomspace This ImplicationLink says that 30% of humans are parents: a more useful statement than the ForAll Link given above, which says that it is very very unlikely to be true that all humans are parents. We may also say, for instance, SatisfyingSet( EvaluationLink eats (cat, $X) to refer to the set of X so that eats(cat, X). On the other hand, suppose we have the implication Implication Evaluation f $X Evaluation f ExOut reverse $X where f is a PredicateNode embodying a mathematical operator acting on pairs of NumberN- odes, and reverse is an operator that reverses a list. So, this implication says that the f predicate is commutative. Now, suppose that f is grounded by the formula f(a,b) a (a > b - 1) embodied in a Combo Tree object (which is not commutative but that is not the point), stored in the ProcedureRepository and linked to the PredicateNode for f. These f-internal which are expressed here using the letters a and b, are not VariableNodes in the CogPrime AtomTable. The notation we use for these within the textual Combo language, that goes with the Combo Tree formalism, is to replace a and bin this example with #1 and #2, so the above grounding would be denoted f -> Cul > #2 - 1) version, it is assumed that type restrictions are always crisp, not probabilistically truth- valued. This assumption may be revisited in a later version of the system. 20.3.2.1 Links as Predicates It. is conceptually important to recognize that CogPrime link types may be interpreted as predicates. For instance, when one says InheritanceLink cat animal <.8> indicating an Inheritance relation between cat and animal with a strength .8, effectively one is declaring that one has a predicate giving an output of .8. Depending on the interpretation of InheritanceLink as a predicate, one has either the predicate InheritanceLink cat $X acting on the input animal or the predicate InheritanceLink $X animal acting on the input EFTA00624187 20.3 Representing Functions and Predicates 41 cat or the predicate InheritanceLink SX SY acting on the list input (cat, animal) This means that, if we wanted to, we could do away with all Link types except OrderedLink and UnorderedLink, and represent all other Link types as PredicateNodes embodying appro- priate predicate schema. This is not the approach taken in the current codebase. However, the situation is somewhat similar to that with CIM-Dynamics: • In future we will likely create a revision of CogPrime that regularly revises its own vocabu- lary, of Link types, in which case an explicit representation of link types as predicate schema will be appropriate. • In the shorter term, it can be useful to treat link types as virtual predicates, meaning that one lets the system create Schemallodes corresponding to them, and hence do some meta level reasoning about its own link types. 20.3.3 Variable and Combinator Notation One of the most important aspects of combinatory logic, from a CogPrime perspective, is that it allows one to represent arbitrarily complex procedures and patterns without using variables in any direct sense. In CogPrime, variables are optional, and the choice of whether or how to use them may be made (by CogPrime itself) on a contextual basis. This section deals with the representation of variable expressions in a variable-free way, in a CogPrime context. The general theory underlying this is well-known, and is usually expressed in terms of the elimination of variables from lambda calculus expressions (lambda lifting). Here we will not present this theory but will restrict ourselves to presenting a simple, hopefully illustrative example, and then discussing some conceptual implications. 20.3.3.1 Why Eliminating Variables is So Useful Before launching into the specifics, a few words about the general utility of variable-free ex- pression may be worthwhile. Some expressions look simpler to the trained human eye with variables, and some look simpler without them. However, the main reason why eliminating all variables from an expression is sometimes very useful, is that there are automated program-manipulation techniques that work much more nicely on programs (schemata, in CogPrime lingo) without any variables in them. As will he discussed later (e.g. Chapter 33 on evolutionary learning, although the same process is also useful for supporting probabilistic reasoning on procedures), in order to mine patterns among multiple schema that all try to do the same (or related) things, we want to put. schema into a kind of "hierarchical normal form." The normal form we wish to use generalizes EFTA00624188 42 20 Knowledge Representation Using the Atomspace Holman's Elegant Normal Form (which is discussed in Moshe Looks' PhD thesis) to program trees rather than just Boolean trees. But, putting computer programs into a useful, nicely-hierarchically-structured normal form is a hard problem - it requires one to have a pretty nice and comprehensive set of program transformations. But the only general, robust, systematic program transformation methods that exist in the computer science literature require one to remove the variables from one's programs, so that one can use the theory of functional programming (which ties in with the theory of monads in category, theory, and a lot of beautiful related math). In large part, we want to remove variables so we can use functional programming tools to normalize programs into a standard and pretty hierarchical form, in order to mine patterns among them effectively. However, we don't always want to be rid of variables, because sometimes, from a logical reasoning perspective, theorem-proving is easier with the variables in there. (Sometimes not.) So, we want to have the option to use variables, or not. 20.3.3.2 An Example of Variable Elimination Consider the PredicateNode AND InheritanceLink X cat eats X mice Here we have used a syntactically sugared representation involving the variable X. How can we get rid of the X? Recall the C combinator (from combinatory logic), defined by Cfxy fyx Using this tool. InheritanceLink X cat becomes C InheritanceLink cat X and eats X mice becomes C eats mice X so that overall we have AND C InheritanceLink cat C eats mice where the C combinators essentially give instructions as to where the virtual argument X should go. In this case the variable-free representation is basically just as simple as the variable-based representation, so there is nothing to lose and a lot to gain by getting rid of the variables. This EFTA00624189 20.3 Representing Functions and Predicates 43 won't always be the case - sometimes execution efficiency will be significantly enhanced by use of variables. WIKISOURCE:TypeInheritance 20.3.4 Inheritance Between Higher-Order Types Next, this section deals with the somewhat subtle matter of Inheritance between higher-order types. This is needed, for example, when one wants to cross over or mutate two complex schemata, in an evolutionary learning context. One encounters questions like: When mutation replaces a schema that takes integer input, can it replace it with one that takes general numerical input? How about vice versa? These questions get more complex when the inputs and outputs of schema may themselves be schema with complex higher-order types. However, they can be dealt with elegantly using some basic mathematical rules. Denote the type of a mapping from type T to type S, as T -> S. Use the shorthand inh to mean inherits from. Then the basic rule we use is that Ti -> Si inh T2 -> S2 iff T2 inh Ti Si inh S2 In other words, we assume higher-order type inheritance is countervariant. The reason is that, if RI = Ti -> Si is to be a special case of R2 = T2 -> 52, then one has to be able to use the latter everywhere one uses the former. This means that any input R2 takes, has to also be taken by RI (hence T2 inherits from Ti). And it means that the outputs R2 gives must be able to be accepted by any function that accepts outputs of RI (hence Si inherits from 52). This type of issue comes up in programming language design fairly frequently, and there are a number of research papers debating the pros and cons of countervariance versus covariance for complex type inheritance. However, for the purpose of schema type inheritance in CogPrime, the greater logical consistency of the countervariance approach holds sway. For instance, in this approach, INT -> INT is not a subtype of NO -> INT (where NO denotes FLOAT), because NO -> INT is the type that includes all functions which take a real and return an int, and an INT -> INT does not take a real. Rather, the containment is the other way around: every NO -> INT function is an example of an INT -> INT function. For example, consider the NO -> INT that takes every real number and rounds it up to the nearest integer. Considered as an INT -> INT function, this is simply the identity function: it is the function that takes an integer and rounds it up to the nearest integer. Of course. sapling of types is different, it's covariant. If one has an ordered pair whose elements are of different types, say (Ti, T2) , then we have (TI , Si) inh (T2, S2) if TI inh T2 Si inh S2 As a mnemonic formula, we may say EFTA00624190 44 20 KnowkdgeRepresentationUsingtheAtomspace (general -> specific) inherits from (specific -> general) (specific, specific) inherits from (general, general) In schema learning, we will also have use for abstract type constructions, such as (T1, T2) where T1 inherits from T2 Notationally, we will refer to variable types as Xvl, Xv2, etc., and then denote the inheritance relationships by using ntunerical indices, e.g. using [1 inh 2] to denote that Xv1 inh Xv2 So for example, (INT, VOID) inh (Xv1, Xv2) is true, because there are no restrictions on the variable types, and we can just assign Xv1 = INT, Xv2 = VOID. On the other hand, ( INT, VOID ) inh ( Xvl, Xv2 ), 1 inh 2 ] is false because the restriction Xvl inh Xv2 is imposed, but it's not true that INT inh VOID. The following list gives some examples of type inheritance, using the elementary types INT, FLOAT (FL), NUMBER (NO), CHAR and STRING (STR), with the elementary type inheri- tance relationships • INT inh NUMBER • FLOAT inh NUMBER • CHAR inh STRING • ( NO -> FL ) inh ( INT -> FL ) • ( FL -> INT ) inh ( FL -> NO ) • ( ( INT -> FL ) -> ( FL -> INT ) ) inh ( ( NO -> FL ) -> ( FL -> NO ) ) WIK- ISOURCE:AbstractSchemaManipulation 20.3.5 Advanced Schema Manipulation Now we describe some special schema for manipulating schema, which seem to be very useful in certain contexts. 20.3.5.1 Listification First, there are two ways to represent n-ary relations in CogPrime's Atom level knowledge representation language: using lists as in f_list (xl, xn) or using currying as in EFTA00624191 20.3 Representing Functions and Predicates 45 f_curry xl xn To make conversion between list and curried forms easier, we have chosen to introduce special schema (combinators) just for this purpose: listify f - f_list so that f_list (xl, xn ) f xl xn unlistify listify f - f For instance kick_curry Ben Ken denotes (kick_curry Ben) Ken which means that kick is applied to the argument Ben to yield a predicate schema applied to ICen. This is the curried style. The list style is kick List (Ben, Ken) where kick is viewed as taking as an argument the List (Ben, ICen). The conversion between the two is done by listify kick_curry kick_list unlistify kick_list - kick_curry As a more detailed example of unlistification, let us utilize a simple mathematical example, the function (X — 1)2. If we use the notations - and pow to denote Schemallodes embodying the corresponding operations, then this formula may be written in variable-free node-and-link form as ExOutLink pow ListLink ExOutLink ListLink 1 2 But to get rid of the nasty variable X, we need to first unlistify the functions pow and -, and then apply the C and B combinators a couple times to move the variable X to the front. The B combinator (see Combinatory Logic REF) is recalled below: Bfghof (g h) This is accomplished as follows (using the standard convention of left-associativity for the application operator, denoted C.g) in the tree representation given in Section 20.3.0.6) pow(-(x, 1), 2) unlistify pow (-(x, 1) 2) C (unlistify pow) 2 (-(x,1)) C (unlistify pow) 2 ((unlistify -) x 1) C (unlistify pow) 2 (C (unlistify -) 1 x) B (C (unlistify pow) 2) (C (unlistify -) 1) x EFTA00624192 46 20 Knowledge Representation Using the Atomspace yielding the final schema (C (unlistify pow) 2) (C (unlistify -) 1) By the way, a variable-free representation of this schema in CogPrime would look like ExOutLink ExOutLink B ExOutLink ExOutLink C ExOutLink unlistify pow 2 ExOutLink ExOutLink C ExOutLink unlistify 1 The main thing to be observed is that the introduction of these extra schema lets us remove the variable X. The size of the schema is increased slightly in this case, but only slightly - an increase that is well - justified by the elimination of the many difficulties that explicit variables would bring to the system. Furthermore, there is a shorter rendition which looks like ExOutLink ExOutLink B ExOutLink ExOutLink C pow_curried 2 ExOutLink ExOutLink C -_curried 1 This rendition uses alternate variants of - and pow schema, labeled -_cu rried and pow_curried, which do not act on lists but are curried in the manner of combinatory logic and Haskell. It is 13 lines whereas the variable-bearing version is 9 lines, a minor increase in length that brings a lot of operational simplification. 20.3.5.2 Argument Permutation In dealing with List relationships. there will sometimes be use for an argument-permutation operator, let us call it P, defined as follows (P p 0 (v1, vn) f (p (v1, vn )) where p is a permutation on n letters. This deals with the case where we want to say, for instance that EFTA00624193 20.3 Representing Functions and Predicates 47 Equivalence parent(x,y) child(y,x) Instead of positing variable names x and y that span the two relations parent(x,y) and child(y,x), what we can instead say in this example is Equivalence parent (P (2,11 child) For the case of two-argument functions, argument permutation is basically doing on the list level what the C combinator does in the curried function domain. On the other hand, in the case of n-argument functions with n>2, argument permutation doesn't correspond to any of the standard combinators. Finally, let's conclude with a similar example in a more standard predicate logic notation, involving both combinators and the permutation argument operator introduced above. We will translate the variable-laden predicate likes(y,x) AND likes ()cry) into the equivalent combinatory logic tree. Let us first recall the combinator S whose function is to distribute an argument over two terms. sfgx•• (f x) (g x) Assume that the two inputs are going to be given to us as a list. Now, the combinatory logic representation of this is S (B AND (B (P (2,1) likes))) likes We now show how this would be evaluated to produce the correct expression: S (B AND (B (P (2,1) likes))) likes (x,y) S gets evaluated first, to produce (B AND (B (P (2,11 likes)) (x,y)) (likes (x,y)) now the first B AND f(B (P (2,1) likes)) (x,y)) (likes (x,y)) now the second one AND ((P (2,11 likes) (x,y)) (likes (xl y)) now P AND (likes (y.x)) (likes (x. IF) ) which is what we wanted. EFTA00624194 Chapter 21 Representing Procedural Knowledge 21.1 Introduction We now turn to CogPrime's representation and manipulation of procedural knowledge. In a sense this is the most fundamental kind of knowledge - since intelligence is most directly about action selection, and it is procedures which generate actions. CogPrime involves multiple representations for procedures, including procedure maps and (for sensorimotor procedures) neural nets or similar structures. Its most basic procedural knowl- edge representation, however, is the program. The choice to use programs to represent proce- dures was made after considerable reflection — they are not of course the only choice, as other representations such as recurrent neural networks possess identical representational power, and are preferable in some regards (e.g. resilience with respect to damage). Ultimately, however, we chase programs due to their consilience with the software and hardware underlying Cog- Prime (and every other current AI program). CogPrime is a program, current computers and operating systems are optimized for executing and manipulating programs; and we humans now have many tools for formally and informally analyzing and reasoning about programs. The human brain probably doesn't represent most procedures as programs in any simple sense, but CogPrime is not intended to be an emulation of the human brain. So, the representation of programs as procedures is one major case where CogPrime deviates from the human cogni- tive architecture in the interest of more effectively exploiting its own hardware and software infrastructure. CogPrime represents procedures as programs in an internal programming language called "Combo." While Combo has a textual representation, described online at the OpenCog wiki, this isn't one of its more important aspects (and may be redesigned slightly or wholly without affecting system intelligence or architecture); the essence of Combo programs lies in their tree representation not their text representation. One could fairly consider Combo as a dialect of LISP, although it's not equivalent to any standard dialect, and it hasn't particularly been developed with this in mind. In this chapter we discuss the key concepts underlying the Combo approach to program representation, seeking to make clear at each step the motivations for doing things in the manner proposed. In terms of the overall CogPrime architecture diagram given in Chapter 6 of Part 1, this chapter is about the box labeled "Procedure Repository." The latter, in OpenCog, is a special- ized component connected to the AtomSpace, storing Combo tree representations of programs; 49 EFTA00624196 50 21 Representing Procedural Knowledge each program in the repository is linked to a Schemallode in the AtomSpace, ensuring full connectivity between procedural and declarative knowledge. 21.2 Representing Programs What is a "program" anyway? What distinguishes a program front an arbitrary representation of a procedure? The essence of programmatic representations is that they are well-specified, compact, com- binatorial, and hierarchical: • Well-specified: unlike sentences in natural language, programs are unambiguous; two distinct programs can be precisely equivalent. • Compact: programs allow us to compress data on the basis of their regularities. Accordingly, for the purposes of this chapter, we do not consider overly constrained representations such as the well-known conjunctive and disjunctive normal forms for Boolean formulae to be programmatic. Although they can express any Boolean function (data), they dramatically limit the range of data that can be expressed compactly, compared to unrestricted Boolean formulae. • Combinatorial: programs access the results of running other programs (e.g. via function application), as well as delete, duplicate, and rearrange these results (e.g., via variables or combinators). • Hierarchical: programs have intrinsic hierarchical organization, and may be decomposed into subprograms. Eric Baum has advanced a theory "under which one understands a problem when one has mental programs that can solve it and many naturally occurring variations" lBatiOtil. In this perspective - which we find an agreeable way to think about procedural knowledge, though perhaps an overly limited perspective on mind as a whole - one of the primary goals of ar- tificial general intelligence is systems that can represent, learn, and reason about such pro- grams [Bau06, B=01]. Furthermore, integrative AGI systems such as CogPrime may contain subsystems operating on programmatic representations. Would-be AGI systems with no direct support for programmatic representation will clearly need to represent procedures and proce- dural abstractions somehow. Alternatives such as recurrent neural networks have serious down- sides, including opacity and inefficiency, but also have their advantages (e.g. recurrent neural nets can be robust with regard to damage, and learnable via biologically plausible algorithms). Note that the problem of how to represent programs for an AGI system dissolves in the unrealistic case of unbounded computational resources. The solution is algorithmic information theory rha081, extended recently to the case of sequential decision theory Illut05al. The latter work defines the universal algorithmic agent AIXI, which in effect simulates all possible pro- grams that are in agreement with the agent's set of observations. While AIXI is =computable, the related agent AIM" may be computed, and is superior to any other agent bounded by time t and space 1 II Int0514. The choice of a representational language for programs' is of no consequence, as it will merely introduce a bias that will disappear within a constant number of time steps.2 I As well as a language for proofs in the case of AIXIti. 2 The universal distribution converges quickly. EFTA00624197 21.3 Representational Challenges 51 Our goal in this chapter Ls to provide practical techniques for approximating the ideal pro- vided by algorithmic probability, based on what Pei Wang has termed the assumption of in- sufficient knowledge and resources tWan061, and assuming an AGI architecture that's at least vaguely humanlike in nature, and operates largely in everyday human environments, but uses programs to represent many procedures. Given these assumptions, how programs are repre- sented is of paramount importance. as we shall see in the next two sections, where we give a conceptual fornmlation of what we mean by tractable program representations, and introduce tools for formalizing such representations. The fourth section delves into effective techniques for representing programs. A key concept throughout is syntactic-semantic correlation, meaning that programs which are similar on the syntactic level, within certain constraints will tend to also be similar in terms of their behavior (i.e. on the semantic level). Lastly, the fifth section changes direction a bit and discusses the translation of programmatic structure into declarative form for the purpases of logical inference. In the future, we will experimentally validate that these normal forms and heuristic trans- formations do in fact increase the syntactic-semantic correlation in program spaces, as has been shown so far only in the Boolean case. We would also like to explore the extent to which even stronger correlation, and additional tractability properties, can be observed when realistic probabilistic constraints on "natural" environment and task spaces are imposed. The importance of a good programmatic representation of procedural knowledge becomes quite clear when one thinks about it in terms of the Mind-World Correspondence Principle introduced in Chapter 10. That principle states, roughly, that transition paths between world- states should map naturally onto transition paths between mind-states. This suggests that there should be a natural, smooth mapping between red-world action series and the corresponding series of internal states. Where internal states are driven by explicitly given programs, this means that the transitions between internal program states should nicely mirror transitions between the states of the real world as it interacts with the system controlled by the program. The extent to which this is true will depend on the specifics of the programming language - and it will be true for a much greater extent, on the whole, if the programming language displays high syntactic-semantic correlation for behaviors that commonly occur when the program is used to control the system in the real world. So, the various technical issues mentioned above and considered below, regarding the qualities desired in a programmatic representation, are merely the manifestation of the general Mind-World Correspondence Principle in the context of procedural knowledge, under the assumption that procedures are represented as programs. The material in this chapter may be viewed as an approach to ensuring the validity of the Mind- World Correspondence principle for programmatically-represented procedural knowledge, for CogPrime systems concerned with achieving humanly meaningful goals in everyday human environments. 21.3 Representational Challenges Despite the advantages outlined in the previous section, there are a number of challenges in working with programmatic representations: • Open-endedness - in contrast to some other knowledge representations current in machine learning, programs vary in size and "shape", and there is no obvious problem-independent EFTA00624198 52 21 Representing Procedural Knowledge upper bound on program size. This makes it difficult to represent programs as points in a fixed-dimensional space, or to learn programs with algorithms that assume such a space. • Over-representation - often, syntactically distinct programs will be semantically iden- tical (i.e. represent the same underlying behavior or functional mapping). Lacking prior knowledge, many algorithms will inefficiently sample semantically identical programs re- peatedly [Loo07a, GIW.01h • Chaotic Execution - programs that are very similar, syntactically, may be very different, semantically. This presents difficulties for many heuristic search algorithms, which require syntactic and semantic distance to be correlated Ilmo07b, TVCC05]. • High resource-variance - programs in the same space vary greatly in the space and time they require to execute. It's easy to see how the latter two issues may present a challenge for mind-world correspon- dence! Chaotic execution makes it hard to predict whether a program will indeed manifest state-sequences mapping nicely to a corresponding world-sequences: and high resource-variance makes it hard to predict whether, for a given program, this sort of mapping can be achieved for relevant goals given available resources. Based on these concerns, it is no surprise that search over program spaces quickly succumbs to combinatorial explosion, and that heuristic search methods are sometimes no better than random sampling II.P02j. However, alternative representations of procedures also have their difficulties, and so far we feel the thornier aspects of programmatic representation are generally an acceptable price to pay in light of the advantages. For some special cases in CogPrime we have made a different choice - e.g. when we use DeSTIN for sensory perception (see Chapter 28 we utilize a more specialized representation comprising a hierarchical network of more specialized elements. DeSTIN doesn't have problems with resource variance or chaotic execution, though it does stiffer from over-representation. It is not very open-ended, which helps increase its efficiency in the perceptual processing domain, but may limit its applicability to more abstract cognition. In short we feel that, for general representation of cognitive procedures, the benefits of programmatic representation outweigh the casts; but for some special cases such as low-level perception and motor procedures, this may not be true and one may do better to opt for a more specialized, more rigid but less problematic representation. It would be possible to modify CogPrime to use, say, recurrent neural nets for procedure representation, rather than programs in an explicit language. However, this would rate as a rather major change in the architecture, and would cause multiple problems in other aspects of the system. For example, programs are reasonably straightforward to reason about using PLN inference, whereas reasoning about the internals of recurrent neural nets is drastically more problematic, though not impassible. The choice of a procedure representation approach for CogPrime has been made considering not only procedural knowledge in itself, but the interaction of procedural knowledge with other sorts of knowledge. This reflects the general synergetic nature of the CogPrime design. There are also various computation-theoretic issues regarding programs; however, we suspect these are not particularly relevant to the task of creating human-level AGI, though they may rear their heads when one gets into the domain of super-human, profoundly self-modifying AGI systems. For instance, in the context of the the difficulties caused by over-representation and high resource-variance, one might observe that determinations of e.g. programmatic equivalence for the former, and e.g. halting behavior for the latter, are uncomputable. But we feel that, given the assumption of insufficient knowledge and resources, these concerns dissolve into the EFTA00624199 21.4 What Makes a Representation Tractable? 53 larger issue of computational intractability and the need for efficient heuristics. Determining the equivalence of two Boolean formulae over 500 variables by computing and comparing their truth tables is trivial from a computability standpoint, but, in the words of Leonid Levin, "only math nerds would call 2500 finite" ILev9-1]. Similarly, a program that never terminates is a special case of a program that runs too slowly to be of interest to us. One of the key ideas underlying our treatment of programmatic knowledge is that, in order to tractably learn and reason about programs, an Al system must have prior knowledge of programming language semantics. That is, in the approach we advocate, the mechanism whereby programs are executed is assumed known a priori, and assumed to remain constant across many problems. One may then craft AI methods that make specific use of the programming language semantics, in various ways. Of course in the long run a sufficiently powerful AGI system could modify these aspects of its procedural knowledge representation; but in that case, according to our approach, it would also need to modify various aspects of its procedure learning and reasoning code accordingly. Specifically, we propose to exploit prior knowledge about program structure via enforcing programs to be represented in normal forms that preserve their hierarchical structure, and to be heuristically simplified based on reduction rules. Accordingly, one formally equivalent pro- gramming language may be preferred over another by virtue of making these reductions and transformations more explicit and concise to describe and to implement. The current OpenCog- Prime system uses a simple LISP-like language called Combo (which takes both tree form and textual form) to represent procedures, but this is not critical; the main point is using some language or language variant that is "tractable" in the sense of providing a context in which the semantically useful reductions and transformations we've identified are naturally expressible and easily usable. 21.4 What Makes a Representation Tractable? Creating a comprehensive formalization of the notion of a tractable program representation would constitute a significant achievement; and we will not answer that summons here. We will, however, take a step in that direction by enunciating a set of positive principles for tractable program representations, corresponding closely to the list of representational challenges above. While the discussion in this section is essentially conceptual rather than formal, we will use a bit of notation to ensure clarity of expression; S to denote a space of programmatic functions of the same type (e.g. all pure Lisp A-expressions mapping from lists to numbers), and B to denote a metric space of behaviors. In the case of a deterministic, side-effect-free program, execution maps from programs in $ to points in B, which will have separate dimensions for the function's output across various inputs of interest, as well as dimensions corresponding to the time and space costs of executing the program. In the case of a program that interacts with an external environment, or is intrinsically nondeterministic, execution will map from $ to probability distributions over points in B, which will contain additional dimensions for any side-effects of interest that programs in $ might have. Note the distinction between syntactic distance, measured as e.g. tree-edit distance between programs in 5, and semantic distance, measured between program's corresponding points in or probability distributions over B. We assume that semantic distance accurately quantifies our EFTA00624200 54 21 Representing Procedural Knowledge preferences in terms of a weighting on the dimensions of B; i.e., if variation along some axis is of great interest, our metric for semantic distance should reflect this. Let P be a probability distribution over B that describes our knowledge of what sorts of problems we expect to encounter, let R(n) C S be all the programs in our representation with (syntactic) size no greater than n. We will say that R(n) d-covers the pair (B,P) to extent p if the probability that, for a random behavior b E B chosen according to 12, there is some program in R whose behavior is within semantic distance d of 6, is greater than or equal to p. Then, some among the various properties of tractability that seem important based on the above discussion are as follows: • for fixed d, p quickly goes to I as n increases, • for fixed p, d quickly goes to 0 as n increases, • for fixed d and p, the minimal n needed for R(n) to d-cover (B, P) to extent p should be as small as possible, • ceteris paribus, syntactic and semantic distance (measured according to P) are highly cor- related. This is closely related to the Mind-Brain Correspondence Principle articulated in Chapter 10, and to the geometric formulation of cognitive synergy posited in Appendix ??. Syntactic distance has to do with distance along paths in mind-space related to formal program structures, and semantic distance has to do with distance along paths in mind-space and world-space corresponding to the record of the program's actual behavior. If syntax-semantics correlation failed, then there would be paths through mind-space (related to formal program structures) that were poorly matched to their closest corresponding paths through the rest of mind-space and world-space, hence causing a failure (or significant diminution) of cognitive synergy and mind-world correspondence. Since execution time and memory usage considerations may be incorporated into the defini- tion of program behavior, minimizing chaotic execution and managing resource variance emerges conceptually here as subcases of maximizing correlation between syntactic and semantic dis- tance. Minimizing over-representation follows from the desire for small program size: roughly speaking the less over-representation there is, the smaller average program size can be achieved. In some cases one can achieve fairly strong results about tractability of representations without any special assumptions about P: for example in prior work we have shown that adoption of an appropriate hierarchical normal form can generically increase correlation between syntactic and semantic distance in the space of Boolean functions I?, Loonbl. In this case we may say that we have a generically tractable representation. However, to achieve tractable representation of more complex programs, some fairly strong assumptions about P will be necessary. This should not be philosophically disturbing, since it's clear that human intelligence has evolved in a manner strongly conditioned by certain classes of environments; and similarly, what we need to do to create a viable program representation system for pragmatic AGI usage, is to achieve tractability relative to the distribution P corresponding to the actual problems the AGI is going to need to solve. Formalizing the distributions P of real-world interest is a difficult problem, and one we will not address here (recall the related, informal discussions of Chapter 9, where we considered the various important peculiarities of the human everyday world). However, we hypothesize that the representations presented in the following section may be tractable to a significant extent irrespective of P,3 and even more powerfully tractable with 3 Specifically, with only weak biases that prefer smaller and faster programs with hierarchical decompositions. EFTA00624201 21.6 Normal Forms Postulated to Provide Tractable Representations 55 respect to this as-yet unfonnalized distribution. As weak evidence in favor of this hypothesis, we note that many of the representations presented have proved useful so far in various narrow problem-solving situations. 21.5 The Combo Language The current version of OpenCogPrime uses a simple language called Combo, which is an example of a language in which the transformations we consider important for AGI-focused program representation are relatively simple and natural. Here we illustrate the Combo language by example, referring the reader to the OpenCog wiki site for a fonnal presentation. The main use of the Combo language in OpenCog is behind-the-scenes, i.e. using tree repre- sentations of Combo programs: but there is also a human-readable syntax, and an interpreter that allows humans to write Combo programs when needed. The main use of Combo, however, is not for human-coded programs, but rather for programs that are learned via various AI methods. In Combo all expressions are in prefix form like LISP, but the left parenthesis is placed after the operator instead of before, for example: • +(4 5) is a 0-ari expression that returns 4 t 5 • and (41 O<(42)) is a binary expression of type boot x float boot that returns true if and only if the first input is true and the second input positive. #n designates the n-th input. • fact (1) if (0<(41) *(41. fact (+(41 - 1))) 1) is a recursive definition of factorial. • and_seg (goto (stick) grab(stick) goto (owner) drop) is a 0-ari expression with side effects, it evaluates a sequence of actions until completion or failure of one of them. Each action is executed in the environment the agent is connected to and returns action_success upon success or action_ failure otherwise. The action sequence returns action_success if it completes or action_ failure if it does not. • if (near (owner self) lick (owner) and_seg (goto(owner) wag) is a 0-ary expression with side effects; it means that if at the time of its evaluation the agent referred as self (here a virtual pet) is near its owner then lick him/her, otherwise go to the owner and wag the tail. 21.6 Normal Forms Postulated to Provide Tractable Representations We now present a series of normal forms for programs, postulated to provide tractable repre- sentations in the contexts relevant to human-level, roughly human-like general intelligence. EFTA00624202 56 21 Representing Procedural Knowledge 21.6.1 A Simple Type System We use a simple type system to distinguish between the various normal forms introduced below. This is necessary to convey the minimal information needed to correctly apply the basic func- tions in our canonical forms. Various systems and applications may of course augment these with additional type information, up to and including the satisfaction of arbitrary predicates (e.g. a type for prime numbers). This can be overlaid on top of our minimalist system to con- vey additional bias in selecting which transformations to apply, and introducing constraints as necessary. For instance, a call to a function expecting a prime number, called with a potentially composite argument, may be wrapped in a conditional testing the argument's primality. A sim- ilar technique is used in the normal form for functions to deal with list arguments that may be empty. Normal forms are provided for Boolean and number primitive types, and the following parametrized types: • list types, tistr, where T is any type, • tuple types, tupleThrz_m, where all Z are types, and N is a positive natural number, • enum types, {s1,s2,... sN}, where N is a positive number and all se are unique identifiers, • function types T1, T2, ... TN -, 0, where 0 and all 711 are types, • action result types. A list of type tistr is an ordered sequence of any number of elements, all of which mast have type T. A tuple of type tuple2),T2....2), is an ordered sequence of exactly N elements, where every ith element is of type An cm= of type (s1, 82, sN) is some element si from the set. Action result types concern side-effectful interaction with some world external to the system (but perhaps simulated, of course), and will be described in detail in their subsection below. Other types may certainly be added at a later date, but we believe that those listed above provide sufficient expressive power to conveniently encompass a wide range of programs, and serve as a compelling proof of concept. The normal form for a type T is a set of elementary functions with codomain T, a set of constants of type T, and a tree grammar. Internal nodes for expressions described by the grammar are elementary functions, and leaves are either U„„r or ticonstani, where U is some type (often U = T). Sentences in a normal form grammar may be transformed into normal form expressions. The set of expressions that may be generated is a function of a set of bound variables and a set of external functions that must be provided (both bound variables and external functions are typed). The transformation is as follows: • Tc.„,i,„„t leaves are replaced with constants of type T, • T„ar leaves are replaced with either bound variables matching type T, or expressions of the form f (expri,expr2,... expr,w), where f is an external function of type 711, T2, Tim T, and each exp.; is a normal form expression of type Z (given the available bound variables and external functions). EFTA00624203 21.6 Normal Forms Postulated to Provide Tractable Representations 57 21.6.2 Boolean Normal Form The elementary functions are and, or, and not. The constants are {true, false}. The grammar is: bool_root or_form I and_form I literal I bool_constant literal - bool_var I not( bool_var ) or_form - or( (and_form I literalf(2,1 ) and_form and( (or_form I literalf(2,1 ) The construct foo(x,) refers to x or more matches of foo (e.g. ( x I 10(2,1 is two or more items in sequences where each item is either an x or a y). 21.6.3 Number Normal Form The elementary functions are * (times) and t (plus). The constants are some subset of the rationals (e.g. those with IEEE single-precision floating-point representations). The grammar is: num_root times_form I plus_form I num_constant I num_var times_form *( (num_constant I plus_form) plus_form(1,) ) I num_var plus_form +( (num_constant I times form} times_form(1,1 ) I num_var 21.6.4 List Normal Form For list types /istr, the elementary functions are list (an n-ary list constructor) and append. The only constant is the empty list (nil). The grammar is: list_T_root append_form I list_form I list_T_var I list_T_constant append_form - append( (list_form I list_T_var)(2,) I list_form - list( T_root(1,) ) 21.6.5 Tuple Normal Form For tuple types tapleT,,,r,,„,n ,, the only elementary function is the tuple constructor (tuple). The constants are Ti_constant xT2_constant x • • • x TN_constant The normal form is either a constant, a var, or tuple( Ti_root Ts_root TN_root ) EFTA00624204 58 21 Representing Procedural Knowledge 21.6.6 Enum. Normal Form Emun.s are atomic tokens with no internal structure - accordingly, there are no elementary functions. The constants for the enum (Si. .92 91.11 are the as. The normal form is either a constant or a variable. 21.6.7 Function Normal Form For T1, T2, ... TN 0. the normal form is a lambda-expression of arity N whose body is of type O. The list of variable names for the lambda-expression Ls not a "proper" argument - it does not have a normal form of its own. Assuming that none of the Tis is a list type, the body of the lambda-expression is simply in the normal form for type O (with the possibility of the lambda-expressions arguments appearing with their appropriate types). If one or more Tis are list types, then the body is a call to the split function with all arguments in normal form. Split is a family of functions with type signatures (71,1iStri ,T2, Tk, iiSirk -, 0), tupletistr,,o,tuplelistr2,0, • • • tupletistr„,O -> 0. To evaluate split( f i tuple(li,o1),tuple(I2,O2), tuple(lk,ok)), the list arguments 11,12,...1k are examined sequentially. If some is found that is empty, then the result is the corresponding value If all l; are nonempty, we deconstruct each of them into xi : xsi, where xi is the first element of the list and xsi is the rest. The result is then f(xi, xst, x2, xs2, . xk,xsk). The split function thus acts as an implicit case statement to deconstruct lists only if they are nonempty. 21.6.8 Action Result Normal Form An action result type act corresponds to the result of taking an action in some world. Every, action result type has a corresponding world type, world. Associated with action results and worlds are two special sorts of functions. • Perceptions - functions that take a world as their first argument and regular (non-world and non-action-result) types as their remaining arguments, and return regular types. Unlike other function types, the result of evaluating a perception call may be different at different times, because the world will have different configurations at different times. • Actions - functions that take a world as their first argument and regular types as their remaining arguments, and return action results (of the type associated with the type of their world argument). As with perceptions, the result of evaluating an action call may be different at different times. Furthermore, actions may have side effects in the associated world that they are called in. Thus, unlike any other sort of function, actions must be evaluated, even if their return values are ignored. Other sorts of functions acting on worlds (e.g. ones that take multiple worlds as arguments) are disallowed. EFTA00624205 21.7 Program Transformations 59 Note that an action result expression cannot appear nested inside an expression of any other type. Consequently, there is no way to convert e.g. an action result to a Boolean, although con- version in the opposite direction is permitted. This is required because mathematical operations in our language have classical mathematical semantics; x and y must equal y and x, which will not generally be the case if x or y can have side-effects. Instead, there are special sequential versions of logical functions which may be used instead. The elementary functions for action result types are andacq (sequential and, equivalent to C's short-circuiting &&), arscl (sequential or, equivalent to C's short-circuiting I I ), and fails (negates success to failure and vice versa). The constants may vary from type to type but must at least contain success and failure, indicating absolute success/failure in execution.4 The normal form is as follows: act_root seqlit act orseq_form andseq_form orseq_form I andseq_form I seqlit act I fails( act ) act constant I act_var orseq( (andseq_form I secrliti(2,) ) andseq( (orseq_form I secrliti(2,) ) 21.7 Program Transformations A program transformation is any type-preserving mapping from expressions to expressions. Transformations may be guaranteed to preserve semantics. When doing program evolution there is an intermediate category of fitness preserving transformations that may alter semantics, but not fitness. In general, the only way that fitness preserving transformations will be uncovered is by scoring programs that have had their semantics potentially transformed to determine their fitness, which is what most fitness function does. On the other hand if the fitness function is encompassed in the program itself, so a candidate directly outputs the fitness itself, then only preserving semantics transformations are needed. 21.7.1 Reductions These are semantics preserving transformations that do not increase some size measure (typ- ically number of symbols), and are idempotent. For example, and(x, x, y) and(x, y) is a reduction for Boolean expressions. A set of canonical reductions is defined for every type that has a normal form. For numerical functions, the simplifier in a computer algebra system may be used. The full list of reductions is omitted in for brevity. An expression is reduced if it maps to itself under all canonical reductions for its type, and all of its children are reduced. Another important set of reductions are the compressive abstractions, which reduce or keep constant the size of expressions by introducing new functions. Consider list is (Ha p q) r) *(+(b p cil r) it(+(c p q) r)) A do(arghar92,...argN) statement (known as progn in Lisp), which evaluates its arguments sequen- tially regardless of success or failure, is equivalent to and,,cq(oracq(argi, success). or seq(arst 2, success). ... orstq(arg iv , success)). EFTA00624206 60 21 Representing Procedural Knowledge which contains 19 symbols. Transforniing this to f(x) w(+(x p q) r) list (f (a) f(b) f(el) reduces the total number of symbols to 15. One can generalize this notion to consider com- pressive abstractions across a set of programs. Compressive abstractions appear to be rather expensive to uncover, although not prohibitively so, the computation may easily be parallelized and may rely heavily on subtree mining !TODD REF]. 21.7.1.1 A Simple Example of Reduction We now give a simple example of how CogPrhne's reduction engine can transform a program into a semantically equivalent but shorter one. Consider the following program and the chain of reduction: 1. We start with the expression if(P and_seq (if (P A B) B) and_seq(A B)) 2. A reduction rule permits to reduce the conditional if IP A B) to if (true A B). Indeed if P is true, then the first branch is evaluated and P must still be true. if(P and_seq (if (true A B) B) and_seq (A B) I 3. Then a rule can reduce if (true A B) to A. if(P and_seq (A B) and_seq (A B)) 4. And finally another rule replaces the conditional by one of its branches since they are identical and_seq (A B) Note that the reduced program is not only smaller (3 symbols instead of 11) but a bit faster too. Of course it is not generally true that smaller programs are faster but in the restricted context of our experiments it has often been the case. 21.7.2 Neutral Transformations Semantics preserving transformations that are not reductions are not useful on their own - they can only have value when followed by transformations from some other class. They are thus more speculative than reductions, and more costly to consider. I will refer to these as neutral transformations 1O1s951. • Abstraction - given an expression E containing non-overlapping subexpressions Et, E2, ... EN, let E' be E with all E1 replaced by the unbound variables Define the function f (vi, v2, ... vs) = E', and replace E with f(Ei, E2,... EN). Abstraction is distinct from compressive abstraction because only a single call to the new function f is introduced.' • Inverse abstraction - replace a call to a user-defined function with the body of the function, with arguments instantiated (note that this can also be used to partially invert a compressive abstraction). 6 In compressive abstraction there must be at least two calls in order to avoid increasing the number of symbols. EFTA00624207 21.7 Program 'I'ransformations 61 • Distribution - let E be a call to some function f, and let E' be an expression of E's ith argument that is a call to some function g, such that f is distributive over g's arguments, or a subset thereof. We shall refer to the actual arguments to g in these positions in E' as x1, x2,... xn. Now, let D(F) be the function that is obtained by evaluating E with its ith argument (the one containing E') replaced with the expression F. Distribution is replacing E with E', and then replacing each xj (1 ≤ j ≤ n) with D(x3). For example, consider +(x *(y if(cond a b))) Since both + and * are distributive over the result branches of if, there are two possible distribution transformations, giving the expressions if(cond +(x *(y a)) +(x *CY b))) + (x (cond *(y a) *ly b) ) ) • Inverse distribution (factorization) - the opposite of distribution. This is nearly a reduction; the exceptions are expressions such as f(g(x)), where f and g are mutually distributive. • Arity broadening - given a function f, modify it to take an additional argument of some type. All calls to f must be correspondingly broadened to pass it an additional argument of the appropriate type. • List broadening - given a function f with some ith argument x of type T, modify f to instead take an argument y of type list?, which gets split into x : xs. All calls to f with ith argument x' must be replaced by corresponding calls with ith argument list(e). • Conditional insertion - an expression x is replaced by if(true,x,y), where y is some expression of the same type of x. As a technical note, action result expressions (which may cause side-effects) complicate neu- tral transformations. Specifically, abstractions and compressive abstractions must take their arguments lazily (i.e. not evaluate them before the function call itself is evaluated), in order to be neutral. Furthermore, distribution and inverse distribution may only be applied when f has no side-effects that will vary (e.g. be duplicated or halved) in the new expression, or affect the nested computation (e.g. change the result of a condition within a conditional). Another way to think about this issue is to consider the action result type as a lazy domain-specific language embedded within a pure functional language (where evaluation order is unspecified). Spector has performed an empirical study of the tradeoffs in lazy vs. eager function abstraction for program evolution ISpe961. The number of neutral transformations applicable to any given program grows quickly with program size.? Furthermore, synthesis of complex programs and abstractions does not seem to be passible without them. Thus, a key hypothesis of any approach to AGI requiring significant program synthesis, without assuming the currently infeasible computational capacities required to brute-force the problem, is that the inductive bias to select promising neutral transforma- tions can be learned and/or programmed. Referring back to the initial discussion of what con- stitutes a tractable representation, we speculate that perhaps, whereas well-chosen reductions are valuable for generically increasing program representation tractability, well-chosen neutral transformations will be valuable for increasing program representation tractability relative to distributions P to which the transformations have some (possibly subtle) relationship. 6 Analogous tuple-broadening transformations may be defined as well, but are omitted for brevity. 7 Exact calculations are given by Olsson EFTA00624208 62 21 Representing Procedural Knowledge 21.7.5 Non-Neutral Transformations Non-neutral transformations are the general class defined by removal, replacement, and inser- tion of subexpressions, acting on expressions in normal form, and preserving the normal form property. Clearly these transformations are sufficient to convert any normal form expression into any other. What is desired is a subclass of the non-neutral transformations that is combi- natorially complete, where each individual transformation is nonetheless a semantically small step. The full set of transformations for Boolean expressions is given in Ilmo06]. For numerical expressions, the transcendental functions sin, log, and 9 are used to construct transformations. These obviate the need for division (a/b = el°5O)-mg(b)), and subtraction (a — b = a + —1 * b). For lists, transformations are based on insertion of new leaves (e.g. to append function calls), and "deepening" of the normal form by insertion of subclauses (see II.00061 for details). For tuples, we take the union of the transformations of all the subtypes. For other mixed-type expressions the union of the non-neutral transformations for all types mast be considered as well. For enum types the only transformation is replacing one symbol with another. For function types, the transformations are based on function composition. For action result types. actions are inserted/removed/altered, akin to the treatment of Boolean literals for the Boolean type. We propose an additional class of non-neutral transformations based on the marvelous fold function: fokl(f,v,I) = if (empty(I),v, f (first(l), fold(f, v, rest(l)))) With fold we can express a wide variety of iterative constructs, with guaranteed termination and a bias towards low computational complexity. In fact, fold allows us to represent exactly the primitive recursive functions pint 94 Even considering only this reduced space of passible transformations, in many cases there are still too many possible programs "nearby" some target to effectively consider all of them. For example many probabilistic model-building algorithms, such as learning the structure of a Bayesian network from data, can require time cubic in the number of variables (in this context each independent non-neutral transformation can correspond to a variable). Especially as the size of the programs we wish to learn grows, and as the number of typologically matching functions increases, there will be simply too many variables to consider each one intensively, let alone apply a quadratic-time algorithm. To alleviate this scaling difficulty, we propose three techniques. The first is to consider each potential variable (i.e. independent non-neutral transformation) to heuristically determine its usefulness in expressing constructive semantic variation. For exam- ple, a Boolean transformation that collapses the overall expression into a tautology is assumed to be useless.8 The second is heuristic coupling rules that allow us to calculate, for a pair of transformations, the expected utility of applying them in conjunction. Finally, while fold is powerful, it may need to be augmented by other methods in order to provide tractable representation of complex programs that would normally be written using nu- merous variables with diverse scopes. One approach that we have explored involves application of ISN1191's ideas about director strings as combinators. In Sinot's approach, special program 8 This S heuristic because such a transformation might be useful together with other transformations. EFTA00624209 21.8 Interfacing Between Procedural and Declarative Knowledge 63 tree nodes are labeled with director strings, and special algebraic operators interrelate these strings. One then achieves the representational efficiency of local variables with diverse scopes, without needing to do any actual variable management. Reductions and other (non-)neutral transformation rules related to broadening and reducing variable scope may then be defined using the director string algebra. 21.8 Interfacing Between Procedural and Declarative Knowledge Finally, another critical aspect of procedural knowledge is its interfacing with declarative knowl- edge. We now discuss the referencing of declarative knowledge within procedures, and the ref- erencing of the details of procedural knowledge within CogPrime's declarative knowledge store. 21.8.1 Programs Manipulating Atoms Above we have used Combo syntax implicitly, referring to Appendix ?? for the formal defi- nitions. Now we introduce one additional, critical element of Combo syntax: the capability to explicitly reference declarative knowledge within procedures. For this purpose Combo mast contain the following types: Atom, Node, Link,TruthV clue, AtomType, AtomTable Atom is the union of Node and Link. So a type Node within a Combo program refers to a Node in CogPrime's AtomTable. The mechanisms used to evaluate these entities during program evaluation are discussed in Chapter 25. For example, suppose one wishes to write a Combo program that creates Atoms embodying the predicate-argument relationship eats(cat, fish), represented Evaluation eats (cat, fish) aka Evaluation eats List cat fish To do this, one could say for instance, new-link (EvaluationLink new-node(PredicateNode "eats") new-link (ListLink new-node(ConceptNode "cat") new-node(ConceptNode "fish")) (new-sty .99 .99)) EFTA00624210 64 21 Representing Procedural Knowledge 21.9 Declarative Representation of Procedures Next, we consider the representation of program tree internals using declarative data structures. This is important if we want OCP to inferentially understand what goes on inside programs. In itself, it is more of a "bookkeeping" issue than a deep conceptual issue, however. First, note that each of the entities that can live at an internal node of a program, can also live in its own Atom. For example. a number in a program tree corresponds to a NmnberNode; an argument in a Combo program already corresponds to some Atom; and an operator in a program can be wrapped up in a Schemallode all its own, and considered as a one-leaf program tree. Thus, one can build a kind of virtual, distributed program tree by linking a number of ProcedureNodes (i.e. PredicateNodes or Schemallodes) together. All one needs in order to achieve this is an analogue of the @ symbol (as defined in Section 20.3 of Chapter 20) for relating ProcedureNodes. This is provided by the ExecutionLink type, where (ExecutionLink f g1 essentially means the same as f g in curried notation or e / \ f g The same generalized evaluation rules used inside program trees may be thought of in terms of ExecutionLinks; formally, they are crisp ExtensionalimplicationLinks among ExecutionLinks. Note that we are here using ExecutionLink as a curried function; that is, we are looking at (ExecutionLink f g) as a function that takes an argument x, where the truth value of (ExecutionLink f g1 x represents the probability that executing f, on input g, will give output x. One may then construct combinator expressions linking multiple ExecutionLinks together; these are the analogues of program trees. For example, using ExecutionLinks, one equivalent of y = x + x^2 is: Hypothetical SeguentialAND ExecutionLink pow List vl 2 v2 ExecutionLink List vl v2 v3 Here the vl, v2, v3 are variables which may be internally represented via combinators. This AND is sequential in case the evaluation order inside the program interpreter makes a difference. As a practical matter, it seems there is no purpose to explicitly storing program trees in conjunction-of-ExecutionLinks form. The information in the ExecutionLink conjunct is already there in the program tree. However, the PLN reasoning system, when reasoning on program trees, may carry out this kind of expansion internally as part of its analytical process. EFTA00624211 Section II The Cognitive Cycle EFTA00624212 Chapter 22 Emotion, Motivation, Attention and Control Co-authored with Zhenhua Cai 22.1 Introduction This chapter begins the heart of the book: the part that explains how the CogPrime design aims to implement roughly human-like general intelligence, at the human level and ultimately beyond. First, here in Section II we explain how CogPrime can be used to implement a simplistic animal-like agent without much learning: an agent that perceives, acts and remembers, and chooses actions that it thinks will achieve its goals; but doesn't do any sophisticated learning or reasoning or pattern recognition to help it better perceive, act, remember or figure out how to achieve its goals. We're not claiming CogPrime is the best way to implement such an animal-like agent, though we suggest it's not a bad way and depending on the complexity and nature of the desired behaviors, it could be the best way. We have simply chosen to split off the parts of CogPrime needed for animal-like behavior and present them first, prior to presenting the various "knowledge creation" (learning, reasoning and pattern recognition) methods that constitute the more innovative and interesting part of the design. In Stan Franklin's terms, what we explain here in Section II is how a basic cognitive cycle may be achieved within CogPrime. In that sense, the portion of CogPrime explained in this Sectionis somewhat similar to the parts of Stan's LIDA architecture that have currently been worked out in detail, and that . However, while LIDA has not yet been extended in detail (in theory or implementation) to handle advanced learning, cognition and language, those aspects of CogPrime have been developed and in fact constitute the largest portion of this book. Looking back to the integrative diagram from Chapter 5, the cognitive cycle is mainly about integrating vaguely LIDA-like structures and mechanisms with heavily Psi-like structures and mechanisms - but doing so in a way that naturally links in with perception and action mecha- nisms "below," and more abstract and advanced learning mechanisms "above." In terms of the general theory of general intelligence, the basic CogPrime cognitive cycle can be seen to have a foundational importance in biasing the CogPrime system toward the problem of controlling an agent in an environment requiring a variety of real-time and near-real-time responses based on a variety of kinds of knowledge. Due to its basis in human and animal cognition, the CogPrime cognitive cycle likely incorporates many useful biases in ways that are not immediately obvious, but that would become apparent if comparing intelligent agents controlled by such a cycle versus intelligent agents controlled via other means. The cognitive cycle also provides a framework in which other cognitive processes, relating to various aspects of the goals and environments relevant to human-level general intelligence, 67 EFTA00624214 68 22 Emotion, Motivation, Attention and Control may conveniently dynamically interoperate. The "Mind OS" aspect of the CogPrime archi- tecture provides general mechanisms in which various cognitive processes may interoperate on a common knowledge store; the cognitive cycle goes further and provides a specific dynamical pattern in which multiple cognitive processes may intersect. Its effective operation places strong demands on the cognitive synergy between the various cognitive processes involved, but also provides a framework that encourages this cognitive synergy to develop and persist. Finally, it should be stressed that the cognitive cycle is not all-powerful nor wholly pervasive in CogPrime's dynamics. It's critical for the real-time interaction of a CogPrime-controlled agent with a virtual or physical world; but there may be many processes within CogPrime that most naturally operate outside such a cycle. For instance, humans will habitually do deep intellectual thinking (even something so abstract as mathematical theorem proving) within a cognitive cycle somewhat similar to the one they use for practical interaction with the external world. But, there's no reason that CogPrime systems need to be constrained in this way. Deviating from a cognitive cycle based dynamic may cause a CogPrime system to deviate further from human- likeness in its intelligence, but may also help it to perform better than humans on some tasks, e.g. tasks like scientific data analysis or mathematical theorem proving that benefit from styles of information processing that humans aren't particularly good at. 22.2 A Quick Look at Action Selection We will begin our exposition of CogPrime's cognitive cycle with a quick look at action selection. As Stan Franklin likes to point out, the essence of an intelligent agent is that it does things; it takes actions. The particular mechanisms of action selection in CogPrime are a bit involved and will be given in Chapter 24; in this chapter we will give the basic idea of the action selection mechanism and then explain how a variant of the Psi model (described in Chapter 4 of Part 1 above) is used to handle motivation (emotions, drives, goals, etc.) in CogPrime, including the guidance of action selection. The crux of CogPrime's action selection mechanism is as follows • the action selector chooses procedures that seem likely to help achieve important goals in the current context - Example: If the goal is to create a block structure that will surprise Bob, and there is plenty of time, one procedure worth choosing might be a memory search procedure for remembering situations involving Bob and physical structures. Alternately, if there isn't much time, one procedure worth choosing might be a procedure for building the base of a large structure - as this will give something to use as part of whatever structure is eventually created. Another procedure worth choosing might be one that greedily assembles structures from blocks without any particular design in mind. • to support the action selector, the system builds implications of the form Context&Procedure Goal, where Context is a predicate evaluated based on the agent's situation — Example: If Bob has asked the agent to do something, and it knows that Bob is very insistent on being obeyed, then implications such as • "Bob instructed to do X" and "do X" -, "please Bob" < .9, .9 > will be utilized EFTA00624215 22.3 Psi in CogPrime 69 — Example: If the agent wants to make a tower taller, then implications such as • ' is a blocks structure" and "place block atop —) "make T taller" < .9, .9 > will be utilized • the truth values of these implications are evaluated based on experience and inference - Example: The above implication involving Bob could be evaluated based on experience, by assessing it against remembered episodes involving Bob giving instructions — Example: The same implication could be evaluated based on inference, using analogy to experiences with instructions from other individuals similar to Bob; or using things Bob has explicitly said, combined with knowledge that Bob's self-descriptions tend to be reasonably accurate • Importance values are propagated between goals using economic attention allocation (and, inference is used to learn subgoals from existing goals) - Example: If Bob has told the agent to do X, and the agent has then derived (from the goal of pleasing Bob) the goal of doing X, then the "please Bob" goal will direct some of its currency to the "do X" goal (which the latter goal can then pass to its subgoals, or spend on executing procedures) These various processes are carried out in a manner orchestrated by Domer's Psi model as refined by Joscha Bach (as reviewed in Chapter 4 above), which supplies (among other features) • a specific theory regarding what "demands" should be used to spawn the top-level goals • a set of (four) interrelated system parameters governing overall system state in a useful manner reminiscent of human and animal psychology • a systematic theory of how various emotions (wholly or partially) emerge from more fun- damental underlying phenomena 22.3 Psi in CogPrime The basic concepts of the Psi approach to motivation, as reviewed in Chapter 4 of Part 1 above, are incorporated in CogPrime as follows (note that the following list includes many concepts that will be elaborated in more detail in later chapters): • Demands are GroundedPredicateNodes (GPNs), i.e. Nodes that have their truth value com- puted at each time by some internal C++ code or some Combo procedure in the Procedur- eRepository - Examples: Alertness, perceived novelty, internal novelty, reward from teachers, social stimulus - Humans and other animals have familiar demands such as hunger, thirst and excretion; to create an AGI closely emulating a human or (say) a dog one may wish to simulate these in one's AGI system as well • Urges are also GPNs, with their truth values defined in terms of the truth values of the Nodes for corresponding Demands. However in CogPrime we have chosen the term "Ubergoar instead of Urge, as this is more evocative of the role that these entities play in the system's dynamics (they are the top-level goals). EFTA00624216 70 22 Emotion, Motivation, Attention and Control • Each system comes with a fixed set of Ubergoals (and only very advanced CogPrime systems will be able to modify their Ubergoals) — Example: Stay alert and alive now and in the future; experience and learn new things now and in the future; get reward from the teachers now and in the future; enjoy rich social interactions with other minds now and in the future - A more advanced CogPrime system could have abstract (but experientially grounded) ethical principles among its Ubergoals, e.g. an Ubergoal to promote joy. an Ubergoal to promote growth and an Ubergoal to promote choice, in accordance with the ethics described in IGoci • The ShortTermImportance of an Ubergoal indicates the urgency of the goal, so if the De- mand corresponding to an Ubergoal is within its target range, then the Ubergoal will have zero STI. But all Ubergoals can be given maximal LTI to guarantee they don't get deleted. - Example: If the system is in an environment continually providing an adequate level of novelty (according to its Ubergoal), then the Ubergoal corresponding to external novelty will have low STI but high LTI. The system won't expend resources seeking novelty. But then, if the environment becomes more monotonous, the urgency of the external novelty goal will increase, and its STI will increase correspondingly, and resources will begin getting allocated toward improving the novelty of the stimuli received by the agent. • Pleasure is a GPN, and its internal truth value computing program compares the satisfaction of Ubergoals to their expected satisfaction - Of course, there are various mathematical functions (e.g. p'th power averages' for dif- ferent p) that one can use to average the satisfaction of multiple Ubergoals; and choices here, i.e. different specific ways of calculating Pleasure, could lead to systems with different "personalities" • Goals are Nodes or Links that are on the system's list of goals (the GoalPool). Ubergoals are automatically Goals, but there will also be many other Goals also - Example: The Ubergoal of getting reward from teachers might spawn subgoals like "getting reward from Bob" (if Bob is a teacher), or "making teachers smile" or "create surprising new structures" (if the latter often garners teacher reward). The subgoal of "create surprising new structures" might, in the context of a new person entering the agent's environment with a bag of toys, lead to the creation of a subgoal of asking for a new toy of the sort that could be used to help create new structures. Etc. • Psi's memory is CogPrime's AtomTable, with associated structures like the Procedur- eRepository (explained in Chapter 19), the SpaceServer and TimeServer (explained in Chapter 26), etc. — Examples: The knowledge of what blocks look like and the knowledge that tall struc- tures often fall down, go in the AtomTable; specific procedures for picking up blocks of different shapes go in the ProcedureRepository; the layout of a room or a pile of blocks at a specific point in time go in the SpaceServer; the series of events involved in the building-up of a tower are temporally indexed in the TimeServer. I the p'th power average is defined as vE XP EFTA00624217 22.3 Psi in CogPrime 71 - In Psi and MicroPsi, these same phenomena are stored in memory in a rather different way, yet the basic Psi motivational dynamics are independent of these representational choices • Psi's "motive selection" process is carried out in CogPrime by economic attention allocation, which allocates ShortTermImportance to Goal nodes - Example: The flow of importance from "Get reward from teachers" to "get reward from Bob" to "make an interesting structure with blocks" is an instance of what Psi calls "motive selection". No action is being taken yet, but choices are being made regarding what specific goals are going to be used to guide action selection. • Psi's action selection plays the same role as CogPrime's action selection, with the clari- fication that in CogPrime this is a matter of selecting which procedures (i.e. schema) to run, rather than which individual actions to execute. However, this notion exists in Psi as well, which accounts for "automatized behaviors" that are similar to CogPrime schemata the only (minor) difference here is that in CogPrime automatized behaviors are the default case. — Example: If the goal "make an interesting structure with blocks" has a high STI, then it may be used to motivate choice of a procedure to execute, e.g. a procedure that finds an interesting picture or object seen before and approximates it with blocks, or a procedure that randomly constructs something and then filters it based on interestingness. Once a blocks-structure-building procedure is chosen, this procedure may invoke the execution of sub-procedures such as those involved with picking up and positioning particular blocks. • Psi's planning is carried out via various learning processes in CogPrime, including PLN plus procedure learning methods like MOSES or hillclimbing — Example: If the agent has decided to build a blocks structure emulating a pyramid (which it saw in a picture), and it knows how to manipulate and position individual blocks, then it must figure out a procedure for carrying out individual-block actions that will result in production of the pyramid. In this case, a very inexperienced agent might use MOSES or hillclimbing and "guidedly-randomly" fiddle with different construction procedures until it hit on something workable. A slightly more experienced agent would use reasoning based on prior structures it had built, to figure out a rational plan (like: "start with the base, then iteratively pile on layers, each one slightly smaller than the previous.") • The modulators are system parameters which may be represented by PredicateNodes, and which must be incorporated appropriately in the dynamics of various MindAgents, e.g. - activation affects action selection. For instance this may be effected by a process that, each cycle, causes a certain amount of STICurrency to pass to schema satisfying certain properties (those involving physical action, or terminating rapidly). The amount of currency passed in this way would be proportional to the activation - resolution level affects perception schema and MindAgents, causing them to expend less effort in processing perceptual data - certainty affects inference and pattern mining and concept creation processes, causing them to place less emphasis on certainty in guiding their activities, i.e. to be more EFTA00624218 72 22 Emotion, Motivation, Attention and Control accepting of uncertain conclusions. To give a single illustrative example: lV1Len backward chaining inference is being used to find values for variables, a "fitness target" of the form strength x confidence Ls sometimes used: this may be replaced with strengths' x cortfidence2-P, where activation parameter affects the exponent p, so when p tends to 0 confidence is more important, when p tends to 2 strength is more important and when p tends to 1 strength and confidence are equally important. - selection threshold may be used to effect a process that, each cycle, causes a certain amount of STICurrency (proportional to the selection threshold) to pass to the Goal Atoms that were wealthiest at the previous cycle. Based on this run-down, Psi and CogPrime may seem very similar, but that's because we have focused here only on the motivation and emotion aspect. Psi uses a very different knowledge representation than CogPrime; and in the Psi architecture diagram, nearly all of CogPrime is pushed into the role of "background processes that operate in the memory box." According to the theoretical framework underlying CogPrime, the multiple synergetic processes operating in the memory box are actually the crux of general intelligence. But getting the motivation/emotion framework right Ls also very important, and Psi seems to do an admirable job of that. 22.4 Implementing Emotion Rules atop Psi's Emotional Dynamics Human motivations are largely determined by human emotions, which are the result of human- ity's evolutionary heritage and embodiment, which are quite different than the heritage and embodiment of current AI systems. So, if we want to create AGI systems that lack humanlike bodies, and didn't evolve to adapt to the same environments as humans did, yet still have vaguely human-like emotional and motivational structures, the latter will need to be explicitly engineered or taught in some way. For instance, if one wants to make a CogPrime agent display anger, something beyond Psi's model of emotion needs to be coded into the agent to enable this. After all, the rule that when angry the agent has some propensity to harm other beings, is not implicit in Psi and needs to be programmed in. However, making use of Psi's emotion model, anger could be characterized as an emotion consisting of high arousal, low resolution, strong motive dominance, few background checks, strong goal-orientedness (as the Psi model suggests) and a propensity to cause harm to agents or objects. This is much simpler than specifying a large set of detailed rules characterizing angry behavior. The "anger" example brings up the point that desirability of giving AGI systems closely humanlike emotional and motivational systems is questionable. After all we humans cause ourselves a lot of problems with these aspects of our mind/brains, and we sometimes put our more ethical and intellectual sides at war with our emotional and motivational systems. Looking into the future, an AGI with greater power than humans yet a humanlike motivational and emotional system, could be a very dangerous thing. On the other hand, if an AGI's motivational and emotional system is too different from human nature, we might have trouble understanding it, and it understanding us. This problem shouldn't be overblown - it seems possible that an AGI with a more orderly and rational motivational system than the human one might be able to understand us intellectually very, well, and that we might be able to understand it well using our analytical tools. However, if we want to have mutual empathy with an AGI system, then its motivational and emotional EFTA00624219 22.5 Coals and Contexts 73 framework had better have at least some reasonable overlap with our own. The value of empathy for ethical behavior was stressed extensively in Chapter 12 of Part 1. This is an area where experimentation is going to be key. Our initial plan is to supply CogPrime with rough emulations of some but not all human emotions. We see no need to take explicit pains to simulate emotions like anger, jealousy and hatred. On the other hand, joy, curiosity, sadness, wonder, fear and a variety of other human emotions seem both natural in the context of a robotically or virtually embodied CogPrime system, and valuable in terms of allowing mutual human/CogPrime empathy. 22.4.1 Grounding the Logical Structure of Emotions in the Psi Model To make this point in a systematic way, we point out that Ortony et al's FOCC90I "cognitive theory of emotions" can be grounded in CogPrime's version of Psi in a natural way. This theory captures a wide variety of human and animal emotions in a systematic logical framework, so that grounding their framework in CogPrime Psi goes a long way toward explaining how CogPrime Psi accounts for a broad spectrum of human emotions. The essential idea of the cognitive theory of emotions can be seen in Figure 22.1. What we see there is that common emotions can be defined in terms of a series of choices: • Is it positive or negative? • Is it a response to an agent, an event or an object? • Is it focused on consequences for oneself, or for another? - If on another, is it good or bad for the other? - If on oneself, is it related to some event whose outcome is uncertain? • if it's related to an uncertain outcome, did the expectation regarding the outcome get fulfilled or not? Figure 22.1 shows how each set of answers to these questions leads to a different emotion. For instance: what is a negative emotion, responding to events, focusing on another, and undesirable to the other? Pity. In the list of questions, we see that two of them - positive vs. negative, and expectation ful- fillment vs. otherwise - are foundational in the Psi model. The other questions are evaluations that an intelligent agent would naturally make, but aren't bound up with Psi's emotion/moti- vation infrastructure in such a deep way. Thus, the cognitive theory of emotion emerges as a combination of some basic Psi factors with some more abstract cognitive properties (good vs. bad for another; agents vs. events vs. objects). 22.5 Goals and Contexts Now we dig deeper into the details of motivation in CogPrime. Just as we have both explicit (local) and implicit (global) memory in CogPrime, we also have both explicit and implicit goals. An explicit goal is formulated as a Goal Atom, and then MindAgents specifically orient the system's activity toward achievement of that goal. An implicit goal is something that the EFTA00624220 74 22 Emotion, Motivation, Attention and Control witt FTC 0 REACTION TO r Cant aA a a Tarot% Wawa ••••••••0 ASPIC'S a MUM dukci OK SODAS) al I row/moot Luta AOCMI I COGEOUEMat KO OMER OEISTAI UPOISRAIll FOR OTHER la OMER I I PROSPECTS Kama I COMKOLIDICES f ULF PPOSPECTS IRREIRVIPTI POSY Mel ••••••••N *Ps SRN iatelin — — NMY POMMISCIP ATTIPINTRIN ATTRACT— •41.1.41OPICI Ton Imp COMMED DICOPPORIG0 pans. paw. Wean fiesepdbal 8•10 IffivelTemod SYR VARIAINO*11110,MOR COSOMPS Fig. 22.1: Ontology of Emotions from IOCC90] system works toward, but in a more loosely organized way, and without necessarily explicitly representing the knowledge that it is working toward that goal. Here we will focus mainly on explicit motivation, beginning with a description of Goal Atoms, and the Contexts in which Goals are worked toward via executing Procedures. Figure 22.2 gives a rough depiction of the relationship between goals, procedures and context, in a simple example relevant to an OpenCogPrime-controlled virtual agent in a game world. 22.5.1 Goal Atoms A Goal Atom represents a target system state and is true to the extent that the system satisfied the conditions it represents. A Context Atom represents an observed state of the world/mind, and is true to the extent that the state it defines is observed. Taken together, these two Atom types provide the infrastructure CogPrime needs to orient its actions in specific contexts toward specific goals. Not all of CogPrime's activity is guided by these Atoms; much of it is non-goal- directed and spontaneous, or ambient as we sometimes call it. But it is important that some EFTA00624221 22.5 Coals and Contexts 75 / 'X " Yell Bob Bob's fewer 71 Glee Bob I a hug Ask Bob 'What are you doing?' CONTEXT Bob is nearby building a tower AND Fig. 22.2: Context, Procedures and Goals. Examples of the basic "goal/context/procedure" triad in a simple game-agent situation. of the system's activity - and in some cases, a substantial portion - is controlled explicitly via goals. Specifically, a Goal Atom is simply an Atom (usually a PredicateNocle, sometimes a Link, and potentially another type of Atom) that has been selected by the GoalRefinement MindAgent as one that represents a state of the atom space which the system finds important to achieve. The extent to which an Atom is considered a Goal Atom at a particular point in time is determined by how much of a certain kind of financial instrument called an RFS (Request For Service) it possess (as will be explained in Chapter 24). A CogPrime instance must begin with some initial Ubergools (aka top level supergoals), but may then refine these goals in various ways using inference. Immature, "childlike" CogPrime systems cannot modify their Ubergoals nor add nor delete Ubergoals. Advanced CogPrime systems may be allowed to modify, add or delete Ubergoals, but this is a critical and subtle aspect of system dynamics that must be treated with great care. WIKISOURCE:ContextAtom EFTA00624222 76 22 Emotion. Motivation, Attention and Control 22.6 Context Atoms Next, a Context is simply an Atom that is used as the source of a ContextLink, for instance Context quantum_computing Inheritance Ben amateur or Context game_of_fetch PredictiveAttraction Evaluation give (ball, teacher) Satisfaction The former simply says that Ben is an amateur in the context of quantum computing. The latter says that in the context of the game of fetch, giving the ball to the teacher implies satisfaction. A more complex instance pertinent to our running example would be Context Evaluation Recently List Minute Evaluation Ask List Bob ThereExists SX And Evaluation Build List self SX Evaluation surprise List SX Bob AverageQuantifier SY PredictiveAttraction And Evaluation Build List self SY Evaluation surprise List SY Jim Satisfaction EFTA00624223 22.7 Ubergoal Dynamics 77 which says that, if the context is that Bob has recently asked for something surprising to be built, then one strategy for getting satisfaction is to build something that seems likely to satisfy Jim. An implementation-level note: in the current OpenCogPrime implementation of CogPrime, ContextLinks are implicit rather than explicit entities. An Atom can contain a ComplexTruth- Value which in turn contains a number of VersionHandles. Each VersionHandle associates a Context or a Hypothetical with a TruthValue. This accomplishes the same thing as a formal ContextLink, but without the creation of a ContextLink object. However, we continue to use ContextLinks in this book and other documents about CogPrime; and it's quite possible that future CogPrime implementations might handle them differently. 22.7 Ubergoal Dynamics In the early phases of a CogPrime system's cognitive development, the goal system dynamics will be quite simple. The Ubergoals are supplied by human programmers, and the system's adaptive cognition is used to derive subgoals. Attentional currency allocated to the Ubergoals is then passed along to the subgoals. as judged appropriate. As the system becomes more advanced, however, more interesting phenomena may arise regarding Ubergoals: implicit and explicit Ubergoal creation. 22.7.1 Implicit Ubergoal Pool Modification First of all, implicit Ubergoal creation or destruction may occur. Implicit Ubergoal destruction may occur when there are multiple Ubergoals in the system, and some prove easier to achieve than others. The system may then decide not to bother achieving the more difficult Ubergoals. Appropriate parameter settings may mitigate against this phenomenon, of course. Implicit Ubergoal creation may occur if some Goal Node G arises that inherits as a subgoal from multiple Ubergoals. This Goal G may then come to act implicitly as an Ubergoal, in that it may get more attentional currency than any of the Ubergoals. Also, implicit Ubergoal creation may occur via forgetting. Suppose that G becomes a goal via inferred inheritance from one or more Ubergoals. Then, suppose G forgets why this inheritance exists, and that in fact the reason becomes obsolete, but the system doesn't realize that and keeps the inheritance there. Then, G is an implicit Ubergoal in a strong sense: it gobbles up a lot of attentional currency, potentially more than any of the actual Ubergoals, but actually doesn't help achieve the Ubergoals, even though the system thinks it does. This kind of dynamic is obviously very bad and should be avoided - and can be avoided with appropriate tuning of system parameters (so that the system pays a lot of attention to making sure that its subgoaling- related inferences are correct and are updated in a timely way). EFTA00624224 78 22 Emotion, Motivation, Attention and Control 22.7.2 Explicit Ubergoal Pool Modification An advanced CogPrime system may be given the ability to explicitly modify its Ubergoal pool. This is a very interesting but very subtle type of dynamic, which is not currently well understood and which potentially could lead to dramatically unpredictable behaviors. However, modification, creation and deletion of goals is a key aspect of human psycholou, and the granting of this capability to mature CogPrime systems must be seriously considered. In the case that Ubergoal pool modification is allowed, one useful heuristic may be to make implicit Ubergoals into explicit Ubergoals. For instance: if an Atom is found to consistently receive a lot of RFSs, and has a long time-scale associated with it, then the system should consider making it an Ubergoal. But this heuristic is certainly not sufficient, and any advanced CogPrime system that is going to modify its own Ubergoals should definitely be tuned to put a lot of thought into the process! The science of Ubergoal pool dynamics basically does not exist at the moment, and one would like to have some nice mathematical models of the process prior to experimenting with it in any intelligent capable CogPrime system. Although Schmiddhuber's Godel machine ISch061 has the theoretical capability to modify its ubergoal (note that CogPrime is, in some way, a Gadd machine), there is currently no mathematics allowing us to assess the time and space complexity of such process in a realistic context, given a certain safety confidence target. 22.8 Goal Formation Goal formation in CogPrime is done via PLN inference. In general, what PLN does for goal formation is to look for predicates that can be proved to probabilistically imply the existing goals. These new predicates will then tend to receive RFS currency, according to the logic of RFS's to be outlined in Chapter 24, which (according to goal-driven attention allocation dynamics) will make the system more likely to enact procedures that lead to their satisfaction. As an example of the goal formation process, consider the case where ExternalNovelty Ls an Ubergoal. The agent may then learn that whenever Bob gives it a picture to look at, its quest for external novelty Ls satisfied to a singificant degree. That is, it learns Attraction Evaluation give (Bob, me, picture) ExternalNovelty where Attraction A B measures how much A versus implies B (as explained in Chapter 34). This information allows the agent (the Goal Formation MindAgent) to nominate the atom: EvaluationLink give (Bob, me, picture) as a goal (a subgoal of the original Ubergoal). This is an example of goal refinement, which is one among many ways that PLN can create new goals from existing ones. EFTA00624225 22.10 Context Formation 79 22.9 Goal Fulfillment and Predicate Schematization When there is a Goal Atom G important in the system (with a lot of RFS), the GoalFulfillment MindAgent seeks Schemallodes $ that it has reason to believe, if enacted, will cause G to become true (satisfied). It then adds these to the ActiveSchemaPool, an object to be discussed below. The dynamics by which the GoalFulfillment process works will be discussed in Chapter 24 below. For example, if a Context Node Chas a high truth value at that time (because it is currently satisfied), and is involved in a relation: Attraction C PredictiveAttraction S G (for some Schemallode S and Goal Node G) then this Schemallode S is likely to be se- lected by the GoalFulfillment process for execution. This is the fully formalized version of the Context&Schema —> Goal notion discussed frequently above. The process may also allow the importance of various schema S to bias its choices of which schemata to execute. For instance, following up previous examples, we might have Attraction Evaluation near List self Bob PredictiveAttraction Evaluation ask List Bob "Show me a picture" ExternalNovelty Of course this is a very simplistic relationship but it's similar to a behavior a young child might display. A more advanced agent would utilize a more abstract relationship that distinguishes various situations in which Bob is nearby, and also involves expressing a concept rather than a particular sentence. The formation of these schema-context-goal triads may occur according to generic inference mechanisms. However, a specially-focused PredicateSchematization MindAgent is very useful here as a mechanism of inference control, increasing the number of such relations that will exist in the system. 22.10 Context Formation New contexts are formed by a combination of processes: • The MapEncapstdation MindAgent, which creates Context Nodes embodying repeated pat- terns in the perceived world. This process encompasses - Maps creating Context Nodes involving Atoms that have high STI at the same time EFTA00624226 80 22 Emotion. Motivation, Attention and Control • Example: A large number of Atoms related to towers could be joined into a single map, which would then be a ConceptNode pointing to "tower-related ideas, proce- dures and experiences" - Maps creating Context Nodes that are involved in a temporal activation pattern that recurs at multiple points in the system's experience. • Example: There may be a common set of processes involving creating a building out of blocks: first build the base, then the walls, then the roof. This could be encapsulated as a temporal map embodying the overall nature of the process. In this case, the map contains information of the nature: first do things related to this, then do things related to this, then do things related to this... • A set of concept creation MindAgents (see Chapter 38, which fuse and split Context Nodes to create new ones. — The concept of a building and the concept of a person can be merged to create the concept of a BuildingMan - The concept of a truck built with Legos can be subdivided into trucks you can actually carry Lego blocks with, versus trucks that are "just for show" and can't really be loaded with objects and then carry them around 22.11 Execution Management The GoalFulfillment MindAgent chooses schemata that are found likely to achieve current goals, but it doesn't actually execute these schemata. What it does is to take these schemata and place them in a container called the ActiveSchemaPool. The ActiveSchemaPool contains a set of schemata that have been determined to be reason- ably likely, if enacted, to significantly help with achieving the current goal-set. I.e., everything in the active schema pool should be a schema S so that it has been concluded that Attraction C PredictiveAttraction S G - where C is a currently applicable context and C is one of the goals in the current goal pool - has a high truth value compared to what could be obtained from other known schemata S or other schemata S that could be reasonably expected to be found via reasoning. The decision of which schemata in the ActiveSchemaPool to enact is made by an object called the ExecutionManager, which is invoked each time the SchemaActivation MindAgent is executed. The ExecutionManager is used to select which schemata to execute, based on doing reasoning and consulting memory regarding which active schemata can usefully be executed simultaneously without causing destructive interference (and hopefully causing constructive interference). This process will also sometimes (indirectly) cause new schemata to be created and/or other schemata from the AtomTable to be made active. This process is described more fully in Chapter 24 on action selection. WIKISOURCE:GoalsAndTime For instance, if the agent is involved in building a blocks structure intended to surprise or please Bob, then it might simultaneously carry out sonic blocks-manipulation schema, and also a schema involving looking at Bob to garner his approval. If it can do the blocks manipulation without constantly looking at the blocks, this should be unproblematic for the agent. EFTA00624227 22.12 Coals and Time 81 22.12 Goals and Time The CogPrime system maintains an explicit list of "Ubergoals", which as will be explained in Chapter 24, receive attentional currency which they may then allocate to their subgoals according to a particular mechanism. However, there is one subtle factor involved in the definition of the Ubergoals: time. The truth value of a Ubergoal is typically defined as the average level of satisfaction of some De- mand over some period of time - but the time scale of this averaging can be very important. In many cases, it may be worthwhile to have separate Ubergoals corresponding to the same Demand but doing their truth-value time-averaging over different time scales. For instance, corresponding to Demands such as Novelty or Health, we may posit both long-term and short- term versions, leading to Ubergoals such as CurrentNovelty, LongTermNovelty, CurrentHealth, LongTermHealth, etc. Of course, one could also wrap multiple Ubergoals corresponding to a single Demand into a single Ubergoal combining estimates over multiple time scales; this is not a critical issue and the only point of splitting Demands into multiple UbergoaLs is that it can make things slightly simpler for other cognitive processes. For instance, if the agent has a goal of pleasing Bob, and it knows Bob likes to be presented with surprising structures and ideas, then the agent has some tricky choices to make. Among other choices it must balance between focusing on • creating things and then showing them to Bob • studying basic knowledge and improving its skills. Perhaps studying basic knowledge and skills will give it a foundation to surprise Bob much more dramatically in the mid-term future ... but in the short run will not allow it to surprise Bob much at all, because Bob already knows all the basic material. This is essentially a variant of the general "exploration versus exploitation" dichotomy, which lacks any easy solution. Young children are typically poor at carrying out this kind of balancing act, and tend to focus overly much on near-term satisfaction. There are also significant cultural differences in the heuristics with which adult humans face these issues; e.g. in some contexts Oriental cultures tend to focus more on mid to long term satisfaction whereas Western cultures are more short term oriented. EFTA00624228 Chapter 23 Attention Allocation Co-authored with Joel Pitt and Matt Ikle' and Rui Liu 23.1 Introduction The critical factor shaping real-world general intelligence is resource constraint. Without this issue, we could just have simplistic program-space-search algorithms like AIXI't instead of com- plicated systems like the human brain or CogPrime. Resource constraint is managed implicitly within various components of CogPrime, for instance in the population size used in evolu- tionary learning algorithms, and the depth of forward or backward chaining inference trees in PLN. But there is also a component of CogPrime that manages resources on a global and cognitive-process-independent manner: the dictation allocation component. The general principles the attention allocation process should follow are easy enough to see: History should be used as a guide, and an intelligence should make probabilistic judgments based on its experience, guessing which resource-allocation decisions are likely to maximize its goal-achievement. The problem is that this is a difficult learning and inference problem, and to carry it out with excellent accuracy would require a limited-resources intelligent system to spend nearly all its resources deciding what to pay attention to and nearly none of them actually paying attention to anything else. Clearly this would be a very poor allocation of an AI system's attention! So simple heuristics are called for, to be supplemented by more advanced and expensive procedures on those occasions where time is available and correct decisions are particularly crucial. Attention allocation plays, to a large extent, a "meta" role in enabling mind-world corre- spondence. Without effective attention allocation, the other cognitive processes can't do their jobs of helping an intelligent agent to achieve its goals in an environment, because they won't be able to pay attention to the most important parts of the environment, and won't get com- putational resources at the times when they need it. Of course this need could be addressed in multiple different ways. For example, in a system with multiple complex cognitive processes, one could have attention allocation handled separately within each cognitive process, and then a simple "top layer" of attention allocation managing the resources allocated to each cognitive process. On the other hand, one could also do attention allocation via a single dynamic, perva- sive both within and between individual cognitive processes. The CogPrime design gravitates more toward the latter approach. though also with some specific mechanisms within various MindAgents; and efforts have been made to have these specific mechanisms modulated by the generic attention allocation structures and dynamics wherever possible. 83 EFTA00624230 84 23 Attention Allocation In this chapter we will dig into the specifics of how these attention allocation issues are addressed in the CogPrime design. In short, they are addressed via a set of mechanisms and equations for dynamically adjusting importance values attached to Atoms and MindAgents. Dif- ferent importance values exist pertinent to different time scales, most critically the short-term (STI) and long-term (LTI) importances. The use of two separate time-scales here reflects fun- damental aspects of human-like general intelligence and real-world computational constraints. The dynamics of STI is oriented partly toward the need for real-time responsiveness, and the more thoroughgoing need for cognitive processes at speeds vaguely resembling the speed of "real time" social interaction. The dynamics of LTI is based on the fact that some data tends to be useful over long periods of time, years or decades in the case of human life, but the practical capability to store large amounts of data in a rapidly accessible way is limited. One could imagine environments in which very-long-term multiple-type memory was less critical than it is in typical human-friendly environments; and one could envision AGI systems carrying out tasks in which real-time responsiveness was unnecessary (though even then some attention focusing would certainly be necessary). For AGI systems like these, an attention allocation system based on STI and LTI with CogPrime-like equations would likely be inappropriate. But for an AGI system intended to control a vaguely human-like agent in an environment vaguely resembling everyday human environments, the focus on STI and LTI values, and the dynamics proposed for these values in CogPrime, appear to make sense. Two basic innovations are involved in the mechanisms attached to these STI and LTI impor- tance values: • treating attention allocation as a data mining problem: the system records information about what it's done in the past and what goals it's achieved in the past, and then recog- nizes patterns in this history and uses them to guide its future actions via probabilistically adjusting the (often context-specific) importance values associated with internal terms, ac- tors and relationships, and adjusting the "effort estimates" associated with Tasks • using an artificial-economics approach to update the importance values (attached to Atoms, MindAgents, and other actors in the CogPrime system) that regulate system attention. (And, more speculatively, using an information geometry based approach to execute the optimization involved in the artificial economics approach efficiently and accurately.) The integration of these two aspects is crucial. The former aspect provides fundamental data about what's of value to the system, and the latter aspect allows this fundamental data to be leveraged to make sophisticated and integrative judgments rapidly. The need for the latter, rapid-updating aspect exists partly because of the need for real-time responsiveness, imposed by the need to control a body in a rapidly dynamic world, and the prominence in the architecture of an animal-like cognitive cycle. The need for the former, data-mining aspect (or something functionally equivalent) exists because, in the context of the tasks involved in human-level general intelligence, the assignment of credit problem is hard - the relations between various entities in the mind and the mind's goals are complex, and identifying and deploying these relationships is a difficult learning problem requiring application of sophisticated intelligence. Both of these aspects of attention allocation dynamics may be used in computationally lightweight or computationally sophisticated manners: • For routine use in real-time activity EFTA00624231 23.2 Semantics of Short and Long Term Importance 85 - "data mining" consists of forming HebbianLinks (involved in the associative memory and inference control, see Section 23.5), where the weight of the link from Atom A to Atom B is based on the probability of shared utility of A and B - economic attention allocation consists of spreading ShortTermlmportance and LongTer- mImportance "artificial currency" values (both grounded in the universal underlying "juju" currency value defined further below) between Atoms according to specific equa- tions that somewhat resemble neural net activation equations but respect the conser- vation of currency • For use in cases where large amounts of computational resources are at stake based on lo- calized decisions, hence allocation of substantial resources to specific instances of attention- allocation is warranted - "data mining" may be more sophisticated, including use of PLN, MOSES and pattern mining to recognize patterns regarding what probably deserves more attention in what contexts - economic attention allocation may involve more sophisticated economic calculations involving the expected future values of various "expenditures" of resources The particular sort of "data mining" going on here is definitely not exactly what the human brain does, but we believe this is a case where slavish adherence to neuroscience would be badly suboptimal (even if the relevant neuroscience were well known, which is not the case). Doing attention allocation entirely in a distributed, formal-neural-net-like way is, we believe, extremely and unnecessarily inefficient, and given realistic resource constraints it leads to the rather poor attention allocation that we experience every, day in our ordinary waking state of consciousness. Several aspects of attention allocation can be fruitfully done in a distributed, neural-net-like way, but not having a logically centralized repository of system-history information (regardless of whether it's physically distributed or not) seems intrinsically problematic in terms of effective attention allocation. And we argue that, even for those aspects of attention allocation that are best addressed in terms of distributed, vaguely neural-net-like dynamics, an artificial-economics approach has significant advantages over a more strictly neural-net-like approach, due to the greater ease of integration with other cognitive mechanisms such as forgetting and data mining. 23.2 Semantics of Short and Long Term Importance We now specify the two types of importance value (short and long term) that play a key role in CogPrime dynamics. Conceptually, ShortTermImportance (STI) is defined as STI(A) = P(A will be useful in the near future) whereas LongTermlmportance (LTI) is defined as LTI(A) = P(A will be useful eventually, in the foreseeable future) Given a time-scale T, in general we can define an importance value relative to T as Ir(A) = P(A will be useful during the next T seconds) EFTA00624232 86 23 Attention Allocation In the ECAN module in CogPrime, we deal only with STI and LTI rather than any other importance values, and the dynamics of STI and LTI are dealt with by treating them as two separate "artificial currency" values, which however are interconvertible via being mutually grounded in a common currency called "juju." For instance, if the agent is intensively concerned with trying to build interesting blocks structures, then knowledge about interpreting biology research paper abstracts is likely to be of very little current importance. So its biological knowledge will get low STI, but - assuming the agent expects to use biology again - it should maintain reasonably high LTI so it can remain in memory for future use. And if in its brainstorming about what blocks structures to build, the system decides to use some biological diagrams as inspiration, STI can always spread to some of the biology-related Atoms, increasing their relevance and getting them more attention. While the attention allocation system contains mechanisms to convert STI to LTI, it also has parameter settings biasing it to spend its juju on both kinds of importance - i.e. it contains an innate bias to both focus its attention judiciously, and manage its long-term memory conscientiously. Because in CogPrime most computations involving STI and LTI are required to be very rapid (as they're done for many Atoms in the memory very frequently), in most cases when dealing with these quantities, it will be appropriate to sacrifice accuracy for efficiency. On the other hand, it's useful to occasionally be able to carry out expensive, highly accurate computations involving importance. An example where doing expensive computations about attention allocation might pay off, would be the decision whether to use biology-related or engineering-related metaphors in cre- ating blocks structures to please a certain person. In this case it could be worth doing a few steps of inference to figure out whether there's a greater intensional similarity between that person's interests and biology or engineering; and then using the results to adjust the STI levels of whichever of the two comes out most similar. This would not be a particularly expensive inference to carry out, but it's still much more effort than what can be expended on Atoms in the memory most of the time. Most attention allocation in CogPrime involves simple neural-net type spreading dynamics rather than explicit reasoning. Figure 23.1 illustrates the key role of LTI in the forgetting process. Figure 23.2 illustrates the key role of STI in maintaining a "moving bubble of attention", which we call the system's AttentionalFocus. 23.2.1 The Precise Semantics of STI and LTI Now we precisiate the above definitions of STI and LTI. First, we introduce the notion of reward. Reward is something that Goals give to Atoms. In principle a Goal might give an Atom reward in various different forms, though in the design given here, reward will be given in units of a currency called juju. The process by which Goals assign reward to Atoms is part of the "assignment of credit" process (and we will later discuss the various time-scales on which assignment of credit may occur and their relationship to the time-scale parameter within LTI). Next, we define J(A,tht2,r) = expected amount of reward A will receive between tt and t2 time-steps in the future, if its STI has percentile rank r among all Atoms in the AtomTable EFTA00624233 23.2 Semantics of Short and Long Term Importance 87 0 ATOM STAY IN ATOMSPACE REMOVE FROM ATOMSPACE SAVE TO BACKUP STORE PERMANENTLY DELETE Fig. 23.1: LongTermlmportance and Forgetting. The percentile rank r of an Atom is the rank of that Atom in a list of Atoms ordered by decreasing STI, divided by the total number of Atoms. The reason for using a percentile rank instead of the STI itself is because at any given time only a limited number of atoms can be given attention, so all atoms below a certain perceptible rank, depending on the amount of available resource, will simply be ignored. This is a fine-grained measure of how worthwhile it is expected to be to increase A's STI, in terms of getting A rewarded by Goals. For practical purposes it is useful to collapse J(A, t1, t2, r) to a single number: J (A, tl, t2) Er J(A,tht2,r)wr Er tor where tor weights the different percentile ranks (and should be chosen to be monotone increasing in r). This is a single-number measure of the responsiveness of an Atom's utility to its STI level. So for instance if A has a lot of STI and it turns out to be rewarded then J(A, tl, t2) will be high. On the other hand if A has little STI then whether it gets rewarded or not will not influence J(A, t1, t2) much. To simplify notation, it's also useful to define a single-time-point version J(A, = J(A, t, 23.2.1.1 Formalizing STI Using these definitions, one simple way to make the STI definition precise is: Mih nsh(A, t) = P(J(A, t, t + t shord 9 Sthreshoid) where Sth„shou demarcates the "attentional focus boundary." Which is a way of saying that we don't want to give STI to atoms that would not get rewarded if they were given attention. EFTA00624234 ss 23 Attention Allocation ATTENTIONAL FOCUS AT TILNTIONAL I OCUS BOUNDARY Fig. 23.2: Formation of the AttentionalFocus. The dynamics of STI is configured to en- courage the emergence of richly crass-connected networks of Atoms with high STI (above a threshold called the AttentionalFocusBoundary), passing STI among each other as long as this is useful and forming new HebbianLinks among each other. The collection of these Atoms is called the AttentionalFocus. Or one could make the STI definition precise in a fuzzier way, and define CO smunv(A,0 t + s)c— s=o for some appropriate parameter c (or something similar with a decay function less severe than exponential) EFTA00624235 23.2 Semantics of Short and Long Term Importance 89 In either case, the goal of the ECAN subsystem, regarding STI, is to assign each Atom A an STI value that corresponds as closely as possible to the theoretical STI values defined by whichever one of the above equations is selected (or some other similar equation). 23.2.2 STI, STIFund, and Juju But how can one estimate these probabilities in practice? In some cases they may be estimated via explicit inference. But often they must be estimated by heuristics. The estimative approach taken in current CogPrime design is an artificial economy, in which each Atom maintains a certain fund of artificial currency. In the current proposal this currency is called juju and is the same currency used to value LTI. Let us call the amount of juju owned by Atom A the STIFund of A. Then, one way to formalize the goal of the artificial economy is to state that: if one ranks all Atoms by the wealth of their STIFtmd, and separately ranks all Atoms by their theoretical STI value, the rankings should be as close as passible to the same. One may also formalize the goal in terms of value correlation instead of rank correlation, of course. Proving conditions under which the STIFtmd values will actually correlate well with the theoretical STI values, is an open math problem. Heuristically, one may map STIFund values into theoretical STI values by a mapping such as A.STIT'und— STIFund. min A.STI = a + i3 STIFund. 1111OC -STIF'und, min where STIFund. min = min X .STIF'und, However, we don't currently have rigorous grounding for any particular functional form for such a mapping; the above is just a heuristic approxima- tion. The artificial economy approach leads to a variety of supporting heuristics. For instance, one such heuristic is: if A has been used at time t, then it will probably be useful at time t + s for small .s. Based on this heuristic, whenever a MindAgent uses an Atom A, it may wish to increase A's STIFund (so as to hopefully increase correlation of A's STIFund with its theoretical STI). It does so by transferring some of its juju to A's STIFund. 23.2.3 Formalizing LTI Similarly to STI, with LTI we will define theoretical LTI values, and posit an LTIFund associated with each Atom, which seeks to create values correlated with the theoretical LTI For LTI, the theoretical issues are subtler. There is a variety of different ways to precisiate the above loose conceptual definition of LTI. For instance, one can (and we will below) create formalizations of both: 1. L77.1(1) = (some time-weighting or normalization of) the expected value of A's total usefulness over the long-term future 2. LTh,„„i(A) = the probability that A ever becomes highly useful at some point in the long- term future EFTA00624236 90 23 Attention Allocation (here "cont" stands for "continuous"). Each of these may be formalized, in similar but noniden- tical ways. These two forms of LTI may be viewed as extremes along a continuum; one could posit a host of intermediary LTI values between them. For instance, one could define LT/p(A) = the p'th power average' of expectation of the utility of A over brief time intervals, measured over the long-term future Then we would have LTIbargi = LTIami = LTI1 and could vary p to vary the sharpness of the LTI computation. This might be useful in some contexts, but our guess is that it's overkill in practice and that looking at LT/bunt and LTIona is enough (or more than enough; the current OCP code uses only one LTI value and that has not been problematic so far). 23.2.4 Applications of LTIfrurat versus LTIcont It seems that the two forms of LTI discussed above might be of interest in different contexts, depending on the different ways that Atoms may be used so as to achieve reward. If an Atom is expected to get rewarded for the results of its being selected by MindAgents that carry out diffuse, background thinking (and hence often select low-STI Atoms from the AtomTable), then it may be best associated with LTIcont. On the other hand, if an Atom is expected to get rewarded for the results of its being selected by MindAgents that are focused on intensive foreground thinking (and hence generally only select Atoms with very high STI), it may be best associated with LT/s„„i. In principle, Atoms could he associated with particular LTIp based on the particulars of the selection mechanisms of the MindAgents expected to lead to their reward. But the issue with this is, it would result in Atoms carrying around an excessive abundance of different LTIp values for various p, resulting in memory bloat; and it would also require complicated analyses of MindAgent dynamics. If we do need more than one LTI value, one would hope that two will be enough, for memory conservation reasons. And of course, if an Atom has only one LTI value associated with it, this can reasonably be taken to stand in for the other one: either of LT/bunt or LTIcont may, in the absence of information to the contrary, be taken as an estimate of the other. 23.2.4.1 LTI with Various Time Lags The issue of the p value in the average in the definition of LTI is somewhat similar to (though orthogonal to) the point that there are many different interpretations of LTI, achieved via I the p'th power average is defined as tJ XP EFTA00624237 23.3 Defining Burst LTI in Terms of STI 91 considering various time-lags. Our guess is that a small set of time-lags will be sufficient. Perhaps one wants an exponentially increasing series of time-lags: i.e. to calculate LTI over k cycles where k is drawn from {r, 2r, 4r, 8r,...Tvr). The time-lag in LTI seems related to the time-lag in the system's goals. If a Goal object is disseminating juju, and the Goal has an intrinsic time scale of t, then it may be interested in LTI on time-scale t. So when a MA (MindAgent) is acting in pursuit of that goal, it should spend a bunch of its juju on LTI on time-scale t. Complex goals may be interested in multiple time-scales (for instance, a goal might place greater value on things that occur in the next hour, but still have nonzero interest in things that occur in a week), and hence may have different levels of interest in LTI on multiple time-scales. 23.2.4.2 Formalizing Burst LTI Regarding burst LTI, two approaches to formalization seem to be the threshold version LT/6„nt.ihmh(A) = P(A will receive a total of at least Sthreshou amount of normalized stimulus during some time interval of length t„hon in the next tk,„5, time steps) and the fuzzy version, CO Lum,„,.,„zzvo, E + si t + S tshort)f(S, tlimg) a=0 where f(t,tio,,g) : R+ x R+ R+ is a nonincreasing function that remains roughly constant in t up till a point te„,„9 steps in the future, and then begins slowly decaying. 23.2.4.3 Formalizing Continuous LTI The threshold version of continuous LTI is quite simply: LT/c,,„144,„„h(A, t, ,,9) = 377thrah(A, te,mg) That is, smooth threshold LTI is just like smooth threshold STI, but the time-scale involved is longer. On the other hand, the fuzzy version of smooth LTI is: 03 L J(A.,t -F s)f(s,tic,„9) s.o using the same decay function f that was introduced above in the context of burst LTI. 23.3 Defining Burst LTI in Terms of STI It is straightforward to define burst LTI in terms of STI, rather than directly in terms of juju. We have EFTA00624238 92 23 Attention Allocation LTIoung,th,,,h(A, t) = P(i s:i t; " STImnsh (A, t s)) Or, using the fuzzy definitions, we obtain instead the approximate equation CO LT16,,„Litm,y(A, Ea(s)STIpazy(A,t s)f(s,tiong) 8=o where 1 — c cc(s) 1 — ce+1 or the more complex exact equation: LT/6„,„E.,,,..,(A, t) . E snii.y(A, t + s) ( f (s, turn") — E(c' f (s — r, tt87,9))) OD r= 1 23.4 Valuing LTI and STI in terms of a Single Currency We now further discuss the approach of defining LTIFtmd and STIFund in terms of a single currency: juju (which as noted, corresponds in the current ECAN design to normalized stimu- lus). In essence, we can think of STIFund and LTIFund as different forms of financial instrument, which are both grounded in juju. Each Atom has two financial instruments attached to it: "STIRind of Atom A" and "LTIFund of Atom A" (or more if multiple versions of LTI are used). These financial instruments have the peculiarity that, although many agents can put juju into any one of them, no record is kept of who put juju in which one. Rather, the MA's are acting so as to satisfy the system's Goals, and are adjusting the STIFund and LTIFund values in a heuristic manner that is expected to approximately maximize the total utility propagated from Goals to Atoms. Finally, each of these financial instruments has a value that gets updated by a specific update equation. To understand the logic of this situation better, consider the point of view of a Goal with a certain amount of resources (juju, to be used as reward), and a certain time-scale on which its satisfaction is to be measured. Suppose that the goal has a certain amount of juju to expend on getting itself satisfied. This Goal clearly should allocate some of its juju toward getting processor time allocated toward the right Atoms to serve its ends in the near future; and some of its juju toward ensuring that, in future, the memory, will contain the Atoms it will want to see processor time allocated to. Thus, it should allocate some of its juju toward boosting the STIFund of Atoms that it thinks will (if chosen by appropriate MindAgents) serve its needs in the near future, and some of its juju toward boosting the LTIFund of Atoms that it thinks will serve its need in the future (if they remain in RAM). Thus, when a Goal invokes a MindAgent (giving the MindAgent the juju it needs to access Atoms and carry out its work), it should tell this MindAgent to put some of its juju into LTIFunds and some into STIFunds. EFTA00624239 23.4 Valuing LTI and STI in terms of a Single Currency 93 If a MindAgent receives a certain amount of juju each cycle, independently of what the system Goals are explicitly telling it, then this should be viewed as reflecting an implicit goal of "ambient cognition", and the balance of STI and LTI associated with this implicit goal must be a system parameter. In general, the trade-off between STI and LTI boils down to the weighting between near and far future that is intrinsic to a particular Goal. Simplistically: if a Goal values getting processor allocated to the right stuff immediately 25 times more than getting processor allocated to the right stuff 20K cycles in the future, then it should he willing spend 25x more of its juju on STI than on LT/20K cydee• (This simplistic picture is complicated a little by the relationship between different time-scales. For instance, boosting LThok, q,de,(A) will have an indirect effect of increasing the odds that A will still be in memory 20K cycles in the future.) However, this isn't the whole story, because multiple Goals are setting the importance values of the same set of Atoms. If M1 pumps all its juju into STI for certain Atoms, then M2 may decide it's not worthwhile for it to bother competing with MI in the STI domain, and to spend its juju on LTI instead. Note that the current system doesn't allow a MA to change its mind about LTI allocations. One can envision a system where a MindAgent could in January, pay juju to have Atom A kept around for a year, but then change its mind in June 6 months later, and ask for some of the money back. But this would require an expensive accounting procedure, keeping track of how much of each Atom's LTI had been purchased by which MindAgent; so it seems a poor approach. A more interesting alternative would be to allow MA's to retain adjustable `reserve funds" of juju. This would mean that a MindAgent would never see a purpose to setting LTI,,„. 5,02,4A) instead of repeatedly setting LT10„c unless a substantial transaction cost were incurred with each transaction of adjusting an Atom's LTI. Introducing a transaction cost plus an ad- justable per-MindAgent juju reserve fund, and LTI's on multiple time scales, would give the LTI framework considerable flexibility. (To prevent MA's from hoarding their juju, one could place a tax rate on reserve juju.) The conversion rate between STI and LTI becomes an interesting matter; though it seems not a critical one, since in the practical dynamics of the system it's juju that is used to produce STI and LTI. In the current design there is no apparent reason to spread STI of one Atom to LTI of another Atom, or convert the STI of an Atom into LTI of that same Atom, etc. - but such an application might come up. (Fbr the rest of this paragraph, let's just consider LTI with one time scale, for simplicity.) Each Goal will have its own preferred conversion rate between STI and LTI, based on its own balancing of different time scales. But, each Goal will also have a limited amount of juju, hence one can only trade a certain amount of STI for LT!, if one is trading with a specific goal G. One could envision a centralized STI-for-LTI market where different MA's would trade with each other, but this seems overcomplicated, at least at the present stage. As a simpler software design point, this all suggests a value for associating each Goal with a parameter telling how much of its juju it wants to spend on STI versus Lit Or, more subtly, how much of its juju it wants to spend on LTI on various time-scales. On the other hand, in a simple ECAN implementation this balance may be assumed constant across all Goals. EFTA00624240 94 23 Attention Allocation 23.5 Economic Attention Networks Economic Attention Networks (ECANs) are dynamical systems based on the propagation of STI and LTI values. They are similar in many respects to Hopfield nets, but are based on a different conceptual foundation involving the propagation of amounts of (conserved) currency rather than neural-net activation. Further, ECANs are specifically designed for integration with a diverse body of cognitive processes as embodied in integrative AI designs such as CogPrime. A key aspect of the CogPrime design is the imposition of ECAN structure on the CogPrime AtomSpace. Specifically, ECANs have been designed to serve two main purposes within CogPrime: to serve as an associative memory for the network, and to facilitate effective allocation of the attention of other cognitive processes to appropriate knowledge items. An ECAN is simply a graph, consisting of un-typed nodes and links, and also "Hebbian" links that may have types such as HebbianLink. InverseHebbianLink, or SymmetricHebbianLink. Each node and link in an ECAN is weighted with two currency values, called STI (short- term importance) and LTI (long-term importance); and each Hebbian link is weighted with a probabilistic truth value. The equations of an ECAN explain how the STI, LTI and Hebbian link weights values get updated over time. As alluded to above, the metaphor underlying these equations is the interpretation of STI and LTI values as (separate) artificial currencies. The fact that STI and LTI are currencies means that, except in unusual instances where the ECAN controller decides to introduce inflation or deflation and explicitly manipulate the amount of currency in circulation, the total amounts of STI and LTI in the system are conserved. This fact makes the dynamics of an ECAN dramatically different than that of an attractor neural network. In addition to STI and LTI as defined above, the ECAN equations also contain the notion of an Attentional Focus (AF), consisting of those Atoms in the ECAN with the highest STI values (and represented by the sgh„ shom value in the above equations). These Atoms play a privileged role in the system and, as such, are treated using an alternate set of equations. 23.5.1 Semantics of Hebbian Links Conceptually, the probability value of a HebbianLink from A to B is the odds that if A is in the AF, so is B; and correspondingly, the InverseHebbianLink from A to B is weighted with the odds that if A is in the AF, then B is not. An ECAN will often be coupled with a "Forgetting" process that removes low-LTI Atoms from memory according to certain heuristics. A critical aspect of the ECAN equations is that Atoms periodically spread their STI and LTI to other Atoms that connect to them via Hebbian and InverseHebbianLinks; this is the ECAN analogue of activation spreading in neural networks. Multiple varieties of HebbianLink may be constructed, for instance • Asymmetric HebbianLinks, whose semantics are as mentioned above: the truth value of Hebbiatthink A B denotes the probability that if A is in the AF, so is B • Symmetric HebbianLinks, whose semantics are that: the truth value of SymmetricHebbian- Link A B denotes the probability that if one of A or B is in the AF, both are EFTA00624241 23.6 Dynamics of STI and LTI Propagation 95 It is also worth noting that one can combine ContextLinks with HebbianLinks and express contextual association such that in context C, there is a strong HebbianLink between A and B. 23.5.2 Explicit and Implicit Hebbian Relations In addition to explicit HebbianLinks, it can be useful to treat other links implicitly as Heb- bianLinks. For instance, if ConceptNodes A and B are found to connote similar concepts, and a SimilarityLink is formed between them, then this gives reason to believe that maybe a Sym- metricHebbianLink between A and B should exist as well. One could incorporate this insight in CogPrime in at least three ways: • creating HebbianLinks paralleling other links (such as SimilarityLinks) • adding "Hebbian weights" to other links (such as SimilarityLinks) • implicitly interpreting other links (such as SimilarityLinks) as HebbianLinks Further, these strategies may potentially be used together. There are some obvious semantic relationships to be used in interpreting other link types im- plicitly as HebbianLinks: for instance, Similarity maps into SynunetricHebbian. and Inheritance A B maps into Hebbian A B. One may express these as inference rules, e.g. SimilarityLink A B <tv_1> I - SymmetricHebbianLink A B <tv_2> where tv2.8 = tvi.s. Clearly, tv2.c c tvi.c; but the precise magnitude of tv2.c must be deter- mined by a heuristic formula. One option is to set tv2.c = crtui.c where the constant cc is set empirically via data mining the System Activity Tables to be described below. 23.6 Dynamics of STI and LTI Propagation We now get more specific about how some of these ideas are implemented in the currently implemented ECAN subsystem of CogPrime. We'll discuss mostly sTi here because in the current design LTI works basically the same way. MindAgents send out stimulus to Atoms whenever they use them (or else, sometimes, just for the purpose of increasing the Atom's STI); and before these stimulus values are used to update the STI levels of the receiving Atom, they are normalized by: the total amount of stimulus sent out by the MindAgent in that cycle, multiplied by the total amount of STI currency that the MindAgent decided to spend in that cycle. The normalized stimulus is what has above been called juju. This normalization preserves fairness among MA's, and conservation of currency. (The reason "stimuli" exist, separately from STI, is that stimulus-sending needs to be very computationally cheap, as in general it's done frequently by each MA each cycle, and we don't want each action a MA takes to invoke some costly importance-updating computation.) Then, Atoms exchange STI according to certain equations (related to HebbianLinks and other links), and have their STI values updated according to certain equations (which involve, among other operations, transferring STI to the "central bank"). EFTA00624242 96 23 Attention Allocation 23.6.1 ECAN Update Equations The CogServer is understood to maintain a kind of central bank of STI and LTI funds. When a non-ECAN MindAgent finds an Atom valuable, it sends that Atom a certain amount of Stimulus, which results in that Atom's STI and LTI values being increased (via equations to be presented below, that transfer STI and LTI funds from the CogServer to the Atoms in question). Then, the ECAN ImportanceUpdating MindAgent carries out multiple operations, including some that transfer STI and LTI funds from some Atoms bath to the CogServer. There are multiple ways to embody this process equationally; here we briefly describe two variants. 23.6.1.1 Definition and Analysis of Variant 1 We now define a specific set of equations in accordance with the ECAN conceptual framework described abate. We define HsTi• = [81, • • • i sn] to be the vector of STI values, and C = 1 1CIL, • • • >elm • • • : • : cia, • • • 'cm,J t at the existence of a HebbianLink or InverseHebbianLink between A and B are mutually [ 911," • 19 exclusive possibilities. We also define CLTI = :. • .: 1 to be the matrix of LTI values 9nt, " >Sinn for each of the corresponding We assume an updating scheme in which, periodically, a number of Atoms are allocated Stimulus amounts, which causes the corresponding STI values to change according to the equa- tions to be the connection matrix of Hebbian probability values, where it is assumed Vi : si = s e — rent + wages, where rent and wages are given by 20.. 1 rent = (Rent) • max ( 0, icl"n r "b"') , if si > 0 { f and 0, if si > e > rent = 0, ifs; ≤ e { (Wage)(StiMUIUS) , if p i 1 wages = (Veag talitiliiS) .r a ' Ti_Er..1p, ,Il pi = u where P = [pi, • • • ,p.), with pa E {0,1) is the cue pattern for the pattern that is to be retrieved. All quantities enclosed in angled brackets are system parameters, and LTI updating is accom- plished using a completely analogous set of equations. EFTA00624243 23.6 Dynamics of STI and LTI Propagation 97 The changing STI values then cause updating of the connection matrix, according to the "conjunction" equations. First define = resents nxSTri if Si recentkiinSTi ' if < Next define conj = Conjunction (si,sj) = norms x norm.' and c:j = (ConjDecay) conj + (1 — conj)cii. Finally update the matrix elements by setting = tip 4.; > 0 cif = ow if < 0 We are currently also experimenting with updating the connection matrix in accordance with the equations suggested by Storkey (1997, 1998, 1999.) A key property of these equations is that both wages paid to, and rent paid by, each node are positively correlated to their STI values. That is, the more important nodes are paid more for their services, but they also pay more in rent. A fixed percentage of the links with the lowest LTI values is then forgotten (which corresponds equationally to setting the LTI to 0). Separately from the above, the process of Hebbian probability updating is carried out via a diffusion process in which some nodes "trade" STI utilizing a diffusion matrix D, a version of the connection matrix C normalized so that D is a left stochastic matrix. D acts on a similarly scaled vector v, normalized so that v is equivalent to a probability vector of STI values. The decision about which nodes diffuse in each diffusion cycle is carried out via a decision function. We currently are working with two types of decision functions: a standard threshold function, by which nodes diffuse if and only if the nodes are in the AF; and a stochastic decision function in which nodes diffuse with probability th"h(shar'('' -PoEkm""17))+1, where shape and FocusBoundary are parameters. The details of the diffusion process are as follows. First, construct the diffusion matrix from the entries in the connection matrix as follows: If cif ≥ 0, then d i = else, set do = Next, we normalize the columns of D to make D a left stochastic matrix. In so doing, we ensure that each node spreads no more than a (MaxSpread) proportion of its STI, by setting n if Edi > (MaxSpread) : 1=1 { dy = d„. x (It 87 1), for i.0 j ' du = 1 — (MaxSpread) EFTA00624244 98 23 Attention Allocation else: djj = 1— E i = 1 i Now we obtain a scaled STI vector v by setting minSTI = min si and maxSTI = max si — min STI v, max STI — min STI The diffusion matrix is then used to update the node STIs v'= Dv and the STI values are resealed to the interval [minSTI,mfocSTII. In both the rent and wage stage and in the diffusion stage, the total STI and LTI funds of the system each separately form a conserved quantity: in the case of diffusion, the vector v is simply the total STI times a probability vector. To maintain overall system funds within homeostatic bounds, a mid-cycle tax and rent-adjustment can be triggered if nectsbary; the equations currently used for this are 1. (Rent} recent stimulus awarded. bettoreArupdatex (Wage) recent 2. tax =f,, where x is the distance from the current AtomSpace bounds to the center of the homeostatic range for AtomSpace funds; 3. Vi:si = — tax 23.6.1.2 Investigation of Convergence Properties of Variant 1 Now we investigate some of the properties that the above ECAN equations display when we use an ECAN defined by them as an associative memory network in the manner of a Hopfield network. We consider a situation where the ECAN is supplied with memories via a "training" phase in which one imprints it with a series of binary patterns of the form P = [pi, • • • ,p„], with E {0, 1). Noisy versions of these patterns are then used as cue patterns during the retrieval process. We obviously desire that the ECAN retrieve the stored pattern corresponding to a given cue pattern. In order to achieve this goal, the ECAN must converge to the correct fixed point. Theorem 23.1. For a given value of e in the STI rent calculation, there is a subset of hyperbolic decision functions for which the ECAN dynamics converge to an attracting faxed point. Proof Rent is zero whenever e ≤ s, ≤ recent2o m' s11 , so we consider this case first. The updating process for the rent and wage stage can then be written as f (s) = s + constant. The next stage is governed by the hyperbolic decision function EFTA00624245 23.6 Dynamics of STI and LTI Propagation 99 tanh (shape (s — FocusBotmdary)) -I- 1 9 (s) 2 The entire updating sequence is obtained by the composition (go f) (s), whose derivative is then (s. of) sech2 (f (s)) • shape (1), 2 which has magnitude less than 1 whenever -2 < shape < 2. We next consider the case si reeentigaxSTI > a The function f now takes the form 20 and we have f (s) s log (20s/recentMaxSTI) 2 -F constant, (so sech2 (f (s)) • shape ( 1 1 \ 2 2s) which has magnitude less than 1 whenever Ishapei < I 2; l• Choosing the shape parameter to satisfy 0 < shape < min (2, I 2* I) then guarantees that I(91 0 f )'I < 1. Finally, g o f maps the closed interval [recentAlinStl,recentMazSTI) into itself, so applying the Contraction Mapping Theorem completes the proof. 23.6.1.3 Definition and Analysis of Variant 2 The ECAN variant described above has performed completely acceptably in our experiments so far; however we have also experimented with an alternate variant, with different convergence properties. In Variant 2, the dynamics of the ECAN are specifically designed so that a certain conceptually intuitive function serves as a Liaptmov function of the dynamics. At a given time t, for a given Atom indexed i, we define two quantities: OUTi(t) = the total amount that Atom i pays in rent and tax and diffusion during the time-t iteration of ECAN ; iNi(t) = the total amount that Atom i receives in diffusion, stimulus and welfare during the time-t iteration of ECAN. Note that welfare is a new concept to be introduced below. We then define DifFi(t) = MAO— OUTi(t) ; and define AVDIFF(t) as the average of DIFF;(t) over all i in the ECAN. The design goal of Variant 2 of the ECAN equations is to ensure that, if the parameters are tweaked appropriately, AVDIFF can serve as a (deterministic or stochastic, depending on the details) Lyapunov function for ECAN dynamics. This implies that with appropriate parameters the ECAN dynamics will converge toward a state where AVDIFF=0, meaning that no Atom is making any profit or incurring any loss. It must be noted that this kind of convergence is not always desirable, and sometimes one might want the parameters set otherwise. But if one wants the STI components of an ECAN to converge to some specific values, as for instance in a classic associative memory application, Variant 2 can guarantee this easily. In Variant 2, each ECAN cycle begins with rent collection and welfare distribution, which occurs via collecting rent via the Variant 1 equation, and then performing the following two steps: EFTA00624246 100 23 Attention Allocation • Step A: calculate X, defined as the positive part of the total amount by which AVDIFF has been increased via the overall rent collection process. • Step B: redistribute X to needy Atoms a S follows: For each Atom z, calculate the positive part of OUT —IN, defined as deficit(z). Distribute X +e wealth among all Atoms z, giving each Atom a percentage of X that is proportional to deficit(z), but never so much as to cause OUT<IN for any Atom (the welfare being given counts toward IN). Here e > 0 ensures AVDIFF decrease; e = 0 may be appropriate if convergence is not required in a certain situation. Step B is the welfare step, which guarantees that rent collection will decrease AVDIFF. Step A calculates the amount by which the rich have been made poorer, and uses this to make the poor richer. In the case that the sum of deficit(z) over all nodes z is less than X, a mid-cycle rent adjustment may be triggered, calculated so that step B will decrease AVDIFF. (Le. we cut rent on the rich, if the poor don't need their money to stay out of deficit.) Similarly. in each Variant 2 ECAN cycle, there is a wage-paying process, which involves the wage-paying equation from Variant 1 followed by two steps. Step A: calculate Y, defined as the positive part of the total amount by which AVDIFF has been increased via the overall wage payment process. Step B: exert taxation based on the surplus Y as follows: For each Atom z, calculate the positive part of IN— OUT, defined as surplus(z). Collect Y + el wealth from all Atom z, collecting from each node a percentage of Y that is proportional to surplus(z), but never so much as to cause IN < OUT for any node (the new STI being collected counts toward OUT). In case the total of surplus(z) over all nodes z is less than Y, one may trigger a mid-cycle wage adjustment, calculated so that step B will decrease AVDIFF. I.e. we cut wages since there is not enough surplus to support it. Finally, in the Variant 2 ECAN cycle, diffusion is done a little differently, via iterating the following process: If AVDIFF has increased during the diffusion round so far, then choose a random node whose diffusion would decrease AVDIFF, and let it diffuse; if AVDIFF has decreased during the diffusion round so far, then choose a random node whose diffusion would increase AVDIFF, and let it diffuse. In carrying out these steps, we avoid letting the same node diffuse twice in the same round. This algorithm does not let all Atoms diffuse in each cycle, but it stochastically lets a lot of diffusion happen in a way that maintains AVDIFF constant. The iteration may be modified to bias toward an average decrease in AVDIFF. The random element in the diffusion step, together with the logic of the rent/welfare and wage/tax steps, combines to yield the result that for Variant 2 of ECAN dynamics, AVDIFF is a stochastic Lyapunov function. The details of the proof of this will be omitted but the outline of the argument should be clear from the construction of Variant 2. And note that by setting the e and el parameter to 0, the convergence requirement can be eliminated, allowing the network to evolve more spontaneously as may be appropriate in some contexts; these parameters allow one to explicitly adjust the convergence rate. One may also derive results pertaining to the meaningfulness of the attractors, in various special cases. For instance, if we have a memory consisting of a set M of m nodes, and we imprint the memory on the ECAN by stimulating in nodes during an interval of time, then we want to be able to show that the condition where precisely those m nodes are in the AF is a fixed-point attractor. However, this is not difficult, because one must only show that if these m nodes and none others are in the AF, this condition will persist. EFTA00624247 23.7 Glocal Economic Attention Networks 101 23.6.2 ECAN as Associative Memory We have carried out experiments gauging the performance of Variant 1 of ECAN as an as- sociative memory, using the implementation of ECAN within CogPrime, and using both the conventional and Storkey Hebbian updating fornmlas. As with a Hopfield net memory, the memory capacity (defined as the number of memories that can be retrieved from the network with high accuracy) depends on the sparsity of the network, with denser networks leading to greater capacity. In the ECAN case the capacity also depends on a variety of parameters of the ECAN equations, and the precise unraveling of these dependencies is a subject of current research. However, one interesting dependency has already been uncovered in our preliminary experimentation, which has to do with the size of the AF versus the size of the memories being stored. Define the size of a memory (a pattern being imprinted) as the number of nodes that are stimulated during imprinting of that memory. In a classical Hopfield net experiment, the mean size of a memory is usually around, say, .2-.5 of the number of neurons. In typical CogPrime associative memory situations, we believe the mean size of a memory, will be one or two orders of magnitude smaller than that, so that each memory occupies only a relatively small portion of the overall network. What we have found is that the memory capacity of an ECAN is generally comparable to that of a Hopfield net with the same number of nodes and links, if and only if the ECAN parameters are tuned so that the memories being imprinted can fit into the AF. That is, the AF threshold or (in the hyperbolic case) shape parameter must be tuned so that the size of the memories is not so large that the active nodes in a memory cannot stably fit into the AF. This tuning may be done adaptively by testing the impact of different threshold/shape values on various memories of the appropriate size; or potentially a theoretical relationship between these quantities could be derived, but this has not been done yet. This is a reasonably satisfying result given the cognitive foundation of ECAN: in loose terms what it means is that ECAN works best for remembering things that fit into its focus of attention during the imprinting process. 23.7 Glocal Economic Attention Networks In order to transform ordinary ECANs into glocal ECANs, one may proceed in essentially the same manner as with glocal Hopfield nets as discussed in Chapter 13 of Part 1. In the language normally used to describe CogPrime, this would be termed a "map encapsulation" heuristic. As with glocal Hopfield nets, one may proceed most simply via creating a fixed pool of nodes intended to provide locally-representative keys for the maps formed as attractors of the network. Links may then be formed to these key nodes, with weights and STI and LTI values adapted by the usual ECAN algorithms. EFTA00624248 102 23 Attention Allocation 23.7.1 Experimental Explorations To compare the performance of glocal ECANs with glocal Hopfield networks in a simple context, we ran experiments using ECAN in the manner of a Hopfield network. That is, a number of nodes take on the equivalent role of the neurons that are presented patterns to be stored. These patterns are imprinted by setting the corresponding nodes of active bits to have their STI within the AF, whereas nodes corresponding to inactive bits of the pattern are below the AF threshold. Link weight updating then occurs, using one of several update rules, but in this case the update rule of ISV99I was used. Attention was spread using a diffusion approach by representing the weights of Hebbian links between pattern nodes within a left stochastic Markov matrix, and multiplying it by the vector of normalised STI values to give a vector representing the new distribution of STI. To explore the effects of key nodes on ECAN Hopfield networks, in roe08b1 we used the palimpsest testing scenario of ISV99], where all the local neighbours of the imprinted pattern, within a single bit change, are tested. Each neighbouring pattern is used as input to try and retrieve the original pattern. If all the retrieved patterns are the same as the original (within a given tolerance) then the pattern is deemed successfully retrieved and recall of the previous pattern is attempted via its neighbours. The number of patterns this can repeat for successfully is called the palimpsest storage of the network. As an example, consider one simple experiment that was nm with recollection of 10 x 10 pixel pat tents (so, 100 nodes. each corresponding to a pixel in the grid), a Hebbian link density of 30%, and with 1% of links being forgotten before each pattern is imprinted. The results demonstrated that, when the mean palimpsest storage is calculated for each of 0, 1, 5 and 10 key nodes we find that the storage is 22.6, 22.4, 24.9, and 26.0 patterns respectively, indicating that key nodes do improve memory recall on average. 23.8 Long-Term Importance and Forgetting Now we turn to the forgetting process (carried out by the Forgetting MindAgent), which is driven by LTI dynamics, but has its own properties as well. Overall, the goal of the "forgetting" process is to maximize the total utility of the Atoms in the AtomSpace throughout the future. The most basic heuristic toward this end is to remove the Atoms with the lowest LTI, but this isn't the whole story. Clearly, the decision to remove an Atom from RAM should depend on factors beyond just the LTI of the Atom. For example, one should also take into account the expected difficulty in reconstituting the given Atom from other Atoms. Suppose the system has the relations: "dogs are animals" "animals are cute" "dogs are cute" and the strength of the third relation is not dissimilar from what would be obtained by deduction and revision from the first two relations and others in the system. Then, even if the system judges it will be very useful to know dogs are cute in the future, it may reasonably choose to remove dogs are cute from memory, anyway, because it knows it can be so easily reconstituted, EFTA00624249 23.9 Attention Allocation via Data Mining on the System Activity Table 103 by a few inference steps for instance. Thus, as well as removing the lowest-LTI Atoms, the Forgetting MindAgent should also remove Atoms meeting certain other criteria such as the combination of: • low STI • easy reconstitutability in terms of other Atoms that have LTI not lass than its own 23.9 Attention Allocation via Data Mining on the System Activity Table In this section we'll discuss an object called the System Activity Table, which contains a number of subtables recording various activities carried out by the various objects in the CogPrime system. These tables may be used for sophisticated attention allocation processes, according to an approach in which importance values and HebbianLink weight values are calculated via direct data mining of a centralized knowledge store (the System Activity Table). This approach provides highly accurate attention allocation but at the cost of significant computational effort. The System Activity Table is actually a set of tables, with multiple components. The precise definition of the tables will surely be adapted based on experience as the work with CogPrime progresses; what is described here is a reasonable first approximation. First, there is a MindAgent Activity Table, which includes, for each MindAgent in the system, a table such as Table 23.1 (in which the time-points recorded are the last T system cycles, and the Atom-lists recorded are lists of Handles for Atoms). System Cycle Effort Spent ?utopia° Used Atom Combo 1 Utilized Atom Combo 2 Utilized .. . Now 3.3 4000 Atom21, Atom44 Atom 44, Atom 47, Atom 345 .. . Now -1 0.4 6079 Atom123, Atom33 Atom 345 .. . ... ... ... ... ... .. . Table 23.1: Example MindAgent Table The MindAgent's activity table records, for that MindAgent and for each system cycle, which Atom-sets were acted on by that MindAgent at that point in time. Similarly, a table of this nature must be maintained for each Task-type, e.g. InferenceTask, MOSESCategorizationTask, etc. The Task tables are used to estimate Effort values for various Tasks, which are used in the procedure execution process. If it can be estimated how much spatial and temporal resources a Task is likely to use, via comparison to a record of previous similar tasks (in the Task table), then a MindAgent can decide whether it is appropriate to carry out this Task (versus some other one, or versus some simpler process not requiring a Task) at a given point in time, a process to be discussed in a later chapter. In addition to the MindAgent and Task-type tables, it is convenient if tables are maintained corresponding to various goals in the system (as shown in Table ??), including the Ubergoals but also potentially derived goals of high importance. For each goal, at minimum, the degree of achievement of the goal at a given time must be recorded. Optionally, at each point in time, the degree of achievement of a goal relative to some EFTA00624250 104 23 Attention Allocation System Cycle Total Achievement Achievement for Atom44 Achievement for set {Atom44, Atom 233} ... Now .8 .4 .5 ... Now-1 .9 .5 .55 ... ... ... •-• •-• ... Table 23.2: Example Goal Table particular Atoms may be recorded. Typically the list of Atom-specific goal-achievements will be short and will be different for different goals and different time points. Some goals may be applied to specific Atoms or Atom sets, others may only be applied more generally. The basic idea is that attention allocation and credit assignment may be effectively carried out via datamining on these tables. 23.10 Schema Credit Assignment And, how do we apply a similar approach to clarifying the semantics of schema credit assign- ment? From the above-described System Activity Tables, one can derive information of the form Achieve(G,E,T) "Goal G was achieved to extent E at time T" which may be grounded as, for example: Similarity E ExOut GetTruthValue Evaluation atTime HypLink G and more refined versions such as Achieve(G,E,T,A,P) "Goal G was achieved to extent E using Atoms A (with parameters P) at time T" Enact(S,I,ST_13,0,3T_2$1 "Schema S was enacted on inputs I at time $T_13, producing outputs 0 at time $T 2$" The problem of schema credit assignment is then, in essence: Given a goal G and a distribution of times V. figure out what schema to enact in order to cause G's achievement at some time in the future, where the desirability of times is weighted by V. The basic method used is the learning of predicates of the form ImplicationLink F(C, Pn) where EFTA00624251 23.10 Schema Credit Assignment 105 • the P1 are Enact() statements in which the T1 and T2 are variable, and the S, I and 0 may be concrete or variable • C is a predicate representing a context • g is an Achieve() statement, whose arguments may be concrete or abstract • F is a Boolean function Typically, the variable expressions in the T1 and T2 positions will be of the form T -F offset, where offset is a constant value and T is a time value representing the time of inception of the whole compound schema. T may then be defined as To — offsety where offset, is a constant value and 7.0 is a variable denoting the time of achievement of the goal. In CogPrime, these predicates may be learned by a combination of statistical pattern mining, PLN inference and MOSES or hill-climbing procedure learning. The choice of what action to take at a given point in time is then a probabilistic decision. Based on the time-distribution 73 given, the system will know a certain number of expressions C = F(C, Ph ..., P4 of the type described above. Each of these will be involved in an Implica- tionLink with a certain estimated strength. It may select the "compound schema" C with the highest strength. One might think to introduce other criteria here, e.g. to choose the schema with the highest strength but the lowest cost of execution. However, it seems better to include all pertinent criteria in the goal, so that if one wants to consider cost of execution, one assumes the existence of a goal that incorporates cost of execution (which may be measured in multiple ways, of course) as part of its internal evaluation function. Another issue that arises is whether to execute multiple C simultaneously. In many cases this won't be possible because two different C's will contradict each other. It seems simplest to assume that C's that can be fused together into a single plan of action, are presented to the schema execution process as a single fused C. In other words, the fusion is done during the schema learning process rather than the execution process. A question emerges regarding how this process deals with false causality, e.g. with a schema that, due to the existence of a common cause, often happens to occur immediately prior to the occurrence of a given goal. For instance, roosters crowing often occurs prior to the sun rising. This matter is discussed in more depth in the PLN book and The Hidden Pattern; but in brief, the answer is: In the current approach, if roosters crowing often causes the sun to rise, then if the system wants to cause the sun to rise, it may well cause a rooster to crow. Once this fails, then the system will no longer hold the false belief, and afterwards will choose a different course of action. Furthermore, if it holds background knowledge indicating that roosters crowing is not likely to cause the sun to rise, then this background knowledge will be invoked by inference to discount the strength of the ImplicationLink pointing from rooster-crowing to sun-rising, so that the link will never be strong enough to guide schema execution in the first place. The problem of credit assignment thus becomes a problem of creating appropriate heuristics to guide inference of ImplicationLinks of the form described above. Assignment of credit is then implicit in the calculation of truth values for these links. The difficulty is that the predicates F involved may be large and complex. EFTA00624252 106 23 Attention Allocation 23.11 Interaction between ECANs and other CogPrime Components We have described above a number of interactions between attention allocation and other aspects of CogPrime; in this section we gather a few comments on these interactions, and some additional ones. 23.11.1 Use of PLN and Procedure Learning to Help ECAN MOSES or hillclimbing may be used to help mine the SystemActivityTable for patterns of usefulness, and create HebbianLinIcs reflecting these patterns. PLN inference may be carried out on HebbianLinks by treating (HebbianLink A B) as a virtual predicate evaluation relationship, i.e. as EvaluationLink Rebbianpredicate (A, B) PLN inference on HebbianLinks may then be used to update node importance values, because node importance values are essentially node probabilities corresponding to HebbianLinks. And similarly, MindAgent-relative node importance values are node probabilities corresponding to MindAgent-relative HebbianLinks. Note that conceptually. the nature of this application of PLN is different from most other uses of PLN in CogPrime. Here, the purpose of PLN is not to draw conclusions about the outside world, but rather about what the system should focus its resources on in what context. PLN, used in this context, effectively constitutes a nonlinear-dynamical iteration governing the flow of attention through the CogPrime system. Finally, inference on HebbianLinks leads to the emergence of maps, via the recognition of clusters in the graph of HebbianLinks. 23.11.2 Use of ECAN to Help Other Cognitive Processes First of all, associative-memory functionality is directly important in CogPrime because it is used to drive concept creation. The CogPrime heuristic called "map formation" creates new Nodes corresponding to prominent attractors in the ECAN, a step that (according to our preliminary results) not only increases the memory capacity of the network beyond what can be achieved with a pure ECAN but also enables attractors to be explicitly manipulated by PLN inference. Equally important to associative memory is the capability of ECANs to facilitate effective allocation of the attention of other cognitive processes to appropriate knowledge items (Atoms). For example, one key role of ECANs in CogPrime is to guide the forward and backward chaining processes of PLN (Probabilistic Logic Network) inference. At each step, the PLN inference chainer is faced with a great number of inference steps (branches) from which to choose; and a choice is made using a statistical "bandit problem" mechanism that selects each possible inference step with a probability proportional to its expected "desirability." In this context, there is considerable appeal in the heuristic of weighting inference steps using probabilities EFTA00624253 23.12 MindAgent Importance and Scheduling 107 proportional to the STI values of the Atoms they contain. One thus arrives at a combined PLN/ECAN dynamic as follows: 1. An inference step is carried out, involving a choice among multiple possible inference steps, which is made using STI-based weightings (and made among Atoms that LTI weightings have deemed valuable enough to remain in RAM) 2. The Atoms involved in the inference step are rewarded with STI and LTI proportionally to the utility of the inference step (how much it increases the confidence of Atoms in the system's memory) 3. The ECAN operates, and multiple Atom's importance values are updated 4. Return to Step 1 if the inference isn't finished An analogous interplay may occur between ECANs and MOSES. It seems intuitively clear that the same attractor-convergence properties highlighted in the above analysis of associative-memory behavior, will also be highly valuable for the application of ECANs to attention allocation. If a collection of Atoms is often collectively useful for some cognitive process (such as PLN), then the ascnriative-memory-type behavior of ECANs means that once a handful of the Atoms in the collection are found useful in a certain inference process, the other Atoms in the collection will get their STI significantly boosted, and will be likely to get chosen in subsequent portions of that same inference process. This is exactly the sort of dynamics one would like to see occur. Systematic experimentation with these interactions between ECAN and other CogPrime processes is one of our research priorities going forwards. 23.12 MindAgent Importance and Scheduling So far we have discussed economic transactions between Atoms and Atoms, and between Atoms and Units. MindAgents have played an indirect role, via spreading stimulation to Atoms which causes them to get paid wages by the Unit. Now it is time to discuss the explicit role of MindAgents in economic transactions. This has to do with the integration of economic attention allocation with the Scheduler that schedules the core MindAgents involved in the basic cognitive cycle. This integration may be done in many ways, but one simple approach is: 1. When a MindAgent utilizes an Atom, this results in sending stimulus to that Atom. (Note that we don't want to make MindAgents pay for using Atoms individually; that would penalize MA's that use more Atoms, which doesn't really make much sense.) 2. MindAgents then get currency from the Lobe (as defined in Chapter 19) periodically, and get extra currency based on usefulness for goal achievement as determined by the credit assigmnent process. The Scheduler then gives more processor time to MindAgents with more STI. 3. However, any MindAgent with LTI above a certain minimum threshold will get some min- imum amount of processor time (i.e. get scheduled at least once each N cycles). As a final note: In a multi-Lobe Unit, the Unit may use the different LTI values of MA's in different Lobes to control the distribution of MA's among Lobes: e.g. a very important (LTI) MA might get cloned across multiple Lobes. EFTA00624254 108 23 Attention Allocation 23.13 Information Geometry for Attention Allocation Appendix ?? outlines some very broad ideas regarding the potential utilization of information geometry and related ideas for modeling cognition. In this section, we present sonic more con- crete and detailed experiments inspired by the same line of thinking. We model CogPrime's Economic Attention Networks (ECAN) component using information geometric language, and then use this model to propose a novel information geometric method of updating ECAN net- works (based on an extension of Amari's ANGL algorithm). Tests on small networks suggest that information-geometric methods have the potential to vastly improve ECAN's capability to shift attention from current preoccupations to desired preoccupations. However, there is a high computational cost associated with the simplest implementations of these methods, which has prevented us from carrying out large-scale experiments so far. We are exploring the possibility of circumventing these issues via using sparse matrix algorithms on GPUs. 23.13.1 Brief Review of Information Geometry "Information geometry" is a branch of applied mathematics concerned with the application of differential geometry to spaces of probability distributions. In IG1111 we have suggested some extensions to traditional information geometry aimed at allowing it to better model general intelligence. However for the concrete technical work in this Chapter, the traditional formulation of information geometry will suffice. One of the core mathematical constructs underlying information geometry, is the Fisher Information, a statistical quantity which has a a variety of applications ranging far beyond statistical data analysis, including physics [Fri98J, psychology and Al LAW)]. Put simply, Fl is a formal way of measuring the amount of information that an observable random variable X carries about an unknown parameter 8 upon which the probability of X depends. Fl forms the basis of the Fisher-Rao metric, which has been proved the only Riemannian metric on the space of probability distributions satisfying certain natural properties regarding invariance with respect to coordinate transformations. Typically 8 in the Fl is considered to be a real multidimensional vector; however, IDab00I has presented a Fl variant that imposes basically no restrictions on the form of O. Here the multidimensional Fl will suffice, but the more gen- eral version is needed if one wishes to apply Fl to AGI more broadly, e.g. to declarative and procedural as well as attentional knowledge. In the set-up underlying the definition of the ordinary finite-dimensional Fisher information, the probability function for X, which is also the likelihood function for 8 E Ft", is a function f(X;6); it is the probability mass (or probability density) of the random variable X conditional on the value of 0. The partial derivative with respect to 81 of the log of the likelihood function is called the score with respect to (h. Under certain regularity conditions, it can be shown that the first moment of the score is 0. The second moment is the Fisher information: = rx(e)i = E [(( a — In f(X;6))) 2 10] 88; where, for any given value of 0i, the expression El..10) denotes the conditional expectation over values for X with respect to the probability function f(X; 0) given 0. Note that 0 ≤ i(0); < EFTA00624255 23.13 Information Geometry for Attention Allocation 109 co. Also note that, in the usual case where the expectation of the score is zero, the Fisher information is also the variance of the score. One can also look at the whole Fisher information matrix 8Inf(X,0)31n1(X,0)\ 2(9)ic E[( ) aej which may be interpreted as a metric gij, that provably Ls the only "intrinsic" metric on prob- ability distribution space. In this notation we have z(6)i = 1(0);,i. Dabak IDal)991 has shown that the geodesic between two parameter vectors 0 and Or is given by the exponential weighted curve (y(t)) (x) - fig:E.:4414 y, under the weak condition that the log-likelihood ratios with respect to f(X, 9) and f(X, 9') are finite. Also, along this sort of curve, the sum of the Kullback-Leibler distances between B and 0', known as the J- divergence, equals the integral of the Fisher information along the geodesic connecting 0 and 6v. This suggests that if one is attempting to learn a certain parameter vector based on data, and one has a certain other parameter vector as an initial value, it may make sense to use algorithms that try to follow the Fisher-Rao geodesic between the initial condition and the desired conclusion. This is what Amari IA ma851 IA NO011 calls "natural gradient" based learning, a conceptually powerful approach which subtly accounts for dependencies between the components of 0. 23.13.2 Information-Geometric Learning for Recurrent Networks: Extending the ANGL Algorithm Now we move on to discuss the practicalities of information-geometric learning within Cog- Prime's ECAN component. As noted above, Amari lAma85, ANOI introduced the natural gradient as a generalization of the direction of steepest descent on the space of loss functions of the parameter space. Issues with the original implementation include the requirement of calcu- lating both the Fisher information matrix and its inverse. To resolve these and other practical considerations, Amari tAma981 proposed an adaptive version of the algorithm, the Adaptive Natural Gradient Learning (ANGL) algorithm. Park, Amari, and Fukumizu IPA FOOI extended ANGL to a variety of stochastic models including stochastic neural networks, multi-dimensional regression, and classification problems. In particular, they showed that, assuming a particular form of stochastic feedforward neu- ral network and under a specific set of assumptions concerning the form of the probability distributions involved, a version of the Fisher information matrix can be written as G(0). E4 [(q1E [OH (V Hi. T Although Park et al considered only feedforward neural networks, their result also holds for more general neural networks, inc tiding the ECAN network. What is important is the decomposition of the probability distribution as P (Ylx; 0) = 11 n (yi - Ei (x,1)) 1=1 EFTA00624256 110 23 Attention Allocation where Y = H(x; 0)+4, Y = (h, • • • ,YL)T, H = (HI, • • • ,HL)T, (41,- • • ,eL)T, where is added noise. If we assume further that each r, has the same form as a Gaussian distribution with zero mean and standard deviation a, then the Fisher information matrix simplifies further to GOO= (V1/)1 . The adaptive estimate for dt-:, is given by = (1+ ei)6T1 — ELOTIVHOTIVH)T. and the loss function for our model takes the form t(x, y; 0) = — E log r(yi — Iii(x, O)). i.1 The learning algorithm for our connection matrix weights B is then given by 83+3 = B3 — MOT 1V1(00- 23.13.3 Information Geometry for Economic Attention Allocation: A Detailed Example Graph 1: Sum of Squares of Errors versus Number of Nodes SOW • 7002 4010 sour 4000 3000 4- KAM 2000 •••••• &M.N.& 1000 0 0 100 NO 300 400 NO Number of Nods Fig. 23.3: Results from Experiment 1 We now present the results of a series of small-scale, exploratory experiments comparing the original ECAN process running alone with the ECAN process coupled with ANGL. We are interested in determining which of these two lines of processing result in focusing attention more accurately. EFTA00624257 23.13 Information Geometry for Attention Allocation 111 The experiment started with base patterns of various sizes to be determined by the two algorithms. In the training stage, noise was added, generating a number of instances of noisy base patterns. The learning goal is to identify the underlying base patterns from the noisy patterns as this will identify how well the different algorithms can focus attention on relevant versus irrelevant nodes. Next, the ECAN process was run, resulting in the determination of the connection matrix C. In order to apply the ANGL algorithm, we need the gradient, VH, of the ECAN training process, with respect to the input x. While calculating the connection matrix C, we used Monte Carlo simulation to simultaneously calculate an approximation to VH. Graph 2: Sum of Squares of Errors versus Training Noise (16 nodes) 120 100 00 60 ea • r - 0 r 05 I 15 2 2.5 Imastandad Oittlailan 3 _r ECM e ECM...Mt Fig. 23.4: Results from Experiment 2 After ECAN training was completed, we bifurcated the experiment. In one branch, we ran fuzzed cue patterns through the retrieval process. In the other, we first applied the ANGL algorithm. optimizing the weights in the connection matrix, prior to running the retrieval process on the same fuzzed cue patterns. At a constant value of a = 0.8 we ran several samples through each branch with pattern sizes of 4 x 4, 7 x 7, 10 x 10, 15 x 15, and 20 x 20. The results are shown in Figure 23.3. We also ran several experiments comparing the sum of squares of the errors to the input training noise as measured by the value of a.; see Figures 23.4 and ??. These results suggest two major advantages of the ECAN+ANGL combination compared to ECAN alone. Not only was the performance of the combination better in every trial, save for one involving a small number of nodes and little noise, but the combination clearly scales significantly better both as the number of nodes increases, and as the training noise increases. EFTA00624258 Chapter 24 Economic Goal and Action Selection 24.1 Introduction A significant portion of CogPrime's dynamics is explicitly goal-driven - that is, based on trying (inasmuch as passible within the available resources) to figure out which actions will best help the system achieve its goals, given the current context. A key aspect of this explicit activity is guided by the process of "goal and action selection" - prioritizing goals, and then prioritizing actions based on these goals. We have already outlined the high-level process of action selection, in Chapter 22 above. Now we dig into the specifics of the process, showing how action selection is dynamically entwined with goal prioritization, and how both processes are guided by economic attention allocation as described in Chapter 23. While the basic structure of CogPrime's action selection aspect is fairly similar to MicroPsi (due to the common foundation in Dorner's Psi model), the dynamics are less similar. MicroPsi's dynamics are a little closer to being a formal neural net model, whereas ECAN's economic foundation tends to push it in different directions. The CogPrime goal and action selection design involves some simple simulated financial mechanisms, building on the economic metaphor of ELAN, that are different from, and more complex than, anything in MicroPsi. The main actors (apart from the usual ones like the AtomTable, economic attention alloca- tion, etc.) in the tale to be told here are as follows: • Structures: - UbergoalPool - ActiveSchemaPool • MindAgents: - GoalBasedSchemaSelection - GoalBasedSchemaLearning - GoalAttentionAllocation - FeasibilityUpdating - SchemaActivation The Ubergoal Pool contains the Atoms that the system considers as top-level goals. These goals must be treated specially by attention allocation: they must be given funding by the Unit so that they can use it to pay for getting themselves achieved. The weighting among different 113 EFTA00624260 114 24 Economic Coal and Action Selection top-level goals is achieved via giving them different amounts of currency. STICurrency is the key kind here, but of course ubergoals must also get some LTICurrency so they won't be forgotten. (Inadvertently deleting your top-level supergoals from memory is generally considered to be a bad thing ... it's in a sense a sort of suicide...) 24.2 Transfer of STI "Requests for Services" Between Goals Transfer of "attentional funds" from goals to subgoals, and schema modules to other schema modules in the same schema, take place via a mechanism of promises of funding (or 'requests for service,' to be called 'RFS's' from here on). This mechanism relies upon and interacts with ordinary economic attention allocation but also has special properties. Note that we will sometimes say that an Atom `tissues" an RFS or "transfers" currency while what we really mean is that some MindAgent working on that Atom issues an RFS or transfers currency. The logic of these RFS's is as follows. If agent A issues an RFS of value x to agent B, then 1. When B judges it appropriate, B may redeem the note and ask A to transfer currency of value x to B. 2. A may withdraw the note from B at any time. (There is also a little more complexity here, in that we will shortly introduce the notion of RFS's whose value is defined by a set of constraints. But this complexity does not contradict the two above points.) The total value of the of RFS's possessed by an Atom may be referred to as its "promise." A rough schematic depiction of this RFS process is given in Figure 24.1. Now we explain how RFS's may be passed between goals. Given two predicates A and B, if A is being considered as a goal, then B may be considered as a subgoal of A (and A the supergoal of B) if there exists a Link of the form Predictivelmplication B A I.e., achieving B may help to achieve A. Of course, the strength of this link and the temporal characteristics of this link are important in terms of quantifying how strongly and how usefully B is a subgoal of A. Supergoals (not only top-level ones, aka ubergoals) allocate RFS's to subgoals as follows. Supergoal A may issue a RFS to subgoal B if it is judged that achievement (i.e., predicate satisfaction) of B implies achievement of A. This may proceed recursively: subgoals may allocate RFS's to subsubgoaLs according to the same justification. Unlike actual currency, RFS's are not conserved. However, the actual payment of real cur- rency upon redemption of RFS's obeys the conservation of real currency. This means that agents need to be responsible in issuing and withdrawing RFS's. In practice this may be ensured by having agents follow a couple simple rules in this regard. 1. If B and C are two alternatives for achieving A, and A has x units of currency, then A may promise both B and C x units of currency. Whoever asks for a redemption of the promise first, will get the money, and then the promise will be rescinded from the other one. 2. On the other hand, if the achievement of A requires both B and C to be achieved, then B and C may be granted RFS's that are defined by constraints. If A has x units of currency, then B and C receive an RFS tagged with the constraint (B+C<=x). This means that in EFTA00624261 24.2 Ttansfer of STI "Requests for Services" Between Goals 115 CENTRAL BANK (ATOMSPACE) WO PIM APPROPRIATE SOCIAL INTERACTION MS $10 ASK SOMEONE RFS $60 ND SOMEONE TO ASK RFS $20 ( WU FOR ATTINTION $100 MAINTAIN APPROPRIATE ENERGY GET IFS BATTERY REMEMBER WHERE I SAW ONE no FIND ELECTRICAL OUTLET WWWRMWW0 AND !URDU RISS.° SEEK NOVELTY USER- GOALS SUB- GOALS Fig. 24.1: The RFS Propagation Process. An illustration of the process via which RFS's propagate from goals to abstract procedures, and finally mast get cashed out to pay for the execution of actual concrete procedures that are estimated relatively likely to lead to goal fulfillment. order to redeem the note, either one of B or C mast confer with the other one, so that they can simultaneously request constraint-consistent amounts of money from A. As an example of the role of constraints, consider the goal of playing fetch successfully (a subgoal of "get reward").... Then suppose it is learned that: ImplicationLink Sequent ialAND get ball deliver ball play_fetch where SequentielAND A B is the conjunction of A and B but with B occurring after A in time. Then, if play_fetch has $10 in STICurrency, it may know it has $10 to spend on a combination EFTA00624262 116 24 Economic Coal and Action Selection of get_ball and deliver_ball. In this case both get_ball and deliver_ball would be given RFS's labeled with the contraint: RES.get_ball + RFS.deliver_ball The issuance of RFS's embodying constraints is different from (and generally carried out prior to) the evaluation of whether the constraints can be fulfilled. An ubergoal may rescind offers of reward for service at any time. And, generally, if a subgoal gets achieved and has not spent all the money it needed, the supergoal will not offer any more funding to the subgoal (until/unless it needs that subgoal achieved again). As there are no ultimate sources of RFS in OCP besides ubergoals, promise may be considered as a measure of "goal-related importance." Transfer of RFS's among Atoms is carried out by the GoalAttentionAllocation MindAgent. 24.3 Feasibility Structures Next, there is a numerical data structure associated with goal Atoms, which is called the feasibility structure. The feasibility structure of an Atom G indicates the feasibility of achieving G as a goal using various amounts of effort. It contains triples of the form (t,p,E) indicating the truth value t of achieving goal G to degree p using effort E. Feasibility structures mist be updated periodically, via scanning the links coining into an Atom G; this may be done by a FeasibilityUpdating MindAgent. Feasibility may be calculated for any Atom G for which there are links of the form: Implication Execution S G for some S. Once a schema has actually been executed on various inputs, its cost of execution on other inputs may be empirically estimated. But this is not the only case in which feasibility may be estimated. For example, if goal G inherits from goal G1, and most children (e.g. subgoals) of G1 are achievable with a certain feasibility, then probably G is achievable with a similar feasibility as well. This allows feasibility estimation even in cases where no plan for achieving G yet exists, e.g. if the plan can be produced via predicate schematization, but such schematization has not yet been carried out. Feasibility then connects with importance as follows. Important goals will get more STICur- rency to spend, thus will be able to spawn more costly schemata. So, the GoalBasedSchemaSe- lection MindAgent, when choosing which schemata to push into the ActiveSchemaPool, will be able to choose more costly schemata corresponding to goals with more STICurrency to spend. 24.4 Goal Based Schema Selection Next, the GoalBa.sedSchemaSelection (GBSS) selects schemata to be placed into the Ac- tiveSchetnaPool. It does this by choosing goals G, and then choosing schemata that are alleged to be useful for achieving these goals. It chooses goals via a fitness function that combines promise and feasibility. This involves solving an optimization problem: figuring out how to EFTA00624263 24.4 Coal Based Schema Selection 117 maximize the odds of getting a lot of goal-important stuff done within the available amount of (memory and space) effort. Potentially this optimization problem can get quite subtle, but initially some simple heuristics are satisfactory. (One subtlety involves handling dependencies between goals, as represented by constraint-bearing RFS's.) Given a goal, the GBSS MindAgent chooses a schema to achieve that goal via the heuristic of selecting the one that maximizes a fitness function balancing the estimated effort required to achieve the goal via executing the schema, with the estimated probability that executing the schema will cause the goal to be achieved. When searching for schemata to achieve G. and estimating their effort, one factor to be taken into account is the set of schemata already in the ActiveSchemaPool. Some schemata S may simultaneously achieve two goals; or two schemata achieving different goals may have significant overlap of modules. In this case G may be able to get achieved using very little or no effort (no additional effort. if there is already a schema S in the ActiveSchemaPool that is going to cause G to be achieved). But if G "decide?' it can be achieved via a schema S already in the ActiveSchemaPool, then it should still notify the ActiveSchemaPool of this, so that G can be added to S's index (see below). If the other goal Cl that placed S in the ActiveSchemaPool decides to withdraw S, then S may need to hit up Cl for money, in order to keep itself in the ActiveSchemaPool with enough funds to actually execute. 24.4.1 A Game-Theoretic Approach to Action Selection Min Jiang has observed that, mathematically, the problem of action selection (represented in CogPrime as the problem of goal-based schema selection) can be modeled in terms of game theory, as follows: • the intelligent agent is one player, the world is another player • the agent's model of the world lets it make probabilistic predictions of how the world may respond to what the agent does (i.e. to estimate what mixed strategy the world is following, considering the world as a game player) • the agent itself chooses schema probabilistically, so it's also following a mixed strategy • so, in principle the agent can choose schema that it thinks will lead to a mixed Nash But the world's responses are very high-dimensional, which means that finding a mixed Nash equilibrium even approximately is a very hard computational problem. Thus, in a sense. the crux of the problem seems to come down to feature identification. If the world's response (real or predicted) can be represented as a low-dimensional set of features, then these features can be considered as the world's "move" in the game ... and the game theory problem becomes tractable via approximation schemes. But without the reduction of the world to a low-dimensional set of features, finding the mixed Nash equilbritun even approximately will not be computationally tractable... Some Al theorists would argue that this division into "feature identification" versus "action selection" is unnecessarily artificial; for instance, Hawkins 11113011 or Arel [ARC09bl might sug- gest to use a single hierarchical neural network to do both of them. But the brain after all t in game theory, a Nash equilibrium is when no player can do better by unilaterally changing its strategy EFTA00624264 118 24 Economic Coal and Action Selection contains many different regions, with different architectures and dynamics.... In the visual cor- tex, it seems that feature extraction and object classification are done separately. And it seems that in the brain, action selection has a lot to do with the basal ganglia, whereas feature extrac- tion is done in the cortex. So the neural analogy provides some inspiration for an architecture in which feature identification and action selection are separated. There is literature discussing numerical methods for calculating approximate Nash equilibria; however, this is an extremely tricky topic in the CogPrime context because action selection must generally be done in real-time. Like perception processing, this may be an area calling for the use of parallel processing hardware. For instance, a neural network algorithm for finding mixed Nash equilibria could be implemented on a GPU supercomputer, enabling rapid real-time action selection based on a reduced-dimensionality model of the world produced by intelligent feature identification. Consideration of the application of game theory in this context brings out an important point, which is that to do reasonably efficient and intelligent action selection, the agent needs some rapidly-evaluable model of the world, i.e. some way to rapidly evaluate the predicted response of the world to a hypothetical action by the agent. In the game theory approach (or any other sufficiently intelligent approach), for the agent to evaluate fitness of a schema-set S for achieving certain goals in a certain context, it has to (explicitly or implicitly) estimate • how the world will respond if the agent does S • how the agent could usefully respond to the world's response (call this action-set S1) • how the world will respond to the agent doing Si • etc. and so to rapidly evaluate the fitness of S, the agent needs to be able to quickly estimate how the world will respond. This may be done via simulation, or it may be done via inference (which however will rarely be fast enough, unless with a very accurate inference control mechanism), or it may be done by learning some compacted model of the world as represented for instance in a hierarchical neural network. 24.5 SchemaActivation And what happens with schemata that are actually in the ActiveSchemapool? Let as assume that each of these schema is a collection of modules (subprograms), connected via Activation- Links, which have semantics: (ActivationLink A B) means that if the schema that placed module A in the schema pool is to be completed, then after A is activated, B should he activated. (We will have more to say about schemata, and their modtdarization, in Chapter 25.) When a goal places a schema in the ActiveSchemaPool, it grants that schema an RFS equal in value to the total or some fraction of the promissory+real currency it has in its possession. The heuristics for determining how much currency to grant may become sophisticated; but initially we may just have a goal give a schema all its promissory currency; or in the case of a top-level supergoal, all its actual currency. When a module within a schema actually executes, then it must redeem some of its promis- sory currency to turn it into actual currency, because executing costs money (paid to the Lobe). EFTA00624265 24.6 ConlBasedScheranLearning 119 Once a schema is done executing, if it hasn't redeemed all its promissory currency, it gives the remainder back to the goal that placed it in the ActiveSchemaPool. When a module finishes executing, it passes promissory currency to the other modules to which it points with ActivationLinks. The network of modules in the ActiveSchemaPool is a digraph (whose links are Activation- Links), because sonic modules may be shared within different overall schemata. Each module must be indexed via which schemata contain it, and each schema must be indexed via which goal(s) want it in the ActiveSchemaPool. 24.6 GoalBasedSchemaLearning Finally, we have the process of trying to figure out how to achieve goals, i.e. trying to learn links between ExecutionLinks and goals G. This process should he focused on goals that have a high importance but for which feasible achievement-methodologies are not yet known. Predicate schematization is one way of achieving this; another is MOSES procedure evolution. EFTA00624266 Chapter 25 Integrative Procedure Evaluation 25.1 Introduction Procedural knowledge must be learned, an often subtle and difficult process - but it must also be enacted. Procedure enaction is not as tricky a topic as procedure learning, but still is far from trivial, and involves the real-time interaction of procedures, during the course of execution, with other knowledge. In this brief chapter we explain how this process may be most naturally and flexibly carried out, in the context of CogPrime's representation of procedures as programs ("Combo trees"). While this may seem somewhat of a "mechanical", implementation-level topic, it also involves some basic conceptual points, on which CogPrime as an AGI design does procedure evaluation fundamentally differently from narrow-AI systems or conventional programming language in- terpreters. Basically, what makes CogPrime Combo tree evaluation somewhat subtle is due to the interfacing between the Combo evaluator itself and the rest of the CogPrime system. In the CogPrime design, Procedure objects (which contain Combo trees, and are associated with ProcedureNodes) are evaluated by ProcedureEvaluator objects. Different ProcedureEvalu- ator objects may evaluate the same Combo tree in different ways. Here we explain these various sorts of evaluation - how they work and what they mean. 25.2 Procedure Evaluators In this section we will mention three different ProcedureEvaluators: • Simple procedure evaluation • Effort-based procedure evaluation, which is more complex but is required for integration of inference with procedure evaluation • Adaptive evaluation order based procedure evaluation In the following section we will delve more thoroughly into the interactions between inference and procedure evaluation. Another related issue is the modularization of procedures. This issue however is actually orthogonal to the distinction between the three ProcedureEvaluators mentioned above. Modu- larity simply requires that particular nodes within a Combo tree be marked as "module roots", 121 EFTA00624268 122 25 Integrative Procedure Evaluation so that they may be extracted from the Combo tree as a whole and treated as separate modules (called differently, sub-routines), if the ExecutionManager judges this appropriate. 25.2.1 Simple Procedure Evaluation The SimpleComboTreeEvaluator simply does Combo tree evaluation as described earlier. When an Atom is encountered, it looks into the AtomTable to evaluate the object. In the case that a Schema refers to an ungrounded Schemallode (that is not defined by a ComboTree as defined in Chapter 19), and an appropriate EvaluationLink value isn't in the AtomTable, there's an evaluation failure, and the whole procedure evaluation returns the truth value (.5,0): i.e., a zero-weight-of-evidence truth value, which is equivalent essentially to returning no value. In the case that a Predicate refers to an ungrounded PredicateNode, and an appropriate EvaluationLink isn't in the AtomTable, then some very simple "default thinking" is done, and it is assigned the truth value of the predicate on the given arguments to be the TruthValue of the corresponding PredicateNode (which is defined as the mean truth value of the predicate across all arguments known to CogPrime. ) 25.2.2 Effort Based Procedure Evaluation The next step is to introduce the notion of "effort" the amount of effort that the CogPrime system must undertake in order to carry out a procedure evaluation. The notion of effort is encapsulated in Effort objects, which may take various forms. The simplest Effort objects measure only elapsed processing time; more advanced Effort objects take into consideration other factors such as memory usage. An effort-based Combo tree evaluator keeps a running total of the effort used in evaluating the Combo tree. This is necessary if inference is to be used to evaluate Predicates, Schema, Arguments, etc. Without some control of effort expenditure, the system could do an arbitrarily large amount of inference to evaluate a single Atom. The matter of evaluation effort is nontrivial because in many cases a given node of a Combo tree may be evaluated in more than one way, with a significant effort differential between the different methodologies. If a Combo tree Node refers to a predicate or schema that is very costly to evaluate, then the ProcedureEvaluator managing the evaluation of the Combo tree must decide whether to evaluate it directly (expensive) or estimate the result using inference (cheaper but less accurate). This decision depends on how much effort the ProcedureEvaluator has to play with, and what percentage of this effort it finds judicious to apply to the particular Combo tree Node in question. In the relevant prototypes we built within OpenCog, this kind of decision was made based on some simple heuristics inside ProcedureEvaluator objects. However, it's clear that, in general, more powerful intelligence must be applied here, so that one needs to have ProcedureEvaluators that - in cases of sub-procedures that are both important and highly expensive - use PLN inference to figure out how much effort to assign to a given subproblem. EFTA00624269 25.3 The Procedure Evaluation Process 123 The simplest useful kind of effort-based Combo tree evaluator is the EffortIntervalCom- boTreeEvaluator, which utilizes an Effort object that contains three numbers (yes, no, max). The yes parameter tells it how much effort should be expended to evaluate an Atom if there is a ready answer in the AtomTable. The no parameter tells it how much effort should be expended in the case that there is not a ready answer in the AtomTable. The max parameter tells it how much effort should be expended, at maximum, to evaluate all the Atoms in the Combo tree, before giving up. Zero effort, in the simplest case, may be heuristically defined as simply looking into the AtomTable - though in reality this does of course take effort, and a more sophisticated treatment would incorporate this as a factor as well. Quantification of amounts of effort is nontrivial, but a simple heuristic guideline is to assign one unit of effort for each inference step. Thus, for instance, • (yes, no, max) = (0,5,1000) means that if an Atom can be evaluated by AtomTable lookup, this is done, but if AtomTable lookup fails, a minimum of 5 inference steps are done to try to do the evaluation. It also says that no more than 1000 evaluations will be done in the course of evaluating the Combo tree. • (yes, no, max) = (3,5,1000) says the same thing, but with the change that even if evaluation could be done by direct AtomTable lookup, 3 inference steps are tried anyway, to try to improve the quality of the evaluation. 25.2.3 Procedure Evaluation with Adaptive Evaluation Order While tracking effort enables the practical use of inference within Combo tree evaluation, if one has truly complex Combo trees, then a higher degree of intelligence is necessary to guide the evaluation process appropriately. The order of evaluation of a Combo tree may be determined adaptively, based on up to three things: • The history, of evaluation of the Combo tree • Past history of evaluation of other Combo trees, as stored in a special AtomTable consisting only of relationships about Combo tree-evaluation-order probabilities • New information entering into CogPrime's primary AtomTable during the course of evalu- ation ProcedureEvaluator objects may be selected at runtime by cognitive schemata, and they may also utilize schemata and MindAgents internally. The AdaptiveEvaluationOrderComboTreeE- valuator is more complex than the other ProcedureEvaluators discussed, and will involve var- ious calls to CogPrime MindAgents, particularly those concerned with PLN inference. WIK- ISOURCE:ProcedureacecutionDetails 25.3 The Procedure Evaluation Process Now we give a more thorough treatment of the procedure evaluation proems, as embodied in the effort-based or adaptive-evaluation-order evaluators discussed above. The process of procedure evaluation is somewhat complex, because it encompasses three interdependent processes: EFTA00624270 12d 25 Integrative Procedure Evaluation • The mechanics of procedure evaluation, which in the CogPrime design involves traversing Combo trees in an appropriate order. When a Combo tree node referring to a predicate or schema is encountered during Combo tree traversal, the process of predicate evaluation or schema execution must be invoked. • The evaluation of the truth values of predicates - which involves a combination of inference and (in the case of grounded predicates) procedure evaluation. • The computation of the truth values of schemata - which may involve inference as well as procedure evaluation. We now review each of these processes. 25.3.1 Truth Value Evaluation What happens when the procedure evaluation process encounters a Combo tree Node that represents a predicate or compound term? The same thing as when some other CogPrime process decides it wants to evaluate the truth value of a PredicateNode or CompoundTermNode: the generic process of predicate evaluation is initiated. This process is carried out by a TruthValueEvaluator object. There are several varieties of TruthValueEvaluator, which fall into the following hierarchy: TruthValueEvaluator DirectTruthValueEvaluator (abstract) SimpleDirectTruthValueEvaluator InferentialTruthValueEvaluator (abstract) SimplelnferentialTruthValueEvaluator MixedTruthValueEvaluator A DirectTruthValueEvaluator evaluates a grounded predicate by directly executing it on one or more inputs; an InferentialTruthValueEvaluator evaluates via inference based on the previously recorded, or specifically elicited, behaviors of other related predicates or compound terms. A MixedTruthValueEvaluator contains references to a DirectTruthValueEvaluator and an InferentialTruthValueEvaluator, and contains a weight that tells it how to balance the outputs from the two. Direct truth value evaluation has two cases. In one case, there is a given argument for the predicate; then, one simply plugs this argument in to the predicate's internal Combo tree, and passes the problem off to an appropriately selected ProcedureEvaluator. In the other case, there is no given argument, and one is looking for the truth value of the predicate in general. In this latter case, some estimation is required. It is not plausible to evaluate the truth value of a predicate on every possible argument, so one must sample a bunch of arguments and then record the resulting probability distribution. A greater or fewer number of samples may be taken, based on the amount of effort that's been allocated to the evaluation process. It's also possible to evaluate the truth value of a predicate in a given context (information that's recorded via embedding in a ContextLink); in this rase the random sampling is restricted to inputs that lie within the specified context. On the other hand, the job of an InferentialTruthValueEvaluator is to use inference rather than direct evaluation to guess the truth value of a predicate (sometimes on a particular argu- ment, sometimes in general). There are several different control strategies that may be applied here, depending on the amount of effort allocated. The simplest strategy is to rely on analogy, EFTA00624271 25.3 The Procedure Evaluation Process 125 simply searching for similar predicates and using their truth values as guidance. (In the case where a specific argument is given, one searches for similar predicates that have been evaluated on similar arguments.) If more effort is available, then a more sophisticated strategy may be taken. Generally, an InferentialTruthValueEvaluator may invoke a Schemallode that embodies an inference control strategy for guiding the truth value estimation process. These Schemallodes may then be learned like any others. Finally, a AlixedTruthValueEvaluator operates by consulting a DirectTruthValueEvaluator and/or an InferentialTruthValueEvaluator as necesbary, and merging the results. Specifically, in the case of an ungrounded PredicateNode, it simply returns the output of the Inferential- TruthValueEvaluator it has chosen. But in the case of a GroundedPredicateNode, it returns a weighted average of the directly evaluated and inferred values, where the weight is a parameter. In general, this weighting may be done by a Schemallode that is selected by the MixedTruth- ValueEvaluator; and these schemata may be adaptively learned. 25.3.2 Schema Execution Finally, schema execution is handled similarly to truth value evaluation, but it's a bit simpler in the details. Schemata have their outputs evaluated by SchemaExecutor objects, which in turn invoke ProcedureEvaluator objects. We have the hierarchy: SchemaExecutor DirectSchemaExecutor (abstract) SimpleDirectSchemaExecutor InferentialSchemaExecutor (abstract) SimplelnferentialSchemaExecutor MixedSchemaExecutor A DirectSchemaExecutor evaluates the output of a schema by directly executing it on some inputs; an InferentialSchemaExecutor evaluates via inference based on the previously recorded, or specifically elicited, behaviors of other related schemata. A MixedSchemaExecutor contains references to a DirectSchemaExecutor and an InferentialSchemaExecutor, and contains a weight that tells it how to balance the outputs from the two (not always obvious, depending on the output type in question). Contexts may be used in schema execution, but they're used only indirectly, via being passed to TruthValueEvaluators used for evaluating truth values of PredicateNodes or Com- poundTermNodes that occur internally in schemata being executed. EFTA00624272 Section III Perception and Action EFTA00624274 Chapter 26 Perceptual and Motor Hierarchies 26.1 Introduction Having discussed declarative, attentional, intentional and procedural knowledge, we are left only with sensorimotor and episodic knowledge to complete our treatment of the basic CogPrime "cognitive cycle" via which a CogPrime system can interact with an environment and seek to achieve its goals therein. The cognitive cycle in its most basic form leaves out the most subtle and unique aspects of CogPrime, which all relate to learning in various forms. But nevertheless it is the foundation on which CogPrime is built, and within which the various learning processes dealing with the various forms of memory all interact. The CogPrime cognitive cycle is more complex in many respects than it would need to be if not for the need to support diverse forms of learning. And this learning-driven complexity is present to sonic extent in the contents of the present chapter as well. If learning weren't an issue, perception and actuation could more likely be treated as wholly (or near-wholly) distinct modules, operating according to algorithms and structures independent of cognition. But our suspicion is that this sort of approach is unlikely to be adequate for achieving high levels of perception and action capability under real-world conditions. Instead, we suspect, it's necessary to create perception and action processes that operate fairly effectively on their own, but are capable of cooperating with cognition to achieve yet higher levels of functionality. And the benefit in such an approach goes both ways. Cognition helps perception and actu- ation deal with difficult cases, where the broad generalization that is cognition's specialty is useful for appropriately biasing perception and actuation based on subtle enviromnental regu- larities. And, the patterns involved in perception and actuation help cognition, via supplying a rich reservoir of structures and processes to use as analogies for reasoning and learning at various levels of abstraction. The prominence of visual and other sensory metaphors in abstract cognition is well known lArn69, Gar00]; and according to Lakoff and Nunez 1LN001 even pure mathematics is grounded in physical perception and action in very concrete ways. We begin by discussing the perception and action mechanisms required to interface CogPrime with an agent operating in a virtual world. We then turn to the more complex mechanisms needed to effectively interface CogPrime with a robot possessing vaguely humanoid sensors and actuators, focusing largely on vision processing. This discussion leads up to deeper discussions in Chapters 27, 28 and 29 where we describe in detail the strategy that would be used to integrate 129 EFTA00624276 130 26 Perceptual and Motor Hierarchies CogPrime with the DeSTIN framework for AGI perception/action (which was described in some detail in Chapter 4 of Part 1). In terms of the integrative cognitive architecture presented in Chapter 5, the material pre- sented in the chapters in this section has mostly to do with the perceptual and motor hierarchies, also touching on the pattern recognition and imprinting processes that play a role in the inter- action between these hierarchies and the conceptual memory. The commitment to a hierarchical architecture for perception and action Ls not critical for the CogPrime design as a whole - one could build a CogPrime with non-hierarchical perception and action modules, and the rest of the system would be about the same. The role of hierarchy here is a reflection of the obvious hierarchical structure of the everyday human environment, and of the human body. In a world marked by hierarchical structure, a hierarchically structured perceptual system is advantageous. To control a body marked by hierarchical structure, an hierarchically structured action system is advantageous. It would be possible to create a CogPrime system without this sort of in-built hierarchical structure, and have it gradually self-adapt in such a way as to grow its own internal hierarchical structure, based its experience in the world. However, this might be a case of push- ing the "experiential learning" perspective too far. The human brain definitely has hierarchical structure built into it; it doesn't need to learn to experience the world in hierarchical terms; and there seems to be no good reason to complicate an AGI's early development phase by forcing it to learn the basic facts of the world's and its body's hierarchality. 26.2 The Generic Perception Process We have already discussed the generic action process of CogPrime, in Chapter 25 on procedure evaluation. Action sequences are generated by Combo programs, which execute primitive ac- tions, including those corresponding to actuator control signals as well as those corresponding to, say, mathematical or cognitive operations. In some cases the actuator control signals may directly dictate movements; in other cases they may supply inputs and/or parameters to other software (such as DeSTIN, in the integrated CogBot architecture to be described below). What about the generic perception process? We distinguish sensation from perception, in a CogPrime context, by defining • perception as what occurs when some signal from the outside world registers itself in either: a CogPrime Atom, or some other sort of node (e.g. a DeSTIN node) that is capable of serving as the target of a CogPrime Link. • sensation as any "preprocessing" that occurs between the impact of some signal on some sensor, and the creation of a corresponding perception Once perceptual Atoms have been created, various perceptual MindAgents comes into play, taking perceptual schemata (schemata whose arguments are perceptual nodes or relations there- between) and applying them to Atoms recently created (creating appropriate ExecutionLinks to store the results). The need to have special, often modality-specific perception MindAgents to do this, instead of just leaving it to the generic SchemaExecution MindAgent, has to do with computational efficiency, scheduling and parameter settings. Perception MindAgents are doing schema execution urgently, and doing it with parameter settings tuned for perceptual processing. This means that, except in unusual circumstances, newly received stimuli will be processed immediately by the appropriate perceptual schemata. EFTA00624277 26.3 Interfacing CogPrime with a Virtual Agent 131 Sonic newly formed perceptual Atoms will have links to existing atoms, ready-made at their moment of creation. CharacterlnstanceNodes and NumberinstanceNodes are examples; they are born linked to the appropriate CharacterNodes and NumberNodes. Of course, atoms represent- ing perceived relationships, perceived groupings, etc., will not have ready-made links and will have to grow such links via various cognitive processes. Also, the ContextFormation MindAgent looks at perceptual atom creation events and creates Context Nodes accordingly; and this must be timed so that the Context Nodes are entered into the system rapidly, so that they can be used by the processes doing initial-stage link creation for new perceptual Atoms. In a full CogPrime configuration, newly created perceptual nodes and perceptual schemata may reside in a special perception-oriented Units, so as to ensure that perceptual processes occur rapidly, not delayed by slower cognitive processes. 26.2.1 The ExperienceDB Separate from the ordinary perception process, it may also valuable for there to be a direct route from the system's sensory sources to a special "ExperienceDB" database that records all of the system's experience. This does not involve perceptual schemata at all, nor is it left up to the sensory source; rather, it is carried out by the CogPrime server at the point where it receives input from the sensory source. This experience database is a record of what the system has seen in the past, and may be mined by the system in the future for various purposes. The creation of new perceptual atoms may also be stored in the experience database, but this must be handled with care as it can pose a large computational expense; it will often be best to store only a subset of these. Obviously, such an ExperienceDB is something that has no correlate in the human mind/brain. This is a case where CogPrime takes advantage of the non-brainlike properties of its digital com- puter substrate. The CogPrime perception process is intended to work perfectly well without access to the comprehensive database of experiences potentially stored in the ExperienceDB. However, a complete record of a mind's experience is a valuable thing, and there seems no reason for the system not to exploit it fully. Advantages like this allow the CogPrime system to partially compensate for its lath of some of the strengths of the human brain as an Al platform, such as massive parallelism. 26.3 Interfacing CogPrime with a Virtual Agent We now discuss some of the particularities of connecting CogPrime to a virtual world (such as Second Life, Multiverse, or Unity3D, to name sonic of the virtual world / gaming platforms to which OpenCog has already been connected in practice). EFTA00624278 132 26 Perceptual and Motor Hierarchies 26.3.1 Perceiving the Virtual World The most complex, high-bandwidth sensory data coming in from a typical virtual world is visual data, so that will be our focus here. We consider three modes in which a virtual world may present visual data to CogPrime (or any other system): • Object vision: CogPrime receives information about polygonal objects and their colors, textures and coordinates (each object is a set of contiguous polygons, and sometimes objects have "type" information, e.g. cube or sphere) • Polygon vision: CogPrime receives information about polygons and their colors, textures and coordinates • Pixel vision: CogPrime receives information about pixels and their colors and coordinates In each case, coordinates may be given either in "world coordinates" or in "relative coordinates" (relative to the gaze). This distinction is not a huge deal since within an architecture like CogPrime, supplying schemata for coordinate transformation is trivial; and, even if treated as a machine learning task, this sort of coordinate transformation is not very difficult to learn. Our current approach is to prefer relative coordinates, as this approach is more natural in terms of modern Western human psychology; but we note that in some other cultures world coordinates are preferred and considered more psychologically natural. Currently we have not yet done any work with pixel vision in virtual worlds. We have been using object vision for most of our experiments, and consider a combination of polygon vision and object vision as the "right" approach for early AGI experiments in a virtual worlds context. The problem with pure object vision is that it removes the possibility for CogPrime to understand object segmentation. If, for instance, CogPrime perceives a person as a single object, then how can it recognize a head as a distinct sub-object? Feeding the system a pre- figured hierarchy of objects, sub-objects and so forth seems inappropriate in the context of an experiential learning system. On the other hand, the use of polygon vision instead of pixel vision seems to meet no such objections. This may take different forms in different platforms. For instance, in our work with a Minecraft-like world in the Unity3D environment, we have relied heavily on virtual objects made of blocks, in which case the polygons of most interest are the faces of the blocks. Momentarily sticking with the object vision case for simplicity, examples of the perceptions emanating from the virtual world perceptual preprocessor into CogPrime are things like: 1. I am at world-coordinates $W 2. Object with metadata SM is at world-coordinates SW 3. Part of object with metadata $M is at world-coordinates $W 4. Avatar with metadata SM is at world-coordinates SW 5. Avatar with metadata SM is carrying out animation SA 6. Statements in natural language, from the pet owner The perceptual preprocessor takes these signals and translates them into Atoms, making use of the special Atomspace mechanisms for efficiently indexing spatial and temporal information (the and ) as appropriate. EFTA00624279 26.3 Interfacing CogPrime with a Virtual Agent 133 26.3.1.1 Transforming Real-World Vision into Virtual Vision One approach to enabling CogPrime to handle visual data coining from the real world is to transform this data into data of the type CogPrime sees in the virtual world. While this is not the approach we are taking in our current work, we do consider it a viable strategy, and we briefly describe it here. One approach along these lines would involve multiple phases: • Use a camera eye and a LiDAR (Light Detection And Ranging, used for high-resolution topographic mapping) sensor in tandem, so as to avoid having to deal with stereo vision • Using the above two inputs, create a continuous 3D contour map of the perceived visual world • Use standard mathematical transforms to polygon-ize the 3D contour map into a large set of small polygons • Use heuristics to merge together the small polygons, obtaining a smaller set of larger poly- gons (but retaining the large set of small polygons for the system to reference in cases where a high level of detail is necessary) • Feed the polygons into the perceptual pattern mining subsystem, analogously to the poly- gons that come in from virtual-world In this approach, preprocessing is used to make the system see the physical world in a manner analogous to how it sees the virtual-world world. This is quite different from the DeSTIN-based approach to CogPrime vision that we will discuss in Chapter 28, but may well also be feasible. 26.3.2 Acting in the Virtual World Complementing the perceptual preprocessor is the action postprocessor: code that translates the actions and action-sequences generated by CogPrime into instructions the virtual world can understand (such as "launch thus-and-thus animation"). Due to the particularities of current virtual world architectures, the current OpenCogPrime system carries out actions via executing pre-programmed high-level procedures, such as "move forward one step", "bend over forward" and so forth. Example action commands are: 1. Move ($D, $S) : $D is a distance, $S is a speed 2. Turn ($A, $S) : $A is an angle, $S is a speed 3. Pitch ($A, $S) : turn vertically up/down... [for birds only] 4. Jump ($D, $H, $S) : $H is a maximum height, at the center of the jump 5. Say ($T), ST is text : for agents with linguistic capability, which is not enabled in the current version 6. pick up($0) : $0 is an object 7. put down($0) This is admittedly a crude approach, and if a robot simulator rather than a typical virtual world were used, it would be possible for CogPrime to emanate detailed servomotor control commands rather than high-level instructions such as these. However, as noted in Chapter 16 of Part 1, at the moment there is no such thing as a "massive multiplayer robot simulator," and so the choice is between a multi-participant virtual environment (like the Multiverse environment EFTA00624280 134 26 Perceptual and Motor Hierarchies currently used with the PetBrain) or a small-scale robot simulator. Our experiments with virtual worlds so far have used the high-level approach described here; but we are also experimenting with using physical robots and corresponding simulators, as will be described below. 26.4 Perceptual Pattern Mining Next we describe how perceptual pattern mining may be carried out, to recognize meaningful structures in the stream of data produced via perceiving a virtual or physical world. In this subsection we discuss the representation of knowledge, and then in the following subsection we discuss the actual mining. We discuss the process in the context of virtual-world perception as outlined above, but the same processes apply to robotic perception, whether one takes the "physical world as virtual world" approach described above or a different sort of approach such as the DeSTIN hybridization approach described below. 26.4.1 Input Data First, we may assume that each perception is recorded as set of "transactions", each of which is of the form Time, 3D coordinates, object type or Time, 3D coordinates, action type Each transaction may also come with an additional list of (attribute, value) pairs, where the list of attributes is dependent upon the object or action type. Transactions are represented as Atoms, and don't need to be a specific Atom type - but are referred to here by the special name transactions simply to make the discussion clear. Next, define a transaction template as a transaction with location and time information set to wild cards - and potentially, some other attributes set to wild cards. (These are implemented in terms of Atoms involving VariableNodes.) For instance, some transaction templates in the current virtual-world might be informally represented as: • Reward • Red cube • kick • move_forward • Cube • Cube, size 5 • inc • Teacher EFTA00624281 26.4 Perceptual Pattern Mining 135 26.4.2 Transaction Graphs Next we may conceive a transaction graph, whose nodes are transactions and whose links are labeled with labels like after, SimAND, SSegAND (short for SimultaneousSequentialAND), near. in_front_of, and so forth (and whose links are weighted as well). We may also conceive a transaction template graph, whose nodes are transaction templates, and whose links are the same as in the transaction graph. Examples of transaction template graphs are near(Cube, Teacher) SSegAND(move_forward, Reward) Where Cube, Teacher, etc are transaction templates since Time and 3D coordinates are left unspecified. And finally, we may conceive a transaction template relationship graph (1TRG), whose nodes may be any of: transactions; transaction templates; basic spatiotemporal predicates evaluated at tuples of transactions or transaction templates. For instance SimAND(near(Cube, Teacher), above(Cube, Chair)) 26.4.3 Spatiotempcnut Conjunctions Define a temporal conjunction as a conjunction involving SimultaneousAND and Sequentia- IAND operators (including SSNAND as a special case of SeciAND: the special case that interests us in the short term). The conjunction is therefore ordered, e.g. A SSecIAND S SimAND C SSemAND D We may assume that the order of operations favors SimAND, so that no parenthesizing is necessary. Next, define a basic spatiotemporal conjunction as a temporal conjunction that conjoins terms that are either • transactions, or • transaction templates. or • basic spat iotemporal predicates applied to tuples of transactions or transaction templates I.e. a basic spatiotemporal conjunction is a temporal conjunction of nodes from the transaction template relationship graph. An example would be: (hold ball) SimAND ( near (ne, teacher) ) SSeciAND Reward This assumes that the hold action has an attribute that is the type of object held, so that hold ball in the above temporal conjunction is a shorthand for the transaction template specified by action type: hold object_held_type: ball This example says that if the agent is holding the ball and is near the teacher then shortly after that, the agent will get a reward. EFTA00624282 136 26 Perceptual and Motor Hierarchies 26.4.4 The Mining Task The perceptual mining task, then, is to find basic spatiotemporal conjunctions that are inter- esting. What constitutes interestingness is multifactorial, and includes. • involves important Atoms (e.g. Reward) • has a high temporal cohesion (i.e. the strength of the time relationships embodied in the SimAND and SeqAND links is high) • has a high spatial cohesion (i.e. the near() relationships have high strength) • has a high frequency • has a high surprise value (its frequency is far from what would be predicted by its component sub-conjunctions) Note that a conjunction can be interesting without satisfying all these criteria; e.g. if it involves something important and has a high temporal cohesion, we want to find it regardless of its spatial cohesion. In preliminary experiments we have worked with a provisional definition orinterestingness" as the combination of frequency and temporal cohesion, but this must be extended; and one may even wish to have the combination function optimized over time (slowly) where the fitness function is defined in terms of the STI and LTI of the concepts generated. 26.4.4.1 A Mining Approach One tractable approach to perceptual pattern mining is greedy and iterative, involving the following steps: 1. Build an initial transaction template graph G 2. Greedily mine some interesting basic spatiotemporal conjunctions from it, adding each interesting conjunction found as a new node in G (so that G becomes a transaction template relationship graph), repeating step 2 until boredom results or time runs out The same TTRG may be maintained over time, but of course will require a robust forgetting mechanism once the history gets long or the environment gets nontrivially complex. The greedy mining step may involve simply grabbing SeqAND or SimAND links with prob- ability determined by the (importance and/or interestingness) of their targets, and the prob- abilistic strength and temporal strength of the temporal AND relationship, and then creating conjunctions based on these links (which then become new nodes in the TTRG, so they can be built up into larger conjunctions). 26.5 The Perceptual-Motor Hierarchy The perceptual pattern mining approach described above is "flat," in the sense that it simply proposes to recognize patterns in a stream of perceptions, without imposing any kind of ex- plicitly hierarchical structure on the pattern recognition process or the memory of perceptual patterns. This is different from how the human visual system works, with its clear hierarchical EFTA00624283 26.6 Object Recognition from Polygonal Meshes 137 structure, and also different from many contemporary vision architectures, such as DeSTIN or Hawkins' Numenta system which also utilizes hierarchical neural networks. However, the approach described above may be easily made hierarchical within the CogPrime architecture, and this is likely the most effective way to deal with complex visual scenes. Most simply, in this approach, a hierarchy may be constructed corresponding to different spatial regions, within the visual field. The RegionNodes at the lowest level of the hierarchy correspond to small spatial regions, the ones at the next level up correspond to slightly larger spatial regions, and so forth. Each RegionNode also correspond to a certain interval of time, and there may be different RegionNodes corresponding to the same spatial region but with different time- durations attached to them. RegionNodes may correspond to overlapping rather than disjoint regions. Within each region mapped by a RegionNode, then, perceptual pattern mining as defined in the previous section may occur. The patterns recognized in a region are linked to the corre- sponding RegionNode - and are then fed as inputs to the RegionNodes corresponding to larger, encompassing regions; and as suggestions-to-guide-pattern-recognition to nearby RegionNodes on the same level. This architecture involves the fundamental hierarchical structure/dynamic observed in the human visual cortex. Thus, the hierarchy incurs a dynamic of patterns-within- patterns-within-patterns, and the heterarchy incurs a dynamic of patterns-spawning-similar- patterns. Also, patterns found in a RegionNode should be used to bias the pattern-search in the RegionNodes corresponding to smaller. contained regions: for instance, if many of the sub- regions corresponding to a certain region have revealed parts of a face, then the pattern-mining processes in the remaining sub-regions may be instructed to look for other face-parts. This architecture permits the hierarchical dynamics utilized in standard hierarchical vision models, such as Jeff Hawkins' and other neural net models, but within the context of CogPrime's pattern-mining approach to perception. It is a good example of the flexibility intrinsic to the CogPrime architecture. Finally, why have we called it a perceptual-motor hierarchy above? This is because, due to the embedding of the perceptual hierarchy in CogPrime's general Atom-network, the percepts in a certain region will automatically be linked to actions occurring in that region. So, there may be some perception-cognition-action interplay specific to a region, occurring in parallel with the dynamics in the hierarchy of multiple regions. Clearly this mirrors some of the complex dynamics occurring in the human brain, and is also reflected in the structure of sophisticated perceptual-motor approaches like DeSTIN, to be discussed below. 26.6 Object Recognition from Polygonal Meshes Next we describe a more specific perceptual pattern recognition algorithm - a strategy for iden- tifying objects in a visual scene that is perceived as a set of polygons. It is not a thoroughly detailed algorithmic approach, but rather a high-level description of how this may be done effec- tively within the CogPrime design. It is offered here largely as an illustration of how specialized perceptual data processing algorithms may he designed and implemented within the CogPrime framework. EFTA00624284 138 26 Perceptual and Motor Hierarchies We deal here with an agent whose perception of the world, at any point in time, is understood to consist of a set of polygons, each one described in terms of a list of corners. The corners may be assumed to be described in coordinates relative to the viewing eye of the agent. What we mean by "identifying objects" here is something very simple. We don't mean iden- tifying that a particular object is a chair, or is Ben's brown chair, or anything like that - we simply mean identifying that a given collection of polygons is meaningfully grouped into an ob- ject. That is the task considered here. The object could be a single block, it could be a person, or it could be a tower of blocks (which appears as a single object until it is taken apart). Of course, not all approaches to polygon-based vision processing would require this sort of phase: it would be possible, as an alternative, to simply compare the set of polygons in the visual field to a database of prior experience and then do object identification (in the present sense) based on this database-comparison. But in the approach described in this section, one begins instead with an automated segmentation of the set of perceived polygons into a set of objects. 26.6.1 Algorithm Overview The algorithm described here falls into three stages: 1. Recognizing PersistentPolygonNodes (PPNodes) from PolygonNodes. 2. Creating Adjacency Graphs from PPNodes. 3. Clustering in the Adjacency Graph. Each of these stages involves a bunch of details, not all of which have been fully resolved: this section just gives a conceptual overview. We will speak in terms of objects such as PolygonNode, PPNode and so forth, because inside the CogPrime AI engine, observed and conceived entities are represented as nodes in an graph. However, this terminology is not very important here, and what we call a PolygonNode here could just as well be represented in a host of other ways, within the overall CogPrime framework. 26.6.2 Recognizing PersistentPolygonNodes (PPNodes) from PolygonNodes A PolygonNode represents a polygon observed at a point in time. A PPNode represents a series of PolygonNodes that are heuristically guessed to represent the same PolygonNode at different moments in time. Before "object permanence" is learned, the heuristics for recognizing PPNodes will only work in the case of a persistent polygon that, over an interval of time, is experiencing relative motion within the visual field, but is never leaving the visual field. For example some reasonable heuristics are: If P1 occurs at time t, P2 occurs at time s where s is very close to t, and PI are similar in shape, size and color and position, then P1 and P2 should be grouped together into the same PPNode. EFTA00624285 26.6 Object Recognition from Polygonal Meshes 139 More advanced heuristics would deal carefully with the case where some of these similarities did not hold, which would allow us to deal e.g. with the case where an object was rapidly changing color. In the case where the polygons are coming from a simulation world like OpenSim, then from our positions as programmers and world-masters, we can see that what a PPNode is supposed to correspond to is a certain side of a certain OpenSim object; but it doesn't appear immediately that way to CogPrime when controlling an agent in OpenSim since CogPrime isn't perceiving OpenSim objects, it's perceiving polygons. On the other hand, in the case where polygons are coming from software that postprocesses the output of a LiDAR based vision system, then the piecing together of PPNodes from PolygonNodes is really necessary. 26.6.3 Creating Adjacency Graphs from. PPNodes Having identified PPNodes, we may then draw a graph between PPNodes, a PPGraph (also called an "Adjacency Graph"), wherein the links are AdjacencyLinks (with weights indicating the degree to which the two PPNodes tend to be adjacent, over time). A more refined graph might also involve SpatialCoordinationLinks (with weights indicating the degree to which the vector between the centroids of the two PPNodes tends to be consistent over time). We may then use this graph to do object identification: • First-level objects may be defined as clusters in the graph of PPNodes. • One may also make a graph between first-level objects, an ObjectGraph with the same kinds of links as in the PPGraph. Second-level objects may be defined as clusters in the ObjectGraph. The "strength" of an identified object may be assigned as the "quality" of the cluster (measured in terms of how tight the cluster is, and how well separated from other clusters.) As an example, consider a robot with two parts: a body and a head. The whole body may have a moderate strength as a first-level object, but the head and body individually will have significantly greater strengths as first-level objects. On the other hand, the whole body should have a pretty strong strength as a second-level object. It seems convenient (though not necessary) to have a PhysicalObjectNode type to represent the objects recognized via clustering; but the first versus second level object distinction should not need to be made on the Atom type level. Building the adjacency graph requires a mathematical formula defining what it means for two PPNodes to be adjacent. Creating this formula may require a little tuning. For instance, the adjacency between two PPNodes PP1 and PP2 may be defined as the average over time of the adjacency of the PolygonNodes PP1(t) and PP2(t) observed at each time t. (A p'th power average' may be used here, and different values of p may be tried.) Then, the adjacency between two (simultaneous) PolygonNodes P1 and P2 may be defined as the average over all x in PI of the minimum over all y in P2 of sim(x,y), where sim(,) is an appropriately scaled similarity function. This latter average could arguably be made a maximum; or perhaps even better a p'th power average with large p, which approximates a maximum. I the p'th power average is defined as J XP EFTA00624286 140 26 Perceptual and Motor Hierarchies 26.6.4 Clustering in the Adjacency Graph. As noted above, the idea is that objects correspond to clusters in the adjacency graph. This means we need to implement some hierarchical clustering algorithm that is tailored to find clusters in symmetric weighted graphs. Probably some decent algorithms of this character exist, if not it would be fairly easy to define one, e.g. by mapping some standard hierarchical clustering algorithm to deal with graphs rather than vectors. Clusters will then be mapped into PhysicalObjectNodes, interlinked appropriately via Phys- icalPartLinks and AdjacencyLinks. (E.g. there would be a PhysicalPartLink between the Phys- icalObjectNode representing a head and the PhysicalObjectNode representing a body [where the body is considered as including the head). 26.6.5 Discussion It seems probable that, for simple scenes consisting of a small number of simple objects, clus- tering for object recognition will be fairly unproblematic. However, there are two cases that are potentially tricky: • Sub-objects: e.g. the head and torso of a body, which may move separately; or the nose of the head, which may wiggle; or the legs of a walking dog; etc. • Coordinated objects: e.g. if a character's hat is on a table, and then later on his head, then when it's on his head we basically want to consider him and his hat as the same object, for some purposes. These examples show that partitioning a scene into objects is a borderline-cognitive rather than purely lower-level-perceptual task, which cannot be hard-wired in any very simple way. We also note that, for complex scen , clustering may not work perfectly for object recogni- tion and some reasoning may be needed to aid with the process. Intuitively, these may corre- spond to scenes that, in human perceptual psychology, require conscious attention and focus in order to be accurately and usefully perceived. 26.7 Interfacing the Atomspace with a Deep Learning Based Perception-Action Hierarchy We have discussed how one may do perception processing such as object recognition within the Atomspace, and this is indeed a viable strategy. But an alternate approach is also interesting, and likely more valuable in the case of robotic perception/action: build a separate perceptual- motor hierarchy, and link it in with the Atomspace. This approach is appealing in large part because a lot of valuable and successful work has already been done using neural networks and related architectures for perception and actuation. And it is not necessarily contradictory to doing perception processing in the Atomspace - obviously, one may have complementary, synergetic perception processing occurring in two different parts of the architecture. EFTA00624287 26.7 Interfacing the Atomspace with a Deep Learning Based Perception-Action Hierarchy 141 This section reviews some general ideas regarding the interfacing of CogPrime with deep learning hierarchies for perception and action; the following chapter then discusses one example of this in detail, involving the DeSTIN deep learning architecture. 26.7.1 Hierarchical Perception Action Networks CogPrime could be integrated with a variety of different hierarchical perception/action archi- tectures. For the purpose of this section, however, we will consider a class of architectures that is neither completely general nor extremely specific. Many of the ideas to be presented here are in fact more broadly applicable beyond the architecture described here. The following assumptions will be made about the HPANs (Hierarchical Perception/Action Network) to be hybridized with CogPrime. It may be best to use multiple HPANs, at least one for declarative/sensory/episodic knowledge (we'll call this the "primary HPAN") and one for procedural knowledge. A HPAN for intentional knowledge (a goal hierarchy; in DeSTIN called the "critic hierarchy") may be valuable as well. We assume that each HPAN has the properties: 1. It consists of a network of nodes, endowed with a learning algorithm, whose connectivity pattern is largely but not entirely hierarchical (and whose hierarchy contains both feedback, feedforward and lateral connections) 2. It contains a set of input nodes, receiving perceptual inputs, at the bottom of the hierarchy 3. It has a set of output nodes, which may span multiple levels of the hierarchy. The "output nodes" indicate informational signals to cognitive processes lying outside the HPAN, or else control signals to actuators, which may be internal or external. 4. Other nodes besides I/O nodes may potentially be observed or influenced by external pro- cesses; for instance they may receive stimulation 5. Link weights in the HPAN get updated via some learning algorithm that is roughly speaking "statistically Hebbian." in the sense that on the whole when a set of nodes get activated together for a period of time, they will tend to become attractors. By an attractor we mean: a set S of nodes such that the activation of a subset of S during a brief interval tends to lead to the activation of the whole set S during a reasonably brief interval to follow 6. As an approximate but not necessarily strict rule, nodes higher in the hierarchy tend to be involved in attractors corresponding to events or objects localized in larger spacetime regions Examples of specific hierarchical architectures broadly satisfying these requirements are the visual pattern recognition networks constructed by Hawkins 11113061 and IPCP011, and Arel's DeSTIN system discussed earlier (and in more depth in following chapters). The latter appears to fit the requirements particularly snugly due to having dynamics very well suited to the formation of a complex array of attractors, and a richer methodology for producing outputs. These are all not only HPANs but have a more particular structure that in Chapter 27 is called a Compositional Spatiotemporal Deep Learning Network or CSDLN. The particulars of the use of HPANs with OpenCog are perhaps best explained via enumer- ation of memory types and control operations. EFTA00624288 142 26 Perceptual and Motor Hierarchies 26.7.2 Declarative Memory The key idea here is linkage of primary HPAN attractors to CogPrime ConceptNodes via MemberLinks. This is in accordance with the notion of glocal memory, in the language of which the HPAN attractors are the maps and the corresponding ConceptNodes are the keys. Put simply, when a HPAN attractor is recognized, MemberLinks are created between the HPAN nodes comprising the main body of the attractor, and a ConceptNode in the AtomTable representing the attractor. MemberLink weights may be used to denote fuzzy attractor membership. Activation may spread from HPAN nodes to ConceptNodes, and STI may spread from ConceptNodes to HPAN nodes; a conversion rate between HPAN activation and STI currency must be maintained by the CogPrime central bank (see Chapter 23), for ECAN purposes. Both abstract and concrete knowledge may be represented in this way. For instance, the Eiffel Tower would correspond to one attractor, the general shape of the Eiffel Tower would correspond to another, and the general notion of a "tower" would correspond to yet another. As these three examples are increasingly abstract, the corresponding attractors would be weighted increasingly heavily on the upper levels of the hierarchy. 26.7.3 Sensory Memory CogPrime may also use its primary HPAN to store memories of sense-perceptions and low-level abstractions therefrom. MemberLinks may join concepts in the AtomTable to percept-attractors in the HPAN. If the HPAN is engineered to associate specific neural modules to specific spatial regions or specific temporal intervals, then this may be accounted for by automatically index- ing ConceptNodes corresponding to attractors. centered in those modules in the AtomTable's TimeServer and SpaceServer objects, which index Atoms according to time and space. An attractor representing something specific like the Eiffel Tower, or Bob's face, would be weighted largely in the lower levels of the hierarchy, and would correspond mainly to sensory rather than conceptual memory. 26.7.4 Procedural Memory The procedural HPAN may be used to learn procedures such as low-level motion primitives that are more easily learned using HPAN training than using more abstract procedure learning methods. For example, a Combo tree learned by MOSES in CogPrime might contain a primitive corresponding to the predicate-argument relationship pick_up(ball); but the actual procedure for controlling a robot hand to pick up a ball, might be expressed as an activity pattern within the low-level procedural HPAN. A procedure P stored in the low-level procedural HPAN would be represented in the AtomTable as a ConceptNode C linked to key nodes in the HPAN attractor corresponding to P. The invocation of P would be accomplished by transferring STI currency to C and then allowing ECAN to do its work. On the other hand, CogPrime's interfacing of the high-level procedural HPAN with the Cog- Prime ProcedureRepository is intimately dependent on the particulars of the MOSES proce- EFTA00624289 26.7 Interfacing the Atomspace with a Deep Learning Based Perception-Action Hierarchy 143 dure learning algorithm. As will be outlined in more depth in Chapter 33, MOSES is a complex, multi-stage process that tries to find a program maximizing some specified fitness function, and that involves doing the following within each "deme" (a deme being an island of roughly-similar programs) 1. casting program trees into a hierarchical normal form 2. evaluating the program trees on a fitness function 3. building a model distinguishing fit versus unfit program trees, which involves: 3a. figuring out what program tree features the model should include; 3b. building the model using a learning algorithm 4. generating new program trees that are inferred likely to give high fitness, based on the model 5. return to step 1 with these new program trees There is also a system for managing the creation and deletion of denies. The weakest point in CogPrime's current MOSES-based approach to procedure learning appears to be step 3. And the main weakness is conceptual rather than algorithmic; what is needed is to replace the current step 3 with something that uses long-term memory to do model-building and feature-selection, rather than (like the current code) doing these things in a manner that's restricted to the population of program trees being evolved to optimize a particular fitness function. One promising approach to resolving this issue is via replacing step 3b (and, to a limited extent, 3a) with an interconnection between MOSES and a procedural HPAN. A HPAN can do supervised categorization, and can be designed to handle feature selection in a manner integrated with categorization, and also to integrate long-term memory into its categorization decisions. 26.7.5 Episodic Memory In a hybrid CogPrime /HPAN architecture, episodic knowledge may be handled via a combi- nation of: 1. using a traditional approach to store a large ExperienceDB of actual experienced episodes (including sensory inputs and actions; and also the states of the most important items in memory during the experience) 2. using the Atomspace (with its TimeServer and Spar,Server components) to store declarative knowledge about experiences 3. using dimensional embedding to index the AtomSpace's episodic knowledge in a spatiotem- porally savvy way, as described in Chapter 40 4. training a large HPAN to summarize the scope of experienced episodes (this could be the primary HPAN used for declarative and sensory memory, or could potentially be a separate episodic HPAN) Such a network should be capable of generating imagined episodes based on cues, as well recalling real episodes. The HPAN would serve as a sort of index into the memory, of episodes, There would be HebbianLinks from the AtomTable into the episodic HPAN. EFTA00624290 144 26 Perceptual and Motor Hierarchies For instance, suppose that once the agent built an extremely tall tower of blocks, taller than any others in its memory. Perhaps it wants to build another very tall tower again, so it wants to summon up the memory, of that previous occasion, to see if there is possibly guidance therein. It then proceeds by thinking about tallness and towerness at the same time, which stimulates the relevant episode, because at the time of building the extremely tall tower, the agent was thinking a lot about tallness (so thoughts of tallness are part of the episodic memory). 26.7.6 Action Selection and Attention Allocation CogPrime's action selection mechanism chooses procedures based on which ones are estimated most likely to achieve current goals given current context, and places these in an "active proce- dure pool" where an ExecutionManager object mediates their execution. Attention allocation spans all components of CogPrime, including an HPAN if one is in- tegrated. Attention flows between the two components due to the conversion of STI to and from HPAN activation. And in this manner assignment of credit flows from GoalNodes into the HPAN, because this kind of simultaneous activation may be viewed as "rewarding" a HPAN link. So, the HPAN may reward signals from GoalNodes via ECAN, because when a ConceptN- ode gets rewarded, if the ConceptNode points to a set of nodes, these nodes get some of the reward. 26.8 Multiple Interaction Channels Now we discuss a broader issue regarding the interfacing between CogPrime and the external world. The only currently existing embodied OpenCog applications, PetBrain and C,ogBot, are based on a loosely human model of perception and action, in which a single CogPrime instance controls a single mobile body, but this of course is not the only way to do things. More generally, what we can say is that a variety of external-world events come into a CogPrime system from physical or virtual world sensors, plus from other sources such as database interfaces, Web spiders, and/or other sources. The external systems providing CogPrime with data may be generically referred to as sensory sources (and in the terminology we adopt here, once Atoms have been created to represent external data, then one is dealing with perceptions rather than sensations). The question arises how to architect a CogPrime system, in general, for dealing with a variety of sensory sources. We introduce the notion of an "interaction channel": a collection of sensory, sources that is intended to be considered as a whole as a synchronous stream, and that is also able to receive CogPrime actions - in the sense that when CogPrime carries out actions relative to the interaction channel, this directly affects the perceptions that CogPrime receives from the interaction channel. A CogPrime meant to have conversations with 10 separate users at once might have 10 interaction channels. A human mind has only one interaction channel in this sense (although humans may become moderately adept at processing information from multiple external-world sources, coming in through the same interaction channel). Multiple-interaction-channel digital psychology may become extremely complex - and hard for us, with our single interaction channels, to comprehend. This is one among many cases EFTA00624291 26.8 Multiple Interaction Channels 145 where a digital mind, with its more flexible architecture, will have a clear advantage over our human minds with their fixed and limited neural architectures. For simplicity, however, in the following chapters we will often focus on the single-interaction-channel case. Events coming in through an interaction channel are presented to the system as new per- ceptual Atoms, and relationships amongst these. In the multiple interaction channel case, the AttentionValues of these newly created Atoms require special treatment. Not only do they re- quire special rules, they require additional fields to be added to the AttentionValue object, beyond what has been discussed so far. We require newly created perceptual Atoms to be given a high initial STI. And we also require them to be given a high amount of a quantity called "interaction-channel STI." To support this, the AttentionValue objects of Atoms must be expanded to contain interaction- channel STI values; and the ImportanceUpdating MindAgent must compute interaction-channel importance separately from ordinary importance. And, just as we have channel-specific AttentionValues, we may also have channel-specific TruthValues. This allows the system to separately account for the frequency of a given percep- tual item in a given interaction channel. However, no specific mechanism is needed for these, they are merely contextual truth values, to be interpreted within a Context Node associated with the interaction channel. EFTA00624292 Chapter 27 Integrating CogPrime with a Compositional Spatiotemporal Deep Learning Network 27.1 Introduction Many different approaches to "low-level" perception and action processing are possible within the overall CogPrime framework. We discussed several in the previous chapter, all elaborations of the general hierarchical pattern recognition approach. Here we describe one sophisticated ap- proach to hierarchical pattern recognition based perception in more detail: the tight integration of CogPrime with a sophisticated hierarchical perception/action oriented learning system such as the DeSTIN architecture reviewed in Chapter 4 of Part 1. We introduce here the term "Compositional Spatiotemporal Deep Learning Network" (CS- DLN), to refer to deep learning networks whose hierarchical structure directly mirrors the hierarchical structure of spacetime. In the language of Chapter 26, a CSDLN is a special kind of HPAN (hierarchical perception action network), which has the special property that each of its nodes refers to a certain spatiotemporal region and is concerned with predicting what happens inside that region. Current exemplifications of the CSDLN paradigm include the DeS- TIN architecture that we will focus on here. along with Jeff Hawkins' Numenta "HTM" system 111B061 I, Itatnar Arel's DeSTIN 1ARC09a1, Itamar Arel's HDRN 2 system (the proprietary, closed-source sibling of DeSTIN), Dileep George's spin-off from Numenta a, and work by Mo- hamad Tarifi 1TSH Ill, Bundzel and Hashimoto 11311101, and others. CSDLNs are reasonably well proven as an approach to intelligent sensory data processing, and have also been hypothe- sized as a broader foundation for artificial general intelligence at the human level and beyond I1113061 IABC094 While CSDLNs have been discussed largely in the context of perception, the specific form of CSDLN we will pursue here goes beyond perception processing, and involves the coupling of three separate hierarchies, for perception, action and goals/reinforcement 1G141G+101. The "action" CSDLNs discussed here correspond to the procedural HPAN discussed in Chapter 26. Abstract learning and self-understanding are then hypothesized as related to systems of attractors emerging from the close dynamic coupling of the upper levels of the three hierarchies. I While the Numenta system is the best-known CSDLN architecture, other CSDLNs appear more impressively functional in various respects; and many CSDLN-related ideas existed in the literature well before Numenta's advent. 2 http: \binatix.con 3 http : \ v icar ioussyst ems c om 147 EFTA00624294 27 Integrating CogPrime with a Compositional Spatiotemporal Deep Learning Network DeSTIN is our paradigm case of this sort of CSDLN, but most of the considerations given here would apply to any CSDLN of this general character. CSDLNs embody a certain conceptual model of the nature of intelligence, and to integrate them appropriately with a broader architecture, one must perform the integration not only on the level of software code but also on the level of conceptual models. Here we focus here on the problem of integrating an extended version of the DeSTIN CSDLN system with the CogPrime integrative AGI (artificial general intelligence) system. The crux of the issue here is how to map DeSTIN's attractors into CogPrime's more abstract, probabilistic "weighted, la- beled hypergraph" representation (called the Atomspace). The main conclusion reached is that in order to perform this mapping in a conceptually satisfactory way, one requires a system of hierarchies involving the structure of DeSTIN's network but the semantic structures of the Atomspace. The DeSTIN perceptual hierarchy is augmented by motor and goal hier- archies, leading to a tripartite "extended DeSTIN". In this spirit, three "semantic-perceptual" hierarchies are proposed, corresponding to the three extended-DeSTIN CSDLN hierarchies and explicitly constituting an intermediate level of representation between attractors in DeSTIN and the habitual cognitive usage of CogPrime Atoms and Atom-networks. For simple reference we refer to this as the "Semantic CSDLN" approach. A "tripartite semantic CSDLN" consisting of interlinked semantic perceptual, motoric and goal hierarchies could be coupled with DeSTIN or another CSDLN architecture to form a novel AGI approach; or (our main focus here) it may be used as a glue between an CSDLN and and a more abstract semantic network such as the cognitive Atoms in CogPrime's Atomspace. One of the core intuitions underlying this integration is that, in order to achieve the desired level of functionality for tasks like picture interpretation and assembly of complex block struc- tures, a convenient route is to perform a fairly tight integration of a highly capable CSDLN like DeSTIN with other CogPrime components. For instance, we believe it's necessary to go deeper than just using DeSTIN as an input/output layer for CogPrime, by building associative links between the nodes inside DeSTIN and those inside the Atomspace. This "tightly linked integration" approach is obviously an instantiation of the general cogni- tive synergy principle, which hypothesizes particular properties that the interactions between components in an integrated AGI system should display, in order for the overall system to dis- play significant general intelligence using limited computational resources. Simply piping output from an CSDLN to other components, and issuing control signals from these components to the CSDLN, is likely an inadequate mode of integration, incapable of leveraging the full potential of CSDLNs; what we are suggesting here is a much tighter and more synergetic integration. In terms of the general principle of mind-world correspondence, the conceptual justification for CSDLN/CogPrime integration would be that the everyday human world contains many com- positional spatiotemporal structures relevant to human goals, but also contains many relevant patterns that are not most conveniently cast into a compositional spatiotemporal hierarchy. Thus, in order to most effectively perceive, remember, represent, manipulate and enact the full variety of relevant patterns in the world, it is sensible to have a cognitive structure containing a CSDLN as a significant component, but not the only component. EFTA00624295 27.3 Semantic CSDLN for Perception Processing 149 27.2 Integrating CSDLNs with Other AI Frameworks CSDLNs represent knowledge as attractor patterns spanning multiple levels of hierarchical net- works, supported by nonlinear dynamics and (at least in the case of the overall DeSTIN design) involving cooperative activity of perceptual, motor and control networks. These attractors are learned and adapted via a combination of methods including localized pattern recognition al- gorithms and probabilistic inference. Other AGI paradigms represent and learn knowledge in a host of other ways. How then can CSDLNs be integrated with these other paradigms? A very simple form of integration, obviously, would be to use a CSDLN as a sensorimotor cortex for another AI system that's focused on more abstract cognition. In this approach, the CSDLN would stream state-vectors to the abstract cognitive system, and the abstract cognitive system would stream abstract cognitive inputs to the CSDLN (which would then consider them together with its other inputs). One thing missing in this approach is the possibility of the abstract cognitive system's insights biasing the judgments inside the CSDLN. Also, abstract cognition systems aren't usually well prepared to handle a stream of quantitative state vectors (even ones representing intelligent compressions of raw data). An alternate approach is to build a richer intermediate layer, which in effect translates between the internal language of the CSDLN and the internal language of the other AI system involved. The particulars, and the viability, of this will depend on the particulars of the other AI system. What we'll consider here is the case where the other AI system contains explicit symbolic representations of patterns (including patterns abstracted from observations that may have no relation to its prior knowledge or any linguistic terms). In this case, we suggest, a viable approach may be to construct a "semantic CSDLN" to serve as an intermediary. The semantic CSDLN has the same hierarchical structure as an CSDLN, but inside each node it contains abstract patterns rather than numerical vectors. This approach has several potential major advantages: the other Al system is not presented with a large volume of numerical vectors (which it may be unprepared to deal with effectively); the CSDLN can be guided by the other AI system, without needing to understand symbolic control signals; and the intermediary semantic CSDLN can serve as a sort of "blackboard" which the CSDLN and the other AI system can update in parallel, and be guided by in parallel, thus providing a platform encouraging "cognitive synergy". The following sections go into more detail on the concept of semantic CSDLNs. The discussion mainly concerns the specific context of DeSTIN/CogPrime integration, but the core ideas would apply to the integration of any CSDLN architecture with any other AI architecture involving uncertain symbolic representations susceptible to online learning. 27.3 Semantic CSDLN for Perception Processing In the standard perceptual CSDLN hierarchy, a node N on level k (considering level 1 as the bottom) corresponds to a spatiotemporal region S with size sk (sk increasing monotonically and usually exponentially with k); and, has children on level k -1 corresponding to spatiotemporal regions that collectively partition S. For example, a node on level 3 might correspond to a 16x16 pixel region $ of 2D space over a time period of 10 seconds, and might have 4 level 2 children corresponding to disjoint 4x4 regions of 2D space over 10 seconds, collectively composing S. EFTA00624296 150 27 Integrating CogPrime with a Compositional Spatiotemporal Deep Learning Network This kind of hierarchy is very, effective for recognizing certain types of visual patterns. How- ever it is cumbersome for recognizing some other types of patterns, e.g. the pattern that a face typically contains two eyes beside each other, but at variable distance from each other. One way to remedy this deficiency is to extend the definition of the hierarchy, so that nodes do not refer to fixed spatial or temporal positions, but only to relative positions. In this approach, the internals of a node are basically the same as in an CSDLN, and the correspondence of the nodes on level k with regions of size s k is retained, but the relationships between the nodes are quite different. For instance, a variable-position node of this sort could contain several possible 2D pictures of an eye, but be nonspecific about where the eye is located in the 2D input image. Figure 27.1 depicts this "semantic-perceptual CSDLN" idea heuristically, showing part of a semantic-perceptual CSDLN indicating the parts of a face, and also the connections between the semantic-perceptual CSDLN, a standard perceptual CSDLN, and a higher-level cognitive semantic network like CogPrime's Atomspace. More formally, in the suggested "semantic-perceptual CSDLN" approach, a node N on level k, instead of pointing to a set of level k. — 1 children, points to a small (but not necessarily connected) semantic network , such that the nodes of the semantic network are (variable- position) level k — 1 nodes; and the edges of the semantic network possess labels repre- senting spatial or temporal relationships, for example horizontally_ aligned, vertically_ aligned, right_side, left_side, above, behind, immediately_ right, immediately_left, immediately_ above, immediately_ below, after, immediately_ after. The edges may also be weighted either with num- bers or probability distributions, indicating the quantitative weight of the relationship indicated by the label. So for example, a level 3 node could have a child network of the form horizontally_aligned(Ni, N2) where N1 and N2 are variable-position level 2 nodes. This would mean that N1 and N2 are along the same horizontal axis in the 2D input but don't need to be immediately next to each other. Or one could say, e.g. on_axis_perpendicular_to(N1, N2, N3, N1), meaning that N1 and N2 are on an axis perpendicular to the axis between N3 and N4. It may be that the latter sort of relationship is fundamentally better in some cases, because horizontally_aligned is still tied to a specific orientation in an absolute space, whereas on _axis _perpendicular _to is fully relative. But it may be that both sorts of relationship are useful. Next, development of learning algorithms for semantic CSDLNs seems a tractable research area. First of all, it would seem that, for instance, the DeSTIN learning algorithms could straightforwardly be utilized in the semantic CSDLN case, once the local semantic networks involved in the network are known. So at least for sonic CSDLN designs, the problem of learning the semantic networks may be decoupled somewhat from the learning occurring inside the nodes. DeSTIN nodes deal with clustering of their inputs, and calculation of probabilities based on these clusters (and based on the parent node states). The difference between the semantic CSDLN and the traditional DeSTIN CSDLN has to do with what the inputs are. Regarding learning the local semantic networks, one relatively straightforward approach would be to data mine them from a standard CSDLN. That is, if one runs a standard CS- The perceptual CSDLN shown is unrealistically small for complex vision processing (only 4 layers), and only a fragment of the semantic-perceptual CSDLN is shown (a node corresponding to the category face, and then a child network containing nodes corresponding to several components of a typical face). In a real semantic- perceptual CSDLN, there would be many other nodes on the same level as the face node. many other parts to the face subnetwork besides the eyes, nose and mouth depicted here; the eye. nose and mouth nodes would also have child subnetworks; there would be link from each semantic node to centrokla within a large number of perceptual nodes; and there would also be many nodes not corresponding clearly to any single English language concept like eye, nose, face, etc. EFTA00624297 27.3 Semantic CSDLN for Perception Processing 151 .-/ PERCEPTUAL HTM 4.. I...mi. Weil Wines. — 11..10000 MIN vomilecHYM invade peed•••••••••vd. inPROMPI COGNITNE SEMANTIC NETWORK SEMANTIC-PERCEPTUAL NTH Fig. 27.1: Simplified depiction of the relationship between a semantic-perceptual CSDLN, a tra- ditional perceptual CSDLN (like DeSTIN), and a cognitive semantic network (like CogPrime's AtomSpace). DLN on a stream of inputs, one can then run a frequent pattern mining algorithm to find semantic networks (using a given vocabulary of semantic relationships) that occur frequently in the CSDLN as it processes input. A subnetwork that is identified via this sort of mining, can then be grouped together in the semantic CSDLN, and a parent node can be created and pointed to it. Also, the standard CSDLN can be searched for frequent patterns involving the clusters (referring to DeSTIN here, where the nodes contain clusters of input sequences) inside the nodes in the semantic CSDLN. Thus, in the "semantic DeSTIN" case, we have a feedback interaction wherein: 1) the standard CSDLN is formed via processing input; 2) frequent pattern mining on the standard CSDLN is used to create subnetworks and cor- responding parent nodes in the semantic CSDLN; 3) the newly created nodes in the semantic CSDLN get their internal clusters updated via standard DeSTIN dynamics; 4) the clusters in the semantic nodes are used as seeds for frequent pattern mining on the standard CSDLN, returning us to Step 2 above. After the semantic CSDLN is formed via mining the perceptual CSDLN, it may be used to bias the further processing of the perceptual CSDLN. For instance, in DeSTIN each node carries out probabilistic calculations involving knowledge of the prior probability of the "observation" coming into that node over a given interval of time. In the current DeSTIN version, this prior probability is drawn from a uniform distribution, but it would be more effective to draw the prior probability from the semantic network - observations matching things represented in the semantic network would get a higher prior probability. One could also use subtler strategies EFTA00624298 152 27 Integrating CogPrime with a Compositional Spatiotemporal Deep Learning Network such as using imprecise probabilities in DeSTIN IG °el 1 14, and assigning a greater confidence to probabilities involving observations contained in the semantic network. Finally, we note that the nodes and networks in the semantic CSDLN may either • be linked into the nodes and links in a semantic network such as CogPrime's AtomSpace • actually be implemented in terms of an abstract semantic network language like CogPrime's AtomSpace (the strategy to be suggested in Chapter 29). This allows us to think of the semantic CSDLN as a kind of bridge between the standard CSDLN and the cognitive layer of an AI system. In an advanced implementation, the cognitive network may be used to suggest new relationships between nodes in the semantic CSDLN, based on knowledge gained via inference or language. 27.4 Semantic CSDLN for Motor and Sensorimotor Processing Next we consider a semantic CSDLN that focuses on movement rather than sensation. In this case, rather than a 2D or 3D visual space, one is dealing with an n-dimensional configura- tion space (C-space). This space has one dimension for each degree of freedom of the agent in question. The more joints with more freedom of movement an agent has, the higher the dimensionality of its configuration space. Using the notion of configuration space, one can construct a semantic-motoric CSDLN hi- erarchy analogous to the semantic-perceptual CSDLN hierarchy. However, the curse of dimen- sionality demands a thoughtful approach here. A square of side 2 can be tiled with 4 squares of side 1, but a 50-dimensional cube of side 2 can be tiled with 25° 50-dimensional cubes of side 1. If one is to build a CSDLN hierarchy in configuration space analogous to that in perceptual space, some sort of sparse hierarchy is necessary. There are many ways to build a sparse hierarchy of this nature, but one simple approach is to build a hierarchy where the nodes on level k represent motions that combine the motions represented by nodes on level k — 1. In this case the most natural semantic label predicates would seem to be things like simultaneously, after, immediately_after, etc. So a level k node represents a sort of "motion plan" corresponded by chaining together (serially and/or in parallel) the motions encoded in level k-1 nodes. Overlapping regions of C-space correspond to different complex movements that share some of the same component movements, e.g. if one is trying to slap one pennon while elbowing another, or run while kicking a soccer ball forwards. Also note, the semantic CSDLN approach reveals perception and motor control to have essentially similar hierarchical structures, more so than with the traditional CSDLN approach and its fixed-position perceptual nodes. Just as the semantic-perceptual CSDLN is naturally aligned with a traditional perceptual CSDLN, similarly a semantic-motoric CSDLN may be naturally aligned with a "motor CS- DLN". A typical motoric hierarchy in robotics might contain a node corresponding to a robot arm, with children corresponding to the hand, upper arm and lower arm; the hand node might then contain child nodes corresponding to each finger, etc. This sort of hierarchy is intrinsically spatiotemporal because each individual action of each joint of an actuator like an arm is in- trinsically bounded in space and time. Perhaps the most ambitious attempt along these lines is IA\101j, which shows how perceptual and motoric hierarchies are constructed and aligned in an architecture for intelligent automated vehicle control. EFTA00624299 27.4 Semantic CSDLN for Motor and Sensorimotor Processing 153 Figure 27.2 gives a simplified illustration of the potential alignment between a semantic- motoric CSDLN and a purely motoric hierarchy (like the one posited above in the context of extended DeSTIN). 5 In the figure, the motoric hierarchy is assumed to operate somewhat like DeSTIN, with nodes corresponding to (at the lowest level) individual servomotors, and (on higher levels) natural groupings of servomotors. The node corresponding to a set of servos is assumed to contain centroids of clusters of trajectories through configuration space. The task of choosing an appropriate action is then executed by finding the appropriate centroids for the nodes. Note an asymmetry between perception and action here. In perception the basic flow is bottom-up, with top-down flow used for modulation and for "imaginative" generation of percepts. In action, the basic flow is top-down, with bottom-up flow used for modulation and for imaginative, "fiddling around" style generation of actions. The semantic-motoric hierarchy then contains abstractions of the C-space centroids from the motoric hierarchy - i.e., actions that bind together different C-space trajectories that correspond to the same fundamental action carried out in different contexts or under different constraints. Similarly to in the perceptual case, the semantic hierarchy here serves as a glue between lower-level function and higher-level cognitive semantics. iVsaffi l cir IC v. oan.tos. dl . & 4. met roam nolemotimw own, gar' , t-TLI L • • • % I MOTORIC HTM The mown[ hatarth/ neck teerespendolg 0), a parocubet to 0 wnretan. marM h.e Iona man chums Sash. through configuration spate that the ganconettes ham Natant* Mama& get object COGNITIVE SEMANTIC NETWORK or lower arm toward object SEMANTIC- MOTORIC HTM Fig. 27.2: Simplified depiction of the relationship between a semantic-motoric CSDLN, a motor control hierarchy (illustrated by the hierarchy of servos associated with a robot arm), and a cognitive semantic network (like CogPrime's AtomSpace). 5 In the figure, only a fragment of the semantic-motoric CSDLN is shown (a node corresponding to the "get object" action category, and then a child network containing nodes corresponding to several components of the action). In a real semantic-motoric CSDLN, there would be many other nodes on the same level as the get-object node, many other parts to the get-object subnetwork besides the ones depicted here; the subnetwork nodes would also have child subnetworks; there would be link from each semantic node to centroids within a large number of motoric nodes: and there might also be many nodes not corresponding clearly to any single English language concept like 'grasp object" etc. EFTA00624300 154 27 Integrating CogPrime with a Compositional Spatiotemporal Deep Learning Network 27.5 Connecting the Perceptual and Motoric Hierarchies with a Goal Hierarchy One way to connect perceptual and motoric CSDLN hierarchies is using a "semantic-goal CS- DLN" bridging the semantic-perceptual and semantic-motoric CSDLNs. The semantic-goal CS- DLN would be a "semantic CSDLN" loosely analogous to the perceptual and motor semantic CSDLNs - and could optionally be linked into the reinforcement hierarchy of a tripartite CS- DLN like extended DeSTIN. Each node in the semantic-goal CSDLN would contain implications of the form "Context & Procedure —> Goal", where Goal is one of the Al system's overall goals or a subgoal thereof, and Context and Procedure refer to nodes in the perceptual and motoric semantic CSDLNs respectively. For instance, a semantic-goal CSDLN node might contain an implication of the form "I perceive my hand is near object X & I grasp object X r I possess object X." This would be useful if "I possess object X" were a subgoal of some higher-level system goal, e.g. if X were a food object and the system had the higher-level goal of obtaining food. To the extent that the system's goals can be decomposed into hierarchies of progressively more and more spatiotemporally localized subgoals, this sort of hierarchy will make sense, leading to a tripartite hierarchy as loosely depicted in Figure 27.3. 6 One could attempt to construct an overall AGI approach based on a tripartite hierarchy of this nature, counting on the upper levels of the three hierarchies to conic together dynamically to form an integrated cognitive network, yielding abstract phenomena like language, self, reasoning and mathematics. On the other hand, one may view this sort of hierarchy as a portion of a larger integrative AGI architecture, containing a separate cognitive network, with a less rigidly hierarchical structure and less of a tie to the spatiotemporal structure of physical reality. The latter view is the one we are primarily taking within the CogPrime AGI approach, viewing perceptual, motoric and goal hierarchies as "lower level" subsystems connected to a "higher level" system based on the CogPrime AtomSpace and centered on its abstract cognitive processes. Learning of the subgoals and implications in the goal hierarchy is of course a complex matter, which may be addressed via a variety of algorithms, including online clustering (for subgoals or implications) or supervised learning (for implications, the "supervision" being purely internal and provided by goal or subgoal achievement). 6 The diagram is simplified in many ways, e.g. only a handful of nodes in each hierarchy is shown (rather than the whole hierarchy), and lines without arrows are used to indicate bidirectional arrows, and nearly all links are omitted. The purpose is just to show the general character of interaction between the components in a simplified context. EFTA00624301 27.5 Connecting the Perceptual and Motoric Hierarchies with a Coal Hierarchy 155 SEMANTIC-PERCEPTUAL HTM SB1ANTIC GOAL HTM SBMNfIC• MOTORIC HTM Fig. 27.3: Simplified illustration of the proposed interoperation of perceptual, motoric and goal semantic CSDLNs. EFTA00624302 Chapter 28 Making DeSTIN Representationally Transparent Co-authored with Itamar Arel 28.1 Introduction In this chapter and the next we describe one particular incarnation of the above ideas on semantic CSDLNs in more depth: the integration of CogPrime with the DeSTIN architecture reviewed in Chapter 4 of Part 1. One of the core intuitions underlying this integration is that, in order to achieve the de- sired level of functionality for tasks like picture interpretation and assembly of complex block structures, it will be necessary to integrate DeSTIN (or some similar system) and CogPrime components fairly tightly - going deeper than just using DeSTIN as an input/output layer for CogPrime, by building a number of explicit linkages between the nodes inside DeSTIN and CogPrime respectively. The general DeSTIN design has been described in talks as comprising three crosslinked hier- archies. handling perception, action and reinforcement; but so far only the perceptual hierarchy (also called the "spatiotemporal inference network") has been implemented or described in de- tail in publications. In this chapter we will focus on DeSTIN's perception hierarchy. We will explain DeSTIN's perceptual dynamics and representations as we understand them, more thor- oughly than was done in the brief review above; and we will describe a series of changes to the DeSTIN design, made in the spirit of easing DeSTIN/OpenCog integration. In the following chapter we will draw action and reinforcement into the picture, deviating somewhat in the de- tails from the manner in which these things would be incorporated into a standalone DeSTIN, but pursuing the same concepts in an OpenCog integration context. What we describe here is a way to make a "Uniform DeSTIN", in which the internal repre- sentation of perceived visual forms is independent of affine transformations (translation, scaling, rotation and shear). This "representational transparency" means that, when Uniform DeSTIN perceives a pattern: no matter how that pattern is shifted or linearly transformed, the way Uni- form DeSTIN represents that pattern internally is going to be basically the same. This makes it easy to look at a collection of DeSTIN states, obtained by exposing a DeSTIN perception network to the world at different points in time, and see the commonalities in what they are perceiving and how they are interpreting it. By contrast, in the original version of DeSTIN (here called "classic DeSTIN"), it may take significant effort to connect the internal repre- sentation of a visual pattern and the representation of its translated or linearly transformed versions. The uniformity of Uniform DeSTIN makes it easier for humans to inspect DeSTIN's state and understand what's going on, and also (more to the point) makes it easier for other 157 EFTA00624304 158 28 Making DeSTIN Representationally Transparent AI components to recognize patterns in sets of DeSTIN states. The latter fact is critical for the DeSTIN/OpenCog integration 28.2 Review of DeSTIN Architecture and Dynamics The hierarchical architecture of DeSTIN's spatiotemporal inference network comprises an ar- rangement into multiple layers of "nodes" comprising multiple instantiations of an identical processing unit. Each node corresponds to a particular spatiotemporal region, and uses a sta- tistical learning algorithm to characterize the sequences of patterns that are presented to it by nodes in the layer beneath it. More specifically, at the very lowest layer of the hierarchy nodes receive as input raw data (e.g. pixels of an image) and continuously construct a belief state that attempts to characterize the sequences of patterns viewed. The second layer, and all these above it, receive as input the belief states of nodes at their corresponding lower layers, and attempt to construct belief states that capture regularities in their inputs. Each node also receives as input the belief state of the node above it in the hierarchy (which constitutes "contextual" information, utilized in the node's prediction process). Inside each node, an online clustering algorithm is used to identify regularities in the se- quences received by that node. The centroids of the clusters learned are stored in the node and comprise the basic visual patterns recognized by that node. The node's "belief" regarding what it is seeing, is then understood as a probability density function defined over the centroids at that node. The equations underlying this centroid formation and belief updating process are identical for every node in the architecture, and were given in their original form in IARC09aJ, though the current open-source DeSTIN codebase reflects some significant improvements not yet reflected in the publication record. In short, the way DeSTIN represents an item of knowledge is as a probability distribution over "network activity patterns" in its hierarchical network. An activity pattern, at each point in time, comprises an indication of which centroids in each node are most active, meaning they have been identified as most closely resembling what that node has perceived, as judged in the context of the perceptions of the other nodes in the system. Based on this methodology, the DeSTIN perceptual network serves the critical role of building and maintaining a model of the state of the world as visually perceived. This methodology allows for powerful unsupervised classification. If shown a variety of real- world scenes, DeSTIN will automatically form internal structures corresponding to the various natural categories of objects shown in the scenes, such as trees, chairs, people, etc.; and also to the various natural categories of events it sees, such as reaching, pointing, falling. In order to demonstrate the informativeness of these internal structures, experiments have been done using DeSTIN's states as input feature vectors for supervised learning algorithms, enabling high-accuracy supervised learning of classification models from labeled image data [KARR)]. A closely related algorithm developed by the same principal researcher (Itamar Arel) has proven extremely successful at audition tasks such as phoneme recognition IA13S ± Ill. EFTA00624305 28.3 Uniform DeSTIN 159 28.2.1 Beyond Gray-Scale Vision The DeSTIN approach may easily be extended to other senses beyond gray-scale vision. For color vision, it suffices to replace the one-dimensional signals coming into DeSTIN's lower layer with 3D signals representing points in the color spectrum; the rest of the DeSTIN process may be carried over essentially without modification. Extension to further senses is also relatively straightforward on the mathematical and software structure level, though they may of course require significant additional tuning and refinement of details. For instance, olfaction does not lend itself well to hierarchical modeling, but audition and haptics (touch) do: • for auditory perception, one could use a DeSTIN architecture in which each layer is one- dimensional rather than two-dimensional, representing a certain pitch. Or one could use two dimensions for pitch and volume. This results in a system quite similar to the DeSTIN-like system shown to perform outstanding phoneme recognition in IA13S+ lib and is conceptually similar to Hierarchical Hidden Markov Models (HHMMs), which have proven quite success- ful in speech recognition and which Ray Kurzweil has argued are the central mechanism of human intelligence tKurl2l. Note also recent results published by Microsoft Research, showing dramatic improvements over prior speech recognition results based on use of a broadly HHMM-like deep learning system 2I. • for haptic perception, one could use a DeSTIN architecture in which the lower layer of the network possesses a 2D topology reflecting the topology of the surface of the body. Similar to the somatccensory cortex in the human brain, the map could be distorted so that more "pixels" are used for regions of the body from which more data is available (e.g. currently this might be the fingertips, if these were implemented using Syntouch technology [Fill, which has proved excellent at touch-based object identification). Input could potentially be multidimensional if multiple kinds of haptic sensors were available, e.g temperature, pressure and movement as in the Syntouch case. Augmentation of DeSTIN to handle action as well as perception is also possible, and will be discussed in Chapter 29 28.3 Uniform DeSTIN It would be possible to integrate DeSTIN in its original form with OpenCog or other Al sys- tems with symbolic aspects, via using an unsupervised machine learning algorithm to recognize patterns in sets of states of the DeSTIN network as originally defined. However, this pattern recognition task becomes much easier if one suitably modifies DeSTIN, so as to make the com- monalities between semantically similar states more obviously perceptible. This can be done by making the library of patterns recognized within each DeSTIN node invariant with respect to translation, scale, rotation and shear - a modification we call "Uniform DeSTIN." This "uni- formization" decreases DeSTIN's degree of biological mimicry, but eases integration of DeSTIN with symbolic AI methods. EFTA00624306 160 28 Making DeSTIN Representationally Transparent 28.3.1 Translation-Invariant DeSTIN The first revision to the "classic DeSTIN" to be suggested here is: All the nodes on the same level of the DeSTIN hierarchy should share the same library, of patterns. In the context of classic DeSTIN (i.e. in the absence of further changes to DeSTIN to be suggested below, which extend the type of patterns usable by DeSTIN), this means: the nodes on the same level should share the same list of centroids. This makes DeSTIN's pattern recognition capability translation- invariant. This translation invariance can be achieved without any change to the algorithms for updating centroids and matching inputs to centroids. In this approach, it's computationally feasible to have a much larger library of patterns utilized by each node, as compared to in classic DeSTIN. Suppose we have anxn pixel grid, where the lowest level has nodes corresponding to 4 x 4 squares. Then, there are a? nodes on the lowest level, and on the k'th level there are (fE)2 nodes. This means that, without increasing computational complexity (actually decreasing it, under reasonable assumptions), in translation-invariant Uniform DeSTIN we can have a factor of (402 more centroids on level k. One can achieve a much greater decrease in computational complexity (with the same amount of centroid increase) via use of a clever data structure like a cover tree IBK1,1181 to store the centroids at each level. Then the nearest-neighbor matching of input patterns to the library (centroid) patterns would be very rapid, much faster than linearly comparing the input to each pattern in the list. 28.3.1.1 Conceptual Justification for Uniform DeSTIN Generally speaking, one may say that: if the class of images that the system will see is invariant with respect to linear translations, then without loss of generality, we can assume that the library of patterns at each node on the same level is the same. In reality this assumption isn't quite going to hold. For instance, for an eye attached to a person or humanoid robot, the top of the pixel grid will probably look at a person's hair more often than the bottom ... because the person stands right-side-up more often than they stand upside-down, and because they will often fixate the center of their view on a person's face, etc. For this reason, we can recognize our friend's face better if we're looking at them directly, with their face centered in our vision. However, we suggest that this kind of peculiarity is not really essential to vision processing for general intelligence. There's no reason you can't have an intelligent vision system that recognizes a face just as well whether it's centered in the visual field or not. (In fact you could straightforwardly explicitly introduce this kind of bias within a translation-invariant DeSTIN, but it's not clear this is a useful direction.) By and large, in almost all cases, it seems to us that in a DeSTIN system exposed to a wide variety of real-world inputs in complex situations, the library of patterns in the different nodes at the same level would turn out to be substantially the same. Even if they weren't exactly the same, they would be close to the same, embodying essentially the same regularities. But of course, this sameness would be obscured, because centroid 7 in a certain node X on level 4 might actually be the same as centroid 18 in some other node Y on level 4 ... and there would be no way to tell that centroid 7 in node X and centroid 18 and node Y were actually referring to the same pattern, without doing a lot of work. EFTA00624307 28.3 Uniform DeSTIN 161 28.3.1.2 Comments on Biological Realism Translation-invariant DeSTIN deviates further from human brain structure than classic DeS- TIN, but this is for good reason. The brain has a lot of neurons, since adding new neurons was fairly easy and cheap for evolution; and tends to do things in a massively parallel manner, with great redundancy. For the brain, it's not so problematically expensive to have the functional equivalent of a lot of DeSTIN nodes on the same level, all simultaneously using and learning libraries of patterns that are essentially identical to each other. Using current computer technology, on the other hand, this sort of strategy is rather inefficient. In the brain, messaging between separated regions is expensive, whereas replicating function redundantly is cheap. In most current computers (with some partial exceptions such as CPUs), messaging between separated regions is fairly cheap (so long as those regions are stored on the same machine), whereas replicating function redundantly is expensive. Thus, even in cases where the same concept and abstract mathematical algorithm can be effectively applied in both the brain and a computer, the specifics needed for efficient implementation may be quite different. 28.3.2 Mapping States of Translation-Invariant DeSTIN into the Atomspace Mapping classic DeSTIN's states into a symbolic pattern-manipulation engine like OpenCog is possible, but relatively cumbersome. Doing the same thing with Uniform DeSTIN is much more straightforward. In Uniform DeSTIN, for example, Cluster 7 means the same thing in ANY node on level 4. So after a Uniform DeSTIN system has seen a fair number of images, you can be pretty sure its library of patterns is going to be relatively stable. Some clusters may come and go as learning progresses, but there's going to be a large and solid library, of clusters at each level that persists, because all of its member clusters occur reasonably often across a variety of inputs. Define a DeSTIN state-tree as a (quaternary) tree with one node for each DeSTIN node; and living at each node, a small list of (integer pattern_code, float weight) pairs. That is, at each node, the state-tree has a short-list of the patterns that closely match a given state at that node. The weights may be assumed between 0 and 1. The integer pattern codes have the same meaning for every, node on the same level. As you feed DeSTIN inputs, at each point in time it will have a certain state, representable as a state-tree. So, suppose you have a large database of DeSTIN state-trees, obtained by showing various inputs to DeSTIN over a long period of time. Then, you can do various kinds of pattern recognition on this database of state-trees. More formally, define a state-subtree as a (quaternary) tree with a single integer at each node. Two state-subtrees may have various relationships with each other within a single state-tree - for instance they may be adjacent to each other, or one may appear atop or below the other, etc. In these terms, one interesting kind of pattern recognition to do is: Recognize frequent state- subtrees in the stored library of state-trees; and then recognize frequent relationships between these frequent state-subtrees. The latter relationships will form a kind of "image grammar," conceptually similar and formally related to those described in IZNIN. Further, temporal pat- EFTA00624308 162 28 Making DeSTIN Representationally Transparent terns may be recognized in the same way as spatial ones, as part of the state-subtree grammar (e.g. state•subtree A often occurs right before state-subtree B; state-subtree C often occurs right before and right below state-subtree D; etc.). The flow of activation from OpenCog back down to DeSTIN is also fairly straightforward in the context of translation-invariant DeSTIN. If relationships have been stored between concepts in OpenCogPrime's memory and grammatical patterns between state-subtrees, then whenever concept C becomes important in OpenCogPrime's memory, this can cause a top-down increase in the probability of matching inputs to DeSTIN node centroids, that would cause the DeSTIN state-tree to contain the grammatical patterns corresponding to concept C. 28.3.3 Scale-Invariant DeSTIN The next step, moving beyond translation invariance, is to make DeSTIN's pattern recognition mostly (not wholly) scale invariant. We will describe a straightforward way to map centroids on one level of DeSTIN, into centroids on the other levels of DeSTIN. This means that when a centroid has been learned on one level, it can be experimentally ported to all the other levels, to see if it may be useful there too. To make the explanation of this mapping clear, we reiterate some DeSTIN basics in slightly different language: • A centroid on Level N is: a spatial arrangement (e.g. k x k square lattice) of beliefs of Level N —1. (More generally it is a spatiotemporal arrangement of such beliefs, but we will ignore this for the moment.) • A belief on Level N is: a probability distribution over centroids on Level N. For heuristic purposes one can think about this as a mixture of Gaussian, though this won't always be the best model. • Thus, a belief on Level N is: a probability distribution over spatial (or more generally, spatiotemporal) arrangements of beliefs on Level N — 1 On Level 1, the role of centroids is played by simple k x k squares of pixels. Level 1 beliefs are probability distributions over these small pixel squares. Level 2 centroids are hence spa- tial arrangements of probability distributions over small pixel-squares; and Level 2 beliefs are probability distributions over spatial arrangements of probability distributions over small pixel- squares. A small pixel-square S may be mapped into a single pixel P via a heuristic algorithm such as: • if S has more black than white pixels, then P is black • is S has more white than black pixels, then P is white • if S has an equal number of white and black pixels, then use some heuristic. For instance if S is 4 x 4 you could look at the central 2 x 2 square and assign P to the color that occurs most often there. If that is also a tie, then you can just arbitrarily assign P to the color that occurs in the upper left corner of S. A probability distribution over small pixel-squares may then be mapped into a probability distribution over pixel values (B or Mr). A probability distribution over the two values B and Wmay be approximatively mapped into a single pixel value - the one that occurs most often EFTA00624309 28.3 Uniform DeSTIN 163 in the distribution, with a random choice made to break a tie. This tells us how to map Level 2 beliefs into spatial arrangements of pixels; and thus, it tells us how to map Level 2 beliefs into Level 1 beliefs. But this tells us how to map Level N beliefs into Level N-1 beliefs, inductively. Remember, a Level N belief is a probability distribution (pdf for short) over spatial arrangements of beliefs on Level N — I. For example: A Level 3 belief if a pdf over arrangements of Level 2 beliefs. But since we can map Level 2 beliefs into Level 1 beliefs, this means we can map a Level 3 belief into a pdf over arrangements of Level 1 beliefs - which means we can map a Level 3 belief into a Level 2 belief. Etc. Of course, this also tells as how to map Level N centroids into Level N — 1 centroids. A Level N centroid is a pdf over arrangements of Level N —1 beliefs; a Level N —1 centroid is a pdf over arrangements of Level N — 2 beliefs. But Level N —1 beliefs can be mapped into Level N — 2 beliefs. so Level N centroids can be represented as pdfs over arrangements of Level N beliefs, and hence mapped into Level N — 1 centroids. In practice, one can implement this idea by moving from the bottom up. Given the mapping from Level 1 "centroids" to pixels, one can iterate through the Level 1 beliefs and identify which pixels they correspond to. Then one can iterate through the Level 2 beliefs and identify which Level 1 beliefs they correspond to. Etc. Each Level N belief can be explicitly linked to a corresponding level N — 1 belief. Synchronously, as one moves up the hierarchy, Level N centroids can be explicitly linked to corresponding Level N — 1 centroids. Since there are in principle more possible Level Nbeliefs than Level N-1 beliefs, the mapping from level Nbeliefs to level N-1 beliefs is many-to-one. This is a reason not to simply maintain a single centroid pool across levels. However, when a new centroid C is added to the Level N pool, it can be mapped into a Level N — 1 centroid to be added to the Level N — 1 pool (if not there already). And, it can also be used to spawn a Level N + 1 centroid, drawn randomly from the set of possible Level N +1 centroids that map into C. Also, note that it is possible to maintain a single centroid numbering system across levels, so that a reference like "centroid # 175" has only one meaning in an entire DeSTIN network, even though some of these centroid may only be meaningful above a certain level in the network. 28.3.4 Rotation Invariant DeSTIN With a little more work, one can make DeSTIN rotation and shear invariant as well 1. Consid- ering rotation first: • When comparing an input A to a Level N node with a Level N centroid B , consider var- ious rotations of A, and see which rotation gives the closest match. • When you match a centroid to an input observation-or-belief, record the rotation angle corresponding to the match. The second of these points implies the tweaked definitions • A centroid on Level N is: a spatial arrangement (e.g. k x k square lattice) of beliefs of Level N —1 • A belief on Level N is: a probability distribution over (angle, centroid) pairs on Level N. I The basic idea in this section, in the context of rotation, is due to Jade O'Neill (private communication) EFTA00624310 164 28 Making DeSTIN Representationally Transparent From these it follows that a belief on Level N is: a probability distribution over (angle, spatial arrangement of beliefs) pairs on Level N — 1 An additional complexity here is that two different (angle, centroid) pairs (on the same level) could be (exactly or approximately) equal to each other. This necessitates an additional step of "centroid simplification", in which ongoing checks are made to see if there are any two centroids C1, C2 on the same level so that: There exist angles A1, A2 so that (Ai, CI) is very close to (A2, C2). In this case the two centroids may be merged into one. To apply these same ideas to shear, one may simply replace "rotation angle" in the above by "(rotation angle, shear factor) pair." 28.3.5 Temporal Perception Translation and scale invariant DeSTIN can be applied perfectly well if the inputs to DeSTIN, at level 1, are movies rather than static images. Then, in the simplest version, Level 1 consists of pixel cubes instead of pixel squares, etc. (the third dimension in the cube representing time). The scale invariance achieved by the methods described above would then be scale invariance in time as well as in space. In this context, one may enable rectangular shapes as well as cubes. That is, one can look at a Level N centroid consisting of m time-slices of a k x k arrangement of Level N —1 beliefs - without requiring that in = k .... This would make the centroid learning algorithm a little more complex, because at each level one would want to consider centroids with various values of in, from in = 1, ..., k (and potentially m > k also). 28.4 Interpretation of DeSTIN's Activity Uniform DeSTIN constitutes a substantial change in how DeSTIN does its business of recog- nizing patterns in the world - conceptually as well as technically. To explicate the meaning of these changes, we briefly present our favored interpretation of DeSTIN's dynamics. The centroids in the DeSTIN library represent points in "spatial pattern space", i.e. they represent exemplary spatial patterns. DeSTIN's beliefs, as probability distributions over cen- troids, represent guesses as to which of the exemplary spatial patterns are the best models of what's currently being seen in a certain space-time region. This matching between observations and centroids might seem to be a simple matter of "nearest neighbor matching"; but the subtle point is, it's not immediately obvious how to best measure the distance between observations and centroids. The optimal way of measuring distance is going to depend on context; that is to say, on the actual distribution of observations in the system's real environment over time. DeSTIN's algorithm for calculating the belief at a node, based on the observation and cen- troids at that node plus the beliefs at other nearby nodes, is essentially a way of tweaking the distance measurement between observations and centroids, so that this measurement accounts for the context (the historical distribution of observations). There are many possible ways of doing this tweaking. Ideally one could use probability theory explicitly, but that's not always EFTA00624311 28.4 Interpretation of DeSTIN's Activity 165 going to be computationally feasible, so heuristics may be valuable, and various versions of DeSTIN have contained various heuristics in this regard. The various ways of "uniformizing" DeSTIN described above (i.e. making its pattern recogni- tion activity approximately invariant with respect to affine transformations), don't really affect this story - they just improve the algorithm's ability to learn based on small amounts of data (and its rapidity at learning from data in general), by removing the need for the system to repeatedly re-learn transformed versions of the same patterns. So the uniformization just lets DeSTIN carry out its basic activity faster and using less data. 28.4.1 DeSTIN's Assumption of Hierarchical Decomposability Roughly speaking, DeSTIN will work well to the extent that: The average distance between each part of an actually observed spatial pattern, and the closest centroid pattern, is not too large (note: the choice of distance measure in this statement is potentially subtle). That is: DeSTIN's set of centroids is supposed to provide a compact model of the probability distribution of spatial patterns appearing in the experience of the cognitive system of which DeSTIN is a part. DeSTIN's effective functionality relies on the assumption that this probability distribution is hierarchically decomposable - i.e. that the distribution of spatial patterns appearing over a k x k region can be compactly expressed, to a reasonable degree of approximation, as a spatial combination of the distributions of spatial patterns appearing over (k/4) x (k/4) regions. This assumption of hierarchical decomposability greatly simplifies the search problem that DeSTIN faces, but also restricts DeSTIN's capability to deal with more general spatial patterns that are not easily hierarchically decomposable. However, the benefits of this approach seem to outweigh the costs, given that visual patterns in the environments humans naturally encounter do seem (intuitively at least) to have this hierarchical property. 28.4.2 Distance and Utility Above we noted that choice of distance measure involved in the assessment of DeSTIN's effec- tive functionality is subtle. Further above, we observed that the function of DeSTIN's belief assessment is basically to figure out the contextually best way to measure the distance between the observation and the centroids at a node. These comments were both getting at the same point. But what is the right measure of distance between two spatial patterns? Ultimately, the right measure is: the probability that the two patterns A and B can be used in the same way. That is: the system wants to identify observation A with centroid B if it has useful action-patterns involving B, and it can substitute A for B in these patterns without lass. This is difficult to calculate in general, though - a rough proxy, which it seems will often be acceptable, is to measure the distance between A and B in terms of both • the basic (extensional) distance between the physical patterns they embody (e.g. pixel by pixel distance) • the contextual (intensional) distance, i.e. the difference between the contexts in which they occur EFTA00624312 166 28 Making DeSTIN Representationally Transparent Via enabling the belief in a node's parent to play a role in modulating a certain node's be- lief, DeSTIN's core algorithm enables contextual/intensional factors to play a role in distance assessment. 28.5 Benefits and Costs of Uniform DeSTIN We now summarize the main benefits and casts of Uniform DeSTIN a little more systematically. The key point we have made here regarding Uniform DeSTIN and representational transparency may be summarized as follows: • Define an "affine perceptual equivalence class" as a set of percepts that are equivalent to each other, or nearly so, under affine transformation. An example would be views of the same object from different perspectives or distances. • Suppose one has an embodied agent using DeSTIN for visual perception, whose perceptual stream tends to include a lot of reasonably large affine perceptual equivalence classes. • Then, supposing the "mechanics" of DeSTIN can be transferred to the Uniform DeSTIN case without dramatic loss of performance, Uniform DeSTIN should be able to recognize patterns based on many fewer examples than classic DeSTIN. As soon as Uniform DeSTIN has learned to recognize one element of a given affine perceptual equivalence class, it can recognize all of them. Whereas, classic DeSTIN must learn each element of the equivalence class separately. So, roughly speaking, the number of cases required for unsupervised training of Uniform DeSTIN will be less than that for classic DeSTIN, by a ratio equal to the average size of the affine perceptual equivalence classes in the agent's perceptual stream. Counterbalancing this, we have the performance cost of comparing the input to each node against a much larger set of centroids (in Uniform DeSTIN as opposed to classic DeSTIN). However, if a cover tree or other efficient data structure is used, this cost is not so onerous. The cost of nearest neighbor queries in a cover tree storing n items (in this case, n centroids) is 0(cn logn), where the constant c represents the "intrinsic dimensionality" of the data; and in practice the cover tree search algorithm seems to perform quite well. So, the added time cost for online clustering in Uniform DeSTIN as opposed to DeSTIN, is a factor on the order of the log of the number of nodes in the DeSTIN tree. We believe this moderate added time cost is well worth paying, to gain a significant decrease in the number of training examples required for unsupervised learning. Beyond increases in computational cost, there is also the risk that the online clustering may just not work as well when one has so many clusters in each node. This is the sort of problem that can really only be identified, and dealt with. during extensive practice - since the performance of any clustering algorithm is largely determined by the specific distribution of the data it's dealing with. It may be necessary to improve DeSTIN's online clustering in some way to make Uniform DeSTIN work optimally, e.g. improving its ability to form clusters with markedly non-spherical shapes. This ties in to a point raised in chapter 29 - the possibility of supplementing traditional clusters with predicates learned by CogPrime, which may live inside DeSTIN nodes alongside centroids. Each such predicate in effect defines a (generally nonconvex) "cluster". EFTA00624313 28.6 Imprecise Probability as a Tool for Linking CogPrime and DeSTIN 167 28.6 Imprecise Probability as a Tool for Linking CogPrime and DeSTIN One key aspect of vision processing is the ability to preferentially focus attention on certain positions within a perceived visual scene. In this section we describe a novel strategy for enabling this in a hybrid CogPrime /DeSTIN system, via use of imprecise probabilities. In fact the basic idea suggested here applies to any probabilistic sensory system, whether deep-learning-based or not, and whether oriented toward vision or some other sensory modality. However, for sake of concreteness, we will focus here on the case of DeSTIN/CogPrime integration. 28.6.1 Visual Attention Focusing Since visual input streams contain vast amounts of data, it's beneficial for a vision system to be able to focus its attention specifically on the most important parts of its input. Sometimes knowledge of what's important will come from cognition and long-term memory, but sometimes it may come from mathematical heuristics applied to the visual data itself. In the human visual system the latter kind of "low level attention focusing" is achieved largely in the context of the eye changing its focus frequently, looking preferentially at certain positions in the scene rint091. This works because the center of the eye corresponds to a greater density of neurons than the periphery. So for example, consider a computer vision algorithm like SIFT (Scale-Invariant Feature Extraction) ILow991, which (as shown in Figure 28.1) mathematically isolates certain points in a visual scene as "keypoints" which are particularly important for identifying what the scene depicts (e.g. these may be corners, or easily identifiable curves in edges). The human eye, when looking at a scene, would probably spend a greater percentage of its time focusing on the SIFT keypoints than on random points in the image. The human visual system's strategy for low-level attention focusing is obviously workable (at least in contexts similar to those in which the human eye evolved), but it's also somewhat complex, requiring the use of subtle temporal processing to interpret even static scenes. We suggest here that there may be a simpler way to achieve the same thing, in the context of vision systems that are substantially probabilistic in nature, via using imprecise probabilities. The crux of the idea is to represent the most important data, e.g. keypoints, using imprecise probability values with greater confidence. Similarly, cognition-guided visual attention-focusing occurs when a mind's broader knowledge of the world tells it that certain parts of the visual input may be more interesting to study than others. For example, in a picture of a person walking down a dark street, the contours of the person may not be tremendously striking visually (according to SIFT or similar approaches); but even so, if the system as a whole knows that it's looking at a person, it may decide to focus extra visual attention on anything person-like. This sort of cognition guided visual attention focusing, we suggest, may be achieved similarly to visual attention focusing guided on lower- level cues - by increasing the confidence of the imprecise probabilities associated with those aspects of the input that are judged more cognitively significant. EFTA00624314 168 28 Making DeSTIN Representationally Transparent 28.6.2 Using Imprecise Probabilities to Guide Visual Attention Focusing Suppose one has a vision system that internally constructs probabilistic values corresponding to small local regions in visual input (these could be pixels or voxels, or something a little larger), and then (perhaps via a complex process) assigns probabilities to different interpretations of the input based on combinations of these input-level probabilities. For this sort of vision system, one may be able to achieve focusing of attention via appropriately replacing the probabilities with imprecise probabilities. Such an approach may be especially interesting in hierarchical vision systems, that also involve the calculation of probabilities corresponding to larger regions of the visual input. Examples of the latter include deep learning based vision systems like Hill or DeSTIN, which construct nested hierarchies corresponding to larger and larger regions of the input space, and calculate probabilities associated with each of the regions on each level, based in part on the probabilities associated with other related regions. In this context, we now state the basic suggestion of the section: 1. Assign higher confidence to the low-level probabilities that the vision system creates cor- responding to the local visual regions that one wants to focus attention on (based on cues from visual preprocessing or cognitive guidance) 2. Carry out the vision system's processing using imprecise probabilities rather than single- number probabilities 3. Wherever the vision system makes a decision based on "the most probable choice" from a number of possibilities, change the system to make a decision based on "the choice maxi- mizing the product (expectation * confidence)". 28.6.3 Sketch of Application to DeSTIN Internally to DeSTIN, probabilities are assigned to clusters associated with local regions of the visual input. If a system such as SIFT is run as a preprocessor to DeSTIN, then those small regions corresponding to SIFT keypoints may be assumed semantically meaningful. and internal DeSTIN probabilities associated with them can be given a high confidence. A similar strategy may be taken if a cognitive system such as OpenCog is run together with DeSTIN, feeding DeSTIN information on which portions of a partially-processed image appear most cognitively relevant. The probabilistic calculations inside DeSTIN can be replaced with corresponding cal- culations involving imprecise probabilities. And critically, there is a step in DeSTIN where, among a set of beliefs about the state in each region of an image (on each of a set of hierarchi- cal levels), the one with the highest probability is selected. In accordance with the above recipe, this step should be modified to select the belief with the highest probability * confidence. 28.6.3.1 Conceptual Justification What is the conceptual justification for this approach? One justification is obtained by assuming that each percept has a certain probability of being erroneous, and those percepts that appear to more closely embody the semantic meaning of the EFTA00624315 28.6 Imprecise Probability as a Tool for Linking CogPrime and DeSTIN 169 visual scene are less likely to be erroneous. This follows conceptually from the assumption that the perceived world tends to be patterned and structured, so that being part of a statistically significant pattern is (perhaps weak) evidence of being real rather than artifactual. Under this assumption, the proposed approach will maximize the accuracy of the system's judgments. A related justification is obtained by observing that this algorithmic approach follows from the consideration of the perceived world as mutable. Consider a vision system that has the capability to modify even the low-level percepts that it intakes - i.e. to use what it thinks and knows, to modify what it sees. The human brain certainly has this potential ICha091. In this case, it will make sense for the system to place sonic constraints regarding which of its percepts it is more likely to modify. Confidence values semantically embody this - a higher confidence being sensibly assigned to percepts that the system considers should be less likely to be modified based on feedback from its higher (more cognitive) processing levels. In that case, a higher confidence should be given to those percepts that seem to more closely embody the semantic meaning of the visual scene - which is exactly what we're suggesting here. 28.6.3.2 Enabling Visual Attention Focusing in DeSTIN via Imprecise Probabilities We now refer back to the mathematical formulation of DeSTIN summarized in Section 4.3.1 of Chapter 4 above, in the context of which the application of imprecise probability based attention focusing to DeSTIN is almost immediate. The probabilities P(ols) may be assigned greater or lesser confidence depending on the assessed semantic criticality of the observation o in question. So for instance, if one is using SIFT as a preprocessor to DeSTIN, then one may assign probabilities P(ols) higher confidence if they correspond to observations o of SIFT keypoints, than if they do not. These confidence levels may then be propagated throughout DeSTIN's probabilistic mathe- matics. For instance, if one were using \Valley's interval probabilities, then one could carry out the probabilistic equations using interval arithmetic. Finally, one wishes to replace Equation 4.3.1.2 in Chapter 4 with c = arg max ((bp(s)).strength (bp(s)).confidence) , (28.1) or some similar variant. The effect of this is that hypotheses based on high-confidence observa- tions are more likely to be chosen, which of course has a large impact on the dynamics of the DeSTIN network. EFTA00624316 170 28 Making DeSTIN Representationally Transparent Fig. 28.1: The SIFT algorithm finds keypoints in an image, i.e. localized features that are particularly useful for identifying the objects in an image. The top row shows images that are matched against the image in the middle row. The bottom-row image shows some of the keypoints used to perform the matching (i.e. these keypoints demonstrate the same features in the top-row images and their transformed middle-row counterparts). SIFT keypoints are identified via a staged filtering approach. The first stage identifies key locations in scale space by looking for locations that are maxima or minima of a difference-of-Gaussian function. Each point is used to generate a feature vector that describes the local image region sampled relative to its scale-space coordinate frame. The features achieve partial invariance to local variations, such as affine or 3D projections, by blurring image gradient locations. EFTA00624317 Chapter 29 Bridging the Symbolic/Subsymbolic Gap 29.1 Introduction While it's widely accepted that human beings carry out both symbolic and subsymbolic process- ing, as integral parts of their general intelligence, the precise definition of "symbolic" versus "subsymbolic" is a subtle issue, which different Al researchers will approach in different ways depending on their differing overall perspectives on AI. Nevertheless, the intuitive meaning of the concepts is commonly understood: • "subsymbolic" refers to things like pattern recognition in high-dimensional quantitative sensory data, and real-time coordination of multiple actuators taking multidimensional control signals • "symbolic" refers to things like natural language grammar and (certain or uncertain) logical reasoning, that are naturally modeled in terms of manipulation of symbolic tokens in terms of particular (perhaps experientially learned) rules Views on the relationship between these two aspects of intelligence in human and artificial cognition are quite diverse, including perspectives such as 1. Symbolic representation and reasoning are the core of human-level intelligence; subsymbolic aspects of intelligence are of secondary importance and can be thought of as pre- or post- processors to symbolic representation and reasoning 2. Subsymbolic representation and learning are the core of human intelligence; symbolic as- pects of intelligence a. emerge from the subsymbolic aspects as needed; or, b. arise via a relatively simple, thin layer on top of subsymbolic intelligence, that merely applies subsymbolic intelligence in a slightly different way 3. Symbolic and subsymbolic aspects of intelligence are best considered as different subsystems, which a. have a significant degree of independent operation, but also need to coordinate closely together; or, b. operate largely separately and can be mostly considered as discrete modules 171 EFTA00624318 172 29 Bridging the Symbolic/Subsymbolic Gap In evolutionary, terms, it is clear that subsymbolic intelligence came first, and that most of the human brain is concerned with the subsymbolic intelligence that humans share with other animals. However, this observation doesn't have clear implications regarding the relationship between symbolic and subsymbolic intelligence in the context of everyday cognition. In the history of the Al field, the symbolic/subsymbolic distinction was sometimes aligned with the dichotomy between logic-based and rule-based AI systems (on the symbolic side) and neural networks (on the subsymbolic side) IP.188bl. However, this dichotomy has become much blurrier in the last couple decades, with developments such as neural network models of lan- guage parsing IC Hilf and logical reasoning 11.Bi Oh and symbolic approaches to perception and action ISR0-1]. Integrative approaches have also become more common, with one of the ma- jor traditional symbolic AI systems, ACT-R, spawning a neural network version II,A931 with parallel structures and dynamics to the traditional explicitly symbolic version and a hybridiza- tion with a computational neuroscience model 1.11.08; and another one, SOAR, incorporating perception processing components as separate modules ILait2l. The field of "neural-symbolic computing" has emerged, covering the emergence of symbolic rules from neural networks, and the hybridization of neural networks with explicitly symbolic systems [HH07]. Our goal here is not to explore the numerous deep issues involved with the symbolic/subsym- bolic dichotomy, but rather to describe the details of a particular approach to symbolic/sub- symbolic integration, inspired by Perspective 3a in the above list: the consideration of symbolic and subsymbolic aspects of intelligence as different subcystems, which have a significant degree of independent operation, but also need to coordinate closely together. We believe this kind of integration can serve a key role in the quest to create human-level general intelligence. The approach presented here is at the beginning rather than end of its practical implementation; what we are describing here is the initial design intention of a project in progress, which is sure to be revised in some respects as implementation and testing proceed. We will focus mainly on the tight integration of a subsymbolic system enabling gray-scale vision processing into a cognitive architecture with significant symbolic aspects, and will then briefly explain how the same ideas can be used for color vision, and multi-sensory and perception-action integration. The approach presented here begins with two separate Al systems, OpenCog (introduced in Chapter 6.3) and DeSTIN (introduced in Chapter 4.3.1) - both currently implemented in open-source software. Here are the relevant features of each as they pertain to our current effort of bridging the symbolic/subsymbolic gap:: • OpenCog, is centered on a "weighted, labeled hypergraph" knowledge representation called the Atomspace. and features a number of different, sophisticated cognitive algorithms acting on the Atomspace. Some of these cognitive algorithms are heavily symbolic in focus (e.g. a probabilistic logic engine); others are more subsymbolic in nature (e.g. a neural net like system for allocating attention and assigning credit). However, OpenCog in its current form cannot deal with high-dimensional perceptual input, nor with detailed real-time control of complex actuators. OpenCog is now being used to control intelligent characters in an experimental virtual world, where the perceptual inputs are the 3D coordinate locations of objects or small blocks; and the actions are movement commands like "step forward", "turn head to the right." • DeSTIN is a deep learning system consisting of a hierarchy of processing nodes, in which the nodes on higher levels correspond to larger regions of space-time, and each node carries out prediction regarding events in the space-time region to which it corresponds. Feedback and feedforward dynamics between nodes combine with the predictive activity within nodes, to create a complex nonlinear dynamical system whose state self-organizes to reflect the EFTA00624319 29.2 Simplified OpenCog Workflow 173 state of the world being perceived. However, the specifics of DeSTIN's dynamics have been designed in what we consider a particularly powerful way, and the system has shown good results on small-scale test problems IKAII10]. So far DeSTIN has been utilized only for vision processing, but a similar proprietary system has been used for auditory data as well; and DeSTIN was designed to work together with an accompanying action hierarchy. These two systems were not originally designed to work together, but we will describe a method for achieving their tight integration via 1. Modifying DeSTIN in several ways, so that a. the patterns in its states over time will have more easily recognizable regularities b. its nodes are able to scan their inputs not only for simple statistical patterns (DeSTIN "centroids"), but also for patterns recognized by routines supplied to it by an external source (e.g. another AI system such as OpenCog) 2. Utilizing one of OpenCogPrime's cognitive processes (the "Fishgram" frequent subhyper- graph mining algorithm) to recognize patterns in sets of DeSTIN states, and then recording these patterns in OpenCogPrime's Atomspace knowledge store 3. Utilizing OpenCogPrime's other cognitive processes to abstract concepts and draw conclu- sions from the patterns recognized in DeSTIN states by Fishgram 4. Exporting the concepts and conclusions thus formed to DeSTIN, so that its nodes can ex- plicitly scan for their presence in their inputs, thus allowing the results of symbolic cognition to explicitly guide subsymbolic perception 5. Creating an action hierarchy corresponding closely to DeSTIN's perceptual hierarchy, and also corresponding to the actuators of a particular robot. This allows action learning to be done via an optimization approach (ILKP- 05j, IYKL+0-11), where the optimization algo- rithm uses DeSTIN states corresponding to perceived actuator states as part of its inputs. The ideas presented here are compatible with those described in [Coe] lag, but different in emphasis. That paper described a strategy for integrating OpenCog and DeSTIN via creating an intermediate "semantic CSDLN" hierarchy to translate between OpenCog and DeSTIN, in both directions. In the approach suggested here, this semantic CSDLN hierarchy exists conceptually but not as a separate software object: it exists as the combination of • OpenCog predicates exported to DeSTIN and used alongside DeSTIN centroids, inside DeSTIN nodes • OpenCog predicates living in the OpenCog knowledge repository (AtomSpace), and inter- connected in a hierarchical way using OpenCog nodes and links (thus reflecting DeSTIN's hierarchical structure within the AtomSpace). This hierarchical network of predicates, spanning the two software systems, plays the role of a semantic CSDLN as described in ICoel la]. 29.2 Simplified OpenCog Workflow The dynamics inside an OpenCog system may be highly complex, defying simple flowchart- ing, but from the point of view of OpenCog-DeSTIN integration, one important pattern of information flow through the system is as follows: EFTA00624320 174 29 Bridging the Symbolic/Subsymbolic Gap 1. Perceptions come into the Atomspace. In the current OpenCog system, these are provided via a proxy to the game engine where the OpenCog controlled character interacts. In an OpenCog-DeSTIN hybrid, these will be provided via DeSTIN. 2. Hebbian learning builds HebbianLinks between perceptual Atoms representing percepts that have frequently co-occurred 3. PLN inference, concept blending and other methods act on these perceptual Atoms and their HebbianLinks, forming links between them and linking them to other Atoms stored in the Atomspace reflecting prior experience and generalizations therefrom 4. Attention allocation gives higher short and long term importance values to those Atoms that appear likely to be useful based on the links they have obtained 5. Based on the system's current goals and subgoals (the latter learned from the top-level goals using PLN), and the goal-related links in the Atomspace, the OpenPsi mechanism triggers the PLN-based planner, which chooses a series of high-level actions that are judged likely to help the system achieve its goals in the current context 6. The chosen high-level actions are transformed into series of lower-level, directly executable actions. In the current OpenCog system, this is done by a set of hand-coded rules based on the specific mechanics of the game engine where the OpenCog controlled character interacts. In an OpenCog-DeSTIN hybrid, the lower-level action sequence will be chosen by an optimization method acting based on the motor control and perceptual hierarchies. This pattern of information flow omits numerous aspects of OpenCog cognitive dynamics, but gives the key parts of the picture in terms of the interaction of OpenCog cognition with perception and action. Most of the other aspects of the dynamics have to do with the interac- tion of multiple cognitive processes acting on the Atomspace, and the interaction between the Atomspace and several associated specialized memory stores, dealing with procedural. episodic, temporal and spatial aspects of knowledge. From the present point of view, these additional aspects may be viewed as part of Step 3 above, wrapped up in the phrase "and other methods act on these perceptual Atoms." However, it's worth noting that in order to act appropriately on perceptual Atoms, a lot of background cognition regarding more abstract conceptual Atoms (often generalized from previous perceptual Atoms) may be drawn on. This background infer- ence incorporates both symbolic and subsymbolic aspects, but goes beyond the scope of the present discussion, as its particulars do not impinge on the particulars of DeSTIN-OpenCog integration. OpenCog also possesses a specialized facility for natural language comprehension and genera- tion'LCElOJ V;oelObj, which may be viewed as a parallel perception/action pathway, bypassing traditional human-like sense perception and dealing with text directly. Integrating OpenCog- Prime's current linguistics processes with DeSTIN-based auditory and visual processing is a deep and important topic, but one we will bypass here, for sake of brevity and because it's not our current research priority. 29.3 Integrating DeSTIN and OpenCog The integration of DeSTIN and OpenCog involves two key aspects: • recognition of patterns in sets of DeSTIN states, and exportation of these patterns into the OpenCog Atomspace EFTA00624321 29.3 Integrating DeSTIN and OpenCog 175 • use of OpenCog-created concepts within DeSTIN nodes, alongside statistically-derived "cen- troids" From here on, unless specified otherwise, when we mention "DeSTIN" we will refer to "Uniform DeSTIN" as presented in Chapter 28 and an extension of "classic DeSTIN" as defined in IA K 29.3.1 Mining Patterns from DeSTIN States The first step toward using OpenCog tools to mine patterns from sets of DeSTIN states, is to represent these states in Atom form in an appropriate way. A simple but workable approach, restricting attention for the moment to purely spatial patterns, is to use the six predicates: • hasCentroid(node N,int k) • hosParentCentroid(node N,int k) • hasNorthNeighborCentrold(node N,int k) • hasSouthNeighborCentrold(node N,int k) • hasEastNeighborCentroid(node N, int k) • hasWestNeighborCentroid(node N, ird For instance hasNorthNeighborCentroid(N, 3) means that N's north neighbor has centroid #3 One may consider also the predicates • hosParent(node N, Node • hasNorthNeighbor(node N, Node Ail) • hasSouthNeighbor(node N, Node M) • hasEastNeighbor(node N, Node M) • hasWestNeighbor(node N, Node Al) Now suppose we have a stored set of DeSTIN states, saved from the application of DeSTIN to multiple different inputs. What we want to find are predicates P that are conjunctions of instances of the above 10 predicates, which occur frequently in the stored set of DeSTIN states. A simple example of such a predicate would be the conjunction of • hosNorthNeighbor(SN,SAI) • hosParentCentroid($N, 5) • hosParentCentroid($M, 5) • hasNorthNeighborCentrold(SN, 6) • hasWestNeighboreentroid(SAI, 4) This predicate could be evaluated at any pair of nodes ($N, $M) on the same DeSTIN level. If it is true for atypically many of these pairs, then it's a "frequent pattern", and should be detected and stored. EFTA00624322 176 29 Bridging the Symbolic/Subsymbolic Gap OpenCogPrime's pattern mining component, Fishgram, exists precisely for the purpose of mining this sort of conjunction from sets of relationships that are stored in the Atonispace. It may be applied to this problem as follows: • Translate each DeSTIN state into a set of relationships drawn from: hasNorthNeighbor, ha.sSouthNeighbor, hasEastNeighbor, hasWestNeighbor, hasCentroid, hasParent • Import these relationships, describing each DeSTIN state, into the OpenCog Atomspace • Run pattern mining on this AtomSpace. 29.3.2 Probabilistic Inference on Mined Hypergraphs Patterns mined from DeSTIN states can then be reasoned on by OpenCogPrime's PLN inference engine, allowing analogy and generalization. Suppose centroids 5 and 617 are estimated to be similar - either via DeSTIN's built-in simi- larity metric, or, more interestingly via OpenCog inference on the Atom representations of these centroids. As an example of the latter, consider: 5 could represent a person's nose and 617 could represent a rabbit's nose. In this case, DeSTIN might not judge the two centroids particularly similar on a purely visual level, but, OpenCog may know that the images corresponding to both of these centroids are are called "noses" (e.g. perhaps via noticing people indicate these images in association with the word "nose"), and may thus infer (using a simple chain of PLN inferences) that these centroids seem probabilistically similar. If 5 and 617 are estimated to be similar, then a predicate like ANDLink EvaluationLink hasNorthNeighbor ListLink $N $M EvaluationLink hasParentCentroid ListLink $N 5 EvaluationLink hasParentCentroid ListLink $M 5 EvaluationLink hasNorthNeighborCentroid ListLink $N 6 EvaluationLink hasWestNeighborCentroid ListLink $M 4 mined from DeSTIN states, could be extended via PLN analogical reasoning to ANDLink EvaluationLink hasNorthNeighbor ListLink $N $M EvaluationLink EFTA00624323 29.3 Integrating DeSTIN and OpenCog ITT hasParentCentroid ListLink $N 617 EvaluationLink hasParentCentroid ListLink $M 617 EvaluationLink hasNorthNeighborCentroid ListLink $N 6 EvaluationLink hasWestNeighborCentroid ListLink $M 4 29.3.3 Insertion of OpenCog-Learned Predicates into DeSTIN's Pattern Library Suppose one has used Fishgram, as described in the earlier part of this chapter, to recog- nize predicates embodying frequent or surprising patterns in a set of DeSTIN states or state- sequences. The next natural step is to add these frequent or surprising patterns to DeSTIN's pattern library, so that the pattern library contains not only classic DeSTIN centroids, but also these corresponding "image grammar" style patterns. Then, when a new input comes into a DeSTIN node, in addition to being compared to the centroids at the node, it can be fed as input to the predicates associated with the node. What is the advantage of this approach, compared to DeSTIN without these predicates? The capability for more compact representation of a variety of spatial patterns. In many cases, a spatial pattern that would require a large number of DeSTIN centroids to represent, can be represented by a single, fairly compact predicate. It is an open question whether these sorts of predicates are really critical for human-like vision processing. However, our intuition is that they do have a role in human as well as machine vision. In essence, DeSTIN is based on a fancy version of nearest-neighbor search, applied in a clever way on multiple levels of a hierarchy, using context-savvy probabilities to bias the matching. But we suspect there are many visual patterns that are more compactly and intuitively represented using a more flexible language, such as OpenCog predicates formed by combining elementary predicates involving appropriate spatial and temporal relations. For example, consider the archetypal spatial pattern of a face as: either two eyes that are next to each other, or sunglasses, above a nose, which is in turn above a mouth. (This is an oversimplified toy example, but we're positing it for illustration only. The same point applies to more complex and realistic patterns.) One could represent this in OpenCogPrime's Atom language as something like: AND InheritanceLink N B_nose InheritanceLink M B_mouth EvaluationLink above ListLink E N EvaluationLink EFTA00624324 178 29 Bridging the Symbolic/Subsymbolic Gap above ListLink N M OR AND MemberLink El E MemberLink E2 E EvaluationLink next_to ListLink El E2 InheritanceLink El B_eye AND InheritanceLink E B_sunglasses where e.g. B_eye is a DeSTIN belief that corresponds roughly to recognition of the spatial pattern of a human eye. To represent this using ordinary DeSTIN centroids, one couldn't rep- resent the OR explicitly; instead one would need to split it into two different sets of centroids, corresponding to the eye case and the sunglasses case - unless the DeSTIN pattern library contained a belief corresponding to "eyes or sunglasses." But the question then becomes: how would classic DeSTIN actually learn a belief like this? In the suggested architecture, pattern mining on the database of DeSTIN states is proposed as an algorithm for learning such beliefs. This sort of predicate-enhanced DeSTIN will have advantages over the traditional version, only if the actual distribution of images observed by the system contains many (reasonably high probability) images modeled accurately by predicates involving disjunction and/or negations as well as conjunctions. If the system's perceived world is simpler than this, then good old DeSTIN will work just as well, and the OpenCog-learned predicates are a needless complication. Without these sorts of predicates, how might DeSTIN be extended to include beliefs like "eyes or sunglasses"? One way would be to couple DeSTIN with a reinforcement learning subsystem, that reinforced the creation of beliefs that were useful for the system as a whole. If reasoning in terms of faces (independent of whether they have eyes or sunglasses) got the system reward, presumably it could learn to form the concept "eyes or sunglasses." We believe this would also be a workable approach, but that given the strengths and weaknesses of contemporary computer hardware, the proposed DeSTIN-OpenCog approach will prove considerably simpler and more effective. 29.4 Multisensory Integration, and Perception-Action Integration In Chapter 28.2.1 we have briefly indicated how DeSTIN could be extended beyond vision to handle other senses such as audition and touch. If one had multiple perception hierarchies corresponding to multiple senses, the easiest way to integrate them within an OpenCog context would be to use OpenCog as the communication nexus - representing DeSTIN centroids in the various modality-specific hierarchies as OpenCog Atoms (PerceptualCentroidNodes), and building HebbianLinks in OpenCogPrime's Atomspace between these PerceptualCentroidNodes as appropriate based on their association. So for instance the sound of a person's footsteps would correspond to a certain belief (probability distribution over centroids) in the auditory DeSTIN network, and the sight of a person's feet stepping would correspond to a certain belief (probability distribution over centroids) in the visual DeSTIN network; and the OpenCog EFTA00624325 29.4 Multisensory Integration, and Perception-Action Integration 179 Atomspace would contain links between the sets of centroids assigned high weights between these two belief distributions. Importance spreading between these various PerceptualCentroidNodes would cause a dynamic wherein seeing feet stepping would bias the system to think it was hearing footsteps, and hearing footsteps would bias it to think it was seeing feet stepping. And, suppose there are similarities between the belief distributions for the visual appearance of dogs, and the visual appearance of cats. Via the intermediary of the Atomspace, this would bias the auditory and haptic DeSTIN hierarchies to assume a similarity between the auditory and haptic characteristics of dogs, and the analogous characteristics of cats. Because: PLN analogical reasoning would extrapolate from, e.g. • HebbianLinks joining cat-related visual PerceptualCentroidNodes and dog-related visual PerceptualCentroidNodm • HebbianLinks joining cat-related visual PerceptualCentroidNodes to cat-related haptic Per- ceptualCentroidNodes; and others joining dog-related visual PerceptualCentroidNodes to dog-related haptic PerceptualCentroidNodes to yield HebbianLinks joining cat-related haptic PerceptualCentroidNodes and dog-related hap- tic PerceptualCentroidNodes. This sort of reasoning would then cause the system DeSTIN to, for example, upon touching a cat, vaguely expect to maybe hear dog-like things. This sort of simple analogical reasoning will be right sometimes and wrong sometimes - a cat walking sounds a fair bit like a dog walking, and cat and dog growls sound fairly similar, but a cat meowing doesn't sound that much like a dog barking. More refined inferences of the same basic sort may be used to get the details right as the system explores and understands the world more accurately. 29.4.1 Perception-Action Integration While experimentation with DeSTIN has so far been restricted to perception processing, the sys- tem was designed from the beginning with robotics applications in mind, involving integration of perception with action and reinforcement learning. As OpenCog already handles reinforce- ment learning on a high level (via OpenPsi), our approach to robot control using DeSTIN and OpenCog involves creating a control hierarchy parallel to DeSTIN's perceptual hierarchy, and doing motor learning using optimization algorithms guided by reinforcement signals delivered from OpenPsi and incorporating DeSTIN perceptual states as part of their input information. Our initial research goal, where action is concerned, is not to equal the best purely control- theoretic algorithms at fine-grained control of robots carrying out specialized tasks, but rather to achieve basic perception / control / cognition integration in the rough manner of a young human child. A two year old child is not particularly well coordinated, but is capable of coor- dinating actions involving multiple body parts using an integration of perception and action with unconscious and deliberative reasoning. Current robots, in some cases, can carry out spe- cialized actions with great accuracy, but they lack this sort of integration, and thus generally have difficulty effectively carrying out actions in unforeseen environments and circumstances. We will create an action hierarchy with nodes corresponding to different parts of the robot body, where e.g. the node corresponding to an arm would have child nodes corresponding to a shoulder, elbow, wrist and hand; and the node corresponding to a hand would have child nodes corresponding to the fingers of the hand; etc. Physical self-perception is then achieved EFTA00624326 ISO 29 Bridging the Symbolic/Subsymbolic Cap by creating a DeSTIN "action-perception" hierarchy with nodes corresponding to the states of body components. In the simplest case this means the lowest-level nodes will correspond to individual servomotors, and their inputs will be numerical vectors characterizing servomotor states. If one is dealing with a robot endowed with haptic technology. e.g. Syntouch fingertips, then numerical vectors characterizing haptic inputs may be used alongside these. The configuration space of an action-perception node, corresponding to the degrees of free- dom of the servomotors of the body part the node represents, may be approximated by a set of "centroid" vectors. When an action is learned by the optimization method used for this purpose, this involves movements of the servomotors corresponding to many different nodes, and thus creates a series of "configuration vectors" in each node. These configuration vector series may be subjected to online clustering, similar to percepts in a DeSTIN perceptual hierarchy. The result is a library of "codewords", corresponding to discrete trajectories of movement, associ- ated with each node. The libraries may be shared by identical body parts (e.g. shared among legs, shared among fingers), but will be distinct otherwise. Each coordinated whole-body action thus results in a series of (node, centroid) pairs, which may be mined for patterns, similarly to the perception case. The set of predicates needed to characterize states in this action-perception hierarchy is simpler than the one described for visual perception above; here one requires only • haseentroid(node N,int k) • hosParentCentroid(node N, Mt k) • hasParent(node N, Node M) • hasSibling(node N, Node M) and most of the patterns will involve specific nodes rather than node variables. The different nodes in a DeSTIN vision hierarchy are more interchangeable (in terms of their involvement in various patterns) than, say, a leg and a finger. In a pure DeSTIN implementation, the visual and action-perception hierarchies would be directly linked. In the context of OpenCog integration, it is simplest to link the two via OpenCog, in a sense using cognition as a bridge between action and perception. It is unclear whether this strategy will be sufficient in the long run, but we believe it will be more than adequate for experimentation with robotic perceptual-motor coordination in a variety of everyday tasks. OpenCogPrime's Hebbian learning process can be used to find common associations between action-perception states and visual-perception states, via mining a data store containing time- stamped state records from both hierarchies. Importance spreading along the HebbianLinIcs learned in this way can then be used to bias the weights in the belief states of the nodes in both hierarchies. So, for example, the action- perception patterns related to clenching the fist, would be Hebbianly correlated with the visual- perception patterns related to seeing a clenched fist. When a clenched fist was perceived via servomotor data. importance spreading would increase the weighting of visual patterns corre- sponding to clenched fists, within the visual hierarchy. When a clenched fist was perceived via visual data, importance spreading would increase the weighting of servomotor data patterns corresponding to clenched fists, within the action-perception hierarchy. EFTA00624327 29.4 Multisensory Integration, and Perception-Action Integration 181 29.4.2 Thought-Experiment: Eye-Hand Coordination For example, how would DeSTIN-OpenCog integration as described here carry out a simple task of eye-hand coordination? Of course the details of such a feat, as actually achieved, would be too intricate to describe in a brief space, but it still is meaningful to describe the basic ideas. Consider the case of a robot picking up a block, in plain sight immediately in front of the robot, via pinching it between two fingers and then lifting it. In this case, • The visual scene, including the block, is perceived by DeSTIN; and appropriate patterns in various DeSTIN nodes are formed • Predicates corresponding to the distribution of patterns among DeSTIN nodes are activated and exported to the OpenCog Atomspace • Recognition that a block is present is carried out, either by - PLN inference within OpenCog, drawing the conclusion that a block is present from the exported predicates, using ImplicationLinks comprising a working definition of a "block" - A predicate comprising the definition of "block", previously imported into DeSTIN from OpenCog and utilized within DeSTIN nodes as a basic pattern to be scanned for. This option would obtain only if the system had perceived many blocks in the past, justifying the automation of block recognition within the perceptual hierarchy. • OpenCog, motivated by one of its higher-level goals, chooses "picking up the block" as subgoal. So it allocates effort to finding a procedure whose execution, in the current context, has a reasonable likelihood of achieving the goal of picking up the block. For instance, the goal could be curiosity (which might make the robot want to see what lies under the block), or the desire to please the agent's human teacher (in case the human teacher likes presents, and will reward the robot for giving it a blo

Document Preview

PDF source document
This document was extracted from a PDF. No image preview is available. The OCR text is shown on the left.

Document Details

Filename EFTA00624128.pdf
File Size 34668.2 KB
OCR Confidence 85.0%
Has Readable Text Yes
Text Length 500,000 characters
Indexed 2026-02-11T23:07:49.218498
Ask the Files