I apply recent work on "learning to think" (2015) and on PowerPlay (2011) to the incremental training of an increasingly general problem solver, continually learning to solve new tasks without forgetting previous skills. The problem solver is a single recurrent neural network (or similar general purpose computer) called ONE. ONE is unusual in the sense that it is trained in various ways, e.g., by black box optimization / reinforcement learning / artificial evolution as well as supervised / unsupervised learning. For example, ONE may learn through neuroevolution to control a robot through environment-changing actions, and learn through unsupervised gradient descent to predict future inputs and vector-valued reward signals as suggested in 1990. User-given tasks can be defined through extra goal-defining input patterns, also proposed in 1990. Suppose ONE has already learned many skills. Now a copy of ONE can be re-trained to learn a new skill, e.g., through neuroevolution without a teacher. Here it may profit from re-using previously learned subroutines, but it may also forget previous skills. Then ONE is retrained in PowerPlay style (2011) on stored input/output traces of (a) ONE's copy executing the new skill and (b) previous instances of ONE whose skills are still considered worth memorizing. Simultaneously, ONE is retrained on old traces (even those of unsuccessful trials) to become a better predictor, without additional expensive interaction with the enviroment. More and more control and prediction skills are thus collapsed into ONE, like in the chunker-automatizer system of the neural history compressor (1991). This forces ONE to relate partially analogous skills (with shared algorithmic information) to each other, creating common subroutines in form of shared subnetworks of ONE, to greatly speed up subsequent learning of additional, novel but algorithmically related skills.

* 17 pages, 107 references

* 17 pages, 107 references

**Click to Read Paper**
On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models

Nov 30, 2015

Juergen Schmidhuber

Nov 30, 2015

Juergen Schmidhuber

* 36 pages, 1 figure. arXiv admin note: substantial text overlap with arXiv:1404.7828

**Click to Read Paper**

In recent years, deep artificial neural networks (including recurrent ones) have won numerous contests in pattern recognition and machine learning. This historical survey compactly summarises relevant work, much of it from the previous millennium. Shallow and deep learners are distinguished by the depth of their credit assignment paths, which are chains of possibly learnable, causal links between actions and effects. I review deep supervised learning (also recapitulating the history of backpropagation), unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.

* Neural Networks, Vol 61, pp 85-117, Jan 2015

* 88 pages, 888 references

* Neural Networks, Vol 61, pp 85-117, Jan 2015

* 88 pages, 888 references

**Click to Read Paper**
Self-delimiting (SLIM) programs are a central concept of theoretical computer science, particularly algorithmic information & probability theory, and asymptotically optimal program search (AOPS). To apply AOPS to (possibly recurrent) neural networks (NNs), I introduce SLIM NNs. Neurons of a typical SLIM NN have threshold activation functions. During a computational episode, activations are spreading from input neurons through the SLIM NN until the computation activates a special halt neuron. Weights of the NN's used connections define its program. Halting programs form a prefix code. The reset of the initial NN state does not cost more than the latest program execution. Since prefixes of SLIM programs influence their suffixes (weight changes occurring early in an episode influence which weights are considered later), SLIM NN learning algorithms (LAs) should execute weight changes online during activation spreading. This can be achieved by applying AOPS to growing SLIM NNs. To efficiently teach a SLIM NN to solve many tasks, such as correctly classifying many different patterns, or solving many different robot control tasks, each connection keeps a list of tasks it is used for. The lists may be efficiently updated during training. To evaluate the overall effect of currently tested weight changes, a SLIM NN LA needs to re-test performance only on the efficiently computable union of tasks potentially affected by the current weight changes. Future SLIM NNs will be implemented on 3-dimensional brain-like multi-processor hardware. Their LAs will minimize task-specific total wire length of used connections, to encourage efficient solutions of subtasks by subsets of neurons that are physically close. The novel class of SLIM NN LAs is currently being probed in ongoing experiments to be reported in separate papers.

* 15 pages

* 15 pages

**Click to Read Paper**
Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes

Apr 15, 2009

Juergen Schmidhuber

Apr 15, 2009

Juergen Schmidhuber

* Short version: J. Schmidhuber. Simple Algorithmic Theory of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes. Journal of SICE 48(1), 21-32, 2009

* 35 pages, 3 figures, based on KES 2008 keynote and ALT 2007 / DS 2007 joint invited lecture

**Click to Read Paper**

Simple Algorithmic Principles of Discovery, Subjective Beauty, Selective Attention, Curiosity & Creativity

Sep 05, 2007

Juergen Schmidhuber

Sep 05, 2007

Juergen Schmidhuber

* 15 pages, 3 highly compressible low-complexity drawings. Joint Invited Lecture for Algorithmic Learning Theory (ALT 2007) and Discovery Science (DS 2007), Sendai, Japan, 2007

**Click to Read Paper**

2006: Celebrating 75 years of AI - History and Outlook: the Next 25 Years

Aug 31, 2007

Juergen Schmidhuber

When Kurt Goedel layed the foundations of theoretical computer science in 1931, he also introduced essential concepts of the theory of Artificial Intelligence (AI). Although much of subsequent AI research has focused on heuristics, which still play a major role in many practical AI applications, in the new millennium AI theory has finally become a full-fledged formal science, with important optimality results for embodied agents living in unknown environments, obtained through a combination of theory a la Goedel and probability theory. Here we look back at important milestones of AI history, mention essential recent results, and speculate about what we may expect from the next 25 years, emphasizing the significance of the ongoing dramatic hardware speedups, and discussing Goedel-inspired, self-referential, self-improving universal problem solvers.
Aug 31, 2007

Juergen Schmidhuber

* 14 pages; preprint of invited contribution to the Proceedings of the ``50th Anniversary Summit of Artificial Intelligence'' at Monte Verita, Ascona, Switzerland, 9-14 July 2006

**Click to Read Paper**

Goedel Machines: Self-Referential Universal Problem Solvers Making Provably Optimal Self-Improvements

Dec 17, 2006

Juergen Schmidhuber

Dec 17, 2006

Juergen Schmidhuber

* Variants published in "Adaptive Agents and Multi-Agent Systems II", LNCS 3394, p. 1-23, Springer, 2005: ISBN 978-3-540-25260-3; as well as in Proc. ICANN 2005, LNCS 3697, p. 223-233, Springer, 2005 (plenary talk); as well as in "Artificial General Intelligence", Series: Cognitive Technologies, Springer, 2006: ISBN-13: 978-3-540-23733-4

* 29 pages, 1 figure, minor improvements, updated references

**Click to Read Paper**

Artificial Intelligence (AI) has recently become a real formal science: the new millennium brought the first mathematically sound, asymptotically optimal, universal problem solvers, providing a new, rigorous foundation for the previously largely heuristic field of General AI and embedded agents. At the same time there has been rapid progress in practical methods for learning true sequence-processing programs, as opposed to traditional methods limited to stationary pattern association. Here we will briefly review some of the new results, and speculate about future developments, pointing out that the time intervals between the most notable events in over 40,000 years or 2^9 lifetimes of human history have sped up exponentially, apparently converging to zero within the next few decades. Or is this impression just a by-product of the way humans allocate memory space to past events?

* Speed Prior: clarification / 15 pages, to appear in "Challenges to Computational Intelligence"

* Speed Prior: clarification / 15 pages, to appear in "Challenges to Computational Intelligence"

**Click to Read Paper**
Most traditional artificial intelligence (AI) systems of the past 50 years are either very limited, or based on heuristics, or both. The new millennium, however, has brought substantial progress in the field of theoretically optimal and practically feasible algorithms for prediction, search, inductive inference based on Occam's razor, problem solving, decision making, and reinforcement learning in environments of a very general type. Since inductive inference is at the heart of all inductive sciences, some of the results are relevant not only for AI and computer science but also for physics, provoking nontraditional predictions based on Zuse's thesis of the computer-generated universe.

* 23 pages, updated refs, added Goedel machine overview, corrected computing history timeline. To appear in B. Goertzel and C. Pennachin, eds.: Artificial General Intelligence

* 23 pages, updated refs, added Goedel machine overview, corrected computing history timeline. To appear in B. Goertzel and C. Pennachin, eds.: Artificial General Intelligence

**Click to Read Paper**
We present a novel, general, optimally fast, incremental way of searching for a universal algorithm that solves each task in a sequence of tasks. The Optimal Ordered Problem Solver (OOPS) continually organizes and exploits previously found solutions to earlier tasks, efficiently searching not only the space of domain-specific algorithms, but also the space of search algorithms. Essentially we extend the principles of optimal nonincremental universal search to build an incremental universal learner that is able to improve itself through experience. In illustrative experiments, our self-improver becomes the first general system that learns to solve all n disk Towers of Hanoi tasks (solution size 2^n-1) for n up to 30, profiting from previously solved, simpler tasks involving samples of a simple context free language.

* Machine Learning, 54, 211-254, 2004.

* 43 pages, 2 figures, short version at NIPS 2002 (added 1 figure and references; streamlined presentation)

* Machine Learning, 54, 211-254, 2004.

* 43 pages, 2 figures, short version at NIPS 2002 (added 1 figure and references; streamlined presentation)

**Click to Read Paper**
The probability distribution P from which the history of our universe is sampled represents a theory of everything or TOE. We assume P is formally describable. Since most (uncountably many) distributions are not, this imposes a strong inductive bias. We show that P(x) is small for any universe x lacking a short description, and study the spectrum of TOEs spanned by two Ps, one reflecting the most compact constructive descriptions, the other the fastest way of computing everything. The former derives from generalizations of traditional computability, Solomonoff's algorithmic probability, Kolmogorov complexity, and objects more random than Chaitin's Omega, the latter from Levin's universal search and a natural resource-oriented postulate: the cumulative prior probability of all x incomputable within time t by this optimal algorithm should be 1/t. Between both Ps we find a universal cumulatively enumerable measure that dominates traditional enumerable measures; any such CEM must assign low probability to any universe lacking a short enumerating program. We derive P-specific consequences for evolving observers, inductive reasoning, quantum physics, philosophy, and the expected duration of our universe.

* Sections 1-5 in: Hierarchies of generalized Kolmogorov complexities and nonenumerable universal measures computable in the limit. International Journal of Foundations of Computer Science 13(4):587-612 (2002). Section 6 in: The Speed Prior: A New Simplicity Measure Yielding Near-Optimal Computable Predictions. In J. Kivinen and R. H. Sloan, editors, Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT 2002), Sydney, Australia, Lecture Notes in Artificial Intelligence, pages 216--228. Springer, 2002.

* 10 theorems, 50 pages, 100 refs, 20000 words. Minor revisions: added references; improved readability

* Sections 1-5 in: Hierarchies of generalized Kolmogorov complexities and nonenumerable universal measures computable in the limit. International Journal of Foundations of Computer Science 13(4):587-612 (2002). Section 6 in: The Speed Prior: A New Simplicity Measure Yielding Near-Optimal Computable Predictions. In J. Kivinen and R. H. Sloan, editors, Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT 2002), Sydney, Australia, Lecture Notes in Artificial Intelligence, pages 216--228. Springer, 2002.

* 10 theorems, 50 pages, 100 refs, 20000 words. Minor revisions: added references; improved readability

**Click to Read Paper**
Improving Speaker-Independent Lipreading with Domain-Adversarial Training

Aug 04, 2017

Michael Wand, Juergen Schmidhuber

Aug 04, 2017

Michael Wand, Juergen Schmidhuber

* Accepted at Interspeech 2017

**Click to Read Paper**

Algorithm Selection as a Bandit Problem with Unbounded Losses

Jul 09, 2008

Matteo Gagliolo, Juergen Schmidhuber

Jul 09, 2008

Matteo Gagliolo, Juergen Schmidhuber

* 15 pages, 2 figures

**Click to Read Paper**

A Frequency-Domain Encoding for Neuroevolution

Dec 28, 2012

Jan Koutník, Juergen Schmidhuber, Faustino Gomez

Dec 28, 2012

Jan Koutník, Juergen Schmidhuber, Faustino Gomez

**Click to Read Paper**

On the Size of the Online Kernel Sparsification Dictionary

Jun 18, 2012

Yi Sun, Faustino Gomez, Juergen Schmidhuber

We analyze the size of the dictionary constructed from online kernel sparsification, using a novel formula that expresses the expected determinant of the kernel Gram matrix in terms of the eigenvalues of the covariance operator. Using this formula, we are able to connect the cardinality of the dictionary with the eigen-decay of the covariance operator. In particular, we show that under certain technical conditions, the size of the dictionary will always grow sub-linearly in the number of data points, and, as a consequence, the kernel linear regressor constructed from the resulting dictionary is consistent.
Jun 18, 2012

Yi Sun, Faustino Gomez, Juergen Schmidhuber

* ICML2012

**Click to Read Paper**

Multi-column Deep Neural Networks for Image Classification

Feb 13, 2012

Dan Cireşan, Ueli Meier, Juergen Schmidhuber

Feb 13, 2012

Dan Cireşan, Ueli Meier, Juergen Schmidhuber

* CVPR 2012, p. 3642-3649

* 20 pages, 14 figures, 8 tables

**Click to Read Paper**

Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments

Mar 29, 2011

Yi Sun, Faustino Gomez, Juergen Schmidhuber

Mar 29, 2011

Yi Sun, Faustino Gomez, Juergen Schmidhuber

**Click to Read Paper**

Phoneme recognition in TIMIT with BLSTM-CTC

Apr 21, 2008

Santiago Fernández, Alex Graves, Juergen Schmidhuber

Apr 21, 2008

Santiago Fernández, Alex Graves, Juergen Schmidhuber

* 8 pages

**Click to Read Paper**

Multi-Dimensional Recurrent Neural Networks

May 14, 2007

Alex Graves, Santiago Fernandez, Juergen Schmidhuber

May 14, 2007

Alex Graves, Santiago Fernandez, Juergen Schmidhuber

* 10 pages, 10 figures

**Click to Read Paper**