Models, code, and papers for "Ke Chen":
In general, video games not only prevail in entertainment but also have become an alternative methodology for knowledge learning, skill acquisition and assistance for medical treatment as well as health care in education, vocational/military training and medicine. On the other hand, video games also provide an ideal test bed for AI researches. To a large extent, however, video game development is still a laborious yet costly process, and there are many technical challenges ranging from game generation to intelligent agent creation. Unlike traditional methodologies, in Machine Learning and Perception Lab at the University of Manchester (MLP@UoM), we advocate applying machine learning to different tasks in video game development to address several challenges systematically. In this paper, we overview the main progress made in MLP@UoM recently and have an outlook on the future research directions in learning-based video game development arising from our works.
Motivated by the advantages achieved by implicit analogue net for solving online linear equations, a novel implicit neural model is designed based on conventional explicit gradient neural networks in this letter by introducing a positive-definite mass matrix. In addition to taking the advantages of the implicit neural dynamics, the proposed implicit gradient neural networks can still achieve globally exponential convergence to the unique theoretical solution of linear equations and also global stability even under no-solution and multi-solution situations. Simulative results verify theoretical convergence analysis on the proposed neural dynamics.
Image registration is one important task in many image processing applications. It aims to align two or more images so that useful information can be extracted through comparison, combination or superposition. This is achieved by constructing an optimal trans- formation which ensures that the template image becomes similar to a given reference image. Although many models exist, designing a model capable of modelling large and smooth deformation field continues to pose a challenge. This paper proposes a novel variational model for image registration using the Gaussian curvature as a regulariser. The model is motivated by the surface restoration work in geometric processing [Elsey and Esedoglu, Multiscale Model. Simul., (2009), pp. 1549-1573]. An effective numerical solver is provided for the model using an augmented Lagrangian method. Numerical experiments can show that the new model outperforms three competing models based on, respectively, a linear curvature [Fischer and Modersitzki, J. Math. Imaging Vis., (2003), pp. 81- 85], the mean curvature [Chumchob, Chen and Brito, Multiscale Model. Simul., (2011), pp. 89-128] and the diffeomorphic demon model [Vercauteren at al., NeuroImage, (2009), pp. 61-72] in terms of robustness and accuracy.
Speech emotion recognition plays an important role in building more intelligent and human-like agents. Due to the difficulty of collecting speech emotional data, an increasingly popular solution is leveraging a related and rich source corpus to help address the target corpus. However, domain shift between the corpora poses a serious challenge, making domain shift adaptation difficult to function even on the recognition of positive/negative emotions. In this work, we propose class-wise adversarial domain adaptation to address this challenge by reducing the shift for all classes between different corpora. Experiments on the well-known corpora EMODB and Aibo demonstrate that our method is effective even when only a very limited number of target labeled examples are provided.
Human action recognition refers to automatic recognizing human actions from a video clip, which is one of the most challenging tasks in computer vision. In reality, a video stream is often weakly-annotated with a set of relevant human action labels at a global level rather than assigning each label to a specific video episode corresponding to a single action, which leads to a multi-label learning problem. Furthermore, there are a great number of meaningful human actions in reality but it would be extremely difficult, if not impossible, to collect/annotate video clips regarding all of various human actions, which leads to a zero-shot learning scenario. To the best of our knowledge, there is no work that has addressed all the above issues together in human action recognition. In this paper, we formulate a real-world human action recognition task as a multi-label zero-shot learning problem and propose a framework to tackle this problem. Our framework simultaneously tackles the issue of unknown temporal boundaries between different actions for multi-label learning and exploits the side information regarding the semantic relationship between different human actions for zero-shot learning. As a result, our framework leads to a joint latent embedding representation for multi-label zero-shot human action recognition. The joint latent embedding is learned with two component models by exploring temporal coherence underlying video data and the intrinsic relationship between visual and semantic domain. We evaluate our framework with different settings, including a novel data split scheme designed especially for evaluating multi-label zero-shot learning, on two weakly annotated multi-label human action datasets: Breakfast and Charades. The experimental results demonstrate the effectiveness of our framework in multi-label zero-shot human action recognition.
Deep reinforcement learning (DRL) has proven to be an effective tool for creating general video-game AI. However most current DRL video-game agents learn end-to-end from the video-output of the game, which is superfluous for many applications and creates a number of additional problems. More importantly, directly working on pixel-based raw video data is substantially distinct from what a human player does.In this paper, we present a novel method which enables DRL agents to learn directly from object information. This is obtained via use of an object embedding network (OEN) that compresses a set of object feature vectors of different lengths into a single fixed-length unified feature vector representing the current game-state and fulfills the DRL simultaneously. We evaluate our OEN-based DRL agent by comparing to several state-of-the-art approaches on a selection of games from the GVG-AI Competition. Experimental results suggest that our object-based DRL agent yields performance comparable to that of those approaches used in our comparative study.
Many of the leading approaches for video understanding are data-hungry and time-consuming, failing to capture the gist of spatial-temporal evolution in an efficient manner. The latest research shows that CNN network can reason about static relation of entities in images. To further exploit its capacity in dynamic evolution reasoning, we introduce a novel network module called DenseImage Network(DIN) with two main contributions. 1) A novel compact representation of video which distills its significant spatial-temporal evolution into a matrix called DenseImage, primed for efficient video encoding. 2) A simple yet powerful learning strategy based on DenseImage and a temporal-order-preserving CNN network is proposed for video understanding, which contains a local temporal correlation constraint capturing temporal evolution at multiple time scales with different filter widths. Extensive experiments on two recent challenging benchmarks demonstrate that our DenseImage Network can accurately capture the common spatial-temporal evolution between similar actions, even with enormous visual variations or different time scales. Moreover, we obtain the state-of-the-art results in action and gesture recognition with much less time-and-memory cost, indicating its immense potential in video representing and understanding.
A proper semantic representation for encoding side information is key to the success of zero-shot learning. In this paper, we explore two alternative semantic representations especially for zero-shot human action recognition: textual descriptions of human actions and deep features extracted from still images relevant to human actions. Such side information are accessible on Web with little cost, which paves a new way in gaining side information for large-scale zero-shot human action recognition. We investigate different encoding methods to generate semantic representations for human actions from such side information. Based on our zero-shot visual recognition method, we conducted experiments on UCF101 and HMDB51 to evaluate two proposed semantic representations . The results suggest that our proposed text- and image-based semantic representations outperform traditional attributes and word vectors considerably for zero-shot human action recognition. In particular, the image-based semantic representations yield the favourable performance even though the representation is extracted from a small number of images per class.
Zero-shot learning for visual recognition, e.g., object and action recognition, has recently attracted a lot of attention. However, it still remains challenging in bridging the semantic gap between visual features and their underlying semantics and transferring knowledge to semantic categories unseen during learning. Unlike most of the existing zero-shot visual recognition methods, we propose a stagewise bidirectional latent embedding framework to two subsequent learning stages for zero-shot visual recognition. In the bottom-up stage, a latent embedding space is first created by exploring the topological and labeling information underlying training data of known classes via a proper supervised subspace learning algorithm and the latent embedding of training data are used to form landmarks that guide embedding semantics underlying unseen classes into this learned latent space. In the top-down stage, semantic representations of unseen-class labels in a given label vocabulary are then embedded to the same latent space to preserve the semantic relatedness between all different classes via our proposed semi-supervised Sammon mapping with the guidance of landmarks. Thus, the resultant latent embedding space allows for predicting the label of a test instance with a simple nearest-neighbor rule. To evaluate the effectiveness of the proposed framework, we have conducted extensive experiments on four benchmark datasets in object and action recognition, i.e., AwA, CUB-200-2011, UCF101 and HMDB51. The experimental results under comparative studies demonstrate that our proposed approach yields the state-of-the-art performance under inductive and transductive settings.
Music information retrieval faces a challenge in modeling contextualized musical concepts formulated by a set of co-occurring tags. In this paper, we investigate the suitability of our recently proposed approach based on a Siamese neural network in fighting off this challenge. By means of tag features and probabilistic topic models, the network captures contextualized semantics from tags via unsupervised learning. This leads to a distributed semantics space and a potential solution to the out of vocabulary problem which has yet to be sufficiently addressed. We explore the nature of the resultant music-based semantics and address computational needs. We conduct experiments on three public music tag collections -namely, CAL500, MagTag5K and Million Song Dataset- and compare our approach to a number of state-of-the-art semantics learning approaches. Comparative results suggest that this approach outperforms previous approaches in terms of semantic priming and music tag completion.
Zero Shot Learning (ZSL) enables a learning model to classify instances of an unseen class during training. While most research in ZSL focuses on single-label classification, few studies have been done in multi-label ZSL, where an instance is associated with a set of labels simultaneously, due to the difficulty in modeling complex semantics conveyed by a set of labels. In this paper, we propose a novel approach to multi-label ZSL via concept embedding learned from collections of public users' annotations of multimedia. Thanks to concept embedding, multi-label ZSL can be done by efficiently mapping an instance input features onto the concept embedding space in a similar manner used in single-label ZSL. Moreover, our semantic learning model is capable of embedding an out-of-vocabulary label by inferring its meaning from its co-occurring labels. Thus, our approach allows both seen and unseen labels during the concept embedding learning to be used in the aforementioned instance mapping, which makes multi-label ZSL more flexible and suitable for real applications. Experimental results of multi-label ZSL on images and music tracks suggest that our approach outperforms a state-of-the-art multi-label ZSL model and can deal with a scenario involving out-of-vocabulary labels without re-training the semantics learning model.
Procedural content generation (PCG) is of great interest to game design and development as it generates game content automatically. Motivated by the recent learning-based PCG framework and other existing PCG works, we propose an alternative approach to online content generation and adaptation in Super Mario Bros (SMB). Unlike most of existing works in SMB, our approach exploits the synergy between rule-based and learning-based methods to produce constructive primitives, quality yet controllable game segments in SMB. As a result, a complete quality game level can be generated online by integrating relevant constructive primitives via controllable parameters regarding geometrical features and procedure-level properties. Also the adaptive content can be generated in real time by dynamically selecting proper constructive primitives via an adaptation criterion, e.g., dynamic difficulty adjustment (DDA). Our approach is of several favorable properties in terms of content quality assurance, generation efficiency and controllability. Extensive simulation results demonstrate that the proposed approach can generate controllable yet quality game levels online and adaptable content for DDA in real time.
To overcome the weakness of a total variation based model for image restoration, various high order (typically second order) regularization models have been proposed and studied recently. In this paper we analyze and test a fractional-order derivative based total $\alpha$-order variation model, which can outperform the currently popular high order regularization models. There exist several previous works using total $\alpha$-order variations for image restoration; however first no analysis is done yet and second all tested formulations, differing from each other, utilize the zero Dirichlet boundary conditions which are not realistic (while non-zero boundary conditions violate definitions of fractional-order derivatives). This paper first reviews some results of fractional-order derivatives and then analyzes the theoretical properties of the proposed total $\alpha$-order variational model rigorously. It then develops four algorithms for solving the variational problem, one based on the variational Split-Bregman idea and three based on direct solution of the discretise-optimization problem. Numerical experiments show that, in terms of restoration quality and solution efficiency, the proposed model can produce highly competitive results, for smooth images, to two established high order models: the mean curvature and the total generalized variation.
One of the biggest challenges in Multimedia information retrieval and understanding is to bridge the semantic gap by properly modeling concept semantics in context. The presence of out of vocabulary (OOV) concepts exacerbates this difficulty. To address the semantic gap issues, we formulate a problem on learning contextualized semantics from descriptive terms and propose a novel Siamese architecture to model the contextualized semantics from descriptive terms. By means of pattern aggregation and probabilistic topic models, our Siamese architecture captures contextualized semantics from the co-occurring descriptive terms via unsupervised learning, which leads to a concept embedding space of the terms in context. Furthermore, the co-occurring OOV concepts can be easily represented in the learnt concept embedding space. The main properties of the concept embedding space are demonstrated via visualization. Using various settings in semantic priming, we have carried out a thorough evaluation by comparing our approach to a number of state-of-the-art methods on six annotation corpora in different domains, i.e., MagTag5K, CAL500 and Million Song Dataset in the music domain as well as Corel5K, LabelMe and SUNDatabase in the image domain. Experimental results on semantic priming suggest that our approach outperforms those state-of-the-art methods considerably in various aspects.
Procedural content generation (PCG) has recently become one of the hottest topics in computational intelligence and AI game researches. Among a variety of PCG techniques, search-based approaches overwhelmingly dominate PCG development at present. While SBPCG leads to promising results and successful applications, it poses a number of challenges ranging from representation to evaluation of the content being generated. In this paper, we present an alternative yet generic PCG framework, named learning-based procedure content generation (LBPCG), to provide potential solutions to several challenging problems in existing PCG techniques. By exploring and exploiting information gained in game development and public beta test via data-driven learning, our framework can generate robust content adaptable to end-user or target players on-line with minimal interruption to their experience. Furthermore, we develop enabling techniques to implement the various models required in our framework. For a proof of concept, we have developed a prototype based on the classic open source first-person shooter game, Quake. Simulation results suggest that our framework is promising in generating quality content.
We propose a novel feature selection strategy to discover language-independent acoustic features that tend to be responsible for emotions regardless of languages, linguistics and other factors. Experimental results suggest that the language-independent feature subset discovered yields the performance comparable to the full feature set on various emotional speech corpora.
This paper presents our investigations on emotional state categorization from speech signals with a psychologically inspired computational model against human performance under the same experimental setup. Based on psychological studies, we propose a multistage categorization strategy which allows establishing an automatic categorization model flexibly for a given emotional speech categorization task. We apply the strategy to the Serbian Emotional Speech Corpus (GEES) and the Danish Emotional Speech Corpus (DES), where human performance was reported in previous psychological studies. Our work is the first attempt to apply machine learning to the GEES corpus where the human recognition rates were only available prior to our study. Unlike the previous work on the DES corpus, our work focuses on a comparison to human performance under the same experimental settings. Our studies suggest that psychology-inspired systems yield behaviours that, to a great extent, resemble what humans perceived and their performance is close to that of humans under the same experimental setup. Furthermore, our work also uncovers some differences between machine and humans in terms of emotional state recognition from speech.
In this paper we examine a formalization of feature distribution learning (FDL) in information-theoretic terms relying on the analytical approach and on the tools already used in the study of the information bottleneck (IB). It has been conjectured that the behavior of FDL algorithms could be expressed as an optimization problem over two information-theoretic quantities: the mutual information of the data with the learned representations and the entropy of the learned distribution. In particular, such a formulation was offered in order to explain the success of the most prominent FDL algorithm, sparse filtering (SF). This conjecture was, however, left unproven. In this work, we aim at providing preliminary empirical support to this conjecture by performing experiments reminiscent of the work done on deep neural networks in the context of the IB research. Specifically, we borrow the idea of using information planes to analyze the behavior of the SF algorithm and gain insights on its dynamics. A confirmation of the conjecture about the dynamics of FDL may provide solid ground to develop information-theoretic tools to assess the quality of the learning process in FDL, and it may be extended to other unsupervised learning algorithms.
In this paper we formally analyse the use of sparse filtering algorithms to perform covariate shift adaptation. We provide a theoretical analysis of sparse filtering by evaluating the conditions required to perform covariate shift adaptation. We prove that sparse filtering can perform adaptation only if the conditional distribution of the labels has a structure explained by a cosine metric. To overcome this limitation, we propose a new algorithm, named periodic sparse filtering, and carry out the same theoretical analysis regarding covariate shift adaptation. We show that periodic sparse filtering can perform adaptation under the looser and more realistic requirement that the conditional distribution of the labels has a periodic structure, which may be satisfied, for instance, by user-dependent data sets. We experimentally validate our theoretical results on synthetic data. Moreover, we apply periodic sparse filtering to real-world data sets to demonstrate that this simple and computationally efficient algorithm is able to achieve competitive performances.
In this paper we present a theoretical analysis to understand sparse filtering, a recent and effective algorithm for unsupervised learning. The aim of this research is not to show whether or how well sparse filtering works, but to understand why and when sparse filtering does work. We provide a thorough theoretical analysis of sparse filtering and its properties, and further offer an experimental validation of the main outcomes of our theoretical analysis. We show that sparse filtering works by explicitly maximizing the entropy of the learned representation through the maximization of the proxy of sparsity, and by implicitly preserving mutual information between original and learned representations through the constraint of preserving a structure of the data, specifically the structure defined by relations of neighborhoodness under the cosine distance. Furthermore, we empirically validate our theoretical results with artificial and real data sets, and we apply our theoretical understanding to explain the success of sparse filtering on real-world problems. Our work provides a strong theoretical basis for understanding sparse filtering: it highlights assumptions and conditions for success behind this feature distribution learning algorithm, and provides insights for developing new feature distribution learning algorithms.