Models, code, and papers for "Yi Zeng":
Knowledge plays a central role in human and artificial intelligence. One of the key characteristics of knowledge is its structured organization. Knowledge can be and should be presented in multiple levels and multiple views to meet people's needs in different levels of granularities and from different perspectives. In this paper, we stand on the view point of granular computing and provide our understanding on multi-level and multi-view of knowledge through granular knowledge structures (GKS). Representation of granular knowledge structures, operations for building granular knowledge structures and how to use them are investigated. As an illustration, we provide some examples through results from an analysis of proceeding papers. Results show that granular knowledge structures could help users get better understanding of the knowledge source from set theoretical, logical and visual point of views. One may consider using them to meet specific needs or solve certain kinds of problems.
Named entity recognition (NER) is a foundational technology for information extraction. This paper presents a flexible NER framework compatible with different languages and domains. Inspired by the idea of distant supervision (DS), this paper enhances the representation by increasing the entity-context diversity without relying on external resources. We choose different layer stacks and sub-network combinations to construct the bilateral networks. This strategy can generally improve model performance on different datasets. We conduct experiments on five languages, such as English, German, Spanish, Dutch and Chinese, and biomedical fields, such as identifying the chemicals and gene/protein terms from scientific works. Experimental results demonstrate the good performance of this framework.
Artificial Intelligence principles define social and ethical considerations to develop future AI. They come from research institutes, government organizations and industries. All versions of AI principles are with different considerations covering different perspectives and making different emphasis. None of them can be considered as complete and can cover the rest AI principle proposals. Here we introduce LAIP, an effort and platform for linking and analyzing different Artificial Intelligence Principles. We want to explicitly establish the common topics and links among AI Principles proposed by different organizations and investigate on their uniqueness. Based on these efforts, for the long-term future of AI, instead of directly adopting any of the AI principles, we argue for the necessity of incorporating various AI Principles into a comprehensive framework and focusing on how they can interact and complete each other.
As a fundamental problem in many different fields, link prediction aims to estimate the likelihood of an existing link between two nodes based on the observed information. Since this problem is related to many applications ranging from uncovering missing data to predicting the evolution of networks, link prediction has been intensively investigated recently and many methods have been proposed so far. The essential challenge of link prediction is to estimate the similarity between nodes. Most of the existing methods are based on the common neighbor index and its variants. In this paper, we propose to calculate the similarity between nodes by the correlation coefficient. This method is found to be very effective when applied to calculate similarity based on high order paths. We finally fuse the correlation-based method with the resource allocation method, and find that the combined method can substantially outperform the existing methods, especially in sparse networks.
Facial recognition is changing the way we live in and interact with our society. Here we discuss the two sides of facial recognition, summarizing potential risks and current concerns. We introduce current policies and regulations in different countries. Very importantly, we point out that the risks and concerns are not only from facial recognition, but also realistically very similar to other biometric recognition technology, including but not limited to gait recognition, iris recognition, fingerprint recognition, voice recognition, etc. To create a responsible future, we discuss possible technological moves and efforts that should be made to keep facial recognition (and biometric recognition in general) developing for social good.
Since many online music services emerged in recent years so that effective music recommendation systems are desirable. Some common problems in recommendation system like feature representations, distance measure and cold start problems are also challenges for music recommendation. In this paper, I proposed a triplet neural network, exploiting both positive and negative samples to learn the representation and distance measure between users and items, to solve the recommendation task.
With the emergence of diverse data collection techniques, objects in real applications can be represented as multi-modal features. What's more, objects may have multiple semantic meanings. Multi-modal and Multi-label (MMML) problem becomes a universal phenomenon. The quality of data collected from different channels are inconsistent and some of them may not benefit for prediction. In real life, not all the modalities are needed for prediction. As a result, we propose a novel instance-oriented Multi-modal Classifier Chains (MCC) algorithm for MMML problem, which can make convince prediction with partial modalities. MCC extracts different modalities for different instances in the testing phase. Extensive experiments are performed on one real-world herbs dataset and two public datasets to validate our proposed algorithm, which reveals that it may be better to extract many instead of all of the modalities at hand.
Deep neural network based methods have made a significant breakthrough in salient object detection. However, they are typically limited to input images with low resolutions ($400\times400$ pixels or less). Little effort has been made to train deep neural networks to directly handle salient object detection in very high-resolution images. This paper pushes forward high-resolution saliency detection, and contributes a new dataset, named High-Resolution Salient Object Detection (HRSOD). To our best knowledge, HRSOD is the first high-resolution saliency detection dataset to date. As another contribution, we also propose a novel approach, which incorporates both global semantic information and local high-resolution details, to address this challenging task. More specifically, our approach consists of a Global Semantic Network (GSN), a Local Refinement Network (LRN) and a Global-Local Fusion Network (GLFN). GSN extracts the global semantic information based on down-sampled entire image. Guided by the results of GSN, LRN focuses on some local regions and progressively produces high-resolution predictions. GLFN is further proposed to enforce spatial consistency and boost performance. Experiments illustrate that our method outperforms existing state-of-the-art methods on high-resolution saliency datasets by a large margin, and achieves comparable or even better performance than them on widely-used saliency benchmarks. The HRSOD dataset is available at https://github.com/yi94code/HRSOD.
This paper focuses on network pruning for image retrieval acceleration. Prevailing image retrieval works target at the discriminative feature learning, while little attention is paid to how to accelerate the model inference, which should be taken into consideration in real-world practice. The challenge of pruning image retrieval models is that the middle-level feature should be preserved as much as possible. Such different requirements of the retrieval and classification model make the traditional pruning methods not that suitable for our task. To solve the problem, we propose a new Progressive Local Filter Pruning (PLFP) method for image retrieval acceleration. Specifically, layer by layer, we analyze the local geometric properties of each filter and select the one that can be replaced by the neighbors. Then we progressively prune the filter by gradually changing the filter weights. In this way, the representation ability of the model is preserved. To verify this, we evaluate our method on two widely-used image retrieval datasets,i.e., Oxford5k and Paris6K, and one person re-identification dataset,i.e., Market-1501. The proposed method arrives with superior performance to the conventional pruning methods, suggesting the effectiveness of the proposed method for image retrieval.
Session based model is widely used in recommend system. It use the user click sequence as input of a Recurrent Neural Network (RNN), and get the output of the RNN network as the vector embedding of the session, and use the inner product of the vector embedding of session and the vector embedding of the next item as the score that is the metric of the interest to the next item. This method can be used for the "match" stage for the recommendation system whose item number is very big by using some index method like KD-Tree or Ball-Tree and etc.. But this method repudiate the variousness of the interest of user in a session. We generated the model to modify the vector embedding of session to a symmetric matrix embedding, that is equivalent to a quadratic form on the vector space of items. The score is builded as the value of the vector embedding of next item under the quadratic form. The eigenvectors of the symmetric matrix embedding corresponding to the positive eigenvalues are conjectured to represent the interests of user in the session. This method can be used for the "match" stage also. The experiments show that this method is better than the method of vector embedding.
We present a real-time method for synthesizing highly complex human motions using a novel training regime we call the auto-conditioned Recurrent Neural Network (acRNN). Recently, researchers have attempted to synthesize new motion by using autoregressive techniques, but existing methods tend to freeze or diverge after a couple of seconds due to an accumulation of errors that are fed back into the network. Furthermore, such methods have only been shown to be reliable for relatively simple human motions, such as walking or running. In contrast, our approach can synthesize arbitrary motions with highly complex styles, including dances or martial arts in addition to locomotion. The acRNN is able to accomplish this by explicitly accommodating for autoregressive noise accumulation during training. Our work is the first to our knowledge that demonstrates the ability to generate over 18,000 continuous frames (300 seconds) of new complex human motion w.r.t. different styles.
This paper aims to develop a new and robust approach to feature representation. Motivated by the success of Auto-Encoders, we first theoretical summarize the general properties of all algorithms that are based on traditional Auto-Encoders: 1) The reconstruction error of the input can not be lower than a lower bound, which can be viewed as a guiding principle for reconstructing the input. Additionally, when the input is corrupted with noises, the reconstruction error of the corrupted input also can not be lower than a lower bound. 2) The reconstruction of a hidden representation achieving its ideal situation is the necessary condition for the reconstruction of the input to reach the ideal state. 3) Minimizing the Frobenius norm of the Jacobian matrix of the hidden representation has a deficiency and may result in a much worse local optimum value. We believe that minimizing the reconstruction error of the hidden representation is more robust than minimizing the Frobenius norm of the Jacobian matrix of the hidden representation. Based on the above analysis, we propose a new model termed Double Denoising Auto-Encoders (DDAEs), which uses corruption and reconstruction on both the input and the hidden representation. We demonstrate that the proposed model is highly flexible and extensible and has a potentially better capability to learn invariant and robust feature representations. We also show that our model is more robust than Denoising Auto-Encoders (DAEs) for dealing with noises or inessential features. Furthermore, we detail how to train DDAEs with two different pre-training methods by optimizing the objective function in a combined and separate manner, respectively. Comparative experiments illustrate that the proposed model is significantly better for representation learning than the state-of-the-art models.
Narrative story generation is a challenging problem because it demands the generated sentences with tight semantic connections, which has not been well studied by most existing generative models. To address this problem, we propose a skeleton-based model to promote the coherence of generated stories. Different from traditional models that generate a complete sentence at a stroke, the proposed model first generates the most critical phrases, called skeleton, and then expands the skeleton to a complete and fluent sentence. The skeleton is not manually defined, but learned by a reinforcement learning method. Compared to the state-of-the-art models, our skeleton-based model can generate significantly more coherent text according to human evaluation and automatic evaluation. The G-score is improved by 20.1% in the human evaluation. The code is available at https://github.com/lancopku/Skeleton-Based-Generation-Model
This paper aims to develop a new architecture that can make full use of the feature maps of convolutional networks. To this end, we study a number of methods for video-based person re-identification and make the following findings: 1) Max-pooling only focuses on the maximum value of a receptive field, wasting a lot of information. 2) Networks with different streams even including the one with the worst performance work better than networks with same streams, where each one has the best performance alone. 3) A full connection layer at the end of convolutional networks is not necessary. Based on these studies, we propose a new convolutional architecture termed Three-Stream Convolutional Networks (TSCN). It first uses different streams to learn different aspects of feature maps for attentive spatio-temporal fusion of video, and then merges them together to study some union features. To further utilize the feature maps, two architectures are designed by using the strategies of multi-scale and upsampling. Comparative experiments on iLIDS-VID, PRID-2011 and MARS datasets illustrate that the proposed architectures are significantly better for feature extraction than the state-of-the-art models.
In recent years, the mean field theory has been applied to the study of neural networks and has achieved a great deal of success. The theory has been applied to various neural network structures, including CNNs, RNNs, Residual networks, and Batch normalization. Inevitably, recent work has also covered the use of dropout. The mean field theory shows that the existence of depth scales that limit the maximum depth of signal propagation and gradient backpropagation. However, the gradient backpropagation is derived under the gradient independence assumption that weights used during feed forward are drawn independently from the ones used in backpropagation. This is not how neural networks are trained in a real setting. Instead, the same weights used in a feed-forward step needs to be carried over to its corresponding backpropagation. Using this realistic condition, we perform theoretical computation on linear dropout networks and a series of experiments on dropout networks. Our empirical results show an interesting phenomenon that the length gradients can backpropagate for a single input and a pair of inputs are governed by the same depth scale. Besides, we study the relationship between variance and mean of statistical metrics of the gradient and shown an emergence of universality. Finally, we investigate the maximum trainable length for deep dropout networks through a series of experiments using MNIST and CIFAR10 and provide a more precise empirical formula that describes the trainable length than original work.
With more encrypted network traffic gets involved in the Internet, how to effectively identify network traffic has become a top priority in the field. Accurate identification of the network traffic is the footstone of basic network services, say QoE, bandwidth allocation, and IDS. Previous identification methods either cannot deal with encrypted traffics or require experts to select tons of features to attain a relatively decent accuracy.In this paper, we present a Deep Learning based end-to-end network traffic identification framework, termed TEST, to avoid the aforementioned problems. CNN and LSTM are combined and implemented to help the machine automatically extract features from both special and time-related features of the raw traffic. The presented framework has two layers of structure, which made it possible to attain a remarkable accuracy on both encrypted traffic classification and intrusion detection tasks. The experimental results demonstrate that our model can outperform previous methods with a state-of-the-art accuracy of 99.98%.
In this study, a multi-task deep neural network is proposed for skin lesion analysis. The proposed multi-task learning model solves different tasks (e.g., lesion segmentation and two independent binary lesion classifications) at the same time by exploiting commonalities and differences across tasks. This results in improved learning efficiency and potential prediction accuracy for the task-specific models, when compared to training the individual models separately. The proposed multi-task deep learning model is trained and evaluated on the dermoscopic image sets from the International Skin Imaging Collaboration (ISIC) 2017 Challenge - Skin Lesion Analysis towards Melanoma Detection, which consists of 2000 training samples and 150 evaluation samples. The experimental results show that the proposed multi-task deep learning model achieves promising performances on skin lesion segmentation and classification. The average value of Jaccard index for lesion segmentation is 0.724, while the average values of area under the receiver operating characteristic curve (AUC) on two individual lesion classifications are 0.880 and 0.972, respectively.
In this work, we propose a very simple deep learning network for image classification which comprises only the very basic data processing components: cascaded principal component analysis (PCA), binary hashing, and block-wise histograms. In the proposed architecture, PCA is employed to learn multistage filter banks. It is followed by simple binary hashing and block histograms for indexing and pooling. This architecture is thus named as a PCA network (PCANet) and can be designed and learned extremely easily and efficiently. For comparison and better understanding, we also introduce and study two simple variations to the PCANet, namely the RandNet and LDANet. They share the same topology of PCANet but their cascaded filters are either selected randomly or learned from LDA. We have tested these basic networks extensively on many benchmark visual datasets for different tasks, such as LFW for face verification, MultiPIE, Extended Yale B, AR, FERET datasets for face recognition, as well as MNIST for hand-written digits recognition. Surprisingly, for all tasks, such a seemingly naive PCANet model is on par with the state of the art features, either prefixed, highly hand-crafted or carefully learned (by DNNs). Even more surprisingly, it sets new records for many classification tasks in Extended Yale B, AR, FERET datasets, and MNIST variations. Additional experiments on other public datasets also demonstrate the potential of the PCANet serving as a simple but highly competitive baseline for texture classification and object recognition.
Telecommunication (Telco) outdoor position recovery aims to localize outdoor mobile devices by leveraging measurement report (MR) data. Unfortunately, Telco position recovery requires sufficient amount of MR samples across different areas and suffers from high data collection cost. For an area with scarce MR samples, it is hard to achieve good accuracy. In this paper, by leveraging the recently developed transfer learning techniques, we design a novel Telco position recovery framework, called TLoc, to transfer good models in the carefully selected source domains (those fine-grained small subareas) to a target one which originally suffers from poor localization accuracy. Specifically, TLoc introduces three dedicated components: 1) a new coordinate space to divide an area of interest into smaller domains, 2) a similarity measurement to select best source domains, and 3) an adaptation of an existing transfer learning approach. To the best of our knowledge, TLoc is the first framework that demonstrates the efficacy of applying transfer learning in the Telco outdoor position recovery. To exemplify, on the 2G GSM and 4G LTE MR datasets in Shanghai, TLoc outperforms a nontransfer approach by 27.58% and 26.12% less median errors, and further leads to 47.77% and 49.22% less median errors than a recent fingerprinting approach NBL.