Models, code, and papers for "Junjie Yin":
In this paper, we propose Evebot, an innovative, sequence to sequence (Seq2seq) based, fully generative conversational system for the diagnosis of negative emotions and prevention of depression through positively suggestive responses. The system consists of an assembly of deep-learning based models, including Bi-LSTM based model for detecting negative emotions of users and obtaining psychological counselling related corpus for training the chatbot, anti-language sequence to sequence neural network, and maximum mutual information (MMI) model. As adolescents are reluctant to show their negative emotions in physical interaction, traditional methods of emotion analysis and comforting methods may not work. Therefore, this system puts emphasis on using virtual platform to detect signs of depression or anxiety, channel adolescents' stress and mood, and thus prevent the emergence of mental illness. We launched the integrated chatbot system onto an online platform for real-world campus applications. Through a one-month user study, we observe better results in the increase in positivity than other public chatbots in the control group.
In this paper, we present a hierarchical path planning framework called SG-RL (subgoal graphs-reinforcement learning), to plan rational paths for agents maneuvering in continuous and uncertain environments. By "rational", we mean (1) efficient path planning to eliminate first-move lags; (2) collision-free and smooth for agents with kinematic constraints satisfied. SG-RL works in a two-level manner. At the first level, SG-RL uses a geometric path-planning method, i.e., Simple Subgoal Graphs (SSG), to efficiently find optimal abstract paths, also called subgoal sequences. At the second level, SG-RL uses an RL method, i.e., Least-Squares Policy Iteration (LSPI), to learn near-optimal motion-planning policies which can generate kinematically feasible and collision-free trajectories between adjacent subgoals. The first advantage of the proposed method is that SSG can solve the limitations of sparse reward and local minima trap for RL agents; thus, LSPI can be used to generate paths in complex environments. The second advantage is that, when the environment changes slightly (i.e., unexpected obstacles appearing), SG-RL does not need to reconstruct subgoal graphs and replan subgoal sequences using SSG, since LSPI can deal with uncertainties by exploiting its generalization ability to handle changes in environments. Simulation experiments in representative scenarios demonstrate that, compared with existing methods, SG-RL can work well on large-scale maps with relatively low action-switching frequencies and shorter path lengths, and SG-RL can deal with small changes in environments. We further demonstrate that the design of reward functions and the types of training environments are important factors for learning feasible policies.
Object recognition techniques using convolutional neural networks (CNN) have achieved great success. However, state-of-the-art object detection methods still perform poorly on large vocabulary and long-tailed datasets, e.g. LVIS. In this work, we analyze this problem from a novel perspective: each positive sample of one category can be seen as a negative sample for other categories, making the tail categories receive more discouraging gradients. Based on it, we propose a simple but effective loss, named equalization loss, to tackle the problem of long-tailed rare categories by simply ignoring those gradients for rare categories. The equalization loss protects the learning of rare categories from being at a disadvantage during the network parameter updating. Thus the model is capable of learning better discriminative features for objects of rare classes. Without any bells and whistles, our method achieves AP gains of 4.1% and 4.8% for the rare and common categories on the challenging LVIS benchmark, compared to the Mask R-CNN baseline. With the utilization of the effective equalization loss, we finally won the 1st place in the LVIS Challenge 2019. Code has been made available at: https: //github.com/tztztztztz/eql.detectron2
In Mandarin text-to-speech (TTS) system, the front-end text processing module significantly influences the intelligibility and naturalness of synthesized speech. Building a typical pipeline-based front-end which consists of multiple individual components requires extensive efforts. In this paper, we proposed a unified sequence-to-sequence front-end model for Mandarin TTS that converts raw texts to linguistic features directly. Compared to the pipeline-based front-end, our unified front-end can achieve comparable performance in polyphone disambiguation and prosody word prediction, and improve intonation phrase prediction by 0.0738 in F1 score. We also implemented the unified front-end with Tacotron and WaveRNN to build a Mandarin TTS system. The synthesized speech by that got a comparable MOS (4.38) with the pipeline-based front-end (4.37) and close to human recordings (4.49).
In this paper, we propose a hybrid text normalization system using multi-head self-attention. The system combines the advantages of a rule-based model and a neural model for text preprocessing tasks. Previous studies in Mandarin text normalization usually use a set of hand-written rules, which are hard to improve on general cases. The idea of our proposed system is motivated by the neural models from recent studies and has a better performance on our internal news corpus. This paper also includes different attempts to deal with imbalanced pattern distribution of the dataset. Overall, the performance of the system is improved by over 1.5% on sentence-level and it has a potential to improve further.
A novel centerline extraction framework is reported which combines an end-to-end trainable multi-task fully convolutional network (FCN) with a minimal path extractor. The FCN simultaneously computes centerline distance maps and detects branch endpoints. The method generates single-pixel-wide centerlines with no spurious branches. It handles arbitrary tree-structured object with no prior assumption regarding depth of the tree or its bifurcation pattern. It is also robust to substantial scale changes across different parts of the target object and minor imperfections of the object's segmentation mask. To the best of our knowledge, this is the first deep-learning based centerline extraction method that guarantees single-pixel-wide centerline for a complex tree-structured object. The proposed method is validated in coronary artery centerline extraction on a dataset of 620 patients (400 of which used as test set). This application is challenging due to the large number of coronary branches, branch tortuosity, and large variations in length, thickness, shape, etc. The proposed method generates well-positioned centerlines, exhibiting lower number of missing branches and is more robust in the presence of minor imperfections of the object segmentation mask. Compared to a state-of-the-art traditional minimal path approach, our method improves patient-level success rate of centerline extraction from 54.3% to 88.8% according to independent human expert review.
Computerized automatic methods have been employed to boost the productivity as well as objectiveness of hand bone age assessment. These approaches make predictions according to the whole X-ray images, which include other objects that may introduce distractions. Instead, our framework is inspired by the clinical workflow (Tanner-Whitehouse) of hand bone age assessment, which focuses on the key components of the hand. The proposed framework is composed of two components: a Mask R-CNN subnet of pixelwise hand segmentation and a residual attention network for hand bone age assessment. The Mask R-CNN subnet segments the hands from X-ray images to avoid the distractions of other objects (e.g., X-ray tags). The hierarchical attention components of the residual attention subnet force our network to focus on the key components of the X-ray images and generate the final predictions as well as the associated visual supports, which is similar to the assessment procedure of clinicians. We evaluate the performance of the proposed pipeline on the RSNA pediatric bone age dataset and the results demonstrate its superiority over the previous methods.
A common difficulty in applications of machine learning is the lack of any general principle for guiding the choices of key parameters of the underlying neural network. Focusing on a class of recurrent neural networks - reservoir computing systems that have recently been exploited for model-free prediction of nonlinear dynamical systems, we uncover a surprising phenomenon: the emergence of an interval in the spectral radius of the neural network in which the prediction error is minimized. In a three-dimensional representation of the error versus time and spectral radius, the interval corresponds to the bottom region of a "valley." Such a valley arises for a variety of spatiotemporal dynamical systems described by nonlinear partial differential equations, regardless of the structure and the edge-weight distribution of the underlying reservoir network. We also find that, while the particular location and size of the valley would depend on the details of the target system to be predicted, the interval tends to be larger for undirected than for directed networks. The valley phenomenon can be beneficial to the design of optimal reservoir computing, representing a small step forward in understanding these machine-learning systems.
Fine-grained image classification is to recognize hundreds of subcategories belonging to the same basic-level category, such as 200 subcategories belonging to the bird, which is highly challenging due to large variance in the same subcategory and small variance among different subcategories. Existing methods generally first locate the objects or parts and then discriminate which subcategory the image belongs to. However, they mainly have two limitations: (1) Relying on object or part annotations which are heavily labor consuming. (2) Ignoring the spatial relationships between the object and its parts as well as among these parts, both of which are significantly helpful for finding discriminative parts. Therefore, this paper proposes the object-part attention model (OPAM) for weakly supervised fine-grained image classification, and the main novelties are: (1) Object-part attention model integrates two level attentions: object-level attention localizes objects of images, and part-level attention selects discriminative parts of object. Both are jointly employed to learn multi-view and multi-scale features to enhance their mutual promotions. (2) Object-part spatial constraint model combines two spatial constraints: object spatial constraint ensures selected parts highly representative, and part spatial constraint eliminates redundancy and enhances discrimination of selected parts. Both are jointly employed to exploit the subtle and local differences for distinguishing the subcategories. Importantly, neither object nor part annotations are used in our proposed approach, which avoids the heavy labor consumption of labeling. Comparing with more than 10 state-of-the-art methods on 4 widely-used datasets, our OPAM approach achieves the best performance.
As a popular meta-learning approach, the model-agnostic meta-learning (MAML) algorithm has been widely used due to its simplicity and effectiveness. However, the convergence of the general multi-step MAML still remains unexplored. In this paper, we develop a new theoretical framework, under which we characterize the convergence rate and the computational complexity of multi-step MAML. Our results indicate that $N$-step MAML attains the convergence with linearly increasing complexity with $N$ under a properly chosen inner stepsize. We then take a further step to develop a more efficient Hessian-free MAML. We first show that the existing zeroth-order Hessian estimator contains a constant-level estimation error so that the MAML algorithm can perform unstably. To address this issue, we propose a novel Hessian estimator via a gradient-based Gaussian smoothing method, and show that it achieves a much smaller estimation bias and variance, and the resulting algorithm achieves the same performance guarantee as the original MAML under mild conditions. Our experiments validate our theory and demonstrate the effectiveness of the proposed Hessian estimator.
Recently, convolutional neural networks (CNNs) have shown great success on the task of monocular depth estimation. A fundamental yet unanswered question is: how CNNs can infer depth from a single image. Toward answering this question, we consider visualization of inference of a CNN by identifying relevant pixels of an input image to depth estimation. We formulate it as an optimization problem of identifying the smallest number of image pixels from which the CNN can estimate a depth map with the minimum difference from the estimate from the entire image. To cope with a difficulty with optimization through a deep CNN, we propose to use another network to predict those relevant image pixels in a forward computation. In our experiments, we first show the effectiveness of this approach, and then apply it to different depth estimation networks on indoor and outdoor scene datasets. The results provide several findings that help exploration of the above question.
In financial markets, abnormal trading behaviors pose a serious challenge to market surveillance and risk management. What is worse, there is an increasing emergence of abnormal trading events that some experienced traders constitute a collusive clique and collaborate to manipulate some instruments, thus mislead other investors by applying similar trading behaviors for maximizing their personal benefits. In this paper, a method is proposed to detect the hidden collusive cliques involved in an instrument of future markets by first calculating the correlation coefficient between any two eligible unified aggregated time series of signed order volume, and then combining the connected components from multiple sparsified weighted graphs constructed by using the correlation matrices where each correlation coefficient is over a user-specified threshold. Experiments conducted on real order data from the Shanghai Futures Exchange show that the proposed method can effectively detect suspect collusive cliques. A tool based on the proposed method has been deployed in the exchange as a pilot application for futures market surveillance and risk management.
Real-time motion detection in non-stationary scenes is a difficult task due to dynamic background, changing foreground appearance and limited computational resource. These challenges degrade the performance of the existing methods in practical applications. In this paper, an optical flow based framework is proposed to address this problem. By applying a novel strategy to utilize optical flow, we enable our method being free of model constructing, training or updating and can be performed efficiently. Besides, a dual judgment mechanism with adaptive intervals and adaptive thresholds is designed to heighten the system's adaptation to different situations. In experiment part, we quantitatively and qualitatively validate the effectiveness and feasibility of our method with videos in various scene conditions. The experimental results show that our method adapts itself to different situations and outperforms the state-of-the-art real-time methods, indicating the advantages of our optical flow based method.
Graph or network data is ubiquitous in the real world, including social networks, information networks, traffic networks, biological networks and various technical networks. The non-Euclidean nature of graph data poses the challenge for modeling and analyzing graph data. Recently, Graph Neural Network (GNN) is proposed as a general and powerful framework to handle tasks on graph data, e.g., node embedding, link prediction and node classification. As a representative implementation of GNNs, Graph Attention Networks (GAT) is successfully applied in a variety of tasks on real datasets. However, GAT is designed to networks with only positive links and fails to handle signed networks which contain both positive and negative links. In this paper, we propose Signed Graph Attention Networks (SiGAT), generalizing GAT to signed networks. SiGAT incorporates graph motifs into GAT to capture two well-known theories in signed network research, i.e., balance theory and status theory. In SiGAT, motifs offer us the flexible structural pattern to aggregate and propagate messages on the signed network to generate node embeddings. We evaluate the proposed SiGAT method by applying it to the signed link prediction task. Experimental results on three real datasets demonstrate that SiGAT outperforms feature-based method, network embedding method and state-of-the-art GNN-based methods like signed graph convolutional networks (SGCN).
Video object detection is more challenging compared to image object detection. Previous works proved that applying object detector frame by frame is not only slow but also inaccurate. Visual clues get weakened by defocus and motion blur, causing failure on corresponding frames. Multi-frame feature fusion methods proved effective in improving the accuracy, but they dramatically sacrifice the speed. Feature propagation based methods proved effective in improving the speed, but they sacrifice the accuracy. So is it possible to improve speed and performance simultaneously? Inspired by how human utilize impression to recognize objects from blurry frames, we propose Impression Network that embodies a natural and efficient feature aggregation mechanism. In our framework, an impression feature is established by iteratively absorbing sparsely extracted frame features. The impression feature is propagated all the way down the video, helping enhance features of low-quality frames. This impression mechanism makes it possible to perform long-range multi-frame feature fusion among sparse keyframes with minimal overhead. It significantly improves per-frame detection baseline on ImageNet VID while being 3 times faster (20 fps). We hope Impression Network can provide a new perspective on video feature enhancement. Code will be made available.
This paper concerns the problem of detecting the use of information hiding at anti-copying 2D barcodes. Prior hidden information detection schemes are either heuristicbased or Machine Learning (ML) based. The key limitation of prior heuristics-based schemes is that they do not answer the fundamental question of why the information hidden at a 2D barcode can be detected. The key limitation of prior MLbased information schemes is that they lack robustness because a printed 2D barcode is very much environmentally dependent, and thus an information hiding detection scheme trained in one environment often does not work well in another environment. In this paper, we propose two hidden information detection schemes at the existing anti-copying 2D barcodes. The first scheme is to directly use the pixel distance to detect the use of an information hiding scheme in a 2D barcode, referred as to the Pixel Distance Based Detection (PDBD) scheme. The second scheme is first to calculate the variance of the raw signal and the covariance between the recovered signal and the raw signal, and then based on the variance results, detects the use of information hiding scheme in a 2D barcode, referred as to the Pixel Variance Based Detection (PVBD) scheme. Moreover, we design advanced IC attacks to evaluate the security of two existing anti-copying 2D barcodes. We implemented our schemes and conducted extensive performance comparison between our schemes and prior schemes under different capturing devices, such as a scanner and a camera phone. Our experimental results show that the PVBD scheme can correctly detect the existence of the hidden information at both the 2LQR code and the LCAC 2D barcode. Moreover, the probability of successfully attacking of our IC attacks achieves 0.6538 for the 2LQR code and 1 for the LCAC 2D barcode.
Recent years have witnessed the successful marriage of finance innovations and AI techniques in various finance applications including quantitative trading (QT). Despite great research efforts devoted to leveraging deep learning (DL) methods for building better QT strategies, existing studies still face serious challenges especially from the side of finance, such as the balance of risk and return, the resistance to extreme loss, and the interpretability of strategies, which limit the application of DL-based strategies in real-life financial markets. In this work, we propose AlphaStock, a novel reinforcement learning (RL) based investment strategy enhanced by interpretable deep attention networks, to address the above challenges. Our main contributions are summarized as follows: i) We integrate deep attention networks with a Sharpe ratio-oriented reinforcement learning framework to achieve a risk-return balanced investment strategy; ii) We suggest modeling interrelationships among assets to avoid selection bias and develop a cross-asset attention mechanism; iii) To our best knowledge, this work is among the first to offer an interpretable investment strategy using deep reinforcement learning models. The experiments on long-periodic U.S. and Chinese markets demonstrate the effectiveness and robustness of AlphaStock over diverse market states. It turns out that AlphaStock tends to select the stocks as winners with high long-term growth, low volatility, high intrinsic value, and being undervalued recently.
Object detection is a fundamental problem in image understanding. One popular solution is the R-CNN framework and its fast versions. They decompose the object detection problem into two cascaded easier tasks: 1) generating object proposals from images, 2) classifying proposals into various object categories. Despite that we are handling with two relatively easier tasks, they are not solved perfectly and there's still room for improvement. In this paper, we push the "divide and conquer" solution even further by dividing each task into two sub-tasks. We call the proposed method "CRAFT" (Cascade Region-proposal-network And FasT-rcnn), which tackles each task with a carefully designed network cascade. We show that the cascade structure helps in both tasks: in proposal generation, it provides more compact and better localized object proposals; in object classification, it reduces false positives (mainly between ambiguous categories) by capturing both inter- and intra-category variances. CRAFT achieves consistent and considerable improvement over the state-of-the-art on object detection benchmarks like PASCAL VOC 07/12 and ILSVRC.
Stochastic gradient descent (SGD) has been found to be surprisingly effective in training a variety of deep neural networks. However, there is still a lack of understanding on how and why SGD can train these complex networks towards a global minimum. In this study, we establish the convergence of SGD to a global minimum for nonconvex optimization problems that are commonly encountered in neural network training. Our argument exploits the following two important properties: 1) the training loss can achieve zero value (approximately), which has been widely observed in deep learning; 2) SGD follows a star-convex path, which is verified by various experiments in this paper. In such a context, our analysis shows that SGD, although has long been considered as a randomized algorithm, converges in an intrinsically deterministic manner to a global minimum.