Models, code, and papers for "Gus Xia":
We contribute a pop-song automation framework for lead melody generation and accompaniment arrangement. The framework reflects the major procedures of human music composition, generating both lead melody and piano accompaniment by a unified strategy. Specifically, we take chord progression as an input and propose three models to generate a structured melody with piano accompaniment textures. First, the harmony alternation model transforms a raw input chord progression to an altered one to better fit the specified music style. Second, the melody generation model generates the lead melody and other voices (melody lines) of the accompaniment using seasonal ARMA (Autoregressive Moving Average) processes. Third, the melody integration model integrates melody lines (voices) together as the final piano accompaniment. We evaluate the proposed framework using subjective listening tests. Experimental results show that the generated melodies are rated significantly higher than the ones generated by bi-directional LSTM, and our accompaniment arrangement result is comparable with a state-of-the-art commercial software, Band in a Box.
Variational Autoencoders(VAEs) have already achieved great results on image generation and recently made promising progress on music generation. However, the generation process is still quite difficult to control in the sense that the learned latent representations lack meaningful music semantics. It would be much more useful if people can modify certain music features, such as rhythm and pitch contour, via latent representations to test different composition ideas. In this paper, we propose a new method to inspect the pitch and rhythm interpretations of the latent representations and we name it disentanglement by augmentation. Based on the interpretable representations, an intuitive graphical user interface is designed for users to better direct the music creation process by manipulating the pitch contours and rhythmic complexity.
With recent breakthroughs in artificial neural networks, deep generative models have become one of the leading techniques for computational creativity. Despite very promising progress on image and short sequence generation, symbolic music generation remains a challenging problem since the structure of compositions are usually complicated. In this study, we attempt to solve the melody generation problem constrained by the given chord progression. This music meta-creation problem can also be incorporated into a plan recognition system with user inputs and predictive structural outputs. In particular, we explore the effect of explicit architectural encoding of musical structure via comparing two sequential generative models: LSTM (a type of RNN) and WaveNet (dilated temporal-CNN). As far as we know, this is the first study of applying WaveNet to symbolic music generation, as well as the first systematic comparison between temporal-CNN and RNN for music generation. We conduct a survey for evaluation in our generations and implemented Variable Markov Oracle in music pattern discovery. Experimental results show that to encode structure more explicitly using a stack of dilated convolution layers improved the performance significantly, and a global encoding of underlying chord progression into the generation procedure gains even more.
Automated melodic phrase detection and segmentation is a classical task in content-based music information retrieval and also the key towards automated music structure analysis. However, traditional methods still cannot satisfy practical requirements. In this paper, we explore and adapt various neural network architectures to see if they can be generalized to work with the symbolic representation of music and produce satisfactory melodic phrase segmentation. The main issue of applying deep-learning methods to phrase detection is the sparse labeling problem of training sets. We proposed two tailored label engineering with corresponding training techniques for different neural networks in order to make decisions at a sequential level. Experiment results show that the CNN-CRF architecture performs the best, being able to offer finer segmentation and faster to train, while CNN, Bi-LSTM-CNN and Bi-LSTM-CRF are acceptable alternatives.
Analogy-making is a key method for computer algorithms to generate both natural and creative music pieces. In general, an analogy is made by partially transferring the music abstractions, i.e., high-level representations and their relationships, from one piece to another; however, this procedure requires disentangling music representations, which usually takes little effort for musicians but is non-trivial for computers. Three sub-problems arise: extracting latent representations from the observation, disentangling the representations so that each part has a unique semantic interpretation, and mapping the latent representations back to actual music. In this paper, we contribute an explicitly-constrained variational autoencoder (EC$^2$-VAE) as a unified solution to all three sub-problems. We focus on disentangling the pitch and rhythm representations of 8-beat music clips conditioned on chords. In producing music analogies, this model helps us to realize the imaginary situation of "what if" a piece is composed using a different pitch contour, rhythm pattern, or chord progression by borrowing the representations from other pieces. Finally, we validate the proposed disentanglement method using objective measurements and evaluate the analogy examples by a subjective study.
Human exhibit rich gender cues in both appearance and behavior. In computer vision domain, gender recognition from facial appearance have been extensively studied, while facial behavior based gender recognition studies remain rare. In this work, we first demonstrate that facial expressions influence the gender patterns presented in 3D face, and gender recognition performance increases when training and testing within the same expression. In further, we design experiments which directly extract the morphological changes resulted from facial expressions as features, for expression-based gender recognition. Experimental results demonstrate that gender can be recognized with considerable accuracy in Happy and Disgust expressions, while Surprise and Sad expressions do not convey much gender related information. This is the first work in the literature which investigates expression-based gender classification with 3D faces, and reveals the strength of gender patterns incorporated in different types of expressions, namely the Happy, the Disgust, the Surprise and the Sad expressions.