The field of big code relies on mining large corpora of code to perform some learning task. A significant threat to this approach has been recently identified by Lopes et al. (2017) who found a large amount of near-duplicate code on GitHub. However, the impact of code duplication has not been noticed by researchers devising machine learning models for source code. In this article, we study the effect of code duplication to machine learning models showing that reported metrics are sometimes inflated by up to 100% when testing on duplicated code corpora compared to the performance on de-duplicated corpora which more accurately represent how machine learning models of code are used by software engineers. We present an "errata" for widely used datasets, list best practices for collecting code corpora and evaluating machine learning models on them, and release tools to help the community avoid this problem in future research.

Click to Read Paper
Deep Neural Networks have been shown to succeed at a range of natural language tasks such as machine translation and text summarization. While tasks on source code (ie, formal languages) have been considered recently, most work in this area does not attempt to capitalize on the unique opportunities offered by its known syntax and structure. In this work, we introduce SmartPaste, a first task that requires to use such information. The task is a variant of the program repair problem that requires to adapt a given (pasted) snippet of code to surrounding, existing source code. As first solutions, we design a set of deep neural models that learn to represent the context of each variable location and variable usage in a data flow-sensitive way. Our evaluation suggests that our models can learn to solve the SmartPaste task in many cases, achieving 58.6% accuracy, while learning meaningful representation of variable usages.

Click to Read Paper
Summarization of long sequences into a concise statement is a core problem in natural language processing, requiring non-trivial understanding of the input. Based on the promising results of graph neural networks on highly structured data, we develop a framework to extend existing sequence encoders with a graph component that can reason about long-distance relationships in weakly structured data such as text. In an extensive evaluation, we show that the resulting hybrid sequence-graph models outperform both pure sequence models as well as pure graph models on a range of summarization tasks.

Click to Read Paper
Learning tasks on source code (i.e., formal languages) have been considered recently, but most work has tried to transfer natural language methods and does not capitalize on the unique opportunities offered by code's known syntax. For example, long-range dependencies induced by using the same variable or function in distant locations are often not considered. We propose to use graphs to represent both the syntactic and semantic structure of code and use graph-based deep learning methods to learn to reason over program structures. In this work, we present how to construct graphs from source code and how to scale Gated Graph Neural Networks training to such large graphs. We evaluate our method on two tasks: VarNaming, in which a network attempts to predict the name of a variable given its usage, and VarMisuse, in which the network learns to reason about selecting the correct variable that should be used at a given program location. Our comparison to methods that use less structured program representations shows the advantages of modeling known structure, and suggests that our models learn to infer meaningful names and to solve the VarMisuse task in many cases. Additionally, our testing showed that VarMisuse identifies a number of bugs in mature open-source projects.

* Published in ICLR 2018. arXiv admin note: text overlap with arXiv:1705.07867
Click to Read Paper
Attention mechanisms in neural networks have proved useful for problems in which the input and output do not have fixed dimension. Often there exist features that are locally translation invariant and would be valuable for directing the model's attention, but previous attentional architectures are not constructed to learn such features specifically. We introduce an attentional neural network that employs convolution on the input tokens to detect local time-invariant and long-range topical attention features in a context-dependent way. We apply this architecture to the problem of extreme summarization of source code snippets into short, descriptive function name-like summaries. Using those features, the model sequentially generates a summary by marginalizing over two attention mechanisms: one that predicts the next summary token based on the attention weights of the input tokens and another that is able to copy a code token as-is directly into the summary. We demonstrate our convolutional attention neural network's performance on 10 popular Java projects showing that it achieves better performance compared to previous attentional mechanisms.

* Code, data and visualization at http://groups.inf.ed.ac.uk/cup/codeattention/
Click to Read Paper
Combining abstract, symbolic reasoning with continuous neural reasoning is a grand challenge of representation learning. As a step in this direction, we propose a new architecture, called neural equivalence networks, for the problem of learning continuous semantic representations of algebraic and logical expressions. These networks are trained to represent semantic equivalence, even of expressions that are syntactically very different. The challenge is that semantic representations must be computed in a syntax-directed manner, because semantics is compositional, but at the same time, small changes in syntax can lead to very large changes in semantics, which can be difficult for continuous neural architectures. We perform an exhaustive evaluation on the task of checking equivalence on a highly diverse class of symbolic algebraic and boolean expression types, showing that our model significantly outperforms existing architectures.

* Accepted to ICML 2017
Click to Read Paper
In recent years, multi-label classification has attracted a significant body of research, motivated by real-life applications, such as text classification and medical diagnoses. Although sparsely studied in this context, Learning Classifier Systems are naturally well-suited to multi-label classification problems, whose search space typically involves multiple highly specific niches. This is the motivation behind our current work that introduces a generalized multi-label rule format -- allowing for flexible label-dependency modeling, with no need for explicit knowledge of which correlations to search for -- and uses it as a guide for further adapting the general Michigan-style supervised Learning Classifier System framework. The integration of the aforementioned rule format and framework adaptations results in a novel algorithm for multi-label classification whose behavior is studied through a set of properly defined artificial problems. The proposed algorithm is also thoroughly evaluated on a set of multi-label datasets and found competitive to other state-of-the-art multi-label classification methods.

Click to Read Paper
Graphs are ubiquitous data structures for representing interactions between entities. With an emphasis on the use of graphs to represent chemical molecules, we explore the task of learning to generate graphs that conform to a distribution observed in training data. We propose a variational autoencoder model in which both encoder and decoder are graph-structured. Our decoder assumes a sequential ordering of graph extension steps and we discuss and analyze design choices that mitigate the potential downsides of this linearization. Experiments compare our approach with a wide range of baselines on the molecule generation task and show that our method is more successful at matching the statistics of the original dataset on semantically important metrics. Furthermore, we show that by using appropriate shaping of the latent space, our model allows us to design molecules that are (locally) optimal in desired properties.

* 8 pages, 5 figures
Click to Read Paper
Generative models for source code are an interesting structured prediction problem, requiring to reason about both hard syntactic and semantic constraints as well as about natural, likely programs. We present a novel model for this problem that uses a graph to represent the intermediate state of the generated output. The generative procedure interleaves grammar-driven expansion steps with graph augmentation and neural message passing steps. An experimental evaluation shows that our new model can generate semantically meaningful expressions, outperforming a range of strong baselines.

Click to Read Paper
Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit code's abundance of patterns. In this article, we survey this work. We contrast programming languages against natural languages and discuss how these similarities and differences drive the design of probabilistic models. We present a taxonomy based on the underlying design principles of each model and use it to navigate the literature. Then, we review how researchers have adapted these models to application areas and discuss cross-cutting and application-specific challenges and opportunities.

* Website accompanying this survey paper can be found at https://ml4code.github.io
Click to Read Paper
We introduce the problem of learning distributed representations of edits. By combining a "neural editor" with an "edit encoder", our models learn to represent the salient information of an edit and can be used to apply edits to new inputs. We experiment on natural language and source code edit data. Our evaluation yields promising results that suggest that our neural network models learn to capture the structure and semantics of edits. We hope that this interesting task and data source will inspire other researchers to work further on this problem.

Click to Read Paper