Notes on Machine Learning on Source Code
In this document I am collecting summary paragraphs of the resources (mainly the papers) found in the awesome ml on src repo by source{d}. Summaries are made out of Abstract-Introduction-Conclusions as a way of examining which papers are interesting or not, faster than reading the whole paper.

Program Synthesis and Induction


  • Summary: NL2Bash provides a dataset (9K English-command pairs for 100 unique Bash utilities) and a baseline method for mapping English sentences to Bash commands. So imagine if you would do file manipulation, search or other scripting stating your goals in English. The work evaluates the Seq2seq, CopyNet and Tellina models. CopyNet seems to do the best at the sub-tokens level, obtaining top-1 command structure accuracy of 49% and top-1 full command accuracy of 36%.
  • Tags: semantic parsing (mapping natural language to machine-interpretable formal representations), nl2code (natural language to code)
  • Summary: The paper proposes a neural network architecture (coarse-to-fine) which decomposes the semantic parsing process into two stages. In the first stage the decoder produces a rough sketch of the meaning representation (e.g. executable queries or logical forms), while in the second step the decoder fills in missing details conditioning both on the natural language input and the sketch itself. As an example, given the table schema of the database and the natural language: "What record company did conductor Mikhail Snitko record for after 1996?" you go into the sketch: WHERE > AND = and then into the structured meaning: SELECT Record Company WHERE (Year ofRecording > 1996) AND (Conductor = Mikhail Snitko). The tasks coarse-to-fine tested were: natural language to logical form, natural language to source code, natural language to SQL. Experimental results display competitive performance compared to previous systems, despite the simple sequence decoders.
  • Tags: semantic parsing, nl2code
  • Notes: I have also done a blog post about this paper.
  • Summary: The paper proposes STAMP (Syntax- and Table- Aware seMantic Parser), a generative model to map natural language questions into SQL queries bypassoing the main issue of existing approaches of missmatch between question words and table contents. This is achieved by taking into account the structure of the table and the syntax of the SQL language. the approach improved the performance on the WikiSQL dataset from 69% to 74.4%. The method is based on pointer networks with three output channels (the column, the value and the SQL channel) that predict those specific outputs and STAMP learns to switch to which channel at each time step of the decoding process.
  • Tags: semantic parsing, WikiSQL, nl2code, nl2sql

Source Code Analysis and Language modeling

TODO

Neural Network Architectures and Algorithms

TODO

Embeddings in Software Engineering

TODO

Program Translation

TODO

Code Suggestion and Completion

TODO

Program Repair and Bug Detection

TODO

APIs and Code Mining

TODO

Code Optimization

TODO

Topic Modeling

TODO

Sentiment Analysis

TODO

Code Summarization

TODO

Clone Detection

TODO

Differentiable Interpreters

TODO

Binary Data Modeling

TODO

Soft Clustering Using T-mixture Models