Models, code, and papers for "Giovanni Da San Martino":

Experiments in Detecting Persuasion Techniques in the News

Nov 15, 2019
Seunghak Yu, Giovanni Da San Martino, Preslav Nakov

Many recent political events, like the 2016 US Presidential elections or the 2018 Brazilian elections have raised the attention of institutions and of the general public on the role of Internet and social media in influencing the outcome of these events. We argue that a safe democracy is one in which citizens have tools to make them aware of propaganda campaigns. We propose a novel task: performing fine-grained analysis of texts by detecting all fragments that contain propaganda techniques as well as their type. We further design a novel multi-granularity neural network, and we show that it outperforms several strong BERT-based baselines.

* NeurIPS-2019 workshop on AI for Social Good 
* arXiv admin note: substantial text overlap with arXiv:1910.02517 

  Click for Model/Code and Paper
A tree-based kernel for graphs with continuous attributes

Dec 20, 2016
Giovanni Da San Martino, Nicolò Navarin, Alessandro Sperduti

The availability of graph data with node attributes that can be either discrete or real-valued is constantly increasing. While existing kernel methods are effective techniques for dealing with graphs having discrete node labels, their adaptation to non-discrete or continuous node attributes has been limited, mainly for computational issues. Recently, a few kernels especially tailored for this domain, and that trade predictive performance for computational efficiency, have been proposed. In this paper, we propose a graph kernel for complex and continuous nodes' attributes, whose features are tree structures extracted from specific graph visits. The kernel manages to keep the same complexity of state-of-the-art kernels while implicitly using a larger feature space. We further present an approximated variant of the kernel which reduces its complexity significantly. Experimental results obtained on six real-world datasets show that the kernel is the best performing one on most of them. Moreover, in most cases the approximated version reaches comparable performances to current state-of-the-art kernels in terms of classification accuracy while greatly shortening the running times.

* This work has been submitted to the IEEE Transactions on Neural Networks and Learning Systems for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible 

  Click for Model/Code and Paper
An Empirical Study on Budget-Aware Online Kernel Algorithms for Streams of Graphs

Jul 20, 2016
Giovanni Da San Martino, Nicolò Navarin, Alessandro Sperduti

Kernel methods are considered an effective technique for on-line learning. Many approaches have been developed for compactly representing the dual solution of a kernel method when the problem imposes memory constraints. However, in literature no work is specifically tailored to streams of graphs. Motivated by the fact that the size of the feature space representation of many state-of-the-art graph kernels is relatively small and thus it is explicitly computable, we study whether executing kernel algorithms in the feature space can be more effective than the classical dual approach. We study three different algorithms and various strategies for managing the budget. Efficiency and efficacy of the proposed approaches are experimentally assessed on relatively large graph streams exhibiting concept drift. It turns out that, when strict memory budget constraints have to be enforced, working in feature space, given the current state of the art on graph kernels, is more than a viable alternative to dual approaches, both in terms of speed and classification performance.

* Author's version of the manuscript, to appear in Neurocomputing (ELSEVIER) 

  Click for Model/Code and Paper
Ordered Decompositional DAG Kernels Enhancements

Dec 28, 2015
Giovanni Da San Martino, Nicolò Navarin, Alessandro Sperduti

In this paper, we show how the Ordered Decomposition DAGs (ODD) kernel framework, a framework that allows the definition of graph kernels from tree kernels, allows to easily define new state-of-the-art graph kernels. Here we consider a fast graph kernel based on the Subtree kernel (ST), and we propose various enhancements to increase its expressiveness. The proposed DAG kernel has the same worst-case complexity as the one based on ST, but an improved expressivity due to an augmented set of features. Moreover, we propose a novel weighting scheme for the features, which can be applied to other kernels of the ODD framework. These improvements allow the proposed kernels to improve on the classification performances of the ST-based kernel for several real-world datasets, reaching state-of-the-art performances.

* Neurocomputing, Volume 192, 5 June 2016, Pages 92--103 
* Paper accepted for publication in Neurocomputing 

  Click for Model/Code and Paper
Graph Kernels exploiting Weisfeiler-Lehman Graph Isomorphism Test Extensions

Sep 22, 2015
Giovanni Da San Martino, Nicolò Navarin, Alessandro Sperduti

In this paper we present a novel graph kernel framework inspired the by the Weisfeiler-Lehman (WL) isomorphism tests. Any WL test comprises a relabelling phase of the nodes based on test-specific information extracted from the graph, for example the set of neighbours of a node. We defined a novel relabelling and derived two kernels of the framework from it. The novel kernels are very fast to compute and achieve state-of-the-art results on five real-world datasets.

* Neural Information Processing, Volume 8835 of the series Lecture Notes in Computer Science pp 93-100, 2014 Springer International Publishing 

  Click for Model/Code and Paper
Findings of the NLP4IF-2019 Shared Task on Fine-Grained Propaganda Detection

Oct 20, 2019
Giovanni Da San Martino, Alberto Barrón-Cedeño, Preslav Nakov

We present the shared task on Fine-Grained Propaganda Detection, which was organized as part of the NLP4IF workshop at EMNLP-IJCNLP 2019. There were two subtasks. FLC is a fragment-level task that asks for the identification of propagandist text fragments in a news article and also for the prediction of the specific propaganda technique used in each such fragment (18-way classification task). SLC is a sentence-level binary classification task asking to detect the sentences that contain propaganda. A total of 12 teams submitted systems for the FLC task, 25 teams did so for the SLC task, and 14 teams eventually submitted a system description paper. For both subtasks, most systems managed to beat the baseline by a sizable margin. The leaderboard and the data from the competition are available at

* NLP4IF@EMNLP-2019 
* propaganda, disinformation, fake news. arXiv admin note: text overlap with arXiv:1910.02517 

  Click for Model/Code and Paper
Proppy: A System to Unmask Propaganda in Online News

Dec 14, 2019
Alberto Barrón-Cedeño, Giovanni Da San Martino, Israa Jaradat, Preslav Nakov

We present proppy, the first publicly available real-world, real-time propaganda detection system for online news, which aims at raising awareness, thus potentially limiting the impact of propaganda and helping fight disinformation. The system constantly monitors a number of news sources, deduplicates and clusters the news into events, and organizes the articles about an event on the basis of the likelihood that they contain propagandistic content. The system is trained on known propaganda sources using a variety of stylistic features. The evaluation results on a standard dataset show state-of-the-art results for propaganda detection.

* Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-2019) 
* propaganda, disinformation, fake news 

  Click for Model/Code and Paper
Building Chatbots from Forum Data: Model Selection Using Question Answering Metrics

Oct 02, 2017
Martin Boyanov, Ivan Koychev, Preslav Nakov, Alessandro Moschitti, Giovanni Da San Martino

We propose to use question answering (QA) data from Web forums to train chatbots from scratch, i.e., without dialog training data. First, we extract pairs of question and answer sentences from the typically much longer texts of questions and answers in a forum. We then use these shorter texts to train seq2seq models in a more efficient way. We further improve the parameter optimization using a new model selection strategy based on QA measures. Finally, we propose to use extrinsic evaluation with respect to a QA task as an automatic evaluation method for chatbots. The evaluation shows that the model achieves a MAP of 63.5% on the extrinsic task. Moreover, it can answer correctly 49.5% of the questions when they are similar to questions asked in the forum, and 47.3% of the questions when they are more conversational in style.

* RANLP-2017 

  Click for Model/Code and Paper
Fine-Grained Analysis of Propaganda in News Articles

Oct 06, 2019
Giovanni Da San Martino, Seunghak Yu, Alberto Barrón-Cedeño, Rostislav Petrov, Preslav Nakov

Propaganda aims at influencing people's mindset with the purpose of advancing a specific agenda. Previous work has addressed propaganda detection at the document level, typically labelling all articles from a propagandistic news outlet as propaganda. Such noisy gold labels inevitably affect the quality of any learning system trained on them. A further issue with most existing systems is the lack of explainability. To overcome these limitations, we propose a novel task: performing fine-grained analysis of texts by detecting all fragments that contain propaganda techniques as well as their type. In particular, we create a corpus of news articles manually annotated at the fragment level with eighteen propaganda techniques and we propose a suitable evaluation measure. We further design a novel multi-granularity neural network, and we show that it outperforms several strong BERT-based baselines.

* EMNLP-2019 

  Click for Model/Code and Paper
Global Thread-Level Inference for Comment Classification in Community Question Answering

Nov 20, 2019
Shafiq Joty, Alberto Barrón-Cedeño, Giovanni Da San Martino, Simone Filice, Lluís Màrquez, Alessandro Moschitti, Preslav Nakov

Community question answering, a recent evolution of question answering in the Web context, allows a user to quickly consult the opinion of a number of people on a particular topic, thus taking advantage of the wisdom of the crowd. Here we try to help the user by deciding automatically which answers are good and which are bad for a given question. In particular, we focus on exploiting the output structure at the thread level in order to make more consistent global decisions. More specifically, we exploit the relations between pairs of comments at any distance in the thread, which we incorporate in a graph-cut and in an ILP frameworks. We evaluated our approach on the benchmark dataset of SemEval-2015 Task 3. Results improved over the state of the art, confirming the importance of using thread level information.

* EMNLP-2015 
* community question answering, thread-level inference, graph-cut, inductive logic programming 

  Click for Model/Code and Paper
Team QCRI-MIT at SemEval-2019 Task 4: Propaganda Analysis Meets Hyperpartisan News Detection

Apr 06, 2019
Abdelrhman Saleh, Ramy Baly, Alberto Barrón-Cedeño, Giovanni Da San Martino, Mitra Mohtarami, Preslav Nakov, James Glass

In this paper, we describe our submission to SemEval-2019 Task 4 on Hyperpartisan News Detection. Our system relies on a variety of engineered features originally used to detect propaganda. This is based on the assumption that biased messages are propagandistic in the sense that they promote a particular political cause or viewpoint. We trained a logistic regression model with features ranging from simple bag-of-words to vocabulary richness and text readability features. Our system achieved 72.9% accuracy on the test data that is annotated manually and 60.8% on the test data that is annotated with distant supervision. Additional experiments showed that significant performance improvements can be achieved with better feature pre-processing.

* Hyperpartisanship, propaganda, news media, fake news, SemEval-2018 

  Click for Model/Code and Paper
Cross-Language Question Re-Ranking

Oct 04, 2017
Giovanni Da San Martino, Salvatore Romeo, Alberto Barron-Cedeno, Shafiq Joty, Lluis Marquez, Alessandro Moschitti, Preslav Nakov

We study how to find relevant questions in community forums when the language of the new questions is different from that of the existing questions in the forum. In particular, we explore the Arabic-English language pair. We compare a kernel-based system with a feed-forward neural network in a scenario where a large parallel corpus is available for training a machine translation system, bilingual dictionaries, and cross-language word embeddings. We observe that both approaches degrade the performance of the system when working on the translated text, especially the kernel-based system, which depends heavily on a syntactic kernel. We address this issue using a cross-language tree kernel, which compares the original Arabic tree to the English trees of the related questions. We show that this kernel almost closes the performance gap with respect to the monolingual system. On the neural network side, we use the parallel corpus to train cross-language embeddings, which we then use to represent the Arabic input and the English related questions in the same space. The results also improve to close to those of the monolingual neural network. Overall, the kernel system shows a better performance compared to the neural network in all cases.

* SIGIR 2017: 1145-1148 
* SIGIR-2017; Community Question Answering; Cross-language Approaches; Question Retrieval; Kernel-based Methods; Neural Networks; Distributed Representations 

  Click for Model/Code and Paper
Overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims. Task 1: Check-Worthiness

Aug 08, 2018
Pepa Atanasova, Alberto Barron-Cedeno, Tamer Elsayed, Reem Suwaileh, Wajdi Zaghouani, Spas Kyuchukov, Giovanni Da San Martino, Preslav Nakov

We present an overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims, with focus on Task 1: Check-Worthiness. The task asks to predict which claims in a political debate should be prioritized for fact-checking. In particular, given a debate or a political speech, the goal was to produce a ranked list of its sentences based on their worthiness for fact checking. We offered the task in both English and Arabic, based on debates from the 2016 US Presidential Campaign, as well as on some speeches during and after the campaign. A total of 30 teams registered to participate in the Lab and seven teams actually submitted systems for Task~1. The most successful approaches used by the participants relied on recurrent and multi-layer neural networks, as well as on combinations of distributional representations, on matchings claims' vocabulary against lexicons, and on measures of syntactic dependency. The best systems achieved mean average precision of 0.18 and 0.15 on the English and on the Arabic test datasets, respectively. This leaves large room for further improvement, and thus we release all datasets and the scoring scripts, which should enable further research in check-worthiness estimation.

* CLEF-2018 
* Computational journalism, Check-worthiness, Fact-checking, Veracity 

  Click for Model/Code and Paper
Addressing Community Question Answering in English and Arabic

Oct 18, 2016
Giovanni Da San Martino, Alberto Barrón-Cedeño, Salvatore Romeo, Alessandro Moschitti, Shafiq Joty, Fahad A. Al Obaidli, Kateryna Tymoshenko, Antonio Uva

This paper studies the impact of different types of features applied to learning to re-rank questions in community Question Answering. We tested our models on two datasets released in SemEval-2016 Task 3 on "Community Question Answering". Task 3 targeted real-life Web fora both in English and Arabic. Our models include bag-of-words features (BoW), syntactic tree kernels (TKs), rank features, embeddings, and machine translation evaluation features. To the best of our knowledge, structural kernels have barely been applied to the question reranking task, where they have to model paraphrase relations. In the case of the English question re-ranking task, we compare our learning to rank (L2R) algorithms against a strong baseline given by the Google-generated ranking (GR). The results show that i) the shallow structures used in our TKs are robust enough to noisy data and ii) improving GR is possible, but effective BoW features and TKs along with an accurate model of GR features in the used L2R algorithm are required. In the case of the Arabic question re-ranking task, for the first time we applied tree kernels on syntactic trees of Arabic sentences. Our approaches to both tasks obtained the second best results on SemEval-2016 subtasks B on English and D on Arabic.

* presented at Second WebQA workshop, SIGIR2016 ( 

  Click for Model/Code and Paper
Tanbih: Get To Know What You Are Reading

Oct 04, 2019
Yifan Zhang, Giovanni Da San Martino, Alberto Barrón-Cedeño, Salvatore Romeo, Jisun An, Haewoon Kwak, Todor Staykovski, Israa Jaradat, Georgi Karadzhov, Ramy Baly, Kareem Darwish, James Glass, Preslav Nakov

We introduce Tanbih, a news aggregator with intelligent analysis tools to help readers understanding what's behind a news story. Our system displays news grouped into events and generates media profiles that show the general factuality of reporting, the degree of propagandistic content, hyper-partisanship, leading political ideology, general frame of reporting, and stance with respect to various claims and topics of a news outlet. In addition, we automatically analyse each article to detect whether it is propagandistic and to determine its stance with respect to a number of controversial topics.

* EMNLP-2019 

  Click for Model/Code and Paper