Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Daryna Dementieva

Toxicity Classification in Ukrainian

Apr 27, 2024
Daryna Dementieva, Valeriia Khylenko, Nikolay Babakov, Georg Groh

The task of toxicity detection is still a relevant task, especially in the context of safe and fair LMs development. Nevertheless, labeled binary toxicity classification corpora are not available for all languages, which is understandable given the resource-intensive nature of the annotation process. Ukrainian, in particular, is among the languages lacking such resources. To our knowledge, there has been no existing toxicity classification corpus in Ukrainian. In this study, we aim to fill this gap by investigating cross-lingual knowledge transfer techniques and creating labeled corpora by: (i)~translating from an English corpus, (ii)~filtering toxic samples using keywords, and (iii)~annotating with crowdsourcing. We compare LLMs prompting and other cross-lingual transfer approaches with and without fine-tuning offering insights into the most robust and efficient baselines.

* Accepted to WOAH, NAACL, 2024. arXiv admin note: text overlap with arXiv:2404.02043

Via

Access Paper or Ask Questions

MultiParaDetox: Extending Text Detoxification with Parallel Data to New Languages

Apr 02, 2024
Daryna Dementieva, Nikolay Babakov, Alexander Panchenko

Text detoxification is a textual style transfer (TST) task where a text is paraphrased from a toxic surface form, e.g. featuring rude words, to the neutral register. Recently, text detoxification methods found their applications in various task such as detoxification of Large Language Models (LLMs) (Leong et al., 2023; He et al., 2024; Tang et al., 2023) and toxic speech combating in social networks (Deng et al., 2023; Mun et al., 2023; Agarwal et al., 2023). All these applications are extremely important to ensure safe communication in modern digital worlds. However, the previous approaches for parallel text detoxification corpora collection -- ParaDetox (Logacheva et al., 2022) and APPADIA (Atwell et al., 2022) -- were explored only in monolingual setup. In this work, we aim to extend ParaDetox pipeline to multiple languages presenting MultiParaDetox to automate parallel detoxification corpus collection for potentially any language. Then, we experiment with different text detoxification models -- from unsupervised baselines to LLMs and fine-tuned models on the presented parallel corpora -- showing the great benefit of parallel corpus presence to obtain state-of-the-art text detoxification models for any language.

* Accepted to NAACL2024

Via

Access Paper or Ask Questions

Ukrainian Texts Classification: Exploration of Cross-lingual Knowledge Transfer Approaches

Apr 02, 2024
Daryna Dementieva, Valeriia Khylenko, Georg Groh

Figure 1 for Ukrainian Texts Classification: Exploration of Cross-lingual Knowledge Transfer Approaches

Figure 2 for Ukrainian Texts Classification: Exploration of Cross-lingual Knowledge Transfer Approaches

Figure 3 for Ukrainian Texts Classification: Exploration of Cross-lingual Knowledge Transfer Approaches

Figure 4 for Ukrainian Texts Classification: Exploration of Cross-lingual Knowledge Transfer Approaches

Despite the extensive amount of labeled datasets in the NLP text classification field, the persistent imbalance in data availability across various languages remains evident. Ukrainian, in particular, stands as a language that still can benefit from the continued refinement of cross-lingual methodologies. Due to our knowledge, there is a tremendous lack of Ukrainian corpora for typical text classification tasks. In this work, we leverage the state-of-the-art advances in NLP, exploring cross-lingual knowledge transfer methods avoiding manual data curation: large multilingual encoders and translation systems, LLMs, and language adapters. We test the approaches on three text classification tasks -- toxicity classification, formality classification, and natural language inference -- providing the "recipe" for the optimal setups.

Via

Access Paper or Ask Questions

Exploring Methods for Cross-lingual Text Style Transfer: The Case of Text Detoxification

Nov 23, 2023
Daryna Dementieva, Daniil Moskovskiy, David Dale, Alexander Panchenko

Text detoxification is the task of transferring the style of text from toxic to neutral. While here are approaches yielding promising results in monolingual setup, e.g., (Dale et al., 2021; Hallinan et al., 2022), cross-lingual transfer for this task remains a challenging open problem (Moskovskiy et al., 2022). In this work, we present a large-scale study of strategies for cross-lingual text detoxification -- given a parallel detoxification corpus for one language; the goal is to transfer detoxification ability to another language for which we do not have such a corpus. Moreover, we are the first to explore a new task where text translation and detoxification are performed simultaneously, providing several strong baselines for this task. Finally, we introduce new automatic detoxification evaluation metrics with higher correlations with human judgments than previous benchmarks. We assess the most promising approaches also with manual markup, determining the answer for the best strategy to transfer the knowledge of text detoxification between languages.

* AACL 2023, main conference, long paper

Via

Access Paper or Ask Questions

Adam-Smith at SemEval-2023 Task 4: Discovering Human Values in Arguments with Ensembles of Transformer-based Models

May 15, 2023
Daniel Schroter, Daryna Dementieva, Georg Groh

Figure 1 for Adam-Smith at SemEval-2023 Task 4: Discovering Human Values in Arguments with Ensembles of Transformer-based Models

Figure 2 for Adam-Smith at SemEval-2023 Task 4: Discovering Human Values in Arguments with Ensembles of Transformer-based Models

Figure 3 for Adam-Smith at SemEval-2023 Task 4: Discovering Human Values in Arguments with Ensembles of Transformer-based Models

Figure 4 for Adam-Smith at SemEval-2023 Task 4: Discovering Human Values in Arguments with Ensembles of Transformer-based Models

This paper presents the best-performing approach alias "Adam Smith" for the SemEval-2023 Task 4: "Identification of Human Values behind Arguments". The goal of the task was to create systems that automatically identify the values within textual arguments. We train transformer-based models until they reach their loss minimum or f1-score maximum. Ensembling the models by selecting one global decision threshold that maximizes the f1-score leads to the best-performing system in the competition. Ensembling based on stacking with logistic regressions shows the best performance on an additional dataset provided to evaluate the robustness ("Nahj al-Balagha"). Apart from outlining the submitted system, we demonstrate that the use of the large ensemble model is not necessary and that the system size can be significantly reduced.

* The winner of SemEval-2023 Task 4: "Identification of Human Values behind Arguments"

Via

Access Paper or Ask Questions

AdamR at SemEval-2023 Task 10: Solving the Class Imbalance Problem in Sexism Detection with Ensemble Learning

May 15, 2023
Adam Rydelek, Daryna Dementieva, Georg Groh

Figure 1 for AdamR at SemEval-2023 Task 10: Solving the Class Imbalance Problem in Sexism Detection with Ensemble Learning

Figure 2 for AdamR at SemEval-2023 Task 10: Solving the Class Imbalance Problem in Sexism Detection with Ensemble Learning

Figure 3 for AdamR at SemEval-2023 Task 10: Solving the Class Imbalance Problem in Sexism Detection with Ensemble Learning

Figure 4 for AdamR at SemEval-2023 Task 10: Solving the Class Imbalance Problem in Sexism Detection with Ensemble Learning

The Explainable Detection of Online Sexism task presents the problem of explainable sexism detection through fine-grained categorisation of sexist cases with three subtasks. Our team experimented with different ways to combat class imbalance throughout the tasks using data augmentation and loss alteration techniques. We tackled the challenge by utilising ensembles of Transformer models trained on different datasets, which are tested to find the balance between performance and interpretability. This solution ranked us in the top 40\% of teams for each of the tracks.

* One of the top solutions at the SemEval-2023 task "The Explainable Detection of Online Sexism"

Via

Access Paper or Ask Questions

IFAN: An Explainability-Focused Interaction Framework for Humans and NLP Models

Mar 06, 2023
Edoardo Mosca, Daryna Dementieva, Tohid Ebrahim Ajdari, Maximilian Kummeth, Kirill Gringauz, Georg Groh

Figure 1 for IFAN: An Explainability-Focused Interaction Framework for Humans and NLP Models

Figure 2 for IFAN: An Explainability-Focused Interaction Framework for Humans and NLP Models

Figure 3 for IFAN: An Explainability-Focused Interaction Framework for Humans and NLP Models

Figure 4 for IFAN: An Explainability-Focused Interaction Framework for Humans and NLP Models

Interpretability and human oversight are fundamental pillars of deploying complex NLP models into real-world applications. However, applying explainability and human-in-the-loop methods requires technical proficiency. Despite existing toolkits for model understanding and analysis, options to integrate human feedback are still limited. We propose IFAN, a framework for real-time explanation-based interaction with NLP models. Through IFAN's interface, users can provide feedback to selected model explanations, which is then integrated through adapter layers to align the model with human rationale. We show the system to be effective in debiasing a hate speech classifier with minimal performance loss. IFAN also offers a visual admin system and API to manage models (and datasets) as well as control access rights. A demo is live at https://ifan.ml/

* ACL Demo 2023 Submission

Via

Access Paper or Ask Questions

Multiverse: Multilingual Evidence for Fake News Detection

Nov 25, 2022
Daryna Dementieva, Mikhail Kuimov, Alexander Panchenko

Figure 1 for Multiverse: Multilingual Evidence for Fake News Detection

Figure 2 for Multiverse: Multilingual Evidence for Fake News Detection

Figure 3 for Multiverse: Multilingual Evidence for Fake News Detection

Figure 4 for Multiverse: Multilingual Evidence for Fake News Detection

Misleading information spreads on the Internet at an incredible speed, which can lead to irreparable consequences in some cases. It is becoming essential to develop fake news detection technologies. While substantial work has been done in this direction, one of the limitations of the current approaches is that these models are focused only on one language and do not use multilingual information. In this work, we propose Multiverse -- a new feature based on multilingual evidence that can be used for fake news detection and improve existing approaches. The hypothesis of the usage of cross-lingual evidence as a feature for fake news detection is confirmed, firstly, by manual experiment based on a set of known true and fake news. After that, we compared our fake news classification system based on the proposed feature with several baselines on two multi-domain datasets of general-topic news and one fake COVID-19 news dataset showing that in additional combination with linguistic features it yields significant improvements.

* 24 pages, 10 figures, extended version of ACL SRW 2021 paper

Via

Access Paper or Ask Questions

Exploring Cross-lingual Textual Style Transfer with Large Multilingual Language Models

Jun 05, 2022
Daniil Moskovskiy, Daryna Dementieva, Alexander Panchenko

Figure 1 for Exploring Cross-lingual Textual Style Transfer with Large Multilingual Language Models

Figure 2 for Exploring Cross-lingual Textual Style Transfer with Large Multilingual Language Models

Figure 3 for Exploring Cross-lingual Textual Style Transfer with Large Multilingual Language Models

Figure 4 for Exploring Cross-lingual Textual Style Transfer with Large Multilingual Language Models

Detoxification is a task of generating text in polite style while preserving meaning and fluency of the original toxic text. Existing detoxification methods are designed to work in one exact language. This work investigates multilingual and cross-lingual detoxification and the behavior of large multilingual models like in this setting. Unlike previous works we aim to make large language models able to perform detoxification without direct fine-tuning in given language. Experiments show that multilingual models are capable of performing multilingual style transfer. However, models are not able to perform cross-lingual detoxification and direct fine-tuning on exact language is inevitable.

Via

Access Paper or Ask Questions

Detecting Text Formality: A Study of Text Classification Approaches

Apr 19, 2022
Daryna Dementieva, Ivan Trifinov, Andrey Likhachev, Alexander Panchenko

Figure 1 for Detecting Text Formality: A Study of Text Classification Approaches

Figure 2 for Detecting Text Formality: A Study of Text Classification Approaches

Figure 3 for Detecting Text Formality: A Study of Text Classification Approaches

Figure 4 for Detecting Text Formality: A Study of Text Classification Approaches

Formality is an important characteristic of text documents. The automatic detection of the formality level of a text is potentially beneficial for various natural language processing tasks, such as retrieval of texts with a desired formality level, integration in language learning and document editing platforms, or evaluating the desired conversation tone by chatbots. Recently two large-scale datasets were introduced for multiple languages featuring formality annotation. However, they were primarily used for the training of style transfer models. However, detection text formality on its own may also be a useful application. This work proposes the first systematic study of formality detection methods based on current (and more classic) machine learning methods and delivers the best-performing models for public usage. We conducted three types of experiments -- monolingual, multilingual, and cross-lingual. The study shows the overcome of BiLSTM-based models over transformer-based ones for the formality classification task. We release formality detection models for several languages yielding state of the art results and possessing tested cross-lingual capabilities.

Via

Access Paper or Ask Questions