Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mark Dredze

A Closer Look at Claim Decomposition

Mar 18, 2024
Miriam Wanner, Seth Ebner, Zhengping Jiang, Mark Dredze, Benjamin Van Durme

Figure 1 for A Closer Look at Claim Decomposition

Figure 2 for A Closer Look at Claim Decomposition

Figure 3 for A Closer Look at Claim Decomposition

Figure 4 for A Closer Look at Claim Decomposition

As generated text becomes more commonplace, it is increasingly important to evaluate how well-supported such text is by external knowledge sources. Many approaches for evaluating textual support rely on some method for decomposing text into its individual subclaims which are scored against a trusted reference. We investigate how various methods of claim decomposition -- especially LLM-based methods -- affect the result of an evaluation approach such as the recently proposed FActScore, finding that it is sensitive to the decomposition method used. This sensitivity arises because such metrics attribute overall textual support to the model that generated the text even though error can also come from the metric's decomposition step. To measure decomposition quality, we introduce an adaptation of FActScore, which we call DecompScore. We then propose an LLM-based approach to generating decompositions inspired by Bertrand Russell's theory of logical atomism and neo-Davidsonian semantics and demonstrate its improved decomposition quality over previous methods.

Via

Access Paper or Ask Questions

Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions

Mar 13, 2024
Hanjie Chen, Zhouxiang Fang, Yash Singla, Mark Dredze

LLMs have demonstrated impressive performance in answering medical questions, such as passing scores on medical licensing examinations. However, medical board exam questions or general clinical questions do not capture the complexity of realistic clinical cases. Moreover, the lack of reference explanations means we cannot easily evaluate the reasoning of model decisions, a crucial component of supporting doctors in making complex medical decisions. To address these challenges, we construct two new datasets: JAMA Clinical Challenge and Medbullets. JAMA Clinical Challenge consists of questions based on challenging clinical cases, while Medbullets comprises USMLE Step 2&3 style clinical questions. Both datasets are structured as multiple-choice question-answering tasks, where each question is accompanied by an expert-written explanation. We evaluate four LLMs on the two datasets using various prompts. Experiments demonstrate that our datasets are harder than previous benchmarks. The inconsistency between automatic and human evaluations of model-generated explanations highlights the need to develop new metrics to support future research on explainable medical QA.

Via

Access Paper or Ask Questions

Evaluating Biases in Context-Dependent Health Questions

Mar 07, 2024
Sharon Levy, Tahilin Sanchez Karver, William D. Adler, Michelle R. Kaufman, Mark Dredze

Figure 1 for Evaluating Biases in Context-Dependent Health Questions

Figure 2 for Evaluating Biases in Context-Dependent Health Questions

Figure 3 for Evaluating Biases in Context-Dependent Health Questions

Figure 4 for Evaluating Biases in Context-Dependent Health Questions

Chat-based large language models have the opportunity to empower individuals lacking high-quality healthcare access to receive personalized information across a variety of topics. However, users may ask underspecified questions that require additional context for a model to correctly answer. We study how large language model biases are exhibited through these contextual questions in the healthcare domain. To accomplish this, we curate a dataset of sexual and reproductive healthcare questions that are dependent on age, sex, and location attributes. We compare models' outputs with and without demographic context to determine group alignment among our contextual questions. Our experiments reveal biases in each of these attributes, where young adult female users are favored.

Via

Access Paper or Ask Questions

An Eye on Clinical BERT: Investigating Language Model Generalization for Diabetic Eye Disease Phenotyping

Nov 15, 2023
Keith Harrigian, Tina Tang, Anthony Gonzales, Cindy X. Cai, Mark Dredze

Diabetic eye disease is a major cause of blindness worldwide. The ability to monitor relevant clinical trajectories and detect lapses in care is critical to managing the disease and preventing blindness. Alas, much of the information necessary to support these goals is found only in the free text of the electronic medical record. To fill this information gap, we introduce a system for extracting evidence from clinical text of 19 clinical concepts related to diabetic eye disease and inferring relevant attributes for each. In developing this ophthalmology phenotyping system, we are also afforded a unique opportunity to evaluate the effectiveness of clinical language models at adapting to new clinical domains. Across multiple training paradigms, we find that BERT language models pretrained on out-of-distribution clinical data offer no significant improvement over BERT language models pretrained on non-clinical data for our domain. Our study tempers recent claims that language models pretrained on clinical data are necessary for clinical NLP tasks and highlights the importance of not treating clinical language data as a single homogeneous domain.

* Extended Abstract presented at Machine Learning for Health (ML4H) symposium 2023, December 10th, 2023, New Orleans, United States, 24 pages

Via

Access Paper or Ask Questions

Selecting Shots for Demographic Fairness in Few-Shot Learning with Large Language Models

Nov 14, 2023
Carlos Aguirre, Kuleen Sasse, Isabel Cachola, Mark Dredze

Figure 1 for Selecting Shots for Demographic Fairness in Few-Shot Learning with Large Language Models

Figure 2 for Selecting Shots for Demographic Fairness in Few-Shot Learning with Large Language Models

Figure 3 for Selecting Shots for Demographic Fairness in Few-Shot Learning with Large Language Models

Figure 4 for Selecting Shots for Demographic Fairness in Few-Shot Learning with Large Language Models

Recently, work in NLP has shifted to few-shot (in-context) learning, with large language models (LLMs) performing well across a range of tasks. However, while fairness evaluations have become a standard for supervised methods, little is known about the fairness of LLMs as prediction systems. Further, common standard methods for fairness involve access to models weights or are applied during finetuning, which are not applicable in few-shot learning. Do LLMs exhibit prediction biases when used for standard NLP tasks? In this work, we explore the effect of shots, which directly affect the performance of models, on the fairness of LLMs as NLP classification systems. We consider how different shot selection strategies, both existing and new demographically sensitive methods, affect model fairness across three standard fairness datasets. We discuss how future work can include LLM fairness evaluations.

Via

Access Paper or Ask Questions

MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies

May 26, 2023
Shiyue Zhang, Shijie Wu, Ozan Irsoy, Steven Lu, Mohit Bansal, Mark Dredze, David Rosenberg

Figure 1 for MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies

Figure 2 for MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies

Figure 3 for MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies

Figure 4 for MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies

Autoregressive language models are trained by minimizing the cross-entropy of the model distribution Q relative to the data distribution P -- that is, minimizing the forward cross-entropy, which is equivalent to maximum likelihood estimation (MLE). We have observed that models trained in this way may "over-generalize", in the sense that they produce non-human-like text. Moreover, we believe that reverse cross-entropy, i.e., the cross-entropy of P relative to Q, is a better reflection of how a human would evaluate text generated by a model. Hence, we propose learning with MixCE, an objective that mixes the forward and reverse cross-entropies. We evaluate models trained with this objective on synthetic data settings (where P is known) and real data, and show that the resulting models yield better generated text without complex decoding strategies. Our code and models are publicly available at https://github.com/bloomberg/mixce-acl2023

* ACL 2023 (22 pages)

Via

Access Paper or Ask Questions

Generalizing Fairness using Multi-Task Learning without Demographic Information

May 22, 2023
Carlos Aguirre, Mark Dredze

Figure 1 for Generalizing Fairness using Multi-Task Learning without Demographic Information

Figure 2 for Generalizing Fairness using Multi-Task Learning without Demographic Information

Figure 3 for Generalizing Fairness using Multi-Task Learning without Demographic Information

Figure 4 for Generalizing Fairness using Multi-Task Learning without Demographic Information

To ensure the fairness of machine learning systems, we can include a fairness loss during training based on demographic information associated with the training data. However, we cannot train debiased classifiers for most tasks since the relevant datasets lack demographic annotations. Can we utilize demographic data for a related task to improve the fairness of our target task? We demonstrate that demographic fairness objectives transfer to new tasks trained within a multi-task framework. We adapt a single-task fairness loss to a multi-task setting to exploit demographic labels from a related task in debiasing a target task. We explore different settings with missing demographic data and show how our loss can improve fairness even without in-task demographics, across various domains and tasks.

Via

Access Paper or Ask Questions

BloombergGPT: A Large Language Model for Finance

Mar 30, 2023
Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, Gideon Mann

Figure 1 for BloombergGPT: A Large Language Model for Finance

Figure 2 for BloombergGPT: A Large Language Model for Finance

Figure 3 for BloombergGPT: A Large Language Model for Finance

Figure 4 for BloombergGPT: A Large Language Model for Finance

The use of NLP in the realm of financial technology is broad and complex, with applications ranging from sentiment analysis and named entity recognition to question answering. Large Language Models (LLMs) have been shown to be effective on a variety of tasks; however, no LLM specialized for the financial domain has been reported in literature. In this work, we present BloombergGPT, a 50 billion parameter language model that is trained on a wide range of financial data. We construct a 363 billion token dataset based on Bloomberg's extensive data sources, perhaps the largest domain-specific dataset yet, augmented with 345 billion tokens from general purpose datasets. We validate BloombergGPT on standard LLM benchmarks, open financial benchmarks, and a suite of internal benchmarks that most accurately reflect our intended usage. Our mixed dataset training leads to a model that outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks. Additionally, we explain our modeling choices, training process, and evaluation methodology. As a next step, we plan to release training logs (Chronicles) detailing our experience in training BloombergGPT.

Via

Access Paper or Ask Questions

Do Text-to-Text Multi-Task Learners Suffer from Task Conflict?

Dec 13, 2022
David Mueller, Nicholas Andrews, Mark Dredze

Figure 1 for Do Text-to-Text Multi-Task Learners Suffer from Task Conflict?

Figure 2 for Do Text-to-Text Multi-Task Learners Suffer from Task Conflict?

Figure 3 for Do Text-to-Text Multi-Task Learners Suffer from Task Conflict?

Figure 4 for Do Text-to-Text Multi-Task Learners Suffer from Task Conflict?

Traditional multi-task learning architectures train a single model across multiple tasks through a shared encoder followed by task-specific decoders. Learning these models often requires specialized training algorithms that address task-conflict in the shared parameter updates, which otherwise can lead to negative transfer. A new type of multi-task learning within NLP homogenizes multi-task architectures as a shared encoder and language model decoder, which does surprisingly well across a range of diverse tasks. Does this new architecture suffer from task-conflicts that require specialized training algorithms? We study how certain factors in the shift towards text-to-text models affects multi-task conflict and negative transfer, finding that both directional conflict and transfer are surprisingly constant across architectures.

* Findings of EMNLP 2022

Via

Access Paper or Ask Questions

Using Open-Ended Stressor Responses to Predict Depressive Symptoms across Demographics

Nov 15, 2022
Carlos Aguirre, Mark Dredze, Philip Resnik

Figure 1 for Using Open-Ended Stressor Responses to Predict Depressive Symptoms across Demographics

Figure 2 for Using Open-Ended Stressor Responses to Predict Depressive Symptoms across Demographics

Figure 3 for Using Open-Ended Stressor Responses to Predict Depressive Symptoms across Demographics

Figure 4 for Using Open-Ended Stressor Responses to Predict Depressive Symptoms across Demographics

Stressors are related to depression, but this relationship is complex. We investigate the relationship between open-ended text responses about stressors and depressive symptoms across gender and racial/ethnic groups. First, we use topic models and other NLP tools to find thematic and vocabulary differences when reporting stressors across demographic groups. We train language models using self-reported stressors to predict depressive symptoms, finding a relationship between stressors and depression. Finally, we find that differences in stressors translate to downstream performance differences across demographic groups.

* Extended Abstract presented at Machine Learning for Health (ML4H) symposium 2022, November 28th, 2022, New Orleans, United States & Virtual, http://www.ml4h.cc, 6 pages

Via

Access Paper or Ask Questions