Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yiqun Yao

Tele-FLM Technical Report

Apr 25, 2024

Large language models (LLMs) have showcased profound capabilities in language understanding and generation, facilitating a wide array of applications. However, there is a notable paucity of detailed, open-sourced methodologies on efficiently scaling LLMs beyond 50 billion parameters with minimum trial-and-error cost and computational resources. In this report, we introduce Tele-FLM (aka FLM-2), a 52B open-sourced multilingual large language model that features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities. Tele-FLM demonstrates superior multilingual language modeling abilities, measured by BPB on textual corpus. Besides, in both English and Chinese foundation model evaluation, it is comparable to strong open-sourced models that involve larger pre-training FLOPs, such as Llama2-70B and DeepSeek-67B. In addition to the model weights, we share the core designs, engineering practices, and training details, which we expect to benefit both the academic and industrial communities.

Via

Access Paper or Ask Questions

CatCode: A Comprehensive Evaluation Framework for LLMs On the Mixture of Code and Text

Mar 04, 2024

Zhenru Lin, Yiqun Yao, Yang Yuan

Figure 1 for CatCode: A Comprehensive Evaluation Framework for LLMs On the Mixture of Code and Text

Figure 2 for CatCode: A Comprehensive Evaluation Framework for LLMs On the Mixture of Code and Text

Figure 3 for CatCode: A Comprehensive Evaluation Framework for LLMs On the Mixture of Code and Text

Figure 4 for CatCode: A Comprehensive Evaluation Framework for LLMs On the Mixture of Code and Text

Large language models (LLMs) such as ChatGPT are increasingly proficient in understanding and generating a mixture of code and text. Evaluation based on such $\textit{mixture}$ can lead to a more comprehensive understanding of the models' abilities in solving coding problems. However, in this context, current evaluation methods are either limited in task coverage or lack standardization. To address this issue, we propose using category theory as a framework for evaluation. Specifically, morphisms within a code category can represent code debugging and transformation, functors between two categories represent code translation, and functors between a code category and a natural language category represent code generation, explanation, and reproduction. We present an automatic evaluation framework called $\textbf{CatCode}$ ($\textbf{Cat}$egory $\textbf{Code}$) that can comprehensively assess the coding abilities of LLMs, including ChatGPT, Text-Davinci, and CodeGeeX.

* 10 pages, 5 figures

Via

Access Paper or Ask Questions

FLM-101B: An Open LLM and How to Train It with $100K Budget

Sep 17, 2023

Xiang Li, Yiqun Yao, Xin Jiang, Xuezhi Fang, Xuying Meng, Siqi Fan, Peng Han, Jing Li, Li Du, Bowen Qin, Zheng Zhang, Aixin Sun, Yequan Wang

Figure 1 for FLM-101B: An Open LLM and How to Train It with $100K Budget

Figure 2 for FLM-101B: An Open LLM and How to Train It with $100K Budget

Figure 3 for FLM-101B: An Open LLM and How to Train It with $100K Budget

Figure 4 for FLM-101B: An Open LLM and How to Train It with $100K Budget

Large language models (LLMs) have achieved remarkable success in NLP and multimodal tasks, among others. Despite these successes, two main challenges remain in developing LLMs: (i) high computational cost, and (ii) fair and objective evaluations. In this paper, we report a solution to significantly reduce LLM training cost through a growth strategy. We demonstrate that a 101B-parameter LLM with 0.31T tokens can be trained with a budget of 100K US dollars. Inspired by IQ tests, we also consolidate an additional range of evaluations on top of existing evaluations that focus on knowledge-oriented abilities. These IQ evaluations include symbolic mapping, rule understanding, pattern mining, and anti-interference. Such evaluations minimize the potential impact of memorization. Experimental results show that our model, named FLM-101B, trained with a budget of 100K US dollars, achieves performance comparable to powerful and well-known models, e.g., GPT-3 and GLM-130B, especially on the additional range of IQ evaluations. The checkpoint of FLM-101B is released at https://huggingface.co/CofeAI/FLM-101B.

Via

Access Paper or Ask Questions

2x Faster Language Model Pre-training via Masked Structural Growth

May 04, 2023

Yiqun Yao, Zheng Zhang, Jing Li, Yequan Wang

Figure 1 for 2x Faster Language Model Pre-training via Masked Structural Growth

Figure 2 for 2x Faster Language Model Pre-training via Masked Structural Growth

Figure 3 for 2x Faster Language Model Pre-training via Masked Structural Growth

Figure 4 for 2x Faster Language Model Pre-training via Masked Structural Growth

Acceleration of large language model pre-training is a critical issue in present NLP research. In this paper, we focus on speeding up pre-training by progressively growing from a small Transformer structure to a large one. There are two main research problems related to progressive growth: growth schedule and growth operator. For growth schedule, existing work has explored multi-stage expansion of depth and feedforward layers. However, the impact of each dimension on the schedule's efficiency is still an open question. For growth operator, existing work relies on the initialization of new weights to inherit knowledge, and achieve only non-strict function preservation, limiting further optimization of training dynamics. To address these issues, we propose Masked Structural Growth (MSG), including growth schedules involving all possible dimensions and strictly function-preserving growth operators that is independent of the initialization of new weights. Experiments show that MSG is significantly faster than related work: we achieve a speed-up of 80% for Bert-base and 120% for Bert-large pre-training. Moreover, MSG is able to improve fine-tuning performances at the same time.

Via

Access Paper or Ask Questions

Research without Re-search: Maximal Update Parametrization Yields Accurate Loss Prediction across Scales

Apr 29, 2023

Yiqun Yao, Yequan Wang

Figure 1 for Research without Re-search: Maximal Update Parametrization Yields Accurate Loss Prediction across Scales

Figure 2 for Research without Re-search: Maximal Update Parametrization Yields Accurate Loss Prediction across Scales

Figure 3 for Research without Re-search: Maximal Update Parametrization Yields Accurate Loss Prediction across Scales

Figure 4 for Research without Re-search: Maximal Update Parametrization Yields Accurate Loss Prediction across Scales

As language models scale up, it becomes increasingly expensive to verify research ideas because conclusions on small models do not trivially transfer to large ones. A possible solution is to establish a generic system that directly predicts some metrics for large models solely based on the results and hyperparameters from small models. Existing methods based on scaling laws require hyperparameter search on the largest models, which is impractical with limited resources. We address this issue by presenting our discoveries indicating that Maximal Update parametrization (muP) enables accurate fitting of scaling laws for hyperparameters close to common loss basins, without any search. Thus, different models can be directly compared on large scales with loss prediction even before the training starts. We propose a new paradigm as a first step towards reliable academic research for any model scale without heavy computation. Code will be publicly available shortly.

* Updated figures and references

Via

Access Paper or Ask Questions

MUSER: MUltimodal Stress Detection using Emotion Recognition as an Auxiliary Task

May 17, 2021

Yiqun Yao, Michalis Papakostas, Mihai Burzo, Mohamed Abouelenien, Rada Mihalcea

Figure 1 for MUSER: MUltimodal Stress Detection using Emotion Recognition as an Auxiliary Task

Figure 2 for MUSER: MUltimodal Stress Detection using Emotion Recognition as an Auxiliary Task

Figure 3 for MUSER: MUltimodal Stress Detection using Emotion Recognition as an Auxiliary Task

Figure 4 for MUSER: MUltimodal Stress Detection using Emotion Recognition as an Auxiliary Task

The capability to automatically detect human stress can benefit artificial intelligent agents involved in affective computing and human-computer interaction. Stress and emotion are both human affective states, and stress has proven to have important implications on the regulation and expression of emotion. Although a series of methods have been established for multimodal stress detection, limited steps have been taken to explore the underlying inter-dependence between stress and emotion. In this work, we investigate the value of emotion recognition as an auxiliary task to improve stress detection. We propose MUSER -- a transformer-based model architecture and a novel multi-task learning algorithm with speed-based dynamic sampling strategy. Evaluations on the Multimodal Stressed Emotion (MuSE) dataset show that our model is effective for stress detection with both internal and external auxiliary tasks, and achieves state-of-the-art results.

* NAACL 2021 accepted

Via

Access Paper or Ask Questions

Concept Learning through Deep Reinforcement Learning with Memory-Augmented Neural Networks

Nov 15, 2018

Jing Shi, Jiaming Xu, Yiqun Yao, Bo Xu

Figure 1 for Concept Learning through Deep Reinforcement Learning with Memory-Augmented Neural Networks

Figure 2 for Concept Learning through Deep Reinforcement Learning with Memory-Augmented Neural Networks

Figure 3 for Concept Learning through Deep Reinforcement Learning with Memory-Augmented Neural Networks

Figure 4 for Concept Learning through Deep Reinforcement Learning with Memory-Augmented Neural Networks

Deep neural networks have shown superior performance in many regimes to remember familiar patterns with large amounts of data. However, the standard supervised deep learning paradigm is still limited when facing the need to learn new concepts efficiently from scarce data. In this paper, we present a memory-augmented neural network which is motivated by the process of human concept learning. The training procedure, imitating the concept formation course of human, learns how to distinguish samples from different classes and aggregate samples of the same kind. In order to better utilize the advantages originated from the human behavior, we propose a sequential process, during which the network should decide how to remember each sample at every step. In this sequential process, a stable and interactive memory serves as an important module. We validate our model in some typical one-shot learning tasks and also an exploratory outlier detection problem. In all the experiments, our model gets highly competitive to reach or outperform those strong baselines.

* 27 pages, 2 figures

Via

Access Paper or Ask Questions

Cascaded Mutual Modulation for Visual Reasoning

Sep 06, 2018

Yiqun Yao, Jiaming Xu, Feng Wang, Bo Xu

Figure 1 for Cascaded Mutual Modulation for Visual Reasoning

Figure 2 for Cascaded Mutual Modulation for Visual Reasoning

Figure 3 for Cascaded Mutual Modulation for Visual Reasoning

Figure 4 for Cascaded Mutual Modulation for Visual Reasoning

Visual reasoning is a special visual question answering problem that is multi-step and compositional by nature, and also requires intensive text-vision interactions. We propose CMM: Cascaded Mutual Modulation as a novel end-to-end visual reasoning model. CMM includes a multi-step comprehension process for both question and image. In each step, we use a Feature-wise Linear Modulation (FiLM) technique to enable textual/visual pipeline to mutually control each other. Experiments show that CMM significantly outperforms most related models, and reach state-of-the-arts on two visual reasoning benchmarks: CLEVR and NLVR, collected from both synthetic and natural languages. Ablation studies confirm that both our multistep framework and our visual-guided language modulation are critical to the task. Our code is available at https://github.com/FlamingHorizon/CMM-VR.

* to appear in EMNLP 2018

Via

Access Paper or Ask Questions

Hierarchical Memory Networks for Answer Selection on Unknown Words

Sep 28, 2016

Jiaming Xu, Jing Shi, Yiqun Yao, Suncong Zheng, Bo Xu

Figure 1 for Hierarchical Memory Networks for Answer Selection on Unknown Words

Figure 2 for Hierarchical Memory Networks for Answer Selection on Unknown Words

Figure 3 for Hierarchical Memory Networks for Answer Selection on Unknown Words

Figure 4 for Hierarchical Memory Networks for Answer Selection on Unknown Words

Recently, end-to-end memory networks have shown promising results on Question Answering task, which encode the past facts into an explicit memory and perform reasoning ability by making multiple computational steps on the memory. However, memory networks conduct the reasoning on sentence-level memory to output coarse semantic vectors and do not further take any attention mechanism to focus on words, which may lead to the model lose some detail information, especially when the answers are rare or unknown words. In this paper, we propose a novel Hierarchical Memory Networks, dubbed HMN. First, we encode the past facts into sentence-level memory and word-level memory respectively. Then, (k)-max pooling is exploited following reasoning module on the sentence-level memory to sample the (k) most relevant sentences to a question and feed these sentences into attention mechanism on the word-level memory to focus the words in the selected sentences. Finally, the prediction is jointly learned over the outputs of the sentence-level reasoning module and the word-level attention mechanism. The experimental results demonstrate that our approach successfully conducts answer selection on unknown words and achieves a better performance than memory networks.

* 10 pages, to appear in COLING 2016

Via

Access Paper or Ask Questions