The recent surge in open-source Large Language Models (LLMs), such as LLaMA, Falcon, and Mistral, provides diverse options for AI practitioners and researchers. However, most LLMs have only released partial artifacts, such as the final model weights or inference code, and technical reports increasingly limit their scope to high-level design choices and surface statistics. These choices hinder progress in the field by degrading transparency into the training of LLMs and forcing teams to rediscover many details in the training process. We present LLM360, an initiative to fully open-source LLMs, which advocates for all training code and data, model checkpoints, and intermediate results to be made available to the community. The goal of LLM360 is to support open and collaborative AI research by making the end-to-end LLM training process transparent and reproducible by everyone. As a first step of LLM360, we release two 7B parameter LLMs pre-trained from scratch, Amber and CrystalCoder, including their training code, data, intermediate checkpoints, and analyses (at https://www.llm360.ai). We are committed to continually pushing the boundaries of LLMs through this open-source effort. More large-scale and stronger models are underway and will be released in the future.
In this paper, we compare the performances of FAISS and FENSHSES on nearest neighbor search in Hamming space--a fundamental task with ubiquitous applications in nowadays eCommerce. Comprehensive evaluations are made in terms of indexing speed, search latency and RAM consumption. This comparison is conducted towards a better understanding on trade-offs between nearest neighbor search systems implemented in main memory and the ones implemented in secondary memory, which is largely unaddressed in literature.
A growing interest has been witnessed recently in building nearest neighbor search solutions within Elasticsearch--one of the most popular full-text search engines. In this paper, we focus specifically on Hamming space nearest neighbor search using Elasticsearch. By combining three techniques: bit operation, substring filtering and data preprocessing with permutation, we develop a novel approach called FENSHSES (Fast Exact Neighbor Search in Hamming Space on Elasticsearch), which achieves dramatic speed-ups over the existing term match baseline. This will empower Elasticsearch with the capability of fast information retrieval even when documents (e.g., texts, images and sounds) are represented with binary codes--a common practice in nowadays semantic representation learning.
Having the right assortment of shipping boxes in the fulfillment warehouse to pack and ship customer's online orders is an indispensable and integral part of nowadays eCommerce business, as it will not only help maintain a profitable business but also create great experiences for customers. However, it is an extremely challenging operations task to strategically select the best combination of tens of box sizes from thousands of feasible ones to be responsible for hundreds of thousands of orders daily placed on millions of inventory products. In this paper, we present a machine learning approach to tackle the task by formulating the box design problem prescriptively as a generalized version of weighted $k$-medoids clustering problem, where the parameters are estimated through a variety of descriptive analytics. We test this machine learning approach on fulfillment data collected from Walmart U.S. eCommerce, and our approach is shown to be capable of improving the box utilization rate by more than $10\%$.
In this paper, we describe our end-to-end content-based image retrieval system built upon Elasticsearch, a well-known and popular textual search engine. As far as we know, this is the first time such a system has been implemented in eCommerce, and our efforts have turned out to be highly worthwhile. We end up with a novel and exciting visual search solution that is extremely easy to be deployed, distributed, scaled and monitored in a cost-friendly manner. Moreover, our platform is intrinsically flexible in supporting multimodal searches, where visual and textual information can be jointly leveraged in retrieval. The core idea is to encode image feature vectors into a collection of string tokens in a way such that closer vectors will share more string tokens in common. By doing that, we can utilize Elasticsearch to efficiently retrieve similar images based on similarities within encoded sting tokens. As part of the development, we propose a novel vector to string encoding method, which is shown to substantially outperform the previous ones in terms of both precision and latency. First-hand experiences in implementing this Elasticsearch-based platform are extensively addressed, which should be valuable to practitioners also interested in building visual search engine on top of Elasticsearch.
We revisit skip-gram negative sampling (SGNS), a popular neural-network based approach to learning distributed word representation. We first point out the ambiguity issue undermining the SGNS model, in the sense that the word vectors can be entirely distorted without changing the objective value. To resolve this issue, we rectify the SGNS model with quadratic regularization. A theoretical justification, which provides a novel insight into quadratic regularization, is presented. Preliminary experiments are also conducted on Google's analytical reasoning task to support the modified SGNS model.
Recovering matrices from compressive and grossly corrupted observations is a fundamental problem in robust statistics, with rich applications in computer vision and machine learning. In theory, under certain conditions, this problem can be solved in polynomial time via a natural convex relaxation, known as Compressive Principal Component Pursuit (CPCP). However, all existing provable algorithms for CPCP suffer from superlinear per-iteration cost, which severely limits their applicability to large scale problems. In this paper, we propose provable, scalable and efficient methods to solve CPCP with (essentially) linear per-iteration cost. Our method combines classical ideas from Frank-Wolfe and proximal methods. In each iteration, we mainly exploit Frank-Wolfe to update the low-rank component with rank-one SVD and exploit the proximal step for the sparse term. Convergence results and implementation details are also discussed. We demonstrate the scalability of the proposed approach with promising numerical experiments on visual data.
Recovering a low-rank tensor from incomplete information is a recurring problem in signal processing and machine learning. The most popular convex relaxation of this problem minimizes the sum of the nuclear norms of the unfoldings of the tensor. We show that this approach can be substantially suboptimal: reliably recovering a $K$-way tensor of length $n$ and Tucker rank $r$ from Gaussian measurements requires $\Omega(r n^{K-1})$ observations. In contrast, a certain (intractable) nonconvex formulation needs only $O(r^K + nrK)$ observations. We introduce a very simple, new convex relaxation, which partially bridges this gap. Our new formulation succeeds with $O(r^{\lfloor K/2 \rfloor}n^{\lceil K/2 \rceil})$ observations. While these results pertain to Gaussian measurements, simulations strongly suggest that the new norm also outperforms the sum of nuclear norms for tensor completion from a random subset of entries. Our lower bound for the sum-of-nuclear-norms model follows from a new result on recovering signals with multiple sparse structures (e.g. sparse, low rank), which perhaps surprisingly demonstrates the significant suboptimality of the commonly used recovery approach via minimizing the sum of individual sparsity inducing norms (e.g. $l_1$, nuclear norm). Our new formulation for low-rank tensor recovery however opens the possibility in reducing the sample complexity by exploiting several structures jointly.
Illumination variation remains a central challenge in object detection and recognition. Existing analyses of illumination variation typically pertain to convex, Lambertian objects, and guarantee quality of approximation in an average case sense. We show that it is possible to build V(vertex)-description convex cone models with worst-case performance guarantees, for non-convex Lambertian objects. Namely, a natural verification test based on the angle to the constructed cone guarantees to accept any image which is sufficiently well-approximated by an image of the object under some admissible lighting condition, and guarantees to reject any image that does not have a sufficiently good approximation. The cone models are generated by sampling point illuminations with sufficient density, which follows from a new perturbation bound for point images in the Lambertian model. As the number of point images required for guaranteed verification may be large, we introduce a new formulation for cone preserving dimensionality reduction, which leverages tools from sparse and low-rank decomposition to reduce the complexity, while controlling the approximation error with respect to the original cone.