Alibaba
"Qwen3.5 represents a significant leap forward, integrating breakthroughs in multimodal learning, architectural efficiency, reinforcement learning scale, and global accessibility"
Liquid AI
"a traditional instruct model without reasoning traces."
Inception
Diffusion large language model (dLLM).
Google DeepMind
Knowledge cutoff still=January 2025. Announce: https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/
Zyphra
For BCI, 'thought-to-text'. Training dataset calcs: (2M hours * 3,600 seconds/hour * 256 samples/second ) / 32 samples/token = 57.6B tokens (refined to 45.1B after rigorous filtering ); 150,000 steps * 2.16M tokens/batch = 324B total tokens seen during training. Announce: https://www.zyphra.com/post/zuna
xAI
No details provided. Announce: https://x.com/elonmusk/status/2023829664318583105
Prime Intellect
Base: GLM-4.5-Air-Base, INTELLECT-3 model. 106BA12B.
Anthropic
1M context. Announce: https://www.anthropic.com/news/claude-sonnet-4-6 Showing GMMLU (Global MMLU by Cohere).
Cohere
70+ languages. Showing GMMLU (Global MMLU by Cohere).
Alibaba
"Qwen3.5 represents a significant leap forward, integrating breakthroughs in multimodal learning, architectural efficiency, reinforcement learning scale, and global accessibility"
JD Open Source
48B-A3B.
MiniMax
230B-A10B. HLE showing without tools.
Z.AI
744B-A40B. Announce: https://z.ai/blog/glm-5
Nanbeige
SOTA for size (3B)
Alibaba
Base: Qwen3-VL-30B-A3B-Instruct. "an embodied foundation model grounded in physical reality."
Anthropic
Shanghai AI Laboratory/SenseTime
1000TA22B. Assumes base model of Qwen3. "Built upon a 235B MoE language model and a 6B Vision encoder, Intern-S1 has been further pretrained on 5 trillion tokens of multimodal data"
StepFun
196B-A11B.
Independent
Warning for inappropriate content. Base: Llama-3.1-Nemotron-8B. "trained it on an extended 4chan dataset" "the original, gpt4chan (by Yannic Kilcher) scored especially high in truthfulness (that was b4 benchmaxxing)... outperformed the base tune (the unabliterated one), it also changed its political alignment... People were initially joking about the "alignment tax", I think there's a none trivial substance in all of this. It seems to me just above a marginal error or statistical noise."
Arcee AI
400BA13B. "we worked closely with Prime Intellect. They not only served the H100 clusters Datology used to generate synthetic data, they have been deeply involved in helping scale our training setup to the GPU footprint required for a fully frontier sized model, including the current 2048 B300 GPU configuration for Trinity Large."
Allen AI
Base: Qwen3-32B. SERA=Soft-verified Efficient Repository Agents. "SERA was built largely by a single Ai2 researcher." https://allenai.org/blog/open-coding-agents "SERA-32B was trained using Soft Verified Generation (SVG), a simple and efficient method that is 26x cheaper than reinforcement learning and 57x cheaper than previous synthetic data methods to reach equivalent performance. The total cost for data generation and training is approximately $2,000 (40 GPU-days)."
Moonshot AI
1TA32B. 1T parameters and 384 experts. Open source SOTA. "Kimi K2.5 builds on Kimi K2 [15.5T tokens] with continued pretraining over approximately 15T mixed visual and text tokens. [+ 15T=30.5T]"
Z.AI
30B-A3B.
Google DeepMind
Lower MMLU score compared to previous MedGemma 1 27B (67.2 v 87). Announce: https://research.google/blog/next-generation-medical-image-interpretation-with-medgemma-15-and-medical-speech-to-text-with-medasr/
Microsoft
Base: Qwen3-32B.
NVIDIA
"EDEN (environmentally-derived evolutionary network) family of metagenomic foundation models, including a 28 billion parameter model trained on 9.7 trillion nucleotide tokens from BaseData1 . This dataset, at the time of training, contained more than 10 billion novel genes from over 1 million new species, and is intentionally enriched for environmental and host-associated metagenomes, phage sequences, and mobile genetic elements, enabling the model to learn from diverse and novel cross-species evolutionary mechanisms and apply them to key challenges in human health."
Baichuan
"new-generation medical-enhanced large language model"
DeepSeek-AI
39.5BA3.8B. "we explore conditional memory as a complementary sparsity axis, instantiated via Engram, a module that modernizes classic N -gram embeddings for O ( 1 ) lookup."
Stanford
Uses a leave-one-out contrastive learning approach to align brain activity (EEG), heart activity (ECG), and respiratory signals. 130+ disease categories and 19–20+ clinical PSG channels. Dataset ~12.63B (Calculated based on 585,000 hours of data across 3 modality groups using 5-second window tokens) x 10 epochs.
Independent
112GB dataset=30B tokens x 0.5 epochs = 15B tokens.
AI21
52B-A12B. Pre-training tokens from Jamba=1.2T + 500B mid.
Liquid AI
For on-device agentic applications. "Extended pre-training from 10T to 28T tokens and large-scale multi-stage reinforcement learning."
MiroMindAI
Base: Qwen3 235B-A22B. Official demo: https://dr.miromind.ai
TII
Base model: Falcon-H1 (May/2025). Announce: https://huggingface.co/blog/tiiuae/falcon-h1r-7b
DeepSeek-AI
27BA4.14B. Scaling tested with 3B MoE on 1T tokens=334:1. "Manifold-Constrained Hyper-Connections (mHC), a general framework that projects the residual connection space of HC onto a specific manifold to restore the identity mapping property, while incorporating rigorous infrastructure optimization to ensure efficiency. Empirical experiments demonstrate that mHC is effective for training at scale, offering tangible performance improvements and superior scalability."
IQuestLab
"IQuest-Coder-V1 captures the dynamic evolution of software logic, delivering state-of-the-art performance across critical dimensions" https://github.com/IQuestLab/IQuest-Coder-V1
SK Hynix
519BA33B.
LG
236BA23B. “EXAONE”=“EXpert AI for EveryONE”.
UZH
Base Model: Qwen 3. 600B tokens of pre-(1913, 1929, 1933, 1939, 1946) data only.
Tencent
Project page: https://wedlm.github.io/ "WeDLM, a diffusion decoding framework built entirely on standard causal attention to make parallel generation prefix-cache friendly. The core idea is to let each masked position condition on all currently observed tokens while keeping a strict causal mask, achieved by Topological Reordering that moves observed tokens to the physical prefix while preserving their logical positions.. We instantiate WeDLM on both Qwen2.5-7B and Qwen3-8B, utilizing 100B tokens for continued training and 10B tokens for SFT."
Upstage AI
South Korean. 102BA12B. Releasing 31/Dec.
Z.AI
355B-A32B. "context window has been expanded from 128K to 200K tokens"
NVIDIA
"NitroGen is a unified vision-to-action model designed to play video games directly from raw frames. It takes video game footage as input and outputs gamepad actions... trained on 40,000 hours of gameplay videos across more than 1,000 games."
Xiaomi
309BA15B.
Google DeepMind
"FunctionGemma, a specialized version of our Gemma 3 270M model tuned for function calling. It is designed as a strong base for further training into custom, fast, private, local agents that translate natural language into executable API actions."
Google DeepMind
Base model: Gemma 3. Dataset: Gemma 3 4B checkpoint (4T) + pretraining (2T)=6T.
Google DeepMind
Announce: https://deepmind.google/models/gemini/flash/
NVIDIA
Knowledge cutoff November 28, 2025 (post).
Allen AI
Base Model: Olmo 3 7B. Announce: https://allenai.org/blog/bolmo
Consortium
A fully open language model developed in Europe.
Inclusion AI
Base Model: Ling-flash-2.0: 103B total parameters with 6.1B activated. "largest diffusion language model to date"
OpenAI
"GPT‑5.2 sets a new state of the art across many benchmarks, including GDPval, where it outperforms industry professionals at well-specified knowledge work tasks spanning 44 occupations." Announce: https://openai.com/index/introducing-gpt-5-2/ MMLU is for Spanish.
ServiceNow
Motif-Technologies
Mistral
SWE-bench Verified=72.2%.
Nanbeige4-3B-Base
Tencent
406BA32B.
MBZUAI
8.5x more tokens trained than K2 (1.4T v 12T). Project page: https://ifm.ai/k2/
Arcee AI
26BA3B. "we worked closely with Prime Intellect. They not only served the H100 clusters Datology used to generate synthetic data, they have been deeply involved in helping scale our training setup to the GPU footprint required for a fully frontier sized model, including the current 2048 B300 GPU configuration for Trinity Large."
Amazon
"Nova 2 Pro is Amazon's most intelligent reasoning model that can process text, images, video, and speech to generate text."
Mistral
675BA41B. "Mistral Large 3 joins the ranks of frontier instruction-fine-tuned open-source models." EU tech doc: https://legal.cms.mistral.ai/assets/1e37fffd-7ea5-469b-822f-05dcfbb43623
DeepSeek-AI
The word 'Speciale' may be a reference to Ferrari. "It shows gold-medal performance in the IOI 2025, ICPC World Final 2025, IMO 2025, and CMO 2025." API: https://api-docs.deepseek.com/news/news251201
DeepSeek-AI
"DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled testtime compute. "
NVIDIA
Base Model: Qwen3-8B
Prime Intellect
Base: GLM-4.5-Air-Base model. 106BA12B. Announce: https://www.primeintellect.ai/blog/intellect-3
Microsoft
"Fara-7B is Microsoft's first agentic small language model (SLM) designed specifically for computer use. With only 7 billion parameters, Fara-7B is an ultra-compact Computer Use Agent (CUA)...Current production baselines leverage Qwen 2.5-VL (7B)."
Anthropic
"the best model in the world for coding, agents, and computer use." Announce: https://www.anthropic.com/news/claude-opus-4-5
NVIDIA
"Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning...We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens"
Tencent
Base model: Qwen2.5-VL-7B-Instruct. "GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. " Project page: https://ekonwang.github.io/geo-vista/
Allen AI
Announce: https://allenai.org/blog/olmo3
Google DeepMind
"The knowledge cutoff date for Gemini 3 Pro was January 2025."
xAI
PleIAs
"The name is both a nod to French origins and to the unusual shape of the model: with 80 layers, Baguettotron is currently the deepest SLM in its size range."
Baidu
Very low performance on ALPrompt. 2.4T params confirmed: https://global.chinadaily.com.cn/a/202511/13/WS691571bda310d6866eb29500.html
OpenAI
Personality change via fine-tuning. GPQA (no tools) increased from GPT-5=85.7 to GPT-5.1=88.1. MMLU is for Spanish.
NVIDIA
Base model: Qwen3-8B (36T) + 150B continual training. "TiDAR, a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively - all within a single forward pass using specially designed structured attention masks"
Tsinghua
"JustRL, a simple recipe with fixed hyperparameters, achieves state-of-the-art performance on two different 1.5B base models (54.5% and 64.3% across 9 math benchmarks) while using 2× less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training remains stable over thousands of steps without intervention. This suggests the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline."
Baidu
28B-A3B. Open-sourced 12/Nov/2025 from Jun/2025 release.
Google DeepMind
"Combining our self-modifying sequence model with the continuum memory system, we present a learning module, called HOPE, showing promising results in language modeling, continual learning, and long-context reasoning tasks." Announce: https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/ May be released after paper is public.
Moonshot AI
1TA32B. 1T parameters and 384 experts. Open source SOTA. HLE=51.0 on text-only subset, compare to Grok-4 HLE=50.7 also on text-only, but Grok-4 HLE=44.4 on HLE full, ∴ Kimi K2 Thinking HLE≈44 full (estimated).
Inclusion AI
1TA50B.
Generalist
"GEN-0, a new class of embodied foundation models built for multimodal training directly on high-fidelity raw physical interaction. Its architecture builds on the strengths of vision and language models while also going beyond them—natively designed to capture human-level reflexes and physical commonsense. One core feature is Harmonic Reasoning, in which the models are trained to simultaneously think and act seamlessly... GEN-0 is pretrained on our in-house robotics dataset, which includes over 270,000 hours of real-world diverse manipulation data, growing at a rate of 10,000 hours a week and accelerating."
"Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9% accuracy... We train our models on the Pile uncopyrighted dataset (Gao et al., 2020). The raw text is processed with the Llama 3 tokenizer (Grattafiori et al., 2024), resulting in a training set of ∼230B tokens."
Moonshot AI
48B-A3B. "Kimi Linear is a hybrid linear attention architecture that outperforms traditional full attention methods across various contexts, including short, long, and reinforcement learning (RL) scaling regimes. At its core is Kimi Delta Attention (KDA)—a refined version of Gated DeltaNet that introduces a more efficient gating mechanism to optimize the use of finite-state RNN memory."
MiniMax
230B-A10B.
Cambridge/LBNL
MACE-MH-1 (Multi-Head 1). Features Multiple Heads (OMAT PBE, OMOL r2scan, OC20) to maintain high accuracy across domains
DeepSeek-AI
2D vision tokens for 1D text achieves huge compression. Encoder/Decoder: DeepEncoder 380M (80M SAM-base + 300M CLIP-large), DeepSeek-3B-MoE (A570M).
Microsoft
"we trained UserLM-8b to simulate the “user” role in conversation (by training it to predict user turns in a large corpus of conversations called WildChat)."
Salesforce
"diffusion coder trained on TPU [Google TPU v4-1024 VM]"
Samsung
"Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers"
IBM
32B-A9B. Announce: https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models
Z.AI
355B-A32B. "context window has been expanded from 128K to 200K tokens"
InclusionAI
1T-A48.5B.
Anthropic
The Claude Sonnet 4.5 "system card" is an absolute farce. Announce: https://www.anthropic.com/news/claude-sonnet-4-5
Google DeepMind
2. "vision-language-action (VLA) model turns visual information and instructions into motor commands for a robot to perform a task." Available to select partners.
Google DeepMind
1. "vision-language model (VLM) reasons about the physical world, natively calls digital tools and creates detailed, multi-step plans to complete a mission." Available to all devs.
TimesFM-ICF is 6.8% more accurate than TimesFM (Base). Time-series forecasting only. 'a large pretraining corpus of 100B real world time-points' may be more than 100B tokens.
Alibaba
"Qwen3-Max-Thinking — still under active training — is already demonstrating remarkable potential. When augmented with tool usage and scaled test-time compute, the Thinking variant has achieved 100% on challenging reasoning benchmarks such as AIME 25 and HMMT. "
Alibaba
"Qwen3-Omni is a unified end-to-end model capable of processing multiple modalities, such as text, audio, image and video, and generating real-time text or speech response."... "pretraining utilizes a large-scale dataset containing approximately 2 trillion tokens, with the following distribution across modalities: text (0.57 trillion), audio (0.77 trillion), image (0.82 trillion), video (0.05 trillion), and video-audio (0.05 trillion)."
DeepSeek-AI
Hybrid reasoning. Dataset tokens: https://x.com/deepseek_ai/status/1958417072536608952 HLE: https://x.com/deepseek_ai/status/1958417068568481854/photo/2
Perceptron
"perceptive-language model...delivering capabilities that meet or exceed those of models over 50 times its size. Founded by the team behind Meta's Chameleon multimodal models, Perceptron is tackling a fundamental challenge: bringing the power of physical AI to the dynamic, multimodal, and real-time environments we live and work in."
xAI
"2M token context window, and a unified architecture that blends reasoning and non-reasoning modes in one model."
Google DeepMind
"Differential Privacy (DP) has emerged as the gold standard, providing a rigorous, mathematical framework to limit the influence of any single example in the training data on the resulting model. A model trained with DP provably bounds the reconstruction or leakage of information tied to individual data points." Announce: https://research.google/blog/vaultgemma-the-worlds-most-capable-differentially-private-llm/
Alibaba
"Qwen3-Next introduces several key improvements: a hybrid attention mechanism, a highly sparse Mixture-of-Experts (MoE) structure, training-stability-friendly optimizations, and a multi-token prediction mechanism for faster inference."
MBZUAI
"Built on the Qwen2.5 base model, our system shows that smaller models can compete at the highest levels by combining advanced post-training and test-time computation techniques. The approach is based on six key technical pillars: Long Chain-of-thought Supervised Finetuning, Reinforcement Learning with Verifiable Rewards (RLVR), Agentic planning prior to reasoning, Test-time Scaling, Speculative Decoding, and Inference-optimized Hardware, all using publicly available open-source datasets."
JHU
"a modern multilingual encoder trained on 3T tokens and 1833 languages. We introduce several novel elements in training: an inverse masking schedule and a cascading annealed language learning schedule for multilingual data" Announce: https://huggingface.co/blog/mmbert
Baidu
Baidu
Kuaishou
46B-A2.5B.
Tilde AI
"language data from across Europe"
Alibaba
GPQA score is SuperGPQA. "our biggest model yet, with over 1 trillion parameters"
Moonshot AI
1TA32B. 1T parameters and 384 experts. Open source SOTA.
ETH Zürich
"Apertus – Latin for “open”" 1,811 languages. Announce: https://ethz.ch/en/news-and-events/eth-news/news/2025/09/press-release-apertus-a-fully-open-transparent-multilingual-language-model.html
Meituan
560B-A18.6B–31.3B (27B on average). Announce: https://lmsys.org/blog/2025-09-01-sglang-longcat-flash/
Baichuan
Base: Qwen2.5. "medical augmented reasoning model"
Microsoft
MAI=Microsoft artificial intelligence. "MAI’s first foundation model trained end-to-end... MAI-1-preview is an in-house mixture-of-experts model, pre-trained and post-trained on ~15,000 NVIDIA H100 GPUs. This model is designed to provide powerful capabilities to consumers seeking to benefit from models that specialize in following instructions and providing helpful responses to everyday queries. We will be rolling MAI-1-preview out for certain text use cases within Copilot"
xAI
"We built grok-code-fast-1 from scratch, starting with a brand-new model architecture. To lay a robust foundation, we carefully assembled a pre-training corpus rich with programming-related content. For post-training, we curated high-quality datasets that reflect real-world pull requests and coding tasks." Announce: https://x.ai/news/grok-code-fast-1
Nous Research
Based on Llama 3. Announce: https://hermes4.nousresearch.com/
NVIDIA
"pre-training corpus and train Jet-Nemotron models for 50B tokens. This is also the setting in Section 2 where we perform PostNAS. At the second stage, we include more high-quality data from math [65] and coding [66, 67] domains into our data mixture. The models are then trained on 350B tokens."
DeepSeek-AI
Hybrid reasoning. Dataset tokens: https://x.com/deepseek_ai/status/1958417072536608952 HLE: https://x.com/deepseek_ai/status/1958417068568481854/photo/2
NVIDIA
Announce: https://research.nvidia.com/labs/adlr/NVIDIA-Nemotron-Nano-2/
Google DeepMind
OpenAI
Announce: https://openai.com/index/introducing-gpt-5/. MMLU is based on ES and PT translated from EN.
OpenAI
116.8B total parameters and 5.1B “active” parameters per token per forward pass. https://openai.com/index/introducing-gpt-oss/
OpenAI
20.9B total and 3.6B active parameters. https://openai.com/index/introducing-gpt-oss/
Anthropic
Z.AI
355B-A32B.
China Telecom Artificial Intelligence Research Institute
Shanghai AI Laboratory/SenseTime
41T tokens assumes base model of Qwen3. "Built upon a 235B MoE language model and a 6B Vision encoder, Intern-S1 has been further pretrained on 5 trillion tokens of multimodal data"
StepFun
321B-A38B. https://x.com/CyouSakura/status/1948767450751009227
Alibaba
235B-A22B. "Qwen3 is pre-trained on 36 trillion tokens across 119 languages" MMLU score is MMLU-Redux.
Kuaishou
200BA40B. In training as of Jul/2025. "to address the overthinking problem in reasoning-intensive tasks"
Kuaishou
"to address the overthinking problem in reasoning-intensive tasks"
Alibaba
480B-A35B.
Alibaba
235B-A22B. "Qwen3 is pre-trained on 36 trillion tokens across 119 languages" MMLU score is MMLU-Redux.
Allen AI
37B-A20B. "We adopt the OLMo-2 7B setup, starting from a a checkpoint pre-trained on 4T tokens and annealed for 50B tokens to produce a public expert. We then train two additional experts on math and code, each for 50B tokens, and combine them with the public expert to form a three-expert version of FLEXOLMO."
LG
“EXAONE”=“EXpert AI for EveryONE”. Training tokens/ratio: EXAONE-3 7.8B=8T tokens (Aug/2024) -> EXAONE-3.5 7.8B=9T -> EXAONE-3.5 32B=6.5T tokens -> EXAONE 4.0 32B=14T tokens. MMLU score is MMLU-Redux. Interesting: "To focus [RL] training on more informative data samples, we perform accuracy-based filtering by generating eight responses from the SFT model and excluding samples where all eight responses are correct, a pre-filtering step that removes problems that are easy for the model to avoid inefficient training."
Moonshot AI
1TA32B. 1T parameters and 384 experts. Open source SOTA.
Reka AI
Mistral
Non-reasoning.
xAI
2.4T? https://x.com/kalomaze/status/1942996555088134592 "The smartest AI in the world, 100% on SAT, etc, questions that it's never seen before."
Kwaipilot
200BA40B.
Kwaipilot
Microsoft
"Pre-training: 5T tokens; Reasoning training: 150B tokens" "At the core of Phi-4-mini-flash-reasoning is the newly introduced decoder-hybrid-decoder architecture, SambaY, whose central innovation is the Gated Memory Unit (GMU), a simple yet effective mechanism for sharing representations between layers. The architecture includes a self-decoder that combines Mamba (a State Space Model) and Sliding Window Attention (SWA), along with a single layer of full attention. The architecture also involves a cross-decoder that interleaves expensive cross-attention layers with the new, efficient GMUs. This new architecture with GMU modules drastically improves decoding efficiency, boosts long-context retrieval performance and enables the architecture to deliver exceptional performance across a wide range of tasks. "
Google DeepMind
Related paper: https://arxiv.org/abs/2504.06225. Dataset was Gemma 2 9B on 8T tokens + 2T tokens adapted.
Google DeepMind
Multimodal model. Text MMLU score for med only=87.0.
TNG
Assembly of Experts-method of V3-0324, R1, R1-0528. Announce: https://x.com/tngtech/status/1940531045432283412?s=46
Consortium
"Spectra-1.1, an open suite of TriLMs trained on up to 1.2 trillion tokens, demonstrating sustained performance gains at scale. Furthermore, to improve inference efficiency, we propose novel 2-bit and 1.6-bit packing schemes for ternary weights"
Apple
"We adapt our model from Qwen2.5-Coder (Hui et al., 2024) as the base model to perform continual pre-training using the adaptation approach from Gong et al. (2025). During this pre-training, we use a 400B-token code pre-training corpus from RefineCode (Huang et al., 2024) and Stackv2 (Lozhkov et al., 2024)."
Tencent
80B-A13B. 'We have open-sourced Hunyuan-A13B-Pretrain , Hunyuan-A13B-Instruct , Hunyuan-A13B-Instruct-FP8 , Hunyuan-A13B-Instruct-GPTQ-Int4 on Hugging Face.'
Inception
Diffusion large language model (dLLM).
Microsoft
"distillation from Microsoft’s Phi models...Mu is an efficient 330M encoder–decoder language model optimized for small-scale deployment, particularly on the NPUs on Copilot+ PCs. It follows a transformer encoder–decoder architecture"
Google DeepMind
See Mar/2025 Gemini Robotics-ER model for comparison. Announce: https://deepmind.google/discover/blog/gemini-robotics-on-device-brings-ai-to-local-robotic-devices/
ICONNAI
"ICONN-1 (this version) is optimized for natural, emotionally resonant, and conversational interactions. ICONN-e1 is a specialized variant of the model fine-tuned for advanced reasoning, critical analysis, and complex problem-solving."
MiniMax
456B-A45.9B. Announce: https://www.minimax.io/news/minimaxm1
Mistral
Magistral Small=24B. Announce: https://mistral.ai/news/magistral
EleutherAI
"Comma v0.1-2T is a decoder-only transformer that uses the same architecture as Llama 3. Training was done in two stages: first on 1.93 trillion tokens with a cosine learning rate schedule, and second a "cool-down" training phase on 75.5 billion tokens from high-quality sources. The final model is the average of 10 checkpoints during this cool-down phase. Both training phases use a batch size of 8.3 million tokens per step. Training was performed using lingua on 512 AMD MI300A GPUs."
Xiaohongshu/RedNote
142B-A14B. "dots.llm1, a large-scale MoE model that activates 14 billion parameters out of a total of 142 billion parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on 11.2T high-quality tokens and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints at every one trillion tokens, providing valuable insights into the learning dynamics of large language models."
Google DeepMind
"an upgraded preview of Gemini 2.5 Pro, our most intelligent model yet. Building on the version we released in May and showed at I/O, this model will be the generally available, stable version starting in a couple of weeks, ready for enterprise-scale applications."
Xiaomi
"[2025.05.30] During the RL training, by continuously expanding the training window size (from 32K to 48K), the performance of MiMo-7B-RL-0530 on AIME24 can be continuously improved and eventually surpass that of DeepSeek R1... MiMo-7B-Base is pre-trained on approximately 25 trillion tokens."
Google DeepMind
"Atlas, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture."
Google DeepMind
"Atlas, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture."
DeepSeek-AI
Censorship increased significantly. "overall performance is now approaching that of leading models, such as o3 and Gemini 2.5 Pro." MMLU shows MMLU-Redux score with lower error rate.
Fractal Analytics
Base R1-distilled-14B model, based on Qwen 14B. Media release.
Alibaba
"the first long-context LRM trained with reinforcement learniing for long-context reasoning."
Anthropic
"Claude Opus 4 is our most intelligent model to date, pushing the frontier in coding, agentic search, and creative writing. With advanced reasoning and powerful collaboration capabilities…Both models can also alternate between reasoning and tool use—like web search—to improve responses…Claude Opus 4 can work continuously for hours on complex, long-running tasks"
TII
"hybrid architecture that combines the strengths of the classical Transformer-based attention mechanism with the State Space Model (SSM), known for its superior long-context memory and computational efficiency."
Google DeepMind
"Gemini Diffusion’s external benchmark performance is comparable to much larger models [like Gemini-2.0-Flash-Lite], whilst also being faster."
Google DeepMind
Matryoshka Transformer or MatFormer model architecture. 850M (696M / 620M / 582M).
Alibaba
"We introduce the third scaling paradigm for scaling LLMs: leverages parallel computation during both training and inference time (Parallel Scaling, or ParScale)... ParScale can use up to 22× less memory increase and 6× less latency increase compared to parameter scaling that achieves the same performance improvement. It can also recycle an off-the-shelf pre-trained model into a parallelly scaled one by post-training on a small amount of tokens, further reducing the training budget." MMLU shows for 1.8B models, not the 4.7B models.
OpenAI
o3 base. "codex-1, a version of OpenAI o3 optimized for software engineering. It was trained using reinforcement learning on real-world coding tasks in a variety of environments to generate code that closely mirrors human style and PR preferences, adheres precisely to instructions, and can iteratively run tests until it receives a passing result."
TII
"Falcon-Edge series - a collection of powerful, universal, and fine-tunable language models available in ternary format, based on the BitNet architecture."
Windsurf
"SWE-1, optimized for the entire software engineering process, not just the task of coding."
Prime Intellect
QwQ-32B base. Announce: https://www.primeintellect.ai/blog/intellect-2-release Finished training 30/Apr/2025: https://app.primeintellect.ai/intelligence/intellect-2
Huawei
718B-A39B. Trained on 6,000 Ascend NPUs (Kunpeng 920 processors in Huawei Atlas 800T A2 servers).
Mistral
Multimodal. 50B param estimate based on "Mistral Medium 3 can also be deployed on any cloud, including self-hosted environments of four GPUs and above.". Note: "With the launches of Mistral Small in March and Mistral Medium today, it’s no secret that we’re working on something ‘large’ over the next few weeks. With even our medium-sized model being resoundingly better than flagship open source models such as Llama 4 Maverick, we’re excited to ‘open’ up what’s to come :) "
IBM
"the model is only partially trained—it has only seen 2.5T of a planned 15T or more training tokens...Granite 4.0 Tiny-Preview, specifically, is a fine-grained hybrid mixture of experts (MoE) model, with 7B total parameters and only 1B active parameters at inference time... Like its predecessors in Granite 3.2 and Granite 3.3, Granite 4.0 Tiny Preview offers toggleable thinking on and thinking off functionality (though its reasoning-focused post-training is very much incomplete)."
Amazon
Announce: https://aws.amazon.com/blogs/aws/amazon-nova-premier-our-most-capable-model-for-complex-tasks-and-teacher-for-model-distillation/
Microsoft
"Phi-4-reasoning-plus is a state-of-the-art open-weight reasoning model finetuned from Phi-4 using supervised fine-tuning on a dataset of chain-of-thought traces and reinforcement learning."
IBM
"During Christmas of 2024, IBM, Princeton, CMU, and UIUC released, Bamba v1, a performant Mamba2 based pretrained model with full data lineage trained to 2T tokens. Since then, we have been busy cooking an update with new datasets. Today, we are excited to release Bamba v2, trained for an additional 1T tokens that significantly improves on Bamba v1. The L1 and L2 leaderboard scores outperform Llama 3.1 8B, which was trained with nearly 5x the amount of data. All of this with the inference speedup that we get from Mamba2 based architecture, which with the latest vLLM is 2-2.5x faster than similar sized transformer models."
Alibaba
Qwen3-235B-A22B. Qwen3-30B-A3B. "Qwen3 is pre-trained on 36 trillion tokens across 119 languages"
Alibaba
Record data ratio 60,000:1. "Qwen3 is pre-trained on 36 trillion tokens across 119 languages"
Baidu
Announce: https://x.com/Baidu_Inc/status/1915603080336597310
Baidu
Announce: https://x.com/Baidu_Inc/status/1915603080336597310
Microsoft
DeepSeek-R1 base. "MAI-DS-R1, a new open weights DeepSeek R1 model variant... post-trained by the Microsoft AI team to improve its responsiveness on blocked topics and its risk profile, while maintaining its reasoning capabilities and competitive performance."
Google DeepMind
Context in=1M, out=64k. Knowledge cutoff Jan/2025. Codename 'nebula'. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/
OpenAI
https://openai.com/index/introducing-o3-and-o4-mini/ MMLU shows a translated LOTE.
OpenAI
https://openai.com/index/introducing-o3-and-o4-mini/ MMLU shows a translated LOTE.
Microsoft
"the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale. Trained on a corpus of 4 trillion tokens"
IBM
"Built on top of an updated Granite 3.3 base model and fine-tuned through multi-stage reinforcement learning using TPO and Group Relative Policy Optimization (GRPO), both Granite 3.3 Instruct models demonstrated significant improvement on the highly technical benchmarks conventionally associated with “reasoning” capabilities."
Zhipu AI (Tsinghua)
Family: GLM-4-32B-Base-0414, GLM-4-32B-0414, GLM-Z1-32B-0414 (reasoning), GLM-Z1-Rumination-32B-0414 (reasoning + deep research).
AI Singapore
"Based on Llama 3.1 70B. SEA-LION v3.5, our first set of hybrid reasoning models trained on Southeast Asian data. Mode selection is managed through the tokenizer’s chat template and offers versatile functionality, handling both complex reasoning tasks and general text generation."
OpenAI
Outperforms GPT‑4o "across the board, with major gains in coding and instruction following. They also have larger context windows—supporting up to 1 million tokens of context—and are able to better use that context with improved long-context comprehension. They feature a refreshed knowledge cutoff of June 2024."
Google DeepMind
"trained on Atlantic spotted dolphin sounds, we anticipate its potential utility for researchers studying other cetacean species, like bottlenose or spinner dolphins... Developed by Google, this AI model makes use of specific Google audio technologies: the SoundStream tokenizer efficiently represents dolphin sounds, which are then processed by a model architecture suited for complex sequences. This ~400M parameter model is optimally-sized to run directly on the Pixel phones WDP uses in the field."
ServiceNow
SLAM - ServiceNow Language Models Lab. The first release in the Apriel model family, designed to support research on foundation models.
ByteDance
200B-A20B. "Seed-Thinking-v1.5, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed-Thinking-v1.5 achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding."
Huawei
"with Huawei Noah’s Ark Lab, we [Hong Kong University] release Dream 7B (Diffusion reasoning model), the most powerful open diffusion large language model to date."
NVIDIA
Llama-3.1-8B-Instruct base. 4M context window.
Together
Base DeepSeek-R1-Distill-Qwen-14B.
Huawei
Trained on 8,192 Ascend NPUs (Kunpeng 920 processors in Huawei Atlas 800T A2 servers).
NVIDIA
https://research.nvidia.com/labs/adlr/nemotronh/
NVIDIA
Llama 3.1 405B base. "Llama-3.1-Nemotron-Ultra-253B-v1 is a large language model (LLM) which is a derivative of Meta Llama-3.1-405B-Instruct (AKA the reference model). It is a reasoning model that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling. The model supports a context length of 128K tokens. This model fits on a single 8xH100 node for inference."
Meta AI
2T-A288B. Announced Apr/2025, abandoned Jul/2025. "We also trained a teacher model, Llama 4 Behemoth, that outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM-focused benchmarks such as MATH-500 and GPQA Diamond... 288B active parameters, 16 experts, and nearly two trillion total parameters."
Meta AI
400B-A17B. "Our most powerful open source multimodal model. 17B active params x 128 experts, 400B total params"
Meta AI
200 languages, "includes diverse text, image, and video datasets."
Google DeepMind
"Sec-Gemini v1 achieves this by combining Gemini’s advanced capabilities with near real-time cybersecurity knowledge and tooling. This combination allows it to achieve superior performance on key cybersecurity workflows, including incident root cause analysis, threat analysis, and vulnerability impact understanding."
DeepSeek-AI
Gemma-2-27B base. "Self-Principled Critique Tuning (SPCT) to foster scalable reward generation behaviors in GRMs through online RL, to generate principles adaptively and critiques accurately, resulting in DeepSeek-GRM models... The models will be released and open-sourced."
Featherless AI
"As demonstrated with our Qwerky-72B-Preview and prior models such as QRWKV6-32B Instruct Preview, we have successfully converted Qwen 2.5 72B into a RWKV variant without requiring a pretrain on the base model or retraining the model from scratch. Enabling us to test and validate the more efficient RWKV Linear attention" Dataset from Qwen2.5=18,000 tokens.
Deep Cogito
"We are releasing early checkpoints of models in sizes 3B, 8B, 14B, 32B and 70B trained using this methodology, starting from pretrained Llama / Qwen base checkpoints."
Google DeepMind
"a therapeutics-focused agentic system powered by Gemini 2.0 Pro. Agentic-Tx is equipped with 18 tools, including: TxGemma as a tool for multi-step reasoning"
Google DeepMind
"a suite of efficient, generalist large language models (LLMs) capable of therapeutic property prediction as well as interactive reasoning and explainability. Unlike task-specific models, TxGemma synthesizes information from diverse sources, enabling broad application across the therapeutic development pipeline."
Google DeepMind
Context in=1M, out=64k. Knowledge cutoff Jan/2025. HLE SOTA. Codename 'nebula'. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/
DeepSeek-AI
Non-reasoning. Significant increase in benchmark performance compared to original V3 from Dec/2024: MMLU-Pro: 75.9 ➜ 81.2, GPQA: 59.1 ➜ 68.4. 37B active.
NVIDIA
Meta Llama-3.3-70B-Instruct derivative "that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling. The model supports a context length of 128K tokens."
LG
“EXAONE”=“EXpert AI for EveryONE”. Training tokens/ratio dropped from EXAONE-3 7.8B with 8T (Aug/2024) to 3.5 (Dec/2024) 7.8B with 9T to 32B (also Deep) with 6.5T. Announce: https://www.lgresearch.ai/news/view?seq=543
Mistral
"Mistral Small 3.1 (2503) adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance."
Baidu
424B-A47B. Announce: https://x.com/Baidu_Inc/status/1901094083508220035
Baidu
Allen AI
"the first fully-open model (all data, code, weights, and details are freely available) to outperform GPT3.5-Turbo and GPT-4o mini on a suite of popular, multi-skill academic benchmarks. It is comparable to the leading open-weight models while requiring only a fraction of training compute."
Cohere
Context=256k. "Command A is an open weights research release of a 111 billion parameter model optimized for demanding enterprises that require fast, secure, and high-quality AI. Compared to other leading proprietary and open-weights models Command A delivers maximum performance with minimum hardware costs, excelling on business-critical agentic and multilingual tasks while being deployable on just two GPUs."
Google DeepMind
Gemini 2.0 Pro (cloud). "The second model is Gemini Robotics, a state-of-theart Vision-Language-Action (VLA) model that connects strong embodied reasoning priors to dexterous low-level control of real-world robots to solve challenging manipulation tasks. As a generalist VLA, Gemini Robotics can perform a wide array of diverse and complicated tasks, while also closely following language guidance and generalizing to distribution shifts in instructions, visuals, and motions. To emphasize the flexibility and generality of the Gemini Robotics models, we also introduce an optional specialization stage, which demonstrates how Gemini Robotics can be adapted for extreme dexterity, for advanced reasoning in difficult generalization settings, and for controlling completely new robot embodiments."
Google DeepMind
Gemini 2.0 Flash (on device). "The first model is Gemini Robotics-ER, a VLM with strong embodied reasoning capabilities at its core, exhibiting generalization across a wide range of embodied reasoning tasks while also maintaining its core foundation model capabilities. Gemini Robotics-ER exhibits strong performance on multiple capabilities critical for understanding the physical world, ranging from 3D perception to detailed pointing to robot state estimation and affordance prediction via code."
Google DeepMind
Trained on 1T more tokens than Gemma 2. "introduces vision understanding abilities, a wider coverage of languages and longer context – at least 128K tokens."
Reka AI
"performs competitively with proprietary models such as OpenAI o1-mini, making it a good foundation to build applications that require low latency or on-device deployment. It is currently the best open model in its size category."
Alibaba
Update to QwQ-32B-Preview released Nov/2024. Scores 1/5 on latest ALPrompt 2024 H2. Qwen with Question=QwQ
AI21
"The AI21 Jamba 1.6 family of models is state-of-the-art, hybrid SSM-Transformer instruction following foundation models. The Jamba models are the most powerful & efficient long-context models on the market, which deliver up to 2.5X faster inference than leading models of comparable sizes."
AMD
"trained from scratch on AMD Instinct™ MI300X GPUs. Instella models outperform existing fully open models of similar sizes and achieve competitive performance compared to state-of-the-art open-weight models such as Llama-3.2-3B, Gemma-2-2B, and Qwen-2.5-3B, including their instruction-tuned counterparts."
Alibaba
"top 25 languages by number of speakers, including English, Chinese, Hindi, Spanish, Arabic, French, Bengali, Portuguese, Russian, Urdu, Indonesian, German, Japanese, Swahili, Filipino, Tamil, Vietnamese, Turkish, Italian, Javanese, Korean, Hausa, Persian, Thai, and Burmese. These 25 languages support over 90% of the global population..."
IBM
"The new Granite 3.2 8B Instruct [offers] experimental chain-of-thought reasoning capabilities "
Cohere
"C4AI Command R7B Arabic is an open weights research release of a 7 billion parameter custom model with advanced capabilities optimized for the Arabic language (MSA dialect) along with English. The model excels at tasks that enterprises care about: instruction following, length control, RAG, and responding in the correct language. It also demonstrates excellent general purpose knowledge and understanding of Arabic language and cultures."
OpenAI
"Our largest and best model for chat" https://openai.com/index/introducing-gpt-4-5/ "GPT-4.5 is not a frontier model, but it is OpenAI’s largest LLM, improving on GPT-4’s computational efficiency by more than 10x. While GPT-4.5 demonstrates increased world knowledge, improved writing ability, and refined personality over previous models, it does not introduce net-new frontier capabilities compared to previous reasoning releases, and its performance is below that of o1, o3-mini, and deep research on most preparedness evaluations."
Tencent
"Based on Turbo S, by introducing technologies such as long thinking chains, retrieval enhancement and reinforcement learning, Hunyuan also launched the reasoning model T1 with deep thinking. This model has been fully launched on Tencent Yuanbao ( Tencent Hunyuan T1 model is open to all users ) , users can choose Deepseek R1 or Tencent Hunyuan T1 model to answer. The official version of Tencent Hunyuan T1 model will be launched soon, providing API access and other services to the outside world."
Tencent
Fast thinking ("Instant reply"). "This is also the first time in the industry that the Mamba architecture has been successfully applied losslessly to a very large MoE model."
Microsoft
"Training data: 5T tokens, 2.3M speech hours, and 1.1T image-text tokens" Announce: https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/
Microsoft
"Phi-4-mini’s training data includes a wide variety of sources, totaling 5 trillion tokens, and is a combination of publicly available documents filtered for quality, selected high-quality educational data, and code newly created synthetic, “textbook-like” data for the purpose of teaching math, coding, common sense reasoning, general knowledge of the world (e.g., science, daily activities, theory of mind, etc.) high quality chat format supervised data covering various topics to reflect human preferences" Announce: https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/
Inception
Diffusion large language model (dLLM). Very low 'IQ' performance (0/5 on all ALPrompts). Fast: 1,000tok/s. https://x.com/inceptionailabs/status/1894847921474150456
Alibaba
"As a sneak peek into our upcoming QwQ-Max release, this version offers a glimpse of its enhanced capabilities, with ongoing refinements and an official Apache 2.0-licensed open-source launch of QwQ-Max and Qwen2.5-Max planned soon." Announce: https://x.com/Alibaba_Qwen/status/1894130603513319842
Anthropic
Knowledge cutoff now November 2024 (was April 2024). "the first hybrid reasoning model on the market." https://www.anthropic.com/news/claude-3-7-sonnet
Moonshot AI
"Scaling law experiments indicate that Muon achieves ∼ 2× computational efficiency compared to AdamW with compute optimal training." https://github.com/MoonshotAI/Moonlight?tab=readme-ov-file
Figure
Likely based on OpenVLA 7B (Jun/2024, based on Llama 2 7B) or Molmo 7B-O (Sep/2024, based on OLMo-7B-1024 with OpenAI CLIP). "high quality, multi-robot, multi-operator dataset of diverse teleoperated behaviors, ~500 hours in total. To generate natural language-conditioned training pairs, we use an auto-labeling VLM to generate hindsight instructions. The VLM processes segmented video clips from the onboard robot cameras, prompted with: "What instruction would you have given the robot to get the action seen in this video?" All items handled during training are excluded from evaluations to prevent contamination. Architecture Our system comprises two main components: S2, a VLM backbone, and S1, a latent-conditional visuomotor transformer. S2 is built on a 7B-parameter open-source, open-weight VLM pretrained on internet-scale data. It processes monocular robot images and robot state information (consisting of wrist pose and finger positions) after projecting them into vision-language embedding space. Combined with natural language commands specifying desired behaviors, S2 distills all semantic task-relevant information into a single continuous latent vector, passed to S1 to condition its low-level actions. S1, an 80M parameter cross-attention encoder-decoder transformer, handles low-level control. It relies on a fully convolutional, multi-scale vision backbone for visual processing, initialized from pretraining done entirely in simulation. While S1 receives the same image and state inputs as S2, it processes them at a higher frequency to enable more responsive closed-loop control. The latent vector from S2 is projected into S1's token space and concatenated with visual features from S1's vision backbone along the sequence dimension, providing task conditioning. S1 outputs full upper body humanoid control at 200hz, including desired wrist poses, finger flexion and abduction control, and torso and head orientation targets. We append to the action space a synthetic "percentage task completion" action, allowing Helix to predict its own termination condition, which makes it easier to sequence multiple learned behaviors."
Figure
"high quality, multi-robot, multi-operator dataset of diverse teleoperated behaviors, ~500 hours in total. To generate natural language-conditioned training pairs, we use an auto-labeling VLM to generate hindsight instructions. The VLM processes segmented video clips from the onboard robot cameras, prompted with: "What instruction would you have given the robot to get the action seen in this video?" All items handled during training are excluded from evaluations to prevent contamination. Architecture Our system comprises two main components: S2, a VLM backbone, and S1, a latent-conditional visuomotor transformer. S2 is built on a 7B-parameter open-source, open-weight VLM pretrained on internet-scale data. It processes monocular robot images and robot state information (consisting of wrist pose and finger positions) after projecting them into vision-language embedding space. Combined with natural language commands specifying desired behaviors, S2 distills all semantic task-relevant information into a single continuous latent vector, passed to S1 to condition its low-level actions. S1, an 80M parameter cross-attention encoder-decoder transformer, handles low-level control. It relies on a fully convolutional, multi-scale vision backbone for visual processing, initialized from pretraining done entirely in simulation. While S1 receives the same image and state inputs as S2, it processes them at a higher frequency to enable more responsive closed-loop control. The latent vector from S2 is projected into S1's token space and concatenated with visual features from S1's vision backbone along the sequence dimension, providing task conditioning. S1 outputs full upper body humanoid control at 200hz, including desired wrist poses, finger flexion and abduction control, and torso and head orientation targets. We append to the action space a synthetic "percentage task completion" action, allowing Helix to predict its own termination condition, which makes it easier to sequence multiple learned behaviors."
Baichuan
Medical LLM. Huge increase to 20T tokens for 14B params standard.
Arc Institute
"Evo 2 is a state of the art DNA language model for long context modeling and design. Evo 2 models DNA sequences at single-nucleotide resolution at up to 1 million base pair context length using the StripedHyena 2 architecture. Evo 2 was pretrained using Savanna. Evo 2 was trained autoregressively on OpenGenome2, a dataset containing 8.8 trillion tokens from all domains of life." Greg Brockman co-author.
Perplexity
Censorship reduced, based on DeepSeek-R1.
xAI
https://x.ai/blog/grok-3 My full analysis: https://lifearchitect.ai/whats-in-grok/
Mistral
"Mistral Saba is a 24B parameter model trained on meticulously curated datasets from across the Middle East and South Asia."
Barcelona Supercomputing Center
"The final [pre-training] dataset is composed of 55.51% FineWeb-Edu, 25.32% Colossal Oscar, 8.38% Wikipedia, 7.17% Aya Collection, and 3.63% StarCoder, totalling 315 billion tokens."
Nous Research
Based on Llama 3 8B. GPQA score based on GPT-4o's analysis of the chart :-/ "one of the first models in the world to unify Reasoning (long chains of thought that improve answer accuracy) and normal LLM response modes into one model." https://x.com/NousResearch/status/1890148004029759612
Shanghai AI Laboratory/SenseTime
OREAL=Outcome REwArd-based reinforcement Learning.
Google DeepMind
Context=2M. Disappointing benchmarks, this is the 'pro' (medium) not 'ultra' (large) model. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/
Stanford
Based on Qwen2.5-32B-Instruct. "we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end. This can lead the model to doublecheck its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24)."
OpenAI
GPQA=79.7 for 'high' thinking. ALPrompt 2025H1=1/5. My analysis is that this model’s performance is very poor, with responses often becoming disordered and illogical. OpenAI compared o3-mini to OpenAI’s software engineers, and it performed very poorly (o3-mini=0%, o1=12%). "o3-mini models have the lowest performance, with scores of 0%… We suspect o3-mini’s low performance is due to poor instruction following and confusion about specifying tools in the correct format. The model often attempts to use a hallucinated bash tool rather than python despite constant, multi-shot prompting and feedback that this format is incorrect. This resulted in long conversations that likely hurt its performance." (o3-mini paper, p31)
Mistral
MMLU=base, -Pro=base, GPQA=instruct. "When quantized, Mistral Small 3 can be run privately on a single RTX 4090 or a Macbook with 32GB RAM." "Mistral Small 3 is neither trained with RL nor synthetic data"
Allen AI
Lower MMLU score than Llama 3.1 405B base.
Alibaba
"Qwen2.5-Max emerges as a milestone in MoE development, featuring an impressive 325 billion parameters. The model has been pretrained on over 20 trillion tokens and further refined with advanced post-training methodologies such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF)." https://wandb.ai/byyoung3/ml-news/reports/Qwen2-5-Max-Advancing-Large-Scale-Mixture-of-Expert-Models---VmlldzoxMTEyMjUyNg
SambaNova
"efficient byte-level processing at scale... [compared to tokenizer-based LMs:] 5x less training data, excelling in coding tasks, and decoding up to 2x faster. Its token-free design also brings added flexibility, avoiding tokenizer quirks while naturally extending to multimodal applications without any architecture tweaks."
ByteDance
VLM. SoTA agent 'computer use' model to 23/Jan/2024.
ByteDance
Includes 2.4B param ViT. "Doubao-1.5-pro uses a sparse MoE architecture. In the pre-training stage, the performance of the MoE model activated with only a small number of parameters can exceed that of ultra-large dense pre-trained models such as Llama3.1-405B. Through the study of the sparsity scaling law, the team determined the sparse ratio that balances performance and efficiency, and determined based on the MoE scaling law that a model activated with a small number of parameters can achieve the performance of a world-class model."
Moonshot AI
"our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities---e.g., 77.5 on AIME, 96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista---matching OpenAI's o1". GPQA score is my estimate from pp13–14, noting that "the scores above come from an internal long-cot model with much smaller model size than k1.5 long-CoT model."
DeepSeek-AI
"DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks. To support the research community, we have open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 based on Llama and Qwen. DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across various benchmarks"
OpenAI
Protein sequence model. "The model was trained on examples of protein sequences from many species, as well as information on which proteins tend to interact with one another. While that’s a lot of data, it’s just a fraction of what OpenAI’s flagship chatbots were trained on, making GPT-4b an example of a “small language model” that works with a focused data set." https://www.technologyreview.com/2025/01/17/1110086/openai-has-created-an-ai-model-for-longevity-science/
Kyutai
"Helium-1 preview, an initial version of our new backbone language model with 2B parameters, targeting edge and mobile devices... We use token level distillation of a 7B parameters model to train Helium-1 preview."
Shanghai AI Laboratory/SenseTime
"InternLM3 is trained on only 4 trillion high-quality tokens, saving more than 75% of the training cost compared to other LLMs of similar scale." Playground: https://internlm-chat.intern-ai.org.cn/
MiniMax
A45.9B. "The pre-training corpus for MiniMax-Text-01 encompasses a comprehensive and meticulously curated dataset, incorporating diverse sources including academic literature, books, web content, and programming code... repeatedly training high-quality documents can lead to enhanced downstream performance, with certain high-quality domains being trained up to 50 times... Our findings indicate that low-quality data suffer a substantial decrease in performance after training for more than two epochs, while high-quality data can be effectively trained for up to four epochs" Login playground: https://www.hailuo.ai/
Berkeley
"To generate our training data we use QwQ-32B-Preview, an open-source model with reasoning capabilities comparable to o1-preview. We curate the data mixture (see later section) to cover diverse domains that require reasoning, and a reject sampling procedure to improve the data quality. We then rewrite QwQ traces with GPT-4o-mini into a well-formatted version, inspired by Still-2, to improve data quality and ease parsing... Rejection Sampling: We discard QwQ samples if they are incorrect according to the solutions provided in datasets."
NVIDIA
VLM. MMMU=47.33. "VILA project becomes part of Cosmos Nemotron family" https://github.com/NVlabs/Cosmos-Nemotron Vision Encoder: SigLIP-400M, Language Encoder: Yi-34B https://blogs.nvidia.com/blog/nemotron-model-families/
NVIDIA
WFM (world foundation model). "The models range in size from 4 billion to 14 billion parameters, with Nano being the smallest and Ultra being the largest... "Cosmos WFM models, were trained on 9,000 trillion tokens [9,000T] from 20 million hours of real-world human interactions, environment, industrial, robotics, and driving data..." https://techcrunch.com/2025/01/06/nvidia-releases-its-own-brand-of-world-models/ Actual working: https://lifearchitect.ai/cosmos/
Prime Intellect
Llama-2-7B base. "METAGENE-1 is a 7B parameter metagenomic foundation model designed for pathogen detection and pandemic monitoring, trained on over 1.5 trillion base pairs [∼370 billion tokens (≈1.69 trillion base pairs)] of DNA and RNA collected via metagenomic sequencing of wastewater."
Rubik's AI
Likely a Llama 3.1 405B wrapper. ALPrompt 2024H1=5/5. ALPrompt 2024H2=2/5. ALPrompt 2025H1=1/5. This is a strange model: slow and smart, but not as performant as o1. No arch details at all.
Renmin
"1.08T tokens for training. Among them are 481B English web data, 138B general English knowledge, 227B code pre-training data, 16.7B code instruction data, 93.8B mathematics pre-training data, 15.5B mathematics instruction data, and 108B Chinese data."
DeepSeek-AI
37B active. Explain: https://threadreaderapp.com/thread/1872318161883959485.html Announce: https://github.com/deepseek-ai/DeepSeek-V3?tab=readme-ov-file
"We found the EON-8B model (a domain-adapted Llama 3.1-8B variant) to be 75x and 6x cost effective in comparison to GPT-4 and GPT-4o respectively (Figure 4)... On tasks seen during training, the EON-8B model outperformed base Llama-3-8B-Instruct and its performance was comparable to SOTA GPT models."
OpenAI
SoTA model for Dec/2024. Parameter estimate is very rough centrepoint for range 400B-52T.
RWKV
RWKV (pronounced RwaKuv) is an RNN: "multilingual, supporting over 100 languages and code.". Full run is 332B tokens of 3.1T dataset.
International
"a proper workhorse model, for retrieval, classification, etc." https://bsky.app/profile/howard.fm/post/3ldod2afps62x
IBM
IBM
"trained by IBM, Princeton, CMU, and UIUC on completely open data. At inference time, the model demonstrates 2.5x throughput improvement and 2x latency speedup compared to standard transformers in vLLM."
OpenAI
"o1-2024-12-17 sets new state-of-the-art results on several benchmarks, improving cost-efficiency and performance."
TII
"We conducted a single large-scale pretraining run on the 7B model, using 1024 H100 GPU chips, leveraging 14 trillion tokens... upscaled the 7B model to a 10B parameters model by duplicating the redundant layers and continuing pre-training with 2 trillion tokens of high-quality data."
Cohere
Cohere
VLM.
Meta AI
Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance
Meta AI
"autoregressive sentence prediction in an embedding space." 7.7T tokens is a misprint, should be 2.2T as in paper.
Microsoft
Use unsloth: https://huggingface.co/unsloth/phi-4-GGUF & https://www.reddit.com/r/singularity/comments/1i0kso4/i_fixed_4_bugs_in_microsofts_opensource_phi4_model/
Google DeepMind
Gemini 2.0 Flash was first model released, 11/Dec/2024. "New Modalities: Gemini 2.0 introduces native image generation and controllable text-to-speech capabilities" Announce: https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/ Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/
International
"Fully Open Source" with pre-training code, configurations, training and fine-tuning datasets, and intermediate checkpoints.
Cerebras
"For Sandia’s trillion parameter training run, Cerebras configured a 55 terabyte MemoryX device."
Shanghai AI Laboratory/SenseTime
Benchmarks are estimates based on Qwen2.5 72B Instruct as the base LLM (InternVL 2.5=InternViT-6B-448px-V2.5 5.5B + Qwen2.5-72B-Instruct). "Notably, Qwen2-VL processed a cumulative total of 1.4T tokens, while our InternVL2.5-78B is trained on just ∼120B tokens [of vision]."Dataset... we identify repetitive generation as one of the most detrimental issues. In many open-source or synthetic datasets, a small number of repetitive samples—comprising merely thousands of examples in our Stage 2 data mixture—can cause the model to spiral into repetitive loops, particularly in long-form outputs or CoT reasoning tasks. This phenomenon undermines the effectiveness of test-time scaling strategies. To address this challenge and support future research, we designed an efficient data filtering pipeline to remove low-quality samples, thereby minimizing the risk of repetitive generation." Repo: https://github.com/OpenGVLab/InternVL
Meta AI
Drop-in replacement for Llama 3.1 70B, comparable performance to Llama 3.1 405B.
LG
“EXAONE”=“EXpert AI for EveryONE”. Training tokens/ratio dropped from EXAONE-3 7.8B with 8T (Aug/2024) to this (Dec/2024) 7.8B with 9T to 32B with 6.5T.
Ruliad
No evals. Llama 3.1 8B base.
Sail
SEA languages. Continual pretraining based on Qwen2.5. Project page: https://sea-sailor.github.io/blog/sailor2/
PleIAs
Trained on the Jean Zay supercomputer, 192x H100s for 20 days. Dataset is new CC + Synthetic: https://huggingface.co/datasets/PleIAs/common_corpus
OpenAI
"a version of our most intelligent model that thinks longer for the most reliable responses" System card about safety only: https://cdn.openai.com/o1-system-card-20241205.pdf
Amazon
Multimodal, same performance as Llama 3.2 90B ∴ est 90B. Model card was hidden: https://assets.amazon.science/9f/a3/ae41627f4ab2bde091f1ebc6b830/the-amazon-nova-family-of-models-technical-report-and-model-card.pdf via https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card
Consortium
24 official languages are Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, and Swedish. "we use 400 Nvidia H100 GPUs of the Marenostrum 5 supercomputer" Also: https://eurollm.io/
Nous Research
"About 14 DGXes scattered around the globe. Sometimes more sometimes less, it varies depending on availability. On average, around 112 H100s." https://x.com/bloc97_/status/1863675225810043331 "we introduce DisTrO, a family of architecture-agnostic and network-agnostic distributed optimizers that reduces the inter-GPU communication requirements by four to five orders of magnitude without relying on amortized analysis, enabling low-latency training of large neural networks on slow internet bandwidths with heterogeneous networking hardware."
Prime Intellect
Training complete 22/Nov/2024. Fully distributed training: "the first decentralized training run of a 10-billion-parameter model, inviting anyone to contribute compute and participate. This brings us one step closer towards open source AGI."
Alibaba
Scores 1/5 on latest ALPrompt 2024 H2. Qwen with Question=QwQ
OpenGPT-X
24 EU languages (60% non-English): bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv. https://opengpt-x.de/models/teuken-7b-de/ & paper date is Sep/2024.
Allen AI
Open Language Model (OLMo) 2 Apache 2.0 license for research and educational use. Paper coming. Data: 5 trillion tokens (1.2 epochs of 4T tokens) + 100B tokens (3 runs) + 300B tokens (1 run) merged. https://huggingface.co/allenai/OLMo-2-1124-13B & playground: https://playground.allenai.org/
CMU
Unreleased, but will be replicated. "a scalable and powerful 1-bit Mamba architecture designed for more efficient large language models"
Moonshot AI
Reasoning, maths only. Very little info available. Chinese. Long context. No paper.
Alibaba
No evals. Qwen2-7B-Instruct with a combination of the filtered Open-O1 CoT dataset, Marco-o1 CoT dataset, and Marco-o1 Instruction dataset.
Allen AI
Llama 3.1 post-training, worse performance on most benchmarks. Post training methods include new Reinforcement Learning with Verifiable Rewards (RLVR). "We perform supervised fine-tuning on new capability-focused synthetic data mixed with existing instruction datasets. We then perform preference tuning on on-policy synthetic preference data. We finish training Llama Tülu3 with our new method, Reinforcement Learning with Verifiable Rewards."
OpenAI
Material decrease in benchmark scores (GPQA: -13.37%, MMLU: -3.38%) compared to Aug/2024. Pruned? Quantized? https://github.com/openai/simple-evals
DeepSeek-AI
Scores 0/5 on latest ALPrompt 2024 H2 "DeepSeek-R1-Lite is currently still in the iterative development stage. It currently only supports web usage and does not support API calls. The base model used by DeepSeek-R1-Lite is also a relatively small model, unable to fully unleash the potential of long reasoning chains. At present, we are continuously iterating on the inference series models. In the future, the official DeepSeek-R1 model will be fully open-sourced. We will publicly release the technical report and deploy API services." https://mp-weixin-qq-com.translate.goog/s/e1YnTxZlzFvjcmrLLTA8fw?_x_tr_sl=zh-CN&_x_tr_tl=en&_x_tr_hl=zh-TW
XiaoduoAI
SLM
Mistral
Open-weights multimodal model built on top of Mistral Large 2.
Fireworks
"a compound AI model specialized in complex reasoning, that interweaves multiple open models at the inference layer. "
Alibaba
https://qwenlm.github.io/blog/qwen2.5-coder-family/ Jack Clark from Anthropic is saying it’s actually 18T tokens from Qwen2.5 + 5.5T tokens for a total of 23.5T tokens. That doesn’t seem right from my interpretation of the technical report.
TensorOpera
Gold standard for dataset documentation
Tencent
Hunyuan-Large is pre-trained on 7T tokens, which contains nearly 1.5T tokens of high-quality and diverse synthetic data.' '389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens'
AI Singapore
SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region. The Gemma2 9B CPT SEA-LIONv3 base model which has undergone continued pre-training from the base Gemma-2-9B model. SEA-LION stands for Southeast Asian Languages In One Network.' News: https://www.techinasia.com/news/ai-singapore-boosts-sea-ai-sealion-v3-model
AMD
1 billion parameter LMs trained from scratch using 1.3T tokens on a cluster of AMD Instinct MI250 GPUs.
Hugging Face
Base and instruct versions, with Apache 2.0 license
Cohere
"Aya Expanse, a family of highly performant multilingual models that excels across 23 languages and outperforms other leading open-weights models...we have collaborated with over 3,000 researchers from 119 countries to expand cutting-edge multilingual research... 220 language ambassadors from around the world who have been part of this release"
Anthropic
Absurd naming scheme. Paper addendum pp51-64: https://assets.anthropic.com/m/61e7d27f8c8f5919/original/Claude-3-Model-Card.pdf#page=51
IBM
Announce: https://www.ibm.com/new/ibm-granite-3-0-open-state-of-the-art-enterprise-models
IBM
Announce: https://www.ibm.com/new/ibm-granite-3-0-open-state-of-the-art-enterprise-models
aiXcoder
Dataset: The Stack
NVIDIA
Related paper: https://arxiv.org/abs/2410.01257
Mistral
"Introducing the world’s best edge models"
01-ai
"New MoE hybrid expert architecture" and https://x.com/01AI_Yi/status/1845776529185476613
Zyphra
Mamba2 "trained on 128 H100 GPUS for approximately 50 days using our internal training framework developed atop Megatron-LM"
NVIDIA
"a novel neural network architecture, the normalized Transformer (nGPT) with representation learning on the hypersphere. In nGPT, all vectors forming the embeddings, MLP, attention matrices and hidden states are unit norm normalized...reducing the number of training steps required to achieve the same accuracy by a factor of 4 to 20, depending on the sequence length."
Inflection AI
Inference via Intel Gaudi® 3 128 GB, on-premise available. Minimum spend $100 credits.
Inflection AI
Inference via Intel Gaudi® 3 128 GB, on-premise available. Minimum spend $100 credits.
Liquid AI
40BA12B. Some controversy/concern over company. Liquid Foundation Models (LFM). "Human preference optimization techniques have not been applied extensively to our models yet."
Salesforce
Code coming soon: https://github.com/SalesforceAIResearch/SFRJudge "we opt to focus on datasets that evaluate modern (2023 and beyond) LLM responses, as older datasets likely contain lower quality responses from less capable models, with correspondingly stale annotations. We supplement human-annotated data with synthetically generated data to endow our judge models with specific capabilities (e.g., following fine-grained rubrics in evaluation)"
BAAI
VLM. Dataset estimates are based on the unrelated UW/Salesforce dataset MINT-1T (3.4B images, 927M documents) https://arxiv.org/abs/2406.11271v1
NVIDIA
Flamingo clone. "we use Qwen2-72B-Instruct as the default text-only LLM backbone. We also employ Nous-Hermes-2-Yi-34B for ablation study and faster experimentation... we use InternViT-6B as the default vision encoder"
China Telecom Artificial Intelligence Research Institute
Trained on Chinese GPUs: "Ascend Atlas 800T A2 training server – a Huawei product listed as supporting the Kunpeng 920 7265 or Kunpeng 920 5250 processors" https://www.theregister.com/2024/10/02/china_telecom_model_trained_local_tech/
China Telecom Artificial Intelligence Research Institute
Trained on Chinese GPUs: "Ascend Atlas 800T A2 training server – a Huawei product listed as supporting the Kunpeng 920 7265 or Kunpeng 920 5250 processors" https://www.theregister.com/2024/10/02/china_telecom_model_trained_local_tech/
AMD
Small language model (SLM). Trained on AMD Instinct™ MI250 accelerators. "Pretrain Dataset: We employed the SlimPajama and Project Gutenberg dataset to pretrain the 135M model. Project Gutenberg is a library of over 70,000 free eBooks approximately. This sums up to 670B tokens"
Meta AI
Vision (VLM)
Meta AI
Text (LLM). "Pre-training. [For Llama 3.2 3B] We prune the models from their 8B siblings and use logits from the 8B and 70B models as token-level targets (token-level distillation). We then use knowledge distillation to recover performance."
Allen AI
ViT: Llava as Qwen2 (or Olmo) + CLIP. Multimodal Open Language Model built by Ai2. Announce: https://molmo.allenai.org/blog
Google DeepMind
Sparse MoE. Context window=2M. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/
Alibaba
Microsoft
16x3.8B "only 6.6B activate parameters". GRIN=GRadient-INformed. "GRIN MoE is pre-trained on 4T tokens as a Causal Language Model. The same training dataset has been used to train Phi-3 dense models"
Google DeepMind
RAG/RIG: "the LLM is fine-tuned to produce natural language Data Commons queries alongside statistics"
OpenAI
Jina AI
HTML->Markdown. Specialist small model; outperforms GPT-4o general model, does not outperform Gemini Pro 1.5.
Mistral
"Pixtral was trained to be a drop-in replacement for Mistral Nemo 12B."
DeepSeek-AI
"DeepSeek-V2.5 is an upgraded version that combines DeepSeek-V2-Chat and DeepSeek-Coder-V2-Instruct."
01-ai
6B=3T tokens, 9B=+0.8T tokens, 9B-Coder=+2.4T tokens=6.2T tokens. See Yi 1.5 34B in this table
Allen AI
Open Language (OL) Mixture of Experts (MoE). "We train OLMoE-1B-7B for 5 trillion tokens, however, some recent dense models train significantly longer, such as Llama 3 with 15 trillion tokens. To the best of our knowledge, there has been no large MoE that has been overtrained as much as OLMoE-1B-7B. Specifically, taking the active parameters of OLMoE-1B-7B, our token multiplier is around 5,000 (5T / 1B). There are likely benefits to training even longer, but to what degree overtraining is effective for MoEs and how it differs from dense models still requires more research."
Consortium
Polish Large Language Model. Not yet available as of Sep/2024
Salesforce
64K sequence length. Released under Apache-2.0.
Magic
Context=100M tokens equals ~10 million lines of code or ~750 novels.
Cartesia
On-device. "hybrid architecture based on Mamba-2, with feedforward and sliding window attention layers interspersed"
Google DeepMind
Announce: https://x.com/OfficialLoganK/status/1828480085353234535 1M context for all modalities. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/
Aleph Alpha
Stanford
Test-Time Training (TTT) layers. Real-time learning by Stanford, UC, and Meta. Potential for frontier models in 2025+.
AI21
Jamba 1.5 Mini (12B active/52B total) and Jamba 1.5 Large (94B active/398B total) are also optimized for business use cases and capabilities such as function calling, structured output (JSON), and grounded generation.
Microsoft
Microsoft
NVIDIA
Pruned and distilled from Nemotron-4 15B: https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/
Sarvam AI
Indic languages supported are: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.
xAI
MMLU-Pro=75.5=SOTA. Claude 3.5S MMLU-Pro=72.83. "Grok-2 has been tested on the LMSYS leaderboard under the name "sus-column-r." At the time of this blog post, it is outperforming both Claude 3.5 Sonnet and GPT-4-Turbo." [Alan: Grok is Heinlein, Sixth Column is also Heinlein: https://en.wikipedia.org/wiki/Sixth_Column ]
LG
“EXAONE”=“EXpert AI for EveryONE”
TII
https://huggingface.co/spaces/tiiuae/falcon-mamba-playground
Writer
Medical. MMLU Medical Genetics=94.0
Writer
Financial. "across a variety of real-world financial use cases. It outperformed popular models like Claude 3.5 Sonnet, GPT-4o, and Mixtral-8x7b"
Zyphra
Mamba2
NVIDIA
Pruned and distilled from Nemotron-4 15B: https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/
Mistral
Fits on a single node for inference.
Meta AI
Announce: https://ai.meta.com/blog/meta-llama-3-1/ Model card: https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md
OpenAI
Omnimodel. "OpenAI would not disclose exactly how large GPT-4o mini is, but said it’s roughly in the same tier as other small AI models, such as Llama 3 8b, Claude Haiku and Gemini 1.5 Flash." https://techcrunch.com/2024/07/18/openai-unveils-gpt-4o-mini-a-small-ai-model-powering-chatgpt/ "tested GPT-4o to identify potential risks, which we have addressed and plan to share the details of in the forthcoming GPT-4o system card and Preparedness scorecard." And related paper about instruction hierarchy: https://arxiv.org/abs/2404.13208
Mistral
With NVIDIA. "Drop-in replacement of Mistral 7B". "trained using Megatron-LM, part of NVIDIA NeMo, with 3,072 H100 80GB Tensor Core GPUs" https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/
Mistral
"Unlike Transformer models, Mamba models offer the advantage of linear time inference and the theoretical ability to model sequences of infinite length."
Mistral
"We’re contributing Mathstral to the science community to bolster efforts in advanced mathematical problems requiring complex, multi-step logical reasoning."
Microsoft
Notable finetune of GPT4-0125-preview "outperforming the vanilla approach by 25.6% in GPT4’s in-context learning setting"
Consortium
AKA TriLM. "Spectra LLM suite, the first open suite of LLMs spanning multiple bit-widths, including FloatLMs, QuantLMs, and TriLMs, ranging from 99M to 3.9B parameters trained on 300B tokens."
DeepL
"Built using our own groundbreaking, specialized LLM technology and proprietary training data, designed specifically for translation"
Hugging Face
Dataset includes new Cosmopedia v2 synthetic data. 135M and 360M models,each trained on 600B tokens from Smollm-Corpus. 1.7B model trained on 1T tokens from Smollm-Corpus.
Vectara
"At <10B parameters it's an LLM trained to provide optimal results for RAG and structured outputs."
Google DeepMind
LLM-as-a-Judge autorater. Foundational Large Autorater Models (FLAMe). Uses an instruction-tuned PaLM-2-24B model. Unrelated to Microsoft FLAME Jan/2023.
StepFun
Launched early Jul/2024: https://pandaily.com/stepfun-releases-three-large-models-of-the-step-series/ "StepFun, founded in April 2023 with the mission to “Scale-up possibilities for everyone,” unites top talent in artificial intelligence from both domestic and international backgrounds, and is dedicated to advancing toward AGI. The company has already launched the Step series of foundation models, which includes Step-2, a cutting-edge trillion-parameter Mixture of Experts (MoE) language model; Step-1.5V, a powerful multimodal large model; and Step-1V, an innovative image generation model, among others."
H2O.ai
Runs natively and fully offline on mobile phone. "H2O-Danube3 is a family of decoder only LLM models that use the general Llama model architecture adopting core principles from Llama 2 and Mistral with custom parameters determining the shape of each layer and total parameter count. We use the Mistral tokenizer..." MMLU for chat=54.74, base=55.18 via https://huggingface.co/h2oai/h2o-danube3-4b-base
Microsoft
"the training dataset follows a specific structure, we develop a custom tokenizer. Alphanumeric node names are tokenized at a character level, while special terms such as ‘causes’, ‘Does’, ‘cause’, ‘Yes’, and ‘No’ are tokenized at the word level... Our training setup consists of around 175k instances of sequential chains with size of chains ranging from 3 to 6 nodes... All models are trained for 100 epochs. [LifeArchitect.ai estimate is 12 tokens per node x 6 nodes x 175,000 instances x 100 epochs = 1.26B tokens]" Based on GPT-2 arch.
SenseTime
"The model training was based on over 10TB tokens [sic, taken as 10T tokens instead of 10TB=2T tokens] of high-quality training data, including a large amount of synthetically-generated reasoning chain data, which help to enhance its reasoning capabilities." & "The updates include SenseNova 5o, the first real-time multimodal model in China, which provides a new AI interaction model on par with GPT-4o’s streaming interaction capabilities"
Kyutai
"1. The model is fine-tuned on 100K transcripts generated by Helium itself. 2. These transcripts are highly detailed, heavily annotated with emotion and style, and conversational. 3. Text to Speech Engine is further fine-tuned on 20 hours of audio recorded by Alice and licensed."
Shanghai AI Laboratory/SenseTime
"The release of InternLM2.5 series contains 7B model size for now and we are going to release the 1.8B and 20B versions soon" [20B released around 1/Aug/2024]
BAAI
Technical arch testing only, ratio is too low for decent performance.
Renmin
"YuLan's training is finished on Jan, 2024 and has achieved performance on par with state-of-the-art LLMs across various English and Chinese benchmarks."
Baidu
"Ernie Bot has reached 300 million users since its launch [on 16/Mar/2023, public Aug/2023]" Jun/2024
Google DeepMind
Announce: https://blog.google/technology/developers/google-gemma-2/
OpenAI
"LLM Critics Help Catch LLM Bugs" Announce: https://openai.com/index/finding-gpt4s-mistakes-with-gpt-4/
Apple
Vision model based on T5-XXL. Modalities: RGB, Caption, Bounding boxes, Semantic segmentation, Depth, Human poses, Surface normals, CLIP, DINOv2, ImageBind, Metadata, Canny edges, SAM edges, SAM instances, Color palette. Project page: https://4m.epfl.ch/
EvolutionaryScale
Biology large language model: "sequence, structure, and function are all masked and predicted during training, ESM3 can generate in all three modalities." 1.4B only released.
Huawei
https://x.com/faridofanani96/status/1804079517193113850/photo/1
Anthropic
MMLU=90.4 with prompting. Model card: https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf
DeepSeek-AI
DeepSeek-V2 with additional 6 trillion tokens.
International
New dataset: 240T tokens: 8× larger than previous SOTA dataset. DCLM-Pool is 240T, DCLM-Baseline is 3.8T: "we combine our 3.8T DCLM-BASELINE with the StarCoder and ProofPile2 data to arrive at a 4.1T token dataset. We train a 7B model for 2.5T tokens" and "We release the DCLM benchmark, framework, models, and datasets at https://datacomp.ai/dclm."
NVIDIA
Open-source equiv of Mar/2023 GPT-4 (1760MoE≈340B, 13T), same param count but 2x the tokens of May/2023 PaLM 2 (340B, 3.6T), competitor to Nov/2023 Grok-1 (314B, 6T). Trained on 6,144 H100s. ~1.3TB for inference. 50+ natural and 40+ coding languages. Trained between December 2023 and May 2024. MMLU 0-shot for instruct=78.7, 5-shot for base=81.1. Permalink for paper: https://research.nvidia.com/publication/2024-06_nemotron-4-340b
Apple
https://lifearchitect.ai/apple/ Likely to be the Apple OpenELM model (Apr/2024). "two of these models — a ~3 billion parameter on-device language model, and a larger server-based language model available with Private Cloud Compute". https://machinelearning.apple.com/research/introducing-apple-foundation-models The server-based model is possibly Ferret, although it is more properly called a multimodal model (not just language). It could also be Apple GPT based on their Ajax framework: https://archive.md/f3C0r
UCSC
"we explore alternative methods for mixing tokens without relying on matrix multiplications." Compared with Transformer++ based on Llama-2, not to be confused with the pre-GPT-3 American Express Transformer++ paper from 2/Mar/2020. Instead, Transformer++ is defined in the Mamba paper: 'Transformer++: A Transformer with an improved architecture, namely rotary positional encodings (Su et al. 2021) and SwiGLU MLP (Shazeer 2020)'
Galileo
Based on DeBERTA-large (440M). RoBERTa=162B token dataset.
Alibaba
Instruct MMLU=82. Instruct GPQA=41.9. https://qwenlm.github.io/blog/qwen2/
Alibaba
https://qwenlm.github.io/blog/qwen2/
Kunlun Tech
CN + EN. "(MoE) model with 146 billion parameters, 16 experts, and 22 billion activated parameters. This model is initialized from the pre-existing dense checkpoints of our Skywork-13B model."
CMU
Analysis: https://tridao.me/blog/2024/mamba2-part1-model/
International
"first fully open-sourced bilingual LLM with comparable performance to existing state-of-the-art LLMs... we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided."
LLM360
"K2-65B is a fully reproducible LLM outperforming Llama 2 70B using 35% less compute."
Mistral
Fluent in 80+ programming languages
Cohere
01-ai
Still training as of May/2024: https://appserversrc.8btc.cn/FnDYlEC4STBhphu6M3NL4CKH43FW dead link, use: https://finance.china.com.cn/roll/20240513/6116857.shtml
01-ai
Meta AI
Multimodal
Google DeepMind
Fine-tuned + prompted Gemini (Dec/2023). "The results of LearnLM-Tutor reproduce the performance of Gemini Pro, for example an MMLU score of 0.72 and MATH score of 0.33."
Cerebras
https://www.cerebras.net/blog/introducing-sparse-llama-70-smaller-3x-faster-full-accuracy "For the 50% sparse model, we utilized 45 billion tokens of pretraining data, while an additional 100 billion tokens were used for the 70% model. This represents approximately 2% to 8% of the original 2 trillion tokens used to train the base Llama-2 model."
Google DeepMind
1M context length. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/
OpenAI
gpt-4o-2024-05-13 no longer easily available, so hidden in the Model Table rankings. Omnimodel. ‘[GPT-4o is] likely an early checkpoint of GPT-5’. https://twitter.com/drjimfan/status/1790089671365767313 ELO: https://twitter.com/LiamFedus/status/1790064963966370209 Demo: https://youtu.be/DQacCB9tDaw
TII
Announce: https://www.tii.ae/news/falcon-2-uaes-technology-innovation-institute-releases-new-ai-model-series-outperforming-metas
Fujitsu
Japanese. CPU trained: 158,976+ A64FX CPUs (7M+ cores), zero GPUs. https://en.wikipedia.org/wiki/Fugaku_(supercomputer)
01-ai
Uses 600B more training tokens than Yi 1.0 (Nov/2023).
Microsoft
With Tsingua. You Only Cache Once (YOCO). Long context "1M context length with near-perfect needle retrieval accuracy"
DeepSeek-AI
Huge dataset, 12% Chinese "Therefore, we acknowledge that DeepSeek-V2 still has a slight gap in basic English capabilities with LLaMA3 70B".
Independent
"results on the ”Needle In A Haystack”(NIAH) tests indicate that ChuXin-1M performs well across all context window lengths up to 1M."
RWKV
RWKV (pronounced RwaKuv) is an RNN: https://twitter.com/BlinkDL_AI/status/1787834625211158562
ELLIS
New method LSTM to xLSTM, see also RNNs. Code/weights doesn't seem to be released. https://github.com/AI-Guru/xlstm-resources
IBM
MMLU=50 for 8B model only. Dataset: publicly available datasets (e.g., GitHub Code Clean, Starcoder data), public code repositories, and issues from GitHub.
Alibaba
https://twitter.com/JustinLin610/status/1787584325367529509
Google DeepMind
Med-Gemini-M 1.0 and Med-Gemini-L 1.0 (Pro and Ultra finetunes) "For language tasks that require less complex reasoning, such as summarizing medical notes and creating referral letters, we introduce Med-Gemini-M 1.0 by fine-tuning the Gemini 1.0 Pro model. For other tasks that require more advanced reasoning, we introduce Med-Gemini-L 1.0 by fine-tuning the Gemini 1.0 Ultra model using a self-training method to enable the models to efficiently use web search."
Microsoft
Precursor to phi.
BAAI
Also known as FLM-2. "We will open-source a 1T model checkpoint, namely Tele-FLM-1T, to advance further training and research." Discussion paper Jul/2024: https://arxiv.org/abs/2407.02783
Alibaba
Worse performance on GPQA (72B=36.3, 110B=35.9).
Snowflake AI Research
"Arctic uses a unique Dense-MoE Hybrid transformer architecture. It combines a 10B dense transformer model with a residual 128×3.66B MoE MLP resulting in 480B total and 17B active parameters chosen using a top-2 gating."
SenseTime
GPT-4 scale; low media coverage; no demo in Western world. https://www.techinasia.com/sensetime-pauses-trading-stock-rises-30-model-launch
Apple
On-device model (laptop, phone). Open-source Efficient Language Models (OpenELM). https://venturebeat.com/ai/apple-releases-openelm-small-open-source-ai-models-designed-to-run-on-device/
Microsoft
Preview only, benchmarks being investigated as of May/2024.
Microsoft
"phi3-mini can be quantized to 4-bits so that it only occupies ≈ 1.8GB of memory. We tested the quantized model by deploying phi-3-mini on iPhone 14 with A16 Bionic chip running natively on-device and fully offline achieving more than 12 tokens per second."
Meta AI
Instruct MMLU-Pro=56.2
Zyphra
Mamba1
Amazon
HLAT=High-quality LLM pre-trained on AWS Trainium. Same arch as Llama 7B. The pre-training is performed up to 64 Amazon EC2 trn1.32xlarge instances with totalling up to 1024 AWS Trainium accelerators. Read more about Trainium: https://www.aboutamazon.com/news/aws/what-you-need-to-know-about-the-aws-ai-chips-powering-amazons-partnership-with-anthropic
Hugging Face
Clone of Flamingo now using Mistral 7B. Named after Asterix and Obelix's dog Idefix (Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS)
Reka AI
https://www.reka.ai/news/reka-core-our-frontier-class-multimodal-language-model
Microsoft
Base model = mistral-8x22b.
EleutherAI
Hugging Face H4
mixtral-8x22b finetune using Odds Ratio Preference Optimization (ORPO).
Cohere
RAG + semantic search, possibly backed by Command-R+.
OpenAI
This is such a significantly better model that I've added it here. This GPQA=46.5%, old GPT-4 GPQA=36%. https://twitter.com/EpochAIResearch/status/1778463039932584205 MMLU scores are unclear, but may have improved by 1%: https://twitter.com/OpenAI/status/1778602770784002136. Final benchmarks are here: https://archive.md/6Cc0Z
Tsinghua
MoE option=https://huggingface.co/openbmb/MiniCPM-MoE-8x2B
Apple
Vicuna base, multimodal. Extension of Ferret from Oct/2023.
Mistral
MoE=22Bx8, seq=65536.
Sail
SEA languages. Based on Qwen-1.5. https://github.com/sail-sg/sailor-llm "Generally Sailor models consume around 200B tokens, completing a full pass through the SailCraft corpus once. However, the Sailor-0.5B model undergoes training with 400B tokens, equivalent to 2 epochs."
MIT
Tsinghua
Fine-tune of Mistral-7B and CodeLlama-70B.
Cohere
purpose-built to excel at real-world enterprise use cases. Announce with no arch details: https://txt.cohere.com/command-r-plus-microsoft-azure/
Silo AI
Viking uses an architecture similar to Llama 2, with flash attention, rotary embeddings, grouped query attention and supports a 4k sequence length'
Nous Research
1.58-bit quantized (ternary weights) means we can run a 70B model in ~14GB VRAM. See also BitNet b1.58
International
Apple
FLAN-T5 (Oct/2022) finetune.
Alibaba
MoE. "Of particular significance is the fact that, through upcycling, the necessity for training an equivalent volume of tokens as in the original model has been eliminated." I assumed half of the original 3T tokens
xAI
Context=128k.
AI21
MoE. Open weights, licensed under Apache 2.0. Announce: https://arxiv.org/abs/2403.19887
MosaicML
MoE. Trained for $10M on 3,072 NVIDIA H100s connected by 3.2Tbps Infiniband.
Stability AI
Context window=16,384. Trained on The Stack dataset.
Sakana AI
Japanese. Model merge 'our EvoLLM-JP-A is a merge of shisa-gamma-7b-v1, Arithmo2-Mistral-7B, and Abel7B-002' https://sakana.ai/evolutionary-model-merge/
Rakuten Group
Japanese. Mistral 7B derivative.
Independent
Tiny model (378M) for testing
RWKV
RWKV (pronounced RwaKuv) is an RNN: Built on the RWKV-v5 architecture (a linear transformer with 10-100x+ lower inference cost)
Apple
VLM, outperforms Flamingo 80B (Apr/2022) across benchmarks. 2T text tokens + ~10B+ other text (estimate). Unreleased.
Covariant
Commercial, multimodal for robotics
Cohere
RAG and tool use
DeepSeek-AI
Vision, based on DeepSeek-LLM-7B
Fudan University
Llama 2 7B backbone with new matrices ('reshaping the embedding matrix and prediction layer')
Stability AI
Mentioned in Stability release about Intel chips 11/Mar/2024, availablity unknown
Inflection AI
SRIBD/CUHK
Qwen 1.8B as base. Medical focus.
Anthropic
Original MMLU=86.8 (GPT-4=86.4). MMLU=88.2 with CoT prompting. Original GPQA=50.4. 200k context, 1M for researchers.
NVIDIA
Unbabel
Commercial product, Llama-2 as base.
Google DeepMind
MMLU=35. RNN.
Google DeepMind
MMLU=49.5. RNN.
Microsoft
SambaNova
CoE: Collection of experts: Llama2 7B / 13B / 70B Mistral 7B DeepSeek Coder 1.3B / 6.7B / 33B Falcon 40B DePlot CLIP Llava
Cohere
mT5 base.
HF
Synthetic data (25B tokens of synthetic data for 6 epochs + code). MMLU=32.4
Silo AI
Uses a BLOOM architecture with ALiBi embeddings to allow for context window extrapolation. While model architecture for the initial model has been kept simple, future models under progress will support additional capabilities, such as flash attention, rotary embeddings and grouped query attention.'
HF/ServiceNow
The Stack v2=900B tokens, 5 epochs to 4.3T tokens
ByteDance
Trained using 12,288 A100 GPUs, replicating MT-NLG size
ByteDance
Trained using 12,288 A100 GPUs, replicating GPT-3 size
Mistral
Optimised for latency and cost.
Mistral
MMLU=81.2 (same as Flan-PaLM 2 340B, higher than PaLM 2 340B MMLU=78.3), 32k context window. API only (not open source).
Reliance
11 Indian languages like Hindi, Tamil, and Marathi
Apple
Internal employee model only
Reka AI
Reka AI
My testing shows very poor performance equiv with tiny model
Google DeepMind
MMLU=64.3 (Llama 2 70B=68.9, ChatGPT 20B=70). Text only. Probably dense. Largest trained dataset (6T) besides frontier models.
Google DeepMind
Sparse MoE. Context window=1M and 10M for research. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/
Alibaba
Meta AI
Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
BRAIN
Satire (and hilarious). Probably Llama 2 with aggressive prompt. Wired interview: https://archive.md/toxHq
ChatDB
Based on DeepSeek-Coder 6.7B.
AI Singapore
MPT base. MMLU=26.87. Southeast Asian languages like Thai, Vietnamese and Bahasa Indonesia. https://www.computerweekly.com/feature/Sea-Lion-explained-Southeast-Asias-first-large-language-model
Time-series forecasting only. 'a large pretraining corpus of 100B real world time-points' may be more than 100B tokens.
Allen AI
Open Language Model (OLMo)
NVIDIA
Project page: https://audioflamingo.github.io/
Cerebras
Spanish, Catalan. Bloom-7.1B (341B tok) + continued pre-training on 140B tok. Trained on Cerebras hardware.
AIWaves.cn
Llama? 'All Weaver models are initialized from powerful open-source LLMs.' English waitlist: https://www.wawawriter.com/en/
Mistral
Leaked, proper version soon: https://venturebeat.com/ai/mistral-ceo-confirms-leak-of-new-open-source-ai-model-nearing-gpt-4-performance/
iFlyTek
pre-trained on a massive high-quality data set with a total of more than 3 trillion tokens, and then fine-tuned on fine-tuned diversified alignment data.'
iFlyTek
GPT-4 competitor. https://www.shine.cn/biz/tech/2401304331/
Apple
MLLM and diffusion model initialized from LLaVA-7B (Llama 2 + Vicuna) + StableDiffusion-v1.5.
Meta AI
Paper link is to 34B from Aug/2023. This 70B model finished training Jan/2024.
RWKV
RWKV (pronounced RwaKuv) is an RNN: Built on the RWKV-v5 architecture (a linear transformer with 10-100x+ lower inference cost), Trained on 1.1 Trillion Tokens across 100+ languages. Original paper: https://arxiv.org/abs/2305.13048
LMU
Extends Llama 2 7B to 10B using 534 languages.
Cornell
Used bytes instead of tokens. 4 bytes≈1 token, so 150B bytes≈37.5B tokens
DeepSeek-AI
surpasses existing closed-source models like Codex and GPT-3.5... permissive license that allows for both research and unrestricted commercial use.'
Tencent
Fusion of Llama-2-7B (2T tok), OpenLLaMA-7B (2T tok), and MPT-7B (1T tok).
Adept
Fuyu-Heavy is the world’s third-most-capable multimodal model, behind only GPT4-V and Gemini Ultra, which are 10-20 times bigger.' Token estimate is based on Adept Persimmon-8B using many more tokens.
OrionStar
English, Chinese, Japanese, Korean, and other languages.
Shanghai AI Laboratory/SenseTime
Zhipu AI (Tsinghua)
Best Chinese model to date based on analysis. Follows OpenAI roadmap. MMLU=81.5. 'hundreds of billions of parameters' https://www.chatglm.cn/
DeepSeek-AI
MoE activated parameters is 10-15% of dense, so I need to rethink ALScore for MoE. 'preliminary efforts to scale up DeepSeekMoE to 145B'
DeepSeek-AI
Chinese/English. Outperforms Llama 2. MMLU=71.3 outperforms GPT-3.5.
Tencent
We pre-train LLAMA PRO’s expanded blocks on 80B tokens using open-source code and math data for 2830 GPU Hours (16 NVIDIA H800 GPUs for about 7 days).
Writer
Palmyra X V2, Palmyra X V3, Palmyra X V4. https://venturebeat.com/ai/why-writers-palmyra-llm-is-the-little-ai-model-that-could-for-enterprises/
SUTD/Independent
Overtrained' using 2,727 tokens per parameter. Dataset was 1T: 3 epochs to 3T seen. Singapore
JPMorgan
Document spatial layout structure.
Cambridge
"Uses 4-body equivariant messages; covers 89 elements; supports fine-tuning for ab initio accuracy with minimal data."
Allen AI
600TB dataset (plus 120+ fine-tuning datasets) includes '1B imagetext pairs, 1T text tokens, 180M video clips, 130M interleaved image & text, 3M 3D assets, and 1M agent trajectories.'
Microsoft
To obtain WaveCoder models, We choose StarCoder-15B, CodeLLaMa (7B and 13B), DeepseekCoder-6.7B as the base model and fine-tune all the base model for 3 epochs
Huawei
Finance + law fine-tune of PanGu-π
Huawei
Dense, named PanGu-π
Wenge
Dataset=240TB filtered to 10.6TB for 2.65T tokens
BAAI
VLM. Gemini clone. Outperforms Flamingo 80B. The Pile for text, but only sampled 3.6B tokens (1.4% of the dataset).
Google DeepMind
Available to 'white-listed' orgs only.
Upstage AI
South Korean. Llama-2 arch. SOTA for its size (Dec/2023).
Deci
4.4x times faster than Mistral. English only.
Mistral
MMLU=75.3% (GPT-3.5-turbo 20B=70%, Llama 2 70B=68.9%)
Mistral
MoE=7Bx8, aka mistral-small. 'Concretely, Mixtral has 45B total parameters but only uses 12B parameters per token. It, therefore, processes input and generates output at the same speed and for the same cost as a 12B model.'
Together
RedPajama (C4), new arch beyond just Transformers
Nexusflow.ai
Based on CodeLlama. 'surpasses GPT-4 by up to 7% in function calling success rates in human-generated use cases involving nested and composite functions.'
Google DeepMind
Original MMLU=83.7. MMLU=90.04 with prompting. Chinchilla (20:1), dense, maybe 600B-2000T. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/
CMU
The Pile, new arch beyond just Transformers. 2.7B MMLU=26.2. 7B MMLU=33.3.
Berkeley/JHU
Paper is 25MB. First Large Vision Model (LVM); no text. Based on Llama and LAION 5B (1.49B).
Alibaba
Llama 2 for Southeast Asian (SEA) languages: Vietnamese 🇻🇳, Indonesian 🇮🇩, Thai 🇹🇭, Malay 🇲🇾, Khmer🇰🇭, Lao🇱🇦, Tagalog🇵🇭 and Burmese🇲🇲
Perplexity
Web access. Higher 'freshness' and 'truth' scores.
Meta AI
Based on NLLB and older models. https://github.com/facebookresearch/seamless_communication
Google DeepMind
Robotics, builds on RT-1
IEIT
Chinese + EN dataset include The Pile: DM, arxiv, wikipedia, book3, stack exchange, Freelaw and medical
EPFL
Llama 2 trained on med data using NVIDIA Megatron-LM. "outperforms Llama-2-70B, GPT-3.5 (text-davinci-003, 8-shot), and Flan-PaLM on multiple medical reasoning tasks."
Microsoft
Proving maths is not memorized. Uses GPT-2-style model. Sébastien Bubeck
Berkeley
Llama 2 7B -> OpenChat 7B -> Starling-7B (RLAIF)
Inflection AI
“now the 2nd best LLM in the world”. Finished training 19/Nov/2023, waiting for fine-tuning and release.
Anthropic
Less hallucinations, 200k context length, tool use
Allen AI
Llama 2 finetune with RLHF direct preference optimization (DPO).
NVIDIA
8B released, 22B internal.
NVIDIA
Used to train HelpSteer (16/Nov/2023): https://arxiv.org/abs/2311.09528
Microsoft
Llama 2 13B (2T) -> Orca 2 (GPT-4 finetune). Still an imitation model, overhyped: The False Promise of Imitating Proprietary LLMs https://arxiv.org/abs/2305.15717
Microsoft
https://twitter.com/SebastienBubeck/status/1724854157004190095
Microsoft
VLM, Flamingo alt
Google DeepMind
Combiner + autoregressive transformer for video/audio/text
NTU
Evolution of Persimmon-9.3B and Fuyu 8B
Samsung
Gauss Language specializing in generating texts, Gauss Code on software and code description and Gauss Image for image creation.
xAI
Context window=8192. UI: https://twitter.com/TobyPhln/status/1721053802235621734
xAI
Announced Nov/2023, trained Jul/2023
01-ai
Controversy about Llama 2 base. https://twitter.com/kaifulee/status/1724673131875377465 MMLU=76.3 (PaLM 2=78.3) Outperforms Llama 2. Chinese and English. https://www.bloomberg.com/news/articles/2023-11-05/kai-fu-lee-s-open-source-01-ai-bests-llama-2-according-to-hugging-face
OpenAI
https://openai.com/blog/new-models-and-developer-products-announced-at-devday
Google DeepMind
Matryoshka Transformer or MatFormer model architecture. 850M (696M / 620M / 582M). "850M decoder-only MatFormer language model (MatLM) allows us to extract multiple smaller models spanning from 582M to 850M parameters, each exhibiting better validation loss and one-shot downstream evaluations than independently trained counterparts."
Kunlun Tech
CN + EN.
Moonshot AI
Chinese. Long context. No paper.
Jina AI
Alternative to text-embedding-ada-002. Related v1 paper: https://arxiv.org/abs/2307.11224
Adept
VLM. 8B available under open licence, Medium size is closed
Baidu
Dense (confirmed). English-dubbed launch video (2h52m): https://twitter.com/i/broadcasts/1yNGaZaeallJj & https://youtu.be/wYozcsavRuM
Hugging Face H4
Mistral with 'aligned' data removed from dataset
Google DeepMind
VLM. Next iteration of PaLI via Pathways. https://lifearchitect.ai/pathways/
NVIDIA
the largest LLM pretrained with retrieval before instruction tuning.'
Apple
Vicuna base, multimodal
XLANG Lab
https://arxiv.org/abs/2310.06830
KAUST/Shenzhen
Arabic. Llama 2 + RLAIF
Reka AI
Multi-modal. No public arch info. Researchers from DeepMind, Google, Baidu and Meta building enterprise models
Google DeepMind
Robotics using UL2. 'RT-1 model trained using the robotic data mixture as RT-1-X, and the RT-2 model trained using the robotic data mixture as RT-2-X.'
Waymo
LLM for autonomous vehicle forecasting. https://youtu.be/jrMMNmN21I8?t=1560
Wayve
World model, generates video. Uses T5-large 770M for language + all vision parameters
Alibaba
Chinese. Full name is 'Tongyi Qianwen' 通义千问. 'Lags behind both GPT-3.5 and GPT-4'. Originally 7B/14B params Apr/2023
Meta AI
Unreleased to date. Context window=32,768 tokens (compare to Llama 2=4096 tokens)
Hessian AI/LAION
Llama 2 'extended' and pretrained on 2000B Llama 2 tokens + 65B tokens of German
Mistral
Apache 2.0, Sliding Window Attention (SWA) to handle longer sequences at smaller cost
Microsoft
Baichuan
Great paper. Chinese-English bilingual dataset
ThirdAI
CPU trained
Deci
Faster inference (4.8× throughput of Llama 2)
IBM
ModuleFormer is based on the Sparse Mixture of Experts (MoE).
Singapore
Multimodal. Vicuna 7B + other modalities
Microsoft
Textbooks only. 30B-token dataset
Apple
Apple's Transformer model for iOS 17 + macOS Sonoma. Announce is actually Jun/2023. GPT-2 base? 128 token context window
Adept
Open Apache license and publicly accessible weights.
BAAI
Train for $100k compute budget (on a cluster of 24 DGX-A800 GPU 8×80G servers for 21 days)
TII
Major milestone for open source models (largest open dense model to date).
Tencent
Independent
Satire. MMLU=100. 'phi-CTNL (pronounced “fictional”) that achieves perfect results across diverse academic benchmarks'
IBM
Original trained on 1T tokens, update 15/Feb/2024 trained on 2.5T tokens: granite-13b-chat-v2 (v2.1.0). "At IBM, we curated 6.48TB of data to train our LLM Granite.13B. This was reduced to 2.07 TB after pre-processing, a 68% decrease."
Inception
Arabic, trained in Abu Dhabi, UAE using Cerebras.
Meta AI
Outperforms GPT-3.5. Initial Llama 2 (2T tokens) trained on 500B tokens of code, 100B tokens of python
Hugging Face
Clone of Flamingo using Llama-1 65B. Named after Asterix and Obelix's dog Idefix (Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS)
UI/NVIDIA
RAG Atlas
AzaleAI
Indonesian fine-tune of WizardLM (which is a Llama fine-tune).
Microsoft
Assume Llama-2 fine-tune. Outperforms text-davinci-003. May merge this entry with the Apr/2023 7B release
Boston University
Fine-tune of Llama 2, family includes merges with Beluga, Dolphin, and Camel fine-tunes.
Stability AI
Best-performing openly available language model for Japanese speakers.
Stability AI
Context window=16,384. Trained on The Stack dataset.
Stanford
Uses LAION OpenFlamingo 9B, based on LLaMA-7B text + 1.3B vision
LightOn
First finetuned version of Falcon with RLHF. Enterprise: https://www.lighton.ai/paradigm
Together
32k context window instead of 4k (Llama 2)
Google DeepMind
Uses PaLM 1. Already outperformed by Med-PaLM 2. Med-PaLM Multimodal (Med-PaLM M).
Cerebras
Runs on devices with as little as 3GB of memory [iPhone, Macbook] when quantized to 4-bit
Stability AI
Fine-tuned Llama 2. Non-commercial use license. Codename was FreeWilly2
Stability AI
Fine-tuned LLaMA-1. Non-commercial use license. Codename was FreeWilly1
Shanghai AI Laboratory/CUHK
Proto-AGI. 12 modalities (text, image, point cloud, audio, video, infrared, hyperspectral, X-ray, time-series, tabular, Inertial Measurement Unit (IMU), and graph data).
Meta AI
Context window=4096. MMLU=68.9 (GPT-3.5=70.0, GPT-4=86.4)
(Undisclosed)
GPT-J (2021) finetune/module.
Anthropic
More HHH, 200k context length
IDEAS/DeepMind
256k context length
Tsinghua
Protein language model
Salesforce
8K sequence length. Released under Apache-2.0.
360 cn
Reka AI
No public arch info. Researchers from DeepMind, Google, Baidu and Meta building enterprise models
Microsoft
Proto-AGI. Multimodal large language model (MLLM). a multimodal large language model with grounding capability built upon KOSMOS-1
a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation
Inflection AI
Comparable with benchmarking results from InternLM 104B, 1-2% better. ‘Inflection-1 was trained using thousands of NVIDIA H100 GPUs on a very large dataset.’
Microsoft
Code model. ‘breaking existing scaling laws by training a 1.3B-parameter model, which we call phi-1, for roughly 8 passes over 7B tokens (slightly over 50B total tokens seen) followed by finetuning on less than 200M tokens.’
Shanghai AI Laboratory/SenseTime
Outperforms ChatGPT, LLaMA on RACE-h, Chinese + English
Meta AI
OPT-175B with new dialogue data
Microsoft
LLaMA -> Vicuna -> Orca (GPT-4 finetune). Still an imitation model, overhyped: The False Promise of Imitating Proprietary LLMs https://arxiv.org/abs/2305.15717
ETH Zürich
GPT-2 trained on leaked passwords
Google DeepMind
Iterative coding model trained on Google's monorepo. Jacob: https://twitter.com/jacobaustin132/status/1663972128176128002
Magic
Context window=5M
OpenAI
Unreleased, includes step by step research
Cambridge/Tencent
Proto-AGI. 6 modalities (text, image/video, audio, depth, thermal, and IMU/accelerometer/gyroscope/compass). Based on Vicuna.
TII
Abu Dhabi
Refact
LiON vs Adam, code, RedPajama+The Stack
UW
LLaMA-65B via QLoRA
Meta AI
LLaMA-65B with nearly no fine-tuning, no RLHF
Asus/TWS
BLOOMZ finetune? Chinese, Taiwan's first LLM. Subscription hardware: https://archive.md/cVdJt
Salesforce
InstructCodeT5+ 16B sets new SoTA results of 35.0% pass@1 and 54.5% pass@10 against other open code LLMs, even surpassing the closed-source OpenAI code-cushman-001'
“What we found in our work is that it’s not really the sort of size of model — that the larger is not always better,” Deepmind VP Zoubin Ghahramani said in a press briefing ahead of today’s announcement. “That’s why we’ve provided a family of models of different sizes. We think that actually parameter count is not really a useful way of thinking about the capabilities of models and capabilities are really to be judged by people using the models and finding out whether they’re useful in the tests that they try to achieve with these models.”
HF/ServiceNow
MosaicML
Llongboi' -Apache 2.0 license suitable for commercial use. -Base 7B LLM trained on 1T tokens outperforms LLaMA and GPT3. -64K+ context length. -$200k to train from scratch.
Inflection AI
No indication of params/tokens. Devs from DeepMind.
NVIDIA
No paper yet
Amazon
No official information at all. 2nd hand via Jack Clark: https://importai.substack.com/p/import-ai-365-wmd-benchmark-amazon '$65m training run. Specifically, they trained a 200B dense model on 4T tokens of data across 13,760 NVIDIA A100 chips (using 1,720 P4d nodes). It took 48 days to train.'
Microsoft
LLaMA 7B self-instructed fine-tune.
MosaicML
More 1B models coming with different datasets. Many more.
Stability AI
contains 1.5 trillion tokens, roughly 3x the size of The Pile. These models will be trained on up to 1.5 trillion tokens. The context length for these models is 4096 tokens.
Databricks
Fine-tuned Pythia 12B
EleutherAI
Berkeley
LLaMA base. Academic licence only.
Character.ai
No details released.
Bloomberg
Video: https://youtu.be/m2Scj2SO85Y Underperforms GPT-3, based on BLOOM. Tokens: 'We select a model size motivated by Hoffmann et al. (2022) and train a 50 billion parameter model on 569 billion tokens from our corpus of over 700 billion tokens to produce a model that is competitive with larger models.'
LAION
Uses LLaMA-7B. Demo: https://7164d2142d11.ngrok.app/
Nomic
chatbot trained on ~800k GPT-3.5-Turbo Generations based on LLaMa
Cerebras
20:1 tokens to parameters as per https://lifearchitect.ai/chinchilla/
Huawei
Sparse. 1.085T parameters named PanGu-Σ.
up to 64k context window [48k words or about 96 pages -Alan]
Google DeepMind
Recently, our next iteration, Med-PaLM 2, consistently performed at an “expert” doctor level on medical exam questions, scoring 85%. This is an 18% improvement from Med-PaLM’s previous performance and far surpasses similar AI models.
OpenAI
Original MMLU=86.4. MMLU=90.1 with prompting. Proto-AGI. 1.76T parameters MoE.
Stanford
Stanford Alpaca: An Instruction-following LLaMA model'
AI21
Together
instruction-tuned 20 billion parameter language model, a 6 billion parameter moderation model, and an extensible retrieval system for including up-to-date responses from custom repositories. It was trained on the OIG-43M training dataset, which was a collaboration between Together, LAION, and Ontocord.ai. '
Microsoft
Proto-AGI. Multimodal large language model (MLLM). Raven’s Progressive Matrices as real images, not digits as in testing of text-davinci-003 at https://lifearchitect.ai/ravens/
Meta AI
Researchers only, noncommercial only. 'LLaMA-65B is competitive with the best models, Chinchilla70B and PaLM-540B.'
Fudan University
Major bandwidth issues: https://www.reuters.com/technology/china-fudan-university-team-apologises-after-chatgpt-style-platform-crashes-2023-02-21/
Writer
Only up to 5B available open-source 'trained on over 300 billion tokens of text data, and the size of the resulting model is over 20 billion parameters. ' https://writer.com/product/cowrite/
Aleph Alpha
‘Control’ means instruction tuned
Meta AI
Based on GPT-J 6.7B + access to other models via API
Amazon
Models <1B with vision CoT
Microsoft
T5 for Excel formulas, very small 60M params, "We start from a dataset of 927M formulas" estimate 10x multiplier for 9B tokens
Google DeepMind
Collab between Google & DeepMind. Makes 1% less errors than humans
Meta AI
Instruct
Anthropic
RLAIF=reinforcement learning with AI feedback
Baidu
OpenAI
Instruct with strict policies ("extremely limited")
OpenAI
Together
RWKV
RWKV (pronounced RwaKuv) is an RNN: https://www.reddit.com/r/MachineLearning/comments/yxt8sa/r_rwkv4_7b_release_an_attentionfree_rnn_language/
Meta AI
scientific only
DeepMind
SED 420M (diffusion text model)
BigScience
fine-tuned
BigScience
fine-tuned
Microsoft
Trained on ~5TB data, 2GB model download. 'In general we see an improvement in model performance as we increase the number of training tokens. Interestingly, larger models did not necessarily result in better performance for robot navigation. Even though larger models consistently presented better loss values for action prediction on a static dataset, (Fig. 7 b), when it comes to real-time deployment the larger network capacity introduces inference delays that become a disadvantage and lead to earlier crashes. For example, while LiDAR perception measurements arrive to the vehicle every 0.077s (13Hz), the largest model of 24 layers takes on average 0.023s for inference with a RTX3090 GPU, roughly 40% longer the 3 layer model (0.016s). These time differences can amount to even larger performance gaps in small embedded systems, and further emphasize the importance of multiple downstream task architectures sharing a common representation branch for real-time robotics applications.'
T5=1T tokens + LM-adapted T5 as 100B tokens
NVIDIA
Tsinghua
Llama 2 13B -> OpenChat 13B
13% English tokens and 87% Chinese
Tsinghua
DeepMind
Chatbot as a fine-tuned version of Chinchilla 70B
PaLM Vision model, new datasets of 10B multilingual text-image pairs
NVIDIA
Microsoft
abstractive text summarization, 710M, outperforms PaLM 540B. "Due to the limited computational resource, Z-Code++LARGE is trained with only 500B tokens instead of 1T tokens as that for mT5 training."
Meta AI
Meta AI
Tsinghua
50% English (200B tokens), so included here
Amazon
Wikipedia and mC4 only. seq2seq
OpenAI
Several models: 8 sizes, NLP, Code, FIM/non-FIM. 100B tokens for 6.9B params... beyond chinchilla
Unnamed. Writes >3% of internal google code.
Huawei
Python via GH
Meta AI
54.5B MOE, 3.3B dense. 200+ languages
AI21
J-1 fine-tuned with RBG law corpus
BigScience
PaLM finetuned on LaTeX/arXiv maths
Microsoft
XL: GPT-3 175B in paper, GPT-J 2.7B released
Yandex
Megatron-LM clone, Russian/English: https://medium.com/yandex/yandex-publishes-yalm-100b-its-the-largest-gpt-like-neural-network-in-open-source-d1df53d0e9a6
Allen AI
Based on T5. Demo only
DeepMind
Context window=100,000. Params=364m wiki, 975M pg-19, 826M books, music=?, imagenet=770M,
Independent
Warning for inappropriate content. GPT-J.
Stanford
GPT-J with synthetic data
Unifying Language model. C4 only.
DeepMind
Proto-AGI. Generalist agent (LLM, VLM, robot)
Chatbot with tiny walled garden demo TBA
Meta AI
Only 30B available (Jun/2022)
Hugging Face
Based on T5.
Meta AI
Python and JavaScript
TII
Arabic. "World’s largest high-quality cross-domain Arabic dataset, combining web data with books, poetry, news articles, and technical information"
Sber
60 languages. Only 1.3B model available
Meta AI
BART and compared to GPT-2
Salesforce
Code
LightOn
Params corrected 25/Apr/2022
DeepMind
First to double tokens per size increase
Salesforce
"Text-to-Text Transfer Transformer". Code. Large introduced in https://arxiv.org/pdf/2207.01780.pdf
EleutherAI
Latest model to Feb/2022
Meta AI
LLM with multimodal capabilities
Baidu
Meta AI
Multilingual: 30 languages, 16 families.
Meta AI
13B & 1100B param models.
DeepMind
Dataset: https://lifearchitect.ai/whats-in-my-ai/
Anthropic
Internal research only
DeepMind
with retrieval
Aleph Alpha
Devs from EleutherAI
Microsoft
RoBERTa=162B token dataset.
Submission to benchmarks. Original dataset was BookCorpus + Wikipedia: https://arxiv.org/pdf/1810.04805.pdf
Submission to benchmarks. Original dataset was BookCorpus + Wikipedia: https://arxiv.org/pdf/1810.04805.pdf
Coteries
French only. GPT-J.
Microsoft/NVIDIA
Fine-tuned LaMDA
Cohere
Stealth 'ebooks and webpages'. 52B: https://crfm.stanford.edu/helm/v1.0/?models=1
Baidu
Chatbot. Reddit comments + CN social
Allen AI
Chatbot
OpenAI
Code
AI21
Emulated GPT-3 dataset
Meta AI
Chatbot
EleutherAI
Popular
Chatbot
Huawei/Sberbank
Russian GPT-3 with input from Huawei
OpenAI
No RLHF (base only). Popular: 3.1M wpm. Dataset: https://lifearchitect.ai/whats-in-my-ai/
Meta AI
My favourite model until GPT-3 and GPT-4 came along: https://github.com/facebookresearch/fairseq/blob/main/examples/megatron_11b/README.md
American Express
Not to be confused with the more common usage of Transformer++, the ~2023 Transformer++ based on Llama. See Mamba paper.
Dialogue model. Trained 61B tokens for 164x epochs to 10T tokens!
"Text-to-Text Transfer Transformer". C4 + NLP language problems. "compared the following three configurations: First, the standard baseline model, which was pre-trained on 235 ≈ 34B tokens; second, the baseline trained instead for about 1 trillion tokens (i.e. the same amount of pre-training used for T5), which we refer to as “baseline-1T”; and third, T5-Base."
NVIDIA
Meta AI
calcs: "In total, this batch size and number of steps corresponds to pre-training on 235 ≈ 34B tokens. This is considerably less than BERT (Devlin et al., 2018), which used roughly 137B tokens, or RoBERTa (Liu et al., 2019c), which used roughly 2.2T tokens. Using only 2 35 tokens results in a reasonable computational budget while still providing a sufficient amount of pre-training for acceptable performance. We consider the effect of pre-training for more steps in Sections 3.6 and 3.7. Note that 2 35 tokens only covers a fraction of the entire C4 data set, so we never repeat any data during pre-training." https://arxiv.org/pdf/1910.10683.pdf MMLU shows RoBERTa-base 125M only=27.9 (not 355M)
OpenAI
WebText 10B token corpus × 4 epochs → 40B tokens processed. Reddit outbound only
"BERT — 128 000 tokens per step × 1 000 000 steps → 128 B tokens processed"
OpenAI
"GPT-1 — 984 M tokens corpus × 100 epochs × 1 token per word → 98.4 B tokens processed" Books only. "We train for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens." =3,276,800
Fast.ai
"ULMFiT — 103 M tokens corpus × 14 epochs → 1.44 B tokens processed" "Corpus size. WikiText-103 contains about 103 million word-level tokens. Training schedule. The reference pre-training run trains for 14 full epochs on that corpus. Total tokens seen. 103 M tokens × 14 epochs → roughly 1.44 billion token prediction steps." Aussie Prof Jeremy Howard: https://www.abc.net.au/news/science/2023-11-15/jeremy-howard-taught-ai-to-the-world-and-helped-invent-chatgpt/103092474
"Transformer Big — 32 768 tokens per step × 300 000 steps → 9.83 B tokens processed" "We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs... For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary. Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens."
"Transformer Base — 32 768 tokens per step × 100 000 steps → 3.28 B tokens processed" "We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs... For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary. Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens."