LLM Timeline

Interactive timeline of Large Language Model releases (20172026)

769
Total
769
Models
188
Organizations
642
Open Weights
769 results
34 models
Alibaba logo

Qwen3.5-27B

Alibaba

2026-02
27B
modelopen

"Qwen3.5 represents a significant leap forward, integrating breakthroughs in multimodal learning, architectural efficiency, reinforcement learning scale, and global accessibility"

Liquid AI logo

LFM2-24B-A2B

Liquid AI

2026-02
24B
modelopen

"a traditional instruct model without reasoning traces."

Inception logo

Mercury 2

Inception

2026-02
180B
modelopen

Diffusion large language model (dLLM).

Google DeepMind logo

Gemini 3.1 Pro

Google DeepMind

2026-02
3000B
modelopen

Knowledge cutoff still=January 2025. Announce: https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/

Z

ZUNA

Zyphra

2026-02
0.38B
modelopen

For BCI, 'thought-to-text'. Training dataset calcs: (2M hours * 3,600 seconds/hour * 256 samples/second ) / 32 samples/token = 57.6B tokens (refined to 45.1B after rigorous filtering ); 150,000 steps * 2.16M tokens/batch = 324B total tokens seen during training. Announce: https://www.zyphra.com/post/zuna

xAI logo

Grok 4.2

xAI

2026-02
3000B
modelopen

No details provided. Announce: https://x.com/elonmusk/status/2023829664318583105

PI

INTELLECT-3.1

Prime Intellect

2026-02
106B
modelopen

Base: GLM-4.5-Air-Base, INTELLECT-3 model. 106BA12B.

Anthropic logo

Claude Sonnet 4.6

Anthropic

2026-02
400B
modelopen

1M context. Announce: https://www.anthropic.com/news/claude-sonnet-4-6 Showing GMMLU (Global MMLU by Cohere).

Cohere logo

Tiny Aya

Cohere

2026-02
3.35B
modelopen

70+ languages. Showing GMMLU (Global MMLU by Cohere).

Alibaba logo

Qwen3.5-397B-A17B

Alibaba

2026-02
397B
modelopen

"Qwen3.5 represents a significant leap forward, integrating breakthroughs in multimodal learning, architectural efficiency, reinforcement learning scale, and global accessibility"

JO

JoyAI-LLM Flash

JD Open Source

2026-02
48B
modelopen

48B-A3B.

MiniMax logo

MiniMax-M2.5

MiniMax

2026-02
230B
modelopen

230B-A10B. HLE showing without tools.

Z

GLM-5

Z.AI

2026-02
744B
modelopen

744B-A40B. Announce: https://z.ai/blog/glm-5

N

Nanbeige4.1-3B

Nanbeige

2026-02
3B
modelopen

SOTA for size (3B)

Alibaba logo

RynnBrain-30B-A3B

Alibaba

2026-02
30B
modelopen

Base: Qwen3-VL-30B-A3B-Instruct. "an embodied foundation model grounded in physical reality."

Anthropic logo

Claude Opus 4.6

Anthropic

2026-02
5000B
modelopen

SA

Intern-S1-Pro

Shanghai AI Laboratory/SenseTime

2026-02
1000B
modelopen

1000TA22B. Assumes base model of Qwen3. "Built upon a 235B MoE language model and a 6B Vision encoder, Intern-S1 has been further pretrained on 5 trillion tokens of multimodal data"

S

Step 3.5 Flash

StepFun

2026-02
196B
modelopen

196B-A11B.

I

Assistant_Pepe_8B

Independent

2026-01
8B
modelopen

Warning for inappropriate content. Base: Llama-3.1-Nemotron-8B. "trained it on an extended 4chan dataset" "the original, gpt4chan (by Yannic Kilcher) scored especially high in truthfulness (that was b4 benchmaxxing)... outperformed the base tune (the unabliterated one), it also changed its political alignment... People were initially joking about the "alignment tax", I think there's a none trivial substance in all of this. It seems to me just above a marginal error or statistical noise."

AA

Trinity-Large

Arcee AI

2026-01
400B
modelopen

400BA13B. "we worked closely with Prime Intellect. They not only served the H100 clusters Datology used to generate synthetic data, they have been deeply involved in helping scale our training setup to the GPU footprint required for a fully frontier sized model, including the current 2048 B300 GPU configuration for Trinity Large."

Allen AI logo

SERA

Allen AI

2026-01
32B
modelopen

Base: Qwen3-32B. SERA=Soft-verified Efficient Repository Agents. "SERA was built largely by a single Ai2 researcher." https://allenai.org/blog/open-coding-agents "SERA-32B was trained using Soft Verified Generation (SVG), a simple and efficient method that is 26x cheaper than reinforcement learning and 57x cheaper than previous synthetic data methods to reach equivalent performance. The total cost for data generation and training is approximately $2,000 (40 GPU-days)."

Moonshot AI logo

Kimi K2.5

Moonshot AI

2026-01
1000B
modelopen

1TA32B. 1T parameters and 384 experts. Open source SOTA. "Kimi K2.5 builds on Kimi K2 [15.5T tokens] with continued pretraining over approximately 15T mixed visual and text tokens. [+ 15T=30.5T]"

Z

GLM-4.7-Flash

Z.AI

2026-01
30B
modelopen

30B-A3B.

Google DeepMind logo

MedGemma 1.5 4B

Google DeepMind

2026-01
4B
modelopen

Lower MMLU score compared to previous MedGemma 1 27B (67.2 v 87). Announce: https://research.google/blog/next-generation-medical-image-interpretation-with-medgemma-15-and-medical-speech-to-text-with-medasr/

Microsoft logo

FrogBoss

Microsoft

2026-01
32B
modelopen

Base: Qwen3-32B.

N

EDEN

NVIDIA

2026-01
28B
modelclosed

"EDEN (environmentally-derived evolutionary network) family of metagenomic foundation models, including a 28 billion parameter model trained on 9.7 trillion nucleotide tokens from BaseData1 . This dataset, at the time of training, contained more than 10 billion novel genes from over 1 million new species, and is intentionally enriched for environmental and host-associated metagenomes, phage sequences, and mobile genetic elements, enabling the model to learn from diverse and novel cross-species evolutionary mechanisms and apply them to key challenges in human health."

B

Baichuan-M3

Baichuan

2026-01
235B
modelopen

"new-generation medical-enhanced large language model"

DeepSeek-AI logo

Engram

DeepSeek-AI

2026-01
39.5B
modelpartial

39.5BA3.8B. "we explore conditional memory as a complementary sparsity axis, instantiated via Engram, a module that modernizes classic N -gram embeddings for O ( 1 ) lookup."

S

SleepFM

Stanford

2026-01
0.091B
modelopen

Uses a leave-one-out contrastive learning approach to align brain activity (EEG), heart activity (ECG), and respiratory signals. 130+ disease categories and 19–20+ clinical PSG channels. Dataset ~12.63B (Calculated based on 585,000 hours of data across 3 modality groups using 5-second window tokens) x 10 epochs.

I

TimeCapsuleLLM-v2-1800-1875

Independent

2026-01
1.2B
modelopen

112GB dataset=30B tokens x 0.5 epochs = 15B tokens.

A

Jamba2

AI21

2026-01
52B
modelopen

52B-A12B. Pre-training tokens from Jamba=1.2T + 500B mid.

Liquid AI logo

LFM2.5

Liquid AI

2026-01
1.2B
modelopen

For on-device agentic applications. "Extended pre-training from 10T to 28T tokens and large-scale multi-stage reinforcement learning."

M

MiroThinker v1.5

MiroMindAI

2026-01
235B
modelopen

Base: Qwen3 235B-A22B. Official demo: https://dr.miromind.ai

TII logo

Falcon-H1R

TII

2026-01
7B
modelopen

Base model: Falcon-H1 (May/2025). Announce: https://huggingface.co/blog/tiiuae/falcon-h1r-7b

235 models
DeepSeek-AI logo

mHC 27B

DeepSeek-AI

2025-12
27B
modelclosed

27BA4.14B. Scaling tested with 3B MoE on 1T tokens=334:1. "Manifold-Constrained Hyper-Connections (mHC), a general framework that projects the residual connection space of HC onto a specific manifold to restore the identity mapping property, while incorporating rigorous infrastructure optimization to ensure efficiency. Empirical experiments demonstrate that mHC is effective for training at scale, offering tangible performance improvements and superior scalability."

I

IQuest-Coder-V1

IQuestLab

2025-12
40B
modelopen

"IQuest-Coder-V1 captures the dynamic evolution of software logic, delivering state-of-the-art performance across critical dimensions" https://github.com/IQuestLab/IQuest-Coder-V1

SH

A.X K1

SK Hynix

2025-12
519B
modelopen

519BA33B.

L

K-EXAONE

LG

2025-12
236B
modelopen

236BA23B. “EXAONE”=“EXpert AI for EveryONE”.

U

Ranke-4B

UZH

2025-12
4B
modelclosed

Base Model: Qwen 3. 600B tokens of pre-(1913, 1929, 1933, 1939, 1946) data only.

Tencent logo

WeDLM

Tencent

2025-12
8B
modelopen

Project page: https://wedlm.github.io/ "WeDLM, a diffusion decoding framework built entirely on standard causal attention to make parallel generation prefix-cache friendly. The core idea is to let each masked position condition on all currently observed tokens while keeping a strict causal mask, achieved by Topological Reordering that moves observed tokens to the physical prefix while preserving their logical positions.. We instantiate WeDLM on both Qwen2.5-7B and Qwen3-8B, utilizing 100B tokens for continued training and 10B tokens for SFT."

UA

SOLAR Open

Upstage AI

2025-12
102B
modelopen

South Korean. 102BA12B. Releasing 31/Dec.

Z

GLM-4.7

Z.AI

2025-12
355B
modelopen

355B-A32B. "context window has been expanded from 128K to 200K tokens"

N

NitroGen

NVIDIA

2025-12
0.493B
modelopen

"NitroGen is a unified vision-to-action model designed to play video games directly from raw frames. It takes video game footage as input and outputs gamepad actions... trained on 40,000 hours of gameplay videos across more than 1,000 games."

X

MiMo-V2-Flash

Xiaomi

2025-12
309B
modelopen

309BA15B.

Google DeepMind logo

FunctionGemma

Google DeepMind

2025-12
0.27B
modelopen

"FunctionGemma, a specialized version of our Gemma 3 270M model tuned for function calling. It is designed as a strong base for further training into custom, fast, private, local agents that translate natural language into executable API actions."

Google DeepMind logo

T5Gemma 2

Google DeepMind

2025-12
4B
modelopen

Base model: Gemma 3. Dataset: Gemma 3 4B checkpoint (4T) + pretraining (2T)=6T.

Google DeepMind logo

Gemini 3 Flash

Google DeepMind

2025-12
200B
modelopen

Announce: https://deepmind.google/models/gemini/flash/

N

NVIDIA-Nemotron-3-Nano-30B-A3B

NVIDIA

2025-12
30B
modelopen

Knowledge cutoff November 28, 2025 (post).

Allen AI logo

Bolmo

Allen AI

2025-12
7B
modelopen

Base Model: Olmo 3 7B. Announce: https://allenai.org/blog/bolmo

C

EuroLLM-22B

Consortium

2025-12
22B
modelopen

A fully open language model developed in Europe.

IA

LLaDA2.0 Flash

Inclusion AI

2025-12
103B
modelopen

Base Model: Ling-flash-2.0: 103B total parameters with 6.1B activated. "largest diffusion language model to date"

OpenAI logo

GPT-5.2

OpenAI

2025-12
3000B
modelopen

"GPT‑5.2 sets a new state of the art across many benchmarks, including GDPval, where it outperforms industry professionals at well-specified knowledge work tasks spanning 44 occupations." Announce: https://openai.com/index/introducing-gpt-5-2/ MMLU is for Spanish.

S

Apriel-1.6-15B-Thinker

ServiceNow

2025-12
15B
modelopen

MT

Motif 2 12.7B

Motif-Technologies

2025-12
12.7B
modelopen

Mistral logo

Devstral 2

Mistral

2025-12
123B
modelopen

SWE-bench Verified=72.2%.

N3

Nanbeige4-3B-Base

Nanbeige4-3B-Base

2025-12
3B
modelopen

Tencent logo

HY 2.0

Tencent

2025-12
406B
modelopen

406BA32B.

M

K2-V2

MBZUAI

2025-12
70B
modelopen

8.5x more tokens trained than K2 (1.4T v 12T). Project page: https://ifm.ai/k2/

AA

Trinity-Mini

Arcee AI

2025-12
26B
modelopen

26BA3B. "we worked closely with Prime Intellect. They not only served the H100 clusters Datology used to generate synthetic data, they have been deeply involved in helping scale our training setup to the GPU footprint required for a fully frontier sized model, including the current 2048 B300 GPU configuration for Trinity Large."

Amazon logo

Nova 2 Pro

Amazon

2025-12
200B
modelopen

"Nova 2 Pro is Amazon's most intelligent reasoning model that can process text, images, video, and speech to generate text."

Mistral logo

Mistral Large 3

Mistral

2025-12
675B
modelopen

675BA41B. "Mistral Large 3 joins the ranks of frontier instruction-fine-tuned open-source models." EU tech doc: https://legal.cms.mistral.ai/assets/1e37fffd-7ea5-469b-822f-05dcfbb43623

DeepSeek-AI logo

DeepSeek-V3.2-Speciale

DeepSeek-AI

2025-12
685B
modelopen

The word 'Speciale' may be a reference to Ferrari. "It shows gold-medal performance in the IOI 2025, ICPC World Final 2025, IMO 2025, and CMO 2025." API: https://api-docs.deepseek.com/news/news251201

DeepSeek-AI logo

DeepSeek-Math-V2

DeepSeek-AI

2025-11
685B
modelopen

"DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled testtime compute. "

N

Orchestrator-8B

NVIDIA

2025-11
8B
modelopen

Base Model: Qwen3-8B

PI

INTELLECT-3

Prime Intellect

2025-11
106B
modelopen

Base: GLM-4.5-Air-Base model. 106BA12B. Announce: https://www.primeintellect.ai/blog/intellect-3

Microsoft logo

Fara-7B

Microsoft

2025-11
7B
modelopen

"Fara-7B is Microsoft's first agentic small language model (SLM) designed specifically for computer use. With only 7 billion parameters, Fara-7B is an ultra-compact Computer Use Agent (CUA)...Current production baselines leverage Qwen 2.5-VL (7B)."

Anthropic logo

Claude Opus 4.5

Anthropic

2025-11
5000B
modelopen

"the best model in the world for coding, agents, and computer use." Announce: https://www.anthropic.com/news/claude-opus-4-5

N

Nemotron Elastic

NVIDIA

2025-11
12B
modelopen

"Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning...We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens"

Tencent logo

GeoVista

Tencent

2025-11
7B
modelopen

Base model: Qwen2.5-VL-7B-Instruct. "GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. " Project page: https://ekonwang.github.io/geo-vista/

Allen AI logo

OLMo 3

Allen AI

2025-11
32B
modelopen

Announce: https://allenai.org/blog/olmo3

Google DeepMind logo

Gemini 3 Pro

Google DeepMind

2025-11
3000B
modelopen

"The knowledge cutoff date for Gemini 3 Pro was January 2025."

xAI logo

Grok 4.1

xAI

2025-11
3000B
modelopen

P

Baguettotron

PleIAs

2025-11
0.321B
modelopen

"The name is both a nod to French origins and to the unusual shape of the model: with 80 layers, Baguettotron is currently the deepest SLM in its size range."

Baidu logo

ERNIE-5.0-Preview-1022

Baidu

2025-11
2400B
modelopen

Very low performance on ALPrompt. 2.4T params confirmed: https://global.chinadaily.com.cn/a/202511/13/WS691571bda310d6866eb29500.html

OpenAI logo

GPT-5.1

OpenAI

2025-11
2000B
modelopen

Personality change via fine-tuning. GPQA (no tools) increased from GPT-5=85.7 to GPT-5.1=88.1. MMLU is for Spanish.

N

TiDAR

NVIDIA

2025-11
8B
modelopen

Base model: Qwen3-8B (36T) + 150B continual training. "TiDAR, a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively - all within a single forward pass using specially designed structured attention masks"

T

JustRL-Nemotron-1.5B

Tsinghua

2025-11
1.5B
modelopen

"JustRL, a simple recipe with fixed hyperparameters, achieves state-of-the-art performance on two different 1.5B base models (54.5% and 64.3% across 9 math benchmarks) while using 2× less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training remains stable over thousands of steps without intervention. This suggests the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline."

Baidu logo

ERNIE-4.5-VL-28B-A3B-Thinking

Baidu

2025-11
28B
modelopen

28B-A3B. Open-sourced 12/Nov/2025 from Jun/2025 release.

Google DeepMind logo

HOPE

Google DeepMind

2025-11
1.3B
modelpartial

"Combining our self-modifying sequence model with the continuum memory system, we present a learning module, called HOPE, showing promising results in language modeling, continual learning, and long-context reasoning tasks." Announce: https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/ May be released after paper is public.

Moonshot AI logo

Kimi K2 Thinking

Moonshot AI

2025-11
1000B
modelopen

1TA32B. 1T parameters and 384 experts. Open source SOTA. HLE=51.0 on text-only subset, compare to Grok-4 HLE=50.7 also on text-only, but Grok-4 HLE=44.4 on HLE full, ∴ Kimi K2 Thinking HLE≈44 full (estimated).

IA

Ling-1T

Inclusion AI

2025-11
1000B
modelopen

1TA50B.

G

GEN-0

Generalist

2025-11
10B
modelpartial

"GEN-0, a new class of embodied foundation models built for multimodal training directly on high-fidelity raw physical interaction. Its architecture builds on the strengths of vision and language models while also going beyond them—natively designed to capture human-level reflexes and physical commonsense. One core feature is Harmonic Reasoning, in which the models are trained to simultaneously think and act seamlessly... GEN-0 is pretrained on our in-house robotics dataset, which includes over 270,000 hours of real-world diverse manipulation data, growing at a rate of 10,000 hours a week and accelerating."

W

CALM

Wechat

2025-10
1.82B
modelopen

"Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9% accuracy... We train our models on the Pile uncopyrighted dataset (Gao et al., 2020). The raw text is processed with the Llama 3 tokenizer (Grattafiori et al., 2024), resulting in a training set of ∼230B tokens."

Moonshot AI logo

Kimi-Linear

Moonshot AI

2025-10
48B
modelopen

48B-A3B. "Kimi Linear is a hybrid linear attention architecture that outperforms traditional full attention methods across various contexts, including short, long, and reinforcement learning (RL) scaling regimes. At its core is Kimi Delta Attention (KDA)—a refined version of Gated DeltaNet that introduces a more efficient gating mechanism to optimize the use of finite-state RNN memory."

MiniMax logo

MiniMax-M2

MiniMax

2025-10
230B
modelopen

230B-A10B.

C

MACE-MH-1

Cambridge/LBNL

2025-10
0.025B
modelopen

MACE-MH-1 (Multi-Head 1). Features Multiple Heads (OMAT PBE, OMOL r2scan, OC20) to maintain high accuracy across domains

DeepSeek-AI logo

DeepSeek-OCR

DeepSeek-AI

2025-10
3B
modelopen

2D vision tokens for 1D text achieves huge compression. Encoder/Decoder: DeepEncoder 380M (80M SAM-base + 300M CLIP-large), DeepSeek-3B-MoE (A570M).

Microsoft logo

UserLM-8b

Microsoft

2025-10
8B
modelopen

"we trained UserLM-8b to simulate the “user” role in conversation (by training it to predict user turns in a large corpus of conversations called WildChat)."

Salesforce logo

CoDA

Salesforce

2025-10
1.7B
modelopen

"diffusion coder trained on TPU [Google TPU v4-1024 VM]"

S

TRM

Samsung

2025-10
0.007B
modelopen

"Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers"

IBM logo

Granite-4.0 Small

IBM

2025-10
32B
modelopen

32B-A9B. Announce: https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models

Z

GLM-4.6

Z.AI

2025-09
355B
modelopen

355B-A32B. "context window has been expanded from 128K to 200K tokens"

I

Ring-1T-preview

InclusionAI

2025-09
1000B
modelopen

1T-A48.5B.

Anthropic logo

Claude Sonnet 4.5

Anthropic

2025-09
400B
modelopen

The Claude Sonnet 4.5 "system card" is an absolute farce. Announce: https://www.anthropic.com/news/claude-sonnet-4-5

Google DeepMind logo

Gemini Robotics 1.5

Google DeepMind

2025-09
200B
modelopen

2. "vision-language-action (VLA) model turns visual information and instructions into motor commands for a robot to perform a task." Available to select partners.

Google DeepMind logo

Gemini Robotics-ER 1.5

Google DeepMind

2025-09
30B
modelopen

1. "vision-language model (VLM) reasons about the physical world, natively calls digital tools and creates detailed, multi-step plans to complete a mission." Available to all devs.

Google logo

TimesFM-ICF

Google

2025-09
0.2B
modelclosed

TimesFM-ICF is 6.8% more accurate than TimesFM (Base). Time-series forecasting only. 'a large pretraining corpus of 100B real world time-points' may be more than 100B tokens.

Alibaba logo

Qwen3-Max

Alibaba

2025-09
1000B
modelopen

"Qwen3-Max-Thinking — still under active training — is already demonstrating remarkable potential. When augmented with tool usage and scaled test-time compute, the Thinking variant has achieved 100% on challenging reasoning benchmarks such as AIME 25 and HMMT. "

Alibaba logo

Qwen3-Omni

Alibaba

2025-09
30B
modelopen

"Qwen3-Omni is a unified end-to-end model capable of processing multiple modalities, such as text, audio, image and video, and generating real-time text or speech response."... "pretraining utilizes a large-scale dataset containing approximately 2 trillion tokens, with the following distribution across modalities: text (0.57 trillion), audio (0.77 trillion), image (0.82 trillion), video (0.05 trillion), and video-audio (0.05 trillion)."

DeepSeek-AI logo

DeepSeek-V3.1-Terminus

DeepSeek-AI

2025-09
685B
modelopen

Hybrid reasoning. Dataset tokens: https://x.com/deepseek_ai/status/1958417072536608952 HLE: https://x.com/deepseek_ai/status/1958417068568481854/photo/2

P

Isaac 0.1

Perceptron

2025-09
2B
modelopen

"perceptive-language model...delivering capabilities that meet or exceed those of models over 50 times its size. Founded by the team behind Meta's Chameleon multimodal models, Perceptron is tackling a fundamental challenge: bringing the power of physical AI to the dynamic, multimodal, and real-time environments we live and work in."

xAI logo

Grok 4 Fast

xAI

2025-09
3000B
modelopen

"2M token context window, and a unified architecture that blends reasoning and non-reasoning modes in one model."

Google DeepMind logo

VaultGemma

Google DeepMind

2025-09
1B
modelopen

"Differential Privacy (DP) has emerged as the gold standard, providing a rigorous, mathematical framework to limit the influence of any single example in the training data on the resulting model. A model trained with DP provably bounds the reconstruction or leakage of information tied to individual data points." Announce: https://research.google/blog/vaultgemma-the-worlds-most-capable-differentially-private-llm/

Alibaba logo

Qwen3-Next-80B-A3B

Alibaba

2025-09
80B
modelopen

"Qwen3-Next introduces several key improvements: a hybrid attention mechanism, a highly sparse Mixture-of-Experts (MoE) structure, training-stability-friendly optimizations, and a multi-token prediction mechanism for faster inference."

M

K2-Think

MBZUAI

2025-09
32B
modelopen

"Built on the Qwen2.5 base model, our system shows that smaller models can compete at the highest levels by combining advanced post-training and test-time computation techniques. The approach is based on six key technical pillars: Long Chain-of-thought Supervised Finetuning, Reinforcement Learning with Verifiable Rewards (RLVR), Agentic planning prior to reasoning, Test-time Scaling, Speculative Decoding, and Inference-optimized Hardware, all using publicly available open-source datasets."

J

mmBERT

JHU

2025-09
0.307B
modelopen

"a modern multilingual encoder trained on 3T tokens and 1833 languages. We introduce several novel elements in training: an inverse masking schedule and a cascading annealed language learning schedule for multilingual data" Announce: https://huggingface.co/blog/mmbert

Baidu logo

ERNIE X1.1

Baidu

2025-09
modelopen

Baidu logo

ERNIE-4.5-21B-A3B-Thinking

Baidu

2025-09
21B
modelopen

K

Klear-46B-A2.5B

Kuaishou

2025-09
46B
modelopen

46B-A2.5B.

TA

TildeOpen-30b

Tilde AI

2025-09
30B
modelopen

"language data from across Europe"

Alibaba logo

Qwen3-Max-Preview

Alibaba

2025-09
1000B
modelopen

GPQA score is SuperGPQA. "our biggest model yet, with over 1 trillion parameters"

Moonshot AI logo

Kimi K2-Instruct-0905

Moonshot AI

2025-09
1000B
modelopen

1TA32B. 1T parameters and 384 experts. Open source SOTA.

EZ

Apertus

ETH Zürich

2025-09
70B
modelopen

"Apertus – Latin for “open”" 1,811 languages. Announce: https://ethz.ch/en/news-and-events/eth-news/news/2025/09/press-release-apertus-a-fully-open-transparent-multilingual-language-model.html

M

LongCat-Flash

Meituan

2025-09
560B
modelopen

560B-A18.6B–31.3B (27B on average). Announce: https://lmsys.org/blog/2025-09-01-sglang-longcat-flash/

B

Baichuan-M2

Baichuan

2025-09
32B
modelopen

Base: Qwen2.5. "medical augmented reasoning model"

Microsoft logo

MAI-1-preview

Microsoft

2025-08
500B
modelopen

MAI=Microsoft artificial intelligence. "MAI’s first foundation model trained end-to-end... MAI-1-preview is an in-house mixture-of-experts model, pre-trained and post-trained on ~15,000 NVIDIA H100 GPUs. This model is designed to provide powerful capabilities to consumers seeking to benefit from models that specialize in following instructions and providing helpful responses to everyday queries. We will be rolling MAI-1-preview out for certain text use cases within Copilot"

xAI logo

grok-code-fast-1

xAI

2025-08
800B
modelopen

"We built grok-code-fast-1 from scratch, starting with a brand-new model architecture. To lay a robust foundation, we carefully assembled a pre-training corpus rich with programming-related content. For post-training, we curated high-quality datasets that reflect real-world pull requests and coding tasks." Announce: https://x.ai/news/grok-code-fast-1

NR

Hermes 4

Nous Research

2025-08
405B
modelopen

Based on Llama 3. Announce: https://hermes4.nousresearch.com/

N

Jet-Nemotron-4B

NVIDIA

2025-08
4B
modelopen

"pre-training corpus and train Jet-Nemotron models for 50B tokens. This is also the setting in Section 2 where we perform PostNAS. At the second stage, we include more high-quality data from math [65] and coding [66, 67] domains into our data mixture. The models are then trained on 350B tokens."

DeepSeek-AI logo

DeepSeek-V3.1-Base

DeepSeek-AI

2025-08
685B
modelopen

Hybrid reasoning. Dataset tokens: https://x.com/deepseek_ai/status/1958417072536608952 HLE: https://x.com/deepseek_ai/status/1958417068568481854/photo/2

N

Nemotron Nano 2

NVIDIA

2025-08
12.31B
modelopen

Announce: https://research.nvidia.com/labs/adlr/NVIDIA-Nemotron-Nano-2/

Google DeepMind logo

Gemma 3 270M

Google DeepMind

2025-08
0.27B
modelopen

OpenAI logo

GPT-5

OpenAI

2025-08
1700B
modelopen

Announce: https://openai.com/index/introducing-gpt-5/. MMLU is based on ES and PT translated from EN.

OpenAI logo

gpt-oss-120b

OpenAI

2025-08
120B
modelopen

116.8B total parameters and 5.1B “active” parameters per token per forward pass. https://openai.com/index/introducing-gpt-oss/

OpenAI logo

gpt-oss-20b

OpenAI

2025-08
20B
modelopen

20.9B total and 3.6B active parameters. https://openai.com/index/introducing-gpt-oss/

Anthropic logo

Claude Opus 4.1

Anthropic

2025-08
2000B
modelopen

Z

GLM-4.5

Z.AI

2025-07
355B
modelopen

355B-A32B.

CT

T1

China Telecom Artificial Intelligence Research Institute

2025-07
115B
modelopen

SA

Intern-S1

Shanghai AI Laboratory/SenseTime

2025-07
235B
modelopen

41T tokens assumes base model of Qwen3. "Built upon a 235B MoE language model and a 6B Vision encoder, Intern-S1 has been further pretrained on 5 trillion tokens of multimodal data"

S

Step 3

StepFun

2025-07
321B
modelopen

321B-A38B. https://x.com/CyouSakura/status/1948767450751009227

Alibaba logo

Qwen3-235B-A22B-Thinking-2507

Alibaba

2025-07
235B
modelopen

235B-A22B. "Qwen3 is pre-trained on 36 trillion tokens across 119 languages" MMLU score is MMLU-Redux.

K

KAT-V1-200B

Kuaishou

2025-07
200B
modelclosed

200BA40B. In training as of Jul/2025. "to address the overthinking problem in reasoning-intensive tasks"

K

KAT-V1-40B

Kuaishou

2025-07
40B
modelopen

"to address the overthinking problem in reasoning-intensive tasks"

Alibaba logo

Qwen3-Coder-480B-A35B-Instruct

Alibaba

2025-07
480B
modelopen

480B-A35B.

Alibaba logo

Qwen3-235B-A22B-Instruct-2507

Alibaba

2025-07
235B
modelopen

235B-A22B. "Qwen3 is pre-trained on 36 trillion tokens across 119 languages" MMLU score is MMLU-Redux.

Allen AI logo

FlexOlmo

Allen AI

2025-07
37B
modelopen

37B-A20B. "We adopt the OLMo-2 7B setup, starting from a a checkpoint pre-trained on 4T tokens and annealed for 50B tokens to produce a public expert. We then train two additional experts on math and code, each for 50B tokens, and combine them with the public expert to form a three-expert version of FLEXOLMO."

L

EXAONE 4.0

LG

2025-07
32B
modelopen

“EXAONE”=“EXpert AI for EveryONE”. Training tokens/ratio: EXAONE-3 7.8B=8T tokens (Aug/2024) -> EXAONE-3.5 7.8B=9T -> EXAONE-3.5 32B=6.5T tokens -> EXAONE 4.0 32B=14T tokens. MMLU score is MMLU-Redux. Interesting: "To focus [RL] training on more informative data samples, we perform accuracy-based filtering by generating eight responses from the SFT model and excluding samples where all eight responses are correct, a pre-filtering step that removes problems that are easy for the model to avoid inefficient training."

Moonshot AI logo

Kimi K2

Moonshot AI

2025-07
1000B
modelopen

1TA32B. 1T parameters and 384 experts. Open source SOTA.

RA

Reka Flash 3.1

Reka AI

2025-07
21B
modelopen

Mistral logo

Devstral Medium

Mistral

2025-07
50B
modelopen

Non-reasoning.

xAI logo

Grok 4

xAI

2025-07
3000B
modelopen

2.4T? https://x.com/kalomaze/status/1942996555088134592 "The smartest AI in the world, 100% on SAT, etc, questions that it's never seen before."

K

KAT-V1-200B

Kwaipilot

2025-07
200B
modelopen

200BA40B.

K

KAT-V1-40B

Kwaipilot

2025-07
40B
modelopen

Microsoft logo

Phi-4-mini-flash-reasoning

Microsoft

2025-07
3.8B
modelopen

"Pre-training: 5T tokens; Reasoning training: 150B tokens" "At the core of Phi-4-mini-flash-reasoning is the newly introduced decoder-hybrid-decoder architecture, SambaY, whose central innovation is the Gated Memory Unit (GMU), a simple yet effective mechanism for sharing representations between layers. The architecture includes a self-decoder that combines Mamba (a State Space Model) and Sliding Window Attention (SWA), along with a single layer of full attention. The architecture also involves a cross-decoder that interleaves expensive cross-attention layers with the new, efficient GMUs. This new architecture with GMU modules drastically improves decoding efficiency, boosts long-context retrieval performance and enables the architecture to deliver exceptional performance across a wide range of tasks. "

Google DeepMind logo

T5Gemma

Google DeepMind

2025-07
9B
modelopen

Related paper: https://arxiv.org/abs/2504.06225. Dataset was Gemma 2 9B on 8T tokens + 2T tokens adapted.

Google DeepMind logo

MedGemma 1 27B

Google DeepMind

2025-07
27B
modelopen

Multimodal model. Text MMLU score for med only=87.0.

T

R1T2 Chimera

TNG

2025-07
685B
modelopen

Assembly of Experts-method of V3-0324, R1, R1-0528. Announce: https://x.com/tngtech/status/1940531045432283412?s=46

C

Spectra 1.1

Consortium

2025-06
3.6B
modelopen

"Spectra-1.1, an open suite of TriLMs trained on up to 1.2 trillion tokens, demonstrating sustained performance gains at scale. Furthermore, to improve inference efficiency, we propose novel 2-bit and 1.6-bit packing schemes for ternary weights"

Apple logo

DiffuCoder

Apple

2025-06
7B
modelopen

"We adapt our model from Qwen2.5-Coder (Hui et al., 2024) as the base model to perform continual pre-training using the adaptation approach from Gong et al. (2025). During this pre-training, we use a 400B-token code pre-training corpus from RefineCode (Huang et al., 2024) and Stackv2 (Lozhkov et al., 2024)."

Tencent logo

Hunyuan-A13B

Tencent

2025-06
80B
modelopen

80B-A13B. 'We have open-sourced Hunyuan-A13B-Pretrain , Hunyuan-A13B-Instruct , Hunyuan-A13B-Instruct-FP8 , Hunyuan-A13B-Instruct-GPTQ-Int4 on Hugging Face.'

Inception logo

Mercury

Inception

2025-06
90B
modelopen

Diffusion large language model (dLLM).

Microsoft logo

Mu

Microsoft

2025-06
0.5B
modelopen

"distillation from Microsoft’s Phi models...Mu is an efficient 330M encoder–decoder language model optimized for small-scale deployment, particularly on the NPUs on Copilot+ PCs. It follows a transformer encoder–decoder architecture"

Google DeepMind logo

Gemini Robotics On-Device

Google DeepMind

2025-06
20B
modelopen

See Mar/2025 Gemini Robotics-ER model for comparison. Announce: https://deepmind.google/discover/blog/gemini-robotics-on-device-brings-ai-to-local-robotic-devices/

I

ICONN-1

ICONNAI

2025-06
88B
modelopen

"ICONN-1 (this version) is optimized for natural, emotionally resonant, and conversational interactions. ICONN-e1 is a specialized variant of the model fine-tuned for advanced reasoning, critical analysis, and complex problem-solving."

MiniMax logo

MiniMax-M1

MiniMax

2025-06
456B
modelopen

456B-A45.9B. Announce: https://www.minimax.io/news/minimaxm1

Mistral logo

Magistral Medium

Mistral

2025-06
50B
modelopen

Magistral Small=24B. Announce: https://mistral.ai/news/magistral

E

Comma v0.1-2T

EleutherAI

2025-06
7B
modelopen

"Comma v0.1-2T is a decoder-only transformer that uses the same architecture as Llama 3. Training was done in two stages: first on 1.93 trillion tokens with a cosine learning rate schedule, and second a "cool-down" training phase on 75.5 billion tokens from high-quality sources. The final model is the average of 10 checkpoints during this cool-down phase. Both training phases use a batch size of 8.3 million tokens per step. Training was performed using lingua on 512 AMD MI300A GPUs."

X

dots.llm1

Xiaohongshu/RedNote

2025-06
142B
modelopen

142B-A14B. "dots.llm1, a large-scale MoE model that activates 14 billion parameters out of a total of 142 billion parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on 11.2T high-quality tokens and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints at every one trillion tokens, providing valuable insights into the learning dynamics of large language models."

Google DeepMind logo

Gemini 2.5 Pro 06-05

Google DeepMind

2025-06
400B
modelopen

"an upgraded preview of Gemini 2.5 Pro, our most intelligent model yet. Building on the version we released in May and showed at I/O, this model will be the generally available, stable version starting in a couple of weeks, ready for enterprise-scale applications."

X

MiMo-7B-RL-0530

Xiaomi

2025-05
7B
modelopen

"[2025.05.30] During the RL training, by continuously expanding the training window size (from 32K to 48K), the performance of MiMo-7B-RL-0530 on AIME24 can be continuously improved and eventually surpass that of DeepSeek R1... MiMo-7B-Base is pre-trained on approximately 25 trillion tokens."

Google DeepMind logo

DeepTransformers

Google DeepMind

2025-05
1.3B
modelclosed

"Atlas, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture."

Google DeepMind logo

Atlas

Google DeepMind

2025-05
1.3B
modelclosed

"Atlas, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture."

DeepSeek-AI logo

DeepSeek-R1-0528

DeepSeek-AI

2025-05
685B
modelopen

Censorship increased significantly. "overall performance is now approaching that of leading models, such as o3 and Gemini 2.5 Pro." MMLU shows MMLU-Redux score with lower error rate.

FA

Fathom-R1-14B

Fractal Analytics

2025-05
14B
modelopen

Base R1-distilled-14B model, based on Qwen 14B. Media release.

Alibaba logo

QwenLong-L1-32B

Alibaba

2025-05
32B
modelopen

"the first long-context LRM trained with reinforcement learniing for long-context reasoning."

Anthropic logo

Claude Opus 4

Anthropic

2025-05
6000B
modelopen

"Claude Opus 4 is our most intelligent model to date, pushing the frontier in coding, agentic search, and creative writing. With advanced reasoning and powerful collaboration capabilities…Both models can also alternate between reasoning and tool use—like web search—to improve responses…Claude Opus 4 can work continuously for hours on complex, long-running tasks"

TII logo

Falcon-H1

TII

2025-05
34B
modelopen

"hybrid architecture that combines the strengths of the classical Transformer-based attention mechanism with the State Space Model (SSM), known for its superior long-context memory and computational efficiency."

Google DeepMind logo

Gemini Diffusion

Google DeepMind

2025-05
40B
modelopen

"Gemini Diffusion’s external benchmark performance is comparable to much larger models [like Gemini-2.0-Flash-Lite], whilst also being faster."

Google DeepMind logo

Gemma 3n

Google DeepMind

2025-05
4B
modelopen

Matryoshka Transformer or MatFormer model architecture. 850M (696M / 620M / 582M).

Alibaba logo

ParScale

Alibaba

2025-05
4.7B
modelopen

"We introduce the third scaling paradigm for scaling LLMs: leverages parallel computation during both training and inference time (Parallel Scaling, or ParScale)... ParScale can use up to 22× less memory increase and 6× less latency increase compared to parameter scaling that achieves the same performance improvement. It can also recycle an off-the-shelf pre-trained model into a parallelly scaled one by post-training on a small amount of tokens, further reducing the training budget." MMLU shows for 1.8B models, not the 4.7B models.

OpenAI logo

codex-1

OpenAI

2025-05
600B
modelopen

o3 base. "codex-1, a version of OpenAI o3 optimized for software engineering. It was trained using reinforcement learning on real-world coding tasks in a variety of environments to generate code that closely mirrors human style and PR preferences, adheres precisely to instructions, and can iteratively run tests until it receives a passing result."

TII logo

Falcon-Edge

TII

2025-05
3B
modelopen

"Falcon-Edge series - a collection of powerful, universal, and fine-tunable language models available in ternary format, based on the BitNet architecture."

W

SWE-1

Windsurf

2025-05
50B
modelopen

"SWE-1, optimized for the entire software engineering process, not just the task of coding."

PI

INTELLECT-2

Prime Intellect

2025-05
32B
modelopen

QwQ-32B base. Announce: https://www.primeintellect.ai/blog/intellect-2-release Finished training 30/Apr/2025: https://app.primeintellect.ai/intelligence/intellect-2

H

Pangu Ultra MoE

Huawei

2025-05
718B
modelclosed

718B-A39B. Trained on 6,000 Ascend NPUs (Kunpeng 920 processors in Huawei Atlas 800T A2 servers).

Mistral logo

Mistral Medium 3

Mistral

2025-05
50B
modelopen

Multimodal. 50B param estimate based on "Mistral Medium 3 can also be deployed on any cloud, including self-hosted environments of four GPUs and above.". Note: "With the launches of Mistral Small in March and Mistral Medium today, it’s no secret that we’re working on something ‘large’ over the next few weeks. With even our medium-sized model being resoundingly better than flagship open source models such as Llama 4 Maverick, we’re excited to ‘open’ up what’s to come :) "

IBM logo

Granite-4.0-Tiny-Preview

IBM

2025-05
7B
modelopen

"the model is only partially trained—it has only seen 2.5T of a planned 15T or more training tokens...Granite 4.0 Tiny-Preview, specifically, is a fine-grained hybrid mixture of experts (MoE) model, with 7B total parameters and only 1B active parameters at inference time... Like its predecessors in Granite 3.2 and Granite 3.3, Granite 4.0 Tiny Preview offers toggleable thinking on and thinking off functionality (though its reasoning-focused post-training is very much incomplete)."

Amazon logo

Nova Premier

Amazon

2025-04
470B
modelopen

Announce: https://aws.amazon.com/blogs/aws/amazon-nova-premier-our-most-capable-model-for-complex-tasks-and-teacher-for-model-distillation/

Microsoft logo

Phi-4-reasoning-plus

Microsoft

2025-04
14B
modelopen

"Phi-4-reasoning-plus is a state-of-the-art open-weight reasoning model finetuned from Phi-4 using supervised fine-tuning on a dataset of chain-of-thought traces and reinforcement learning."

IBM logo

Bamba-9B-v2

IBM

2025-04
9B
modelopen

"During Christmas of 2024, IBM, Princeton, CMU, and UIUC released, Bamba v1, a performant Mamba2 based pretrained model with full data lineage trained to 2T tokens. Since then, we have been busy cooking an update with new datasets. Today, we are excited to release Bamba v2, trained for an additional 1T tokens that significantly improves on Bamba v1. The L1 and L2 leaderboard scores outperform Llama 3.1 8B, which was trained with nearly 5x the amount of data. All of this with the inference speedup that we get from Mamba2 based architecture, which with the latest vLLM is 2-2.5x faster than similar sized transformer models."

Alibaba logo

Qwen3-235B-A22B

Alibaba

2025-04
235B
modelopen

Qwen3-235B-A22B. Qwen3-30B-A3B. "Qwen3 is pre-trained on 36 trillion tokens across 119 languages"

Alibaba logo

Qwen3-0.6B

Alibaba

2025-04
0.6B
modelopen

Record data ratio 60,000:1. "Qwen3 is pre-trained on 36 trillion tokens across 119 languages"

Baidu logo

ERNIE X1 Turbo

Baidu

2025-04
modelopen

Announce: https://x.com/Baidu_Inc/status/1915603080336597310

Baidu logo

ERNIE 4.5 Turbo

Baidu

2025-04
modelopen

Announce: https://x.com/Baidu_Inc/status/1915603080336597310

Microsoft logo

MAI-DS-R1

Microsoft

2025-04
685B
modelopen

DeepSeek-R1 base. "MAI-DS-R1, a new open weights DeepSeek R1 model variant... post-trained by the Microsoft AI team to improve its responsiveness on blocked topics and its risk profile, while maintaining its reasoning capabilities and competitive performance."

Google DeepMind logo

Gemini 2.5 Flash Preview

Google DeepMind

2025-04
80B
modelopen

Context in=1M, out=64k. Knowledge cutoff Jan/2025. Codename 'nebula'. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/

OpenAI logo

o4-mini

OpenAI

2025-04
200B
modelopen

https://openai.com/index/introducing-o3-and-o4-mini/ MMLU shows a translated LOTE.

OpenAI logo

o3

OpenAI

2025-04
600B
modelopen

https://openai.com/index/introducing-o3-and-o4-mini/ MMLU shows a translated LOTE.

Microsoft logo

BitNet b1.58 2B4T

Microsoft

2025-04
2B
modelopen

"the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale. Trained on a corpus of 4 trillion tokens"

IBM logo

Granite 3.3 8B Instruct

IBM

2025-04
8B
modelopen

"Built on top of an updated Granite 3.3 base model and fine-tuned through multi-stage reinforcement learning using TPO and Group Relative Policy Optimization (GRPO), both Granite 3.3 Instruct models demonstrated significant improvement on the highly technical benchmarks conventionally associated with “reasoning” capabilities."

ZA

GLM-4-0414

Zhipu AI (Tsinghua)

2025-04
32B
modelopen

Family: GLM-4-32B-Base-0414, GLM-4-32B-0414, GLM-Z1-32B-0414 (reasoning), GLM-Z1-Rumination-32B-0414 (reasoning + deep research).

AS

SEA-LION v3.5 70B R

AI Singapore

2025-04
70B
modelopen

"Based on Llama 3.1 70B. SEA-LION v3.5, our first set of hybrid reasoning models trained on Southeast Asian data. Mode selection is managed through the tokenizer’s chat template and offers versatile functionality, handling both complex reasoning tasks and general text generation."

OpenAI logo

GPT-4.1

OpenAI

2025-04
300B
modelopen

Outperforms GPT‑4o "across the board, with major gains in coding and instruction following. They also have larger context windows—supporting up to 1 million tokens of context—and are able to better use that context with improved long-context comprehension. They feature a refreshed knowledge cutoff of June 2024."

Google DeepMind logo

DolphinGemma

Google DeepMind

2025-04
0.4B
modelopen

"trained on Atlantic spotted dolphin sounds, we anticipate its potential utility for researchers studying other cetacean species, like bottlenose or spinner dolphins... Developed by Google, this AI model makes use of specific Google audio technologies: the SoundStream tokenizer efficiently represents dolphin sounds, which are then processed by a model architecture suited for complex sequences. This ~400M parameter model is optimally-sized to run directly on the Pixel phones WDP uses in the field."

S

Apriel-5B

ServiceNow

2025-04
5B
modelopen

SLAM - ServiceNow Language Models Lab. The first release in the Apriel model family, designed to support research on foundation models.

ByteDance logo

Seed-Thinking-v1.5

ByteDance

2025-04
200B
modelopen

200B-A20B. "Seed-Thinking-v1.5, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed-Thinking-v1.5 achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding."

H

Dream 7B

Huawei

2025-04
7B
modelopen

"with Huawei Noah’s Ark Lab, we [Hong Kong University] release Dream 7B (Diffusion reasoning model), the most powerful open diffusion large language model to date."

N

UltraLong-8B

NVIDIA

2025-04
8B
modelopen

Llama-3.1-8B-Instruct base. 4M context window.

T

Deepcoder-14B-Preview

Together

2025-04
14B
modelopen

Base DeepSeek-R1-Distill-Qwen-14B.

H

Pangu Ultra

Huawei

2025-04
135B
modelopen

Trained on 8,192 Ascend NPUs (Kunpeng 920 processors in Huawei Atlas 800T A2 servers).

N

Nemotron-H-56B-Base

NVIDIA

2025-04
56B
modelopen

https://research.nvidia.com/labs/adlr/nemotronh/

N

Llama-3.1-Nemotron-Ultra-253B

NVIDIA

2025-04
253B
modelopen

Llama 3.1 405B base. "Llama-3.1-Nemotron-Ultra-253B-v1 is a large language model (LLM) which is a derivative of Meta Llama-3.1-405B-Instruct (AKA the reference model). It is a reasoning model that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling. The model supports a context length of 128K tokens. This model fits on a single 8xH100 node for inference."

Meta AI logo

Llama 4 Behemoth

Meta AI

2025-04
2000B
modelclosed

2T-A288B. Announced Apr/2025, abandoned Jul/2025. "We also trained a teacher model, Llama 4 Behemoth, that outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM-focused benchmarks such as MATH-500 and GPQA Diamond... 288B active parameters, 16 experts, and nearly two trillion total parameters."

Meta AI logo

Llama 4 Maverick

Meta AI

2025-04
400B
modelopen

400B-A17B. "Our most powerful open source multimodal model. 17B active params x 128 experts, 400B total params"

Meta AI logo

Llama 4 Scout

Meta AI

2025-04
109B
modelopen

200 languages, "includes diverse text, image, and video datasets."

Google DeepMind logo

Sec-Gemini v1

Google DeepMind

2025-04
400B
modelopen

"Sec-Gemini v1 achieves this by combining Gemini’s advanced capabilities with near real-time cybersecurity knowledge and tooling. This combination allows it to achieve superior performance on key cybersecurity workflows, including incident root cause analysis, threat analysis, and vulnerability impact understanding."

DeepSeek-AI logo

DeepSeek-GRM-27B

DeepSeek-AI

2025-04
27B
modelopen

Gemma-2-27B base. "Self-Principled Critique Tuning (SPCT) to foster scalable reward generation behaviors in GRMs through online RL, to generate principles adaptively and critiques accurately, resulting in DeepSeek-GRM models... The models will be released and open-sourced."

FA

Qwerky-72B

Featherless AI

2025-04
72B
modelopen

"As demonstrated with our Qwerky-72B-Preview and prior models such as QRWKV6-32B Instruct Preview, we have successfully converted Qwen 2.5 72B into a RWKV variant without requiring a pretrain on the base model or retraining the model from scratch. Enabling us to test and validate the more efficient RWKV Linear attention" Dataset from Qwen2.5=18,000 tokens.

DC

Cogito 70B

Deep Cogito

2025-04
70B
modelopen

"We are releasing early checkpoints of models in sizes 3B, 8B, 14B, 32B and 70B trained using this methodology, starting from pretrained Llama / Qwen base checkpoints."

Google DeepMind logo

Agentic-Tx

Google DeepMind

2025-03
400B
modelopen

"a therapeutics-focused agentic system powered by Gemini 2.0 Pro. Agentic-Tx is equipped with 18 tools, including: TxGemma as a tool for multi-step reasoning"

Google DeepMind logo

TxGemma

Google DeepMind

2025-03
27B
modelopen

"a suite of efficient, generalist large language models (LLMs) capable of therapeutic property prediction as well as interactive reasoning and explainability. Unlike task-specific models, TxGemma synthesizes information from diverse sources, enabling broad application across the therapeutic development pipeline."

Google DeepMind logo

Gemini 2.5 Pro Preview

Google DeepMind

2025-03
400B
modelopen

Context in=1M, out=64k. Knowledge cutoff Jan/2025. HLE SOTA. Codename 'nebula'. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/

DeepSeek-AI logo

DeepSeek-V3 0324

DeepSeek-AI

2025-03
685B
modelopen

Non-reasoning. Significant increase in benchmark performance compared to original V3 from Dec/2024: MMLU-Pro: 75.9 ➜ 81.2, GPQA: 59.1 ➜ 68.4. 37B active.

N

Llama-3.3-Nemotron-Super-49B-v1

NVIDIA

2025-03
49B
modelopen

Meta Llama-3.3-70B-Instruct derivative "that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling. The model supports a context length of 128K tokens."

L

EXAONE Deep

LG

2025-03
32B
modelopen

“EXAONE”=“EXpert AI for EveryONE”. Training tokens/ratio dropped from EXAONE-3 7.8B with 8T (Aug/2024) to 3.5 (Dec/2024) 7.8B with 9T to 32B (also Deep) with 6.5T. Announce: https://www.lgresearch.ai/news/view?seq=543

Mistral logo

Mistral Small 3.1

Mistral

2025-03
24B
modelopen

"Mistral Small 3.1 (2503) adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance."

Baidu logo

ERNIE 4.5

Baidu

2025-03
424B
modelopen

424B-A47B. Announce: https://x.com/Baidu_Inc/status/1901094083508220035

Baidu logo

X1

Baidu

2025-03
modelopen

Allen AI logo

OLMo 2 32B

Allen AI

2025-03
32B
modelopen

"the first fully-open model (all data, code, weights, and details are freely available) to outperform GPT3.5-Turbo and GPT-4o mini on a suite of popular, multi-skill academic benchmarks. It is comparable to the leading open-weight models while requiring only a fraction of training compute."

Cohere logo

Command A

Cohere

2025-03
111B
modelopen

Context=256k. "Command A is an open weights research release of a 111 billion parameter model optimized for demanding enterprises that require fast, secure, and high-quality AI. Compared to other leading proprietary and open-weights models Command A delivers maximum performance with minimum hardware costs, excelling on business-critical agentic and multilingual tasks while‬ being deployable on just two GPUs."

Google DeepMind logo

Gemini Robotics

Google DeepMind

2025-03
200B
modelclosed

Gemini 2.0 Pro (cloud). "The second model is Gemini Robotics, a state-of-theart Vision-Language-Action (VLA) model that connects strong embodied reasoning priors to dexterous low-level control of real-world robots to solve challenging manipulation tasks. As a generalist VLA, Gemini Robotics can perform a wide array of diverse and complicated tasks, while also closely following language guidance and generalizing to distribution shifts in instructions, visuals, and motions. To emphasize the flexibility and generality of the Gemini Robotics models, we also introduce an optional specialization stage, which demonstrates how Gemini Robotics can be adapted for extreme dexterity, for advanced reasoning in difficult generalization settings, and for controlling completely new robot embodiments."

Google DeepMind logo

Gemini Robotics-ER

Google DeepMind

2025-03
30B
modelclosed

Gemini 2.0 Flash (on device). "The first model is Gemini Robotics-ER, a VLM with strong embodied reasoning capabilities at its core, exhibiting generalization across a wide range of embodied reasoning tasks while also maintaining its core foundation model capabilities. Gemini Robotics-ER exhibits strong performance on multiple capabilities critical for understanding the physical world, ranging from 3D perception to detailed pointing to robot state estimation and affordance prediction via code."

Google DeepMind logo

Gemma 3

Google DeepMind

2025-03
27B
modelopen

Trained on 1T more tokens than Gemma 2. "introduces vision understanding abilities, a wider coverage of languages and longer context – at least 128K tokens."

RA

Reka Flash 3

Reka AI

2025-03
21B
modelopen

"performs competitively with proprietary models such as OpenAI o1-mini, making it a good foundation to build applications that require low latency or on-device deployment. It is currently the best open model in its size category."

Alibaba logo

QwQ-32B

Alibaba

2025-03
32B
modelopen

Update to QwQ-32B-Preview released Nov/2024. Scores 1/5 on latest ALPrompt 2024 H2. Qwen with Question=QwQ

A

Jamba 1.6

AI21

2025-03
398B
modelopen

"The AI21 Jamba 1.6 family of models is state-of-the-art, hybrid SSM-Transformer instruction following foundation models. The Jamba models are the most powerful & efficient long-context models on the market, which deliver up to 2.5X faster inference than leading models of comparable sizes."

A

Instella-3B

AMD

2025-03
3B
modelopen

"trained from scratch on AMD Instinct™ MI300X GPUs. Instella models outperform existing fully open models of similar sizes and achieve competitive performance compared to state-of-the-art open-weight models such as Llama-3.2-3B, Gemma-2-2B, and Qwen-2.5-3B, including their instruction-tuned counterparts."

Alibaba logo

Babel-83B

Alibaba

2025-03
83B
modelopen

"top 25 languages by number of speakers, including English, Chinese, Hindi, Spanish, Arabic, French, Bengali, Portuguese, Russian, Urdu, Indonesian, German, Japanese, Swahili, Filipino, Tamil, Vietnamese, Turkish, Italian, Javanese, Korean, Hausa, Persian, Thai, and Burmese. These 25 languages support over 90% of the global population..."

IBM logo

Granite-3.2-8B-Instruct

IBM

2025-02
8B
modelopen

"The new Granite 3.2 8B Instruct [offers] experimental chain-of-thought reasoning capabilities "

Cohere logo

C4AI Command R7B Arabic

Cohere

2025-02
7B
modelopen

"C4AI Command R7B Arabic is an open weights research release of a 7 billion parameter custom model with advanced capabilities optimized for the Arabic language (MSA dialect) along with English. The model excels at tasks that enterprises care about: instruction following, length control, RAG, and responding in the correct language. It also demonstrates excellent general purpose knowledge and understanding of Arabic language and cultures."

OpenAI logo

GPT-4.5

OpenAI

2025-02
3000B
modelopen

"Our largest and best model for chat" https://openai.com/index/introducing-gpt-4-5/ "GPT-4.5 is not a frontier model, but it is OpenAI’s largest LLM, improving on GPT-4’s computational efficiency by more than 10x. While GPT-4.5 demonstrates increased world knowledge, improved writing ability, and refined personality over previous models, it does not introduce net-new frontier capabilities compared to previous reasoning releases, and its performance is below that of o1, o3-mini, and deep research on most preparedness evaluations."

Tencent logo

Hunyuan T1

Tencent

2025-02
389B
modelopen

"Based on Turbo S, by introducing technologies such as long thinking chains, retrieval enhancement and reinforcement learning, Hunyuan also launched the reasoning model T1 with deep thinking. This model has been fully launched on Tencent Yuanbao ( Tencent Hunyuan T1 model is open to all users ) , users can choose Deepseek R1 or Tencent Hunyuan T1 model to answer. The official version of Tencent Hunyuan T1 model will be launched soon, providing API access and other services to the outside world."

Tencent logo

Hunyuan Turbo S

Tencent

2025-02
389B
modelopen

Fast thinking ("Instant reply"). "This is also the first time in the industry that the Mamba architecture has been successfully applied losslessly to a very large MoE model."

Microsoft logo

Phi-4-multimodal

Microsoft

2025-02
5.6B
modelopen

"Training data: 5T tokens, 2.3M speech hours, and 1.1T image-text tokens" Announce: https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/

Microsoft logo

Phi-4-mini

Microsoft

2025-02
3.8B
modelopen

"Phi-4-mini’s training data includes a wide variety of sources, totaling 5 trillion tokens, and is a combination of publicly available documents filtered for quality, selected high-quality educational data, and code newly created synthetic, “textbook-like” data for the purpose of teaching math, coding, common sense reasoning, general knowledge of the world (e.g., science, daily activities, theory of mind, etc.) high quality chat format supervised data covering various topics to reflect human preferences" Announce: https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/

Inception logo

Mercury Coder Small

Inception

2025-02
40B
modelopen

Diffusion large language model (dLLM). Very low 'IQ' performance (0/5 on all ALPrompts). Fast: 1,000tok/s. https://x.com/inceptionailabs/status/1894847921474150456

Alibaba logo

QwQ-Max-Preview

Alibaba

2025-02
325B
modelopen

"As a sneak peek into our upcoming QwQ-Max release, this version offers a glimpse of its enhanced capabilities, with ongoing refinements and an official Apache 2.0-licensed open-source launch of QwQ-Max and Qwen2.5-Max planned soon." Announce: https://x.com/Alibaba_Qwen/status/1894130603513319842

Anthropic logo

Claude 3.7 Sonnet

Anthropic

2025-02
175B
modelopen

Knowledge cutoff now November 2024 (was April 2024). "the first hybrid reasoning model on the market." https://www.anthropic.com/news/claude-3-7-sonnet

Moonshot AI logo

Moonlight

Moonshot AI

2025-02
16B
modelopen

"Scaling law experiments indicate that Muon achieves ∼ 2× computational efficiency compared to AdamW with compute optimal training." https://github.com/MoonshotAI/Moonlight?tab=readme-ov-file

F

S2

Figure

2025-02
7B
modelclosed

Likely based on OpenVLA 7B (Jun/2024, based on Llama 2 7B) or Molmo 7B-O (Sep/2024, based on OLMo-7B-1024 with OpenAI CLIP). "high quality, multi-robot, multi-operator dataset of diverse teleoperated behaviors, ~500 hours in total. To generate natural language-conditioned training pairs, we use an auto-labeling VLM to generate hindsight instructions. The VLM processes segmented video clips from the onboard robot cameras, prompted with: "What instruction would you have given the robot to get the action seen in this video?" All items handled during training are excluded from evaluations to prevent contamination. Architecture Our system comprises two main components: S2, a VLM backbone, and S1, a latent-conditional visuomotor transformer. S2 is built on a 7B-parameter open-source, open-weight VLM pretrained on internet-scale data. It processes monocular robot images and robot state information (consisting of wrist pose and finger positions) after projecting them into vision-language embedding space. Combined with natural language commands specifying desired behaviors, S2 distills all semantic task-relevant information into a single continuous latent vector, passed to S1 to condition its low-level actions. S1, an 80M parameter cross-attention encoder-decoder transformer, handles low-level control. It relies on a fully convolutional, multi-scale vision backbone for visual processing, initialized from pretraining done entirely in simulation. While S1 receives the same image and state inputs as S2, it processes them at a higher frequency to enable more responsive closed-loop control. The latent vector from S2 is projected into S1's token space and concatenated with visual features from S1's vision backbone along the sequence dimension, providing task conditioning. S1 outputs full upper body humanoid control at 200hz, including desired wrist poses, finger flexion and abduction control, and torso and head orientation targets. We append to the action space a synthetic "percentage task completion" action, allowing Helix to predict its own termination condition, which makes it easier to sequence multiple learned behaviors."

F

S1

Figure

2025-02
0.08B
modelclosed

"high quality, multi-robot, multi-operator dataset of diverse teleoperated behaviors, ~500 hours in total. To generate natural language-conditioned training pairs, we use an auto-labeling VLM to generate hindsight instructions. The VLM processes segmented video clips from the onboard robot cameras, prompted with: "What instruction would you have given the robot to get the action seen in this video?" All items handled during training are excluded from evaluations to prevent contamination. Architecture Our system comprises two main components: S2, a VLM backbone, and S1, a latent-conditional visuomotor transformer. S2 is built on a 7B-parameter open-source, open-weight VLM pretrained on internet-scale data. It processes monocular robot images and robot state information (consisting of wrist pose and finger positions) after projecting them into vision-language embedding space. Combined with natural language commands specifying desired behaviors, S2 distills all semantic task-relevant information into a single continuous latent vector, passed to S1 to condition its low-level actions. S1, an 80M parameter cross-attention encoder-decoder transformer, handles low-level control. It relies on a fully convolutional, multi-scale vision backbone for visual processing, initialized from pretraining done entirely in simulation. While S1 receives the same image and state inputs as S2, it processes them at a higher frequency to enable more responsive closed-loop control. The latent vector from S2 is projected into S1's token space and concatenated with visual features from S1's vision backbone along the sequence dimension, providing task conditioning. S1 outputs full upper body humanoid control at 200hz, including desired wrist poses, finger flexion and abduction control, and torso and head orientation targets. We append to the action space a synthetic "percentage task completion" action, allowing Helix to predict its own termination condition, which makes it easier to sequence multiple learned behaviors."

B

Baichuan-M1-14B

Baichuan

2025-02
14B
modelopen

Medical LLM. Huge increase to 20T tokens for 14B params standard.

AI

Evo 2

Arc Institute

2025-02
40B
modelopen

"Evo 2 is a state of the art DNA language model for long context modeling and design. Evo 2 models DNA sequences at single-nucleotide resolution at up to 1 million base pair context length using the StripedHyena 2 architecture. Evo 2 was pretrained using Savanna. Evo 2 was trained autoregressively on OpenGenome2, a dataset containing 8.8 trillion tokens from all domains of life." Greg Brockman co-author.

P

R1 1776

Perplexity

2025-02
685B
modelopen

Censorship reduced, based on DeepSeek-R1.

xAI logo

Grok-3

xAI

2025-02
3000B
modelopen

https://x.ai/blog/grok-3 My full analysis: https://lifearchitect.ai/whats-in-grok/

Mistral logo

Mistral Saba

Mistral

2025-02
24B
modelopen

"Mistral Saba is a 24B parameter model trained on meticulously curated datasets from across the Middle East and South Asia."

BS

Salamandra

Barcelona Supercomputing Center

2025-02
40B
modelopen

"The final [pre-training] dataset is composed of 55.51% FineWeb-Edu, 25.32% Colossal Oscar, 8.38% Wikipedia, 7.17% Aya Collection, and 3.63% StarCoder, totalling 315 billion tokens."

NR

DeepHermes 3 Preview

Nous Research

2025-02
8B
modelopen

Based on Llama 3 8B. GPQA score based on GPT-4o's analysis of the chart :-/ "one of the first models in the world to unify Reasoning (long chains of thought that improve answer accuracy) and normal LLM response modes into one model." https://x.com/NousResearch/status/1890148004029759612

SA

OREAL-32B

Shanghai AI Laboratory/SenseTime

2025-02
32B
modelopen

OREAL=Outcome REwArd-based reinforcement Learning.

Google DeepMind logo

Gemini 2.0 Pro

Google DeepMind

2025-02
200B
modelopen

Context=2M. Disappointing benchmarks, this is the 'pro' (medium) not 'ultra' (large) model. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/

S

s1-32B

Stanford

2025-02
32B
modelopen

Based on Qwen2.5-32B-Instruct. "we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end. This can lead the model to doublecheck its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24)."

OpenAI logo

o3-mini

OpenAI

2025-01
20B
modelopen

GPQA=79.7 for 'high' thinking. ALPrompt 2025H1=1/5. My analysis is that this model’s performance is very poor, with responses often becoming disordered and illogical. OpenAI compared o3-mini to OpenAI’s software engineers, and it performed very poorly (o3-mini=0%, o1=12%). "o3-mini models have the lowest performance, with scores of 0%… We suspect o3-mini’s low performance is due to poor instruction following and confusion about specifying tools in the correct format. The model often attempts to use a hallucinated bash tool rather than python despite constant, multi-shot prompting and feedback that this format is incorrect. This resulted in long conversations that likely hurt its performance." (o3-mini paper, p31)

Mistral logo

Mistral Small 3

Mistral

2025-01
24B
modelopen

MMLU=base, -Pro=base, GPQA=instruct. "When quantized, Mistral Small 3 can be run privately on a single RTX 4090 or a Macbook with 32GB RAM." "Mistral Small 3 is neither trained with RL nor synthetic data"

Allen AI logo

Llama-3.1-Tulu-3-405B

Allen AI

2025-01
405B
modelopen

Lower MMLU score than Llama 3.1 405B base.

Alibaba logo

Qwen2.5-Max

Alibaba

2025-01
325B
modelopen

"Qwen2.5-Max emerges as a milestone in MoE development, featuring an impressive 325 billion parameters. The model has been pretrained on over 20 trillion tokens and further refined with advanced post-training methodologies such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF)." https://wandb.ai/byyoung3/ml-news/reports/Qwen2-5-Max-Advancing-Large-Scale-Mixture-of-Expert-Models---VmlldzoxMTEyMjUyNg

S

EvaByte

SambaNova

2025-01
6.5B
modelopen

"efficient byte-level processing at scale... [compared to tokenizer-based LMs:] 5x less training data, excelling in coding tasks, and decoding up to 2x faster. Its token-free design also brings added flexibility, avoiding tokenizer quirks while naturally extending to multimodal applications without any architecture tweaks."

ByteDance logo

UI-TARS-72B

ByteDance

2025-01
72B
modelopen

VLM. SoTA agent 'computer use' model to 23/Jan/2024.

ByteDance logo

Doubao-1.5-pro

ByteDance

2025-01
300B
modelopen

Includes 2.4B param ViT. "Doubao-1.5-pro uses a sparse MoE architecture. In the pre-training stage, the performance of the MoE model activated with only a small number of parameters can exceed that of ultra-large dense pre-trained models such as Llama3.1-405B. Through the study of the sparsity scaling law, the team determined the sparse ratio that balances performance and efficiency, and determined based on the MoE scaling law that a model activated with a small number of parameters can achieve the performance of a world-class model."

Moonshot AI logo

Kimi k1.5

Moonshot AI

2025-01
500B
modelopen

"our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities---e.g., 77.5 on AIME, 96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista---matching OpenAI's o1". GPQA score is my estimate from pp13–14, noting that "the scores above come from an internal long-cot model with much smaller model size than k1.5 long-CoT model."

DeepSeek-AI logo

DeepSeek-R1

DeepSeek-AI

2025-01
685B
modelopen

"DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks. To support the research community, we have open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 based on Llama and Qwen. DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across various benchmarks"

OpenAI logo

GPT-4b

OpenAI

2025-01
8B
modelclosed

Protein sequence model. "The model was trained on examples of protein sequences from many species, as well as information on which proteins tend to interact with one another. While that’s a lot of data, it’s just a fraction of what OpenAI’s flagship chatbots were trained on, making GPT-4b an example of a “small language model” that works with a focused data set." https://www.technologyreview.com/2025/01/17/1110086/openai-has-created-an-ai-model-for-longevity-science/

K

Helium-1

Kyutai

2025-01
2B
modelopen

"Helium-1 preview, an initial version of our new backbone language model with 2B parameters, targeting edge and mobile devices... We use token level distillation of a 7B parameters model to train Helium-1 preview."

SA

InternLM3

Shanghai AI Laboratory/SenseTime

2025-01
8B
modelopen

"InternLM3 is trained on only 4 trillion high-quality tokens, saving more than 75% of the training cost compared to other LLMs of similar scale." Playground: https://internlm-chat.intern-ai.org.cn/

MiniMax logo

MiniMax-Text-01

MiniMax

2025-01
456B
modelopen

A45.9B. "The pre-training corpus for MiniMax-Text-01 encompasses a comprehensive and meticulously curated dataset, incorporating diverse sources including academic literature, books, web content, and programming code... repeatedly training high-quality documents can lead to enhanced downstream performance, with certain high-quality domains being trained up to 50 times... Our findings indicate that low-quality data suffer a substantial decrease in performance after training for more than two epochs, while high-quality data can be effectively trained for up to four epochs" Login playground: https://www.hailuo.ai/

B

Sky-T1-32B-Preview

Berkeley

2025-01
32B
modelopen

"To generate our training data we use QwQ-32B-Preview, an open-source model with reasoning capabilities comparable to o1-preview. We curate the data mixture (see later section) to cover diverse domains that require reasoning, and a reject sampling procedure to improve the data quality. We then rewrite QwQ traces with GPT-4o-mini into a well-formatted version, inspired by Still-2, to improve data quality and ease parsing... Rejection Sampling: We discard QwQ samples if they are incorrect according to the solutions provided in datasets."

N

Cosmos Nemotron 34B

NVIDIA

2025-01
34B
modelopen

VLM. MMMU=47.33. "VILA project becomes part of Cosmos Nemotron family" https://github.com/NVlabs/Cosmos-Nemotron Vision Encoder: SigLIP-400M, Language Encoder: Yi-34B https://blogs.nvidia.com/blog/nemotron-model-families/

N

Cosmos 1.0

NVIDIA

2025-01
14B
modelopen

WFM (world foundation model). "The models range in size from 4 billion to 14 billion parameters, with Nano being the smallest and Ultra being the largest... "Cosmos WFM models, were trained on 9,000 trillion tokens [9,000T] from 20 million hours of real-world human interactions, environment, industrial, robotics, and driving data..." https://techcrunch.com/2025/01/06/nvidia-releases-its-own-brand-of-world-models/ Actual working: https://lifearchitect.ai/cosmos/

PI

METAGENE-1

Prime Intellect

2025-01
7B
modelopen

Llama-2-7B base. "METAGENE-1 is a 7B parameter metagenomic foundation model designed for pathogen detection and pandemic monitoring, trained on over 1.5 trillion base pairs [∼370 billion tokens (≈1.69 trillion base pairs)] of DNA and RNA collected via metagenomic sequencing of wastewater."

RA

Sonus-1 Reasoning

Rubik's AI

2025-01
405B
modelopen

Likely a Llama 3.1 405B wrapper. ALPrompt 2024H1=5/5. ALPrompt 2024H2=2/5. ALPrompt 2025H1=1/5. This is a strange model: slow and smart, but not as performant as o1. No arch details at all.

254 models
R

YuLan-Mini

Renmin

2024-12
2.4B
modelopen

"1.08T tokens for training. Among them are 481B English web data, 138B general English knowledge, 227B code pre-training data, 16.7B code instruction data, 93.8B mathematics pre-training data, 15.5B mathematics instruction data, and 108B Chinese data."

DeepSeek-AI logo

DeepSeek-V3

DeepSeek-AI

2024-12
685B
modelopen

37B active. Explain: https://threadreaderapp.com/thread/1872318161883959485.html Announce: https://github.com/deepseek-ai/DeepSeek-V3?tab=readme-ov-file

L

EON-8B

LinkedIn

2024-12
8B
modelclosed

"We found the EON-8B model (a domain-adapted Llama 3.1-8B variant) to be 75x and 6x cost effective in comparison to GPT-4 and GPT-4o respectively (Figure 4)... On tasks seen during training, the EON-8B model outperformed base Llama-3-8B-Instruct and its performance was comparable to SOTA GPT models."

OpenAI logo

o3-preview

OpenAI

2024-12
600B
modelopen

SoTA model for Dec/2024. Parameter estimate is very rough centrepoint for range 400B-52T.

R

RWKV-7 Goose

RWKV

2024-12
0.4B
modelopen

RWKV (pronounced RwaKuv) is an RNN: "multilingual, supporting over 100 languages and code.". Full run is 332B tokens of 3.1T dataset.

I

ModernBERT

International

2024-12
0.395B
modelopen

"a proper workhorse model, for retrieval, classification, etc." https://bsky.app/profile/howard.fm/post/3ldod2afps62x

IBM logo

Granite 3.1 8B

IBM

2024-12
8B
modelopen

IBM logo

Bamba-9B

IBM

2024-12
9B
modelopen

"trained by IBM, Princeton, CMU, and UIUC on completely open data. At inference time, the model demonstrates 2.5x throughput improvement and 2x latency speedup compared to standard transformers in vLLM."

OpenAI logo

o1-2024-12-17

OpenAI

2024-12
200B
modelclosed

"o1-2024-12-17 sets new state-of-the-art results on several benchmarks, improving cost-efficiency and performance."

TII logo

Falcon 3

TII

2024-12
10B
modelopen

"We conducted a single large-scale pretraining run on the 7B model, using 1024 H100 GPU chips, leveraging 14 trillion tokens... upscaled the 7B model to a 10B parameters model by duplicating the redundant layers and continuing pre-training with 2 trillion tokens of high-quality data."

Cohere logo

Command R7B

Cohere

2024-12
7B
modelopen

Cohere logo

Maya

Cohere

2024-12
8B
modelopen

VLM.

Meta AI logo

BLT

Meta AI

2024-12
8B
modelopen

Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance

Meta AI logo

Large Concept Model

Meta AI

2024-12
7B
modelopen

"autoregressive sentence prediction in an embedding space." 7.7T tokens is a misprint, should be 2.2T as in paper.

Microsoft logo

Phi-4

Microsoft

2024-12
14B
modelopen

Use unsloth: https://huggingface.co/unsloth/phi-4-GGUF & https://www.reddit.com/r/singularity/comments/1i0kso4/i_fixed_4_bugs_in_microsofts_opensource_phi4_model/

Google DeepMind logo

Gemini 2.0 Flash exp

Google DeepMind

2024-12
30B
modelopen

Gemini 2.0 Flash was first model released, 11/Dec/2024. "New Modalities: Gemini 2.0 introduces native image generation and controllable text-to-speech capabilities" Announce: https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/ Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/

I

Moxin-7B

International

2024-12
7B
modelopen

"Fully Open Source" with pre-training code, configurations, training and fine-tuning datasets, and intermediate checkpoints.

C

1T

Cerebras

2024-12
1000B
modelclosed

"For Sandia’s trillion parameter training run, Cerebras configured a 55 terabyte MemoryX device."

SA

InternVL 2.5

Shanghai AI Laboratory/SenseTime

2024-12
78B
modelopen

Benchmarks are estimates based on Qwen2.5 72B Instruct as the base LLM (InternVL 2.5=InternViT-6B-448px-V2.5 5.5B + Qwen2.5-72B-Instruct). "Notably, Qwen2-VL processed a cumulative total of 1.4T tokens, while our InternVL2.5-78B is trained on just ∼120B tokens [of vision]."Dataset... we identify repetitive generation as one of the most detrimental issues. In many open-source or synthetic datasets, a small number of repetitive samples—comprising merely thousands of examples in our Stage 2 data mixture—can cause the model to spiral into repetitive loops, particularly in long-form outputs or CoT reasoning tasks. This phenomenon undermines the effectiveness of test-time scaling strategies. To address this challenge and support future research, we designed an efficient data filtering pipeline to remove low-quality samples, thereby minimizing the risk of repetitive generation." Repo: https://github.com/OpenGVLab/InternVL

Meta AI logo

Llama 3.3

Meta AI

2024-12
70B
modelopen

Drop-in replacement for Llama 3.1 70B, comparable performance to Llama 3.1 405B.

L

EXAONE-3.5

LG

2024-12
32B
modelopen

“EXAONE”=“EXpert AI for EveryONE”. Training tokens/ratio dropped from EXAONE-3 7.8B with 8T (Aug/2024) to this (Dec/2024) 7.8B with 9T to 32B with 6.5T.

R

Deepthought-8B

Ruliad

2024-12
8B
modelopen

No evals. Llama 3.1 8B base.

S

Sailor2

Sail

2024-12
20B
modelopen

SEA languages. Continual pretraining based on Qwen2.5. Project page: https://sea-sailor.github.io/blog/sailor2/

P

Pleias 1.0

PleIAs

2024-12
3B
modelopen

Trained on the Jean Zay supercomputer, 192x H100s for 20 days. Dataset is new CC + Synthetic: https://huggingface.co/datasets/PleIAs/common_corpus

OpenAI logo

o1

OpenAI

2024-12
200B
modelopen

"a version of our most intelligent model that thinks longer for the most reliable responses" System card about safety only: https://cdn.openai.com/o1-system-card-20241205.pdf

Amazon logo

Nova Pro

Amazon

2024-12
90B
modelopen

Multimodal, same performance as Llama 3.2 90B ∴ est 90B. Model card was hidden: https://assets.amazon.science/9f/a3/ae41627f4ab2bde091f1ebc6b830/the-amazon-nova-family-of-models-technical-report-and-model-card.pdf via https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card

C

EuroLLM

Consortium

2024-12
9B
modelopen

24 official languages are Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, and Swedish. "we use 400 Nvidia H100 GPUs of the Marenostrum 5 supercomputer" Also: https://eurollm.io/

NR

DisTrO 15B

Nous Research

2024-12
15B
modelopen

"About 14 DGXes scattered around the globe. Sometimes more sometimes less, it varies depending on availability. On average, around 112 H100s." https://x.com/bloc97_/status/1863675225810043331 "we introduce DisTrO, a family of architecture-agnostic and network-agnostic distributed optimizers that reduces the inter-GPU communication requirements by four to five orders of magnitude without relying on amortized analysis, enabling low-latency training of large neural networks on slow internet bandwidths with heterogeneous networking hardware."

PI

INTELLECT-1

Prime Intellect

2024-11
10B
modelopen

Training complete 22/Nov/2024. Fully distributed training: "the first decentralized training run of a 10-billion-parameter model, inviting anyone to contribute compute and participate. This brings us one step closer towards open source AGI."

Alibaba logo

QwQ-32B-Preview

Alibaba

2024-11
32B
modelopen

Scores 1/5 on latest ALPrompt 2024 H2. Qwen with Question=QwQ

OX

Teuken-7B

OpenGPT-X

2024-11
7B
modelopen

24 EU languages (60% non-English): bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv. https://opengpt-x.de/models/teuken-7b-de/ & paper date is Sep/2024.

Allen AI logo

OLMo 2

Allen AI

2024-11
13B
modelopen

Open Language Model (OLMo) 2 Apache 2.0 license for research and educational use. Paper coming. Data: 5 trillion tokens (1.2 epochs of 4T tokens) + 100B tokens (3 runs) + 300B tokens (1 run) merged. https://huggingface.co/allenai/OLMo-2-1124-13B & playground: https://playground.allenai.org/

C

Bi-Mamba

CMU

2024-11
2.7B
modelclosed

Unreleased, but will be replicated. "a scalable and powerful 1-bit Mamba architecture designed for more efficient large language models"

Moonshot AI logo

k0-math

Moonshot AI

2024-11
100B
modelopen

Reasoning, maths only. Very little info available. Chinese. Long context. No paper.

Alibaba logo

Marco-o1

Alibaba

2024-11
7B
modelopen

No evals. Qwen2-7B-Instruct with a combination of the filtered Open-O1 CoT dataset, Marco-o1 CoT dataset, and Marco-o1 Instruction dataset.

Allen AI logo

TÜLU 3

Allen AI

2024-11
70B
modelopen

Llama 3.1 post-training, worse performance on most benchmarks. Post training methods include new Reinforcement Learning with Verifiable Rewards (RLVR). "We perform supervised fine-tuning on new capability-focused synthetic data mixed with existing instruction datasets. We then perform preference tuning on on-policy synthetic preference data. We finish training Llama Tülu3 with our new method, Reinforcement Learning with Verifiable Rewards."

OpenAI logo

gpt-4o-2024-11-20

OpenAI

2024-11
200B
modelopen

Material decrease in benchmark scores (GPQA: -13.37%, MMLU: -3.38%) compared to Aug/2024. Pruned? Quantized? https://github.com/openai/simple-evals

DeepSeek-AI logo

DeepSeek-R1-Lite

DeepSeek-AI

2024-11
67B
modelopen

Scores 0/5 on latest ALPrompt 2024 H2 "DeepSeek-R1-Lite is currently still in the iterative development stage. It currently only supports web usage and does not support API calls. The base model used by DeepSeek-R1-Lite is also a relatively small model, unable to fully unleash the potential of long reasoning chains. At present, we are continuously iterating on the inference series models. In the future, the official DeepSeek-R1 model will be fully open-sourced. We will publicly release the technical report and deploy API services." https://mp-weixin-qq-com.translate.goog/s/e1YnTxZlzFvjcmrLLTA8fw?_x_tr_sl=zh-CN&_x_tr_tl=en&_x_tr_hl=zh-TW

X

Xmodel-LM

XiaoduoAI

2024-11
1.1B
modelopen

SLM

Mistral logo

Pixtral Large

Mistral

2024-11
124B
modelopen

Open-weights multimodal model built on top of Mistral Large 2.

F

f1

Fireworks

2024-11
modelopen

"a compound AI model specialized in complex reasoning, that interweaves multiple open models at the inference layer. "

Alibaba logo

Qwen2.5-Coder

Alibaba

2024-11
32.5B
modelopen

https://qwenlm.github.io/blog/qwen2.5-coder-family/ Jack Clark from Anthropic is saying it’s actually 18T tokens from Qwen2.5 + 5.5T tokens for a total of 23.5T tokens. That doesn’t seem right from my interpretation of the technical report.

T

Fox-1

TensorOpera

2024-11
1.6B
modelopen

Gold standard for dataset documentation

Tencent logo

Hunyuan-Large

Tencent

2024-11
389B
modelopen

Hunyuan-Large is pre-trained on 7T tokens, which contains nearly 1.5T tokens of high-quality and diverse synthetic data.' '389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens'

AS

SEA-LIONv3

AI Singapore

2024-11
9.24B
modelopen

SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region. The Gemma2 9B CPT SEA-LIONv3 base model which has undergone continued pre-training from the base Gemma-2-9B model. SEA-LION stands for Southeast Asian Languages In One Network.' News: https://www.techinasia.com/news/ai-singapore-boosts-sea-ai-sealion-v3-model

A

AMD OLMo

AMD

2024-11
1B
modelopen

1 billion parameter LMs trained from scratch using 1.3T tokens on a cluster of AMD Instinct MI250 GPUs.

Hugging Face logo

SmolLM2

Hugging Face

2024-11
1.7B
modelopen

Base and instruct versions, with Apache 2.0 license

Cohere logo

Aya-Expanse-32B

Cohere

2024-10
32B
modelopen

"Aya Expanse, a family of highly performant multilingual models that excels across 23 languages and outperforms other leading open-weights models...we have collaborated with over 3,000 researchers from 119 countries to expand cutting-edge multilingual research... 220 language ambassadors from around the world who have been part of this release"

Anthropic logo

Claude 3.5 Sonnet (new)

Anthropic

2024-10
175B
modelopen

Absurd naming scheme. Paper addendum pp51-64: https://assets.anthropic.com/m/61e7d27f8c8f5919/original/Claude-3-Model-Card.pdf#page=51

IBM logo

Granite 3.0 8B

IBM

2024-10
8B
modelopen

Announce: https://www.ibm.com/new/ibm-granite-3-0-open-state-of-the-art-enterprise-models

IBM logo

Granite-3.0-3B-A800M-Instruct

IBM

2024-10
3B
modelopen

Announce: https://www.ibm.com/new/ibm-granite-3-0-open-state-of-the-art-enterprise-models

A

aiXcoder-7B

aiXcoder

2024-10
7B
modelopen

Dataset: The Stack

N

Llama-3.1-Nemotron-70B

NVIDIA

2024-10
70B
modelopen

Related paper: https://arxiv.org/abs/2410.01257

Mistral logo

Ministral 8B

Mistral

2024-10
8B
modelopen

"Introducing the world’s best edge models"

01-ai logo

Yi-Lightning

01-ai

2024-10
200B
modelopen

"New MoE hybrid expert architecture" and https://x.com/01AI_Yi/status/1845776529185476613

Z

Zamba2-7B

Zyphra

2024-10
7B
modelopen

Mamba2 "trained on 128 H100 GPUS for approximately 50 days using our internal training framework developed atop Megatron-LM"

N

nGPT

NVIDIA

2024-10
1B
modelopen

"a novel neural network architecture, the normalized Transformer (nGPT) with representation learning on the hypersphere. In nGPT, all vectors forming the embeddings, MLP, attention matrices and hidden states are unit norm normalized...reducing the number of training steps required to achieve the same accuracy by a factor of 4 to 20, depending on the sequence length."

IA

Inflection-3 Pi (3.0)

Inflection AI

2024-10
1200B
modelopen

Inference via Intel Gaudi® 3 128 GB, on-premise available. Minimum spend $100 credits.

IA

Inflection-3 Productivity (3.0)

Inflection AI

2024-10
1200B
modelopen

Inference via Intel Gaudi® 3 128 GB, on-premise available. Minimum spend $100 credits.

Liquid AI logo

LFM-40B

Liquid AI

2024-09
40B
modelopen

40BA12B. Some controversy/concern over company. Liquid Foundation Models (LFM). "Human preference optimization techniques have not been applied extensively to our models yet."

Salesforce logo

SFR-LLaMA-3.1-70B-Judge

Salesforce

2024-09
70B
modelclosed

Code coming soon: https://github.com/SalesforceAIResearch/SFRJudge "we opt to focus on datasets that evaluate modern (2023 and beyond) LLM responses, as older datasets likely contain lower quality responses from less capable models, with correspondingly stale annotations. We supplement human-annotated data with synthetically generated data to endow our judge models with specific capabilities (e.g., following fine-grained rubrics in evaluation)"

B

Emu3

BAAI

2024-09
8B
modelopen

VLM. Dataset estimates are based on the unrelated UW/Salesforce dataset MINT-1T (3.4B images, 927M documents) https://arxiv.org/abs/2406.11271v1

N

NLVM 1.0

NVIDIA

2024-09
72B
modelopen

Flamingo clone. "we use Qwen2-72B-Instruct as the default text-only LLM backbone. We also employ Nous-Hermes-2-Yi-34B for ablation study and faster experimentation... we use InternViT-6B as the default vision encoder"

CT

Unnamed 1T

China Telecom Artificial Intelligence Research Institute

2024-09
1000B
modelclosed

Trained on Chinese GPUs: "Ascend Atlas 800T A2 training server – a Huawei product listed as supporting the Kunpeng 920 7265 or Kunpeng 920 5250 processors" https://www.theregister.com/2024/10/02/china_telecom_model_trained_local_tech/

CT

TeleChat2-115B

China Telecom Artificial Intelligence Research Institute

2024-09
115B
modelopen

Trained on Chinese GPUs: "Ascend Atlas 800T A2 training server – a Huawei product listed as supporting the Kunpeng 920 7265 or Kunpeng 920 5250 processors" https://www.theregister.com/2024/10/02/china_telecom_model_trained_local_tech/

A

AMD-Llama-135m

AMD

2024-09
0.135B
modelopen

Small language model (SLM). Trained on AMD Instinct™ MI250 accelerators. "Pretrain Dataset: We employed the SlimPajama and Project Gutenberg dataset to pretrain the 135M model. Project Gutenberg is a library of over 70,000 free eBooks approximately. This sums up to 670B tokens"

Meta AI logo

Llama 3.2 90B

Meta AI

2024-09
90B
modelopen

Vision (VLM)

Meta AI logo

Llama 3.2 3B

Meta AI

2024-09
3.21B
modelopen

Text (LLM). "Pre-training. [For Llama 3.2 3B] We prune the models from their 8B siblings and use logits from the 8B and 70B models as token-level targets (token-level distillation). We then use knowledge distillation to recover performance."

Allen AI logo

Molmo

Allen AI

2024-09
72B
modelopen

ViT: Llava as Qwen2 (or Olmo) + CLIP. Multimodal Open Language Model built by Ai2. Announce: https://molmo.allenai.org/blog

Google DeepMind logo

Gemini-1.5-Pro-002

Google DeepMind

2024-09
200B
modelopen

Sparse MoE. Context window=2M. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/

Alibaba logo

Qwen2.5

Alibaba

2024-09
72B
modelopen

Microsoft logo

GRIN MoE

Microsoft

2024-09
60B
modelopen

16x3.8B "only 6.6B activate parameters". GRIN=GRadient-INformed. "GRIN MoE is pre-trained on 4T tokens as a Causal Language Model. The same training dataset has been used to train Phi-3 dense models"

Google DeepMind logo

Data-Gemma

Google DeepMind

2024-09
27B
modelopen

RAG/RIG: "the LLM is fine-tuned to produce natural language Data Commons queries alongside statistics"

OpenAI logo

o1-preview

OpenAI

2024-09
200B
modelclosed

JA

Reader-LM

Jina AI

2024-09
1.54B
modelopen

HTML->Markdown. Specialist small model; outperforms GPT-4o general model, does not outperform Gemini Pro 1.5.

Mistral logo

Pixtral-12b-240910

Mistral

2024-09
12B
modelopen

"Pixtral was trained to be a drop-in replacement for Mistral Nemo 12B."

DeepSeek-AI logo

DeepSeek-V2.5

DeepSeek-AI

2024-09
236B
modelopen

"DeepSeek-V2.5 is an upgraded version that combines DeepSeek-V2-Chat and DeepSeek-Coder-V2-Instruct."

01-ai logo

Yi-Coder

01-ai

2024-09
9B
modelopen

6B=3T tokens, 9B=+0.8T tokens, 9B-Coder=+2.4T tokens=6.2T tokens. See Yi 1.5 34B in this table

Allen AI logo

OLMoE-1B-7B

Allen AI

2024-09
6.9B
modelopen

Open Language (OL) Mixture of Experts (MoE). "We train OLMoE-1B-7B for 5 trillion tokens, however, some recent dense models train significantly longer, such as Llama 3 with 15 trillion tokens. To the best of our knowledge, there has been no large MoE that has been overtrained as much as OLMoE-1B-7B. Specifically, taking the active parameters of OLMoE-1B-7B, our token multiplier is around 5,000 (5T / 1B). There are likely benefits to training even longer, but to what degree overtraining is effective for MoEs and how it differs from dense models still requires more research."

C

PLLuM

Consortium

2024-08
20B
modelopen

Polish Large Language Model. Not yet available as of Sep/2024

Salesforce logo

xLAM

Salesforce

2024-08
141B
modelopen

64K sequence length. Released under Apache-2.0.

M

LTM-2-mini

Magic

2024-08
20B
modelclosed

Context=100M tokens equals ~10 million lines of code or ~750 novels.

C

Rene

Cartesia

2024-08
1.3B
modelopen

On-device. "hybrid architecture based on Mamba-2, with feedforward and sliding window attention layers interspersed"

Google DeepMind logo

Gemini 1.5 Flash-8B

Google DeepMind

2024-08
8B
modelopen

Announce: https://x.com/OfficialLoganK/status/1828480085353234535 1M context for all modalities. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/

AA

Pharia-1-LLM-7B

Aleph Alpha

2024-08
7B
modelopen

S

TTT-Linear

Stanford

2024-08
1.3B
modelopen

Test-Time Training (TTT) layers. Real-time learning by Stanford, UC, and Meta. Potential for frontier models in 2025+.

A

Jamba 1.5

AI21

2024-08
398B
modelopen

Jamba 1.5 Mini (12B active/52B total) and Jamba 1.5 Large (94B active/398B total) are also optimized for business use cases and capabilities such as function calling, structured output (JSON), and grounded generation.

Microsoft logo

phi-3.5-MoE

Microsoft

2024-08
60B
modelopen

Microsoft logo

phi-3.5-mini

Microsoft

2024-08
3.8B
modelopen

N

Minitron-4B

NVIDIA

2024-08
4B
modelopen

Pruned and distilled from Nemotron-4 15B: https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/

SA

sarvam-2b

Sarvam AI

2024-08
2B
modelopen

Indic languages supported are: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.

xAI logo

Grok-2

xAI

2024-08
400B
modelopen

MMLU-Pro=75.5=SOTA. Claude 3.5S MMLU-Pro=72.83. "Grok-2 has been tested on the LMSYS leaderboard under the name "sus-column-r." At the time of this blog post, it is outperforming both Claude 3.5 Sonnet and GPT-4-Turbo." [Alan: Grok is Heinlein, Sixth Column is also Heinlein: https://en.wikipedia.org/wiki/Sixth_Column ]

L

EXAONE 3.0

LG

2024-08
7.8B
modelopen

“EXAONE”=“EXpert AI for EveryONE”

TII logo

Falcon Mamba 7B

TII

2024-08
7B
modelopen

https://huggingface.co/spaces/tiiuae/falcon-mamba-playground

W

Palmyra-Med-70B

Writer

2024-07
70B
modelopen

Medical. MMLU Medical Genetics=94.0

W

Palmyra-Fin-70B

Writer

2024-07
70B
modelopen

Financial. "across a variety of real-world financial use cases. It outperformed popular models like Claude 3.5 Sonnet, GPT-4o, and Mixtral-8x7b"

Z

Zamba2-small

Zyphra

2024-07
2.7B
modelopen

Mamba2

N

Minitron-8B

NVIDIA

2024-07
4B
modelopen

Pruned and distilled from Nemotron-4 15B: https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/

Mistral logo

Mistral Large 2

Mistral

2024-07
123B
modelopen

Fits on a single node for inference.

Meta AI logo

Llama 3.1 405B

Meta AI

2024-07
405B
modelopen

Announce: https://ai.meta.com/blog/meta-llama-3-1/ Model card: https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md

OpenAI logo

GPT-4o mini

OpenAI

2024-07
8B
modelopen

Omnimodel. "OpenAI would not disclose exactly how large GPT-4o mini is, but said it’s roughly in the same tier as other small AI models, such as Llama 3 8b, Claude Haiku and Gemini 1.5 Flash." https://techcrunch.com/2024/07/18/openai-unveils-gpt-4o-mini-a-small-ai-model-powering-chatgpt/ "tested GPT-4o to identify potential risks, which we have addressed and plan to share the details of in the forthcoming GPT-4o system card and Preparedness scorecard." And related paper about instruction hierarchy: https://arxiv.org/abs/2404.13208

Mistral logo

NeMo

Mistral

2024-07
12B
modelopen

With NVIDIA. "Drop-in replacement of Mistral 7B". "trained using Megatron-LM, part of NVIDIA NeMo, with 3,072 H100 80GB Tensor Core GPUs" https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/

Mistral logo

Codestral Mamba

Mistral

2024-07
7B
modelopen

"Unlike Transformer models, Mamba models offer the advantage of linear time inference and the theoretical ability to model sequences of infinite length."

Mistral logo

Mathstral

Mistral

2024-07
7B
modelopen

"We’re contributing Mathstral to the science community to bolster efforts in advanced mathematical problems requiring complex, multi-step logical reasoning."

Microsoft logo

SpreadsheetLLM

Microsoft

2024-07
1760B
modelclosed

Notable finetune of GPT4-0125-preview "outperforming the vanilla approach by 25.6% in GPT4’s in-context learning setting"

C

Spectra

Consortium

2024-07
3.9B
modelopen

AKA TriLM. "Spectra LLM suite, the first open suite of LLMs spanning multiple bit-widths, including FloatLMs, QuantLMs, and TriLMs, ranging from 99M to 3.9B parameters trained on 300B tokens."

D

next-gen

DeepL

2024-07
7B
modelopen

"Built using our own groundbreaking, specialized LLM technology and proprietary training data, designed specifically for translation"

Hugging Face logo

SmolLM

Hugging Face

2024-07
1.7B
modelopen

Dataset includes new Cosmopedia v2 synthetic data. 135M and 360M models,each trained on 600B tokens from Smollm-Corpus. 1.7B model trained on 1T tokens from Smollm-Corpus.

V

Mockingbird

Vectara

2024-07
9B
modelopen

"At <10B parameters it's an LLM trained to provide optimal results for RAG and structured outputs."

Google DeepMind logo

FLAMe

Google DeepMind

2024-07
24B
modelclosed

LLM-as-a-Judge autorater. Foundational Large Autorater Models (FLAMe). Uses an instruction-tuned PaLM-2-24B model. Unrelated to Microsoft FLAME Jan/2023.

S

Step-2

StepFun

2024-07
1000B
modelopen

Launched early Jul/2024: https://pandaily.com/stepfun-releases-three-large-models-of-the-step-series/ "StepFun, founded in April 2023 with the mission to “Scale-up possibilities for everyone,” unites top talent in artificial intelligence from both domestic and international backgrounds, and is dedicated to advancing toward AGI. The company has already launched the Step series of foundation models, which includes Step-2, a cutting-edge trillion-parameter Mixture of Experts (MoE) language model; Step-1.5V, a powerful multimodal large model; and Step-1V, an innovative image generation model, among others."

H

H2O-Danube3-4B

H2O.ai

2024-07
4B
modelopen

Runs natively and fully offline on mobile phone. "H2O-Danube3 is a family of decoder only LLM models that use the general Llama model architecture adopting core principles from Llama 2 and Mistral with custom parameters determining the shape of each layer and total parameter count. We use the Mistral tokenizer..." MMLU for chat=54.74, base=55.18 via https://huggingface.co/h2oai/h2o-danube3-4b-base

Microsoft logo

Causal Axioms

Microsoft

2024-07
0.067B
modelclosed

"the training dataset follows a specific structure, we develop a custom tokenizer. Alphanumeric node names are tokenized at a character level, while special terms such as ‘causes’, ‘Does’, ‘cause’, ‘Yes’, and ‘No’ are tokenized at the word level... Our training setup consists of around 175k instances of sequential chains with size of chains ranging from 3 to 6 nodes... All models are trained for 100 epochs. [LifeArchitect.ai estimate is 12 tokens per node x 6 nodes x 175,000 instances x 100 epochs = 1.26B tokens]" Based on GPT-2 arch.

S

SenseNova 5.5

SenseTime

2024-07
600B
modelopen

"The model training was based on over 10TB tokens [sic, taken as 10T tokens instead of 10TB=2T tokens] of high-quality training data, including a large amount of synthetically-generated reasoning chain data, which help to enhance its reasoning capabilities." & "The updates include SenseNova 5o, the first real-time multimodal model in China, which provides a new AI interaction model on par with GPT-4o’s streaming interaction capabilities"

K

Helium 7B

Kyutai

2024-07
7B
modelopen

"1. The model is fine-tuned on 100K transcripts generated by Helium itself. 2. These transcripts are highly detailed, heavily annotated with emotion and style, and conversational. 3. Text to Speech Engine is further fine-tuned on 20 hours of audio recorded by Alice and licensed."

SA

InternLM2.5

Shanghai AI Laboratory/SenseTime

2024-07
20B
modelopen

"The release of InternLM2.5 series contains 7B model size for now and we are going to release the 1.8B and 20B versions soon" [20B released around 1/Aug/2024]

B

Tele-FLM-1T

BAAI

2024-07
1000B
modelopen

Technical arch testing only, ratio is too low for decent performance.

R

YuLan-Base-12B

Renmin

2024-07
12B
modelopen

"YuLan's training is finished on Jan, 2024 and has achieved performance on par with state-of-the-art LLMs across various English and Chinese benchmarks."

Baidu logo

ERNIE 4.0 Turbo

Baidu

2024-06
200B
modelopen

"Ernie Bot has reached 300 million users since its launch [on 16/Mar/2023, public Aug/2023]" Jun/2024

Google DeepMind logo

Gemma 2

Google DeepMind

2024-06
27B
modelopen

Announce: https://blog.google/technology/developers/google-gemma-2/

OpenAI logo

CriticGPT

OpenAI

2024-06
3B
modelclosed

"LLM Critics Help Catch LLM Bugs" Announce: https://openai.com/index/finding-gpt4s-mistakes-with-gpt-4/

Apple logo

4M-21

Apple

2024-06
3B
modelopen

Vision model based on T5-XXL. Modalities: RGB, Caption, Bounding boxes, Semantic segmentation, Depth, Human poses, Surface normals, CLIP, DINOv2, ImageBind, Metadata, Canny edges, SAM edges, SAM instances, Color palette. Project page: https://4m.epfl.ch/

E

ESM3

EvolutionaryScale

2024-06
98B
modelpartial

Biology large language model: "sequence, structure, and function are all masked and predicted during training, ESM3 can generate in all three modalities." 1.4B only released.

H

PanGu 5.0 Super

Huawei

2024-06
1000B
modelpartial

https://x.com/faridofanani96/status/1804079517193113850/photo/1

Anthropic logo

Claude 3.5 Sonnet

Anthropic

2024-06
70B
modelclosed

MMLU=90.4 with prompting. Model card: https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf

DeepSeek-AI logo

DeepSeek-Coder-V2

DeepSeek-AI

2024-06
236B
modelopen

DeepSeek-V2 with additional 6 trillion tokens.

I

DCLM-Baseline 7B 2.6T

International

2024-06
7B
modelpartial

New dataset: 240T tokens: 8× larger than previous SOTA dataset. DCLM-Pool is 240T, DCLM-Baseline is 3.8T: "we combine our 3.8T DCLM-BASELINE with the StarCoder and ProofPile2 data to arrive at a 4.1T token dataset. We train a 7B model for 2.5T tokens" and "We release the DCLM benchmark, framework, models, and datasets at https://datacomp.ai/dclm."

N

Nemotron-4-340B

NVIDIA

2024-06
340B
modelopen

Open-source equiv of Mar/2023 GPT-4 (1760MoE≈340B, 13T), same param count but 2x the tokens of May/2023 PaLM 2 (340B, 3.6T), competitor to Nov/2023 Grok-1 (314B, 6T). Trained on 6,144 H100s. ~1.3TB for inference. 50+ natural and 40+ coding languages. Trained between December 2023 and May 2024. MMLU 0-shot for instruct=78.7, 5-shot for base=81.1. Permalink for paper: https://research.nvidia.com/publication/2024-06_nemotron-4-340b

Apple logo

Apple On-Device model Jun/2024

Apple

2024-06
3.04B
modelopen

https://lifearchitect.ai/apple/ Likely to be the Apple OpenELM model (Apr/2024). "two of these models — a ~3 billion parameter on-device language model, and a larger server-based language model available with Private Cloud Compute". https://machinelearning.apple.com/research/introducing-apple-foundation-models The server-based model is possibly Ferret, although it is more properly called a multimodal model (not just language). It could also be Apple GPT based on their Ajax framework: https://archive.md/f3C0r

U

MatMul-Free LM

UCSC

2024-06
2.7B
modelopen

"we explore alternative methods for mixing tokens without relying on matrix multiplications." Compared with Transformer++ based on Llama-2, not to be confused with the pre-GPT-3 American Express Transformer++ paper from 2/Mar/2020. Instead, Transformer++ is defined in the Mamba paper: 'Transformer++: A Transformer with an improved architecture, namely rotary positional encodings (Su et al. 2021) and SwiGLU MLP (Shazeer 2020)'

G

Luna

Galileo

2024-06
0.44B
modelopen

Based on DeBERTA-large (440M). RoBERTa=162B token dataset.

Alibaba logo

Qwen2

Alibaba

2024-06
72B
modelopen

Instruct MMLU=82. Instruct GPQA=41.9. https://qwenlm.github.io/blog/qwen2/

Alibaba logo

Qwen2-57B-A14B

Alibaba

2024-06
57B
modelopen

https://qwenlm.github.io/blog/qwen2/

KT

Skywork MoE 16x13B

Kunlun Tech

2024-06
146B
modelopen

CN + EN. "(MoE) model with 146 billion parameters, 16 experts, and 22 billion activated parameters. This model is initialized from the pre-existing dense checkpoints of our Skywork-13B model."

C

Mamba-2

CMU

2024-05
2.7B
modelopen

Analysis: https://tridao.me/blog/2024/mamba2-part1-model/

I

MAP-Neo

International

2024-05
7B
modelopen

"first fully open-sourced bilingual LLM with comparable performance to existing state-of-the-art LLMs... we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided."

L

K2

LLM360

2024-05
65B
modelopen

"K2-65B is a fully reproducible LLM outperforming Llama 2 70B using 35% less compute."

Mistral logo

Codestral

Mistral

2024-05
22B
modelopen

Fluent in 80+ programming languages

Cohere logo

Aya-23-35B

Cohere

2024-05
35B
modelopen

01-ai logo

Yi-XLarge

01-ai

2024-05
2000B
modelopen

Still training as of May/2024: https://appserversrc.8btc.cn/FnDYlEC4STBhphu6M3NL4CKH43FW dead link, use: https://finance.china.com.cn/roll/20240513/6116857.shtml

01-ai logo

Yi-Large

01-ai

2024-05
1000B
modelopen

Meta AI logo

Chameleon

Meta AI

2024-05
34B
modelopen

Multimodal

Google DeepMind logo

LearnLM

Google DeepMind

2024-05
1500B
modelpartial

Fine-tuned + prompted Gemini (Dec/2023). "The results of LearnLM-Tutor reproduce the performance of Gemini Pro, for example an MMLU score of 0.72 and MATH score of 0.33."

C

Sparse Llama 7B

Cerebras

2024-05
7B
modelopen

https://www.cerebras.net/blog/introducing-sparse-llama-70-smaller-3x-faster-full-accuracy "For the 50% sparse model, we utilized 45 billion tokens of pretraining data, while an additional 100 billion tokens were used for the 70% model. This represents approximately 2% to 8% of the original 2 trillion tokens used to train the base Llama-2 model."

Google DeepMind logo

Gemini 1.5 Flash

Google DeepMind

2024-05
8B
modelopen

1M context length. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/

OpenAI logo

GPT-4o

OpenAI

2024-05
200B
modelclosed

gpt-4o-2024-05-13 no longer easily available, so hidden in the Model Table rankings. Omnimodel. ‘[GPT-4o is] likely an early checkpoint of GPT-5’. https://twitter.com/drjimfan/status/1790089671365767313 ELO: https://twitter.com/LiamFedus/status/1790064963966370209 Demo: https://youtu.be/DQacCB9tDaw

TII logo

Falcon 2 11B

TII

2024-05
11B
modelopen

Announce: https://www.tii.ae/news/falcon-2-uaes-technology-innovation-institute-releases-new-ai-model-series-outperforming-metas

F

Fugaku-LLM

Fujitsu

2024-05
13B
modelopen

Japanese. CPU trained: 158,976+ A64FX CPUs (7M+ cores), zero GPUs. https://en.wikipedia.org/wiki/Fugaku_(supercomputer)

01-ai logo

Yi 1.5 34B

01-ai

2024-05
34.4B
modelopen

Uses 600B more training tokens than Yi 1.0 (Nov/2023).

Microsoft logo

YOCO

Microsoft

2024-05
3B
modelopen

With Tsingua. You Only Cache Once (YOCO). Long context "1M context length with near-perfect needle retrieval accuracy"

DeepSeek-AI logo

DeepSeek-V2

DeepSeek-AI

2024-05
236B
modelopen

Huge dataset, 12% Chinese "Therefore, we acknowledge that DeepSeek-V2 still has a slight gap in basic English capabilities with LLaMA3 70B".

I

ChuXin

Independent

2024-05
1.6B
modelopen

"results on the ”Needle In A Haystack”(NIAH) tests indicate that ChuXin-1M performs well across all context window lengths up to 1M."

R

RWKV-v6 Finch

RWKV

2024-05
7.63B
modelopen

RWKV (pronounced RwaKuv) is an RNN: https://twitter.com/BlinkDL_AI/status/1787834625211158562

E

xLSTM

ELLIS

2024-05
2.7B
modelclosed

New method LSTM to xLSTM, see also RNNs. Code/weights doesn't seem to be released. https://github.com/AI-Guru/xlstm-resources

IBM logo

Granite Code

IBM

2024-05
34B
modelopen

MMLU=50 for 8B model only. Dataset: publicly available datasets (e.g., GitHub Code Clean, Starcoder data), public code repositories, and issues from GitHub.

Alibaba logo

Qwen-Max

Alibaba

2024-05
300B
modelopen

https://twitter.com/JustinLin610/status/1787584325367529509

Google DeepMind logo

Med-Gemini-L 1.0

Google DeepMind

2024-05
200B
modelclosed

Med-Gemini-M 1.0 and Med-Gemini-L 1.0 (Pro and Ultra finetunes) "For language tasks that require less complex reasoning, such as summarizing medical notes and creating referral letters, we introduce Med-Gemini-M 1.0 by fine-tuning the Gemini 1.0 Pro model. For other tasks that require more advanced reasoning, we introduce Med-Gemini-L 1.0 by fine-tuning the Gemini 1.0 Ultra model using a self-training method to enable the models to efficiently use web search."

Microsoft logo

TinyStories

Microsoft

2024-04
0.033B
modelopen

Precursor to phi.

B

Tele-FLM

BAAI

2024-04
52B
modelopen

Also known as FLM-2. "We will open-source a 1T model checkpoint, namely Tele-FLM-1T, to advance further training and research." Discussion paper Jul/2024: https://arxiv.org/abs/2407.02783

Alibaba logo

Qwen-1.5 110B

Alibaba

2024-04
111B
modelopen

Worse performance on GPQA (72B=36.3, 110B=35.9).

SA

Arctic

Snowflake AI Research

2024-04
480B
modelopen

"Arctic uses a unique Dense-MoE Hybrid transformer architecture. It combines a 10B dense transformer model with a residual 128×3.66B MoE MLP resulting in 480B total and 17B active parameters chosen using a top-2 gating."

S

SenseNova 5.0

SenseTime

2024-04
600B
modelopen

GPT-4 scale; low media coverage; no demo in Western world. https://www.techinasia.com/sensetime-pauses-trading-stock-rises-30-model-launch

Apple logo

OpenELM

Apple

2024-04
3.04B
modelopen

On-device model (laptop, phone). Open-source Efficient Language Models (OpenELM). https://venturebeat.com/ai/apple-releases-openelm-small-open-source-ai-models-designed-to-run-on-device/

Microsoft logo

phi-3-medium

Microsoft

2024-04
14B
modelopen

Preview only, benchmarks being investigated as of May/2024.

Microsoft logo

phi-3-mini

Microsoft

2024-04
3.8B
modelopen

"phi3-mini can be quantized to 4-bits so that it only occupies ≈ 1.8GB of memory. We tested the quantized model by deploying phi-3-mini on iPhone 14 with A16 Bionic chip running natively on-device and fully offline achieving more than 12 tokens per second."

Meta AI logo

Llama 3 70B

Meta AI

2024-04
70B
modelopen

Instruct MMLU-Pro=56.2

Z

Zamba 7B

Zyphra

2024-04
7B
modelopen

Mamba1

Amazon logo

HLAT

Amazon

2024-04
7B
modelclosed

HLAT=High-quality LLM pre-trained on AWS Trainium. Same arch as Llama 7B. The pre-training is performed up to 64 Amazon EC2 trn1.32xlarge instances with totalling up to 1024 AWS Trainium accelerators. Read more about Trainium: https://www.aboutamazon.com/news/aws/what-you-need-to-know-about-the-aws-ai-chips-powering-amazons-partnership-with-anthropic

Hugging Face logo

Idefics2

Hugging Face

2024-04
8.4B
modelopen

Clone of Flamingo now using Mistral 7B. Named after Asterix and Obelix's dog Idefix (Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS)

RA

Reka Core

Reka AI

2024-04
300B
modelopen

https://www.reka.ai/news/reka-core-our-frontier-class-multimodal-language-model

Microsoft logo

WizardLM-2-8x22B

Microsoft

2024-04
141B
modelopen

Base model = mistral-8x22b.

E

Pile-T5

EleutherAI

2024-04
11B
modelopen

HF

Zephyr 141B-A35B

Hugging Face H4

2024-04
35B
modelopen

mixtral-8x22b finetune using Odds Ratio Preference Optimization (ORPO).

Cohere logo

Rerank 3

Cohere

2024-04
104B
modelopen

RAG + semantic search, possibly backed by Command-R+.

OpenAI logo

gpt-4-turbo-2024-04-09

OpenAI

2024-04
70B
modelopen

This is such a significantly better model that I've added it here. This GPQA=46.5%, old GPT-4 GPQA=36%. https://twitter.com/EpochAIResearch/status/1778463039932584205 MMLU scores are unclear, but may have improved by 1%: https://twitter.com/OpenAI/status/1778602770784002136. Final benchmarks are here: https://archive.md/6Cc0Z

T

MiniCPM-2.4B

Tsinghua

2024-04
2.4B
modelopen

MoE option=https://huggingface.co/openbmb/MiniCPM-MoE-8x2B

Apple logo

Ferret-UI

Apple

2024-04
13B
modelopen

Vicuna base, multimodal. Extension of Ferret from Oct/2023.

Mistral logo

mixtral-8x22b

Mistral

2024-04
141B
modelopen

MoE=22Bx8, seq=65536.

S

Sailor

Sail

2024-04
7B
modelopen

SEA languages. Based on Qwen-1.5. https://github.com/sail-sg/sailor-llm "Generally Sailor models consume around 200B tokens, completing a full pass through the SailCraft corpus once. However, the Sailor-0.5B model undergoes training with 400B tokens, equivalent to 2 epochs."

M

JetMoE-8B

MIT

2024-04
8B
modelopen

T

Eurus

Tsinghua

2024-04
70B
modelopen

Fine-tune of Mistral-7B and CodeLlama-70B.

Cohere logo

Command-R+

Cohere

2024-04
104B
modelopen

purpose-built to excel at real-world enterprise use cases. Announce with no arch details: https://txt.cohere.com/command-r-plus-microsoft-azure/

SA

Viking

Silo AI

2024-04
33B
modelopen

Viking uses an architecture similar to Llama 2, with flash attention, rotary embeddings, grouped query attention and supports a 4k sequence length'

NR

OLMo-Bitnet-1B

Nous Research

2024-04
1B
modelopen

1.58-bit quantized (ternary weights) means we can run a 70B model in ~14GB VRAM. See also BitNet b1.58

I

Aurora-M

International

2024-03
15.5B
modelopen

Apple logo

ReALM-3B

Apple

2024-03
3B
modelclosed

FLAN-T5 (Oct/2022) finetune.

Alibaba logo

Qwen1.5-MoE-A2.7B

Alibaba

2024-03
14.3B
modelopen

MoE. "Of particular significance is the fact that, through upcycling, the necessity for training an equivalent volume of tokens as in the original model has been eliminated." I assumed half of the original 3T tokens

xAI logo

Grok-1.5

xAI

2024-03
180B
modelopen

Context=128k.

A

Jamba 1

AI21

2024-03
52B
modelopen

MoE. Open weights, licensed under Apache 2.0. Announce: https://arxiv.org/abs/2403.19887

M

DBRX

MosaicML

2024-03
132B
modelopen

MoE. Trained for $10M on 3,072 NVIDIA H100s connected by 3.2Tbps Infiniband.

Stability AI logo

Stable Code Instruct 3B

Stability AI

2024-03
2.7B
modelopen

Context window=16,384. Trained on The Stack dataset.

SA

EvoLLM-JP

Sakana AI

2024-03
10B
modelopen

Japanese. Model merge 'our EvoLLM-JP-A is a merge of shisa-gamma-7b-v1, Arithmo2-Mistral-7B, and Abel7B-002' https://sakana.ai/evolutionary-model-merge/

RG

RakutenAI-7B

Rakuten Group

2024-03
7B
modelopen

Japanese. Mistral 7B derivative.

I

Parakeet

Independent

2024-03
0.378B
modelopen

Tiny model (378M) for testing

R

RWKV-v5 EagleX

RWKV

2024-03
7.52B
modelopen

RWKV (pronounced RwaKuv) is an RNN: Built on the RWKV-v5 architecture (a linear transformer with 10-100x+ lower inference cost)

Apple logo

MM1

Apple

2024-03
30B
modelclosed

VLM, outperforms Flamingo 80B (Apr/2022) across benchmarks. 2T text tokens + ~10B+ other text (estimate). Unreleased.

C

RFM-1

Covariant

2024-03
8B
modelpartial

Commercial, multimodal for robotics

Cohere logo

Command-R

Cohere

2024-03
35B
modelopen

RAG and tool use

DeepSeek-AI logo

DeepSeek-VL

DeepSeek-AI

2024-03
7B
modelopen

Vision, based on DeepSeek-LLM-7B

FU

AnyGPT

Fudan University

2024-03
7B
modelopen

Llama 2 7B backbone with new matrices ('reshaping the embedding matrix and prediction layer')

Stability AI logo

Stable Beluga 2.5

Stability AI

2024-03
70B
modelopen

Mentioned in Stability release about Intel chips 11/Mar/2024, availablity unknown

IA

Inflection-2.5

Inflection AI

2024-03
1200B
modelopen

S

Apollo

SRIBD/CUHK

2024-03
7B
modelopen

Qwen 1.8B as base. Medical focus.

Anthropic logo

Claude 3 Opus

Anthropic

2024-03
2500B
modelopen

Original MMLU=86.8 (GPT-4=86.4). MMLU=88.2 with CoT prompting. Original GPQA=50.4. 200k context, 1M for researchers.

N

Nemotron-4 15B

NVIDIA

2024-02
15B
modelopen

U

TowerLLM

Unbabel

2024-02
7B
modelopen

Commercial product, Llama-2 as base.

Google DeepMind logo

Hawk

Google DeepMind

2024-02
7B
modelopen

MMLU=35. RNN.

Google DeepMind logo

Griffin

Google DeepMind

2024-02
14B
modelopen

MMLU=49.5. RNN.

Microsoft logo

BitNet b1.58

Microsoft

2024-02
70B
modelopen

S

Samba-1

SambaNova

2024-02
1400B
modelpartial

CoE: Collection of experts: Llama2 7B / 13B / 70B Mistral 7B DeepSeek Coder 1.3B / 6.7B / 33B Falcon 40B DePlot CLIP Llava

Cohere logo

Aya-101

Cohere

2024-02
13B
modelopen

mT5 base.

H

Cosmo-1B

HF

2024-02
1.8B
modelopen

Synthetic data (25B tokens of synthetic data for 6 epochs + code). MMLU=32.4

SA

Poro

Silo AI

2024-02
34.2B
modelopen

Uses a BLOOM architecture with ALiBi embeddings to allow for context window extrapolation. While model architecture for the initial model has been kept simple, future models under progress will support additional capabilities, such as flash attention, rotary embeddings and grouped query attention.'

H

StarCoder 2

HF/ServiceNow

2024-02
15B
modelopen

The Stack v2=900B tokens, 5 epochs to 4.3T tokens

ByteDance logo

530B

ByteDance

2024-02
530B
modelclosed

Trained using 12,288 A100 GPUs, replicating MT-NLG size

ByteDance logo

175B

ByteDance

2024-02
175B
modelclosed

Trained using 12,288 A100 GPUs, replicating GPT-3 size

Mistral logo

Mistral Small

Mistral

2024-02
7B
modelopen

Optimised for latency and cost.

Mistral logo

Mistral Large

Mistral

2024-02
300B
modelopen

MMLU=81.2 (same as Flan-PaLM 2 340B, higher than PaLM 2 340B MMLU=78.3), 32k context window. API only (not open source).

R

Hanooman

Reliance

2024-02
40B
modelopen

11 Indian languages like Hindi, Tamil, and Marathi

Apple logo

Ask

Apple

2024-02
20B
modelclosed

Internal employee model only

RA

Reka Edge

Reka AI

2024-02
7B
modelopen

RA

Reka Flash

Reka AI

2024-02
21B
modelopen

My testing shows very poor performance equiv with tiny model

Google DeepMind logo

Gemma

Google DeepMind

2024-02
7B
modelopen

MMLU=64.3 (Llama 2 70B=68.9, ChatGPT 20B=70). Text only. Probably dense. Largest trained dataset (6T) besides frontier models.

Google DeepMind logo

Gemini 1.5 Pro

Google DeepMind

2024-02
200B
modelopen

Sparse MoE. Context window=1M and 10M for research. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/

Alibaba logo

Qwen-1.5 72B

Alibaba

2024-02
72B
modelopen

Meta AI logo

MobileLLM

Meta AI

2024-02
1B
modelopen

Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

B

GOODY-2

BRAIN

2024-02
modelopen

Satire (and hilarious). Probably Llama 2 with aggressive prompt. Wired interview: https://archive.md/toxHq

C

Natural-SQL-7B

ChatDB

2024-02
7B
modelopen

Based on DeepSeek-Coder 6.7B.

AS

Sea-Lion

AI Singapore

2024-02
7.5B
modelopen

MPT base. MMLU=26.87. Southeast Asian languages like Thai, Vietnamese and Bahasa Indonesia. https://www.computerweekly.com/feature/Sea-Lion-explained-Southeast-Asias-first-large-language-model

Google logo

TimesFM

Google

2024-02
0.2B
modelopen

Time-series forecasting only. 'a large pretraining corpus of 100B real world time-points' may be more than 100B tokens.

Allen AI logo

OLMo

Allen AI

2024-02
7B
modelopen

Open Language Model (OLMo)

N

Audio Flamingo

NVIDIA

2024-02
1B
modelpartial

Project page: https://audioflamingo.github.io/

C

FLOR-6.3B

Cerebras

2024-01
6.3B
modelopen

Spanish, Catalan. Bloom-7.1B (341B tok) + continued pre-training on 140B tok. Trained on Cerebras hardware.

A

Weaver

AIWaves.cn

2024-01
34B
modelopen

Llama? 'All Weaver models are initialized from powerful open-source LLMs.' English waitlist: https://www.wawawriter.com/en/

Mistral logo

miqu 70b

Mistral

2024-01
70B
modelopen

Leaked, proper version soon: https://venturebeat.com/ai/mistral-ceo-confirms-leak-of-new-open-source-ai-model-nearing-gpt-4-performance/

I

iFlytekSpark-13B

iFlyTek

2024-01
13B
modelopen

pre-trained on a massive high-quality data set with a total of more than 3 trillion tokens, and then fine-tuned on fine-tuned diversified alignment data.'

I

Xinghuo 3.5 (Spark)

iFlyTek

2024-01
200B
modelopen

GPT-4 competitor. https://www.shine.cn/biz/tech/2401304331/

Apple logo

MGIE

Apple

2024-01
7B
modelopen

MLLM and diffusion model initialized from LLaVA-7B (Llama 2 + Vicuna) + StableDiffusion-v1.5.

Meta AI logo

CodeLlama-70B

Meta AI

2024-01
70B
modelopen

Paper link is to 34B from Aug/2023. This 70B model finished training Jan/2024.

R

RWKV-v5 Eagle 7B

RWKV

2024-01
7.52B
modelopen

RWKV (pronounced RwaKuv) is an RNN: Built on the RWKV-v5 architecture (a linear transformer with 10-100x+ lower inference cost), Trained on 1.1 Trillion Tokens across 100+ languages. Original paper: https://arxiv.org/abs/2305.13048

L

MaLA-500

LMU

2024-01
10B
modelopen

Extends Llama 2 7B to 10B using 534 languages.

C

MambaByte

Cornell

2024-01
0.972B
modelclosed

Used bytes instead of tokens. 4 bytes≈1 token, so 150B bytes≈37.5B tokens

DeepSeek-AI logo

DeepSeek-Coder

DeepSeek-AI

2024-01
33B
modelopen

surpasses existing closed-source models like Codex and GPT-3.5... permissive license that allows for both research and unrestricted commercial use.'

Tencent logo

FuseLLM

Tencent

2024-01
7B
modelopen

Fusion of Llama-2-7B (2T tok), OpenLLaMA-7B (2T tok), and MPT-7B (1T tok).

A

Fuyu-Heavy

Adept

2024-01
120B
modelpartial

Fuyu-Heavy is the world’s third-most-capable multimodal model, behind only GPT4-V and Gemini Ultra, which are 10-20 times bigger.' Token estimate is based on Adept Persimmon-8B using many more tokens.

O

Orion-14B

OrionStar

2024-01
14B
modelopen

English, Chinese, Japanese, Korean, and other languages.

SA

InternLM2

Shanghai AI Laboratory/SenseTime

2024-01
20B
modelopen

ZA

GLM-4

Zhipu AI (Tsinghua)

2024-01
200B
modelopen

Best Chinese model to date based on analysis. Follows OpenAI roadmap. MMLU=81.5. 'hundreds of billions of parameters' https://www.chatglm.cn/

DeepSeek-AI logo

DeepSeekMoE

DeepSeek-AI

2024-01
16B
modelclosed

MoE activated parameters is 10-15% of dense, so I need to rethink ALScore for MoE. 'preliminary efforts to scale up DeepSeekMoE to 145B'

DeepSeek-AI logo

DeepSeek

DeepSeek-AI

2024-01
67B
modelopen

Chinese/English. Outperforms Llama 2. MMLU=71.3 outperforms GPT-3.5.

Tencent logo

LLaMA Pro

Tencent

2024-01
8.3B
modelopen

We pre-train LLAMA PRO’s expanded blocks on 80B tokens using open-source code and math data for 2830 GPU Hours (16 NVIDIA H800 GPUs for about 7 days).

W

Palmyra X

Writer

2024-01
72B
modelopen

Palmyra X V2, Palmyra X V3, Palmyra X V4. https://venturebeat.com/ai/why-writers-palmyra-llm-is-the-little-ai-model-that-could-for-enterprises/

S

TinyLlama

SUTD/Independent

2024-01
1.1B
modelopen

Overtrained' using 2,727 tokens per parameter. Dataset was 1T: 3 epochs to 3T seen. Singapore

J

DocLLM

JPMorgan

2024-01
7B
modelclosed

Document spatial layout structure.

149 models
C

MACE-MP-0

Cambridge

2023-12
0.00469B
modelopen

"Uses 4-body equivariant messages; covers 89 elements; supports fine-tuning for ab initio accuracy with minimal data."

Allen AI logo

Unified-IO 2

Allen AI

2023-12
7B
modelopen

600TB dataset (plus 120+ fine-tuning datasets) includes '1B imagetext pairs, 1T text tokens, 180M video clips, 130M interleaved image & text, 3M 3D assets, and 1M agent trajectories.'

Microsoft logo

WaveCoder-DS-6.7B

Microsoft

2023-12
6.7B
modelclosed

To obtain WaveCoder models, We choose StarCoder-15B, CodeLLaMa (7B and 13B), DeepseekCoder-6.7B as the base model and fine-tune all the base model for 3 epochs

H

YunShan

Huawei

2023-12
7B
modelclosed

Finance + law fine-tune of PanGu-π

H

PanGu-Pi

Huawei

2023-12
7B
modelclosed

Dense, named PanGu-π

W

YAYI 2

Wenge

2023-12
30B
modelopen

Dataset=240TB filtered to 10.6TB for 2.65T tokens

B

Emu2

BAAI

2023-12
37B
modelopen

VLM. Gemini clone. Outperforms Flamingo 80B. The Pile for text, but only sampled 3.6B tokens (1.4% of the dataset).

Google DeepMind logo

MedLM

Google DeepMind

2023-12
modelpartial

Available to 'white-listed' orgs only.

UA

SOLAR-10.7B

Upstage AI

2023-12
10.7B
modelopen

South Korean. Llama-2 arch. SOTA for its size (Dec/2023).

D

DeciLM-7B

Deci

2023-12
7.04B
modelopen

4.4x times faster than Mistral. English only.

Mistral logo

Mistral-medium

Mistral

2023-12
180B
modelopen

MMLU=75.3% (GPT-3.5-turbo 20B=70%, Llama 2 70B=68.9%)

Mistral logo

mixtral-8x7b-32kseqlen

Mistral

2023-12
46.7B
modelopen

MoE=7Bx8, aka mistral-small. 'Concretely, Mixtral has 45B total parameters but only uses 12B parameters per token. It, therefore, processes input and generates output at the same speed and for the same cost as a 12B model.'

T

StripedHyena 7B

Together

2023-12
7.65B
modelopen

RedPajama (C4), new arch beyond just Transformers

N

NexusRaven-V2 13B

Nexusflow.ai

2023-12
modelopen

Based on CodeLlama. 'surpasses GPT-4 by up to 7% in function calling success rates in human-generated use cases involving nested and composite functions.'

Google DeepMind logo

Gemini Ultra 1.0

Google DeepMind

2023-12
1500B
modelopen

Original MMLU=83.7. MMLU=90.04 with prompting. Chinchilla (20:1), dense, maybe 600B-2000T. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/

C

Mamba

CMU

2023-12
2.8B
modelopen

The Pile, new arch beyond just Transformers. 2.7B MMLU=26.2. 7B MMLU=33.3.

B

LVM-3B

Berkeley/JHU

2023-12
3B
modelclosed

Paper is 25MB. First Large Vision Model (LVM); no text. Based on Llama and LAION 5B (1.49B).

Alibaba logo

SeaLLM-13b

Alibaba

2023-12
13B
modelopen

Llama 2 for Southeast Asian (SEA) languages: Vietnamese 🇻🇳, Indonesian 🇮🇩, Thai 🇹🇭, Malay 🇲🇾, Khmer🇰🇭, Lao🇱🇦, Tagalog🇵🇭 and Burmese🇲🇲

P

pplx-70b-online

Perplexity

2023-11
70B
modelopen

Web access. Higher 'freshness' and 'truth' scores.

Meta AI logo

SeamlessM4T-Large v2

Meta AI

2023-11
2.3B
modelopen

Based on NLLB and older models. https://github.com/facebookresearch/seamless_communication

Google DeepMind logo

Q-Transformer

Google DeepMind

2023-11
modelclosed

Robotics, builds on RT-1

I

Yuan 2.0

IEIT

2023-11
102.6B
modelopen

Chinese + EN dataset include The Pile: DM, arxiv, wikipedia, book3, stack exchange, Freelaw and medical

E

MEDITRON

EPFL

2023-11
70B
modelopen

Llama 2 trained on med data using NVIDIA Megatron-LM. "outperforms Llama-2-70B, GPT-3.5 (text-davinci-003, 8-shot), and Flan-PaLM on multiple medical reasoning tasks."

Microsoft logo

Transformers-Arithmetic

Microsoft

2023-11
0.1B
modelclosed

Proving maths is not memorized. Uses GPT-2-style model. Sébastien Bubeck

B

Starling-7B

Berkeley

2023-11
7B
modelopen

Llama 2 7B -> OpenChat 7B -> Starling-7B (RLAIF)

IA

Inflection-2

Inflection AI

2023-11
1200B
modelopen

“now the 2nd best LLM in the world”. Finished training 19/Nov/2023, waiting for fine-tuning and release.

Anthropic logo

Claude 2.1

Anthropic

2023-11
130B
modelopen

Less hallucinations, 200k context length, tool use

Allen AI logo

TÜLU 2

Allen AI

2023-11
70B
modelopen

Llama 2 finetune with RLHF direct preference optimization (DPO).

N

Nemotron-3 22B

NVIDIA

2023-11
22B
modelopen

8B released, 22B internal.

N

Nemotron-2 43B

NVIDIA

2023-11
43B
modelclosed

Used to train HelpSteer (16/Nov/2023): https://arxiv.org/abs/2311.09528

Microsoft logo

Orca 2

Microsoft

2023-11
13B
modelpartial

Llama 2 13B (2T) -> Orca 2 (GPT-4 finetune). Still an imitation model, overhyped: The False Promise of Imitating Proprietary LLMs https://arxiv.org/abs/2305.15717

Microsoft logo

Phi-2

Microsoft

2023-11
2.7B
modelopen

https://twitter.com/SebastienBubeck/status/1724854157004190095

Microsoft logo

Florence-2

Microsoft

2023-11
0.771B
modelopen

VLM, Flamingo alt

Google DeepMind logo

Mirasol3B

Google DeepMind

2023-11
3B
modelclosed

Combiner + autoregressive transformer for video/audio/text

N

OtterHD-8B

NTU

2023-11
8B
modelopen

Evolution of Persimmon-9.3B and Fuyu 8B

S

Gauss

Samsung

2023-11
7B
modelpartial

Gauss Language specializing in generating texts, Gauss Code on software and code description and Gauss Image for image creation.

xAI logo

Grok-1

xAI

2023-11
314B
modelopen

Context window=8192. UI: https://twitter.com/TobyPhln/status/1721053802235621734

xAI logo

Grok-0

xAI

2023-11
33B
modelclosed

Announced Nov/2023, trained Jul/2023

01-ai logo

Yi-34B

01-ai

2023-11
34.4B
modelopen

Controversy about Llama 2 base. https://twitter.com/kaifulee/status/1724673131875377465 MMLU=76.3 (PaLM 2=78.3) Outperforms Llama 2. Chinese and English. https://www.bloomberg.com/news/articles/2023-11-05/kai-fu-lee-s-open-source-01-ai-bests-llama-2-according-to-hugging-face

OpenAI logo

GPT-4 Turbo

OpenAI

2023-11
70B
modelopen

https://openai.com/blog/new-models-and-developer-products-announced-at-devday

Google DeepMind logo

MatFormer

Google DeepMind

2023-10
0.85B
modelopen

Matryoshka Transformer or MatFormer model architecture. 850M (696M / 620M / 582M). "850M decoder-only MatFormer language model (MatLM) allows us to extract multiple smaller models spanning from 582M to 850M parameters, each exhibiting better validation loss and one-shot downstream evaluations than independently trained counterparts."

KT

Skywork-13B

Kunlun Tech

2023-10
13B
modelopen

CN + EN.

Moonshot AI logo

Kimi Chat

Moonshot AI

2023-10
100B
modelopen

Chinese. Long context. No paper.

JA

jina-embeddings-v2

Jina AI

2023-10
0.435B
modelopen

Alternative to text-embedding-ada-002. Related v1 paper: https://arxiv.org/abs/2307.11224

A

Fuyu

Adept

2023-10
8B
modelopen

VLM. 8B available under open licence, Medium size is closed

Baidu logo

ERNIE 4.0

Baidu

2023-10
1000B
modelopen

Dense (confirmed). English-dubbed launch video (2h52m): https://twitter.com/i/broadcasts/1yNGaZaeallJj & https://youtu.be/wYozcsavRuM

HF

Zephyr

Hugging Face H4

2023-10
7.3B
modelopen

Mistral with 'aligned' data removed from dataset

Google DeepMind logo

PaLI-3

Google DeepMind

2023-10
5B
modelclosed

VLM. Next iteration of PaLI via Pathways. https://lifearchitect.ai/pathways/

N

Retro 48B

NVIDIA

2023-10
48B
modelopen

the largest LLM pretrained with retrieval before instruction tuning.'

Apple logo

Ferret

Apple

2023-10
13B
modelopen

Vicuna base, multimodal

XL

Lemur

XLANG Lab

2023-10
70B
modelopen

https://arxiv.org/abs/2310.06830

K

AceGPT

KAUST/Shenzhen

2023-10
13B
modelopen

Arabic. Llama 2 + RLAIF

RA

Yasa-1

Reka AI

2023-10
modelpartial

Multi-modal. No public arch info. Researchers from DeepMind, Google, Baidu and Meta building enterprise models

Google DeepMind logo

RT-X

Google DeepMind

2023-10
55B
modelopen

Robotics using UL2. 'RT-1 model trained using the robotic data mixture as RT-1-X, and the RT-2 model trained using the robotic data mixture as RT-2-X.'

W

MotionLM

Waymo

2023-09
0.09B
modelclosed

LLM for autonomous vehicle forecasting. https://youtu.be/jrMMNmN21I8?t=1560

W

GAIA-1

Wayve

2023-09
9B
modelclosed

World model, generates video. Uses T5-large 770M for language + all vision parameters

Alibaba logo

Qwen

Alibaba

2023-09
72B
modelopen

Chinese. Full name is 'Tongyi Qianwen' 通义千问. 'Lags behind both GPT-3.5 and GPT-4'. Originally 7B/14B params Apr/2023

Meta AI logo

Llama 2 Long

Meta AI

2023-09
70B
modelclosed

Unreleased to date. Context window=32,768 tokens (compare to Llama 2=4096 tokens)

HA

LeoLM

Hessian AI/LAION

2023-09
13B
modelopen

Llama 2 'extended' and pretrained on 2000B Llama 2 tokens + 65B tokens of German

Mistral logo

Mistral 7B

Mistral

2023-09
7.3B
modelopen

Apache 2.0, Sliding Window Attention (SWA) to handle longer sequences at smaller cost

Microsoft logo

Kosmos-2.5

Microsoft

2023-09
1.3B
modelclosed

B

Baichuan 2

Baichuan

2023-09
13B
modelopen

Great paper. Chinese-English bilingual dataset

T

BOLT2.5B

ThirdAI

2023-09
2.5B
modelopen

CPU trained

D

DeciLM

Deci

2023-09
5.7B
modelopen

Faster inference (4.8× throughput of Llama 2)

IBM logo

MoLM

IBM

2023-09
8B
modelopen

ModuleFormer is based on the Sparse Mixture of Experts (MoE).

S

NExT-GPT

Singapore

2023-09
7B
modelopen

Multimodal. Vicuna 7B + other modalities

Microsoft logo

Phi-1.5

Microsoft

2023-09
1.3B
modelopen

Textbooks only. 30B-token dataset

Apple logo

UniLM

Apple

2023-09
0.034B
modelopen

Apple's Transformer model for iOS 17 + macOS Sonoma. Announce is actually Jun/2023. GPT-2 base? 128 token context window

A

Persimmon-8B

Adept

2023-09
8B
modelopen

Open Apache license and publicly accessible weights.

B

FLM-101B

BAAI

2023-09
101B
modelopen

Train for $100k compute budget (on a cluster of 24 DGX-A800 GPU 8×80G servers for 21 days)

TII logo

Falcon 180B

TII

2023-09
180B
modelopen

Major milestone for open source models (largest open dense model to date).

Tencent logo

Hunyuan

Tencent

2023-09
100B
modelopen

I

phi-CTNL

Independent

2023-09
0.1B
modelopen

Satire. MMLU=100. 'phi-CTNL (pronounced “fictional”) that achieves perfect results across diverse academic benchmarks'

IBM logo

Granite

IBM

2023-09
13B
modelopen

Original trained on 1T tokens, update 15/Feb/2024 trained on 2.5T tokens: granite-13b-chat-v2 (v2.1.0). "At IBM, we curated 6.48TB of data to train our LLM Granite.13B. This was reduced to 2.07 TB after pre-processing, a 68% decrease."

Inception logo

Jais

Inception

2023-08
13B
modelopen

Arabic, trained in Abu Dhabi, UAE using Cerebras.

Meta AI logo

Code Llama 34B

Meta AI

2023-08
34B
modelopen

Outperforms GPT-3.5. Initial Llama 2 (2T tokens) trained on 500B tokens of code, 100B tokens of python

Hugging Face logo

IDEFICS

Hugging Face

2023-08
80B
modelopen

Clone of Flamingo using Llama-1 65B. Named after Asterix and Obelix's dog Idefix (Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS)

U

Raven

UI/NVIDIA

2023-08
11B
modelclosed

RAG Atlas

A

DukunLM

AzaleAI

2023-08
13B
modelopen

Indonesian fine-tune of WizardLM (which is a Llama fine-tune).

Microsoft logo

WizardLM

Microsoft

2023-08
70B
modelopen

Assume Llama-2 fine-tune. Outperforms text-davinci-003. May merge this entry with the Apr/2023 7B release

BU

Platypus

Boston University

2023-08
70B
modelopen

Fine-tune of Llama 2, family includes merges with Beluga, Dolphin, and Camel fine-tunes.

Stability AI logo

Japanese StableLM Alpha 7B

Stability AI

2023-08
7B
modelopen

Best-performing openly available language model for Japanese speakers.

Stability AI logo

Stable Code 3B

Stability AI

2023-08
2.7B
modelopen

Context window=16,384. Trained on The Stack dataset.

S

Med-Flamingo

Stanford

2023-07
8.3B
modelopen

Uses LAION OpenFlamingo 9B, based on LLaMA-7B text + 1.3B vision

L

Alfred-40B-0723

LightOn

2023-07
40B
modelopen

First finetuned version of Falcon with RLHF. Enterprise: https://www.lighton.ai/paradigm

T

LLaMA-2-7B-32K

Together

2023-07
7B
modelopen

32k context window instead of 4k (Llama 2)

Google DeepMind logo

Med-PaLM M

Google DeepMind

2023-07
540B
modelclosed

Uses PaLM 1. Already outperformed by Med-PaLM 2. Med-PaLM Multimodal (Med-PaLM M).

C

BTLM-3B-8K

Cerebras

2023-07
3B
modelopen

Runs on devices with as little as 3GB of memory [iPhone, Macbook] when quantized to 4-bit

Stability AI logo

Stable Beluga 2

Stability AI

2023-07
70B
modelopen

Fine-tuned Llama 2. Non-commercial use license. Codename was FreeWilly2

Stability AI logo

Stable Beluga 1

Stability AI

2023-07
65B
modelopen

Fine-tuned LLaMA-1. Non-commercial use license. Codename was FreeWilly1

SA

Meta-Transformer

Shanghai AI Laboratory/CUHK

2023-07
2B
modelopen

Proto-AGI. 12 modalities (text, image, point cloud, audio, video, infrared, hyperspectral, X-ray, time-series, tabular, Inertial Measurement Unit (IMU), and graph data).

Meta AI logo

Llama 2

Meta AI

2023-07
70B
modelopen

Context window=4096. MMLU=68.9 (GPT-3.5=70.0, GPT-4=86.4)

(

WormGPT

(Undisclosed)

2023-07
6B
modelpartial

GPT-J (2021) finetune/module.

Anthropic logo

Claude 2

Anthropic

2023-07
130B
modelopen

More HHH, 200k context length

I

LongLLaMA

IDEAS/DeepMind

2023-07
7B
modelopen

256k context length

T

xTrimoPGLM

Tsinghua

2023-07
100B
modelclosed

Protein language model

Salesforce logo

XGen

Salesforce

2023-07
7B
modelopen

8K sequence length. Released under Apache-2.0.

3C

Zhinao (Intellectual Brain)

360 cn

2023-07
100B
modelopen

RA

Yasa

Reka AI

2023-06
modelpartial

No public arch info. Researchers from DeepMind, Google, Baidu and Meta building enterprise models

Microsoft logo

Kosmos-2

Microsoft

2023-06
1.6B
modelopen

Proto-AGI. Multimodal large language model (MLLM). a multimodal large language model with grounding capability built upon KOSMOS-1

Google logo

AudioPaLM

Google

2023-06
340B
modelclosed

a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation

IA

Inflection-1

Inflection AI

2023-06
120B
modelopen

Comparable with benchmarking results from InternLM 104B, 1-2% better. ‘Inflection-1 was trained using thousands of NVIDIA H100 GPUs on a very large dataset.’

Microsoft logo

Phi-1

Microsoft

2023-06
1.3B
modelclosed

Code model. ‘breaking existing scaling laws by training a 1.3B-parameter model, which we call phi-1, for roughly 8 passes over 7B tokens (slightly over 50B total tokens seen) followed by finetuning on less than 200M tokens.’

SA

InternLM

Shanghai AI Laboratory/SenseTime

2023-06
104B
modelclosed

Outperforms ChatGPT, LLaMA on RACE-h, Chinese + English

Meta AI logo

BlenderBot 3x

Meta AI

2023-06
175B
modelopen

OPT-175B with new dialogue data

Microsoft logo

Orca

Microsoft

2023-06
13B
modelpartial

LLaMA -> Vicuna -> Orca (GPT-4 finetune). Still an imitation model, overhyped: The False Promise of Imitating Proprietary LLMs https://arxiv.org/abs/2305.15717

EZ

PassGPT

ETH Zürich

2023-06
modelclosed

GPT-2 trained on leaked passwords

Google DeepMind logo

DIDACT

Google DeepMind

2023-06
modelclosed

Iterative coding model trained on Google's monorepo. Jacob: https://twitter.com/jacobaustin132/status/1663972128176128002

M

LTM-1

Magic

2023-06
modelclosed

Context window=5M

OpenAI logo

GPT-4 MathMix

OpenAI

2023-05
1760B
modelclosed

Unreleased, includes step by step research

C

PandaGPT

Cambridge/Tencent

2023-05
13B
modelopen

Proto-AGI. 6 modalities (text, image/video, audio, depth, thermal, and IMU/accelerometer/gyroscope/compass). Based on Vicuna.

TII logo

Falcon

TII

2023-05
40B
modelopen

Abu Dhabi

R

202305-refact2b-mqa-lion

Refact

2023-05
1.6B
modelpartial

LiON vs Adam, code, RedPajama+The Stack

U

Guanaco

UW

2023-05
65B
modelopen

LLaMA-65B via QLoRA

Meta AI logo

LIMA

Meta AI

2023-05
65B
modelclosed

LLaMA-65B with nearly no fine-tuning, no RLHF

A

Formosa (FFM)

Asus/TWS

2023-05
176B
modelpartial

BLOOMZ finetune? Chinese, Taiwan's first LLM. Subscription hardware: https://archive.md/cVdJt

Salesforce logo

CodeT5+

Salesforce

2023-05
16B
modelopen

InstructCodeT5+ 16B sets new SoTA results of 35.0% pass@1 and 54.5% pass@10 against other open code LLMs, even surpassing the closed-source OpenAI code-cushman-001'

Google logo

PaLM 2

Google

2023-05
340B
modelopen

“What we found in our work is that it’s not really the sort of size of model — that the larger is not always better,” Deepmind VP Zoubin Ghahramani said in a press briefing ahead of today’s announcement. “That’s why we’ve provided a family of models of different sizes. We think that actually parameter count is not really a useful way of thinking about the capabilities of models and capabilities are really to be judged by people using the models and finding out whether they’re useful in the tests that they try to achieve with these models.”

H

StarCoder

HF/ServiceNow

2023-05
15.5B
modelopen

M

MPT

MosaicML

2023-05
7B
modelopen

Llongboi' -Apache 2.0 license suitable for commercial use. -Base 7B LLM trained on 1T tokens outperforms LLaMA and GPT3. -64K+ context length. -$200k to train from scratch.

IA

Pi

Inflection AI

2023-05
60B
modelopen

No indication of params/tokens. Devs from DeepMind.

N

GPT-2B-001

NVIDIA

2023-05
2B
modelopen

No paper yet

Amazon logo

Titan

Amazon

2023-04
200B
modelopen

No official information at all. 2nd hand via Jack Clark: https://importai.substack.com/p/import-ai-365-wmd-benchmark-amazon '$65m training run. Specifically, they trained a 200B dense model on 4T tokens of data across 13,760 NVIDIA A100 chips (using 1,720 P4d nodes). It took 48 days to train.'

Microsoft logo

WizardLM

Microsoft

2023-04
7B
modelopen

LLaMA 7B self-instructed fine-tune.

M

MPT

MosaicML

2023-04
1.3B
modelopen

More 1B models coming with different datasets. Many more.

Stability AI logo

StableLM

Stability AI

2023-04
65B
modelopen

contains 1.5 trillion tokens, roughly 3x the size of The Pile. These models will be trained on up to 1.5 trillion tokens. The context length for these models is 4096 tokens.

D

Dolly 2.0

Databricks

2023-04
12B
modelopen

Fine-tuned Pythia 12B

E

Pythia

EleutherAI

2023-04
12B
modelopen

B

Koala-13B

Berkeley

2023-04
13B
modelopen

LLaMA base. Academic licence only.

C

C1.2

Character.ai

2023-03
20B
modelopen

No details released.

B

BloombergGPT

Bloomberg

2023-03
50B
modelclosed

Video: https://youtu.be/m2Scj2SO85Y Underperforms GPT-3, based on BLOOM. Tokens: 'We select a model size motivated by Hoffmann et al. (2022) and train a 50 billion parameter model on 569 billion tokens from our corpus of over 700 billion tokens to produce a model that is competitive with larger models.'

L

OpenFlamingo-9B

LAION

2023-03
8.3B
modelopen

Uses LLaMA-7B. Demo: https://7164d2142d11.ngrok.app/

N

GPT4All-LoRa

Nomic

2023-03
7B
modelopen

chatbot trained on ~800k GPT-3.5-Turbo Generations based on LLaMa

C

Cerebras-GPT

Cerebras

2023-03
13B
modelopen

20:1 tokens to parameters as per https://lifearchitect.ai/chinchilla/

H

PanGu-Sigma

Huawei

2023-03
1085B
modelclosed

Sparse. 1.085T parameters named PanGu-Σ.

Google logo

CoLT5

Google

2023-03
5.2B
modelclosed

up to 64k context window [48k words or about 96 pages -Alan]

Google DeepMind logo

Med-PaLM 2

Google DeepMind

2023-03
340B
modelclosed

Recently, our next iteration, Med-PaLM 2, consistently performed at an “expert” doctor level on medical exam questions, scoring 85%. This is an 18% improvement from Med-PaLM’s previous performance and far surpasses similar AI models.

OpenAI logo

GPT-4 Classic (gpt-4-0314 & gpt-4-0613, non-Turbo)

OpenAI

2023-03
1760B
modelopen

Original MMLU=86.4. MMLU=90.1 with prompting. Proto-AGI. 1.76T parameters MoE.

S

Alpaca

Stanford

2023-03
7B
modelopen

Stanford Alpaca: An Instruction-following LLaMA model'

A

Jurassic-2

AI21

2023-03
178B
modelopen

T

GPT-NeoX-Chat-Base-20B

Together

2023-03
20B
modelopen

instruction-tuned 20 billion parameter language model, a 6 billion parameter moderation model, and an extensible retrieval system for including up-to-date responses from custom repositories. It was trained on the OIG-43M training dataset, which was a collaboration between Together, LAION, and Ontocord.ai. '

Microsoft logo

Kosmos-1

Microsoft

2023-02
1.6B
modelclosed

Proto-AGI. Multimodal large language model (MLLM). Raven’s Progressive Matrices as real images, not digits as in testing of text-davinci-003 at https://lifearchitect.ai/ravens/

Meta AI logo

LLaMA-65B

Meta AI

2023-02
65B
modelopen

Researchers only, noncommercial only. 'LLaMA-65B is competitive with the best models, Chinchilla70B and PaLM-540B.'

FU

MOSS

Fudan University

2023-02
16B
modelopen

Major bandwidth issues: https://www.reuters.com/technology/china-fudan-university-team-apologises-after-chatgpt-style-platform-crashes-2023-02-21/

W

Palmyra

Writer

2023-02
20B
modelopen

Only up to 5B available open-source 'trained on over 300 billion tokens of text data, and the size of the resulting model is over 20 billion parameters. ' https://writer.com/product/cowrite/

AA

Luminous Supreme Control

Aleph Alpha

2023-02
70B
modelopen

‘Control’ means instruction tuned

Meta AI logo

Toolformer+Atlas 11B+NLLB 54B

Meta AI

2023-02
6.7B
modelclosed

Based on GPT-J 6.7B + access to other models via API

Amazon logo

Multimodal-CoT

Amazon

2023-02
0.738B
modelopen

Models <1B with vision CoT

Microsoft logo

FLAME

Microsoft

2023-01
0.06B
modelclosed

T5 for Excel formulas, very small 60M params, "We start from a dataset of 927M formulas" estimate 10x multiplier for 9B tokens

60 models
Google DeepMind logo

Med-PaLM 1

Google DeepMind

2022-12
540B
modelclosed

Collab between Google & DeepMind. Makes 1% less errors than humans

Meta AI logo

OPT-IML

Meta AI

2022-12
175B
modelopen

Instruct

Anthropic logo

RL-CAI

Anthropic

2022-12
52B
modelclosed

RLAIF=reinforcement learning with AI feedback

Baidu logo

ERNIE-Code

Baidu

2022-12
0.56B
modelopen

Google logo

RT-1

Google

2022-12
0.035B
modelclosed

OpenAI logo

ChatGPT (gpt-3.5-turbo)

OpenAI

2022-11
20B
modelopen

Instruct with strict policies ("extremely limited")

OpenAI logo

text-davinci-003

OpenAI

2022-11
modelopen

T

GPT-JT

Together

2022-11
6B
modelopen

R

RWKV-4

RWKV

2022-11
14B
modelopen

RWKV (pronounced RwaKuv) is an RNN: https://www.reddit.com/r/MachineLearning/comments/yxt8sa/r_rwkv4_7b_release_an_attentionfree_rnn_language/

Meta AI logo

Galactica

Meta AI

2022-11
120B
modelopen

scientific only

DeepMind logo

SED

DeepMind

2022-11
modelclosed

SED 420M (diffusion text model)

B

mT0

BigScience

2022-11
13B
modelopen

fine-tuned

B

BLOOMZ

BigScience

2022-11
176B
modelopen

fine-tuned

Microsoft logo

PACT

Microsoft

2022-10
modelopen

Trained on ~5TB data, 2GB model download. 'In general we see an improvement in model performance as we increase the number of training tokens. Interestingly, larger models did not necessarily result in better performance for robot navigation. Even though larger models consistently presented better loss values for action prediction on a static dataset, (Fig. 7 b), when it comes to real-time deployment the larger network capacity introduces inference delays that become a disadvantage and lead to earlier crashes. For example, while LiDAR perception measurements arrive to the vehicle every 0.077s (13Hz), the largest model of 24 layers takes on average 0.023s for inference with a RTX3090 GPU, roughly 40% longer the 3 layer model (0.016s). These time differences can amount to even larger performance gaps in small embedded systems, and further emphasize the importance of multiple downstream task architectures sharing a common representation branch for real-time robotics applications.'

Google logo

Flan-T5

Google

2022-10
11B
modelopen

T5=1T tokens + LM-adapted T5 as 100B tokens

Google logo

Flan-PaLM

Google

2022-10
540B
modelclosed

Google logo

U-PaLM

Google

2022-10
540B
modelclosed

N

VIMA

NVIDIA

2022-10
0.2B
modelopen

T

OpenChat

Tsinghua

2022-09
13B
modelopen

Llama 2 13B -> OpenChat 13B

W

WeLM

Wechat

2022-09
10B
modelopen

13% English tokens and 87% Chinese

T

CodeGeeX

Tsinghua

2022-09
13B
modelopen

DeepMind logo

Sparrow

DeepMind

2022-09
70B
modelclosed

Chatbot as a fine-tuned version of Chinchilla 70B

Google logo

PaLI

Google

2022-09
17B
modelclosed

PaLM Vision model, new datasets of 10B multilingual text-image pairs

N

NeMo Megatron-GPT 20B

NVIDIA

2022-09
20B
modelopen

Microsoft logo

Z-Code++

Microsoft

2022-08
0.71B
modelclosed

abstractive text summarization, 710M, outperforms PaLM 540B. "Due to the limited computational resource, Z-Code++LARGE is trained with only 500B tokens instead of 1T tokens as that for mT5 training."

Meta AI logo

Atlas

Meta AI

2022-08
11B
modelopen

Meta AI logo

BlenderBot 3

Meta AI

2022-08
175B
modelopen

T

GLM-130B

Tsinghua

2022-08
130B
modelopen

50% English (200B tokens), so included here

Amazon logo

AlexaTM 20B

Amazon

2022-08
20B
modelopen

Wikipedia and mC4 only. seq2seq

OpenAI logo

6.9B FIM

OpenAI

2022-07
6.9B
modelclosed

Several models: 8 sizes, NLP, Code, FIM/non-FIM. 100B tokens for 6.9B params... beyond chinchilla

Google logo

‘monorepo-Transformer’

Google

2022-07
0.5B
modelclosed

Unnamed. Writes >3% of internal google code.

H

PanGu-Coder

Huawei

2022-07
2.6B
modelclosed

Python via GH

Meta AI logo

NLLB

Meta AI

2022-07
54.5B
modelopen

54.5B MOE, 3.3B dense. 200+ languages

A

J-1 RBG

AI21

2022-07
178B
modelopen

J-1 fine-tuned with RBG law corpus

B

BLOOM (tr11-176B-ml)

BigScience

2022-07
176B
modelopen

Google logo

Minerva

Google

2022-06
540B
modelclosed

PaLM finetuned on LaTeX/arXiv maths

Microsoft logo

GODEL-XL

Microsoft

2022-06
2.7B
modelopen

XL: GPT-3 175B in paper, GPT-J 2.7B released

Y

YaLM 100B

Yandex

2022-06
100B
modelopen

Megatron-LM clone, Russian/English: https://medium.com/yandex/yandex-publishes-yalm-100b-its-the-largest-gpt-like-neural-network-in-open-source-d1df53d0e9a6

Allen AI logo

Unified-IO

Allen AI

2022-06
2.8B
modelclosed

Based on T5. Demo only

DeepMind logo

Perceiver AR

DeepMind

2022-06
1B
modelclosed

Context window=100,000. Params=364m wiki, 975M pg-19, 826M books, music=?, imagenet=770M,

Google logo

LIMoE

Google

2022-06
5.6B
modelclosed

I

GPT-4chan

Independent

2022-06
6B
modelopen

Warning for inappropriate content. GPT-J.

S

Diffusion-LM

Stanford

2022-05
0.3B
modelopen

GPT-J with synthetic data

Google logo

UL2 20B

Google

2022-05
20B
modelclosed

Unifying Language model. C4 only.

DeepMind logo

Gato (Cat)

DeepMind

2022-05
1B
modelclosed

Proto-AGI. Generalist agent (LLM, VLM, robot)

Google logo

LaMDA 2

Google

2022-05
137B
modelpartial

Chatbot with tiny walled garden demo TBA

Meta AI logo

OPT-175B

Meta AI

2022-05
175B
modelopen

Only 30B available (Jun/2022)

Hugging Face logo

Tk-Instruct

Hugging Face

2022-04
11B
modelopen

Based on T5.

Meta AI logo

InCoder

Meta AI

2022-04
6.7B
modelopen

Python and JavaScript

TII logo

NOOR

TII

2022-04
10B
modelclosed

Arabic. "World’s largest high-quality cross-domain Arabic dataset, combining web data with books, poetry, news articles, and technical information"

S

mGPT

Sber

2022-04
13B
modelpartial

60 languages. Only 1.3B model available

Google logo

PaLM-Coder

Google

2022-04
540B
modelclosed

Google logo

PaLM

Google

2022-04
540B
modelclosed

Meta AI logo

SeeKeR

Meta AI

2022-03
2.7B
modelopen

BART and compared to GPT-2

Salesforce logo

CodeGen

Salesforce

2022-03
16B
modelopen

Code

L

VLM-4

LightOn

2022-03
10B
modelopen

Params corrected 25/Apr/2022

DeepMind logo

Chinchilla

DeepMind

2022-03
70B
modelclosed

First to double tokens per size increase

Salesforce logo

CodeT5

Salesforce

2022-03
0.7B
modelopen

"Text-to-Text Transfer Transformer". Code. Large introduced in https://arxiv.org/pdf/2207.01780.pdf

E

GPT-NeoX-20B

EleutherAI

2022-02
20B
modelopen

Latest model to Feb/2022

Meta AI logo

CM3

Meta AI

2022-01
13B
modelopen

LLM with multimodal capabilities

24 models
Baidu logo

ERNIE 3.0 Titan

Baidu

2021-12
260B
modelopen

Meta AI logo

XGLM

Meta AI

2021-12
7.5B
modelopen

Multilingual: 30 languages, 16 families.

Meta AI logo

Fairseq

Meta AI

2021-12
1100B
modelopen

13B & 1100B param models.

DeepMind logo

Gopher

DeepMind

2021-12
280B
modelclosed

Dataset: https://lifearchitect.ai/whats-in-my-ai/

Google logo

GLaM

Google

2021-12
1200B
modelclosed

Anthropic logo

Anthropic-LM 52B

Anthropic

2021-12
52B
modelclosed

Internal research only

DeepMind logo

RETRO

DeepMind

2021-12
7.5B
modelclosed

with retrieval

AA

Luminous

Aleph Alpha

2021-11
200B
modelopen

Devs from EleutherAI

Microsoft logo

DeBERTaV3

Microsoft

2021-11
1.5B
modelopen

RoBERTa=162B token dataset.

Google logo

BERT-480

Google

2021-11
480B
modelclosed

Submission to benchmarks. Original dataset was BookCorpus + Wikipedia: https://arxiv.org/pdf/1810.04805.pdf

Google logo

BERT-200

Google

2021-11
200B
modelclosed

Submission to benchmarks. Original dataset was BookCorpus + Wikipedia: https://arxiv.org/pdf/1810.04805.pdf

C

Cedille FR-Boris

Coteries

2021-11
6B
modelopen

French only. GPT-J.

M

MT-NLG

Microsoft/NVIDIA

2021-10
530B
modelclosed

Google logo

FLAN

Google

2021-09
137B
modelclosed

Fine-tuned LaMDA

Cohere logo

Command xlarge

Cohere

2021-09
52.4B
modelopen

Stealth 'ebooks and webpages'. 52B: https://crfm.stanford.edu/helm/v1.0/?models=1

Baidu logo

PLATO-XL

Baidu

2021-09
11B
modelopen

Chatbot. Reddit comments + CN social

Allen AI logo

Macaw

Allen AI

2021-09
11B
modelpartial

Chatbot

OpenAI logo

Codex

OpenAI

2021-08
12B
modelopen

Code

A

Jurassic-1

AI21

2021-08
178B
modelopen

Emulated GPT-3 dataset

Meta AI logo

BlenderBot 2.0

Meta AI

2021-07
9.4B
modelopen

Chatbot

E

GPT-J

EleutherAI

2021-06
6B
modelopen

Popular

Google logo

LaMDA

Google

2021-06
137B
modelclosed

Chatbot

H

ruGPT-3

Huawei/Sberbank

2021-02
1.3B
modelopen

Russian GPT-3 with input from Huawei

Google logo

Switch

Google

2021-01
1600B
modelopen

4 models
OpenAI logo

GPT-3

OpenAI

2020-05
175B
modelopen

No RLHF (base only). Popular: 3.1M wpm. Dataset: https://lifearchitect.ai/whats-in-my-ai/

Meta AI logo

Megatron-11B

Meta AI

2020-04
11B
modelopen

My favourite model until GPT-3 and GPT-4 came along: https://github.com/facebookresearch/fairseq/blob/main/examples/megatron_11b/README.md

AE

Transformer++

American Express

2020-03
0.212B
modelclosed

Not to be confused with the more common usage of Transformer++, the ~2023 Transformer++ based on Llama. See Mamba paper.

Google logo

Meena

Google

2020-01
2.6B
modelclosed

Dialogue model. Trained 61B tokens for 164x epochs to 10T tokens!

4 models
Google logo

T5

Google

2019-10
11B
modelopen

"Text-to-Text Transfer Transformer". C4 + NLP language problems. "compared the following three configurations: First, the standard baseline model, which was pre-trained on 235 ≈ 34B tokens; second, the baseline trained instead for about 1 trillion tokens (i.e. the same amount of pre-training used for T5), which we refer to as “baseline-1T”; and third, T5-Base."

N

Megatron-LM

NVIDIA

2019-09
8.3B
modelopen

Meta AI logo

RoBERTa

Meta AI

2019-07
0.355B
modelopen

calcs: "In total, this batch size and number of steps corresponds to pre-training on 235 ≈ 34B tokens. This is considerably less than BERT (Devlin et al., 2018), which used roughly 137B tokens, or RoBERTa (Liu et al., 2019c), which used roughly 2.2T tokens. Using only 2 35 tokens results in a reasonable computational budget while still providing a sufficient amount of pre-training for acceptable performance. We consider the effect of pre-training for more steps in Sections 3.6 and 3.7. Note that 2 35 tokens only covers a fraction of the entire C4 data set, so we never repeat any data during pre-training." https://arxiv.org/pdf/1910.10683.pdf MMLU shows RoBERTa-base 125M only=27.9 (not 355M)

OpenAI logo

GPT-2

OpenAI

2019-02
1.5B
modelopen

WebText 10B token corpus × 4 epochs → 40B tokens processed. Reddit outbound only

3 models
Google logo

BERT

Google

2018-10
0.34B
modelopen

"BERT — 128 000 tokens per step × 1 000 000 steps → 128 B tokens processed"

OpenAI logo

GPT-1

OpenAI

2018-06
0.117B
modelopen

"GPT-1 — 984 M tokens corpus × 100 epochs × 1 token per word → 98.4 B tokens processed" Books only. "We train for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens." =3,276,800

F

ULMFiT

Fast.ai

2018-01
0.034B
modelopen

"ULMFiT — 103 M tokens corpus × 14 epochs → 1.44 B tokens processed" "Corpus size. WikiText-103 contains about 103 million word-level tokens. Training schedule. The reference pre-training run trains for 14 full epochs on that corpus. Total tokens seen. 103 M tokens × 14 epochs → roughly 1.44 billion token prediction steps." Aussie Prof Jeremy Howard: https://www.abc.net.au/news/science/2023-11-15/jeremy-howard-taught-ai-to-the-world-and-helped-invent-chatgpt/103092474

2 models
Google logo

Transformer (big)

Google

2017-06
0.213B
modelopen

"Transformer Big — 32 768 tokens per step × 300 000 steps → 9.83 B tokens processed" "We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs... For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary. Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens."

Google logo

Transformer (base)

Google

2017-06
0.065B
modelopen

"Transformer Base — 32 768 tokens per step × 100 000 steps → 3.28 B tokens processed" "We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs... For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary. Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens."