Merge pull request #247 from kellewic/community-contributions-kellewic

Added frontier model testing table compared to basic models and human…
This commit is contained in:
Ed Donner
2025-03-15 08:18:58 -04:00
committed by GitHub

View File

@@ -0,0 +1,71 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "00f05a05-d989-4bf7-b1f1-9418e25ecd58",
"metadata": {},
"source": [
"# The Product Pricer Continued\n",
"\n",
"I tested numerous frontier models from OpenAI, Anthropic, Google, and others via Groq API.\n",
"\n",
"Here are the results of all tests including ones from Day 3 and how the frontier models stacked up.\n",
"\n",
"They are ordered by Error from best to worst.\n",
"\n",
"I ran each model once on 2025-03-09.\n",
"\n",
"Main repo at [https://github.com/kellewic/llm](https://github.com/kellewic/llm)"
]
},
{
"cell_type": "markdown",
"id": "a69cc81a-e582-4d04-8e12-fd83e120a7d1",
"metadata": {},
"source": [
"| Rank | Model | Error ($) | RMSLE | Hits (%) | Chart Link |\n",
"|------|-----------------------------------|-----------|-------|----------|------------|\n",
"| 1 | **gemini-2.0-flash** | 73.48 | 0.56 | 56.4% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/gemini-2.0-flash.png) |\n",
"| 2 | **gpt-4o-2024-08-06** | 75.66 | 0.89 | 57.6% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/gpt-4o-2024-08-06.png) |\n",
"| 3 | **gemini-2.0-flash-lite** | 76.42 | 0.61 | 56.0% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/gemini-2.0-flash-lite.png) |\n",
"| 4 | **gpt-4o-mini (original)** | 81.61 | 0.60 | 51.6% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/gpt-4o-mini.png) |\n",
"| 5 | **claude-3-5-haiku-20241022** | 85.25 | 0.62 | 50.8% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/claude-3-5-haiku-20241022.png) |\n",
"| 6 | **claude-3-5-sonnet-20241022** | 88.97 | 0.61 | 49.2% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/claude-3-5-sonnet-20241022.png) |\n",
"| 7 | **claude-3-7-sonnet-20250219** | 89.41 | 0.62 | 55.2% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/claude-3-7-sonnet-20250219.png) |\n",
"| 8 | **mistral-saba-24b** | 98.02 | 0.82 | 44.8% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/mistral-saba-24b.png) |\n",
"| 9 | **llama-3.3-70b-versatile** | 98.24 | 0.70 | 44.8% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/llama-3.3-70b-versatile.png) |\n",
"| 10 | **GPT-4o-mini (fine-tuned)** | 101.49 | 0.81 | 41.2% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_tuning/gpt_fine_tuned.png) |\n",
"| 11 | **Random Forest Regressor** | 105.10 | 0.89 | 37.6% | [📊](https://github.com/kellewic/llm/blob/main/basic_model_training/random_forest_pricer.png) |\n",
"| 12 | **deepseek-r1-distill-llama-70b** | 109.09 | 0.67 | 48.4% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/deepseek-r1-distill-llama-70b.png) |\n",
"| 13 | **Linear SVR** | 110.91 | 0.92 | 29.2% | [📊](https://github.com/kellewic/llm/blob/main/basic_model_training/svr_pricer.png) |\n",
"| 14 | **Word2Vec LR** | 113.14 | 1.05 | 22.8% | [📊](https://github.com/kellewic/llm/blob/main/basic_model_training/word2vec_lr_pricer.png) |\n",
"| 15 | **Bag of Words LR** | 113.60 | 0.99 | 24.8% | [📊](https://github.com/kellewic/llm/blob/main/basic_model_training/bow_lr_pricer.png) |\n",
"| 16 | **Human Performance** | 126.55 | 1.00 | 32.0% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/human_pricer.png) |\n",
"| 17 | **Average** | 137.17 | 1.19 | 15.2% | [📊](https://github.com/kellewic/llm/blob/main/basic_model_training/average_pricer.png) |\n",
"| 18 | **Linear Regression** | 139.20 | 1.17 | 15.6% | [📊](https://github.com/kellewic/llm/blob/main/basic_model_training/linear_regression_pricer.png) |\n",
"| 19 | **deepseek-r1-distill-qwen-32b** | 151.59 | 0.80 | 38.4% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/deepseek-r1-distill-qwen-32b.png) |"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}