Added frontier model testing table compared to basic models and human evaluator

This commit is contained in:
Shay Harding
2025-03-09 17:58:56 +00:00
parent 3a2eb97cf2
commit 579d0bb906

View File

@@ -0,0 +1,71 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "00f05a05-d989-4bf7-b1f1-9418e25ecd58",
"metadata": {},
"source": [
"# The Product Pricer Continued\n",
"\n",
"I tested numerous frontier models from OpenAI, Anthropic, Google, and others via Groq API.\n",
"\n",
"Here are the results of all tests including ones from Day 3 and how the frontier models stacked up.\n",
"\n",
"They are ordered by Error from best to worst.\n",
"\n",
"I ran each model once on 2025-03-09.\n",
"\n",
"Main repo at [https://github.com/kellewic/llm](https://github.com/kellewic/llm)"
]
},
{
"cell_type": "markdown",
"id": "a69cc81a-e582-4d04-8e12-fd83e120a7d1",
"metadata": {},
"source": [
"| Rank | Model | Error ($) | RMSLE | Hits (%) | Chart Link |\n",
"|------|-----------------------------------|-----------|-------|----------|------------|\n",
"| 1 | **gemini-2.0-flash** | 73.48 | 0.56 | 56.4% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/gemini-2.0-flash.png) |\n",
"| 2 | **gpt-4o-2024-08-06** | 75.66 | 0.89 | 57.6% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/gpt-4o-2024-08-06.png) |\n",
"| 3 | **gemini-2.0-flash-lite** | 76.42 | 0.61 | 56.0% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/gemini-2.0-flash-lite.png) |\n",
"| 4 | **gpt-4o-mini (original)** | 81.61 | 0.60 | 51.6% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/gpt-4o-mini.png) |\n",
"| 5 | **claude-3-5-haiku-20241022** | 85.25 | 0.62 | 50.8% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/claude-3-5-haiku-20241022.png) |\n",
"| 6 | **claude-3-5-sonnet-20241022** | 88.97 | 0.61 | 49.2% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/claude-3-5-sonnet-20241022.png) |\n",
"| 7 | **claude-3-7-sonnet-20250219** | 89.41 | 0.62 | 55.2% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/claude-3-7-sonnet-20250219.png) |\n",
"| 8 | **mistral-saba-24b** | 98.02 | 0.82 | 44.8% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/mistral-saba-24b.png) |\n",
"| 9 | **llama-3.3-70b-versatile** | 98.24 | 0.70 | 44.8% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/llama-3.3-70b-versatile.png) |\n",
"| 10 | **GPT-4o-mini (fine-tuned)** | 101.49 | 0.81 | 41.2% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_tuning/gpt_fine_tuned.png) |\n",
"| 11 | **Random Forest Regressor** | 105.10 | 0.89 | 37.6% | [📊](https://github.com/kellewic/llm/blob/main/basic_model_training/random_forest_pricer.png) |\n",
"| 12 | **deepseek-r1-distill-llama-70b** | 109.09 | 0.67 | 48.4% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/deepseek-r1-distill-llama-70b.png) |\n",
"| 13 | **Linear SVR** | 110.91 | 0.92 | 29.2% | [📊](https://github.com/kellewic/llm/blob/main/basic_model_training/svr_pricer.png) |\n",
"| 14 | **Word2Vec LR** | 113.14 | 1.05 | 22.8% | [📊](https://github.com/kellewic/llm/blob/main/basic_model_training/word2vec_lr_pricer.png) |\n",
"| 15 | **Bag of Words LR** | 113.60 | 0.99 | 24.8% | [📊](https://github.com/kellewic/llm/blob/main/basic_model_training/bow_lr_pricer.png) |\n",
"| 16 | **Human Performance** | 126.55 | 1.00 | 32.0% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/human_pricer.png) |\n",
"| 17 | **Average** | 137.17 | 1.19 | 15.2% | [📊](https://github.com/kellewic/llm/blob/main/basic_model_training/average_pricer.png) |\n",
"| 18 | **Linear Regression** | 139.20 | 1.17 | 15.6% | [📊](https://github.com/kellewic/llm/blob/main/basic_model_training/linear_regression_pricer.png) |\n",
"| 19 | **deepseek-r1-distill-qwen-32b** | 151.59 | 0.80 | 38.4% | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/deepseek-r1-distill-qwen-32b.png) |"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}