Added frontier model testing table compared to basic models and human evaluator

2025-03-09 17:58:56 +00:00
parent 3a2eb97cf2
commit 579d0bb906
1 changed files with 71 additions and 0 deletions
--- a/week6/community-contributions/week6_day4_frontier_model_testing.ipynb
+++ b/week6/community-contributions/week6_day4_frontier_model_testing.ipynb
@@ -0,0 +1,71 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "00f05a05-d989-4bf7-b1f1-9418e25ecd58",
+   "metadata": {},
+   "source": [
+    "# The Product Pricer Continued\n",
+    "\n",
+    "I tested numerous frontier models from OpenAI, Anthropic, Google, and others via Groq API.\n",
+    "\n",
+    "Here are the results of all tests including ones from Day 3 and how the frontier models stacked up.\n",
+    "\n",
+    "They are ordered by Error from best to worst.\n",
+    "\n",
+    "I ran each model once on 2025-03-09.\n",
+    "\n",
+    "Main repo at [https://github.com/kellewic/llm](https://github.com/kellewic/llm)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a69cc81a-e582-4d04-8e12-fd83e120a7d1",
+   "metadata": {},
+   "source": [
+    "| Rank | Model                             | Error ($) | RMSLE | Hits (%) | Chart Link |\n",
+    "|------|-----------------------------------|-----------|-------|----------|------------|\n",
+    "| 1    | **gemini-2.0-flash**              | 73.48     | 0.56  | 56.4%    | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/gemini-2.0-flash.png) |\n",
+    "| 2    | **gpt-4o-2024-08-06**             | 75.66     | 0.89  | 57.6%    | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/gpt-4o-2024-08-06.png) |\n",
+    "| 3    | **gemini-2.0-flash-lite**         | 76.42     | 0.61  | 56.0%    | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/gemini-2.0-flash-lite.png) |\n",
+    "| 4    | **gpt-4o-mini (original)**        | 81.61     | 0.60  | 51.6%    | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/gpt-4o-mini.png) |\n",
+    "| 5    | **claude-3-5-haiku-20241022**     | 85.25     | 0.62  | 50.8%    | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/claude-3-5-haiku-20241022.png) |\n",
+    "| 6    | **claude-3-5-sonnet-20241022**    | 88.97     | 0.61  | 49.2%    | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/claude-3-5-sonnet-20241022.png) |\n",
+    "| 7    | **claude-3-7-sonnet-20250219**    | 89.41     | 0.62  | 55.2%    | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/claude-3-7-sonnet-20250219.png) |\n",
+    "| 8    | **mistral-saba-24b**              | 98.02     | 0.82  | 44.8%    | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/mistral-saba-24b.png) |\n",
+    "| 9    | **llama-3.3-70b-versatile**       | 98.24     | 0.70  | 44.8%    | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/llama-3.3-70b-versatile.png) |\n",
+    "| 10   | **GPT-4o-mini (fine-tuned)**      | 101.49    | 0.81  | 41.2%    | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_tuning/gpt_fine_tuned.png) |\n",
+    "| 11   | **Random Forest Regressor**       | 105.10    | 0.89  | 37.6%    | [📊](https://github.com/kellewic/llm/blob/main/basic_model_training/random_forest_pricer.png) |\n",
+    "| 12   | **deepseek-r1-distill-llama-70b** | 109.09    | 0.67  | 48.4%    | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/deepseek-r1-distill-llama-70b.png) |\n",
+    "| 13   | **Linear SVR**                    | 110.91    | 0.92  | 29.2%    | [📊](https://github.com/kellewic/llm/blob/main/basic_model_training/svr_pricer.png) |\n",
+    "| 14   | **Word2Vec LR**                   | 113.14    | 1.05  | 22.8%    | [📊](https://github.com/kellewic/llm/blob/main/basic_model_training/word2vec_lr_pricer.png) |\n",
+    "| 15   | **Bag of Words LR**               | 113.60    | 0.99  | 24.8%    | [📊](https://github.com/kellewic/llm/blob/main/basic_model_training/bow_lr_pricer.png) |\n",
+    "| 16   | **Human Performance**             | 126.55    | 1.00  | 32.0%    | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/human_pricer.png) |\n",
+    "| 17   | **Average**                       | 137.17    | 1.19  | 15.2%    | [📊](https://github.com/kellewic/llm/blob/main/basic_model_training/average_pricer.png) |\n",
+    "| 18   | **Linear Regression**             | 139.20    | 1.17  | 15.6%    | [📊](https://github.com/kellewic/llm/blob/main/basic_model_training/linear_regression_pricer.png) |\n",
+    "| 19   | **deepseek-r1-distill-qwen-32b**  | 151.59    | 0.80  | 38.4%    | [📊](https://github.com/kellewic/llm/blob/main/frontier_model_test/deepseek-r1-distill-qwen-32b.png) |"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}