Files
LLM_Engineering_OLD/week6/community-contributions/emmy/price_estimator.ipynb
2025-11-01 16:46:12 +01:00

414 lines
12 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "1bf0f654",
"metadata": {},
"source": [
"# Custom Price Estimator\n",
"\n",
"This notebook mirrors the week 6 day 5 fine-tuning workflow and pushes it a little further with the goal of beating the $76 average error target on the shared product dataset."
]
},
{
"cell_type": "markdown",
"id": "4b4a89e6",
"metadata": {},
"source": [
"## Plan\n",
"- Load the curated `Item` objects that we prepared earlier in week 6.\n",
"- Create train/validation splits sized for a stronger fine-tune than the baseline.\n",
"- Package the conversations in JSONL format and launch an OpenAI fine-tuning job.\n",
"- Retrieve the tuned model, score it with the shared tester, and aim for < $76 average error."
]
},
{
"cell_type": "markdown",
"id": "8dc5b7b0",
"metadata": {},
"source": [
"## Environment Setup\n",
"Pull in the packages, load API keys from `.env`, and make sure we can talk to both the OpenAI and Hugging Face services used elsewhere in the course."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f6332b2b",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import json\n",
"import pickle\n",
"import random\n",
"import re\n",
"from pathlib import Path\n",
"\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from dotenv import load_dotenv\n",
"from huggingface_hub import login\n",
"\n",
"from items import Item\n",
"from testing import Tester\n",
"from openai import OpenAI\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "14eb4e29",
"metadata": {},
"outputs": [],
"source": [
"# Load secrets from the .env file so the OpenAI client picks them up.\n",
"load_dotenv(override=True)\n",
"os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'set-your-openai-key')\n",
"os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', 'set-your-hf-token')\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b07a6cab",
"metadata": {},
"outputs": [],
"source": [
"# Log in to Hugging Face once per session (needed for the tokenizer used in Item).\n",
"hf_token = os.environ['HF_TOKEN']\n",
"if hf_token and hf_token != 'set-your-hf-token':\n",
" login(hf_token, add_to_git_credential=True)\n",
"else:\n",
" print('⚠️ Provide a valid HF_TOKEN in your .env if you need to download tokenizer weights.')\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "113d520b",
"metadata": {},
"outputs": [],
"source": [
"openai = OpenAI()\n",
"%matplotlib inline\n"
]
},
{
"cell_type": "markdown",
"id": "04ae4263",
"metadata": {},
"source": [
"## Load the Week 6 Dataset\n",
"We reuse the curated pickled `Item` objects. If the pickle files are missing, circle back to the earlier data curation notebook to regenerate them."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6ca7ca03",
"metadata": {},
"outputs": [],
"source": [
"#Let's avoid curating all our data again! Load in the pickle files:\n",
"with open('train_lite.pkl', 'rb') as file:\n",
" train = pickle.load(file)\n",
"\n",
"with open('test_lite.pkl', 'rb') as file:\n",
" test = pickle.load(file)\n",
"\n",
"len(train), len(test)\n"
]
},
{
"cell_type": "markdown",
"id": "35e6dde7",
"metadata": {},
"source": [
"We will widen the training split beyond the day 5 baseline to squeeze out better accuracy."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0ea1ba91",
"metadata": {},
"outputs": [],
"source": [
"TRAIN_SIZE = 400\n",
"VAL_SIZE = 100\n",
"RANDOM_SEED = 42\n",
"\n",
"rng = random.Random(RANDOM_SEED)\n",
"shuffled = train[:]\n",
"rng.shuffle(shuffled)\n",
"fine_tune_train = shuffled[:TRAIN_SIZE]\n",
"fine_tune_validation = shuffled[TRAIN_SIZE:TRAIN_SIZE+VAL_SIZE]\n",
"\n",
"len(fine_tune_train), len(fine_tune_validation)\n"
]
},
{
"cell_type": "markdown",
"id": "4a1c67fa",
"metadata": {},
"source": [
"## Step 1 — Build Training Conversations\n",
"Frontier models handled the unaltered prompt, but for the fine-tune we keep the instruction tight and leave the assistant answer as just the numerical price."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "436b78b5",
"metadata": {},
"outputs": [],
"source": [
"SYSTEM_MESSAGE = 'You are an ecommerce pricing assistant. Respond with the price only, no text before or after.'\n",
"ASSISTANT_PREFIX = 'Price is $'\n",
"\n",
"def clean_user_prompt(item):\n",
" prompt = item.test_prompt().replace(' to the nearest dollar', '')\n",
" return prompt.replace(ASSISTANT_PREFIX, '')\n",
"\n",
"def messages_for_training(item):\n",
" return [\n",
" {\"role\": \"system\", \"content\": SYSTEM_MESSAGE},\n",
" {\"role\": \"user\", \"content\": clean_user_prompt(item)},\n",
" {\"role\": \"assistant\", \"content\": f'{ASSISTANT_PREFIX}{item.price:.2f}'}\n",
" ]\n",
"\n",
"def messages_for_inference(item):\n",
" return [\n",
" {\"role\": \"system\", \"content\": SYSTEM_MESSAGE},\n",
" {\"role\": \"user\", \"content\": clean_user_prompt(item)},\n",
" {\"role\": \"assistant\", \"content\": ASSISTANT_PREFIX}\n",
" ]\n",
"\n",
"messages_for_training(fine_tune_train[0])\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ecf456c2",
"metadata": {},
"outputs": [],
"source": [
"def make_jsonl(items):\n",
" lines = []\n",
" for item in items:\n",
" lines.append(json.dumps({\"messages\": messages_for_training(item)}))\n",
" return '\\n'.join(lines)\n",
"\n",
"def write_jsonl(items, filename):\n",
" payload = make_jsonl(items)\n",
" with open(filename, 'w') as f:\n",
" f.write(payload)\n",
"\n",
"write_jsonl(fine_tune_train, 'fine_tune_train.jsonl')\n",
"write_jsonl(fine_tune_validation, 'fine_tune_validation.jsonl')\n",
"\n",
"Path('fine_tune_train.jsonl').stat().st_size, Path('fine_tune_validation.jsonl').stat().st_size\n"
]
},
{
"cell_type": "markdown",
"id": "7dfde306",
"metadata": {},
"source": [
"Upload the datasets so the fine-tuning job can consume them."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2c522928",
"metadata": {},
"outputs": [],
"source": [
"with open('fine_tune_train.jsonl', 'rb') as file:\n",
" train_file = openai.files.create(file=file, purpose='fine-tune')\n",
"train_file\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d3660112",
"metadata": {},
"outputs": [],
"source": [
"with open('fine_tune_validation.jsonl', 'rb') as file:\n",
" validation_file = openai.files.create(file=file, purpose='fine-tune')\n",
"validation_file\n"
]
},
{
"cell_type": "markdown",
"id": "9eaf47e1",
"metadata": {},
"source": [
"## Step 2 — Launch the Fine-Tune\n",
"Weights & Biases logging is optional but handy for tracking metrics over time."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d758ba4b",
"metadata": {},
"outputs": [],
"source": [
"wandb_integration = {\"type\": \"wandb\", \"wandb\": {\"project\": \"gpt-pricer\"}}\n",
"train_file.id, validation_file.id\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b7152b9b",
"metadata": {},
"outputs": [],
"source": [
"fine_tune_job = openai.fine_tuning.jobs.create(\n",
" training_file=train_file.id,\n",
" validation_file=validation_file.id,\n",
" model='gpt-4o-mini-2024-07-18',\n",
" seed=RANDOM_SEED,\n",
" hyperparameters={\"n_epochs\": 2, \"learning_rate_multiplier\": 1.5},\n",
" suffix='emmy-pricer'\n",
")\n",
"fine_tune_job\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cd047075",
"metadata": {},
"outputs": [],
"source": [
"job_id = fine_tune_job.id\n",
"job_id\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cd830d14",
"metadata": {},
"outputs": [],
"source": [
"openai.fine_tuning.jobs.retrieve(job_id)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d2b25992",
"metadata": {},
"outputs": [],
"source": [
"openai.fine_tuning.jobs.list_events(fine_tuning_job_id=job_id, limit=10).data\n"
]
},
{
"cell_type": "markdown",
"id": "f0328367",
"metadata": {},
"source": [
"If you connected Weights & Biases under Settings → Integrations in the OpenAI dashboard, sync the run for richer charts."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5995f1d6",
"metadata": {},
"outputs": [],
"source": [
"import wandb\n",
"from wandb.integration.openai.fine_tuning import WandbLogger\n",
"\n",
"wandb.login()\n",
"WandbLogger.sync(fine_tune_job_id=job_id, project='gpt-pricer')\n"
]
},
{
"cell_type": "markdown",
"id": "7961d020",
"metadata": {},
"source": [
"## Step 3 — Evaluate the Tuned Model\n",
"Once the job is complete, grab the resulting model name and use the shared tester harness to verify we cleared the $76 average error goal."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7742bad2",
"metadata": {},
"outputs": [],
"source": [
"fine_tuned_model_name = openai.fine_tuning.jobs.retrieve(job_id).fine_tuned_model\n",
"fine_tuned_model_name\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8d18cc45",
"metadata": {},
"outputs": [],
"source": [
"def get_price(text):\n",
" cleaned = text.replace('$', '').replace(',', '').strip()\n",
" match = re.search(r'[-+]?\\d*\\.?\\d+', cleaned)\n",
" return float(match.group()) if match else 0.0\n",
"\n",
"def gpt_pricer(item):\n",
" response = openai.chat.completions.create(\n",
" model=fine_tuned_model_name,\n",
" messages=messages_for_inference(item),\n",
" seed=RANDOM_SEED,\n",
" max_tokens=8\n",
" )\n",
" reply = response.choices[0].message.content\n",
" return get_price(reply)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3a491e4b",
"metadata": {},
"outputs": [],
"source": [
"Tester.test(gpt_pricer, test)\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "llm-engineering (3.12.10)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}