414 lines
12 KiB
Plaintext
414 lines
12 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "1bf0f654",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Custom Price Estimator\n",
|
|
"\n",
|
|
"This notebook mirrors the week 6 day 5 fine-tuning workflow and pushes it a little further with the goal of beating the $76 average error target on the shared product dataset."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "4b4a89e6",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Plan\n",
|
|
"- Load the curated `Item` objects that we prepared earlier in week 6.\n",
|
|
"- Create train/validation splits sized for a stronger fine-tune than the baseline.\n",
|
|
"- Package the conversations in JSONL format and launch an OpenAI fine-tuning job.\n",
|
|
"- Retrieve the tuned model, score it with the shared tester, and aim for < $76 average error."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8dc5b7b0",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Environment Setup\n",
|
|
"Pull in the packages, load API keys from `.env`, and make sure we can talk to both the OpenAI and Hugging Face services used elsewhere in the course."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "f6332b2b",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"import json\n",
|
|
"import pickle\n",
|
|
"import random\n",
|
|
"import re\n",
|
|
"from pathlib import Path\n",
|
|
"\n",
|
|
"import numpy as np\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"from dotenv import load_dotenv\n",
|
|
"from huggingface_hub import login\n",
|
|
"\n",
|
|
"from items import Item\n",
|
|
"from testing import Tester\n",
|
|
"from openai import OpenAI\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "14eb4e29",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Load secrets from the .env file so the OpenAI client picks them up.\n",
|
|
"load_dotenv(override=True)\n",
|
|
"os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'set-your-openai-key')\n",
|
|
"os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', 'set-your-hf-token')\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "b07a6cab",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Log in to Hugging Face once per session (needed for the tokenizer used in Item).\n",
|
|
"hf_token = os.environ['HF_TOKEN']\n",
|
|
"if hf_token and hf_token != 'set-your-hf-token':\n",
|
|
" login(hf_token, add_to_git_credential=True)\n",
|
|
"else:\n",
|
|
" print('⚠️ Provide a valid HF_TOKEN in your .env if you need to download tokenizer weights.')\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "113d520b",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"openai = OpenAI()\n",
|
|
"%matplotlib inline\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "04ae4263",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Load the Week 6 Dataset\n",
|
|
"We reuse the curated pickled `Item` objects. If the pickle files are missing, circle back to the earlier data curation notebook to regenerate them."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "6ca7ca03",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#Let's avoid curating all our data again! Load in the pickle files:\n",
|
|
"with open('train_lite.pkl', 'rb') as file:\n",
|
|
" train = pickle.load(file)\n",
|
|
"\n",
|
|
"with open('test_lite.pkl', 'rb') as file:\n",
|
|
" test = pickle.load(file)\n",
|
|
"\n",
|
|
"len(train), len(test)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "35e6dde7",
|
|
"metadata": {},
|
|
"source": [
|
|
"We will widen the training split beyond the day 5 baseline to squeeze out better accuracy."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "0ea1ba91",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"TRAIN_SIZE = 400\n",
|
|
"VAL_SIZE = 100\n",
|
|
"RANDOM_SEED = 42\n",
|
|
"\n",
|
|
"rng = random.Random(RANDOM_SEED)\n",
|
|
"shuffled = train[:]\n",
|
|
"rng.shuffle(shuffled)\n",
|
|
"fine_tune_train = shuffled[:TRAIN_SIZE]\n",
|
|
"fine_tune_validation = shuffled[TRAIN_SIZE:TRAIN_SIZE+VAL_SIZE]\n",
|
|
"\n",
|
|
"len(fine_tune_train), len(fine_tune_validation)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "4a1c67fa",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 1 — Build Training Conversations\n",
|
|
"Frontier models handled the unaltered prompt, but for the fine-tune we keep the instruction tight and leave the assistant answer as just the numerical price."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "436b78b5",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"SYSTEM_MESSAGE = 'You are an ecommerce pricing assistant. Respond with the price only, no text before or after.'\n",
|
|
"ASSISTANT_PREFIX = 'Price is $'\n",
|
|
"\n",
|
|
"def clean_user_prompt(item):\n",
|
|
" prompt = item.test_prompt().replace(' to the nearest dollar', '')\n",
|
|
" return prompt.replace(ASSISTANT_PREFIX, '')\n",
|
|
"\n",
|
|
"def messages_for_training(item):\n",
|
|
" return [\n",
|
|
" {\"role\": \"system\", \"content\": SYSTEM_MESSAGE},\n",
|
|
" {\"role\": \"user\", \"content\": clean_user_prompt(item)},\n",
|
|
" {\"role\": \"assistant\", \"content\": f'{ASSISTANT_PREFIX}{item.price:.2f}'}\n",
|
|
" ]\n",
|
|
"\n",
|
|
"def messages_for_inference(item):\n",
|
|
" return [\n",
|
|
" {\"role\": \"system\", \"content\": SYSTEM_MESSAGE},\n",
|
|
" {\"role\": \"user\", \"content\": clean_user_prompt(item)},\n",
|
|
" {\"role\": \"assistant\", \"content\": ASSISTANT_PREFIX}\n",
|
|
" ]\n",
|
|
"\n",
|
|
"messages_for_training(fine_tune_train[0])\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "ecf456c2",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def make_jsonl(items):\n",
|
|
" lines = []\n",
|
|
" for item in items:\n",
|
|
" lines.append(json.dumps({\"messages\": messages_for_training(item)}))\n",
|
|
" return '\\n'.join(lines)\n",
|
|
"\n",
|
|
"def write_jsonl(items, filename):\n",
|
|
" payload = make_jsonl(items)\n",
|
|
" with open(filename, 'w') as f:\n",
|
|
" f.write(payload)\n",
|
|
"\n",
|
|
"write_jsonl(fine_tune_train, 'fine_tune_train.jsonl')\n",
|
|
"write_jsonl(fine_tune_validation, 'fine_tune_validation.jsonl')\n",
|
|
"\n",
|
|
"Path('fine_tune_train.jsonl').stat().st_size, Path('fine_tune_validation.jsonl').stat().st_size\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7dfde306",
|
|
"metadata": {},
|
|
"source": [
|
|
"Upload the datasets so the fine-tuning job can consume them."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "2c522928",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"with open('fine_tune_train.jsonl', 'rb') as file:\n",
|
|
" train_file = openai.files.create(file=file, purpose='fine-tune')\n",
|
|
"train_file\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "d3660112",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"with open('fine_tune_validation.jsonl', 'rb') as file:\n",
|
|
" validation_file = openai.files.create(file=file, purpose='fine-tune')\n",
|
|
"validation_file\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "9eaf47e1",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 2 — Launch the Fine-Tune\n",
|
|
"Weights & Biases logging is optional but handy for tracking metrics over time."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "d758ba4b",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"wandb_integration = {\"type\": \"wandb\", \"wandb\": {\"project\": \"gpt-pricer\"}}\n",
|
|
"train_file.id, validation_file.id\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "b7152b9b",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"fine_tune_job = openai.fine_tuning.jobs.create(\n",
|
|
" training_file=train_file.id,\n",
|
|
" validation_file=validation_file.id,\n",
|
|
" model='gpt-4o-mini-2024-07-18',\n",
|
|
" seed=RANDOM_SEED,\n",
|
|
" hyperparameters={\"n_epochs\": 2, \"learning_rate_multiplier\": 1.5},\n",
|
|
" suffix='emmy-pricer'\n",
|
|
")\n",
|
|
"fine_tune_job\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cd047075",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"job_id = fine_tune_job.id\n",
|
|
"job_id\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "cd830d14",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"openai.fine_tuning.jobs.retrieve(job_id)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "d2b25992",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"openai.fine_tuning.jobs.list_events(fine_tuning_job_id=job_id, limit=10).data\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "f0328367",
|
|
"metadata": {},
|
|
"source": [
|
|
"If you connected Weights & Biases under Settings → Integrations in the OpenAI dashboard, sync the run for richer charts."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "5995f1d6",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import wandb\n",
|
|
"from wandb.integration.openai.fine_tuning import WandbLogger\n",
|
|
"\n",
|
|
"wandb.login()\n",
|
|
"WandbLogger.sync(fine_tune_job_id=job_id, project='gpt-pricer')\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7961d020",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Step 3 — Evaluate the Tuned Model\n",
|
|
"Once the job is complete, grab the resulting model name and use the shared tester harness to verify we cleared the $76 average error goal."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7742bad2",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"fine_tuned_model_name = openai.fine_tuning.jobs.retrieve(job_id).fine_tuned_model\n",
|
|
"fine_tuned_model_name\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "8d18cc45",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def get_price(text):\n",
|
|
" cleaned = text.replace('$', '').replace(',', '').strip()\n",
|
|
" match = re.search(r'[-+]?\\d*\\.?\\d+', cleaned)\n",
|
|
" return float(match.group()) if match else 0.0\n",
|
|
"\n",
|
|
"def gpt_pricer(item):\n",
|
|
" response = openai.chat.completions.create(\n",
|
|
" model=fine_tuned_model_name,\n",
|
|
" messages=messages_for_inference(item),\n",
|
|
" seed=RANDOM_SEED,\n",
|
|
" max_tokens=8\n",
|
|
" )\n",
|
|
" reply = response.choices[0].message.content\n",
|
|
" return get_price(reply)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "3a491e4b",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"Tester.test(gpt_pricer, test)\n"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "llm-engineering (3.12.10)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.12.10"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|