717 lines
44 KiB
Plaintext
717 lines
44 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "40978455-23da-4159-bf08-15d9e8f79984",
|
||
"metadata": {},
|
||
"source": [
|
||
"# 🔍 Predicting Item Prices from Descriptions (Part 1)\n",
|
||
"A complete pipeline from raw text to fine-tuned frontier and open source models\n",
|
||
"\n",
|
||
"---\n",
|
||
"In this project, we aim to **predict item prices based solely on their textual descriptions**. \n",
|
||
"\n",
|
||
"We approach the problem with a structured 8-part pipeline:\n",
|
||
"\n",
|
||
"- 🧩 **Part 1: Data Curation & Preprocessing** : We aggregate, clean, analyze, and balance the dataset — then export it in .pkl format and save it in the HuggingFace Hub for the next step: model training and evaluation.\n",
|
||
"\n",
|
||
"- ⚔️ **Part 2: Traditional ML vs Frontier LLMs** : We compare traditional machine learning models (LR, SVR, XGBoost) using vectorized text inputs (BoW, Word2Vec) against LLMs like GPT-4o, LLaMA, Deepseek ... ❗ Who will predict better: handcrafted features or massive pretraining?\n",
|
||
"\n",
|
||
"- 🧠 **Part 3: E5 Embeddings & RAG** : We compare XGBoost on **contextual dense embeddings** vs. Word2Vec, and test if **RAG** boosts GPT-4o Mini’s price predictions. 📦 Do contextual embeddings and retrieval improve price prediction?\n",
|
||
"\n",
|
||
"- 🔧 **Part 4: Fine-Tuning GPT-4o Mini** : We fine-tune GPT-4o Mini on our curated dataset and compare performance before and after.\n",
|
||
"🤖 Can a fine-tuned GPT-4o Mini beat its own zero-shot performance?\n",
|
||
"\n",
|
||
"- 🦙 **Part 5: Evaluating LLaMA 3.1 8B Quantized** : We run LLaMA 3.1 (8B, quantized) using the same evaluation setup to see how well an open-source base model performs with no fine-tuning.\n",
|
||
"\n",
|
||
"- ⚙️ **Part 6: Fine-Tuning LLaMA 3.1 with QLoRA** : We fine-tune LLaMA 3.1 using QLoRA and explore key hyperparameters, tracking **training and validation loss** to monitor overfitting and select the best configuration.\n",
|
||
"\n",
|
||
"- 🧪 **Part 7: Evaluating Fine-Tuned LLaMA 3.1 8B (Quantized)** : After fine-tuning LLaMA 3.1, it's time to evaluate its performance and see how it stacks up against other models. Let's dive into the results.\n",
|
||
"\n",
|
||
"- 🏆**Part 8: Summary & Leaderboard** : Who comes out on top? Let’s find out. We wrap up with final model rankings and key insights across ML, embeddings, RAG, and fine-tuned frontier and open-source models.\n",
|
||
"\n",
|
||
"---\n",
|
||
"- ➡️ Data Curation & Preprocessing\n",
|
||
"- Model Benchmarking – Traditional ML vs LLMs\n",
|
||
"- E5 Embeddings & RAG\n",
|
||
"- Fine-Tuning GPT-4o Mini\n",
|
||
"- Evaluating LLaMA 3.1 8B Quantized\n",
|
||
"- Fine-Tuning LLaMA 3.1 with QLoRA\n",
|
||
"- Evaluating Fine-Tuned LLaMA \n",
|
||
"- Summary & Leaderboard\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"Let’s begin with Part 1.\n",
|
||
"\n",
|
||
"# 🧩 Part 1: Data Curation & Preprocessing\n",
|
||
"\n",
|
||
"- Tasks:\n",
|
||
" - Load and filter dataset, then prepare each datapoint\n",
|
||
" - Explore, visualize, balance price distribution\n",
|
||
" - Export .pkl, upload to HF Hub\n",
|
||
"- 🧑💻 Skill Level: Advanced\n",
|
||
"- ⚙️ Hardware: ✅ CPU is sufficient — no GPU required\n",
|
||
"- 🛠️ Requirements: 🔑 Hugging Face Token\n",
|
||
"\n",
|
||
"---\n",
|
||
"📢 Find more LLM notebooks on my [GitHub repository](https://github.com/lisekarimi/lexo)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "dcf2f470",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"!uv pip install transformers"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "ddbb5eb0-9ab7-4675-b195-0bf4055b9320",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# imports\n",
|
||
"\n",
|
||
"import os\n",
|
||
"import sys\n",
|
||
"import random\n",
|
||
"import pickle\n",
|
||
"import importlib\n",
|
||
"from dotenv import load_dotenv\n",
|
||
"from huggingface_hub import login\n",
|
||
"from datasets import Dataset, DatasetDict\n",
|
||
"from collections import Counter, defaultdict\n",
|
||
"import matplotlib.pyplot as plt\n",
|
||
"%matplotlib inline\n",
|
||
"import numpy as np"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "fa916b7a-9044-4461-b29a-815d47973e75",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# import datasets\n",
|
||
"# print(datasets.__version__)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "e6cf6e19-1276-4b37-8f9b-6acf1473a7c6",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# environment\n",
|
||
"\n",
|
||
"load_dotenv(override=True)\n",
|
||
"hf_token = os.getenv('HF_TOKEN')\n",
|
||
"if not hf_token:\n",
|
||
" print(\"❌ HF_TOKEN is missing\")\n",
|
||
"\n",
|
||
"login(hf_token, add_to_git_credential=True)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a1637a14-b2df-4286-a8d6-ddae413f4a8a",
|
||
"metadata": {},
|
||
"source": [
|
||
"## ⚙️ Data Loading & Curation (Simultaneously)\n",
|
||
"We load and curate the data at the same time using loaders.py and items.py.\n",
|
||
"- Datasets come from: https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/tree/main/raw/meta_categories\n",
|
||
"- `loaders.py` handles parallel loading and filtering of products\n",
|
||
"- `items.py` defines the Item class to clean, validate, and prepare each datapoint (title, description, price...) for modeling.\n",
|
||
"\n",
|
||
"\n",
|
||
"🛠️ Note: Data is filtered to include items priced between 1 and 999 USD.\n",
|
||
"\n",
|
||
"💡 Comments have been added in both files to clarify the processing logic.\n",
|
||
"\n",
|
||
"⚠️ Loading 2.8M+ items can take 40+ mins on a regular laptop.\n",
|
||
"\n",
|
||
"⚠️ Set WORKER wisely in `loaders.py` to match your system capacity. Too many may crash your machine."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "8b89273c-e02f-4c15-8394-5d948a266bfc",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"sys.path.append('./helpers')\n",
|
||
"import helpers.items\n",
|
||
"import helpers.loaders\n",
|
||
"\n",
|
||
"importlib.reload(helpers.items)\n",
|
||
"importlib.reload(helpers.loaders)\n",
|
||
"\n",
|
||
"from helpers.items import Item # noqa: E402\n",
|
||
"from helpers.loaders import ItemLoader # noqa: E402"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "260a123b-8f34-4c66-bcac-1c3b25e95d7f",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"dataset_names = [\n",
|
||
" \"Automotive\",\n",
|
||
" \"Electronics\",\n",
|
||
" \"Office_Products\",\n",
|
||
" \"Tools_and_Home_Improvement\",\n",
|
||
" \"Cell_Phones_and_Accessories\",\n",
|
||
" \"Toys_and_Games\",\n",
|
||
" \"Appliances\",\n",
|
||
" \"Musical_Instruments\",\n",
|
||
"]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "9b482032-cba9-4ee9-9451-9b7dc9f41be6",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"items = []\n",
|
||
"for dataset_name in dataset_names:\n",
|
||
" loader = ItemLoader(dataset_name)\n",
|
||
" items.extend(loader.load())\n",
|
||
"\n",
|
||
"# Now, time for a coffee break!!\n",
|
||
"# By the way, the larger datasets first... it speeds up the process."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "145d0648-e01d-46b9-ad42-f10b69fccbc3",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 🔍 Inspecting a Sample Datapoint"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "0185985d-5f67-4e4b-ac66-95b5b293231f",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"print(f\"A grand total of {len(items):,} items\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "2b0c0ae8-c0ec-4f6f-b847-800da379c01b",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Investigate the first item from the list\n",
|
||
"\n",
|
||
"datapoint = items[0]\n",
|
||
"\n",
|
||
"# Access various attributes\n",
|
||
"title = datapoint.title\n",
|
||
"details = datapoint.details\n",
|
||
"price = datapoint.price\n",
|
||
"category = datapoint.category\n",
|
||
"\n",
|
||
"print(f\"Datapoint: {datapoint}\")\n",
|
||
"print('*' * 40)\n",
|
||
"print(f\"Title: {title}\")\n",
|
||
"print('*' * 40)\n",
|
||
"print(f\"Detail: {details}\")\n",
|
||
"print('*' * 40)\n",
|
||
"print(f\"Price: ${price}\")\n",
|
||
"print('*' * 40)\n",
|
||
"print(f\"Category: {category}\")\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "e05ed6e4-1cbc-46a4-be2f-4832b99e5ec3",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# The prompt that will be used during training\n",
|
||
"print(items[0].prompt)\n",
|
||
"print('*' * 40)\n",
|
||
"# The prompt that will be used during testing\n",
|
||
"print(items[0].test_prompt())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f66e714d-2bae-458e-a0f6-1ce78d0696b3",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 📊 Data Visualization"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "dd50ae2c-b34e-4be7-bd74-62055e4d5b2d",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"plt.figure(figsize=(15, 6))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "c736b038-2dcd-40b9-8ae9-d17271f1ff81",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Plot the distribution of token counts\n",
|
||
"\n",
|
||
"tokens = [item.token_count for item in items]\n",
|
||
"plt.title(f\"Token counts: Avg {sum(tokens)/len(tokens):,.1f} and highest {max(tokens):,}\\n\")\n",
|
||
"plt.xlabel('Length (tokens)')\n",
|
||
"plt.ylabel('Count')\n",
|
||
"plt.hist(tokens, rwidth=0.7, color=\"blue\", bins=range(0, 300, 10))\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"attachments": {
|
||
"image.png": {
|
||
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjYAAAHLCAYAAADbUtJvAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsMAAA7DAcdvqGQAAEGISURBVHhe7d17XFVl3v//9+asIqikAoqglaZ5wmOYB9TCtBypscM05Snnl41ojXlPNzWlTQedybytu0mn8ZRa5phIjlmBinguNexgHrA8ZaB5AjVFkev3xy3r61oggija4vV8PNbj0f5cn703+2Ltvd9day30GGOMAAAAXMDLWQAAAPi1ItgAAADXINjAxuPxlGpbsWKF865FeDweJSQkOMsoxpIlSzR27Fhn+ar65ptv5PF45Ovrq6ysLOdwhZk0aZLuu+8+NWzYUB6PR7Gxsc4WSVJsbGyR/fDCLTs729Z/8uRJvfDCC2rcuLH8/f0VEhKi7t27KzMz09bntGLFiiKPfeE2bNgw512umbFjx8rj8TjLRcTGxqp58+bOchG7d++Wx+PRzJkznUOl4rkG7/m1a9dq7NixOnbsmHOoWD/++KOeeuopdevWTTVq1Ljo6y2ci4ttd911l61/x44d+u1vf6uaNWuqatWq6tixoxYtWmTrQcUg2MBm3bp1tq1Pnz6qUqVKkXqbNm2cd0U5LFmyRC+++KKzfFVNnTpVkpSfn69Zs2Y5hyvMlClTtGfPHvXo0UO1a9d2DlvefvvtIvvhsmXL5Ovrq9tuu02hoaFW74kTJxQbG6tp06ZpxIgRSklJ0YwZM9SxY0f98ssvtsd1atOmTZHnWbdunQYMGCBJuvfee513cY2wsDCtW7dOd999t3PourV27Vq9+OKLpQ42O3fu1HvvvSc/Pz/16dPHOWwpnAvn9swzz0iO/WD37t2KiYnR9u3bNWXKFM2fP1+1a9dWfHy8FixYcMGjokIYoAQDBw401apVc5ZLRZIZPny4s4xiDB8+3FTk2/H06dMmJCTEtGrVytSrV880btzY2VJhzp07Z/33rbfearp162YbL8nMmTONJDN16lRb/cknnzTVqlUz33//va1+uQoKCkyjRo1MZGSk7ee91saMGVOq/aZbt27m1ltvdZavuGvxnn/ttdeMJLNr1y7nULEu/P1t2LDBSDIzZsyw9ZQkNjbWVK1a1eTk5Fi1xx9/3AQEBJgff/zRquXn55umTZuaiIiI62qfqQxYsUGZHTlyRH/84x9Vr149+fn5qVGjRnruueeUl5fnbLUxxujZZ5+Vr6+v/vWvf1n1efPmKSYmRtWqVVNgYKB69eqljIwM230HDRqkwMBA7dy5U3369FFgYKAiIiL09NNPX/J5C73//vuKiYlRYGCgAgMD1bp1a02bNs3WM336dLVq1UoBAQGqVauW7r33Xm3dutXWExsbW+zhkkGDBikqKsq6XbiUPWHCBE2cOFENGzZUYGCgYmJitH79etv9/vGPf0iOQ4G7d++WJM2fP18dO3ZUcHCwqlatqkaNGmnIkCHW/S9HcnKyDh8+rKFDh2rgwIHasWOHVq9ebY3Hx8crMjJSBQUFtvtJUseOHW0rdseOHdNjjz2mWrVqKTAwUHfffbd++OEHeTyeUh1e8/K6/I+hadOmKTAwUA8++KBV++WXXzR16lTdf//9atSoka3/cqWlpemHH37Q4MGDL/nznj59Wk8//bRat26t4OBg1apVSzExMfroo4+crdahm9mzZ6tp06aqWrWqWrVqpcWLFztb9fHHH6t169by9/dXw4YNNWHCBGfLJW3YsEFdunSx9qPx48fbfscXOxT10UcfqWXLlvL391ejRo30xhtvlHgYrDSvJzMzUw8//LDq1Kkjf39/NW3a1HofFCooKNDLL7+sJk2aqEqVKqpRo4ZatmypN954Qzp/KO6//uu/JMk6lOm5xKHyS/3+SvL9998rPT1dDzzwgIKCgqz6mjVr1KpVK9WrV8+qeXt7q3fv3tq3b5+++OILq44K4Ew6wIWcKzanTp0yLVu2NNWqVTMTJkwwKSkp5vnnnzc+Pj6mT58+tvte+H9vp0+fNg899JCpXr26+eSTT6yeV155xXg8HjNkyBCzePFik5SUZGJiYky1atXMli1brL6BAwcaPz8/07RpUzNhwgSzdOlS88ILLxiPx2NefPFFq+9inn/+eSPJ3HfffWb+/PkmJSXFTJw40Tz//PNWz6uvvmokmd/97nfm448/NrNmzTKNGjUywcHBZseOHVZft27dil1VGDhwoImMjLRu79q1y0gyUVFR5q677jLJyckmOTnZtGjRwtSsWdMcO3bMGGPMzp07Tf/+/Y0ks27dOms7ffq0Wbt2rfF4POahhx4yS5YsMcuXLzczZswwjz766AXP/H8/U1neznfeeafx9/c3R44cMTt37jQej8cMGjTIGv/oo4+MJJOammq739atW40k8+abbxpz/v9+O3fubAICAsz48eNNSkqKefHFF83NN99sJJkxY8bY7n8pZVmx2bFjh5Fkhg4daquvXLnSSDKvvPKKGTZsmKlRo4bx9fU1bdu2NYsXL7b1ltbDDz9svLy8zJ49e5xDRRw7dswMGjTIzJ492yxfvtx8+umnZvTo0cbLy8u8++67tt7C/aNDhw7m3//+t1myZImJjY01Pj4+ttWmpUuXGm9vb9O5c2eTlJRk5s+fb9q3b28aNGhQqt97t27dTEhIiLn55pvNlClTTGpqqvnjH/9oJNl+psJ99sIVjE8++cR4eXmZ2NhYs3DhQjN//nzTsWNHExUVVeS5S/t6tmzZYoKDg02LFi3MrFmzTEpKinn66aeNl5eXGTt2rNU3btw44+3tbcaMGWOWLVtmPv30UzNp0iSrZ9++fWbEiBFGkklKSrLeOxeuppSkrCs2zz77rJFkVq9ebas3btzYdO3a1VYzxpjExEQjyfzzn/90DuEquvQ7ApWaM9hMmTLFSDL//ve/bX1/+9vfjCSTkpJi1XQ+2Bw+fNh07tzZ1KtXz2zevNka37t3r/Hx8TEjRoywasYYc/z4cRMaGmoeeOABqzZw4MBin7dPnz6mSZMmtprTDz/8YLy9vc3vf/9755Dl6NGjpkqVKkXC2d69e42/v795+OGHrVpZg02LFi1Mfn6+Vf/iiy+MJDN37lyrdrFDURMmTDCSrBB0MT169DDe3t7OcrF2795tvLy8zEMPPWTVunXrZqpVq2Zyc3ONMcacPXvW1K1b1/a6jTHmz3/+s/Hz8zOHDh0yxhjz8ccfG0lm8uTJtr5x48YZXeVg88wzzxidD4MXmjt3rpFkgoKCzO23324WLVpkFi9ebLp37248Ho/59NNPbf2XcvToURMQEGB69erlHCqV/Px8c/bsWfPYY4+Z6Oho25gkU7duXWvejTEmOzvbeHl5mXHjxlm1jh07mvDwcHPq1Cmrlpuba2rVqlXsfuNUGHw///xzW71Zs2a211VcsGnfvr2JiIgweXl5Vu348eMmJCSkyHOX9vX06tXL1K9fv0gASUhIMAEBAebIkSPGGGPuuece07p1a1uPU1kPRV2oLMEmPz/f1KtXz9xyyy3OIRMfH29q1Khhjh8/bqt36dLFSDKvvvqqrY6r6/LX5FApLV++XNWqVVP//v1t9UGDBkmSli1bZqvv2rVLMTExys3N1fr169WqVStr7LPPPlN+fr4GDBig/Px8awsICFC3bt2KLCd7PB717dvXVmvZsqX27Nljqzmlpqbq3LlzGj58uHPIsm7dOp06dcp6HYUiIiLUo0ePIq+rLO6++255e3tbt1u2bClJl/y5Jal9+/aSpAceeED//ve/tX//fmeLdH7e8/PzneVizZgxQwUFBbbDWUOGDNHJkyc1b948SZKPj48eeeQRJSUlKScnR5J07tw5zZ49W/369VNISIgkKT09XTr/813od7/7ne32lZafn693331Xt956q2677TbbWOGhFT8/P33yySfq27ev7r77bi1evFhhYWF66aWXbP2X8t577+n06dMaOnSoc+ii5s+fr9tvv12BgYHy8fGRr6+vpk2bVuSwpiR1795d1atXt27XrVtXderUsfaPkydPasOGDbrvvvsUEBBg9VWvXr3I+6EkoaGh6tChg612qffPyZMntXHjRsXHx8vPz8+qBwYGXvS5L/V6Tp8+rWXLlunee+9V1apVbe/9Pn366PTp09ah2g4dOuirr77SH//4R3322WfKzc21Hreiffrpp9q/f78ee+wx55ASEhKUk5OjAQMG6IcfftCBAwf0/PPPa+3atVI5D3+h7JhtlMnhw4cVGhpa5Nh6nTp15OPjo8OHD9vqX3zxhXbs2KEHH3xQ9evXt40dOHBAOv/l7evra9vmzZunQ4cO2fqrVq1q+2CXJH9/f50+fdpWc/r5558lqcjzX6jw5w4LC3MOKTw8vMjrKovCEFDI399fknTq1ClbvThdu3ZVcnKyFQDr16+v5s2ba+7cuc7WUikoKNDMmTMVHh6utm3b6tixYzp27JjuuOMOVatWzXbO0ZAhQ3T69Gl98MEH0vkgmpWVpcGDB1s9hw8flo+Pj2rVqmXVdP7L7GpasmSJsrOziw0bhfPdqVMn2xds1apV1a1bN3355ZcXdF/atGnTVLt2bfXr1885VKykpCQ98MADqlevnubMmaN169Zpw4YN1nw6OfcPnd9HCvePo0ePqqCgwHbVV6HiahdzqecpztGjR2WMKfb3WVxNpXiew4cPKz8/X//7v/9b5H1feJVS4Xs/MTFREyZM0Pr169W7d2+FhISoZ8+e2rhxo+3xK8K0adPk6+trXR13oZ49e2rGjBlauXKlbrzxRoWGhiopKckK0Reee4Orj2CDMgkJCdGBAwfk/Jc4Dh48qPz8fN1www22+oMPPqiXXnpJzz33nF5++WXbWGHvhx9+qA0bNhTZPv/8c1v/5Sq8hPjHH390DlkKP4yL+3suP/30k+11BQQEFHvCsjOIXSn9+vXTsmXLlJOToxUrVqh+/fp6+OGHtW7dOmfrJS1dulR79uzRTz/9pJCQENWsWVM1a9ZUvXr1dPLkSa1fv17fffedJKlZs2bq0KGDZsyYIZ1f6QkPD1dcXJz1eCEhIcrPz9eRI0esmqQif1PmSps2bZr8/Pz06KOPOoesFbHiGGPK9H/PGRkZysjI0IABA+Tr6+scLtacOXPUsGFDzZs3T/Hx8brtttvUrl27YveZ0qhZs6Y8xfydHlXAPBc+d+H/hFzocp+7Zs2a8vb21qBBg4q85wu3woDj4+OjUaNG6csvv9SRI0c0d+5c7du3T7169brkZftX0sGDB7V48WL95je/UZ06dZzDkqSBAwcqOztb3333nTIzM7Vlyxbp/Epzly5dnO24ikr/DgfO/5/JiRMnlJycbKsX/h2Unj172uqS9Je//EWTJk3SCy+8oMTERKveq1cv+fj46Pvvv1e7du2K3a6EuLg4eXt7a/Lkyc4hS0xMjKpUqaI5c+bY6j/++KOWL19ue11RUVHasWOH7Yvq8OHD1rLz5SjNKo6/v7+6deumv/3tb9L5L92ymjZtmry8vJScnKy0tDTbNnv2bOn8lWGFBg8erM8//1yrV6/Wf/7zHw0cONB2WK1bt27S+SvbLlS4ynM1ZGdna8mSJYqPjy92dSAsLEwxMTFas2aN7dDFL7/8ovT09CKHrkpSuIJV3OGHi/F4PPLz87OtamZnZxd7VVRpVKtWTR06dFBSUpJtxef48eP6z3/+Y+u90qpVq6Z27dopOTlZZ86cseonTpwo9kqn0qhataq6d++ujIwMtWzZssh7vl27dsX+XmvUqKH+/ftr+PDhOnLkiHXVYGneO+U1a9YsnT179pL7gY+Pj5o2baqbbrpJOTk5euedd9SvXz9FRkY6W3E1OU+6AS7kPHm48Kqo6tWrm4kTJ5rU1FQzZswY4+vrW+TEWzn+psXUqVONl5eXSUhIMAUFBcacvxLJx8fHPP7442bhwoVmxYoVZt68eebpp582L7zwgnVf589RqLR/x6Pwqqj+/fubBQsWmKVLl5o333zT9hyFV0U9+uijZsmSJWb27NnmpptuKnJV1OrVq63H+uyzz8z7779vWrdubSIjI4s9efi1116zaoXkOLF2xowZVm39+vVmw4YNJi8vzzz//PNm8ODBZs6cOWbFihUmOTnZdO/e3fj6+ppvv/3Wun9pTh4+dOiQ8ff3N71793YOWdq0aWNq165tzpw5Y8z5K3yqVKli6tevbySZ7du32/rPnTtnbr/9dlOlShUzfvx4k5qaav7617+am266yUgq1RVrGzZsMPPnzzfz5883ERERplmzZtbt3bt3O9vN+PHjjRwnqjutWbPG+Pn5mdtuu80sXLjQJCcnmy5duhhfX1+zdu1aq2/37t3G29vbDBkyxHZ/c35fr1mzpunUqZNzqETTp083kswTTzxhli1bZmbOnGluvPFG60qxCznfI4UiIyPNwIEDrdspKSnGy8vLdO7c2SxcuNB8+OGH1km9zscszsX+js3FTngv6aqoDz/80HTs2NFERkYaj8dj9ZkyvJ4tW7aYmjVrmg4dOpgZM2aYtLQ0s2jRIjNx4kTTvXt3q++ee+4x//3f/20+/PBDk56ebmbNmmWioqJMZGSktY+mpaUZSebxxx83a9euNRs2bLCdvFycwv2r8KKH4cOHW7Xi3HLLLSX+PZoDBw6YP//5z+ajjz4yy5cvN2+//baJiooyjRo1Mvv373e24yq79DsClVpxgeLw4cNm2LBhJiwszPj4+JjIyEiTmJhoTp8+besr7kNu7ty5xsfHxwwePNj6kCj8sg4KCjL+/v4mMjLS9O/f3yxdutS6X3E/hylDsDHGmFmzZpn27dubgIAAExgYaKKjo4tcDTF16lTTsmVL4+fnZ4KDg02/fv1sl50Xevfdd03Tpk1NQECAadasmZk3b95FvyRKE2zy8vLM0KFDTe3atY3H4zE6f5XH4sWLTe/evU29evWMn5+fqVOnjunTp49ZtWqV7fFKc7n3pEmTjCSTnJzsHLIUXvW2YMECq/bwww8bSeb222+39RY6cuSIGTx4sKlRo4apWrWqufPOO8369euNJPPGG28424sovOKtuM35+zHnL62NioqywvHFrFq1ynTr1s1UrVrVVK1a1fTo0cOsWbPG1lP4O7rwS7fQe++9ZySZ6dOnO4cuafz48SYqKsr4+/ubpk2bmn/961/F7qsq5j1iigkCxhizaNEia99s0KCBGT9+fLGPWZzyBBtjjFm4cKFp0aKF7blHjhxpatasaesry+vZtWuXGTJkiKlXr57x9fU1tWvXNp06dTIvv/yy1fP666+bTp06mRtuuMF67scee6xI4E1MTDTh4eHGy8vLSDJpaWm2cSfnfnbh5rRmzRojyfY/QU6HDx82cXFxpnbt2sbX19c0aNDAjBgxwvz888/OVlQAj3GeLAEA5fT+++/r97//vdasWaNOnTo5h/Erd/bsWbVu3Vr16tVTSkqKcxi4pgg2AMpl7ty52r9/v1q0aCEvLy+tX79er732mqKjo63LwfHr9thjj+nOO+9UWFiYsrOzNWXKFKWnpyslJUV33HGHsx24pgg2AMpl8eLFGjt2rHbu3KmTJ08qLCxM8fHxevnll21/dh6/Xg888IDWrl2rn3/+Wb6+vmrTpo2effbZIv/CNXA9INgAAADX4HJvAADgGgQbAADgGgQbAADgGgQbAADgGgQbAADgGgQbAADgGgQbAADgGgQbAADgGpUi2KxcuVJ9+/ZVeHi4PB6PkpOTnS2XZIzRhAkT1LhxY/n7+ysiIkKvvvqqsw0AAFxDlSLYnDx5Uq1atdJbb73lHCq1J598UlOnTtWECRO0bds2/ec//1GHDh2cbQAA4BqqdP+kgsfj0cKFCxUfH2/Vzpw5o7/85S967733dOzYMTVv3lx/+9vfFBsbK0naunWrWrZsqW+//VZNmjS54NEAAMD1pFKs2FzK4MGDtWbNGn3wwQf6+uuvdf/99+uuu+5SZmamJOk///mPGjVqpMWLF6thw4aKiorS0KFDdeTIEedDAQCAa6jSB5vvv/9ec+fO1fz589WlSxfdeOONGj16tDp37qwZM2ZIkn744Qft2bNH8+fP16xZszRz5kxt2rRJ/fv3dz4cAAC4hip9sPnyyy9ljFHjxo0VGBhobenp6fr+++8lSQUFBcrLy9OsWbPUpUsXxcbGatq0aUpLS9P27dudDwkAAK6RSh9sCgoK5O3trU2bNmnz5s3WtnXrVr3xxhuSpLCwMPn4+Khx48bW/Zo2bSpJ2rt3r1UDAADXVqUPNtHR0Tp37pwOHjyom266ybaFhoZKkm6//Xbl5+dbKziStGPHDklSZGSkVQMAANdWpbgq6sSJE9q5c6d0PshMnDhR3bt3V61atdSgQQM98sgjWrNmjV5//XVFR0fr0KFDWr58uVq0aKE+ffqooKBA7du3V2BgoCZNmqSCggINHz5cQUFBSklJcT4dAAC4RipFsFmxYoW6d+/uLGvgwIGaOXOmzp49q5dfflmzZs3S/v37FRISopiYGL344otq0aKFJOmnn37SiBEjlJKSomrVqql37956/fXXVatWLefDAgCAa6RSBBsAAFA5VPpzbAAAgHu4dsWmoKBAP/30k6pXry6Px+McBgAA1yFjjI4fP67w8HB5eZV9/cW1webHH39URESEswwAAH4F9u3bp/r16zvLl+TaYJOTk6MaNWpo3759CgoKcg4DAIDrUG5uriIiInTs2DEFBwc7hy/JtcEmNzdXwcHBysnJIdgAAPArUd7v77IfvAIAALhOEWwAAIBrEGwAAIBrEGwAAIBrEGwAAIBrEGwAAIBrEGwAAIBrEGwAAIBrEGwAAIBrEGwAAIBrEGwAAIBrEGwAAIBrEGwAAIBrEGwAAIBrEGwAAIBrEGwA4CryeEq/ASg/gg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHANgg0AAHCNCgk248aNU/v27VW9enXVqVNH8fHx2r59u7PNZsWKFfJ4PEW2bdu2OVsBAACkigo26enpGj58uNavX6/U1FTl5+crLi5OJ0+edLYWsX37dmVlZVnbzTff7GwBAACQJHmMMcZZvNp+/vln1alTR+np6eratatzWDq/YtO9e3cdPXpUNWrUcA5fUm5uroKDg5WTk6OgoCDnMABUCI/HWbm4iv80Bq4/5f3+rpAVG6ecnBxJUq1atZxDRURHRyssLEw9e/ZUWlqac9iSl5en3Nxc2wYAACqXCg82xhiNGjVKnTt3VvPmzZ3DlrCwML3zzjtasGCBkpKS1KRJE/Xs2VMrV650tkrnz+MJDg62toiICGcLAABwuQo/FDV8+HB9/PHHWr16terXr+8cLlHfvn3l8Xi0aNEi55Dy8vKUl5dn3c7NzVVERMRlL2UBwJXAoSigbH5Vh6JGjBihRYsWKS0trcyhRpJuu+02ZWZmOsuSJH9/fwUFBdk2AABQuVRIsDHGKCEhQUlJSVq+fLkaNmzobCmVjIwMhYWFOcsAAABSRQWb4cOHa86cOXr//fdVvXp1ZWdnKzs7W6dOnbJ6EhMTNWDAAOv2pEmTlJycrMzMTG3ZskWJiYlasGCBEhISrB4AAIALVUiwmTx5snJychQbG6uwsDBrmzdvntWTlZWlvXv3WrfPnDmj0aNHq2XLlurSpYtWr16tjz/+WPfdd5/VAwAAcKEKP3m4opT35CMAuBI4eRgom/J+f1fIig0AAEBFINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXINgAAADXqJBgM27cOLVv317Vq1dXnTp1FB8fr+3btzvbikhPT1fbtm0VEBCgRo0aacqUKc4WAAAAS4UEm/T0dA0fPlzr169Xamqq8vPzFRcXp5MnTzpbLbt27VKfPn3UpUsXZWRk6Nlnn9XIkSO1YMECZysAAIAkyWOMMc7i1fbzzz+rTp06Sk9PV9euXZ3DkqRnnnlGixYt0tatW63asGHD9NVXX2ndunW23uLk5uYqODhYOTk5CgoKcg4DQIXweJyVi6v4T2Pg+lPe7+8KWbFxysnJkSTVqlXLOWRZt26d4uLibLVevXpp48aNOnv2rK0uSXl5ecrNzbVtAACgcqnwYGOM0ahRo9S5c2c1b97cOWzJzs5W3bp1bbW6desqPz9fhw4dstV1/jye4OBga4uIiHC2AAAAl6vwYJOQkKCvv/5ac+fOdQ4V4XGs4RYeNXPWJSkxMVE5OTnWtm/fPmcLAABwuQoNNiNGjNCiRYuUlpam+vXrO4dtQkNDlZ2dbasdPHhQPj4+CgkJsdUlyd/fX0FBQbYNAABULhUSbIwxSkhIUFJSkpYvX66GDRs6W4qIiYlRamqqrZaSkqJ27drJ19fXVgcAAFBFBZvhw4drzpw5ev/991W9enVlZ2crOztbp06dsnoSExM1YMAA6/awYcO0Z88ejRo1Slu3btX06dM1bdo0jR492uoBAAC4UIUEm8mTJysnJ0exsbEKCwuztnnz5lk9WVlZ2rt3r3W7YcOGWrJkiVasWKHWrVvrpZde0ptvvqnf/va3Vg8AAMCFrsnfsakI5b0OHgCuhGKudbgod34aA2VT3u/vClmxAQAAqAgEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BoEGwAA4BolBptGjRrp8OHDzrKOHTumRo0aOcsAAADXVInBZvfu3Tp37pyzrLy8PO3fv99ZBgAAuKY8xhjjLC5atEiSFB8fr3fffVfBwcHW2Llz57Rs2TKlpqZq+/btF9zr+pKbm6vg4GDl5OQoKCjIOQwAFcLjcVYuruinMVD5lPf7u9hg4+X1fws5Ho9HzmFfX19FRUXp9ddf1z333GMbu56Ud2IA4Eog2ABlU97v72IPRRUUFKigoEANGjTQwYMHrdsFBQXKy8vT9u3br+tQAwAAKqdig02hXbt26YYbbnCWAQAArkvFHoq60LJly7Rs2TJr5eZC06dPt92+npR3KQsArgQORQFlU97v7xJXbF588UXFxcVp2bJlOnTokI4ePWrbAAAAriclrtiEhYXp73//ux599FHn0HWvvIkPAK4EVmyAsinv93eJKzZnzpxRp06dnGUAAIDrUonBZujQoXr//fed5TJbuXKl+vbtq/DwcHk8HiUnJztbbFasWCGPx1Nk27Ztm7MVAADAUmKwOX36tCZOnKhu3bppxIgRGjVqlG0rrZMnT6pVq1Z66623nEMl2r59u7Kysqzt5ptvdrYAAABYSjzHpnv37s6SxePxaPny5c7yJXk8Hi1cuFDx8fHOIcuKFSvUvXt3HT16VDVq1HAOl0p5j9EBwJXAOTZA2ZT3+7vEYHM1lCXYREVF6fTp02rWrJn+8pe/lBi08vLylJeXZ93Ozc1VRETEZU8MAFwJBBugbMobbEo8FHWthIWF6Z133tGCBQuUlJSkJk2aqGfPnlq5cqWz1TJu3DgFBwdbW0REhLMFAAC4XIkrNt27d5enhP/duFqHoorTt29feTwe6x/odGLFBsD1qISP0CIu/mkMVB5XdcWmdevWatWqlbU1a9ZMZ86c0ZdffqkWLVo426+q2267TZmZmc6yxd/fX0FBQbYNAABULiWu2FzM2LFjdeLECU2YMME5dEmXu2LTv39/HTlypNSrROVNfABwJbBiA5RNeb+/S1yxuZhHHnmkTP9O1IkTJ7R582Zt3rxZOv+Pa27evFl79+6VJCUmJmrAgAFW/6RJk5ScnKzMzExt2bJFiYmJWrBggRISEqweAAAAp8sKNuvWrVNAQICzfFEbN25UdHS0oqOjJUmjRo1SdHS0XnjhBUlSVlaWFXJ0/i8ejx49Wi1btlSXLl20evVqffzxx7rvvvusHgAAAKcSD0U5g4QxRllZWdq4caOef/55jRkzxjZ+PSnvUhYAXAkcigLKprzf3yWu2Fx4+XRwcLBq1aql2NhYLVmy5LoONQAAoHIqccXm16y8iQ8ArgRWbICyKe/3d4krNoU2bdqkOXPm6L333lNGRoZzGAAA4LpQYrA5ePCgevToofbt22vkyJFKSEhQ27Zt1bNnT/3888/OdgAAgGuqxGAzYsQI5ebmasuWLTpy5IiOHj2qb7/9Vrm5uRo5cqSzHQAA4Joq8Ryb4OBgLV26VO3bt7fVv/jiC8XFxenYsWO2+vWkvMfoAOBK4BwboGzK+/1d4opNQUGBfH19nWX5+vqqoKDAWQYAALimSgw2PXr00JNPPqmffvrJqu3fv19/+tOf1LNnT1svAADAtVZisHnrrbd0/PhxRUVF6cYbb9RNN92khg0b6vjx4/rf//1fZzsAAMA1VeI5NoVSU1O1bds2GWPUrFkz3XHHHc6W6055j9EBwJXAOTZA2ZT3+7vYFZvly5erWbNmys3NlSTdeeedGjFihEaOHKn27dvr1ltv1apVq5x3AwAAuKaKDTaTJk3SH/7wh2KTUnBwsB5//HFNnDjROQQAAHBNFRtsvvrqK911113OsiUuLk6bNm1ylgEAAK6pYoPNgQMHir3Mu5CPjw9/eRgAAFx3ig029erV0zfffOMsW77++muFhYU5ywAAANdUscGmT58+euGFF3T69GnnkE6dOqUxY8bonnvucQ4BAABcU8Ve7n3gwAG1adNG3t7eSkhIUJMmTeTxeLR161b94x//0Llz5/Tll1+qbt26zrteN8p7uRgAXAlc7g2UTXm/v4sNNpK0Z88ePfHEE/rss89U2OLxeNSrVy+9/fbbioqKct7lulLeiQGAK4FgA5RNeb+/LxpsCh09elQ7d+6UMUY333yzatas6Wy5LpV3YgDgSiDYAGVT3u/vSwabX6vyTgwAXAkEG6Bsyvv9XezJwwAAAL9GBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaBBsAAOAaFRJsVq5cqb59+yo8PFwej0fJycnOliLS09PVtm1bBQQEqFGjRpoyZYqzBQAAwKZCgs3JkyfVqlUrvfXWW86hYu3atUt9+vRRly5dlJGRoWeffVYjR47UggULnK0AAAAWjzHGOItXk8fj0cKFCxUfH+8csjzzzDNatGiRtm7datWGDRumr776SuvWrbP1Xkxubq6Cg4OVk5OjoKAg5zAAVAiPx1m5uIr9NAauT+X9/q6QFZuyWrduneLi4my1Xr16aePGjTp79qytXigvL0+5ubm2DQAAVC7XZbDJzs5W3bp1bbW6desqPz9fhw4dstULjRs3TsHBwdYWERHhbAEAAC53XQYbnT9kdaHCI2bOeqHExETl5ORY2759+5wtAADA5a7LYBMaGqrs7Gxb7eDBg/Lx8VFISIitXsjf319BQUG2DQAAVC7XZbCJiYlRamqqrZaSkqJ27drJ19fXVgcAAChUIcHmxIkT2rx5szZv3iydv5x78+bN2rt3r3T+MNKAAQOs/mHDhmnPnj0aNWqUtm7dqunTp2vatGkaPXq01QMAAOBUIcFm48aNio6OVnR0tCRp1KhRio6O1gsvvCBJysrKskKOJDVs2FBLlizRihUr1Lp1a7300kt688039dvf/tbqAQAAcKrwv2NTUcp7HTwAXAkXud6hWO78NAbKprzf3xWyYgMAAFARCDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1CDYAAMA1KizYvP3222rYsKECAgLUtm1brVq1ytliWbFihTweT5Ft27ZtzlYAAABLhQSbefPm6amnntJzzz2njIwMdenSRb1799bevXudrTbbt29XVlaWtd18883OFgAAAEuFBJuJEyfqscce09ChQ9W0aVNNmjRJERERmjx5srPVpk6dOgoNDbU2b29vZwsAAIDlqgebM2fOaNOmTYqLi7PV4+LitHbtWlvNKTo6WmFhYerZs6fS0tKcwzZ5eXnKzc21bQAAoHK56sHm0KFDOnfunOrWrWur161bV9nZ2bZaobCwML3zzjtasGCBkpKS1KRJE/Xs2VMrV650tlrGjRun4OBga4uIiHC2AAAAl/MYY4yzeCX99NNPqlevntauXauYmBir/sorr2j27NmlPiG4b9++8ng8WrRokXNIOr9ik5eXZ93Ozc1VRESEcnJyFBQUZOsFgIri8TgrF3d1P42BX4fc3FwFBwdf9vf3VV+xueGGG+Tt7V1kdebgwYNFVnFKcttttykzM9NZtvj7+ysoKMi2AQCAyuWqBxs/Pz+1bdtWqamptnpqaqo6depkq5UkIyNDYWFhzjIAAIDlqgcbSRo1apSmTp2q6dOna+vWrfrTn/6kvXv3atiwYZKkxMREDRgwwOqfNGmSkpOTlZmZqS1btigxMVELFixQQkLCBY8KAABgVyHB5sEHH9SkSZP017/+Va1bt9bKlSu1ZMkSRUZGSpKysrJsf9PmzJkzGj16tFq2bKkuXbpo9erV+vjjj3Xfffdd8KgAAAB2V/3k4WulvCcfAcCVwMnDQNmU9/u7QlZsAAAAKgLBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuAbBBgAAuEaFBZu3335bDRs2VEBAgNq2batVq1Y5W2zS09PVtm1bBQQEqFGjRpoyZYqzBQAAwKZCgs28efP01FNP6bnnnlNGRoa6dOmi3r17a+/evc5WSdKuXbvUp08fdenSRRkZGXr22Wc1cuRILViwwNkKAABg8RhjjLN4pXXs2FFt2rTR5MmTrVrTpk0VHx+vcePG2Xol6ZlnntGiRYu0detWqzZs2DB99dVXWrduna33YnJzcxUcHKycnBwFBQU5hwGgVDweZ+Xiivs0Le/9gcqmvN/fVz3YnDlzRlWrVtX8+fN17733WvUnn3xSmzdvVnp6uq1fkrp27aro6Gi98cYbVm3hwoV64IEH9Msvv8jX19fWL0l5eXnKy8uzbufk5KhBgwbat2/fZU0MAEhScLCzcnE5Oc5K+e8PVDa5ubmKiIjQsWPHFFyWN1Ahc5Xt37/fSDJr1qyx1V955RXTuHFjW63QzTffbF555RVbbc2aNUaS+emnn2z1QmPGjDGS2NjY2NjY2Fyw7du3z/lVXypXfcXmp59+Ur169bR27VrFxMRY9VdeeUWzZ8/Wtm3bbP2S1LhxYw0ePFiJiYlWbc2aNercubOysrIUGhpq61cxKzYFBQU6cuSIQkJC5CnLWnAJClMkq0Blw7xdPubu8jBvl4+5uzzM2+Vzzp0xRsePH1d4eLi8vMp+KvBVDzYVdSiqIpT3uF9lxbxdPubu8jBvl4+5uzzM2+W70nNX9ihURn5+fmrbtq1SU1Nt9dTUVHXq1MlWKxQTE1OkPyUlRe3atbtmoQYAAFz/rnqwkaRRo0Zp6tSpmj59urZu3ao//elP2rt3r4YNGyZJSkxM1IABA6z+YcOGac+ePRo1apS2bt2q6dOna9q0aRo9evQFjwoAAGDnPXbs2LHO4pXWvHlzhYSE6NVXX9WECRN06tQpzZ49W61atZIkzZkzR3v27NGgQYMkSTVr1lTnzp31z3/+Uy+99JIyMjL0yiuv2MLPteLt7a3Y2Fj5+Pg4h1AC5u3yMXeXh3m7fMzd5WHeLt+VnLurfo4NAABARamQQ1EAAAAVgWADAABcg2ADAABcg2ADAABcg2BTSm+//bYaNmyogIAAtW3bVqtWrXK2VGpjx46Vx+OxbRf+hWhjjMaOHavw8HBVqVJFsbGx2rJli+0xKouVK1eqb9++Cg8Pl8fjUXJysm28NHOVl5enESNG6IYbblC1atX0m9/8Rj/++KOtx40uNXeDBg0qsh/edttttp7KOHfjxo1T+/btVb16ddWpU0fx8fHavn27rYf9rqjSzBv7XPEmT56sli1bKigoSEFBQYqJidEnn3xijV/N/Y1gUwrz5s3TU089peeee04ZGRnq0qWLevfurb179zpbK7Vbb71VWVlZ1vbNN99YY3//+981ceJEvfXWW9qwYYNCQ0N155136vjx47bHqAxOnjypVq1a6a233nIOSaWcq6eeekoLFy7UBx98oNWrV+vEiRO65557dO7cOdtjuc2l5k6S7rrrLtt+uGTJEtt4ZZy79PR0DR8+XOvXr1dqaqry8/MVFxenkydPWj3sd0WVZt7EPles+vXra/z48dq4caM2btyoHj16qF+/flZ4uar7m/Mfj0JRHTp0MMOGDbPVbrnlFvPf//3ftlplNmbMGNOqVStn2RhjTEFBgQkNDTXjx4+3aqdPnzbBwcFmypQptt7KRpJZuHChdbs0c3Xs2DHj6+trPvjgA6tn//79xsvLy3z66adWze2cc2eMMQMHDjT9+vWz1S7E3P2fgwcPGkkmPT3dGPa7UnPOm2GfK5OaNWuaqVOnXvX9jRWbSzhz5ow2bdqkuLg4Wz0uLk5r16611Sq7zMxMhYeHq2HDhnrooYf0ww8/SJJ27dql7Oxs2xz6+/urW7duzKFDaeZq06ZNOnv2rK0nPDxczZs3Zz4lrVixQnXq1FHjxo31hz/8QQcPHrTGmLv/k5OTI0mqVauWxH5Xas55K8Q+V7Jz587pgw8+0MmTJxUTE3PV9zeCzSUcOnRI586dU926dW31unXrKjs721arzDp27KhZs2bps88+07/+9S9lZ2erU6dOOnz4sDVPzOGllWausrOz5efnp5o1a160p7Lq3bu33nvvPS1fvlyvv/66NmzYoB49eigvL09i7qTz5zaMGjVKnTt3VvPmzSX2u1Ipbt7EPleib775RoGBgfL399ewYcO0cOFCNWvW7KrvbwSbUvJ4PLbbxpgitcqsd+/e+u1vf6sWLVrojjvu0McffyxJevfdd60e53wxhxfnnJfSzFVpetzuwQcf1N13363mzZurb9+++uSTT7Rjxw5rf7yYyjR3CQkJ+vrrrzV37lznUJE5KM28lKbHDS42b+xzF9ekSRNt3rxZ69ev1xNPPKGBAwfqu+++s8adr780c1KaHoLNJdxwww3y9vYukhAPHjxYJG3i/6lWrZpatGihzMxM6+oo5vDSSjNXoaGhOnPmjI4ePXrRHvyfsLAwRUZGKjMzU2LuNGLECC1atEhpaWmqX7++VWe/K9nF5q047HP/j5+fn2666Sa1a9dO48aNU6tWrfTGG29c9f2NYHMJfn5+atu2rVJTU2311NRUderUyVbD/5OXl6etW7cqLCxMDRs2VGhoqG0Oz5w5o/T0dObQoTRz1bZtW/n6+tp6srKy9O233zKfDocPH9a+ffsUFhYmVeK5M8YoISFBSUlJWr58uRo2bGgbZ78r3qXmrTjscxdnjFFeXt7V39+cZxOjqA8++MD4+vqaadOmme+++8489dRTplq1amb37t3O1krr6aefNitWrDA//PCDWb9+vbnnnntM9erVrTkaP368CQ4ONklJSeabb74xv/vd70xYWJjJzc11PpTrHT9+3GRkZJiMjAwjyUycONFkZGSYPXv2GFPKuRo2bJipX7++Wbp0qfnyyy9Njx49TKtWrUx+fv4Fz+Q+Jc3d8ePHzdNPP23Wrl1rdu3aZdLS0kxMTIypV69epZ+7J554wgQHB5sVK1aYrKwsa/vll1+sHva7oi41b+xzF5eYmGhWrlxpdu3aZb7++mvz7LPPGi8vL5OSkmLMVd7fCDal9I9//MNERkYaPz8/06ZNG9vlfjDmwQcfNGFhYcbX19eEh4eb++67z2zZssUaLygoMGPGjDGhoaHG39/fdO3a1XzzzTe2x6gs0tLSjKQi28CBA40p5VydOnXKJCQkmFq1apkqVaqYe+65x+zdu9fW40Ylzd0vv/xi4uLiTO3atY2vr69p0KCBGThwYJF5qYxz55yvwm3GjBlWD/tdUc75cs4b+9zFDRkyxPrOrF27tunZs6cVasxV3t885v9+eQAAAL96nGMDAABcg2ADAABcg2ADAABcg2ADAABcg2ADAABcg2ADAABcg2ADAABcg2ADAABcg2AD4JoaNGiQ4uPjneVS6dq1q95//31nucw8Ho+Sk5Od5QqRl5enBg0aaNOmTc4hAJeBYANUAuUJD1fK7t275fF4tHnzZufQZVm8eLGys7P10EMPWbVrGVAul7+/v0aPHq1nnnnGOQTgMhBsAPwqvfnmmxo8eLC8vH79H2O///3vtWrVKm3dutU5BKCMfv2fCADK7bvvvlOfPn0UGBiounXr6tFHH9WhQ4es8djYWI0cOVJ//vOfVatWLYWGhmrs2LG2x9i2bZs6d+6sgIAANWvWTEuXLrWtoDRs2FCSFB0dLY/Ho9jYWNv9J0yYoLCwMIWEhGj48OE6e/asbfxChw4d0tKlS/Wb3/zGqkVFRUmS7r33Xnk8Huu2JE2ePFk33nij/Pz81KRJE82ePdsaK85f//pX1a1b11pdWrt2rbp27aoqVaooIiJCI0eO1MmTJ63+qKgovfrqqxoyZIiqV6+uBg0a6J133rHGz5w5o4SEBIWFhSkgIEBRUVEaN26cNR4SEqJOnTpp7ty5Vg3A5SHYAJVcVlaWunXrptatW2vjxo369NNPdeDAAT3wwAO2vnfffVfVqlXT559/rr///e/661//qtTUVElSQUGB4uPjVbVqVX3++ed655139Nxzz9nu/8UXX0iSli5dqqysLCUlJVljaWlp+v7775WWlqZ3331XM2fO1MyZMy+4t93q1atVtWpVNW3a1Kpt2LBBkjRjxgxlZWVZtxcuXKgnn3xSTz/9tL799ls9/vjjGjx4sNLS0qz7FjLG6Mknn9S0adO0evVqtW7dWt9884169eql++67T19//bXmzZun1atXKyEhwXbf119/Xe3atVNGRob++Mc/6oknntC2bduk86tLixYt0r///W9t375dc+bMsQUvSerQoYNWrVplqwG4DM5/7huA+wwcOND069fPWTbGGPP888+buLg4W23fvn1Gktm+fbsxxphu3bqZzp0723rat29vnnnmGWOMMZ988onx8fExWVlZ1nhqaqqRZBYuXGiMMWbXrl1GksnIyLB6zPmfLTIy0uTn51u1+++/3zz44IO2vgv9z//8j2nUqJGzbHu+Qp06dTJ/+MMfbLX777/f9OnTx7otycyfP9888sgj5pZbbjH79u2zxh599FHz//1//5912xhjVq1aZby8vMypU6eMMcZERkaaRx55xBovKCgwderUMZMnTzbGGDNixAjTo0cPU1BQYPU4vfHGGyYqKspZBlBGrNgAldymTZuUlpamwMBAa7vlllskSd9//73V17JlywvuJYWFhengwYOSpO3btysiIkKhoaHWeIcOHS7oLtmtt94qb29v6/aFj12cU6dOKSAgwFku1tatW3X77bfbarfffnuR81n+9Kc/ad26dVq1apXq169v1Tdt2qSZM2fa5qdXr14qKCjQrl27rL4L58fj8Sg0NNR6DYMGDdLmzZvVpEkTjRw5UikpKVZvoSpVquiXX35xlgGUEcEGqOQKCgrUt29fbd682bZlZmaqa9euVp+vr6/tfh6PRwUFBdL5Qzgej8c2XhYlPXZxbrjhBh09etRZvijnz1bcz3vnnXdq//79+uyzz2z1goICPf7447a5+eqrr5SZmakbb7zR6ivpNbRp00a7du3SSy+9pFOnTumBBx5Q//79bf1HjhxR7dq1bTUAZUewASq5Nm3aaMuWLYqKitJNN91k26pVq+ZsL9Ytt9yivXv36sCBA1at8ByXQn5+fpKkc+fO2eqXIzo6WtnZ2UXCja+vb5HHb9q0qVavXm2rrV271nZ+jiT95je/0fvvv6+hQ4fqgw8+sOqF8+Ocm5tuusl6TaURFBSkBx98UP/61780b948LViwQEeOHLHGv/32W0VHR9vuA6DsCDZAJZGTk1NkVWbv3r0aPny4jhw5ot/97nf64osv9MMPPyglJUVDhgwpEhIu5s4779SNN96ogQMH6uuvv9aaNWusk4cLV0bq1KmjKlWqWCcn5+TkOB6l9KKjo1W7dm2tWbPGVo+KitKyZctsoee//uu/NHPmTE2ZMkWZmZmaOHGikpKSNHr0aNt9df6KqtmzZ2vw4MH68MMPJUnPPPOM1q1bp+HDh1srWYsWLdKIESOcd7+o//mf/9EHH3ygbdu2aceOHZo/f75CQ0NVo0YNq2fVqlWKi4uz3Q9A2RFsgEpixYoVio6Otm0vvPCCwsPDtWbNGp07d069evVS8+bN9eSTTyo4OLjUfyPG29tbycnJOnHihNq3b6+hQ4fqL3/5iyRZ58L4+PjozTff1D//+U+Fh4erX79+jkcpPW9vbw0ZMkTvvfeerf76668rNTVVERER1upHfHy83njjDb322mu69dZb9c9//lMzZswocrl5of79++vdd9/Vo48+qqSkJLVs2VLp6enKzMxUly5dFB0dreeff15hYWHOu15UYGCg/va3v6ldu3Zq3769du/erSVLlljzu27dOuXk5BQ5PAWg7Dzm/64IAIAras2aNercubN27txpOxflSjlw4IBuvfVWbdq0SZGRkc7hX5X7779f0dHRevbZZ51DAMqIYAPgili4cKECAwN18803a+fOnXryySdVs2bNIue3XEkfffSRatWqpS5dujiHfjXy8vL02muv6emnn1aVKlWcwwDKiGAD4IqYNWuWXnrpJe3bt0833HCD7rjjDr3++usKCQlxtgLAVUOwAQAArlG6MwMBAAB+BQg2AADANQg2AADANQg2AADANf5/xpvZ/AgCsC0AAAAASUVORK5CYII="
|
||
}
|
||
},
|
||
"cell_type": "markdown",
|
||
"id": "940ba698",
|
||
"metadata": {},
|
||
"source": [
|
||
""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "da33633a-7ad5-479c-8dff-f7a7a149d49c",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Plot the distribution of prices\n",
|
||
"\n",
|
||
"prices = [item.price for item in items]\n",
|
||
"plt.title(f\"Prices: Avg {sum(prices)/len(prices):,.1f} and highest {max(prices):,}\\n\")\n",
|
||
"plt.xlabel('Price ($)')\n",
|
||
"plt.ylabel('Count')\n",
|
||
"plt.hist(prices, rwidth=0.7, color=\"blueviolet\", bins=range(0, 1000, 10))\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "d0f494d7-349e-4878-929c-075ac97c6b6d",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Plot the distribution of categories\n",
|
||
"\n",
|
||
"category_counts = Counter()\n",
|
||
"for item in items:\n",
|
||
" category_counts[item.category]+=1\n",
|
||
"\n",
|
||
"categories = category_counts.keys()\n",
|
||
"counts = [category_counts[category] for category in categories]\n",
|
||
"\n",
|
||
"# Bar chart by category\n",
|
||
"plt.bar(categories, counts, color=\"goldenrod\")\n",
|
||
"plt.title('How many items in each category')\n",
|
||
"plt.xlabel('Categories')\n",
|
||
"plt.ylabel('Count')\n",
|
||
"\n",
|
||
"plt.xticks(rotation=30, ha='right')\n",
|
||
"\n",
|
||
"# Add value labels on top of each bar\n",
|
||
"for i, v in enumerate(counts):\n",
|
||
" plt.text(i, v, f\"{v:,}\", ha='center', va='bottom')\n",
|
||
"\n",
|
||
"# Display the chart\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "d4fe384d-049b-4742-98e5-20d162db5151",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 🎯 Data Sampling\n",
|
||
"\n",
|
||
"We sample to keep the dataset balanced but rich:\n",
|
||
"- 🎯 Keep all items if price ≥ $240 or group size ≤ 1200\n",
|
||
"- 🎯 For large groups, randomly sample 1200 items, favoring rare categories\n",
|
||
"\n",
|
||
"✅ This keeps valuable high-price items and avoids overrepresented classes"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "20330037-744d-4834-8ece-413a8dbe2030",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"HEAVY_DATASET = \"Automative\"\n",
|
||
"\n",
|
||
"# Group items by rounded price\n",
|
||
"# Slots is a dictionary where the keys are rounded prices and the values are lists of items that have that rounded price\n",
|
||
"slots = defaultdict(list)\n",
|
||
"for item in items:\n",
|
||
" slots[round(item.price)].append(item)\n",
|
||
"\n",
|
||
"np.random.seed(42) # Set random seed for reproducibility\n",
|
||
"sample = [] # Final collection of items after our sampling process completes\n",
|
||
"\n",
|
||
"# Sampling loop\n",
|
||
"for price, items_at_price in slots.items():\n",
|
||
"\n",
|
||
" # Take all items if price ≥ 240 or small group\n",
|
||
" if price >= 240 or len(items_at_price) <= 1200:\n",
|
||
" sample.extend(items_at_price)\n",
|
||
"\n",
|
||
" # Otherwise sample 1200 items with weights\n",
|
||
" else:\n",
|
||
"\n",
|
||
" # Weight: 1 for toys, 5 for others\n",
|
||
" weights = [1 if item.category == HEAVY_DATASET else 5 for item in items_at_price]\n",
|
||
" weights = np.array(weights) / sum(weights)\n",
|
||
"\n",
|
||
" indices = np.random.choice(len(items_at_price), 1200, False, weights) # False = don't pick the same index twice\n",
|
||
" sample.extend([items_at_price[i] for i in indices])\n",
|
||
"\n",
|
||
"print(f\"There are {len(sample):,} items in the sample\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "21aed337-6f15-48e4-8155-70551ed1d5e0",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Plot the distribution of prices in the sample\n",
|
||
"\n",
|
||
"prices = [float(item.price) for item in sample]\n",
|
||
"plt.title(f\"Avg {sum(prices)/len(prices):.2f} and highest {max(prices):,.2f}\\n\")\n",
|
||
"plt.xlabel('Price ($)')\n",
|
||
"plt.ylabel('Count')\n",
|
||
"plt.hist(prices, rwidth=0.7, color=\"darkblue\", bins=range(0, 1000, 10))\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "08a7353e-2752-4493-bb0b-6057d1eab16d",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Plot the distribution of categories in the sample\n",
|
||
"\n",
|
||
"category_counts = Counter()\n",
|
||
"for item in sample:\n",
|
||
" category_counts[item.category]+=1\n",
|
||
"\n",
|
||
"categories = category_counts.keys()\n",
|
||
"counts = [category_counts[category] for category in categories]\n",
|
||
"\n",
|
||
"# Create bar chart\n",
|
||
"plt.bar(categories, counts, color=\"pink\")\n",
|
||
"\n",
|
||
"# Customize the chart\n",
|
||
"plt.title('How many in each category')\n",
|
||
"plt.xlabel('Categories')\n",
|
||
"plt.ylabel('Count')\n",
|
||
"\n",
|
||
"plt.xticks(rotation=30, ha='right')\n",
|
||
"\n",
|
||
"# Add value labels on top of each bar\n",
|
||
"for i, v in enumerate(counts):\n",
|
||
" plt.text(i, v, f\"{v:,}\", ha='center', va='bottom')\n",
|
||
"\n",
|
||
"# Display the chart\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9bdb0c58-24e0-4ab5-8a28-2136b53ab915",
|
||
"metadata": {},
|
||
"source": [
|
||
"The HEAVY_DATASET still in the lead, but improved somewhat"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "4ce8ff80-cd19-4c3b-965f-ce6af8ee347d",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Create pie chart\n",
|
||
"\n",
|
||
"fig, ax = plt.subplots(figsize=(8, 8))\n",
|
||
"wedges, texts, autotexts = ax.pie(\n",
|
||
" counts,\n",
|
||
" # labels=categories,\n",
|
||
" autopct='%1.0f%%',\n",
|
||
" startangle=90,\n",
|
||
" pctdistance=0.85,\n",
|
||
" labeldistance=1.1\n",
|
||
")\n",
|
||
"ax.legend(wedges, categories, title=\"Categories\", loc=\"lower center\", bbox_to_anchor=(0.5, 1.15), ncol=3)\n",
|
||
"\n",
|
||
"# Draw donut center\n",
|
||
"centre_circle = plt.Circle((0, 0), 0.70, fc='white')\n",
|
||
"fig.gca().add_artist(centre_circle)\n",
|
||
"\n",
|
||
"# Add center label\n",
|
||
"ax.text(0, 0, \"Categories\", ha='center', va='center', fontsize=14, fontweight='bold')\n",
|
||
"\n",
|
||
"# Equal aspect ratio\n",
|
||
"plt.axis('equal')\n",
|
||
"plt.title(\"Category Distribution\")\n",
|
||
"plt.tight_layout()\n",
|
||
"plt.show()\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "acbc6beb-fab4-49ab-bc7e-243638c1fa99",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# How does the price vary with the character count of the prompt?\n",
|
||
"\n",
|
||
"sizes = [len(item.prompt) for item in sample]\n",
|
||
"prices = [item.price for item in sample]\n",
|
||
"\n",
|
||
"# Create the scatter plot\n",
|
||
"plt.scatter(sizes, prices, s=0.2, color=\"red\")\n",
|
||
"\n",
|
||
"# Add labels and title\n",
|
||
"plt.xlabel('Size')\n",
|
||
"plt.ylabel('Price')\n",
|
||
"plt.title('Is there a simple correlation between prompt length and item price?')\n",
|
||
"\n",
|
||
"# Display the plot\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "76b060a4-0b8d-495c-bb96-28cb7b7ec623",
|
||
"metadata": {},
|
||
"source": [
|
||
"There is no strong or simple correlation between prompt length and item price.\n",
|
||
"\n",
|
||
"In other words, longer prompts don’t clearly mean higher prices, and vice versa."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0f33211c-3548-4a21-990b-21aa55089186",
|
||
"metadata": {},
|
||
"source": [
|
||
"## ✅ Final Check Before Training"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "be8d0c68-ac6e-4a4d-a6c7-64e9c6763ec4",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Ensure the price label is correctly placed by the end of the prompt\n",
|
||
"\n",
|
||
"def report(item):\n",
|
||
" prompt = item.prompt\n",
|
||
" tokens = Item.tokenizer.encode(item.prompt)\n",
|
||
" print(prompt)\n",
|
||
" print(tokens[-6:])\n",
|
||
" print(Item.tokenizer.batch_decode(tokens[-6:]))\n",
|
||
"\n",
|
||
"report(sample[50])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "656d523d-8297-4d75-a973-a7e5517d21bc",
|
||
"metadata": {},
|
||
"source": [
|
||
"LLaMA and GPT-4o both tokenize numbers from 1 to 999 as a single token, while models like Qwen2, Gemma, and Phi-3 split them into multiple tokens. This helps keep prices compact in our prompts — useful for our project, though not strictly required."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e36254ba-d20f-44ad-b991-1f1f3cdc4aaa",
|
||
"metadata": {},
|
||
"source": [
|
||
"## 📦 Creating Train/Test Datasets"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "5cfb5092-c38d-4c14-8dd0-e1d97c06d7f6",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"random.seed(42)\n",
|
||
"random.shuffle(sample)\n",
|
||
"train = sample[:400_000]\n",
|
||
"test = sample[400_000:402_000]\n",
|
||
"print(f\"Divided into a training set of {len(train):,} items and test set of {len(test):,} items\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "2f084822-e489-4946-8cf5-f5b0ebd7a23c",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"print(train[0].prompt)\n",
|
||
"print('*' * 40)\n",
|
||
"print(test[0].test_prompt())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "d49a08ce-dd41-4af8-82f6-4701628e8152",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Plot the distribution of prices in the first 250 test points\n",
|
||
"\n",
|
||
"prices = [float(item.price) for item in test[:250]]\n",
|
||
"plt.figure(figsize=(15, 6))\n",
|
||
"plt.title(f\"Avg {sum(prices)/len(prices):.2f} and highest {max(prices):,.2f}\\n\")\n",
|
||
"plt.xlabel('Price ($)')\n",
|
||
"plt.ylabel('Count')\n",
|
||
"plt.hist(prices, rwidth=0.7, color=\"darkblue\", bins=range(0, 1000, 10))\n",
|
||
"plt.show()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "0c581439-93f2-422a-924f-fd6c58ef8693",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Extract prompts and prices\n",
|
||
"train_prompts = [item.prompt for item in train]\n",
|
||
"train_prices = [item.price for item in train]\n",
|
||
"test_prompts = [item.test_prompt() for item in test]\n",
|
||
"test_prices = [item.price for item in test]\n",
|
||
"\n",
|
||
"# Create Hugging Face datasets\n",
|
||
"train_dataset = Dataset.from_dict({\"text\": train_prompts, \"price\": train_prices})\n",
|
||
"test_dataset = Dataset.from_dict({\"text\": test_prompts, \"price\": test_prices})\n",
|
||
"dataset = DatasetDict({\n",
|
||
" \"train\": train_dataset,\n",
|
||
" \"test\": test_dataset\n",
|
||
"})\n",
|
||
"\n",
|
||
"# Save full Item objects\n",
|
||
"os.makedirs(\"data\", exist_ok=True) # Make sure the folder exists\n",
|
||
"\n",
|
||
"# Save full Item objects to the folder\n",
|
||
"with open('data/train.pkl', 'wb') as file:\n",
|
||
" pickle.dump(train, file)\n",
|
||
"\n",
|
||
"with open('data/test.pkl', 'wb') as file:\n",
|
||
" pickle.dump(test, file)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "3914d029-350e-4140-a31f-e931fa289a41",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Push to the Hugging Face Hub\n",
|
||
"USERNAME = \"lisekarimi\" # 🔧 Replace with your Hugging Face username\n",
|
||
"DATASET_NAME = f\"{USERNAME}/pricer-data\"\n",
|
||
"\n",
|
||
"dataset.push_to_hub(DATASET_NAME, private=True)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3d8f3b33-41f8-4ee6-96ed-27677ffc8ec4",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Note:** \n",
|
||
"- The dataset `pricer-data` on Hugging Face only contains `text` and `price`:\n",
|
||
"\n",
|
||
"\n",
|
||
"{\n",
|
||
" \"text\": \"How much does this cost...Price is $175.00\",\n",
|
||
" \"price\": 175.0\n",
|
||
"}\n",
|
||
"\n",
|
||
"- Full `Item` objects (with metadata) are available in `train.pkl` and `test.pkl`:\n",
|
||
"\n",
|
||
"Item(data={\n",
|
||
" \"title\": str,\n",
|
||
" \"description\": list[str],\n",
|
||
" \"features\": list[str],\n",
|
||
" \"details\": str\n",
|
||
"}, price=float)\n",
|
||
"\n",
|
||
"\n",
|
||
"Now, it’s time to move on to **Part 2: Model Benchmarking – Traditional ML vs Frontier LLMs.**\n",
|
||
"\n",
|
||
"🔜 See you in the [next notebook](https://github.com/lisekarimi/lexo/blob/main/09_part2_tradml_vs_frontier.ipynb)"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": ".venv",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.11.7"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|