Merge pull request #540 from aaron-official/submit/aaron-official

some week 6 improvement suggestion
This commit is contained in:
Ed Donner
2025-07-25 10:17:20 -04:00
committed by GitHub
3 changed files with 2596 additions and 0 deletions

View File

@@ -0,0 +1,823 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "28a0673e-96b5-43f2-8a8b-bd033bf851b0",
"metadata": {},
"source": [
"# The Product Pricer Continued\n",
"\n",
"A model that can estimate how much something costs, from its description.\n",
"\n",
"## Data Curation Part 2\n",
"\n",
"Today we'll extend our dataset to a greater coverage, and craft it into an excellent dataset for training. \n",
"Data curation can seem less exciting than other things we work on, but it's a crucial part of the LLM engineers' responsibility and an important craft to hone, so that you can build your own commercial solutions with high quality datasets.\n",
"\n",
"The dataset is here: \n",
"https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023\n",
"\n",
"And the folder with all the product datasets is here: \n",
"https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023/tree/main/raw/meta_categories\n",
"\n",
"Handles Large Datasets: This notebook is designed to efficiently process large datasets like the Amazon Reviews 2023 data, even with limited local resources.\n",
"https://colab.research.google.com/drive/1KY55mHyM5weQMSzHxiDXKSCxB_hItCD2?usp=sharing\n",
"\n",
"## Important Note - read me first please\n",
"\n",
"We are about to craft a massive dataset of 400,000 items covering multiple types of product. In Week 7 we will be using this data to train our own model. It's a pretty big dataset, and depending on the GPU you select, training could take 20+ hours. It will be really good fun, but it could cost a few dollars in compute units.\n",
"\n",
"As an alternative, if you want to keep things quick & low cost, you can work with a smaller dataset focused only on Home Appliances. You'll be able to cover the same learning points; the results will be good -- not quite as good as the full dataset, but still pretty amazing! If you'd prefer to do this, I've set up an alternative jupyter notebook in this folder called `lite.ipynb` that you should use in place of this one.\n",
"\n",
"Also, if you'd prefer, you can shortcut running all this data curation by downloading the pickle files that we save in the last cell. The pickle files are available here: https://drive.google.com/drive/folders/1f_IZGybvs9o0J5sb3xmtTEQB3BXllzrW"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "67cedf85-8125-4322-998e-9375fe745597",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import random\n",
"from dotenv import load_dotenv\n",
"from huggingface_hub import login\n",
"from datasets import load_dataset, Dataset, DatasetDict\n",
"import matplotlib.pyplot as plt\n",
"from collections import Counter, defaultdict\n",
"import numpy as np\n",
"import pickle"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "446bc939-62fe-4608-bec3-52ae1b2de322",
"metadata": {},
"outputs": [],
"source": [
"# Run this in your LOCAL environment to get the exact versions\n",
"import sys\n",
"print(f\"Python version: {sys.version}\")\n",
"print(\"=\"*50)\n",
"\n",
"# Check versions of all your dependencies\n",
"dependencies = [\n",
" 'datasets',\n",
" 'transformers', \n",
" 'huggingface_hub',\n",
" 'matplotlib',\n",
" 'numpy',\n",
" 'python-dotenv', # This is the package name for dotenv\n",
" 'tqdm' # Usually imported by datasets/transformers\n",
"]\n",
"\n",
"# Method 1: Using __version__ attribute\n",
"print(\"DEPENDENCY VERSIONS:\")\n",
"print(\"=\"*50)\n",
"\n",
"for dep in dependencies:\n",
" try:\n",
" if dep == 'python-dotenv':\n",
" import dotenv\n",
" version = dotenv.__version__\n",
" print(f\"python-dotenv: {version}\")\n",
" elif dep == 'huggingface_hub':\n",
" import huggingface_hub\n",
" version = huggingface_hub.__version__\n",
" print(f\"huggingface_hub: {version}\")\n",
" else:\n",
" module = __import__(dep)\n",
" version = getattr(module, '__version__', 'Unknown')\n",
" print(f\"{dep}: {version}\")\n",
" except ImportError:\n",
" print(f\"{dep}: NOT INSTALLED\")\n",
" except AttributeError:\n",
" print(f\"{dep}: Version attribute not found\")\n",
"\n",
"print(\"\\n\" + \"=\"*50)\n",
"print(\"INSTALLATION COMMANDS FOR COLAB:\")\n",
"print(\"=\"*50)\n",
"\n",
"# Method 2: Using pip show (more reliable)\n",
"import subprocess\n",
"import json\n",
"\n",
"def get_pip_version(package):\n",
" try:\n",
" result = subprocess.run([sys.executable, '-m', 'pip', 'show', package], \n",
" capture_output=True, text=True)\n",
" if result.returncode == 0:\n",
" for line in result.stdout.split('\\n'):\n",
" if line.startswith('Version:'):\n",
" return line.split(':', 1)[1].strip()\n",
" except:\n",
" pass\n",
" return None\n",
"\n",
"print(\"# Run these commands in Google Colab:\")\n",
"print(\"# (Copy and paste the exact versions from your local environment)\")\n",
"print()\n",
"\n",
"for dep in dependencies:\n",
" version = get_pip_version(dep)\n",
" if version:\n",
" print(f\"!pip install {dep}=={version}\")\n",
" else:\n",
" print(f\"# !pip install {dep} # Version not found\")\n",
"\n",
"print()\n",
"print(\"# Alternative: Install all at once\")\n",
"install_commands = []\n",
"for dep in dependencies:\n",
" version = get_pip_version(dep)\n",
" if version:\n",
" install_commands.append(f\"{dep}=={version}\")\n",
" else:\n",
" install_commands.append(dep)\n",
"\n",
"print(f\"!pip install {' '.join(install_commands)}\")\n",
"\n",
"print(\"\\n\" + \"=\"*50)\n",
"print(\"ADDITIONAL INFO:\")\n",
"print(\"=\"*50)\n",
"\n",
"# Check if we're in a virtual environment\n",
"print(f\"Virtual environment: {sys.prefix != sys.base_prefix}\")\n",
"print(f\"Python executable: {sys.executable}\")\n",
"\n",
"# Show pip list for reference\n",
"print(\"\\nFull pip list (for reference):\")\n",
"try:\n",
" result = subprocess.run([sys.executable, '-m', 'pip', 'list'], \n",
" capture_output=True, text=True)\n",
" if result.returncode == 0:\n",
" lines = result.stdout.split('\\n')\n",
" relevant_packages = []\n",
" for line in lines:\n",
" for dep in dependencies + ['torch', 'tensorflow', 'tokenizers']:\n",
" if dep.lower() in line.lower():\n",
" relevant_packages.append(line.strip())\n",
" break\n",
" \n",
" for pkg in relevant_packages:\n",
" print(f\" {pkg}\")\n",
"except Exception as e:\n",
" print(f\"Could not get pip list: {e}\")\n",
"\n",
"print(\"\\n\" + \"=\"*50)\n",
"print(\"REQUIREMENTS.TXT FORMAT:\")\n",
"print(\"=\"*50)\n",
"print(\"# Copy this to create a requirements.txt file:\")\n",
"\n",
"for dep in dependencies:\n",
" version = get_pip_version(dep)\n",
" if version:\n",
" print(f\"{dep}=={version}\")\n",
" else:\n",
" print(f\"{dep}\")\n",
"\n",
"print(\"\\n\" + \"=\"*50)\n",
"print(\"COLAB SETUP SCRIPT:\")\n",
"print(\"=\"*50)\n",
"print(\"\"\"# Copy this entire block to run in Colab:\n",
"\n",
"# Install exact versions from local environment\n",
"!pip install --upgrade pip\n",
"\n",
"# Your specific versions (replace with actual versions from above)\"\"\")\n",
"\n",
"for dep in dependencies:\n",
" version = get_pip_version(dep)\n",
" if version:\n",
" print(f\"!pip install {dep}=={version}\")\n",
"\n",
"print(\"\"\"\n",
"# Restart runtime after installation\n",
"import os\n",
"os.kill(os.getpid(), 9) # This will restart the runtime\n",
"\"\"\")\n",
"\n",
"print(\"\\n\" + \"=\"*50)\n",
"print(\"VERIFICATION SCRIPT FOR COLAB:\")\n",
"print(\"=\"*50)\n",
"print(\"\"\"# Run this in Colab AFTER installing to verify versions match:\n",
"\n",
"import sys\n",
"dependencies_to_check = [\n",
" 'datasets', 'transformers', 'huggingface_hub', \n",
" 'matplotlib', 'numpy', 'dotenv', 'tqdm'\n",
"]\n",
"\n",
"print(\"Verification of installed versions:\")\n",
"print(\"=\"*40)\n",
"for dep in dependencies_to_check:\n",
" try:\n",
" if dep == 'dotenv':\n",
" import dotenv as module\n",
" else:\n",
" module = __import__(dep)\n",
" version = getattr(module, '__version__', 'Unknown')\n",
" print(f\"{dep}: {version}\")\n",
" except ImportError:\n",
" print(f\"{dep}: NOT INSTALLED\")\n",
"\n",
"print(\"\\\\nIf all versions match your local environment, the code should work!\")\n",
"\"\"\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7390a6aa-79cb-4dea-b6d7-de7e4b13e472",
"metadata": {},
"outputs": [],
"source": [
"# environment\n",
"\n",
"load_dotenv(override=True)\n",
"os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')\n",
"os.environ['ANTHROPIC_API_KEY'] = os.getenv('ANTHROPIC_API_KEY', 'your-key-if-not-using-env')\n",
"os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', 'your-key-if-not-using-env')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0732274a-aa6a-44fc-aee2-40dc8a8e4451",
"metadata": {},
"outputs": [],
"source": [
"# Log in to HuggingFace\n",
"\n",
"hf_token = os.environ['HF_TOKEN']\n",
"login(hf_token, add_to_git_credential=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6746144c-2e19-485a-8086-368c144722b4",
"metadata": {},
"outputs": [],
"source": [
"# More imports after HF login\n",
"\n",
"from loaders import ItemLoader\n",
"from items import Item"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1adcf323-de9d-4c24-a9c3-d7ae554d06ca",
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"id": "01065d69-765c-42c8-9f90-68b8c8754068",
"metadata": {},
"source": [
"## The ItemLoader code\n",
"\n",
"Look in loaders.py - there's some useful code to make life easier for us"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "049885d4-fdfa-4ff0-a932-4a2ed73928e2",
"metadata": {},
"outputs": [],
"source": [
"# Load in the same dataset as last time\n",
"\n",
"items = ItemLoader(\"All_Beauty\").load()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ffba41b5-ddb6-4359-9790-9b2db900eee1",
"metadata": {},
"outputs": [],
"source": [
"# Look for a familiar item..\n",
"print(items[1].prompt)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7cc7f3e7-e98e-48c1-8eed-1608b42b0f65",
"metadata": {},
"outputs": [],
"source": [
"import datasets\n",
"print(datasets.__version__)"
]
},
{
"cell_type": "markdown",
"id": "e2b6dc50-ac5c-4cf2-af2e-968ed8ef86d7",
"metadata": {},
"source": [
"## Now to SCALE UP\n",
"\n",
"Let's look at all datasets of all the items that you might find in a large home retail store - electrical, electronic, office and related, but not clothes / beauty / books."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d1d06cd3-f3c2-44f0-a9f2-13b54ff8be5c",
"metadata": {},
"outputs": [],
"source": [
"dataset_names = [\n",
" \"Automotive\",\n",
" \"Electronics\",\n",
" \"Office_Products\",\n",
" \"Tools_and_Home_Improvement\",\n",
" \"Cell_Phones_and_Accessories\",\n",
" \"Toys_and_Games\",\n",
" \"Appliances\",\n",
" \"Musical_Instruments\",\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aa8fd0f0-509a-4298-8fcc-e499a061e1be",
"metadata": {},
"outputs": [],
"source": [
"items = []\n",
"for dataset_name in dataset_names:\n",
" loader = ItemLoader(dataset_name)\n",
" items.extend(loader.load())\n",
"\n",
"# Now, time for a coffee break!!\n",
"# By the way, I put the biggest datasets first.. it gets faster."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3e29a5ab-ca61-41cc-9b33-22d374681b85",
"metadata": {},
"outputs": [],
"source": [
"print(f\"A grand total of {len(items):,} items\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "89078cb1-9679-4eb0-b295-599b8586bcd1",
"metadata": {},
"outputs": [],
"source": [
"# Plot the distribution of token counts again\n",
"\n",
"tokens = [item.token_count for item in items]\n",
"plt.figure(figsize=(15, 6))\n",
"plt.title(f\"Token counts: Avg {sum(tokens)/len(tokens):,.1f} and highest {max(tokens):,}\\n\")\n",
"plt.xlabel('Length (tokens)')\n",
"plt.ylabel('Count')\n",
"plt.hist(tokens, rwidth=0.7, color=\"skyblue\", bins=range(0, 300, 10))\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c38e0c43-9f7a-450e-a911-c94d37d9b9c3",
"metadata": {},
"outputs": [],
"source": [
"# Plot the distribution of prices\n",
"\n",
"prices = [item.price for item in items]\n",
"plt.figure(figsize=(15, 6))\n",
"plt.title(f\"Prices: Avg {sum(prices)/len(prices):,.1f} and highest {max(prices):,}\\n\")\n",
"plt.xlabel('Price ($)')\n",
"plt.ylabel('Count')\n",
"plt.hist(prices, rwidth=0.7, color=\"blueviolet\", bins=range(0, 1000, 10))\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eabc7c61-0cd2-41f4-baa1-b85400bbf87f",
"metadata": {},
"outputs": [],
"source": [
"category_counts = Counter()\n",
"for item in items:\n",
" category_counts[item.category]+=1\n",
"\n",
"categories = category_counts.keys()\n",
"counts = [category_counts[category] for category in categories]\n",
"\n",
"# Bar chart by category\n",
"plt.figure(figsize=(15, 6))\n",
"plt.bar(categories, counts, color=\"goldenrod\")\n",
"plt.title('How many in each category')\n",
"plt.xlabel('Categories')\n",
"plt.ylabel('Count')\n",
"\n",
"plt.xticks(rotation=30, ha='right')\n",
"\n",
"# Add value labels on top of each bar\n",
"for i, v in enumerate(counts):\n",
" plt.text(i, v, f\"{v:,}\", ha='center', va='bottom')\n",
"\n",
"# Display the chart\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "e5b6e987-83ba-4262-a082-57c6b0741062",
"metadata": {},
"source": [
"# Objective\n",
"\n",
"Craft a dataset which is more balanced in terms of prices. Less heavily scewed to cheap items, with an average that's higher than $60. Try to balance out the categories - fewer Automotive items."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3b9424c1-44e0-499a-b45e-a35246655469",
"metadata": {},
"outputs": [],
"source": [
"# Create a dict with a key of each price from $1 to $999\n",
"# And in the value, put a list of items with that price (to nearest round number)\n",
"\n",
"slots = defaultdict(list)\n",
"for item in items:\n",
" slots[round(item.price)].append(item)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7805a7f1-4ad8-48f6-bea3-d64b64894804",
"metadata": {},
"outputs": [],
"source": [
"# Create a dataset called \"sample\" which tries to more evenly take from the range of prices\n",
"# And gives more weight to items from categories other than Automotive\n",
"# Set random seed for reproducibility\n",
"\n",
"np.random.seed(42)\n",
"random.seed(42)\n",
"sample = []\n",
"for i in range(1, 1000):\n",
" slot = slots[i]\n",
" if i>=240:\n",
" sample.extend(slot)\n",
" elif len(slot) <= 1200:\n",
" sample.extend(slot)\n",
" else:\n",
" weights = np.array([1 if item.category=='Automotive' else 5 for item in slot])\n",
" weights = weights / np.sum(weights)\n",
" selected_indices = np.random.choice(len(slot), size=1200, replace=False, p=weights)\n",
" selected = [slot[i] for i in selected_indices]\n",
" sample.extend(selected)\n",
"\n",
"print(f\"There are {len(sample):,} items in the sample\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "430b432f-b769-41da-9506-a238cb5cf1b6",
"metadata": {},
"outputs": [],
"source": [
"# Plot the distribution of prices in sample\n",
"\n",
"prices = [float(item.price) for item in sample]\n",
"plt.figure(figsize=(15, 10))\n",
"plt.title(f\"Avg {sum(prices)/len(prices):.2f} and highest {max(prices):,.2f}\\n\")\n",
"plt.xlabel('Price ($)')\n",
"plt.ylabel('Count')\n",
"plt.hist(prices, rwidth=0.7, color=\"darkblue\", bins=range(0, 1000, 10))\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0d570794-6f1d-462e-b567-a46bae3556a1",
"metadata": {},
"outputs": [],
"source": [
"# OK, we did well in terms of raising the average price and having a smooth-ish population of prices\n",
"# Let's see the categories\n",
"\n",
"category_counts = Counter()\n",
"for item in sample:\n",
" category_counts[item.category]+=1\n",
"\n",
"categories = category_counts.keys()\n",
"counts = [category_counts[category] for category in categories]\n",
"\n",
"# Create bar chart\n",
"plt.figure(figsize=(15, 6))\n",
"plt.bar(categories, counts, color=\"lightgreen\")\n",
"\n",
"# Customize the chart\n",
"plt.title('How many in each category')\n",
"plt.xlabel('Categories')\n",
"plt.ylabel('Count')\n",
"\n",
"plt.xticks(rotation=30, ha='right')\n",
"\n",
"# Add value labels on top of each bar\n",
"for i, v in enumerate(counts):\n",
" plt.text(i, v, f\"{v:,}\", ha='center', va='bottom')\n",
"\n",
"# Display the chart\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6609d77c-3e0a-4679-9129-c7cdc3273070",
"metadata": {},
"outputs": [],
"source": [
"# Automotive still in the lead, but improved somewhat\n",
"# For another perspective, let's look at a pie\n",
"\n",
"plt.figure(figsize=(12, 10))\n",
"plt.pie(counts, labels=categories, autopct='%1.0f%%', startangle=90)\n",
"\n",
"# Add a circle at the center to create a donut chart (optional)\n",
"centre_circle = plt.Circle((0,0), 0.70, fc='white')\n",
"fig = plt.gcf()\n",
"fig.gca().add_artist(centre_circle)\n",
"plt.title('Categories')\n",
"\n",
"# Equal aspect ratio ensures that pie is drawn as a circle\n",
"plt.axis('equal') \n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "ac046cc1-2717-415b-96ad-b73b2950d235",
"metadata": {},
"source": [
"# Dataset Curated!\n",
"\n",
"We've crafted an excellent dataset.\n",
"\n",
"Let's do some final checks"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "70219e99-22cc-4e08-9121-51f9707caef0",
"metadata": {},
"outputs": [],
"source": [
"# How does the price vary with the character count of the prompt?\n",
"\n",
"sizes = [len(item.prompt) for item in sample]\n",
"prices = [item.price for item in sample]\n",
"\n",
"# Create the scatter plot\n",
"plt.figure(figsize=(15, 8))\n",
"plt.scatter(sizes, prices, s=0.2, color=\"red\")\n",
"\n",
"# Add labels and title\n",
"plt.xlabel('Size')\n",
"plt.ylabel('Price')\n",
"plt.title('Is there a simple correlation?')\n",
"\n",
"# Display the plot\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "30ae1453-b9fc-40db-8310-65d850c4b1da",
"metadata": {},
"outputs": [],
"source": [
"def report(item):\n",
" prompt = item.prompt\n",
" tokens = Item.tokenizer.encode(item.prompt)\n",
" print(prompt)\n",
" print(tokens[-10:])\n",
" print(Item.tokenizer.batch_decode(tokens[-10:]))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d9998b8d-d746-4541-9ac2-701108e0e8fb",
"metadata": {},
"outputs": [],
"source": [
"report(sample[398000])"
]
},
{
"cell_type": "markdown",
"id": "7aa0a3fc-d2fe-4e6e-8fdb-96913df2f588",
"metadata": {},
"source": [
"## Observation\n",
"\n",
"An interesting thing about the Llama tokenizer is that every number from 1 to 999 gets mapped to 1 token, much as we saw with gpt-4o. The same is not true of qwen2, gemma and phi3, which all map individual digits to tokens. This does turn out to be a bit useful for our project, although it's not an essential requirement."
]
},
{
"cell_type": "markdown",
"id": "0f03c0ee-3103-4603-af5c-b484884a3aa2",
"metadata": {},
"source": [
"# Finally\n",
"\n",
"It's time to break down our data into a training, test and validation dataset.\n",
"\n",
"It's typical to use 5%-10% of your data for testing purposes, but actually we have far more than we need at this point. We'll take 400,000 points for training, and we'll reserve 2,000 for testing, although we won't use all of them.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3b163ca2-18ef-4c26-8e9d-88eb55f114f6",
"metadata": {},
"outputs": [],
"source": [
"random.seed(42)\n",
"random.shuffle(sample)\n",
"train = sample[:400_000]\n",
"test = sample[400_000:402_000]\n",
"print(f\"Divided into a training set of {len(train):,} items and test set of {len(test):,} items\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "299b9816-8885-4798-829a-69d66d60eb01",
"metadata": {},
"outputs": [],
"source": [
"print(train[0].prompt)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "97222da3-9f2c-4d15-a5cd-5e5f8dbde6cc",
"metadata": {},
"outputs": [],
"source": [
"print(test[0].test_prompt())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7a116369-335a-412b-b70c-2add6675c2e3",
"metadata": {},
"outputs": [],
"source": [
"# Plot the distribution of prices in the first 250 test points\n",
"\n",
"prices = [float(item.price) for item in test[:250]]\n",
"plt.figure(figsize=(15, 6))\n",
"plt.title(f\"Avg {sum(prices)/len(prices):.2f} and highest {max(prices):,.2f}\\n\")\n",
"plt.xlabel('Price ($)')\n",
"plt.ylabel('Count')\n",
"plt.hist(prices, rwidth=0.7, color=\"darkblue\", bins=range(0, 1000, 10))\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "d522d752-6f66-4786-a4dc-8ef51842558c",
"metadata": {},
"source": [
"# Finally - upload your brand new dataset\n",
"\n",
"Convert to prompts and upload to HuggingFace hub"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fa11b3e5-fcf4-4efc-a573-f6f67fec3e73",
"metadata": {},
"outputs": [],
"source": [
"train_prompts = [item.prompt for item in train]\n",
"train_prices = [item.price for item in train]\n",
"test_prompts = [item.test_prompt() for item in test]\n",
"test_prices = [item.price for item in test]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b020ab1b-7153-4e5f-b8a3-d5bc2fafb6df",
"metadata": {},
"outputs": [],
"source": [
"# Create a Dataset from the lists\n",
"\n",
"train_dataset = Dataset.from_dict({\"text\": train_prompts, \"price\": train_prices})\n",
"test_dataset = Dataset.from_dict({\"text\": test_prompts, \"price\": test_prices})\n",
"dataset = DatasetDict({\n",
" \"train\": train_dataset,\n",
" \"test\": test_dataset\n",
"})"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "17639641-fb55-44e2-a463-b0b394d00f32",
"metadata": {},
"outputs": [],
"source": [
"# Uncomment these lines if you're ready to push to the hub, and replace my name with your HF username\n",
"\n",
"# HF_USER = \"ed-donner\"\n",
"# DATASET_NAME = f\"{HF_USER}/pricer-data\"\n",
"# dataset.push_to_hub(DATASET_NAME, private=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b85733ba-d165-4f07-b055-46803543edfe",
"metadata": {},
"outputs": [],
"source": [
"# One more thing!\n",
"# Let's pickle the training and test dataset so we don't have to execute all this code next time!\n",
"\n",
"with open('train.pkl', 'wb') as file:\n",
" pickle.dump(train, file)\n",
"\n",
"with open('test.pkl', 'wb') as file:\n",
" pickle.dump(test, file)"
]
},
{
"cell_type": "markdown",
"id": "2b58dc61-747f-46f7-b9e0-c205db4f3e5e",
"metadata": {},
"source": [
"## Todos for you:\n",
"\n",
"- Investigate the dataset more!\n",
"- Confirm that the tokenizer tokenizes all 3 digit prices into 1 token"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,676 @@
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "91ae778b"
},
"source": [
"# Getting Started\n",
"\n",
"Before running this notebook, please ensure you have the following:\n",
"\n",
"1. **Local Modules:** Upload the necessary local Python files (`items.py`, `loaders.py`, `testing.py`) to the Colab runtime's temporary storage. You can do this by clicking the folder icon on the left sidebar, then the upload icon, and selecting the files.\n",
"2. **Hugging Face Access Token:** Add your Hugging Face access token to Colab's user data secrets. Click the key icon on the left sidebar, click \"New secret\", and add your token with the name `HF_TOKEN`.\n",
"3. **Install Dependencies:** Run the first code cell to install the required libraries with the specified versions.\n",
"\n",
"Once these steps are completed, you can run the rest of the notebook cells sequentially."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_fj3pImYM5dw"
},
"outputs": [],
"source": [
"# Install exact versions from local environment to match the course's environment\n",
"!pip install --upgrade pip\n",
"\n",
"# Install specific versions of required libraries\n",
"!pip install datasets==3.6.0\n",
"!pip install transformers==4.51.3\n",
"!pip install huggingface_hub==0.31.2\n",
"!pip install matplotlib==3.10.3\n",
"!pip install numpy==1.26.4\n",
"!pip install python-dotenv==1.1.0\n",
"!pip install tqdm==4.67.1"
]
},
{
"cell_type": "code",
"source": [
"# Import necessary libraries\n",
"import os\n",
"import random\n",
"from dotenv import load_dotenv\n",
"from huggingface_hub import login\n",
"from datasets import load_dataset, Dataset, DatasetDict\n",
"import matplotlib.pyplot as plt\n",
"from collections import Counter, defaultdict\n",
"import numpy as np\n",
"import pickle"
],
"metadata": {
"id": "YQHruTKgPMRX"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Retrieve the Hugging Face access token from Colab's user data secrets\n",
"# This token is needed to interact with the Hugging Face Hub\n",
"from google.colab import userdata\n",
"userdata.get('HF_TOKEN')"
],
"metadata": {
"id": "jBdHkdyVNj_f"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Import custom classes from local files (items.py and loaders.py)\n",
"# These files were manually added to the Colab runtime's temporary storage\n",
"from loaders import ItemLoader\n",
"from items import Item"
],
"metadata": {
"id": "FdBT3PPzNmq3"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Set the backend for matplotlib to display plots inline in the notebook\n",
"%matplotlib inline"
],
"metadata": {
"id": "vynEBaq6OGEZ"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Load a single dataset (\"All_Beauty\") using the custom ItemLoader\n",
"# This was likely an initial test or example loading step\n",
"items = ItemLoader(\"Appliances\").load()"
],
"metadata": {
"id": "OFOJtH6FOG2u"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Define a list of dataset names (Amazon product categories) to be loaded\n",
"dataset_names = [\n",
" \"Automotive\",\n",
" \"Electronics\",\n",
" \"Office_Products\",\n",
" \"Tools_and_Home_Improvement\",\n",
" \"Cell_Phones_and_Accessories\",\n",
" \"Toys_and_Games\",\n",
" \"Appliances\",\n",
" \"Musical_Instruments\",\n",
"]"
],
"metadata": {
"id": "rkLXYtfhOJNn"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Check and print the available CPU cores and RAM in the Colab runtime environment\n",
"# This helps understand the resources available for data processing\n",
"import psutil\n",
"print(f\"CPU cores: {psutil.cpu_count()}\")\n",
"print(f\"Available RAM: {psutil.virtual_memory().available / (1024**3):.1f} GB\")"
],
"metadata": {
"id": "1oQSUpovOfKs"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"items = []\n",
"for dataset_name in dataset_names:\n",
" loader = ItemLoader(dataset_name)\n",
" items.extend(loader.load(workers=8))\n",
"\n",
"# Now, time for a coffee break!!\n",
"# By the way, I put the biggest datasets first.. it gets faster."
],
"metadata": {
"id": "UcV9RB2Go8nC"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Print the total number of items loaded from all datasets\n",
"print(f\"A grand total of {len(items):,} items\")"
],
"metadata": {
"id": "YdkGJ_X3oI1g"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Extract token counts from all loaded items\n",
"tokens = [item.token_count for item in items]\n",
"# Create and display a histogram of token counts\n",
"plt.figure(figsize=(15, 6))\n",
"plt.title(f\"Token counts: Avg {sum(tokens)/len(tokens):,.1f} and highest {max(tokens):,}\\n\")\n",
"plt.xlabel('Length (tokens)')\n",
"plt.ylabel('Count')\n",
"plt.hist(tokens, rwidth=0.7, color=\"skyblue\", bins=range(0, 300, 10))\n",
"plt.show()"
],
"metadata": {
"id": "8VzKJ7neo-wp"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Extract prices from all loaded items\n",
"prices = [item.price for item in items]\n",
"# Create and display a histogram of item prices\n",
"plt.figure(figsize=(15, 6))\n",
"plt.title(f\"Prices: Avg {sum(prices)/len(prices):,.1f} and highest {max(prices):,}\\n\")\n",
"plt.xlabel('Price ($)')\n",
"plt.ylabel('Count')\n",
"plt.hist(prices, rwidth=0.7, color=\"blueviolet\", bins=range(0, 1000, 10))\n",
"plt.show()"
],
"metadata": {
"id": "ZLFJycNZpDak"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Count the occurrences of each category in the loaded items\n",
"category_counts = Counter()\n",
"for item in items:\n",
" category_counts[item.category]+=1\n",
"\n",
"# Extract categories and their counts for plotting\n",
"categories = category_counts.keys()\n",
"counts = [category_counts[category] for category in categories]\n",
"\n",
"# Create and display a bar chart showing the count of items per category\n",
"plt.figure(figsize=(15, 6))\n",
"plt.bar(categories, counts, color=\"goldenrod\")\n",
"plt.title('How many in each category')\n",
"plt.xlabel('Categories')\n",
"plt.ylabel('Count')\n",
"\n",
"# Rotate x-axis labels for better readability\n",
"plt.xticks(rotation=30, ha='right')\n",
"\n",
"# Add value labels on top of each bar for clarity\n",
"for i, v in enumerate(counts):\n",
" plt.text(i, v, f\"{v:,}\", ha='center', va='bottom')\n",
"\n",
"# Display the chart\n",
"plt.show()"
],
"metadata": {
"id": "6oRa8rI6pGb0"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Create a dictionary where keys are rounded prices and values are lists of items with that price\n",
"# This is done to group items by price for sampling\n",
"slots = defaultdict(list)\n",
"for item in items:\n",
" slots[round(item.price)].append(item)"
],
"metadata": {
"id": "7mT5ZubkpJ06"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Create a curated sample dataset with a more even distribution of prices and reduced bias towards 'Automotive' category\n",
"# Items with price >= $240 are included entirely\n",
"# For prices < $240, if the number of items is <= 1200, all are included\n",
"# If the number of items > 1200, a weighted random sample of 1200 items is taken,\n",
"# giving non-Automotive items higher weight\n",
"\n",
"# Set random seeds for reproducibility\n",
"np.random.seed(42)\n",
"random.seed(42)\n",
"sample = []\n",
"for i in range(1, 1000):\n",
" slot = slots[i]\n",
" if i>=240:\n",
" sample.extend(slot)\n",
" elif len(slot) <= 1200:\n",
" sample.extend(slot)\n",
" else:\n",
" # Assign weights: 1 for 'Automotive', 5 for other categories\n",
" weights = np.array([1 if item.category=='Automotive' else 5 for item in slot])\n",
" # Normalize weights\n",
" weights = weights / np.sum(weights)\n",
" # Randomly select 1200 indices based on weights\n",
" selected_indices = np.random.choice(len(slot), size=1200, replace=False, p=weights)\n",
" # Select the items corresponding to the chosen indices\n",
" selected = [slot[i] for i in selected_indices]\n",
" sample.extend(selected)\n",
"\n",
"# Print the total number of items in the curated sample\n",
"print(f\"There are {len(sample):,} items in the sample\")"
],
"metadata": {
"id": "qHJdXynopMBp"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Extract prices from the curated sample\n",
"prices = [float(item.price) for item in sample]\n",
"# Create and display a histogram of prices for the sample dataset\n",
"# This helps visualize the effect of the sampling process on the price distribution\n",
"plt.figure(figsize=(15, 10))\n",
"plt.title(f\"Avg {sum(prices)/len(prices):.2f} and highest {max(prices):,.2f}\\n\")\n",
"plt.xlabel('Price ($)')\n",
"plt.ylabel('Count')\n",
"plt.hist(prices, rwidth=0.7, color=\"darkblue\", bins=range(0, 1000, 10))\n",
"plt.show()"
],
"metadata": {
"id": "gtBkOdPGpOou"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Count the occurrences of each category in the curated sample\n",
"category_counts = Counter()\n",
"for item in sample:\n",
" category_counts[item.category]+=1\n",
"\n",
"# Extract categories and their counts for plotting\n",
"categories = category_counts.keys()\n",
"counts = [category_counts[category] for category in categories]\n",
"\n",
"# Create and display a bar chart showing the count of items per category in the sample\n",
"# This helps visualize the effect of weighted sampling on category distribution\n",
"plt.figure(figsize=(15, 6))\n",
"plt.bar(categories, counts, color=\"lightgreen\")\n",
"\n",
"# Customize the chart\n",
"plt.title('How many in each category')\n",
"plt.xlabel('Categories')\n",
"plt.ylabel('Count')\n",
"\n",
"# Rotate x-axis labels for better readability\n",
"plt.xticks(rotation=30, ha='right')\n",
"\n",
"# Add value labels on top of each bar for clarity\n",
"for i, v in enumerate(counts):\n",
" plt.text(i, v, f\"{v:,}\", ha='center', va='bottom')\n",
"\n",
"# Display the chart\n",
"plt.show()"
],
"metadata": {
"id": "-lYpt40jpTE1"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Create and display a pie chart showing the percentage distribution of items across categories in the sample\n",
"plt.figure(figsize=(12, 10))\n",
"plt.pie(counts, labels=categories, autopct='%1.0f%%', startangle=90)\n",
"\n",
"# Add a circle at the center to create a donut chart (optional)\n",
"centre_circle = plt.Circle((0,0), 0.70, fc='white')\n",
"fig = plt.gcf()\n",
"fig.gca().add_artist(centre_circle)\n",
"plt.title('Categories')\n",
"\n",
"# Equal aspect ratio ensures that pie is drawn as a circle\n",
"plt.axis('equal')\n",
"\n",
"plt.show()"
],
"metadata": {
"id": "5QPV4m2LpV3g"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Markdown cell indicates that the dataset curation is complete and ready for final checks\n",
"# Dataset Curated!\n",
"\n",
"# We've crafted an excellent dataset.\n",
"\n",
"# Let's do some final checks"
],
"metadata": {
"id": "3Xc2ZxjapZ0a"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Extract prompt lengths (character counts) and prices from the curated sample\n",
"sizes = [len(item.prompt) for item in sample]\n",
"prices = [item.price for item in sample]\n",
"\n",
"# Create and display a scatter plot to visualize the relationship between prompt size and price\n",
"# This helps check for any simple correlation between the two\n",
"plt.figure(figsize=(15, 8))\n",
"plt.scatter(sizes, prices, s=0.2, color=\"red\")\n",
"\n",
"# Add labels and title\n",
"plt.xlabel('Size')\n",
"plt.ylabel('Price')\n",
"plt.title('Is there a simple correlation?')\n",
"\n",
"# Display the plot\n",
"plt.show()"
],
"metadata": {
"id": "VXYQkVarpceE"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Define a helper function to report information about an item\n",
"# It prints the item's prompt, the last 10 token IDs, and the decoded last 10 tokens\n",
"def report(item):\n",
" prompt = item.prompt\n",
" tokens = Item.tokenizer.encode(item.prompt)\n",
" print(prompt)\n",
" print(tokens[-10:])\n",
" print(Item.tokenizer.batch_decode(tokens[-10:]))"
],
"metadata": {
"id": "1BBJNDAKpgL_"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Use the report function to display information about a specific item in the sample\n",
"# This helps inspect the data and the tokenizer's behavior\n",
"report(sample[398000])"
],
"metadata": {
"id": "ZO2zF09wpiPp"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"## Observation\n",
"\n",
"An interesting thing about the Llama tokenizer is that every number from 1 to 999 gets mapped to 1 token, much as we saw with gpt-4o. The same is not true of qwen2, gemma and phi3, which all map individual digits to tokens. This does turn out to be a bit useful for our project, although it's not an essential requirement."
],
"metadata": {
"id": "GCkwmt_VpsaU"
}
},
{
"cell_type": "markdown",
"source": [
"# Finally\n",
"\n",
"It's time to break down our data into a training, test and validation dataset.\n",
"\n",
"It's typical to use 5%-10% of your data for testing purposes, but actually we have far more than we need at this point. We'll take 400,000 points for training, and we'll reserve 2,000 for testing, although we won't use all of them.\n"
],
"metadata": {
"id": "dy6WGVAmpx0g"
}
},
{
"cell_type": "code",
"source": [
"# Set random seed for reproducibility before shuffling and splitting the sample\n",
"random.seed(42)\n",
"# Shuffle the curated sample dataset\n",
"random.shuffle(sample)\n",
"# Split the shuffled sample into training (400,000 items) and testing (2,000 items) sets\n",
"train = sample[:400_000]\n",
"test = sample[400_000:402_000]\n",
"# Print the sizes of the training and testing sets\n",
"print(f\"Divided into a training set of {len(train):,} items and test set of {len(test):,} items\")"
],
"metadata": {
"id": "oY1ZSkW7p0VS"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Extract prices from the first 250 items of the test set\n",
"prices = [float(item.price) for item in test[:250]]\n",
"# Create and display a histogram of prices for the first 250 test items\n",
"# This provides a quick look at the price distribution in a small portion of the test data\n",
"plt.figure(figsize=(15, 6))\n",
"plt.title(f\"Avg {sum(prices)/len(prices):.2f} and highest {max(prices):,.2f}\\n\")\n",
"plt.xlabel('Price ($)')\n",
"plt.ylabel('Count')\n",
"plt.hist(prices, rwidth=0.7, color=\"darkblue\", bins=range(0, 1000, 10))\n",
"plt.show()"
],
"metadata": {
"id": "nLnRpUbtp17N"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Extract prompts from the training set\n",
"train_prompts = [item.prompt for item in train]\n",
"# Extract prices from the training set\n",
"train_prices = [item.price for item in train]\n",
"# Extract test prompts (using the test_prompt method) from the test set\n",
"test_prompts = [item.test_prompt() for item in test]\n",
"# Extract prices from the test set\n",
"test_prices = [item.price for item in test]"
],
"metadata": {
"id": "kpw1Y8IIp6kw"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Create Hugging Face Dataset objects from the training and testing data\n",
"train_dataset = Dataset.from_dict({\"text\": train_prompts, \"price\": train_prices})\n",
"test_dataset = Dataset.from_dict({\"text\": test_prompts, \"price\": test_prices})\n",
"# Create a DatasetDict containing the training and testing datasets\n",
"dataset = DatasetDict({\n",
" \"train\": train_dataset,\n",
" \"test\": test_dataset\n",
"})"
],
"metadata": {
"id": "WtEFiTlvp8hL"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Push the created DatasetDict to the Hugging Face Hub\n",
"# Replace \"aaron-official\" with your Hugging Face username\n",
"# The dataset will be named \"your-username/pricer-data\" and will be private\n",
"# HF_USER = \"aaron-official\" # Uncomment and replace with your HF username\n",
"# DATASET_NAME = f\"{HF_USER}/pricer-data\" # Uncomment\n",
"# dataset.push_to_hub(DATASET_NAME, private=True) # Uncomment to push to hub"
],
"metadata": {
"id": "sSnwZIxHp-VJ"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"# Pickle (serialize) the training and testing datasets and save them as files\n",
"# This allows for quick loading of the processed data in future sessions\n",
"with open('train.pkl', 'wb') as file:\n",
" pickle.dump(train, file)\n",
"\n",
"with open('test.pkl', 'wb') as file:\n",
" pickle.dump(test, file)"
],
"metadata": {
"id": "WRawIsrOqMQ-"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "bd72e246"
},
"source": [
"# Mount Google Drive to access files in your Drive\n",
"from google.colab import drive\n",
"drive.mount('/content/drive')"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "6fc5d915"
},
"source": [
"Once your Google Drive is mounted, you can copy the file to a folder in your Drive. Replace `My Drive/your_folder_name` with the path to the folder where you want to save the file."
]
},
{
"cell_type": "code",
"metadata": {
"id": "f319129b"
},
"source": [
"# Import the shutil module for file operations\n",
"import shutil\n",
"\n",
"# Define the destination path in Google Drive and the source path of the pickled training data\n",
"# Replace 'My Drive/your_folder_name' with your desired folder path in Google Drive\n",
"destination_path = '/content/drive/My Drive/train.pkl'\n",
"source_path = '/content/train.pkl'\n",
"\n",
"# Copy the pickled training data file from the Colab environment to Google Drive\n",
"shutil.copyfile(source_path, destination_path)\n",
"\n",
"# Print a confirmation message\n",
"print(f\"Copied {source_path} to {destination_path}\")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "d23d6cf0"
},
"source": [
"# Import the shutil module for file operations\n",
"import shutil\n",
"\n",
"# Define the destination path in Google Drive and the source path of the pickled test data\n",
"# Replace 'My Drive/your_folder_name' with your desired folder path in Google Drive\n",
"destination_path = '/content/drive/My Drive/test.pkl'\n",
"source_path = '/content/test.pkl'\n",
"\n",
"# Copy the pickled test data file from the Colab environment to Google Drive\n",
"shutil.copyfile(source_path, destination_path)\n",
"\n",
"# Print a confirmation message\n",
"print(f\"Copied {source_path} to {destination_path}\")"
],
"execution_count": null,
"outputs": []
}
]
}