{ "cells": [ { "cell_type": "markdown", "id": "d9b9eaa6-a12f-4cf8-a4c5-e8ac2c15d15b", "metadata": { "id": "d9b9eaa6-a12f-4cf8-a4c5-e8ac2c15d15b" }, "source": [ "# πŸ” Predicting Item Prices from Descriptions (Part 3)\n", "---\n", "- Data Curation & Preprocessing\n", "- Model Benchmarking – Traditional ML vs LLMs\n", "- ➑️E5 Embeddings & RAG\n", "- Fine-Tuning GPT-4o Mini\n", "- Evaluating LLaMA 3.1 8B Quantized\n", "- Fine-Tuning LLaMA 3.1 with QLoRA\n", "- Evaluating Fine-Tuned LLaMA\n", "- Summary & Leaderboard\n", "\n", "---\n", "\n", "# 🧠 Part 3: E5 Embeddings & RAG\n", "\n", "- πŸ§‘β€πŸ’» Skill Level: Advanced\n", "- βš™οΈ Hardware: ⚠️ GPU required for embeddings (400K items) - use Google Colab\n", "- πŸ› οΈ Requirements: πŸ”‘ HF Token, Open API Key\n", "- Tasks:\n", " - Preprocessed item descriptions\n", " - Generated and stored embeddings in ChromaDB\n", " - Trained XGBoost on embeddings, pushed to HF Hub, and ran predictions\n", " - Predicted prices with GPT-4o Mini using RAG\n", "\n", "Is Word2Vec enough for XGBoost, or do contextual E5 embeddings perform better?\n", "\n", "Does retrieval improve price prediction for GPT-4o Mini?\n", "\n", "Let’s find out.\n", "\n", "⚠️ This notebook assumes basic familiarity with RAG and contextual embeddings.\n", "We use the same E5 embedding space for both XGBoost and GPT-4o Mini with RAG, enabling a fair comparison.\n", "Embeddings are stored and queried via ChromaDB β€” no LangChain is used for creation or retrieval.\n", "\n", "---\n", "πŸ“’ Find more LLM notebooks on my [GitHub repository](https://github.com/lisekarimi/lexo)" ] }, { "cell_type": "code", "execution_count": null, "id": "d8e2af5e-03cc-46dc-8a8b-37cb102d0e92", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "d8e2af5e-03cc-46dc-8a8b-37cb102d0e92", "outputId": "905907cc-81c5-4a3b-e7c8-9e237e594a09" }, "outputs": [], "source": [ "# Install required packages in Google Colab\n", "%pip install -q tqdm huggingface_hub numpy sentence-transformers datasets chromadb xgboost" ] }, { "cell_type": "code", "execution_count": null, "id": "4ce6a892-b357-4132-b9c0-a3142a0244c8", "metadata": { "id": "4ce6a892-b357-4132-b9c0-a3142a0244c8" }, "outputs": [], "source": [ "# imports\n", "\n", "import math\n", "import chromadb\n", "import re\n", "import joblib\n", "import os\n", "from tqdm import tqdm\n", "import gc\n", "from huggingface_hub import login, HfApi\n", "import numpy as np\n", "from sentence_transformers import SentenceTransformer\n", "from datasets import load_dataset\n", "from google.colab import userdata\n", "from xgboost import XGBRegressor\n", "from openai import OpenAI\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "id": "yBH-mvV0QBiw", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "yBH-mvV0QBiw", "outputId": "b4b6df10-dc05-4dbe-dd8b-55bae5a2b7af" }, "outputs": [], "source": [ "# Mount Google Drive to access persistent storage\n", "\n", "from google.colab import drive\n", "drive.mount('/content/drive')" ] }, { "cell_type": "code", "execution_count": null, "id": "3OUI1jQYyaeX", "metadata": { "id": "3OUI1jQYyaeX" }, "outputs": [], "source": [ "# Google Colab User Data\n", "# Ensure you have set the following in your Google Colab environment:\n", "openai_api_key = userdata.get(\"OPENAI_API_KEY\")\n", "hf_token = userdata.get('HF_TOKEN')" ] }, { "cell_type": "code", "execution_count": null, "id": "99f6f632", "metadata": {}, "outputs": [], "source": [ "openai = OpenAI(api_key=openai_api_key)\n", "login(hf_token, add_to_git_credential=True)\n", "\n", "# Configuration\n", "ROOT = \"/content/drive/MyDrive/deal_finder\"\n", "CHROMA_PATH = f\"{ROOT}/chroma\"" ] }, { "cell_type": "code", "execution_count": null, "id": "FF-HryRnDXm5", "metadata": { "id": "FF-HryRnDXm5" }, "outputs": [], "source": [ "# Helper class for evaluating model predictions\n", "\n", "GREEN = \"\\033[92m\"\n", "YELLOW = \"\\033[93m\"\n", "RED = \"\\033[91m\"\n", "RESET = \"\\033[0m\"\n", "COLOR_MAP = {\"red\":RED, \"orange\": YELLOW, \"green\": GREEN}\n", "\n", "class Tester:\n", "\n", " def __init__(self, predictor, data, title=None, size=250):\n", " self.predictor = predictor\n", " self.data = data\n", " self.title = title or predictor.__name__.replace(\"_\", \" \").title()\n", " self.size = size\n", " self.guesses = []\n", " self.truths = []\n", " self.errors = []\n", " self.sles = []\n", " self.colors = []\n", "\n", " def color_for(self, error, truth):\n", " if error<40 or error/truth < 0.2:\n", " return \"green\"\n", " elif error<80 or error/truth < 0.4:\n", " return \"orange\"\n", " else:\n", " return \"red\"\n", "\n", " def run_datapoint(self, i):\n", " datapoint = self.data[i]\n", " guess = self.predictor(datapoint)\n", " truth = datapoint[\"price\"]\n", " error = abs(guess - truth)\n", " log_error = math.log(truth+1) - math.log(guess+1)\n", " sle = log_error ** 2\n", " color = self.color_for(error, truth)\n", " # title = datapoint[\"text\"].split(\"\\n\\n\")[1][:20] + \"...\"\n", " self.guesses.append(guess)\n", " self.truths.append(truth)\n", " self.errors.append(error)\n", " self.sles.append(sle)\n", " self.colors.append(color)\n", " # print(f\"{COLOR_MAP[color]}{i+1}: Guess: ${guess:,.2f} Truth: ${truth:,.2f} Error: ${error:,.2f} SLE: {sle:,.2f} Item: {title}{RESET}\")\n", "\n", " def chart(self, title):\n", " # max_error = max(self.errors)\n", " plt.figure(figsize=(12, 8))\n", " max_val = max(max(self.truths), max(self.guesses))\n", " plt.plot([0, max_val], [0, max_val], color='deepskyblue', lw=2, alpha=0.6)\n", " plt.scatter(self.truths, self.guesses, s=3, c=self.colors)\n", " plt.xlabel('Ground Truth')\n", " plt.ylabel('Model Estimate')\n", " plt.xlim(0, max_val)\n", " plt.ylim(0, max_val)\n", " plt.title(title)\n", "\n", " # Add color legend\n", " from matplotlib.lines import Line2D\n", " legend_elements = [\n", " Line2D([0], [0], marker='o', color='w', label='Accurate (green)', markerfacecolor='green', markersize=8),\n", " Line2D([0], [0], marker='o', color='w', label='Medium error (orange)', markerfacecolor='orange', markersize=8),\n", " Line2D([0], [0], marker='o', color='w', label='High error (red)', markerfacecolor='red', markersize=8)\n", " ]\n", " plt.legend(handles=legend_elements, loc='upper right')\n", "\n", " plt.show()\n", "\n", "\n", " def report(self):\n", " average_error = sum(self.errors) / self.size\n", " rmsle = math.sqrt(sum(self.sles) / self.size)\n", " hits = sum(1 for color in self.colors if color==\"green\")\n", " title = f\"{self.title} Error=${average_error:,.2f} RMSLE={rmsle:,.2f} Hits={hits/self.size*100:.1f}%\"\n", " self.chart(title)\n", "\n", " def run(self):\n", " self.error = 0\n", " for i in range(self.size):\n", " self.run_datapoint(i)\n", " self.report()\n", "\n", " @classmethod\n", " def test(cls, function, data):\n", " cls(function, data).run()\n" ] }, { "cell_type": "markdown", "id": "6f82b230-2e03-4b1e-9be5-926fcd19acbe", "metadata": { "id": "6f82b230-2e03-4b1e-9be5-926fcd19acbe" }, "source": [ "## πŸ“₯ Load Dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "3ae00568", "metadata": {}, "outputs": [], "source": [ "# #If you face NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported run:\n", "# %pip install -U datasets" ] }, { "cell_type": "code", "execution_count": null, "id": "55f1495b-f343-4152-8739-3a99f5ac405d", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 177, "referenced_widgets": [ "6e7c01d666f64fa58d6a059cc8d8f323", "597b7155767441e6a0283a19edced00f", "cf1360550eaa49a0867f55db8b8c4c77", "94f26137cccf47f6a36d9325bc8f5b9c", "a764b97f3dcd480c8860dde979e5e114", "f1ec9a46c9ce4e038f3051bbd1b2c661", "992f46ae91554731987b4baf79ba1bbd", "b4abe22402fe40fd82b7fe93b4bc06f3", "57ec058518734e3dbd27324cbba243c0", "f101230e8a9a431d85ee2f8e51add7ad", "e196658b093746588113240a60336437", "cb06a4d26cb84c708857b683d1e84c12", "e82ad07ba22e465cbe0232c504c3b693", "c4e0ed1165f54393aaec24cd4624d562", "295a3c6662034aaaab4d2e0192d1d1ce", "c38aff0c91a849feb547e78156c2c347", "69647c5595874c3185cebf6813ee908c", "1036b1af4b154916a3d4f16f5ed799eb", "e6347ff832cc4c04aef86594ea5a9e64", "01c63224aa6a4f0c9c88a4d85527e767", "1db34b9a4f1f42a897345b5a6630ced6", "9293f2d745024d7facb68e04cc188850", "26f6ec91efaf42909cec172fafe55987", "c1131f0324b0498da9bc59720e867eb6", "3e58017527a04634a489a33ed53fd312", "06cd89f57d08466c875d179e79e3ecd2", "2e0aa0aa87a04419a277f303f577f7ff", "8fa0fe1992db42a997e7cd3ee08bd09e", "accb1d5142a9498da0117f746fedd691", "fcc2fc2f82e2441995b9e61b23b9b91e", "da93fe316dd24cb48538b52ef2eaf6b5", "5cea58775faf41829c04d2a84e3e2c31", "1914ec7959d143d09a55da324bbcd47b", "a3d3504148df46f59b6770fb377e2bb6", "b088b9a503e24f179741d40d21a730d9", "b77dcf4632954d0c9c3b6d441c5f684d", "4cc8b3c4d9934f24a94b4601ab7816b5", "c093f1c0806a43b79594ddac856a301c", "9f4d9ac1aa074ed6b0248a4b18fde7db", "c00785b8fdda409e9cb435abbb0466da", "612e211af4cd46eb9d2f3148d1c7cb0b", "86f93c663cc446adbc6366a528cb01b0", "dd42911451ec48e086c1c99e76492321", "5b942241f11c4f2ab086f0f289f99a03", "d28a5c6172f74c0f8bbd2d949455f22e", "0e67b2055f214eb691b4b54d9431bdd8", "f81c4dc72b3b4b40a6a70528db732482", "043a355b6a85471ba0142eb25e2c9eb0", "8682bfab79a8409499797a3307e4d64d", "55a837644bb643ac864fa1a674e665c8", "33aae5a98bf5433b813ff8216e015089", "56eedfc5ba6642dc8443ab60f5f09b8c", "a1b710c227a84ea1a55c310084f13a93", "0d4bc0d0e88a4c77a202f9c11b2ee2a9", "20858379c2cd45d59070b18149d6e925" ] }, "id": "55f1495b-f343-4152-8739-3a99f5ac405d", "outputId": "37317fe6-b560-4ad0-c7d6-66517fd67c42" }, "outputs": [], "source": [ "HF_USER = \"lisekarimi\"\n", "DATASET_NAME = f\"{HF_USER}/pricer-data\"\n", "\n", "dataset = load_dataset(DATASET_NAME)\n", "train = dataset['train']\n", "test = dataset['test']" ] }, { "cell_type": "code", "execution_count": null, "id": "85880d79-f1ba-4ee8-a039-b6acea84562c", "metadata": { "id": "85880d79-f1ba-4ee8-a039-b6acea84562c" }, "outputs": [], "source": [ "print(train[0][\"text\"])" ] }, { "cell_type": "code", "execution_count": null, "id": "88842541-d73b-4fae-a550-6dedf8fab633", "metadata": { "id": "88842541-d73b-4fae-a550-6dedf8fab633" }, "outputs": [], "source": [ "print(train[0][\"price\"])" ] }, { "cell_type": "markdown", "id": "7b8a9a5b-f74d-487d-a400-d157fea8c979", "metadata": { "id": "7b8a9a5b-f74d-487d-a400-d157fea8c979" }, "source": [ "## πŸ“¦ Embed + Save Training Data to Chroma\n", "- No LangChain used.\n", "- We use `intfloat/e5-small-v2` for embeddings:\n", " - Fast, high-quality, retrieval-tuned\n", " - **Requires 'passage:' prefix**\n", "- We embed item descriptions and store them in ChromaDB, with price saved as metadata." ] }, { "cell_type": "code", "execution_count": null, "id": "b95a87a8-2136-4e03-a36c-42e5d53a3e28", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 337, "referenced_widgets": [ "8216f5d45e9345e493a43b8cbbe6598a", "ec3854658f8448fc8463e8635889f700", "7a90822b2aff4d5cb926442f01a77a9b", "9518c3af589744cfbbb51f87d68f216e", "327044765c044384a14be4e660bb152f", "0b773d68d2394d80a2baf73c1808752a", "21568b9954c8411d863baa7385df624f", "0a08828a0ba4430ea6e039949f220b5b", "3d5a51cfb5f44eecbf80d46e2e4608fd", "313f059a82104a9394182f6dcdb0bfb4", "6a625748afc84fe89a8af7a4ef638675", "ebe43cd30e414f31ab52614c6e9f9f2b", "88c29992adaa44af857e3216f7e53e60", "0528af78cef844e8a2b489dcb8fce049", "8cbccd78a79447158f02caadfa7d805f", "076ce072490c493ba5b3c431f6166eda", "dd7780038f8a4cd3837972c78b6583bc", "9e285e2b58934552b98edd998b82a678", "338efda3245a4989a9b3ee0795949bb8", "136dfb68394742ea98d9eb845730846c", "891d821725b6457c9d06737bf75fe3ed", "14feb4e20339465d966a6a80504eb819", "c02b637785324b9eb88e6a2c00cb986b", "3635da14e6f04e8f90548eb6381290a8", "1314757f404e47f5b0f6fa4de8537863", "9e5f2478e931476d882e471c7f66aaeb", "4ad885d69d9f492c960ca53426189707", "992d5e88d7844a52a283c0e19475ab78", "43eaec936c774e3380ae4ff1a823f3dc", "ceeb11b317ac4d37b59641024f77265f", "5e0371de53164830b4e8c2b6954b5947", "63a729492e8a4a759d75b769cbb3e1e7", "14dde2c87b7b4c9ea16d48732108dcd7", "f50717b099d142be95390ae8f1e99e6a", "ffa64c304dab4ef18e9ef50ac1625cd6", "f358351612004f64adffb931c3130603", "7593358526ae4a87bf4be0eb1bcfc076", "51536b45f5674d498272dc7b2def635d", "8fbe2a3fc07943e7bf0fdc927bab795a", "6b265cc65d5a42638572c1776faafdb1", "39fa86a7760d43c793eb8ef27475af7d", "eee5113e2dd1402faf76d00f07d8e0af", "6792ed7123724b2d8091bc8d36255e68", "e35094b24c154340bb1b3ebba7ac0a0d", "dd63bb6ffed34b6687a0c79d8af93fb7", "32080bc9381c449ab63794655ec6d714", "eb7aa289fefc465d98edeed9ce2bff51", "53fae218b4b74863af5fe53a66a5f7ef", "35bc6d95c60f4c3d8ddc6b3b0845ff7e", "f4765ca278ad4da4b465bd2920a21320", "7ac6ead5baef4f30aff170a30a9a7977", "e7adb5eb38d54b29b734d207982411c8", "8f4f51b75af74daa9b9ad6696760109c", "ae4db932b7544c6cb9ff668fa954addd", "be63f07eedbd4d46ac4913df45216108", "2e47d9e7b36a4ec69a9071930671ae8e", "7b1c7f9bf0e8412abb66bcfc24cf9668", "5c8742d3f663470e9977d006e83314b7", "74ec67e07ee0477eb41e21093ae82858", "4b60a8f023bc4d759bc197b11bf4e160", "7a090f162fa84568a5e486ba935c3ed1", "8b650428a6834f5d8ebe62ad327493e0", "5c4d22bce82546d28a8b0c041895c8e3", "16121b830a2948afb3ca8eb54e27a678", "0305a4b4408f4562b87b58098148326d", "68f07b5b7ad447ce9a87023d872c2e73", "2156a5ced089414c99a1bb8dd3a0b3b7", "2e6cd134c70e455a85c47b1575135883", "f4264985b5cc4a0f970a088fb90b8bcf", "71d790bf25324e6dbb5372f636c53da9", "dac3ba29ee4d4083a9abca7eab632534", "5c75c020a1914da680340fe826f3f58d", "195e6dfb82c84f0191838acbbfe38126", "b06adcaf8d4c497897ed3625f3afb4eb", "d4ab3971183a4e8fa10402e3542e6466", "444ca1f5213241c2bc71fa9ebe9ac3ca", "34d571f76ef845f4bc272a5e05491c31", "e8ee76b022d64b2cb24a2cb7b61aeef7", "8c9ac87788b04ae6899f3b62fdc3ed0d", "431b638c435444c38e50a09573b8f31b", "0430f22e24d14171b83261faa090f349", "0fa5ae935a554461b086a4b81470b9ad", "f072e665d27e442ab4d0e2eb33c98db9", "fd3b1885c39c4b70b083d7fddf74d4b6", "f77051cb151645559223ecf835426688", "0e17661f878948598703ee7942e5e1a2", "fca913c6cfff48099d1744d5b091fc46", "085baf51ecef46318ceafbaba2bb4490", "52309039c2d8421bbb8e99f63f5ba91f", "f4233cd960ea4f549734a5b1e1da5e2e", "42ce1b7765f547cd9ecd8b428ec1c718", "e72a08514d3b42d2b5fbf87a920bcdf0", "ad05cf4c0ed44341aa3cd2cbd22b513d", "db9915d53d784b85accebe1552c4e7e1", "9519b6d9bf1b45e3b56da4c28d2aeb2e", "cfeb0597708b49fa9b65342e1ac446ae", "e29617eff6fd4199a74b670198ba2a69", "1cea197a15d94654a0e792318435d707", "89dcb96670a8433593e3452fad3c9210", "0802085388be453b8fe5edee7e0a01ef", "1ed257f19b8b44ee85f09e10178ae52f", "04107981561149cba5baf74ccba87aa6", "09afb010020e4b2f91d7cdbdca316962", "b11b51beaa54474cb7682110bd2d24ae", "47822470ddf842cd9e3368090549a2b5", "835bce5d87a2417c9b6a5b27627447dc", "5ca06dd536d44de784984a492d23573f", "8e75bdb4469e497c8f021ebde7c6c9b3", "7f4d4f8ece1d4651a2186f10a0cc25a5", "92036442af5f4b698f2a54ecba4650e2" ] }, "id": "b95a87a8-2136-4e03-a36c-42e5d53a3e28", "outputId": "6094328e-8c33-4b40-80e9-08c5cfb3e277" }, "outputs": [], "source": [ "# Load embedding model\n", "model_embedding = SentenceTransformer(\"intfloat/e5-small-v2\", device='cuda')" ] }, { "cell_type": "code", "execution_count": null, "id": "733cf41d-e81e-4cfc-b597-67da02dbc3cf", "metadata": { "id": "733cf41d-e81e-4cfc-b597-67da02dbc3cf" }, "outputs": [], "source": [ "# Init Chroma\n", "client = chromadb.PersistentClient(path=CHROMA_PATH)\n", "collection = client.get_or_create_collection(name=\"price_items\")" ] }, { "cell_type": "code", "execution_count": null, "id": "1f493c7d-1c72-40f9-a5c6-63c7f6b1cf2c", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 91 }, "id": "1f493c7d-1c72-40f9-a5c6-63c7f6b1cf2c", "outputId": "72627732-4eee-4d9a-c8cb-0c42e2541a80" }, "outputs": [], "source": [ "# Format description function (no price in text)\n", "def description(item):\n", " text = item[\"text\"].replace(\"How much does this cost to the nearest dollar?\\n\\n\", \"\")\n", " text = text.split(\"\\n\\nPrice is $\")[0]\n", " return f\"passage: {text}\"\n", "\n", "description(train[0])" ] }, { "cell_type": "code", "execution_count": null, "id": "f44bf613-adf6-4993-bf7b-6aa9fad21a03", "metadata": { "id": "f44bf613-adf6-4993-bf7b-6aa9fad21a03" }, "outputs": [], "source": [ "batch_size = 300 # how many items to insert into Chroma at once\n", "encode_batch_size = 1024 # how many items to encode at once in GPU memory\n", "\n", "for i in tqdm(range(0, len(train), batch_size), desc=\"Processing batches\"):\n", "\n", " end_idx = min(i + batch_size, len(train))\n", "\n", " # Collect documents and metadata\n", " documents = [description(train[j]) for j in range(i, end_idx)]\n", " metadatas = [{\"price\": train[j][\"price\"]} for j in range(i, end_idx)]\n", " ids = [f\"doc_{j}\" for j in range(i, end_idx)]\n", "\n", " # GPU batch encoding\n", " vectors = model_embedding.encode(\n", " documents,\n", " batch_size=encode_batch_size,\n", " show_progress_bar=False,\n", " normalize_embeddings=True\n", " ).tolist()\n", "\n", " # Insert into Chroma\n", " collection.add(\n", " ids=ids,\n", " documents=documents,\n", " embeddings=vectors,\n", " metadatas=metadatas\n", " )\n", "\n", "print(\"βœ… Embedding and storage to ChromaDB completed.\")" ] }, { "cell_type": "code", "execution_count": null, "id": "f2e2ccc9-b772-45f7-8258-cbc4f9c3ed59", "metadata": {}, "outputs": [], "source": [ "# Now flush and clean\n", "print(\"🧹 Cleaning up and saving ChromaDB...\")\n", "client = None\n", "gc.collect()" ] }, { "cell_type": "markdown", "id": "c35d2fab-583f-4527-a7cc-9d31214b2f35", "metadata": {}, "source": [ "Our ChromaDB is currently saved in a persistent Google Drive path; for a production-ready app, we recommend uploading it to AWS S3 for better reliability and scalability.\n", "\n", "🧩 Now that we've generated the E5 embeddings, let's use them for both **XGBoost regression** and **GPT-4o Mini with RAG** ." ] }, { "cell_type": "markdown", "id": "40e4c587-211d-4bc0-91cf-6267f45405d6", "metadata": { "id": "40e4c587-211d-4bc0-91cf-6267f45405d6" }, "source": [ "## πŸ“ˆ Embedding-Based Regression with XGBoost" ] }, { "cell_type": "code", "execution_count": null, "id": "f058ccac-3392-457d-b54c-6471960e9af3", "metadata": { "id": "f058ccac-3392-457d-b54c-6471960e9af3" }, "outputs": [], "source": [ "# Step 1: Load vectors and prices from Chroma\n", "result = collection.get(include=['embeddings', 'documents', 'metadatas'])\n", "vectors = np.array(result['embeddings'])\n", "documents = result['documents']\n", "prices = [meta['price'] for meta in result['metadatas']]" ] }, { "cell_type": "code", "execution_count": null, "id": "JYQo0RaMb8Ql", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 254 }, "id": "JYQo0RaMb8Ql", "outputId": "c1641347-1fd4-41bb-e060-147224fc6bed" }, "outputs": [], "source": [ "# Step 2: Train XGBoost model\n", "xgb_model = XGBRegressor(n_estimators=100, random_state=42, n_jobs=-1, verbosity=0)\n", "xgb_model.fit(vectors, prices)" ] }, { "cell_type": "code", "execution_count": null, "id": "yaqG0z7jb919", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "yaqG0z7jb919", "outputId": "6a2f9120-97e0-4436-aa12-40d94fbc5c64" }, "outputs": [], "source": [ "# Step 3: Serialize XGBoost model locally for Hugging Face upload\n", "MODEL_DIR = os.path.join(ROOT, \"models\")\n", "MODEL_FILENAME = \"xgboost_model.pkl\"\n", "LOCAL_MODEL = os.path.join(MODEL_DIR, MODEL_FILENAME)\n", "\n", "os.makedirs(MODEL_DIR, exist_ok=True)\n", "joblib.dump(xgb_model, LOCAL_MODEL)" ] }, { "cell_type": "code", "execution_count": null, "id": "Z_17sQUdxIr3", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 104, "referenced_widgets": [ "2362f3121e5546b98e4623eb3680e96b", "ef53ee3b68c840d6a3fe98386d26bbd9", "a4768d0ecdd640a2a5bccd07a93c54b7", "e177440016974bc699b666fa721c6490", "2a9d0e5829174b738b4dfea1c71a3481", "ee6dffc7b79e405d923940166ef10590", "57bf3388622241869a5e9dab558dca72", "aa87f4feddd6409fbfb81f417e5d6662", "973a83ca118e4ed1b5a51821034ecc31", "d5a3c955aba14b3ea8e9b5c90a3bf20a", "daaa4f26bad545a394685e266f85a6ae" ] }, "id": "Z_17sQUdxIr3", "outputId": "68ebdbdb-d42e-4bc8-addc-85b42d418d1d" }, "outputs": [], "source": [ "# Step 4: Push serialized XGBoost model to Hugging Face Hub\n", "api = HfApi(token=hf_token)\n", "REPO_NAME = \"smart-deal-finder-models\"\n", "REPO_ID = f\"{HF_USER}/{REPO_NAME}\"\n", "\n", "# Create the model repo if it doesn't exist\n", "api.create_repo(repo_id=REPO_ID, repo_type=\"model\", private=True, exist_ok=True)\n", "\n", "# Upload the saved model\n", "api.upload_file(\n", " path_or_fileobj=LOCAL_MODEL,\n", " path_in_repo=MODEL_FILENAME,\n", " repo_id=REPO_ID,\n", " repo_type=\"model\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "3f59125d-9fa6-483b-957f-4423a9b2c900", "metadata": { "id": "3f59125d-9fa6-483b-957f-4423a9b2c900" }, "outputs": [], "source": [ "# Step 5: Define the predictor\n", "def xgb_predictor(datapoint):\n", " doc = description(datapoint)\n", " vector = model_embedding.encode([doc], normalize_embeddings=True)[0]\n", " return max(0, xgb_model.predict([vector])[0])" ] }, { "cell_type": "markdown", "id": "a890f1f0-d827-472f-a7a9-6c2cbe3d8341", "metadata": { "id": "a890f1f0-d827-472f-a7a9-6c2cbe3d8341" }, "source": [ "πŸ”” Reminder: In Part 2, XGBoost with Word2Vec (non-contextual embeddings) achieved:\n", "- Avg. Error: ~$107\n", "- RMSLE: 0.83\n", "- Accuracy: 29.20%\n", "\n", "πŸ§ͺ Now, let’s see if contextual embeddings improve XGBoost." ] }, { "cell_type": "code", "execution_count": null, "id": "q-tIbVilTPxP", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 718 }, "id": "q-tIbVilTPxP", "outputId": "7c9043ef-a2c4-4933-b334-18d99690ba0f" }, "outputs": [], "source": [ "# Step 4: Run the Tester on a subset of test data\n", "tester = Tester(xgb_predictor, test)\n", "tester.run()" ] }, { "cell_type": "markdown", "id": "dcb09db0-7d69-40e1-a6e3-b92263e38f1e", "metadata": { "id": "dcb09db0-7d69-40e1-a6e3-b92263e38f1e" }, "source": [ "Xgb Predictor Error=$110.68 RMSLE=0.93 Hits=30.4%" ] }, { "cell_type": "markdown", "id": "1ccd5d3f-98cd-45a8-951f-d6446062addc", "metadata": { "id": "1ccd5d3f-98cd-45a8-951f-d6446062addc" }, "source": [ "Results are nearly the same. In this setup, switching to contextual embeddings didn’t yield performance gains for XGBoost." ] }, { "cell_type": "markdown", "id": "4db1051d-9a7e-4cec-87fc-0d77fd858ced", "metadata": { "id": "4db1051d-9a7e-4cec-87fc-0d77fd858ced" }, "source": [ "## 🚰 Retrieval-Augmented Pipeline – GPT-4o Mini\n", "\n", "- Preprocess: clean the input text (description(item))\n", "- Embed: generate embedding vector (get_embedding(item))\n", "- Retrieve: find similar items from ChromaDB (find_similar_items)\n", "- Build Prompt: create the LLM prompt using context and masked target (build_messages)\n", "- Predict: get price estimate from LLM (estimate_price)" ] }, { "cell_type": "code", "execution_count": null, "id": "YPLxSn7eHp9N", "metadata": { "id": "YPLxSn7eHp9N" }, "outputs": [], "source": [ "test[1]" ] }, { "cell_type": "code", "execution_count": null, "id": "eFxFKNroNiyD", "metadata": { "id": "eFxFKNroNiyD" }, "outputs": [], "source": [ "# Step 1: Preprocess test item text\n", "# (uses the same `description(item)` function as during training)\n", "description(test[1])" ] }, { "cell_type": "code", "execution_count": null, "id": "lxIEtSWYHqCT", "metadata": { "id": "lxIEtSWYHqCT" }, "outputs": [], "source": [ "# Step 2: Embed a test item\n", "def get_embedding(item):\n", " return model_embedding.encode([description(item)])" ] }, { "cell_type": "code", "execution_count": null, "id": "y43prQsuHp_w", "metadata": { "id": "y43prQsuHp_w" }, "outputs": [], "source": [ "# Step 3: Query Chroma for similar items\n", "def find_similars(item):\n", " results = collection.query(query_embeddings=get_embedding(item).astype(float).tolist(), n_results=5)\n", " documents = results['documents'][0][:]\n", " prices = [m['price'] for m in results['metadatas'][0][:]]\n", " return documents, prices" ] }, { "cell_type": "code", "execution_count": null, "id": "nxAOUFRkHp6v", "metadata": { "id": "nxAOUFRkHp6v" }, "outputs": [], "source": [ "documents, prices = find_similars(test[1])\n", "documents, prices" ] }, { "cell_type": "code", "execution_count": null, "id": "djPoSk6sHo84", "metadata": { "id": "djPoSk6sHo84" }, "outputs": [], "source": [ "# Step 4: Format similar items as context\n", "def format_context(similars, prices):\n", " message = \"To provide some context, here are some other items that might be similar to the item you need to estimate.\\n\\n\"\n", " for similar, price in zip(similars, prices):\n", " message += f\"Potentially related product:\\n{similar}\\nPrice is ${price:.2f}\\n\\n\"\n", " return message" ] }, { "cell_type": "code", "execution_count": null, "id": "F3yxhnqSHp4C", "metadata": { "id": "F3yxhnqSHp4C" }, "outputs": [], "source": [ "print(format_context(documents, prices))" ] }, { "cell_type": "code", "execution_count": null, "id": "pEJobsKNHqE8", "metadata": { "id": "pEJobsKNHqE8" }, "outputs": [], "source": [ "# Step 5: Mask the price in the test item\n", "def mask_price_value(text):\n", " return re.sub(r\"(\\n\\nPrice is \\$).*\", r\"\\1\", text)" ] }, { "cell_type": "code", "execution_count": null, "id": "vLhBNVBNQAHS", "metadata": { "id": "vLhBNVBNQAHS" }, "outputs": [], "source": [ "# Step 6: Build LLM messages\n", "def build_messages(datapoint, similars, prices):\n", "\n", " system_message = \"You estimate prices of items. Reply only with the price, no explanation.\"\n", "\n", " context = format_context(similars, prices)\n", "\n", " prompt = mask_price_value(datapoint[\"text\"])\n", " prompt = prompt.replace(\" to the nearest dollar\", \"\").replace(\"\\n\\nPrice is $\", \"\")\n", "\n", " user_prompt = context + \"And now the question for you:\\n\\n\" + prompt\n", "\n", " return [\n", " {\"role\": \"system\", \"content\": system_message},\n", " {\"role\": \"user\", \"content\": user_prompt},\n", " {\"role\": \"assistant\", \"content\": \"Price is $\"}\n", " ]" ] }, { "cell_type": "code", "execution_count": null, "id": "I94fNHfBHp1a", "metadata": { "id": "I94fNHfBHp1a" }, "outputs": [], "source": [ "build_messages(test[1], documents, prices)" ] }, { "cell_type": "code", "execution_count": null, "id": "5NfY_GAVHpy4", "metadata": { "id": "5NfY_GAVHpy4" }, "outputs": [], "source": [ "# Step 7: Run prediction\n", "def get_price(s):\n", " s = s.replace('$','').replace(',','')\n", " match = re.search(r\"[-+]?\\d*\\.\\d+|\\d+\", s)\n", " return float(match.group()) if match else 0\n", "\n", "def gpt_4o_mini_rag(item):\n", " documents, prices = find_similars(item)\n", " response = openai.chat.completions.create(\n", " model=\"gpt-4o-mini\",\n", " messages=build_messages(item, documents, prices),\n", " seed=42,\n", " max_tokens=5\n", " )\n", " reply = response.choices[0].message.content\n", " return get_price(reply)" ] }, { "cell_type": "code", "execution_count": null, "id": "Pg-GJTT0HpwV", "metadata": { "id": "Pg-GJTT0HpwV" }, "outputs": [], "source": [ "print(test[1][\"price\"])\n", "print(gpt_4o_mini_rag(test[1]))" ] }, { "cell_type": "markdown", "id": "54103ab4-d6dd-4c0b-add5-5d9741e934b4", "metadata": { "id": "54103ab4-d6dd-4c0b-add5-5d9741e934b4" }, "source": [ "πŸ”” Reminder: In Part 2, GPT-4o Mini (without RAG) achieved:\n", "- Avg. Error: ~$99\n", "- RMSLE: 0.75\n", "- Accuracy: 44.8%\n", "\n", "πŸ§ͺ Let’s find out if RAG can boost GPT-4o Mini’s price prediction capabilities.\n", " " ] }, { "cell_type": "code", "execution_count": null, "id": "r0NGJupwHppF", "metadata": { "id": "r0NGJupwHppF" }, "outputs": [], "source": [ "Tester.test(gpt_4o_mini_rag, test)" ] }, { "cell_type": "markdown", "id": "00545880-d9e1-4934-8008-b62c105d177b", "metadata": { "id": "00545880-d9e1-4934-8008-b62c105d177b" }, "source": [ "Gpt 4O Mini Rag Error=$59.54 RMSLE=0.42 Hits=69.2%" ] }, { "cell_type": "markdown", "id": "2b9f46ae-92b5-4189-89b0-df88a600bb89", "metadata": { "id": "2b9f46ae-92b5-4189-89b0-df88a600bb89" }, "source": [ "πŸŽ‰ **GPT-4o Mini + RAG shows clear gains:** \n", "Average error dropped from **$99 β†’ $59.54**, RMSLE from **0.75 β†’ 0.42**, and accuracy rose from **48.8% β†’ 69.2%**. \n", "\n", "Adding retrieval-based context led to a strong performance boost for GPT-4o Mini.\n", "\n", "Now the question is β€” can fine-tuning push it even further, surpass RAG, and challenge larger models?\n", "\n", "πŸ”œ See you in the [next notebook](https://github.com/lisekarimi/lexo/blob/main/09_part4_ft_gpt4omini.ipynb)" ] } ], "metadata": { "accelerator": "GPU", "colab": { "gpuType": "A100", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 5 }