LLM_Engineering_OLD/week8/day2.4.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "40d49349-faaa-420c-9b65-0bdc9edfabce",
   "metadata": {},
   "source": [
    "# The Price is Right\n",
    "\n",
    "Today we build a more complex solution for estimating prices of goods.\n",
    "\n",
    "1. Day 2.0 notebook: create a RAG database with our 400,000 training data\n",
    "2. Day 2.1 notebook: visualize in 2D\n",
    "3. Day 2.2 notebook: visualize in 3D\n",
    "4. Day 2.3 notebook: build and test a RAG pipeline with GPT-4o-mini\n",
    "5. Day 2.4 notebook: (a) bring back our Random Forest pricer (b) Create a Ensemble pricer that allows contributions from all the pricers\n",
    "\n",
    "Phew! That's a lot to get through in one day!\n",
    "\n",
    "## PLEASE NOTE:\n",
    "\n",
    "We already have a very powerful product estimator with our proprietary, fine-tuned LLM. Most people would be very satisfied with that! The main reason we're adding these extra steps is to deepen your expertise with RAG and with Agentic workflows.\n",
    "\n",
    "## Finishing off with Random Forests & Ensemble"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fbcdfea8-7241-46d7-a771-c0381a3e7063",
   "metadata": {},
   "outputs": [],
   "source": [
    "# imports\n",
    "\n",
    "import os\n",
    "import re\n",
    "import math\n",
    "import json\n",
    "from tqdm import tqdm\n",
    "import random\n",
    "from dotenv import load_dotenv\n",
    "from huggingface_hub import login\n",
    "import numpy as np\n",
    "import pickle\n",
    "from openai import OpenAI\n",
    "from sentence_transformers import SentenceTransformer\n",
    "from datasets import load_dataset\n",
    "import chromadb\n",
    "from items import Item\n",
    "from testing import Tester\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.metrics import mean_squared_error, r2_score\n",
    "import joblib\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e6e88bd1-f89c-4b98-92fa-aa4bc1575bca",
   "metadata": {},
   "outputs": [],
   "source": [
    "# CONSTANTS\n",
    "\n",
    "QUESTION = \"How much does this cost to the nearest dollar?\\n\\n\"\n",
    "DB = \"products_vectorstore\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "98666e73-938e-469d-8987-e6e55ba5e034",
   "metadata": {},
   "outputs": [],
   "source": [
    "# environment\n",
    "\n",
    "load_dotenv(override=True)\n",
    "os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')\n",
    "os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', 'your-key-if-not-using-env')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc696493-0b6f-48aa-9fa8-b1ae0ecaf3cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load in the test pickle file:\n",
    "\n",
    "with open('test.pkl', 'rb') as file:\n",
    "    test = pickle.load(file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d26a1104-cd11-4361-ab25-85fb576e0582",
   "metadata": {},
   "outputs": [],
   "source": [
    "client = chromadb.PersistentClient(path=DB)\n",
    "collection = client.get_or_create_collection('products')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e00b82a9-a8dc-46f1-8ea9-2f07cbc8e60d",
   "metadata": {},
   "outputs": [],
   "source": [
    "result = collection.get(include=['embeddings', 'documents', 'metadatas'])\n",
    "vectors = np.array(result['embeddings'])\n",
    "documents = result['documents']\n",
    "prices = [metadata['price'] for metadata in result['metadatas']]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf6492cb-b11a-4ad5-859b-a71a78ffb949",
   "metadata": {},
   "source": [
    "# Random Forest\n",
    "\n",
    "We will now train a Random Forest model.\n",
    "\n",
    "Can you spot the difference from what we did in Week 6? In week 6 we used the word2vec model to form vectors; this time we'll use the vectors we already have in Chroma, from the SentenceTransformer model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48894777-101f-4fe5-998c-47079407f340",
   "metadata": {},
   "outputs": [],
   "source": [
    "# This next line takes an hour on my M1 Mac!\n",
    "\n",
    "rf_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)\n",
    "rf_model.fit(vectors, prices)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "62eb7ddf-e1da-481e-84c6-1256547566bd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save the model to a file\n",
    "\n",
    "joblib.dump(rf_model, 'random_forest_model.pkl')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d281dc5e-761e-4a5e-86b3-29d9c0a33d4a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load it back in again\n",
    "\n",
    "rf_model = joblib.load('random_forest_model.pkl')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5d438dec-8e5b-4e60-bb6f-c3f82e522dd9",
   "metadata": {},
   "outputs": [],
   "source": [
    "from agents.specialist_agent import SpecialistAgent\n",
    "from agents.frontier_agent import FrontierAgent\n",
    "from agents.random_forest_agent import RandomForestAgent"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "afc39369-b97b-4a90-b17e-b20ef501d3c9",
   "metadata": {},
   "outputs": [],
   "source": [
    "specialist = SpecialistAgent()\n",
    "frontier = FrontierAgent(collection)\n",
    "random_forest = RandomForestAgent()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8e2d0d0a-8bb8-4b39-b046-322828c39244",
   "metadata": {},
   "outputs": [],
   "source": [
    "def description(item):\n",
    "    return item.prompt.split(\"to the nearest dollar?\\n\\n\")[1].split(\"\\n\\nPrice is $\")[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bfe0434f-b29e-4cc0-bad9-b07624665727",
   "metadata": {},
   "outputs": [],
   "source": [
    "def rf(item):\n",
    "    return random_forest.price(description(item))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cdf233ec-264f-4b34-9f2b-27c39692137b",
   "metadata": {},
   "outputs": [],
   "source": [
    "Tester.test(rf, test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f759bd2-7a7e-4c1a-80a0-e12470feca89",
   "metadata": {},
   "outputs": [],
   "source": [
    "product = \"Quadcast HyperX condenser mic for high quality audio for podcasting\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e44dbd25-fb95-4b6b-bbbb-8da5fc817105",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(specialist.price(product))\n",
    "print(frontier.price(product))\n",
    "print(random_forest.price(product))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1779b353-e2bb-4fc7-be7c-93057e4d688a",
   "metadata": {},
   "outputs": [],
   "source": [
    "specialists = []\n",
    "frontiers = []\n",
    "random_forests = []\n",
    "prices = []\n",
    "for item in tqdm(test[1000:1250]):\n",
    "    text = description(item)\n",
    "    specialists.append(specialist.price(text))\n",
    "    frontiers.append(frontier.price(text))\n",
    "    random_forests.append(random_forest.price(text))\n",
    "    prices.append(item.price)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f0bca725-4e34-405b-8d90-41d67086a25d",
   "metadata": {},
   "outputs": [],
   "source": [
    "mins = [min(s,f,r) for s,f,r in zip(specialists, frontiers, random_forests)]\n",
    "maxes = [max(s,f,r) for s,f,r in zip(specialists, frontiers, random_forests)]\n",
    "\n",
    "X = pd.DataFrame({\n",
    "    'Specialist': specialists,\n",
    "    'Frontier': frontiers,\n",
    "    'RandomForest': random_forests,\n",
    "    'Min': mins,\n",
    "    'Max': maxes,\n",
    "})\n",
    "\n",
    "# Convert y to a Series\n",
    "y = pd.Series(prices)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1be5be8a-3e7f-42a2-be54-0c7e380f7cc4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Train a Linear Regression\n",
    "np.random.seed(42)\n",
    "\n",
    "lr = LinearRegression()\n",
    "lr.fit(X, y)\n",
    "\n",
    "feature_columns = X.columns.tolist()\n",
    "\n",
    "for feature, coef in zip(feature_columns, lr.coef_):\n",
    "    print(f\"{feature}: {coef:.2f}\")\n",
    "print(f\"Intercept={lr.intercept_:.2f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0bdf6e68-28a3-4ed2-b17e-de0ede923d34",
   "metadata": {},
   "outputs": [],
   "source": [
    "joblib.dump(lr, 'ensemble_model.pkl')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e762441a-9470-4dd7-8a8f-ec0430e908c7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from agents.ensemble_agent import EnsembleAgent\n",
    "ensemble = EnsembleAgent(collection)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1a29f03c-8010-43b7-ae7d-1bc85ca6e8e2",
   "metadata": {},
   "outputs": [],
   "source": [
    "ensemble.price(product)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e6a5e226-a508-43d5-aa42-cefbde72ffdf",
   "metadata": {},
   "outputs": [],
   "source": [
    "def ensemble_pricer(item):\n",
    "    return max(0,ensemble.price(description(item)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8397b1ef-2ea3-4af8-bb34-36594e0600cc",
   "metadata": {},
   "outputs": [],
   "source": [
    "Tester.test(ensemble_pricer, test)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "347c5350-d4b5-42ae-96f6-ec94f6ab41d7",
   "metadata": {},
   "source": [
    "# WHAT A DAY!\n",
    "\n",
    "We got so much done - a Fronter RAG pipeline, a Random Forest model using transformer-based encodings, and an Ensemble model.\n",
    "\n",
    "You can do better, for sure!\n",
    "\n",
    "Tweak this, and try adding components into the ensemble, to beat my performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "85009065-851e-44a2-b39f-4c116f7fbd22",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}