LLM_Engineering_OLD/extras/trading/curator.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "46d90d45-2d19-49c7-b853-6809dc417ea7",
   "metadata": {},
   "source": [
    "# Extra Project - Trading Code Generator\n",
    "\n",
    "This is an example extra project to show fine-tuning in action, and applied to code generation.\n",
    "\n",
    "## Project Brief\n",
    "\n",
    "Build a prototype LLM that can generate example code to suggest trading decisions to buy or sell stocks!\n",
    "\n",
    "I generated test data using frontier models, in the other files in this directory. Use this to train an open source code model.\n",
    "\n",
    "<table style=\"margin: 0; text-align: left;\">\n",
    "    <tr>\n",
    "        <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
    "            <img src=\"../../resources.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
    "        </td>\n",
    "        <td>\n",
    "            <h2 style=\"color:#f71;\">This project is provided as an extra resource</h2>\n",
    "            <span style=\"color:#f71;\">It will make most sense after completing Week 7, and might trigger some ideas for your own projects.<br/><br/>\n",
    "                This is provided without a detailed walk-through; the output from the colab has been saved (see last cell) so you can review the results if you have any problems running yourself.\n",
    "            </span>\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>\n",
    "<table style=\"margin: 0; text-align: left;\">\n",
    "    <tr>\n",
    "        <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
    "            <img src=\"../../important.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
    "        </td>\n",
    "        <td>\n",
    "            <h2 style=\"color:#900;\">Do not use for actual trading decisions!!</h2>\n",
    "            <span style=\"color:#900;\">It hopefully goes without saying: this project will generate toy trading code that is over-simplified and untrusted.<br/><br/>Please do not make actual trading decisions based on this!</span>\n",
    "        </td>\n",
    "    </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e2c4bbb-5e8b-4d84-9997-ecb2c349cf54",
   "metadata": {},
   "source": [
    "## First step - generate training data from examples\n",
    "\n",
    "There are 3 sample python files generated (via multiple queries) by GPT-4o, Claude 3 Opus and Gemini 1.5 Pro.  \n",
    "\n",
    "This notebook creates training data from these files, then converts to the HuggingFace format and uploads to the hub.\n",
    "\n",
    "Afterwards, we will move to Google Colab to fine-tune."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16cf3aa2-f407-4b95-8b9e-c3c586f67835",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import glob\n",
    "import matplotlib.pyplot as plt\n",
    "import random\n",
    "from datasets import Dataset\n",
    "from dotenv import load_dotenv\n",
    "from huggingface_hub import login\n",
    "import transformers\n",
    "from transformers import AutoTokenizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "375302b6-b6a7-46ea-a74c-c2400dbd8bbe",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load environment variables in a file called .env\n",
    "from datasets import load_dataset, Dataset\n",
    "load_dotenv()\n",
    "hf_token = os.getenv('HF_TOKEN')\n",
    "if hf_token and hf_token.startswith(\"hf_\") and len(hf_token)>5:\n",
    "    print(\"HuggingFace Token looks good so far\")\n",
    "else:\n",
    "    print(\"Potential problem with HuggingFace token - please check your .env file, and see the Troubleshooting notebook for more\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a0c9fff-9eff-42fd-971b-403c99d9b726",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Constants\n",
    "\n",
    "DATASET_NAME = \"trade_code_data\"\n",
    "BASE_MODEL = \"Qwen/CodeQwen1.5-7B\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "586b07ba-5396-4c34-a696-01c8bc3597a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# A utility method to convert the text contents of a file into a list of methods\n",
    "\n",
    "def extract_method_bodies(text):\n",
    "    chunks = text.split('def trade')[1:]\n",
    "    results = []\n",
    "    for chunk in chunks:\n",
    "        lines = chunk.split('\\n')[1:]\n",
    "        body = '\\n'.join(line for line in lines if line!='\\n')\n",
    "        results.append(body)\n",
    "    return results          "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "953422d0-2e75-4d01-862e-6383df54d9e5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read all .py files and convert into training data\n",
    "\n",
    "bodies = []\n",
    "for filename in glob.glob(\"*.py\"):\n",
    "    with open(filename, 'r', encoding='utf-8') as file:\n",
    "        content = file.read()\n",
    "        extracted = extract_method_bodies(content)\n",
    "        bodies += extracted\n",
    "\n",
    "print(f\"Extracted {len(bodies)} trade method bodies\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "836480e9-ba23-4aa3-a7e2-2666884e9a06",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's look at one\n",
    "\n",
    "print(random.choice(bodies))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "47b10e7e-a562-4968-af3f-254a9b424ac8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# To visualize the lines of code in each \n",
    "\n",
    "%matplotlib inline\n",
    "fig, ax = plt.subplots(1, 1)\n",
    "lengths = [len(body.split('\\n')) for body in bodies]\n",
    "ax.set_xlabel('Lines of code')\n",
    "ax.set_ylabel('Count of training samples');\n",
    "_ = ax.hist(lengths, rwidth=0.7, color=\"green\", bins=range(0, max(lengths)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "03b37f62-679e-4c3d-9e5b-5878a82696e6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add the prompt to the start of every training example\n",
    "\n",
    "prompt = \"\"\"\n",
    "# tickers is a list of stock tickers\n",
    "import tickers\n",
    "\n",
    "# prices is a dict; the key is a ticker and the value is a list of historic prices, today first\n",
    "import prices\n",
    "\n",
    "# Trade represents a decision to buy or sell a quantity of a ticker\n",
    "import Trade\n",
    "\n",
    "import random\n",
    "import numpy as np\n",
    "\n",
    "def trade():\n",
    "\"\"\"\n",
    "\n",
    "data = [prompt + body for body in bodies]\n",
    "print(random.choice(data))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "28fdb82f-3864-4023-8263-547d17571a5c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Distribution of tokens in our dataset\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)\n",
    "tokenized_data = [tokenizer.encode(each) for each in data]\n",
    "token_counts = [len(tokens) for tokens in tokenized_data]\n",
    "\n",
    "%matplotlib inline\n",
    "fig, ax = plt.subplots(1, 1)\n",
    "ax.set_xlabel('Number of tokens')\n",
    "ax.set_ylabel('Count of training samples');\n",
    "_ = ax.hist(token_counts, rwidth=0.7, color=\"purple\", bins=range(0, max(token_counts), 20))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4eb73b0-88ef-4aeb-8e5b-fe7050109ba0",
   "metadata": {},
   "source": [
    "# Enforcing a maximum token length\n",
    "\n",
    "We need to specify a maximum number of tokens when we fine-tune.\n",
    "\n",
    "Let's pick a cut-off, and only keep training data points that fit within this number of tokens,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ffb0d55c-5602-4518-b811-fa385c0959a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "CUTOFF = 320\n",
    "truncated = len([tokens for tokens in tokenized_data if len(tokens) > CUTOFF])\n",
    "percentage = truncated/len(tokenized_data)*100\n",
    "print(f\"With cutoff at {CUTOFF}, we truncate {truncated} datapoints which is {percentage:.1f}% of the dataset\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7064ef0a-7b07-4f24-a580-cbef2c5e1f2f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's only keep datapoints that wouldn't get truncated\n",
    "\n",
    "filtered_data = [datapoint for datapoint in data if len(tokenizer.encode(datapoint))<=CUTOFF]\n",
    "print(f\"After e now have {len(filtered_data)} datapoints\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb2bb067-2bd3-498b-9fc8-5e8186afbe27",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Mix up the data\n",
    "\n",
    "random.seed(42)\n",
    "random.shuffle(filtered_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26713fb9-765f-4524-b9db-447e97686d1a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# I don't make a Training / Test split - if we had more training data, we would!\n",
    "\n",
    "dataset = Dataset.from_dict({'text':filtered_data})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bfabba27-ef47-46a8-a26b-4d650ae3b193",
   "metadata": {},
   "outputs": [],
   "source": [
    "login(hf_token)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "55b595cd-2df7-4be4-aec1-0667b17d36f1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Push your dataset to your hub\n",
    "# I've also pushed the data to my account and made it public, which you can use from the colab below\n",
    "\n",
    "dataset.push_to_hub(DATASET_NAME, private=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4691a025-9800-4e97-a20f-a65f102401f1",
   "metadata": {},
   "source": [
    "## And now to head over to a Google Colab for fine-tuning in the cloud\n",
    "\n",
    "Follow this link for the Colab:\n",
    "\n",
    "https://colab.research.google.com/drive/1wry2-4AGw-U7K0LQ_jEgduoTQqVIvo1x?usp=sharing\n",
    "\n",
    "I've also saved this Colab with output included, so you can see the results without needing to train if you'd prefer.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "04a6c3e0-a2e6-4115-a01a-45e79dfdb730",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}