353 lines
11 KiB
Plaintext
353 lines
11 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "46d90d45-2d19-49c7-b853-6809dc417ea7",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Extra Project - Trading Code Generator\n",
|
|
"\n",
|
|
"This is an example extra project to show fine-tuning in action, and applied to code generation.\n",
|
|
"\n",
|
|
"## Project Brief\n",
|
|
"\n",
|
|
"Build a prototype LLM that can generate example code to suggest trading decisions to buy or sell stocks!\n",
|
|
"\n",
|
|
"I generated test data using frontier models, in the other files in this directory. Use this to train an open source code model.\n",
|
|
"\n",
|
|
"<table style=\"margin: 0; text-align: left;\">\n",
|
|
" <tr>\n",
|
|
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
|
|
" <img src=\"../../resources.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
|
|
" </td>\n",
|
|
" <td>\n",
|
|
" <h2 style=\"color:#f71;\">This project is provided as an extra resource</h2>\n",
|
|
" <span style=\"color:#f71;\">It will make most sense after completing Week 7, and might trigger some ideas for your own projects.<br/><br/>\n",
|
|
" This is provided without a detailed walk-through; the output from the colab has been saved (see last cell) so you can review the results if you have any problems running yourself.\n",
|
|
" </span>\n",
|
|
" </td>\n",
|
|
" </tr>\n",
|
|
"</table>\n",
|
|
"<table style=\"margin: 0; text-align: left;\">\n",
|
|
" <tr>\n",
|
|
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
|
|
" <img src=\"../../important.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
|
|
" </td>\n",
|
|
" <td>\n",
|
|
" <h2 style=\"color:#900;\">Do not use for actual trading decisions!!</h2>\n",
|
|
" <span style=\"color:#900;\">It hopefully goes without saying: this project will generate toy trading code that is over-simplified and untrusted.<br/><br/>Please do not make actual trading decisions based on this!</span>\n",
|
|
" </td>\n",
|
|
" </tr>\n",
|
|
"</table>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7e2c4bbb-5e8b-4d84-9997-ecb2c349cf54",
|
|
"metadata": {},
|
|
"source": [
|
|
"## First step - generate training data from examples\n",
|
|
"\n",
|
|
"There are 3 sample python files generated (via multiple queries) by GPT-4o, Claude 3 Opus and Gemini 1.5 Pro. \n",
|
|
"\n",
|
|
"This notebook creates training data from these files, then converts to the HuggingFace format and uploads to the hub.\n",
|
|
"\n",
|
|
"Afterwards, we will move to Google Colab to fine-tune."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "16cf3aa2-f407-4b95-8b9e-c3c586f67835",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"import glob\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"import random\n",
|
|
"from datasets import Dataset\n",
|
|
"from dotenv import load_dotenv\n",
|
|
"from huggingface_hub import login\n",
|
|
"import transformers\n",
|
|
"from transformers import AutoTokenizer"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "375302b6-b6a7-46ea-a74c-c2400dbd8bbe",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Load environment variables in a file called .env\n",
|
|
"from datasets import load_dataset, Dataset\n",
|
|
"load_dotenv()\n",
|
|
"hf_token = os.getenv('HF_TOKEN')\n",
|
|
"if hf_token and hf_token.startswith(\"hf_\") and len(hf_token)>5:\n",
|
|
" print(\"HuggingFace Token looks good so far\")\n",
|
|
"else:\n",
|
|
" print(\"Potential problem with HuggingFace token - please check your .env file, and see the Troubleshooting notebook for more\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "8a0c9fff-9eff-42fd-971b-403c99d9b726",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Constants\n",
|
|
"\n",
|
|
"DATASET_NAME = \"trade_code_data\"\n",
|
|
"BASE_MODEL = \"Qwen/CodeQwen1.5-7B\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "586b07ba-5396-4c34-a696-01c8bc3597a0",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# A utility method to convert the text contents of a file into a list of methods\n",
|
|
"\n",
|
|
"def extract_method_bodies(text):\n",
|
|
" chunks = text.split('def trade')[1:]\n",
|
|
" results = []\n",
|
|
" for chunk in chunks:\n",
|
|
" lines = chunk.split('\\n')[1:]\n",
|
|
" body = '\\n'.join(line for line in lines if line!='\\n')\n",
|
|
" results.append(body)\n",
|
|
" return results "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "953422d0-2e75-4d01-862e-6383df54d9e5",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Read all .py files and convert into training data\n",
|
|
"\n",
|
|
"bodies = []\n",
|
|
"for filename in glob.glob(\"*.py\"):\n",
|
|
" with open(filename, 'r', encoding='utf-8') as file:\n",
|
|
" content = file.read()\n",
|
|
" extracted = extract_method_bodies(content)\n",
|
|
" bodies += extracted\n",
|
|
"\n",
|
|
"print(f\"Extracted {len(bodies)} trade method bodies\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "836480e9-ba23-4aa3-a7e2-2666884e9a06",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Let's look at one\n",
|
|
"\n",
|
|
"print(random.choice(bodies))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "47b10e7e-a562-4968-af3f-254a9b424ac8",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# To visualize the lines of code in each \n",
|
|
"\n",
|
|
"%matplotlib inline\n",
|
|
"fig, ax = plt.subplots(1, 1)\n",
|
|
"lengths = [len(body.split('\\n')) for body in bodies]\n",
|
|
"ax.set_xlabel('Lines of code')\n",
|
|
"ax.set_ylabel('Count of training samples');\n",
|
|
"_ = ax.hist(lengths, rwidth=0.7, color=\"green\", bins=range(0, max(lengths)))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "03b37f62-679e-4c3d-9e5b-5878a82696e6",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Add the prompt to the start of every training example\n",
|
|
"\n",
|
|
"prompt = \"\"\"\n",
|
|
"# tickers is a list of stock tickers\n",
|
|
"import tickers\n",
|
|
"\n",
|
|
"# prices is a dict; the key is a ticker and the value is a list of historic prices, today first\n",
|
|
"import prices\n",
|
|
"\n",
|
|
"# Trade represents a decision to buy or sell a quantity of a ticker\n",
|
|
"import Trade\n",
|
|
"\n",
|
|
"import random\n",
|
|
"import numpy as np\n",
|
|
"\n",
|
|
"def trade():\n",
|
|
"\"\"\"\n",
|
|
"\n",
|
|
"data = [prompt + body for body in bodies]\n",
|
|
"print(random.choice(data))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "28fdb82f-3864-4023-8263-547d17571a5c",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Distribution of tokens in our dataset\n",
|
|
"\n",
|
|
"tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)\n",
|
|
"tokenized_data = [tokenizer.encode(each) for each in data]\n",
|
|
"token_counts = [len(tokens) for tokens in tokenized_data]\n",
|
|
"\n",
|
|
"%matplotlib inline\n",
|
|
"fig, ax = plt.subplots(1, 1)\n",
|
|
"ax.set_xlabel('Number of tokens')\n",
|
|
"ax.set_ylabel('Count of training samples');\n",
|
|
"_ = ax.hist(token_counts, rwidth=0.7, color=\"purple\", bins=range(0, max(token_counts), 20))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "b4eb73b0-88ef-4aeb-8e5b-fe7050109ba0",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Enforcing a maximum token length\n",
|
|
"\n",
|
|
"We need to specify a maximum number of tokens when we fine-tune.\n",
|
|
"\n",
|
|
"Let's pick a cut-off, and only keep training data points that fit within this number of tokens,"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "ffb0d55c-5602-4518-b811-fa385c0959a7",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"CUTOFF = 320\n",
|
|
"truncated = len([tokens for tokens in tokenized_data if len(tokens) > CUTOFF])\n",
|
|
"percentage = truncated/len(tokenized_data)*100\n",
|
|
"print(f\"With cutoff at {CUTOFF}, we truncate {truncated} datapoints which is {percentage:.1f}% of the dataset\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7064ef0a-7b07-4f24-a580-cbef2c5e1f2f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Let's only keep datapoints that wouldn't get truncated\n",
|
|
"\n",
|
|
"filtered_data = [datapoint for datapoint in data if len(tokenizer.encode(datapoint))<=CUTOFF]\n",
|
|
"print(f\"After e now have {len(filtered_data)} datapoints\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "fb2bb067-2bd3-498b-9fc8-5e8186afbe27",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Mix up the data\n",
|
|
"\n",
|
|
"random.seed(42)\n",
|
|
"random.shuffle(filtered_data)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "26713fb9-765f-4524-b9db-447e97686d1a",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# I don't make a Training / Test split - if we had more training data, we would!\n",
|
|
"\n",
|
|
"dataset = Dataset.from_dict({'text':filtered_data})"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "bfabba27-ef47-46a8-a26b-4d650ae3b193",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"login(hf_token)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "55b595cd-2df7-4be4-aec1-0667b17d36f1",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Push your dataset to your hub\n",
|
|
"# I've also pushed the data to my account and made it public, which you can use from the colab below\n",
|
|
"\n",
|
|
"dataset.push_to_hub(DATASET_NAME, private=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "4691a025-9800-4e97-a20f-a65f102401f1",
|
|
"metadata": {},
|
|
"source": [
|
|
"## And now to head over to a Google Colab for fine-tuning in the cloud\n",
|
|
"\n",
|
|
"Follow this link for the Colab:\n",
|
|
"\n",
|
|
"https://colab.research.google.com/drive/1wry2-4AGw-U7K0LQ_jEgduoTQqVIvo1x?usp=sharing\n",
|
|
"\n",
|
|
"I've also saved this Colab with output included, so you can see the results without needing to train if you'd prefer.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "04a6c3e0-a2e6-4115-a01a-45e79dfdb730",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.11.10"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|