Merge branch 'main' of github.com:ed-donner/llm_engineering

This commit is contained in:
Edward Donner
2025-09-20 16:25:25 -04:00
31 changed files with 13394 additions and 48 deletions

View File

@@ -0,0 +1,668 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d15d8294-3328-4e07-ad16-8a03e9bbfdb9",
"metadata": {},
"source": [
"# YOUR FIRST LAB\n",
"### Please read this section. This is valuable to get you prepared, even if it's a long read -- it's important stuff.\n",
"\n",
"## Your first Frontier LLM Project\n",
"\n",
"Let's build a useful LLM solution - in a matter of minutes.\n",
"\n",
"By the end of this course, you will have built an autonomous Agentic AI solution with 7 agents that collaborate to solve a business problem. All in good time! We will start with something smaller...\n",
"\n",
"Our goal is to code a new kind of Web Browser. Give it a URL, and it will respond with a summary. The Reader's Digest of the internet!!\n",
"\n",
"Before starting, you should have completed the setup for [PC](../SETUP-PC.md) or [Mac](../SETUP-mac.md) and you hopefully launched this jupyter lab from within the project root directory, with your environment activated.\n",
"\n",
"## If you're new to Jupyter Lab\n",
"\n",
"Welcome to the wonderful world of Data Science experimentation! Once you've used Jupyter Lab, you'll wonder how you ever lived without it. Simply click in each \"cell\" with code in it, such as the cell immediately below this text, and hit Shift+Return to execute that cell. As you wish, you can add a cell with the + button in the toolbar, and print values of variables, or try out variations. \n",
"\n",
"I've written a notebook called [Guide to Jupyter](Guide%20to%20Jupyter.ipynb) to help you get more familiar with Jupyter Labs, including adding Markdown comments, using `!` to run shell commands, and `tqdm` to show progress.\n",
"\n",
"## If you're new to the Command Line\n",
"\n",
"Please see these excellent guides: [Command line on PC](https://chatgpt.com/share/67b0acea-ba38-8012-9c34-7a2541052665) and [Command line on Mac](https://chatgpt.com/canvas/shared/67b0b10c93a081918210723867525d2b). \n",
"\n",
"## If you'd prefer to work in IDEs\n",
"\n",
"If you're more comfortable in IDEs like VSCode, Cursor or PyCharm, they both work great with these lab notebooks too. \n",
"If you'd prefer to work in VSCode, [here](https://chatgpt.com/share/676f2e19-c228-8012-9911-6ca42f8ed766) are instructions from an AI friend on how to configure it for the course.\n",
"\n",
"## If you'd like to brush up your Python\n",
"\n",
"I've added a notebook called [Intermediate Python](Intermediate%20Python.ipynb) to get you up to speed. But you should give it a miss if you already have a good idea what this code does: \n",
"`yield from {book.get(\"author\") for book in books if book.get(\"author\")}`\n",
"\n",
"## I am here to help\n",
"\n",
"If you have any problems at all, please do reach out. \n",
"I'm available through the platform, or at ed@edwarddonner.com, or at https://www.linkedin.com/in/eddonner/ if you'd like to connect (and I love connecting!) \n",
"And this is new to me, but I'm also trying out X/Twitter at [@edwarddonner](https://x.com/edwarddonner) - if you're on X, please show me how it's done 😂 \n",
"\n",
"## More troubleshooting\n",
"\n",
"Please see the [troubleshooting](troubleshooting.ipynb) notebook in this folder to diagnose and fix common problems. At the very end of it is a diagnostics script with some useful debug info.\n",
"\n",
"## For foundational technical knowledge (eg Git, APIs, debugging) \n",
"\n",
"If you're relatively new to programming -- I've got your back! While it's ideal to have some programming experience for this course, there's only one mandatory prerequisite: plenty of patience. 😁 I've put together a set of self-study guides that cover Git and GitHub, APIs and endpoints, beginner python and more.\n",
"\n",
"This covers Git and GitHub; what they are, the difference, and how to use them: \n",
"https://github.com/ed-donner/agents/blob/main/guides/03_git_and_github.ipynb\n",
"\n",
"This covers technical foundations: \n",
"ChatGPT vs API; taking screenshots; Environment Variables; Networking basics; APIs and endpoints: \n",
"https://github.com/ed-donner/agents/blob/main/guides/04_technical_foundations.ipynb\n",
"\n",
"This covers Python for beginners, and making sure that a `NameError` never trips you up: \n",
"https://github.com/ed-donner/agents/blob/main/guides/06_python_foundations.ipynb\n",
"\n",
"This covers the essential techniques for figuring out errors: \n",
"https://github.com/ed-donner/agents/blob/main/guides/08_debugging.ipynb\n",
"\n",
"And you'll find other useful guides in the same folder in GitHub. Some information applies to my other Udemy course (eg Async Python) but most of it is very relevant for LLM engineering.\n",
"\n",
"## If this is old hat!\n",
"\n",
"If you're already comfortable with today's material, please hang in there; you can move swiftly through the first few labs - we will get much more in depth as the weeks progress. Ultimately we will fine-tune our own LLM to compete with OpenAI!\n",
"\n",
"<table style=\"margin: 0; text-align: left;\">\n",
" <tr>\n",
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
" <img src=\"../important.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
" </td>\n",
" <td>\n",
" <h2 style=\"color:#900;\">Please read - important note</h2>\n",
" <span style=\"color:#900;\">The way I collaborate with you may be different to other courses you've taken. I prefer not to type code while you watch. Rather, I execute Jupyter Labs, like this, and give you an intuition for what's going on. My suggestion is that you carefully execute this yourself, <b>after</b> watching the lecture. Add print statements to understand what's going on, and then come up with your own variations. If you have a Github account, use this to showcase your variations. Not only is this essential practice, but it demonstrates your skills to others, including perhaps future clients or employers...</span>\n",
" </td>\n",
" </tr>\n",
"</table>\n",
"<table style=\"margin: 0; text-align: left;\">\n",
" <tr>\n",
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
" <img src=\"../resources.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
" </td>\n",
" <td>\n",
" <h2 style=\"color:#f71;\">This code is a live resource - keep an eye out for my emails</h2>\n",
" <span style=\"color:#f71;\">I push updates to the code regularly. As people ask questions, I add more examples or improved commentary. As a result, you'll notice that the code below isn't identical to the videos. Everything from the videos is here; but I've also added better explanations and new models like DeepSeek. Consider this like an interactive book.<br/><br/>\n",
" I try to send emails regularly with important updates related to the course. You can find this in the 'Announcements' section of Udemy in the left sidebar. You can also choose to receive my emails via your Notification Settings in Udemy. I'm respectful of your inbox and always try to add value with my emails!\n",
" </span>\n",
" </td>\n",
" </tr>\n",
"</table>\n",
"<table style=\"margin: 0; text-align: left;\">\n",
" <tr>\n",
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
" <img src=\"../business.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
" </td>\n",
" <td>\n",
" <h2 style=\"color:#181;\">Business value of these exercises</h2>\n",
" <span style=\"color:#181;\">A final thought. While I've designed these notebooks to be educational, I've also tried to make them enjoyable. We'll do fun things like have LLMs tell jokes and argue with each other. But fundamentally, my goal is to teach skills you can apply in business. I'll explain business implications as we go, and it's worth keeping this in mind: as you build experience with models and techniques, think of ways you could put this into action at work today. Please do contact me if you'd like to discuss more or if you have ideas to bounce off me.</span>\n",
" </td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4e2a9393-7767-488e-a8bf-27c12dca35bd",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import requests\n",
"from dotenv import load_dotenv\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI\n",
"from selenium import webdriver\n",
"from selenium.webdriver.chrome.options import Options\n",
"from selenium.webdriver.common.by import By\n",
"import time\n",
"\n",
"# If you get an error running this cell, then please head over to the troubleshooting notebook!"
]
},
{
"cell_type": "markdown",
"id": "6900b2a8-6384-4316-8aaa-5e519fca4254",
"metadata": {},
"source": [
"# Connecting to OpenAI (or Ollama)\n",
"\n",
"The next cell is where we load in the environment variables in your `.env` file and connect to OpenAI. \n",
"\n",
"If you'd like to use free Ollama instead, please see the README section \"Free Alternative to Paid APIs\", and if you're not sure how to do this, there's a full solution in the solutions folder (day1_with_ollama.ipynb).\n",
"\n",
"## Troubleshooting if you have problems:\n",
"\n",
"Head over to the [troubleshooting](troubleshooting.ipynb) notebook in this folder for step by step code to identify the root cause and fix it!\n",
"\n",
"If you make a change, try restarting the \"Kernel\" (the python process sitting behind this notebook) by Kernel menu >> Restart Kernel and Clear Outputs of All Cells. Then try this notebook again, starting at the top.\n",
"\n",
"Or, contact me! Message me or email ed@edwarddonner.com and we will get this to work.\n",
"\n",
"Any concerns about API costs? See my notes in the README - costs should be minimal, and you can control it at every point. You can also use Ollama as a free alternative, which we discuss during Day 2."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7b87cadb-d513-4303-baee-a37b6f938e4d",
"metadata": {},
"outputs": [],
"source": [
"# Load environment variables in a file called .env\n",
"\n",
"load_dotenv(override=True)\n",
"api_key = os.getenv('OPENAI_API_KEY')\n",
"\n",
"# Check the key\n",
"\n",
"if not api_key:\n",
" print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
"elif not api_key.startswith(\"sk-proj-\"):\n",
" print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n",
"elif api_key.strip() != api_key:\n",
" print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n",
"else:\n",
" print(\"API key found and looks good so far!\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "019974d9-f3ad-4a8a-b5f9-0a3719aea2d3",
"metadata": {},
"outputs": [],
"source": [
"openai = OpenAI()\n",
"\n",
"# If this doesn't work, try Kernel menu >> Restart Kernel and Clear Outputs Of All Cells, then run the cells from the top of this notebook down.\n",
"# If it STILL doesn't work (horrors!) then please see the Troubleshooting notebook in this folder for full instructions"
]
},
{
"cell_type": "markdown",
"id": "442fc84b-0815-4f40-99ab-d9a5da6bda91",
"metadata": {},
"source": [
"# Let's make a quick call to a Frontier model to get started, as a preview!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a58394bf-1e45-46af-9bfd-01e24da6f49a",
"metadata": {},
"outputs": [],
"source": [
"# To give you a preview -- calling OpenAI with these messages is this easy. Any problems, head over to the Troubleshooting notebook.\n",
"\n",
"message = \"Hello, GPT! This is my first ever message to you! Hi!\"\n",
"response = openai.chat.completions.create(model=\"gpt-4o-mini\", messages=[{\"role\":\"user\", \"content\":message}])\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "2aa190e5-cb31-456a-96cc-db109919cd78",
"metadata": {},
"source": [
"## OK onwards with our first project"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c5e793b2-6775-426a-a139-4848291d0463",
"metadata": {},
"outputs": [],
"source": [
"# A class to represent a Webpage\n",
"# If you're not familiar with Classes, check out the \"Intermediate Python\" notebook\n",
"\n",
"# Some websites need you to use proper headers when fetching them:\n",
"headers = {\n",
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n",
"}\n",
"\n",
"class Website:\n",
" def __init__(self, url):\n",
" self.url = url\n",
" self.title = None\n",
" self.text = None\n",
"\n",
" # --- Try BeautifulSoup first ---\n",
" try:\n",
" response = requests.get(url, headers=headers, timeout=10)\n",
" soup = BeautifulSoup(response.content, 'html.parser')\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
" for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
" self.text = soup.body.get_text(separator=\"\\n\", strip=True)\n",
" except Exception:\n",
" self.text = None\n",
"\n",
" # --- If BeautifulSoup fails or gets very little text, try Selenium ---\n",
" if not self.text or len(self.text) < 100:\n",
" try:\n",
" options = Options()\n",
" options.add_argument(\"--headless\")\n",
" options.add_argument(\"--no-sandbox\")\n",
" options.add_argument(\"--disable-dev-shm-usage\")\n",
" # No need to specify executable_path; Selenium Manager will handle the driver\n",
" driver = webdriver.Chrome(options=options)\n",
" driver.get(url)\n",
" time.sleep(3) # Wait for JS to load\n",
" self.title = driver.title or self.title or \"No title found\"\n",
" try:\n",
" body = driver.find_element(By.TAG_NAME, \"body\")\n",
" self.text = body.text\n",
" except Exception:\n",
" self.text = \"No body text found\"\n",
" driver.quit()\n",
" except Exception as e:\n",
" self.text = f\"Selenium failed: {e}\"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2ef960cf-6dc2-4cda-afb3-b38be12f4c97",
"metadata": {},
"outputs": [],
"source": [
"# Let's try one out. Change the website and add print statements to follow along.\n",
"\n",
"ed = Website(\"https://edwarddonner.com\")\n",
"print(ed.title)\n",
"print(ed.text)"
]
},
{
"cell_type": "markdown",
"id": "6a478a0c-2c53-48ff-869c-4d08199931e1",
"metadata": {},
"source": [
"## Types of prompts\n",
"\n",
"You may know this already - but if not, you will get very familiar with it!\n",
"\n",
"Models like GPT4o have been trained to receive instructions in a particular way.\n",
"\n",
"They expect to receive:\n",
"\n",
"**A system prompt** that tells them what task they are performing and what tone they should use\n",
"\n",
"**A user prompt** -- the conversation starter that they should reply to"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "abdb8417-c5dc-44bc-9bee-2e059d162699",
"metadata": {},
"outputs": [],
"source": [
"# Define our system prompt - you can experiment with this later, changing the last sentence to 'Respond in markdown in Spanish.\"\n",
"\n",
"system_prompt = \"You are an assistant that analyzes the contents of a website \\\n",
"and provides a short summary, ignoring text that might be navigation related. \\\n",
"Respond in markdown.\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f0275b1b-7cfe-4f9d-abfa-7650d378da0c",
"metadata": {},
"outputs": [],
"source": [
"# A function that writes a User Prompt that asks for summaries of websites:\n",
"\n",
"def user_prompt_for(website):\n",
" user_prompt = f\"You are looking at a website titled {website.title}\"\n",
" user_prompt += \"\\nThe contents of this website is as follows; \\\n",
"please provide a short summary of this website in markdown. \\\n",
"If it includes news or announcements, then summarize these too.\\n\\n\"\n",
" user_prompt += website.text\n",
" return user_prompt"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "26448ec4-5c00-4204-baec-7df91d11ff2e",
"metadata": {},
"outputs": [],
"source": [
"print(user_prompt_for(ed))"
]
},
{
"cell_type": "markdown",
"id": "ea211b5f-28e1-4a86-8e52-c0b7677cadcc",
"metadata": {},
"source": [
"## Messages\n",
"\n",
"The API from OpenAI expects to receive messages in a particular structure.\n",
"Many of the other APIs share this structure:\n",
"\n",
"```python\n",
"[\n",
" {\"role\": \"system\", \"content\": \"system message goes here\"},\n",
" {\"role\": \"user\", \"content\": \"user message goes here\"}\n",
"]\n",
"```\n",
"To give you a preview, the next 2 cells make a rather simple call - we won't stretch the mighty GPT (yet!)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f25dcd35-0cd0-4235-9f64-ac37ed9eaaa5",
"metadata": {},
"outputs": [],
"source": [
"messages = [\n",
" {\"role\": \"system\", \"content\": \"You are a snarky assistant\"},\n",
" {\"role\": \"user\", \"content\": \"What is 2 + 2?\"}\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "21ed95c5-7001-47de-a36d-1d6673b403ce",
"metadata": {},
"outputs": [],
"source": [
"# To give you a preview -- calling OpenAI with system and user messages:\n",
"\n",
"response = openai.chat.completions.create(model=\"gpt-4o-mini\", messages=messages)\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "d06e8d78-ce4c-4b05-aa8e-17050c82bb47",
"metadata": {},
"source": [
"## And now let's build useful messages for GPT-4o-mini, using a function"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0134dfa4-8299-48b5-b444-f2a8c3403c88",
"metadata": {},
"outputs": [],
"source": [
"# See how this function creates exactly the format above\n",
"\n",
"def messages_for(website):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt_for(website)}\n",
" ]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "36478464-39ee-485c-9f3f-6a4e458dbc9c",
"metadata": {},
"outputs": [],
"source": [
"# Try this out, and then try for a few more websites\n",
"\n",
"messages_for(ed)"
]
},
{
"cell_type": "markdown",
"id": "16f49d46-bf55-4c3e-928f-68fc0bf715b0",
"metadata": {},
"source": [
"## Time to bring it together - the API for OpenAI is very simple!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "905b9919-aba7-45b5-ae65-81b3d1d78e34",
"metadata": {},
"outputs": [],
"source": [
"# And now: call the OpenAI API. You will get very familiar with this!\n",
"\n",
"def summarize(url):\n",
" website = Website(url)\n",
" response = openai.chat.completions.create(\n",
" model = \"gpt-4o-mini\",\n",
" messages = messages_for(website)\n",
" )\n",
" return response.choices[0].message.content"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "05e38d41-dfa4-4b20-9c96-c46ea75d9fb5",
"metadata": {},
"outputs": [],
"source": [
"summarize(\"https://edwarddonner.com\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3d926d59-450e-4609-92ba-2d6f244f1342",
"metadata": {},
"outputs": [],
"source": [
"# A function to display this nicely in the Jupyter output, using markdown\n",
"\n",
"def display_summary(url):\n",
" summary = summarize(url)\n",
" display(Markdown(summary))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3018853a-445f-41ff-9560-d925d1774b2f",
"metadata": {},
"outputs": [],
"source": [
"display_summary(\"https://edwarddonner.com\")"
]
},
{
"cell_type": "markdown",
"id": "b3bcf6f4-adce-45e9-97ad-d9a5d7a3a624",
"metadata": {},
"source": [
"# Let's try more websites\n",
"\n",
"Note that this will only work on websites that can be scraped using this simplistic approach.\n",
"\n",
"Websites that are rendered with Javascript, like React apps, won't show up. See the community-contributions folder for a Selenium implementation that gets around this. You'll need to read up on installing Selenium (ask ChatGPT!)\n",
"\n",
"Also Websites protected with CloudFront (and similar) may give 403 errors - many thanks Andy J for pointing this out.\n",
"\n",
"But many websites will work just fine!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "45d83403-a24c-44b5-84ac-961449b4008f",
"metadata": {},
"outputs": [],
"source": [
"display_summary(\"https://cnn.com\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "75e9fd40-b354-4341-991e-863ef2e59db7",
"metadata": {},
"outputs": [],
"source": [
"display_summary(\"https://anthropic.com\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a87b0c4d",
"metadata": {},
"outputs": [],
"source": [
"display_summary(\"https://openai.com/about/\")"
]
},
{
"cell_type": "markdown",
"id": "c951be1a-7f1b-448f-af1f-845978e47e2c",
"metadata": {},
"source": [
"<table style=\"margin: 0; text-align: left;\">\n",
" <tr>\n",
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
" <img src=\"../business.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
" </td>\n",
" <td>\n",
" <h2 style=\"color:#181;\">Business applications</h2>\n",
" <span style=\"color:#181;\">In this exercise, you experienced calling the Cloud API of a Frontier Model (a leading model at the frontier of AI) for the first time. We will be using APIs like OpenAI at many stages in the course, in addition to building our own LLMs.\n",
"\n",
"More specifically, we've applied this to Summarization - a classic Gen AI use case to make a summary. This can be applied to any business vertical - summarizing the news, summarizing financial performance, summarizing a resume in a cover letter - the applications are limitless. Consider how you could apply Summarization in your business, and try prototyping a solution.</span>\n",
" </td>\n",
" </tr>\n",
"</table>\n",
"\n",
"<table style=\"margin: 0; text-align: left;\">\n",
" <tr>\n",
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
" <img src=\"../important.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
" </td>\n",
" <td>\n",
" <h2 style=\"color:#900;\">Before you continue - now try yourself</h2>\n",
" <span style=\"color:#900;\">Use the cell below to make your own simple commercial example. Stick with the summarization use case for now. Here's an idea: write something that will take the contents of an email, and will suggest an appropriate short subject line for the email. That's the kind of feature that might be built into a commercial email tool.</span>\n",
" </td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "00743dac-0e70-45b7-879a-d7293a6f68a6",
"metadata": {},
"outputs": [],
"source": [
"# Step 1: Create your prompts\n",
"\n",
"system_prompt = \"You are a soccer commentator for premier league games mimicing Peter Dury vocabulary and phrases to serenade a player on scoring a goal\"\n",
"user_prompt = \"\"\"\n",
" Dominik Szoboslai of Liverpool just hit a 32 yard free kick in the 82nd min against Arsenal, making the score 1 nil to liverpool and closer to winnning the game.\n",
"\"\"\"\n",
"\n",
"# Step 2: Make the messages list\n",
"\n",
"messages = [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
"]\n",
"\n",
"# Step 3: Call OpenAI\n",
"\n",
"response = openai.chat.completions.create(\n",
" model=\"gpt-4o-mini\",\n",
" messages=messages\n",
")\n",
"\n",
"# Step 4: print the result\n",
"\n",
"print(response.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"id": "36ed9f14-b349-40e9-a42c-b367e77f8bda",
"metadata": {},
"source": [
"## An extra exercise for those who enjoy web scraping\n",
"\n",
"You may notice that if you try `display_summary(\"https://openai.com\")` - it doesn't work! That's because OpenAI has a fancy website that uses Javascript. There are many ways around this that some of you might be familiar with. For example, Selenium is a hugely popular framework that runs a browser behind the scenes, renders the page, and allows you to query it. If you have experience with Selenium, Playwright or similar, then feel free to improve the Website class to use them. In the community-contributions folder, you'll find an example Selenium solution from a student (thank you!)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5828d6c4",
"metadata": {},
"outputs": [],
"source": [
"display_summary(\"https://openai.com\")"
]
},
{
"cell_type": "markdown",
"id": "eeab24dc-5f90-4570-b542-b0585aca3eb6",
"metadata": {},
"source": [
"# Sharing your code\n",
"\n",
"I'd love it if you share your code afterwards so I can share it with others! You'll notice that some students have already made changes (including a Selenium implementation) which you will find in the community-contributions folder. If you'd like add your changes to that folder, submit a Pull Request with your new versions in that folder and I'll merge your changes.\n",
"\n",
"If you're not an expert with git (and I am not!) then GPT has given some nice instructions on how to submit a Pull Request. It's a bit of an involved process, but once you've done it once it's pretty clear. As a pro-tip: it's best if you clear the outputs of your Jupyter notebooks (Edit >> Clean outputs of all cells, and then Save) for clean notebooks.\n",
"\n",
"Here are good instructions courtesy of an AI friend: \n",
"https://chatgpt.com/share/677a9cb5-c64c-8012-99e0-e06e88afd293"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f4484fcf-8b39-4c3f-9674-37970ed71988",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "llms",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,188 @@
# YouTube Video Summarizer
A Python tool that automatically fetches YouTube video transcripts and generates comprehensive summaries using OpenAI's GPT-4o-mini model. Features intelligent chunking for large videos and high-quality summarization.
## Features
- 🎬 **YouTube Integration**: Automatically fetches video transcripts
- 🤖 **AI-Powered Summaries**: Uses GPT-4o-mini for high-quality summaries
- 📊 **Smart Chunking**: Handles large videos by splitting into manageable chunks
- 🔄 **Automatic Stitching**: Combines chunk summaries into cohesive final summaries
- 💰 **Cost-Effective**: Optimized for GPT-4o-mini's token limits
- 🛡️ **Error Handling**: Robust error handling with helpful messages
## Installation
### Prerequisites
- Python 3.8 or higher
### Option 1: Using the installation script (Recommended)
```bash
# Run the automated installation script
python install.py
# The script will let you choose between UV and pip
# Then run the script with your chosen method
```
### Option 2: Using UV
```bash
# Install UV if not already installed
pip install uv
# Install dependencies and create virtual environment
uv sync
# Run the script
uv run python youtube_video_summarizer.py
```
### Option 3: Using pip
```bash
# Install dependencies
pip install -r requirements.txt
# Run the script
python youtube_video_summarizer.py
```
### Optional Dependencies
#### With UV:
```bash
# For Jupyter notebook support
uv sync --extra jupyter
# For development dependencies (testing, linting, etc.)
uv sync --extra dev
```
#### With pip:
```bash
# For Jupyter notebook support
pip install ipython jupyter
# For development dependencies
pip install pytest black flake8 mypy
```
## Setup
1. **Get an OpenAI API Key**:
- Visit [OpenAI API](https://platform.openai.com/api-keys)
- Create a new API key
2. **Create a .env file**:
```bash
echo "OPENAI_API_KEY=your_api_key_here" > .env
```
3. **Update the video URL** in `youtube_video_summarizer.py`:
```python
video_url = "https://www.youtube.com/watch?v=YOUR_VIDEO_ID"
```
## Usage
### Basic Usage
```python
from youtube_video_summarizer import YouTubeVideo, summarize_video
# Create video object
video = YouTubeVideo("https://www.youtube.com/watch?v=VIDEO_ID")
# Generate summary
summary = summarize_video(video)
print(summary)
```
### Advanced Usage with Custom Settings
```python
# Custom chunking settings
summary = summarize_video(
video,
use_chunking=True,
max_chunk_tokens=4000
)
```
## How It Works
1. **Video Processing**: Fetches YouTube video metadata and transcript
2. **Token Analysis**: Counts tokens to determine if chunking is needed
3. **Smart Chunking**: Splits large transcripts into manageable pieces
4. **Individual Summaries**: Generates summaries for each chunk
5. **Intelligent Stitching**: Combines chunk summaries into final result
## Configuration
### Model Settings
- **Model**: GPT-4o-mini (cost-effective and high-quality)
- **Temperature**: 0.3 (focused, consistent output)
- **Max Tokens**: 2,000 (optimal for summaries)
### Chunking Settings
- **Max Chunk Size**: 4,000 tokens (auto-calculated per model)
- **Overlap**: 5% of chunk size (maintains context)
- **Auto-detection**: Automatically determines if chunking is needed
## Error Handling
The script includes comprehensive error handling:
- ✅ **Missing Dependencies**: Clear installation instructions
- ✅ **Invalid URLs**: YouTube URL validation
- ✅ **API Errors**: OpenAI API error handling
- ✅ **Network Issues**: Request timeout and retry logic
## Requirements
- **Python**: 3.8 or higher
- **OpenAI API Key**: Required for summarization
- **Internet Connection**: For YouTube and OpenAI API access
## Dependencies
### Core Dependencies
- `requests`: HTTP requests
- `tiktoken`: Token counting
- `python-dotenv`: Environment variable management
- `openai`: OpenAI API client
- `youtube-transcript-api`: YouTube transcript fetching
- `beautifulsoup4`: HTML parsing
### Optional Dependencies
- `ipython`: Jupyter notebook support
- `jupyter`: Jupyter notebook support
## Troubleshooting
### Common Issues
1. **ModuleNotFoundError**:
- With UV: Run `uv sync` to install dependencies
- With pip: Run `pip install -r requirements.txt`
2. **UV not found**: Install UV with `pip install uv` or run `python install.py`
3. **OpenAI API Error**: Check your API key in `.env` file
4. **YouTube Transcript Error**: Video may not have transcripts available
5. **Token Limit Error**: Video transcript is too long (rare with chunking)
### Getting Help
If you encounter issues:
1. Check the error messages (they include helpful installation instructions)
2. Ensure all dependencies are installed:
- With UV: `uv sync`
- With pip: `pip install -r requirements.txt`
3. Verify your OpenAI API key is correct
4. Check that the YouTube video has transcripts available
5. Try running with the appropriate command:
- With UV: `uv run python youtube_video_summarizer.py`
- With pip: `python youtube_video_summarizer.py`
## License
This project is part of the LLM Engineering course materials.
## Contributing
Feel free to submit issues and enhancement requests!

View File

@@ -0,0 +1,178 @@
#!/usr/bin/env python3
"""
Installation script for YouTube Video Summarizer
This script installs all required dependencies for the project using either UV or pip.
"""
import subprocess
import sys
import os
import shutil
def run_command(command, description):
"""Run a command and handle errors"""
print(f"🔄 {description}...")
try:
result = subprocess.run(command, shell=True, check=True, capture_output=True, text=True)
print(f"{description} completed successfully")
return True
except subprocess.CalledProcessError as e:
print(f"{description} failed:")
print(f" Error: {e.stderr}")
return False
def check_python_version():
"""Check if Python version is compatible"""
version = sys.version_info
if version.major < 3 or (version.major == 3 and version.minor < 8):
print("❌ Python 3.8 or higher is required")
print(f" Current version: {version.major}.{version.minor}.{version.micro}")
return False
print(f"✅ Python {version.major}.{version.minor}.{version.micro} is compatible")
return True
def check_uv_installed():
"""Check if UV is installed"""
if shutil.which("uv"):
print("✅ UV is already installed")
return True
else:
print("❌ UV is not installed")
return False
def install_uv():
"""Install UV package manager"""
print("🔄 Installing UV...")
try:
# Try to install UV using pip first
if not run_command(f"{sys.executable} -m pip install uv", "Installing UV via pip"):
# Fallback to curl installation
install_script = "curl -LsSf https://astral.sh/uv/install.sh | sh"
if not run_command(install_script, "Installing UV via curl"):
print("❌ Failed to install UV. Please install it manually:")
print(" pip install uv")
print(" or visit: https://github.com/astral-sh/uv")
return False
return True
except Exception as e:
print(f"❌ Error installing UV: {e}")
return False
def choose_package_manager():
"""Let user choose between UV and pip"""
print("\n📦 Choose your package manager:")
print("1. UV (recommended - faster, better dependency resolution)")
print("2. pip (traditional Python package manager)")
while True:
choice = input("\nEnter your choice (1 or 2): ").strip()
if choice == "1":
return "uv"
elif choice == "2":
return "pip"
else:
print("❌ Invalid choice. Please enter 1 or 2.")
def install_dependencies_uv():
"""Install dependencies using UV"""
print("🚀 Installing YouTube Video Summarizer dependencies with UV...")
print("=" * 60)
# Check if UV is installed, install if not
if not check_uv_installed():
if not install_uv():
return False
# Check if pyproject.toml exists
pyproject_file = os.path.join(os.path.dirname(__file__), "pyproject.toml")
if not os.path.exists(pyproject_file):
print("❌ pyproject.toml not found. Please ensure you're in the project directory.")
return False
# Install dependencies using UV
if not run_command("uv sync", "Installing dependencies with UV"):
return False
print("=" * 60)
print("🎉 Installation completed successfully!")
print("\n📋 Next steps:")
print("1. Create a .env file with your OpenAI API key:")
print(" OPENAI_API_KEY=your_api_key_here")
print("2. Run the script:")
print(" uv run python youtube_video_summarizer.py")
print("\n💡 For Jupyter notebook support, install with:")
print(" uv sync --extra jupyter")
print("\n💡 For development dependencies, install with:")
print(" uv sync --extra dev")
return True
def install_dependencies_pip():
"""Install dependencies using pip"""
print("🚀 Installing YouTube Video Summarizer dependencies with pip...")
print("=" * 60)
# Upgrade pip first
if not run_command(f"{sys.executable} -m pip install --upgrade pip", "Upgrading pip"):
return False
# Install dependencies from requirements.txt
requirements_file = os.path.join(os.path.dirname(__file__), "requirements.txt")
if os.path.exists(requirements_file):
if not run_command(f"{sys.executable} -m pip install -r {requirements_file}", "Installing dependencies from requirements.txt"):
return False
else:
# Install core dependencies individually
core_deps = [
"requests",
"tiktoken",
"python-dotenv",
"openai",
"youtube-transcript-api",
"beautifulsoup4"
]
for dep in core_deps:
if not run_command(f"{sys.executable} -m pip install {dep}", f"Installing {dep}"):
return False
print("=" * 60)
print("🎉 Installation completed successfully!")
print("\n📋 Next steps:")
print("1. Create a .env file with your OpenAI API key:")
print(" OPENAI_API_KEY=your_api_key_here")
print("2. Run the script:")
print(" python youtube_video_summarizer.py")
print("\n💡 For Jupyter notebook support, also install:")
print(" pip install jupyter ipython")
return True
def install_dependencies():
"""Install required dependencies using chosen package manager"""
# Check Python version
if not check_python_version():
return False
# Let user choose package manager
package_manager = choose_package_manager()
if package_manager == "uv":
return install_dependencies_uv()
else:
return install_dependencies_pip()
def main():
"""Main installation function"""
print("🎬 YouTube Video Summarizer - Installation Script")
print("=" * 60)
if install_dependencies():
print("\n✅ All dependencies installed successfully!")
print("🚀 You can now run the YouTube Video Summarizer!")
else:
print("\n❌ Installation failed. Please check the error messages above.")
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,78 @@
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "youtube-video-summarizer"
version = "1.0.0"
description = "A tool to summarize YouTube videos using OpenAI's GPT models"
readme = "README.md"
requires-python = ">=3.8"
license = {text = "MIT"}
authors = [
{name = "YouTube Video Summarizer Team"},
]
keywords = ["youtube", "summarizer", "openai", "transcript", "ai"]
classifiers = [
"Development Status :: 4 - Beta",
"Intended Audience :: Developers",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Topic :: Multimedia :: Video",
"Topic :: Scientific/Engineering :: Artificial Intelligence",
]
dependencies = [
"requests>=2.25.0",
"tiktoken>=0.5.0",
"python-dotenv>=0.19.0",
"openai>=1.0.0",
"youtube-transcript-api>=0.6.0",
"beautifulsoup4>=4.9.0",
]
[project.optional-dependencies]
jupyter = [
"ipython>=7.0.0",
"jupyter>=1.0.0",
]
dev = [
"pytest>=6.0.0",
"black>=22.0.0",
"flake8>=4.0.0",
"mypy>=0.950",
]
[project.urls]
Homepage = "https://github.com/your-username/youtube-video-summarizer"
Repository = "https://github.com/your-username/youtube-video-summarizer"
Issues = "https://github.com/your-username/youtube-video-summarizer/issues"
[project.scripts]
youtube-summarizer = "youtube_video_summarizer:main"
[tool.uv]
dev-dependencies = [
"pytest>=6.0.0",
"black>=22.0.0",
"flake8>=4.0.0",
"mypy>=0.950",
]
[tool.black]
line-length = 88
target-version = ['py38']
[tool.mypy]
python_version = "3.8"
warn_return_any = true
warn_unused_configs = true
disallow_untyped_defs = true

View File

@@ -0,0 +1,17 @@
# Core dependencies for YouTube Video Summarizer
requests>=2.25.0
tiktoken>=0.5.0
python-dotenv>=0.19.0
openai>=1.0.0
youtube-transcript-api>=0.6.0
beautifulsoup4>=4.9.0
# Optional dependencies for Jupyter notebook support
ipython>=7.0.0
jupyter>=1.0.0
# Development dependencies (optional)
pytest>=6.0.0
black>=22.0.0
flake8>=4.0.0
mypy>=0.950

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,906 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "e371ea2b",
"metadata": {},
"source": [
"# YouTube Video Summarizer\n",
"\n",
"This notebook provides a comprehensive solution for summarizing YouTube videos using OpenAI's GPT models. It includes:\n",
"\n",
"- **Automatic transcript extraction** from YouTube videos\n",
"- **Intelligent chunking** for large videos that exceed token limits\n",
"- **Smart summarization** with academic-quality output\n",
"- **Error handling** and dependency management\n",
"\n",
"## Features\n",
"\n",
"- ✅ Extracts transcripts from YouTube videos\n",
"- ✅ Handles videos of any length with automatic chunking\n",
"- ✅ Generates structured, academic-quality summaries\n",
"- ✅ Includes proper error handling and dependency checks\n",
"- ✅ Optimized for different OpenAI models\n",
"- ✅ Interactive notebook format for easy testing\n",
"\n",
"## Prerequisites\n",
"\n",
"Make sure you have the required dependencies installed:\n",
"```bash\n",
"pip install -r requirements.txt\n",
"```\n",
"\n",
"You'll also need an OpenAI API key set in your environment variables or `.env` file.\n"
]
},
{
"cell_type": "markdown",
"id": "95b713e0",
"metadata": {},
"source": [
"## 1. Import Dependencies and Setup\n",
"\n",
"First, let's import all required libraries and set up the environment.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c940970b",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import re\n",
"import sys\n",
"\n",
"# Check for required dependencies and provide helpful error messages\n",
"try:\n",
" import requests\n",
" print(\"✅ requests imported successfully\")\n",
"except ImportError:\n",
" print(\"❌ Error: 'requests' module not found.\")\n",
" print(\"💡 Install with: pip install requests\")\n",
" print(\" Or: pip install -r requirements.txt\")\n",
" sys.exit(1)\n",
"\n",
"try:\n",
" import tiktoken\n",
" print(\"✅ tiktoken imported successfully\")\n",
"except ImportError:\n",
" print(\"❌ Error: 'tiktoken' module not found.\")\n",
" print(\"💡 Install with: pip install tiktoken\")\n",
" print(\" Or: pip install -r requirements.txt\")\n",
" sys.exit(1)\n",
"\n",
"try:\n",
" from dotenv import load_dotenv\n",
" print(\"✅ python-dotenv imported successfully\")\n",
"except ImportError:\n",
" print(\"❌ Error: 'python-dotenv' module not found.\")\n",
" print(\"💡 Install with: pip install python-dotenv\")\n",
" print(\" Or: pip install -r requirements.txt\")\n",
" sys.exit(1)\n",
"\n",
"try:\n",
" from openai import OpenAI\n",
" print(\"✅ openai imported successfully\")\n",
"except ImportError:\n",
" print(\"❌ Error: 'openai' module not found.\")\n",
" print(\"💡 Install with: pip install openai\")\n",
" print(\" Or: pip install -r requirements.txt\")\n",
" sys.exit(1)\n",
"\n",
"try:\n",
" from youtube_transcript_api import YouTubeTranscriptApi\n",
" print(\"✅ youtube-transcript-api imported successfully\")\n",
"except ImportError:\n",
" print(\"❌ Error: 'youtube-transcript-api' module not found.\")\n",
" print(\"💡 Install with: pip install youtube-transcript-api\")\n",
" print(\" Or: pip install -r requirements.txt\")\n",
" sys.exit(1)\n",
"\n",
"try:\n",
" from bs4 import BeautifulSoup\n",
" print(\"✅ beautifulsoup4 imported successfully\")\n",
"except ImportError:\n",
" print(\"❌ Error: 'beautifulsoup4' module not found.\")\n",
" print(\"💡 Install with: pip install beautifulsoup4\")\n",
" print(\" Or: pip install -r requirements.txt\")\n",
" sys.exit(1)\n",
"\n",
"try:\n",
" from IPython.display import Markdown, display\n",
" print(\"✅ IPython.display imported successfully\")\n",
"except ImportError:\n",
" # IPython is optional for Jupyter notebooks\n",
" print(\"⚠️ Warning: IPython not available (optional for Jupyter notebooks)\")\n",
" Markdown = None\n",
" display = None\n",
"\n",
"print(\"\\n🎉 All dependencies imported successfully!\")\n"
]
},
{
"cell_type": "markdown",
"id": "603e9c3b",
"metadata": {},
"source": [
"## 2. Configuration and Constants\n",
"\n",
"Set up headers for web scraping and define the YouTubeVideo class.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8584ca1a",
"metadata": {},
"outputs": [],
"source": [
"# Headers for website scraping\n",
"headers = {\n",
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n",
"}\n",
"\n",
"class YouTubeVideo:\n",
" \"\"\"Class to handle YouTube video data extraction and processing\"\"\"\n",
" \n",
" def __init__(self, url):\n",
" \"\"\"\n",
" Initialize YouTube video object\n",
" \n",
" Args:\n",
" url (str): YouTube video URL\n",
" \"\"\"\n",
" self.url = url\n",
" youtube_pattern = r'https://www\\.youtube\\.com/watch\\?v=[a-zA-Z0-9_-]+'\n",
" \n",
" if re.match(youtube_pattern, url):\n",
" response = requests.get(url, headers=headers)\n",
" soup = BeautifulSoup(response.content, 'html.parser')\n",
" self.video_id = url.split(\"v=\")[1]\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
" self.transcript = YouTubeTranscriptApi().fetch(self.video_id)\n",
" else:\n",
" raise ValueError(\"Invalid YouTube URL\")\n",
" \n",
" def get_transcript_text(self):\n",
" \"\"\"Get transcript as a single text string\"\"\"\n",
" return \" \".join([segment.text for segment in self.transcript])\n",
" \n",
" def get_video_info(self):\n",
" \"\"\"Get basic video information\"\"\"\n",
" return {\n",
" \"title\": self.title,\n",
" \"video_id\": self.video_id,\n",
" \"url\": self.url,\n",
" \"transcript_length\": len(self.transcript)\n",
" }\n",
"\n",
"print(\"✅ YouTubeVideo class defined successfully\")\n"
]
},
{
"cell_type": "markdown",
"id": "235e9998",
"metadata": {},
"source": [
"## 3. OpenAI API Setup\n",
"\n",
"Functions to handle OpenAI API key and client initialization.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4fa7aba3",
"metadata": {},
"outputs": [],
"source": [
"def get_api_key():\n",
" \"\"\"Get OpenAI API key from environment variables\"\"\"\n",
" load_dotenv(override=True)\n",
" api_key = os.getenv(\"OPENAI_API_KEY\")\n",
" if not api_key:\n",
" raise ValueError(\"OPENAI_API_KEY is not set. Please set it in your environment variables or .env file.\")\n",
" return api_key\n",
"\n",
"def get_openai_client():\n",
" \"\"\"Initialize and return OpenAI client\"\"\"\n",
" api_key = get_api_key()\n",
" return OpenAI(api_key=api_key)\n",
"\n",
"# Test API connection\n",
"try:\n",
" client = get_openai_client()\n",
" print(\"✅ OpenAI client initialized successfully\")\n",
" print(\"✅ API key is valid\")\n",
"except Exception as e:\n",
" print(f\"❌ Error initializing OpenAI client: {e}\")\n",
" print(\"💡 Make sure you have set your OPENAI_API_KEY environment variable\")\n"
]
},
{
"cell_type": "markdown",
"id": "4d3223f4",
"metadata": {},
"source": [
"## 4. Token Counting and Chunking Functions\n",
"\n",
"Functions to handle token counting and intelligent chunking of large transcripts.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "71f68ad0",
"metadata": {},
"outputs": [],
"source": [
"def count_tokens(text, model=\"gpt-4o-mini\"):\n",
" \"\"\"Count tokens in text using tiktoken with fallback\"\"\"\n",
" try:\n",
" # Try model-specific encoding first\n",
" encoding = tiktoken.encoding_for_model(model)\n",
" return len(encoding.encode(text))\n",
" except KeyError:\n",
" # Fallback to cl100k_base encoding (used by most OpenAI models)\n",
" # This ensures compatibility even if model-specific encoding isn't available\n",
" encoding = tiktoken.get_encoding(\"cl100k_base\")\n",
" return len(encoding.encode(text))\n",
" except Exception as e:\n",
" # Ultimate fallback - rough estimation\n",
" print(f\"Warning: Token counting failed ({e}), using rough estimation\")\n",
" return len(text.split()) * 1.3 # Rough word-to-token ratio\n",
"\n",
"def get_optimal_chunk_size(model=\"gpt-4o-mini\"):\n",
" \"\"\"Calculate optimal chunk size based on model's context window\"\"\"\n",
" model_limits = {\n",
" \"gpt-4o-mini\": 8192,\n",
" \"gpt-4o\": 128000,\n",
" \"gpt-4-turbo\": 128000,\n",
" \"gpt-3.5-turbo\": 4096,\n",
" \"gpt-4\": 8192,\n",
" }\n",
" \n",
" context_window = model_limits.get(model, 8192) # Default to 8K\n",
" \n",
" # Reserve tokens for:\n",
" # - System prompt: ~800 tokens\n",
" # - User prompt overhead: ~300 tokens \n",
" # - Output: ~2000 tokens\n",
" # - Safety buffer: ~500 tokens\n",
" reserved_tokens = 800 + 300 + 2000 + 500\n",
" \n",
" optimal_chunk_size = context_window - reserved_tokens\n",
" \n",
" # Ensure minimum chunk size\n",
" return max(optimal_chunk_size, 2000)\n",
"\n",
"print(\"✅ Token counting and chunk size functions defined\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b6647838",
"metadata": {},
"outputs": [],
"source": [
"def chunk_transcript(transcript, max_tokens=4000, overlap_tokens=200, model=\"gpt-4o-mini\"):\n",
" \"\"\"\n",
" Split transcript into chunks that fit within token limits\n",
" \n",
" Args:\n",
" transcript: List of transcript segments from YouTube\n",
" max_tokens: Maximum tokens per chunk (auto-calculated if None)\n",
" overlap_tokens: Number of tokens to overlap between chunks\n",
" model: Model name for token limit calculation\n",
" \n",
" Returns:\n",
" List of transcript chunks\n",
" \"\"\"\n",
" # Auto-calculate max_tokens based on model if not provided\n",
" if max_tokens is None:\n",
" max_tokens = get_optimal_chunk_size(model)\n",
" \n",
" # Auto-calculate overlap as percentage of max_tokens\n",
" if overlap_tokens is None:\n",
" overlap_tokens = int(max_tokens * 0.05) # 5% overlap\n",
" \n",
" # Convert transcript to text\n",
" transcript_text = \" \".join([segment.text for segment in transcript])\n",
" \n",
" # If transcript is small enough, return as single chunk\n",
" if count_tokens(transcript_text) <= max_tokens:\n",
" return [transcript_text]\n",
" \n",
" # Split into sentences for better chunking\n",
" sentences = re.split(r'[.!?]+', transcript_text)\n",
" chunks = []\n",
" current_chunk = \"\"\n",
" \n",
" for sentence in sentences:\n",
" sentence = sentence.strip()\n",
" if not sentence:\n",
" continue\n",
" \n",
" # Check if adding this sentence would exceed token limit\n",
" test_chunk = current_chunk + \" \" + sentence if current_chunk else sentence\n",
" \n",
" if count_tokens(test_chunk) <= max_tokens:\n",
" current_chunk = test_chunk\n",
" else:\n",
" # Save current chunk and start new one\n",
" if current_chunk:\n",
" chunks.append(current_chunk)\n",
" \n",
" # Start new chunk with overlap from previous chunk\n",
" if chunks and overlap_tokens > 0:\n",
" # Get last few words from previous chunk for overlap\n",
" prev_words = current_chunk.split()[-overlap_tokens//4:] # Rough word-to-token ratio\n",
" current_chunk = \" \".join(prev_words) + \" \" + sentence\n",
" else:\n",
" current_chunk = sentence\n",
" \n",
" # Add the last chunk\n",
" if current_chunk:\n",
" chunks.append(current_chunk)\n",
" \n",
" return chunks\n",
"\n",
"print(\"✅ Chunking function defined\")\n"
]
},
{
"cell_type": "markdown",
"id": "7ee3f8a4",
"metadata": {},
"source": [
"## 5. Prompt Generation Functions\n",
"\n",
"Functions to generate system prompts, user prompts, and stitching prompts for the summarization process.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e7f20bf5",
"metadata": {},
"outputs": [],
"source": [
"def generate_system_prompt():\n",
" \"\"\"Generate the system prompt for video summarization\"\"\"\n",
" return f\"\"\"\n",
" You are an expert YouTube video summarizer. Your job is to take the full transcript of a video and generate a structured, precise, and academically grounded summary.\n",
"\n",
" Your output must include:\n",
"\n",
" 1. Title\n",
" - Either reuse the video's title (if it is clear, accurate, and concise)\n",
" - Or generate a new, sharper, more descriptive title that best reflects the actual content covered.\n",
"\n",
" 2. Topic & Area of Coverage\n",
" - Provide a 12 line highlight of the main topic of the video and the specific area it best covers.\n",
" - Format:\n",
" - Domain (e.g., Finance, Health, Technology, Psychology, Fitness, Productivity, etc.)\n",
" - Sub-area (e.g., investment strategies, portfolio design; training routine, best exercises; productivity systems, cognitive science insights, etc.)\n",
"\n",
" 3. Summary of the Video\n",
" - A structured, clear, and concise summary of the video.\n",
" - Focus only on relevant, high-value content.\n",
" - Skip fluff, tangents, product promotions, personal banter, or irrelevant side discussions.\n",
" - Include key insights, frameworks, step-by-step methods, and actionable advice.\n",
" - Where applicable, reference scientific studies, historical sources, or authoritative references (with author + year or journal if mentioned in the video, or inferred if the reference is well known).\n",
"\n",
" Style & Quality Rules:\n",
" - Be extremely specific: avoid vague generalizations.\n",
" - Use precise language and structured formatting (bullet points, numbered lists, sub-sections if needed).\n",
" - Prioritize clarity and factual accuracy.\n",
" - Write as though preparing an executive briefing or academic digest.\n",
" - If the transcript includes non-relevant sections (jokes, ads, unrelated chit-chat), skip summarizing them entirely.\n",
" \"\"\"\n",
"\n",
"def generate_user_prompt(website, transcript_chunk=None):\n",
" \"\"\"Generate user prompt for video summarization\"\"\"\n",
" if transcript_chunk:\n",
" return f\"\"\"Here is a portion of a YouTube video transcript. Use the system instructions to generate a summary of this section.\n",
"\n",
" Video Title: {website.title}\n",
"\n",
" Transcript Section: {transcript_chunk}\n",
" \"\"\"\n",
" else:\n",
" return f\"\"\"Here is the transcript of a YouTube video. Use the system instructions to generate the output.\n",
"\n",
" Video Title: {website.title}\n",
"\n",
" Transcript: {website.transcript}\n",
" \"\"\"\n",
"\n",
"def generate_stitching_prompt(chunk_summaries, video_title):\n",
" \"\"\"Generate prompt for stitching together chunk summaries\"\"\"\n",
" return f\"\"\"You are an expert at combining multiple summaries into a cohesive, comprehensive summary.\n",
"\n",
" Video Title: {video_title}\n",
"\n",
" Below are summaries of different sections of this video. Combine them into a single, well-structured summary that:\n",
" 1. Maintains the original structure and quality standards\n",
" 2. Eliminates redundancy between sections\n",
" 3. Creates smooth transitions between topics\n",
" 4. Preserves all important information \n",
" 5. Maintains the academic, professional tone\n",
" 6. Include examples and nuances where relevant\n",
" 7. Include the citations and references where applicable\n",
"\n",
" Section Summaries:\n",
" {chr(10).join([f\"Section {i+1}: {summary}\" for i, summary in enumerate(chunk_summaries)])}\n",
"\n",
" Please provide a unified, comprehensive summary following the same format as the individual sections.\n",
" Make sure the final summary is cohesive and logical.\n",
" \"\"\"\n",
"\n",
"print(\"✅ Prompt generation functions defined\")\n"
]
},
{
"cell_type": "markdown",
"id": "5c9a620d",
"metadata": {},
"source": [
"## 6. Summarization Functions\n",
"\n",
"Core functions for summarizing videos with support for both single-chunk and chunked processing.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cc8a183b",
"metadata": {},
"outputs": [],
"source": [
"def summarize_single_chunk(website, client):\n",
" \"\"\"Summarize a single chunk (small video)\"\"\"\n",
" system_prompt = generate_system_prompt()\n",
" user_prompt = generate_user_prompt(website)\n",
" \n",
" try:\n",
" response = client.chat.completions.create(\n",
" model=\"gpt-4o-mini\",\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ],\n",
" max_tokens=2000,\n",
" temperature=0.3\n",
" )\n",
" \n",
" return response.choices[0].message.content\n",
" \n",
" except Exception as e:\n",
" return f\"Error generating summary: {str(e)}\"\n",
"\n",
"def summarize_with_chunking(website, client, max_chunk_tokens=4000):\n",
" \"\"\"Summarize a large video by chunking and stitching\"\"\"\n",
" print(\"Video is large, using chunking strategy...\")\n",
" \n",
" # Chunk the transcript\n",
" chunks = chunk_transcript(website.transcript, max_chunk_tokens)\n",
" print(f\"Split into {len(chunks)} chunks\")\n",
" \n",
" # Summarize each chunk\n",
" chunk_summaries = []\n",
" system_prompt = generate_system_prompt()\n",
" \n",
" for i, chunk in enumerate(chunks):\n",
" print(f\"Processing chunk {i+1}/{len(chunks)}...\")\n",
" user_prompt = generate_user_prompt(website, chunk)\n",
" \n",
" try:\n",
" response = client.chat.completions.create(\n",
" model=\"gpt-4o-mini\",\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ],\n",
" max_tokens=1500, # Smaller for chunks\n",
" temperature=0.3\n",
" )\n",
" \n",
" chunk_summaries.append(response.choices[0].message.content)\n",
" \n",
" except Exception as e:\n",
" chunk_summaries.append(f\"Error in chunk {i+1}: {str(e)}\")\n",
" \n",
" # Stitch the summaries together\n",
" print(\"Stitching summaries together...\")\n",
" stitching_prompt = generate_stitching_prompt(chunk_summaries, website.title)\n",
" \n",
" try:\n",
" response = client.chat.completions.create(\n",
" model=\"gpt-4o-mini\",\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": \"You are an expert at combining multiple summaries into a cohesive, comprehensive summary.\"},\n",
" {\"role\": \"user\", \"content\": stitching_prompt}\n",
" ],\n",
" max_tokens=2000,\n",
" temperature=0.3\n",
" )\n",
" \n",
" return response.choices[0].message.content\n",
" \n",
" except Exception as e:\n",
" return f\"Error stitching summaries: {str(e)}\"\n",
"\n",
"print(\"✅ Summarization functions defined\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "99168160",
"metadata": {},
"outputs": [],
"source": [
"def summarize_video(website, use_chunking=True, max_chunk_tokens=4000):\n",
" \"\"\"Summarize a YouTube video using OpenAI API with optional chunking for large videos\"\"\"\n",
" client = get_openai_client()\n",
" \n",
" # Check if we need chunking\n",
" transcript_text = \" \".join([segment.text for segment in website.transcript])\n",
" total_tokens = count_tokens(transcript_text)\n",
" \n",
" print(f\"Total transcript tokens: {total_tokens}\")\n",
" \n",
" if total_tokens <= max_chunk_tokens and not use_chunking:\n",
" # Single summary for small videos\n",
" return summarize_single_chunk(website, client)\n",
" else:\n",
" # Chunked summary for large videos\n",
" return summarize_with_chunking(website, client, max_chunk_tokens)\n",
"\n",
"print(\"✅ Main summarization function defined\")\n"
]
},
{
"cell_type": "markdown",
"id": "54a76dab",
"metadata": {},
"source": [
"## 7. Interactive Demo\n",
"\n",
"Now let's test the YouTube video summarizer with a sample video. You can replace the URL with any YouTube video you want to summarize.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "87badeff",
"metadata": {},
"outputs": [],
"source": [
"# Example usage - replace with your YouTube URL\n",
"video_url = \"https://www.youtube.com/watch?v=Xan5JnecLNA\"\n",
"\n",
"try:\n",
" # Create YouTube video object\n",
" print(\"🎬 Fetching video data...\")\n",
" video = YouTubeVideo(video_url)\n",
" \n",
" # Display video info\n",
" print(f\"📺 Video Title: {video.title}\")\n",
" print(f\"🆔 Video ID: {video.video_id}\")\n",
" \n",
" # Count tokens in transcript\n",
" transcript_text = video.get_transcript_text()\n",
" total_tokens = count_tokens(transcript_text)\n",
" print(f\"📊 Total transcript tokens: {total_tokens}\")\n",
" \n",
" # Show video info\n",
" info = video.get_video_info()\n",
" print(f\"📝 Transcript segments: {info['transcript_length']}\")\n",
" \n",
"except Exception as e:\n",
" print(f\"❌ Error: {str(e)}\")\n",
" print(\"💡 Make sure the YouTube URL is valid and the video has captions available\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b9e4cf2f",
"metadata": {},
"outputs": [],
"source": [
"# Generate summary (automatically uses chunking if needed)\n",
"if 'video' in locals():\n",
" print(\"\\n🤖 Generating summary...\")\n",
" print(\"⏳ This may take a few minutes for long videos...\")\n",
" \n",
" try:\n",
" summary = summarize_video(video, use_chunking=True, max_chunk_tokens=4000)\n",
" \n",
" # Display results with nice formatting\n",
" print(\"\\n\" + \"=\"*60)\n",
" print(\"📋 FINAL SUMMARY\")\n",
" print(\"=\"*60)\n",
" \n",
" # Use IPython display if available for better formatting\n",
" if display and Markdown:\n",
" display(Markdown(summary))\n",
" else:\n",
" print(summary)\n",
" \n",
" except Exception as e:\n",
" print(f\"❌ Error generating summary: {str(e)}\")\n",
"else:\n",
" print(\"⚠️ Please run the previous cell first to load a video\")\n"
]
},
{
"cell_type": "markdown",
"id": "42ff8a15",
"metadata": {},
"source": [
"## 8. Testing and Utility Functions\n",
"\n",
"Additional functions for testing the chunking functionality and other utilities.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d798b08f",
"metadata": {},
"outputs": [],
"source": [
"def test_chunking():\n",
" \"\"\"Test function to demonstrate chunking with a sample transcript\"\"\"\n",
" # Sample transcript for testing\n",
" sample_transcript = [\n",
" {\"text\": \"This is a sample transcript segment 1. \" * 100}, # ~1000 tokens\n",
" {\"text\": \"This is a sample transcript segment 2. \" * 100}, # ~1000 tokens\n",
" {\"text\": \"This is a sample transcript segment 3. \" * 100}, # ~1000 tokens\n",
" {\"text\": \"This is a sample transcript segment 4. \" * 100}, # ~1000 tokens\n",
" {\"text\": \"This is a sample transcript segment 5. \" * 100}, # ~1000 tokens\n",
" ]\n",
" \n",
" print(\"🧪 Testing chunking functionality...\")\n",
" chunks = chunk_transcript(sample_transcript, max_tokens=2000, overlap_tokens=100)\n",
" \n",
" print(f\"📊 Original transcript: {count_tokens(' '.join([s['text'] for s in sample_transcript]))} tokens\")\n",
" print(f\"📦 Number of chunks: {len(chunks)}\")\n",
" \n",
" for i, chunk in enumerate(chunks):\n",
" print(f\"📄 Chunk {i+1}: {count_tokens(chunk)} tokens\")\n",
"\n",
"def analyze_video_tokens(video_url):\n",
" \"\"\"Analyze token count and chunking strategy for a video\"\"\"\n",
" try:\n",
" video = YouTubeVideo(video_url)\n",
" transcript_text = video.get_transcript_text()\n",
" total_tokens = count_tokens(transcript_text)\n",
" \n",
" print(f\"📺 Video: {video.title}\")\n",
" print(f\"📊 Total tokens: {total_tokens}\")\n",
" print(f\"📦 Optimal chunk size: {get_optimal_chunk_size()}\")\n",
" \n",
" if total_tokens > 4000:\n",
" chunks = chunk_transcript(video.transcript, max_tokens=4000)\n",
" print(f\"🔀 Would be split into {len(chunks)} chunks\")\n",
" print(\"✅ Chunking strategy recommended\")\n",
" else:\n",
" print(\"✅ Single summary strategy sufficient\")\n",
" \n",
" except Exception as e:\n",
" print(f\"❌ Error analyzing video: {str(e)}\")\n",
"\n",
"print(\"✅ Testing and utility functions defined\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bfd789e5",
"metadata": {},
"outputs": [],
"source": [
"# Test chunking functionality (optional)\n",
"# Uncomment the line below to test chunking with sample data\n",
"# test_chunking()\n"
]
},
{
"cell_type": "markdown",
"id": "3528125f",
"metadata": {},
"source": [
"## 9. Usage Instructions\n",
"\n",
"### How to Use This Notebook\n",
"\n",
"1. **Set up your OpenAI API key**:\n",
" - Create a `.env` file in the same directory as this notebook\n",
" - Add your API key: `OPENAI_API_KEY=your_api_key_here`\n",
" - Or set it as an environment variable\n",
"\n",
"2. **Install dependencies**:\n",
" ```bash\n",
" pip install -r requirements.txt\n",
" ```\n",
"\n",
"3. **Run the cells in order**:\n",
" - Start with the import and setup cells\n",
" - Modify the `video_url` variable in the demo section\n",
" - Run the demo cells to test the summarizer\n",
"\n",
"### Customization Options\n",
"\n",
"- **Change the model**: Modify the model parameter in the summarization functions\n",
"- **Adjust chunk size**: Change `max_chunk_tokens` parameter\n",
"- **Modify prompts**: Edit the prompt generation functions for different output styles\n",
"- **Add error handling**: Extend the exception handling as needed\n",
"\n",
"### Features\n",
"\n",
"- ✅ **Automatic transcript extraction** from YouTube videos\n",
"- ✅ **Intelligent chunking** for videos exceeding token limits\n",
"- ✅ **Academic-quality summaries** with structured output\n",
"- ✅ **Error handling** and dependency validation\n",
"- ✅ **Interactive testing** with sample data\n",
"- ✅ **Token analysis** and optimization recommendations\n",
"\n",
"### Troubleshooting\n",
"\n",
"- **\"No transcript available\"**: The video may not have captions enabled\n",
"- **\"Invalid YouTube URL\"**: Make sure the URL follows the correct format\n",
"- **\"API key not set\"**: Check your `.env` file or environment variables\n",
"- **Import errors**: Run `pip install -r requirements.txt` to install dependencies\n"
]
},
{
"cell_type": "markdown",
"id": "a5a44fb8",
"metadata": {},
"source": [
"## 10. Advanced Usage Examples\n",
"\n",
"Here are some advanced usage patterns you can try with this notebook.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2bef390a",
"metadata": {},
"outputs": [],
"source": [
"# Example 1: Analyze multiple videos\n",
"video_urls = [\n",
" \"https://www.youtube.com/watch?v=Xan5JnecLNA\",\n",
" # Add more URLs here\n",
"]\n",
"\n",
"for url in video_urls:\n",
" print(f\"\\n{'='*50}\")\n",
" print(f\"Analyzing: {url}\")\n",
" print('='*50)\n",
" analyze_video_tokens(url)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fbdb5cd8",
"metadata": {},
"outputs": [],
"source": [
"# Example 2: Custom summarization with different parameters\n",
"def custom_summarize(video_url, model=\"gpt-4o-mini\", max_tokens=3000, temperature=0.1):\n",
" \"\"\"Custom summarization with specific parameters\"\"\"\n",
" try:\n",
" video = YouTubeVideo(video_url)\n",
" client = get_openai_client()\n",
" \n",
" # Use custom chunking parameters\n",
" chunks = chunk_transcript(video.transcript, max_tokens=max_tokens)\n",
" \n",
" if len(chunks) == 1:\n",
" # Single chunk\n",
" system_prompt = generate_system_prompt()\n",
" user_prompt = generate_user_prompt(video, chunks[0])\n",
" \n",
" response = client.chat.completions.create(\n",
" model=model,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ],\n",
" max_tokens=2000,\n",
" temperature=temperature\n",
" )\n",
" \n",
" return response.choices[0].message.content\n",
" else:\n",
" # Multiple chunks - use standard chunking approach\n",
" return summarize_with_chunking(video, client, max_tokens)\n",
" \n",
" except Exception as e:\n",
" return f\"Error: {str(e)}\"\n",
"\n",
"# Example usage:\n",
"# custom_summary = custom_summarize(\"https://www.youtube.com/watch?v=Xan5JnecLNA\")\n",
"# print(custom_summary)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f7a5a9e9",
"metadata": {},
"outputs": [],
"source": [
"# Generate summary (automatically uses chunking if needed)\n",
"if 'video' in locals():\n",
" print(\"\\n🤖 Generating summary...\")\n",
" print(\"⏳ This may take a few minutes for long videos...\")\n",
" \n",
" try:\n",
" summary = summarize_video(video, use_chunking=True, max_chunk_tokens=4000)\n",
" \n",
" # Display results with nice formatting\n",
" print(\"\\n\" + \"=\"*60)\n",
" print(\"📋 FINAL SUMMARY\")\n",
" print(\"=\"*60)\n",
" \n",
" # Use IPython display if available for better formatting\n",
" if display and Markdown:\n",
" display(Markdown(summary))\n",
" else:\n",
" print(summary)\n",
" \n",
" except Exception as e:\n",
" print(f\"❌ Error generating summary: {str(e)}\")\n",
"else:\n",
" print(\"⚠️ Please run the previous cell first to load a video\")\n"
]
},
{
"cell_type": "markdown",
"id": "4028fa5e",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "c100b384-2c3e-49de-92ce-f5dd0b4b58c0",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,421 @@
import os
import re
import sys
# Check for required dependencies and provide helpful error messages
try:
import requests
except ImportError:
print("❌ Error: 'requests' module not found.")
print("💡 Install with: pip install requests")
print(" Or: pip install -r requirements.txt")
sys.exit(1)
try:
import tiktoken
except ImportError:
print("❌ Error: 'tiktoken' module not found.")
print("💡 Install with: pip install tiktoken")
print(" Or: pip install -r requirements.txt")
sys.exit(1)
try:
from dotenv import load_dotenv
except ImportError:
print("❌ Error: 'python-dotenv' module not found.")
print("💡 Install with: pip install python-dotenv")
print(" Or: pip install -r requirements.txt")
sys.exit(1)
try:
from openai import OpenAI
except ImportError:
print("❌ Error: 'openai' module not found.")
print("💡 Install with: pip install openai")
print(" Or: pip install -r requirements.txt")
sys.exit(1)
try:
from youtube_transcript_api import YouTubeTranscriptApi
except ImportError:
print("❌ Error: 'youtube-transcript-api' module not found.")
print("💡 Install with: pip install youtube-transcript-api")
print(" Or: pip install -r requirements.txt")
sys.exit(1)
try:
from bs4 import BeautifulSoup
except ImportError:
print("❌ Error: 'beautifulsoup4' module not found.")
print("💡 Install with: pip install beautifulsoup4")
print(" Or: pip install -r requirements.txt")
sys.exit(1)
try:
from IPython.display import Markdown, display
except ImportError:
# IPython is optional for Jupyter notebooks
print("⚠️ Warning: IPython not available (optional for Jupyter notebooks)")
Markdown = None
display = None
#headers and class for website to summarize
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}
class YouTubeVideo:
def __init__(self, url):
self.url = url
youtube_pattern = r'https://www\.youtube\.com/watch\?v=[a-zA-Z0-9_-]+'
if re.match(youtube_pattern, url):
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
self.video_id = url.split("v=")[1]
self.title = soup.title.string if soup.title else "No title found"
self.transcript = YouTubeTranscriptApi().fetch(self.video_id)
else:
raise ValueError("Invalid YouTube URL")
#get api key and openai client
def get_api_key():
load_dotenv(override=True)
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY is not set")
return api_key
def get_openai_client():
api_key = get_api_key()
return OpenAI(api_key=api_key)
#count tokens
def count_tokens(text, model="gpt-4o-mini"):
"""Count tokens in text using tiktoken with fallback"""
try:
# Try model-specific encoding first
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
except KeyError:
# Fallback to cl100k_base encoding (used by most OpenAI models)
# This ensures compatibility even if model-specific encoding isn't available
encoding = tiktoken.get_encoding("cl100k_base")
return len(encoding.encode(text))
except Exception as e:
# Ultimate fallback - rough estimation
print(f"Warning: Token counting failed ({e}), using rough estimation")
return len(text.split()) * 1.3 # Rough word-to-token ratio
def get_optimal_chunk_size(model="gpt-4o-mini"):
"""Calculate optimal chunk size based on model's context window"""
model_limits = {
"gpt-4o-mini": 8192,
"gpt-4o": 128000,
"gpt-4-turbo": 128000,
"gpt-3.5-turbo": 4096,
"gpt-4": 8192,
}
context_window = model_limits.get(model, 8192) # Default to 8K
# Reserve tokens for:
# - System prompt: ~800 tokens
# - User prompt overhead: ~300 tokens
# - Output: ~2000 tokens
# - Safety buffer: ~500 tokens
reserved_tokens = 800 + 300 + 2000 + 500
optimal_chunk_size = context_window - reserved_tokens
# Ensure minimum chunk size
return max(optimal_chunk_size, 2000)
#chunk transcript
def chunk_transcript(transcript, max_tokens=4000, overlap_tokens=200, model="gpt-4o-mini"):
"""
Split transcript into chunks that fit within token limits
Args:
transcript: List of transcript segments from YouTube
max_tokens: Maximum tokens per chunk (auto-calculated if None)
overlap_tokens: Number of tokens to overlap between chunks
model: Model name for token limit calculation
Returns:
List of transcript chunks
"""
# Auto-calculate max_tokens based on model if not provided
if max_tokens is None:
max_tokens = get_optimal_chunk_size(model)
# Auto-calculate overlap as percentage of max_tokens
if overlap_tokens is None:
overlap_tokens = int(max_tokens * 0.05) # 5% overlap
# Convert transcript to text
transcript_text = " ".join([segment.text for segment in transcript])
# If transcript is small enough, return as single chunk
if count_tokens(transcript_text) <= max_tokens:
return [transcript_text]
# Split into sentences for better chunking
sentences = re.split(r'[.!?]+', transcript_text)
chunks = []
current_chunk = ""
for sentence in sentences:
sentence = sentence.strip()
if not sentence:
continue
# Check if adding this sentence would exceed token limit
test_chunk = current_chunk + " " + sentence if current_chunk else sentence
if count_tokens(test_chunk) <= max_tokens:
current_chunk = test_chunk
else:
# Save current chunk and start new one
if current_chunk:
chunks.append(current_chunk)
# Start new chunk with overlap from previous chunk
if chunks and overlap_tokens > 0:
# Get last few words from previous chunk for overlap
prev_words = current_chunk.split()[-overlap_tokens//4:] # Rough word-to-token ratio
current_chunk = " ".join(prev_words) + " " + sentence
else:
current_chunk = sentence
# Add the last chunk
if current_chunk:
chunks.append(current_chunk)
return chunks
#generate system prompt
def generate_system_prompt():
return f"""
You are an expert YouTube video summarizer. Your job is to take the full transcript of a video and generate a structured, precise, and academically grounded summary.
Your output must include:
1. Title
- Either reuse the videos title (if it is clear, accurate, and concise)
- Or generate a new, sharper, more descriptive title that best reflects the actual content covered.
2. Topic & Area of Coverage
- Provide a 12 line highlight of the main topic of the video and the specific area it best covers.
- Format:
- Domain (e.g., Finance, Health, Technology, Psychology, Fitness, Productivity, etc.)
- Sub-area (e.g., investment strategies, portfolio design; training routine, best exercises; productivity systems, cognitive science insights, etc.)
3. Summary of the Video
- A structured, clear, and concise summary of the video.
- Focus only on relevant, high-value content.
- Skip fluff, tangents, product promotions, personal banter, or irrelevant side discussions.
- Include key insights, frameworks, step-by-step methods, and actionable advice.
- Where applicable, reference scientific studies, historical sources, or authoritative references (with author + year or journal if mentioned in the video, or inferred if the reference is well known).
Style & Quality Rules:
- Be extremely specific: avoid vague generalizations.
- Use precise language and structured formatting (bullet points, numbered lists, sub-sections if needed).
- Prioritize clarity and factual accuracy.
- Write as though preparing an executive briefing or academic digest.
- If the transcript includes non-relevant sections (jokes, ads, unrelated chit-chat), skip summarizing them entirely.
"""
#generate user prompt
def generate_user_prompt(website, transcript_chunk=None):
if transcript_chunk:
return f"""Here is a portion of a YouTube video transcript. Use the system instructions to generate a summary of this section.
Video Title: {website.title}
Transcript Section: {transcript_chunk}
"""
else:
return f"""Here is the transcript of a YouTube video. Use the system instructions to generate the output.
Video Title: {website.title}
Transcript: {website.transcript}
"""
#generate stitching prompt
def generate_stitching_prompt(chunk_summaries, video_title):
"""Generate prompt for stitching together chunk summaries"""
return f"""You are an expert at combining multiple summaries into a cohesive, comprehensive summary.
Video Title: {video_title}
Below are summaries of different sections of this video. Combine them into a single, well-structured summary that:
1. Maintains the original structure and quality standards
2. Eliminates redundancy between sections
3. Creates smooth transitions between topics
4. Preserves all important information
5. Maintains the academic, professional tone
6. Include examples and nuances where relevant
7. Include the citations and references where applicable
Section Summaries:
{chr(10).join([f"Section {i+1}: {summary}" for i, summary in enumerate(chunk_summaries)])}
Please provide a unified, comprehensive summary following the same format as the individual sections.
Make sure the final summary is cohesive and logical.
"""
#summarize video
def summarize_video(website, use_chunking=True, max_chunk_tokens=4000):
"""Summarize a YouTube video using OpenAI API with optional chunking for large videos"""
client = get_openai_client()
# Check if we need chunking
transcript_text = " ".join([segment.text for segment in website.transcript])
total_tokens = count_tokens(transcript_text)
print(f"Total transcript tokens: {total_tokens}")
if total_tokens <= max_chunk_tokens and not use_chunking:
# Single summary for small videos
return summarize_single_chunk(website, client)
else:
# Chunked summary for large videos
return summarize_with_chunking(website, client, max_chunk_tokens)
#summarize single chunk
def summarize_single_chunk(website, client):
"""Summarize a single chunk (small video)"""
system_prompt = generate_system_prompt()
user_prompt = generate_user_prompt(website)
try:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=2000,
temperature=0.3
)
return response.choices[0].message.content
except Exception as e:
return f"Error generating summary: {str(e)}"
#summarize with chunking
def summarize_with_chunking(website, client, max_chunk_tokens=4000):
"""Summarize a large video by chunking and stitching"""
print("Video is large, using chunking strategy...")
# Chunk the transcript
chunks = chunk_transcript(website.transcript, max_chunk_tokens)
print(f"Split into {len(chunks)} chunks")
# Summarize each chunk
chunk_summaries = []
system_prompt = generate_system_prompt()
for i, chunk in enumerate(chunks):
print(f"Processing chunk {i+1}/{len(chunks)}...")
user_prompt = generate_user_prompt(website, chunk)
try:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=1500, # Smaller for chunks
temperature=0.3
)
chunk_summaries.append(response.choices[0].message.content)
except Exception as e:
chunk_summaries.append(f"Error in chunk {i+1}: {str(e)}")
# Stitch the summaries together
print("Stitching summaries together...")
stitching_prompt = generate_stitching_prompt(chunk_summaries, website.title)
try:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are an expert at combining multiple summaries into a cohesive, comprehensive summary."},
{"role": "user", "content": stitching_prompt}
],
max_tokens=2000,
temperature=0.3
)
return response.choices[0].message.content
except Exception as e:
return f"Error stitching summaries: {str(e)}"
#main function
def main():
"""Main function to demonstrate usage"""
# Example usage - replace with actual YouTube URL
video_url = "https://www.youtube.com/watch?v=Xan5JnecLNA"
try:
# Create YouTube video object
print("Fetching video data...")
video = YouTubeVideo(video_url)
# Display video info
print(f"Video Title: {video.title}")
print(f"Video ID: {video.video_id}")
# Count tokens in transcript
transcript_text = " ".join([segment.text for segment in video.transcript])
total_tokens = count_tokens(transcript_text)
print(f"Total transcript tokens: {total_tokens}")
# Generate summary (automatically uses chunking if needed)
print("\nGenerating summary...")
summary = summarize_video(video, use_chunking=True, max_chunk_tokens=4000)
# Display results
print("\n" + "="*50)
print("FINAL SUMMARY")
print("="*50)
print(summary)
except Exception as e:
print(f"Error: {str(e)}")
def test_chunking():
"""Test function to demonstrate chunking with a sample transcript"""
# Sample transcript for testing
sample_transcript = [
{"text": "This is a sample transcript segment 1. " * 100}, # ~1000 tokens
{"text": "This is a sample transcript segment 2. " * 100}, # ~1000 tokens
{"text": "This is a sample transcript segment 3. " * 100}, # ~1000 tokens
{"text": "This is a sample transcript segment 4. " * 100}, # ~1000 tokens
{"text": "This is a sample transcript segment 5. " * 100}, # ~1000 tokens
]
print("Testing chunking functionality...")
chunks = chunk_transcript(sample_transcript, max_tokens=2000, overlap_tokens=100)
print(f"Original transcript: {count_tokens(' '.join([s['text'] for s in sample_transcript]))} tokens")
print(f"Number of chunks: {len(chunks)}")
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}: {count_tokens(chunk)} tokens")
if __name__ == "__main__":
# Uncomment the line below to test chunking
# test_chunking()
# Run main function
main()

View File

@@ -0,0 +1,106 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "4a9842d0-2465-4c0a-9f08-3c23f4202c3a",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from dotenv import load_dotenv\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7fcfe08c-d074-41f5-befe-24358c967e1b",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "89fd124d-5e7b-4e61-af85-2fe978c688f2",
"metadata": {},
"outputs": [],
"source": [
"system_prompt = \"You are an analyst that analyzes the financial transactions data and provides summary of where the money has been spent, where money can be cut back so savings be increased\"\n",
"user_prompt = \"\"\"\n",
" data = [\n",
" {\"transaction_id\": 1, \"date\": \"2025-01-05\", \"merchant\": \"Amazon\", \"category\": \"Shopping\", \"amount\": -120.50, \"currency\": \"GBP\"},\n",
" {\"transaction_id\": 2, \"date\": \"2025-01-07\", \"merchant\": \"Starbucks\", \"category\": \"Food & Drink\", \"amount\": -4.75, \"currency\": \"GBP\"},\n",
" {\"transaction_id\": 3, \"date\": \"2025-01-09\", \"merchant\": \"Tesco\", \"category\": \"Groceries\", \"amount\": -56.20, \"currency\": \"GBP\"},\n",
" {\"transaction_id\": 4, \"date\": \"2025-01-10\", \"merchant\": \"Uber\", \"category\": \"Transport\", \"amount\": -15.80, \"currency\": \"GBP\"},\n",
" {\"transaction_id\": 5, \"date\": \"2025-01-15\", \"merchant\": \"Apple\", \"category\": \"Electronics\", \"amount\": -899.00, \"currency\": \"GBP\"},\n",
" {\"transaction_id\": 6, \"date\": \"2025-01-18\", \"merchant\": \"Netflix\", \"category\": \"Subscription\", \"amount\": -9.99, \"currency\": \"GBP\"},\n",
" {\"transaction_id\": 7, \"date\": \"2025-01-20\", \"merchant\": \"Salary\", \"category\": \"Income\", \"amount\": 2500.00, \"currency\": \"GBP\"},\n",
" {\"transaction_id\": 8, \"date\": \"2025-01-22\", \"merchant\": \"British Airways\", \"category\": \"Travel\", \"amount\": -450.00, \"currency\": \"GBP\"},\n",
" {\"transaction_id\": 9, \"date\": \"2025-01-25\", \"merchant\": \"Marks & Spencer\", \"category\": \"Shopping\", \"amount\": -75.30, \"currency\": \"GBP\"},\n",
" {\"transaction_id\": 10, \"date\": \"2025-01-30\", \"merchant\": \"HMRC\", \"category\": \"Tax\", \"amount\": -320.00, \"currency\": \"GBP\"},\n",
"]\n",
"\n",
"\n",
"\"\"\"\n",
"\n",
"# Step 2: Make the messages list\n",
"\n",
"messages = [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
"] # fill this in\n",
"\n",
"# Step 3: Call OpenAI\n",
"\n",
"response = openai.chat.completions.create(\n",
" model = \"gpt-4o-mini\",\n",
" messages = messages\n",
" )\n",
"\n",
"# Step 4: print the result\n",
"\n",
"display(Markdown(response.choices[0].message.content))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "12d6642e-a4d4-49c7-a14b-8c8200dd210c",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "d5fccacf-ddb4-4076-87cc-0ffe6a4d64a4",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,181 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "4e2a9393-7767-488e-a8bf-27c12dca35bd",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import requests\n",
"from dotenv import load_dotenv\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI\n",
"\n",
"# If you get an error running this cell, then please head over to the troubleshooting notebook!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0d6b368e-728a-4a9d-8e9b-0d41cfd15ac9",
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" result = 5 / 0\n",
"except ZeroDivisionError:\n",
" print(\"You can't divide by zero!\")\n",
"finally:\n",
" print(\"Done.\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1bf7d5aa-670e-4eaa-a7a4-a0059fe5bae7",
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" x = int(\"hello\") # Causes a ValueError\n",
"except ValueError:\n",
" print(\"That's not an integer.\")\n",
"except TypeError:\n",
" print(\"Wrong type.\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9870195c-2854-44bb-b451-bc670c3de536",
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" do_something()\n",
"except Exception as e:\n",
" print(f\"Something went wrong: {e}\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b8b1318b-0940-42c0-8995-dee4ecde8f55",
"metadata": {},
"outputs": [],
"source": [
"\n",
"import os\n",
"import requests\n",
"from dotenv import load_dotenv\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7b87cadb-d513-4303-baee-a37b6f938e4d",
"metadata": {},
"outputs": [],
"source": [
"# Load environment variables in a file called .env\n",
"\n",
"load_dotenv(override=True)\n",
"api_key = os.getenv('OPENAI_API_KEY')\n",
"\n",
"# Check the key\n",
"\n",
"if not api_key:\n",
" print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
"elif not api_key.startswith(\"sk-proj-\"):\n",
" print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n",
"elif api_key.strip() != api_key:\n",
" print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n",
"else:\n",
" print(\"API key found and looks good so far!\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ed6d8042-53ec-44f3-80b6-aa8ccc1dfd28",
"metadata": {},
"outputs": [],
"source": [
"openai = OpenAI()\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "00743dac-0e70-45b7-879a-d7293a6f68a6",
"metadata": {},
"outputs": [],
"source": [
"# Step 1: Create your prompts\n",
"\n",
"system_prompt = \"You are a football reporter that reports the games of a day \\\n",
"and provides a short summary, ignoring text that might be navigation related. \\\n",
"The summary should highlight the important things that happened in each game. \\\n",
"Also provide the location information if it is a local league, or among countries.\\\n",
"Do not mix american football games, you can include american soccer results.\\\n",
"Please also provide the history of each league and who plays in each and why.\\\n",
"Respond in markdown.\"\n",
"user_prompt = \"\"\"\n",
" Give me the summary of the games on 9th of september 2023\n",
"\"\"\"\n",
"\n",
"# Step 2: Make the messages list\n",
"\n",
"messages = [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ] \n",
"\n",
"# Step 3: Call OpenAI\n",
"\n",
"response = openai.chat.completions.create(model=\"gpt-4o-mini\", messages=messages)\n",
"\n",
"# Step 4: print the result\n",
"\n",
"print(display(Markdown(response.choices[0].message.content)))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "afaa1c5c-04b7-43fb-b080-24fcc4d4b702",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,209 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "44aba2a0-c6eb-4fc1-a5cc-0a8f8679dbb8",
"metadata": {},
"source": [
"## Song-writing Assistant\n",
"\n",
"This app will use the GPT LLM to help you write a song that will make sure to have specific keywords in it that the user has entered. This app lets the user enter genres of music as well as specific artists to help tailor the assistant to the styles of music you want to incorprate in your song."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6ad35f4c-ef77-4a87-b790-d3ffaf517ff0",
"metadata": {},
"outputs": [],
"source": [
"# Imports\n",
"\n",
"import os\n",
"import requests\n",
"from dotenv import load_dotenv\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8e615a47-9dfa-48f3-b09a-db22d5462141",
"metadata": {},
"outputs": [],
"source": [
"# Load environment variables in a file called .env\n",
"\n",
"load_dotenv(override=True)\n",
"api_key = os.getenv('OPENAI_API_KEY')\n",
"\n",
"# Check the key\n",
"\n",
"if not api_key:\n",
" print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
"elif not api_key.startswith(\"sk-proj-\"):\n",
" print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n",
"elif api_key.strip() != api_key:\n",
" print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n",
"else:\n",
" print(\"API key found and looks good so far!\")\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d4d58124-5e9a-4f5a-9e0a-ff74f43896a8",
"metadata": {},
"outputs": [],
"source": [
"openai = OpenAI()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7d33cc22-1dc0-4b49-a483-1c94fc466005",
"metadata": {},
"outputs": [],
"source": [
"# Initializing the genres set\n",
"\n",
"genres = []\n",
"numOfGenres = int(input(\"How many genres would you like your assistant to know?\\nYou MUST enter at least 1, but you can enter up to 3: \"))\n",
"\n",
"while ((numOfGenres < 1) or (numOfGenres > 3)):\n",
" numOfGenres = int(input(\"\\nInvalid number of genres.\\nYou MUST enter at least 1, but you can enter up to 3: \"))\n",
"\n",
"print(f\"\\nEnter your genres below. Please keep in mind duplicate genres will be removed, so try to enter {numOfGenres} unique genres\")\n",
"for g in range(numOfGenres):\n",
" genres.append(str(input(f\"Enter genre {g+1}: \")).lower())\n",
"\n",
"genres = set(genres)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7ba7e045-ec3d-4555-ac56-badffa9f68f7",
"metadata": {},
"outputs": [],
"source": [
"# Initializing the music groups\n",
"\n",
"music_groups = []\n",
"\n",
"musicGroupsFlag = str(input(\"Would you like to add some singers/bands/groups that your assistant can be familar with? (Y/N)\")).lower()\n",
"\n",
"if(musicGroupsFlag == \"y\"):\n",
" numOfGroups = int(input(\"\\nHow many groups would you like your assistant to know?\\nYou MUST enter at least 1, but you can enter up to 3: \"))\n",
" while ((numOfGroups < 1) or (numOfGroups > 3)):\n",
" numOfGroups = int(input(\"\\nInvalid number of groups.\\nYou MUST enter at least 1, but you can enter up to 3: \"))\n",
"\n",
" print(f\"\\nEnter your singers/bands/groups below. Please keep in mind duplicate singers/bands/groups will be removed, so try to enter {numOfGroups} unique singers/bands/groups.\")\n",
" for m in range(numOfGroups):\n",
" music_groups.append(str(input(f\"Enter singer/band/group {m+1}: \")))\n",
"\n",
"music_groups = set(music_groups)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "69f75309-b179-4adc-979c-03d348dfb310",
"metadata": {},
"outputs": [],
"source": [
"# Initializing the keywords\n",
"\n",
"keywords = []\n",
"numOfWords = int(input(\"How many keywords would you like add to your song?\\nYou MUST enter at least 1, but you can enter up to 20: \"))\n",
"\n",
"while ((numOfWords < 1) or (numOfWords > 20)):\n",
" numOfWords = int(input(\"\\nInvalid number of words.\\nYou MUST enter at least 1, but you can enter up to 20: \"))\n",
"\n",
"print(f\"\\nEnter your words below. Please keep in mind duplicate words will be removed, so try to enter {numOfWords} unique words\")\n",
"for w in range(numOfWords):\n",
" keywords.append(str(input(f\"Word {w+1}: \")).lower())\n",
"\n",
"keywords = set(keywords)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d3f53326-aeac-439b-8cb5-0ab1064df118",
"metadata": {},
"outputs": [],
"source": [
"# Setting up the system and user prompts\n",
"\n",
"format_genres = \", \".join(genres)\n",
"\n",
"system_prompt = f\"You are a professional songwriter who has a specialty in the following genre(s) of music: {format_genres}.\"\n",
"\n",
"if(len(music_groups) > 0):\n",
" format_groups = \", \".join(music_groups)\n",
" system_prompt += f\"\\n\\nYou are also heavily familiar with the stylings of the following artist(s): {format_groups}\"\n",
"\n",
"system_prompt += \"\\n\\nUsing your knowledge of the genre(s) of music you know\"\n",
"\n",
"if(len(music_groups) > 0):\n",
" system_prompt += \", as well as the artist(s) you are familiar with\"\n",
"\n",
"system_prompt += \", your task will be to write a song that incorporates all of these influences while also making sure \\\n",
"to use very specific keywords in that song. Please give your song in Markdown format.\"\n",
"\n",
"format_keywords = \", \".join(keywords)\n",
"\n",
"user_prompt = f\"I need you to write a song that has these specific keywords in it: {format_keywords}.\\n\\nMake sure that the song is in Markdown format\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "67dc3099-2ccc-4ee8-8ff2-0dbbe4ae2fcb",
"metadata": {},
"outputs": [],
"source": [
"# Setting up the API\n",
"\n",
"messages = [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt},\n",
"]\n",
" \n",
"response = openai.chat.completions.create(\n",
" model = \"gpt-4o-mini\",\n",
" messages = messages\n",
" )\n",
"\n",
"# Printing the song in markdown format\n",
"formatted_song = Markdown(response.choices[0].message.content)\n",
"display(formatted_song)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,333 @@
{
"cells": [
{
"cell_type": "raw",
"id": "c6227d68-b1f4-4f71-9cc6-18aa3ce54209",
"metadata": {},
"source": [
"# FirstPage URL Summarizer (OpenAI)\n",
"\n",
"#This notebook does not crawl a whole site. It only fetches the first page for each provided URL and asks OpenAI to summarize it.\n",
"\n",
"### What it does\n",
"Loads a list of URLs (provided inline or from a file)\n",
"Fetches each page with `aiohttp` (HTML only)\n",
"Extracts text via BeautifulSoup (basic)\n",
"Calls OpenAI to produce a structured JSON summary\n",
"Exports a CSV with: url, http_status, title, meta_description, summary, category, key_entities\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0b0fe0e9-228e-461b-9a3e-f4392974c974",
"metadata": {},
"outputs": [],
"source": [
"# (Optional) If running locally, install deps here\n",
"import sys, subprocess\n",
"def pip_install(pkgs):\n",
" subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", *pkgs])\n",
"\n",
"pkgs = [\n",
" \"aiohttp>=3.10\",\n",
" \"beautifulsoup4>=4.12\",\n",
" \"lxml>=5.2\",\n",
" \"pandas>=2.2\",\n",
" \"python-dotenv>=1.0\",\n",
" \"openai>=1.51\",\n",
"]\n",
"try:\n",
" import aiohttp, bs4, lxml, pandas, dotenv, openai\n",
"except Exception:\n",
" pip_install(pkgs)\n",
"print(\"Ready ✔\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "86134741-0f8c-4049-894c-f31b27701da8",
"metadata": {},
"outputs": [],
"source": [
"import os, asyncio, aiohttp, pandas as pd\n",
"from bs4 import BeautifulSoup\n",
"from urllib.parse import urlparse\n",
"from dotenv import load_dotenv\n",
"from openai import OpenAI\n",
"\n",
"load_dotenv() # reads .env if present\n",
"OPENAI_API_KEY = os.getenv(\"OPENAI_API_KEY\")\n",
"MODEL = os.getenv(\"OPENAI_DEFAULT_MODEL\", \"gpt-4.1-mini\")\n",
"if not OPENAI_API_KEY:\n",
" print(\"Set OPENAI_API_KEY in .env or environment.\")\n",
"client = OpenAI(api_key=OPENAI_API_KEY)\n",
"\n",
"DEFAULT_HEADERS = {\"User-Agent\": \"FirstPageSummarizer/1.0 (+https://edwarddonner.com\"}"
]
},
{
"cell_type": "raw",
"id": "b96c4ed0-4c50-4347-8cc4-22ea21e7e483",
"metadata": {},
"source": [
"## 1) Provide URLs\n",
"You can paste a small list below, or set `URLS_FILE` to a text/CSV file containing URLs (one per line or in a column named `url`)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7ce4aef5-8df8-4f47-91b3-c3ecc7c4c8be",
"metadata": {},
"outputs": [],
"source": [
"URLS_INLINE = [\n",
" \"https://edwarddonner.com\"\n",
"]\n",
"URLS_FILE = None # e.g., \"urls.txt\" or \"urls.csv\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ba9f6f25-a04c-44fe-a16c-f7b5c47ed100",
"metadata": {},
"outputs": [],
"source": [
"import csv\n",
"def load_urls(urls_inline, urls_file):\n",
" urls = []\n",
" if urls_file and os.path.exists(urls_file):\n",
" if urls_file.endswith(\".csv\"):\n",
" df = pd.read_csv(urls_file)\n",
" if \"url\" in df.columns:\n",
" urls.extend(df[\"url\"].dropna().tolist())\n",
" else:\n",
" with open(urls_file, \"r\", encoding=\"utf-8\") as f:\n",
" for line in f:\n",
" line=line.strip()\n",
" if line:\n",
" urls.append(line)\n",
" urls.extend([u for u in urls_inline if u])\n",
" # de-dup while preserving order\n",
" seen=set(); out=[]\n",
" for u in urls:\n",
" if u not in seen:\n",
" seen.add(u); out.append(u)\n",
" return out\n",
"\n",
"URLS = load_urls(URLS_INLINE, URLS_FILE)\n",
"print(f\"Loaded {len(URLS)} URLs\")"
]
},
{
"cell_type": "raw",
"id": "bb3761f0-3684-4f30-92e9-869fd4556529",
"metadata": {},
"source": [
"## 2) Fetch first page HTML only\n",
"This grabs the main HTML and extracts simple metadata and body text. No link following."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7a7582b6-8277-4967-9d98-8cceeeab486d",
"metadata": {},
"outputs": [],
"source": [
"from aiohttp import ClientTimeout\n",
"from bs4 import BeautifulSoup\n",
"try:\n",
" from bs4 import FeatureNotFound\n",
"except Exception:\n",
" class FeatureNotFound(Exception):\n",
" ...\n",
"\n",
"DEFAULT_HEADERS = {\"User-Agent\": \"FirstPageSummarizer/1.0 (+https://edwarddonner.com)\"}\n",
"\n",
"async def fetch_one(session, url):\n",
" \"\"\"Fetch just one page (HTML if available).\"\"\"\n",
" try:\n",
" async with session.get(\n",
" url,\n",
" timeout=ClientTimeout(total=20),\n",
" headers=DEFAULT_HEADERS,\n",
" allow_redirects=True\n",
" ) as r:\n",
" ctype = r.headers.get(\"Content-Type\", \"\") or \"\"\n",
" is_html = \"html\" in ctype.lower()\n",
" text = await r.text(errors=\"ignore\") if is_html else \"\"\n",
" return {\n",
" \"url\": str(r.url),\n",
" \"status\": r.status,\n",
" \"content_type\": ctype,\n",
" \"html\": text,\n",
" }\n",
" except Exception as e:\n",
" return {\"url\": url, \"status\": None, \"content_type\": \"\", \"html\": \"\", \"error\": str(e)}\n",
"\n",
"def make_soup(html: str) -> BeautifulSoup:\n",
" \"\"\"Try lxml parser first, fall back to built-in html.parser if missing.\"\"\"\n",
" try:\n",
" return BeautifulSoup(html, \"lxml\")\n",
" except FeatureNotFound:\n",
" return BeautifulSoup(html, \"html.parser\")\n",
"\n",
"def extract_fields(url, html):\n",
" \"\"\"Extract title, meta description, and text from HTML.\"\"\"\n",
" soup = make_soup(html)\n",
" title = soup.title.string.strip() if soup.title and soup.title.string else \"\"\n",
"\n",
" meta_desc = \"\"\n",
" m = soup.find(\"meta\", attrs={\"name\": \"description\"})\n",
" if m and m.get(\"content\"):\n",
" meta_desc = m[\"content\"].strip()\n",
"\n",
" for tag in soup([\"script\", \"style\", \"noscript\"]):\n",
" tag.decompose()\n",
"\n",
" text = soup.get_text(\" \", strip=True)\n",
" text = text[:8000] # truncate to limit token size\n",
" return title, meta_desc, text\n",
"\n",
"async def fetch_all(urls):\n",
" \"\"\"Fetch and extract fields for a list of URLs (first page only).\"\"\"\n",
" import aiohttp\n",
" out = []\n",
" async with aiohttp.ClientSession() as session:\n",
" for u in urls:\n",
" resp = await fetch_one(session, u)\n",
" if resp.get(\"html\"):\n",
" title, meta_desc, text = extract_fields(resp[\"url\"], resp[\"html\"])\n",
" resp.update({\"title\": title, \"meta_description\": meta_desc, \"text\": text})\n",
" out.append(resp)\n",
" return out\n",
"\n",
"# Example usage in notebook (if URLS is defined):\n",
"# results = await fetch_all(URLS)\n",
"# len(results), results[:1]\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d30a3c6d-b208-4d6b-a5ea-e4276935a629",
"metadata": {},
"outputs": [],
"source": [
"URLS = [\"https://edwarddonner.com\", \"https://www.wikipedia.org/\"]\n",
"results = await fetch_all(URLS)\n",
"len(results), results[:1]\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b2a53f08-4374-4125-9de8-6e1060e31200",
"metadata": {},
"outputs": [],
"source": [
"import os, json\n",
"from dotenv import load_dotenv\n",
"from openai import OpenAI\n",
"\n",
"load_dotenv()\n",
"client = OpenAI(api_key=os.getenv(\"OPENAI_API_KEY\"))\n",
"MODEL = os.getenv(\"OPENAI_DEFAULT_MODEL\", \"gpt-4.1-mini\")\n",
"\n",
"SYSTEM_PROMPT = \"\"\"\n",
"You summarize a web page for migration planning. \n",
"Return JSON with:\n",
"- title: short page title\n",
"- meta_description: concise (<= 160 chars)\n",
"- summary: 3-5 bullet points as a single string\n",
"- category: one of [blog, docs, product, pricing, careers, marketing, legal, support, account, other]\n",
"- key_entities: array of 3-8 important entities/keywords\n",
"\"\"\"\n",
"\n",
"def summarize_page(row):\n",
" user = (\n",
" f\"URL: {row['url']}\\n\"\n",
" f\"<title>{row.get('title','')}</title>\\n\"\n",
" f\"<meta_description>{row.get('meta_description','')}</meta_description>\\n\"\n",
" f\"<text>\\n{row.get('text','')[:6000]}\\n</text>\"\n",
" )\n",
" resp = client.responses.create(\n",
" model=MODEL,\n",
" input=[\n",
" {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
" {\"role\": \"user\", \"content\": user},\n",
" ],\n",
" response_format={\"type\": \"json_object\"}\n",
" )\n",
" return json.loads(resp.output[0].content[0].text)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "59f7d992-e7f0-4287-bd19-f8062fefe8c3",
"metadata": {},
"outputs": [],
"source": [
"enriched = []\n",
"for r in results:\n",
" if r.get(\"status\") and 200 <= r[\"status\"] < 400 and \"html\" in r.get(\"content_type\",\"\").lower():\n",
" try:\n",
" data = summarize_page(r)\n",
" enriched.append({**r, **data})\n",
" except Exception as e:\n",
" enriched.append({**r, \"error\": str(e)})\n",
" else:\n",
" enriched.append({**r, \"error\": \"Non-HTML or bad status\"})\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "822d8108-64c2-4cf1-abc5-1acd288b7574",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.DataFrame(enriched)\n",
"df.to_csv(\"firstpage_summary.csv\", index=False)\n",
"df.head()\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0f05d05c-bf6d-4236-8767-8695e4d4618f",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,6 @@
aiohttp>=3.10
beautifulsoup4>=4.12
lxml>=5.2
pandas>=2.2
python-dotenv>=1.0
openai>=1.51

View File

@@ -0,0 +1,188 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "fe12c203-e6a6-452c-a655-afb8a03a4ff5",
"metadata": {},
"source": [
"# End of week 1 exercise\n",
"\n",
"To demonstrate your familiarity with OpenAI API, and also Ollama, build a tool that takes a technical question, \n",
"and responds with an explanation. This is a tool that you will be able to use yourself during the course!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c1070317-3ed9-4659-abe3-828943230e03",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import requests\n",
"import json\n",
"from typing import List\n",
"from dotenv import load_dotenv\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display, update_display\n",
"from openai import OpenAI"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4a456906-915a-4bfd-bb9d-57e505c5093f",
"metadata": {},
"outputs": [],
"source": [
"# constants\n",
"\n",
"MODEL_GPT = 'gpt-4o-mini'\n",
"MODEL_LLAMA = 'llama3.2'"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a8d7923c-5f28-4c30-8556-342d7c8497c1",
"metadata": {},
"outputs": [],
"source": [
"# set up environment\n",
"\n",
"load_dotenv(override=True)\n",
"api_key = os.getenv('OPENAI_API_KEY')\n",
"\n",
"if api_key and api_key.startswith('sk-proj-') and len(api_key)>10:\n",
" print(\"API key looks good so far\")\n",
"else:\n",
" print(\"There might be a problem with your API key? Please visit the troubleshooting notebook!\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "847fa7cd-1ae6-4888-933a-012e04ab1bcd",
"metadata": {},
"outputs": [],
"source": [
"openai = OpenAI()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3f0d0137-52b0-47a8-81a8-11a90a010798",
"metadata": {},
"outputs": [],
"source": [
"# here is the question; type over this to ask something new\n",
"\n",
"question = \"\"\"\n",
"Please explain what this code does and why:\n",
"yield from {book.get(\"author\") for book in books if book.get(\"author\")}\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "60ce7000-a4a5-4cce-a261-e75ef45063b4",
"metadata": {},
"outputs": [],
"source": [
"# Get gpt-4o-mini to answer, with streaming\n",
"\n",
"tone_setting = \"\"\n",
"toneFlag = str(input(\"Would you like the tutor to have a tone to them? (Y/N)\")).lower()\n",
"\n",
"if(toneFlag == \"y\"):\n",
" toneChoice = str(input(\"What kind of tone should they have? You can choose between sarcastic, humorous, snide, scholarly, or lugubrious: \")).lower()\n",
" tone_setting = f\"You have a very {toneChoice} tone and you respond to your students questions in kind. \"\n",
"\n",
"system_prompt = \"You are a computer science tutor who is helping their students with any programming questions they might have. \" + tone_setting + \"\\\n",
"Please give your responses in markdown format.\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c038b94e-5b69-4833-b75a-cbd5827d9fb7",
"metadata": {},
"outputs": [],
"source": [
"def question_prompt_setup(question):\n",
" user_prompt = \"The question I have for you is: \" + question\n",
" return user_prompt"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8c544acd-7541-4356-90cc-2c3a6d2f81bf",
"metadata": {},
"outputs": [],
"source": [
"def tutor_response(question):\n",
" stream = openai.chat.completions.create(\n",
" model=MODEL_GPT,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": question_prompt_setup(question)}\n",
" ],\n",
" stream=True\n",
" )\n",
"\n",
" response = \"\"\n",
" display_handle = display(Markdown(\"\"), display_id=True)\n",
" for chunk in stream:\n",
" response += chunk.choices[0].delta.content or ''\n",
" response = response.replace(\"```\",\"\").replace(\"markdown\", \"\")\n",
" update_display(Markdown(response), display_id=display_handle.display_id)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "975622fa-6c03-4069-a067-dfa0c878d04a",
"metadata": {},
"outputs": [],
"source": [
"tutor_response(question)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8f7c8ea8-4082-4ad0-8751-3301adcf6538",
"metadata": {},
"outputs": [],
"source": [
"# Get Llama 3.2 to answer"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,249 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "87c471b2-6a46-47f6-9da9-81d2652dd1b6",
"metadata": {},
"source": [
"# The code given by tutor results in an error when more than 1 city name is entered."
]
},
{
"cell_type": "markdown",
"id": "d4c3cdc4-3af9-4b9e-a5d2-80cee3b120be",
"metadata": {},
"source": [
"# This code aims to solve that by giving proper prices for all the given cities"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "292b5152-8932-4341-b2c4-850f16a89e5e",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import json\n",
"from dotenv import load_dotenv\n",
"from openai import OpenAI\n",
"import gradio as gr\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "92d35c3d-cb2d-4ce8-a6da-3907ce3ce8b8",
"metadata": {},
"outputs": [],
"source": [
"# Initialization\n",
"\n",
"load_dotenv(override=True)\n",
"\n",
"openai_api_key = os.getenv('OPENAI_API_KEY')\n",
"if openai_api_key:\n",
" print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n",
"else:\n",
" print(\"OpenAI API Key not set\")\n",
" \n",
"MODEL = \"gpt-4o-mini\"\n",
"openai = OpenAI()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "54e11038-795c-4451-ad3b-f797abb57728",
"metadata": {},
"outputs": [],
"source": [
"system_message = \"You are a helpful assistant for an Airline called FlightAI. \"\n",
"system_message += \"Give short, courteous answers, no more than 1 sentence. \"\n",
"system_message += \"Always be accurate. If you don't know the answer, say so.\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e06c982f-59f1-4e33-a1c1-2f56415efbde",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# This function looks rather simpler than the one from my video, because we're taking advantage of the latest Gradio updates\n",
"\n",
"def chat(message, history):\n",
" messages = [{\"role\": \"system\", \"content\": system_message}] + history + [{\"role\": \"user\", \"content\": message}]\n",
" response = openai.chat.completions.create(model=MODEL, messages=messages)\n",
" return response.choices[0].message.content\n",
"\n",
"gr.ChatInterface(fn=chat, type=\"messages\").launch()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d895e0ff-c47f-4b01-b987-4a236c452ba6",
"metadata": {},
"outputs": [],
"source": [
"# we'll try to impliment methods handle multi inputs in the query\n",
"ticket_prices = {\"london\": \"$799\", \"paris\": \"$899\", \"tokyo\": \"$1400\", \"berlin\": \"$499\"}\n",
"\n",
"def get_ticket_price(destination_city):\n",
" print(f\"Tool get_ticket_price called for {destination_city}\")\n",
" #return_prices = []\n",
" #for city in destination_city:\n",
" city = destination_city.lower()\n",
" #return_prices.append(ticket_prices.get(city,\"unknown\"))\n",
" return ticket_prices.get(city,\"Unknown\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e2387fe7-a7ac-4192-ad46-9ec2a9bc49fa",
"metadata": {},
"outputs": [],
"source": [
"get_ticket_price(\"paris\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b63e229e-08c9-49b4-b7af-1883736f12cd",
"metadata": {},
"outputs": [],
"source": [
"# There's a particular dictionary structure that's required to describe our function:\n",
"\n",
"price_function = {\n",
" \"name\": \"get_ticket_price\",\n",
" \"description\": \"Get the price of a return ticket to the destination city. Call this whenever you need to know the ticket price, for example when a customer asks 'How much is a ticket to this city'\",\n",
" \"parameters\": {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"destination_city\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"List of cities that the customer wants to travel to\",\n",
" },\n",
" },\n",
" \"required\": [\"destination_city\"],\n",
" \"additionalProperties\": False\n",
" }\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0162af66-2ea4-4221-93df-dd22f0ad92f7",
"metadata": {},
"outputs": [],
"source": [
"# And this is included in a list of tools:\n",
"\n",
"tools = [{\"type\": \"function\", \"function\": price_function}]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8b2a5434-63d0-4519-907e-bce21852d48f",
"metadata": {},
"outputs": [],
"source": [
"def chat(message, history):\n",
" messages = [{\"role\": \"system\", \"content\": system_message}] + history + [{\"role\": \"user\", \"content\": message}]\n",
" response = openai.chat.completions.create(model=MODEL, messages=messages, tools=tools)\n",
" print(f\"response ----------------- \\n {response}\")\n",
" if response.choices[0].finish_reason==\"tool_calls\":\n",
" message = response.choices[0].message\n",
" print(f\"message: -----------------\\n\",message)\n",
" response, city = handle_tool_call(message)\n",
" # print('response is --------', response)\n",
" # print('city is ----------',city)\n",
" messages.append(message)\n",
" messages.extend(response)\n",
" response = openai.chat.completions.create(model=MODEL, messages=messages)\n",
" \n",
" return response.choices[0].message.content"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d7dfa28c-95f8-4d25-8f3c-cd677bb4a4d1",
"metadata": {},
"outputs": [],
"source": [
"# We have to write that function handle_tool_call:\n",
"\n",
"def handle_tool_call(message):\n",
" responses = []\n",
" all_cities = []\n",
" for tool_call in message.tool_calls:\n",
" \n",
" arguments = json.loads(tool_call.function.arguments)\n",
" list_of_city = arguments.get('destination_city')\n",
" print(f'list of city is ======== {list_of_city}')\n",
" price = get_ticket_price(list_of_city)\n",
" print(f'price of ticket to {list_of_city} is {price}')\n",
" response = {\n",
" \"role\": \"tool\",\n",
" \"content\": json.dumps({\"destination_city\": list_of_city,\"price\": price}),\n",
" \"tool_call_id\": tool_call.id\n",
" }\n",
" responses.append(response)\n",
" all_cities.append(list_of_city)\n",
" print(f'responses ====== {responses}')\n",
" print(f'cities ======= {all_cities}')\n",
" return responses,all_cities"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "15a4152d-6455-4116-bb63-6700eedf0626",
"metadata": {},
"outputs": [],
"source": [
"gr.ChatInterface(fn=chat, type=\"messages\").launch()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c6b0fcfa-38b7-4063-933e-1c8177bf55f1",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -3,7 +3,7 @@ name: Run Python script
on:
push:
branches:
- figma_assistance
- Figma_AI_Assistant
jobs:
build:

View File

@@ -1,5 +1,5 @@
---
title: Figma_assistance
title: Figma_AI_Assistant
app_file: day_5_figma_assistance.py
sdk: gradio
sdk_version: 5.38.2

View File

@@ -292,7 +292,6 @@ custom_css = """
background: linear-gradient(135deg, #4facfe 0%, #00f2fe 100%);
padding: 15px 20px;
border-radius: 10px;
margin: 20px 0;
}
.quickstart-title {
@@ -315,7 +314,7 @@ with gr.Blocks(title="Figma Onboarding Assistant", theme=gr.themes.Soft(), css=c
gr.HTML(
"""
<div class="header-container">
<h1 class="header-title">🎨 Figma Onboarding Assistant</h1>
<h1 class="header-title">🎨 Figma AI Assistant</h1>
<p class="header-subtitle">Your AI-powered Figma learning companion</p>
</div>
@@ -351,26 +350,6 @@ with gr.Blocks(title="Figma Onboarding Assistant", theme=gr.themes.Soft(), css=c
"""
)
# Model selection dropdown
model_dropdown = gr.Dropdown(
choices=["OpenAI (GPT-3.5)", "Google Gemini (2.0 Flash)", "Claude (Sonnet 4)"],
value="OpenAI (GPT-3.5)",
label="Select AI Model",
info="Choose which AI model to use for responses"
)
with gr.Row():
msg = gr.Textbox(
placeholder="Type your Figma question here...",
container=False,
scale=4
)
submit_btn = gr.Button("Ask", scale=1, variant="primary")
clear_btn = gr.Button("Clear Chat", scale=1)
audio_btn = gr.Button("🔊 Play Audio", scale=1, variant="secondary")
clear_audio_btn = gr.Button("🔇 Clear Audio", scale=1, variant="secondary")
# Example questions
gr.HTML(
"""
@@ -380,7 +359,7 @@ with gr.Blocks(title="Figma Onboarding Assistant", theme=gr.themes.Soft(), css=c
</div>
"""
)
with gr.Row():
example_btns = [
gr.Button(
@@ -405,6 +384,24 @@ with gr.Blocks(title="Figma Onboarding Assistant", theme=gr.themes.Soft(), css=c
)
]
# Model selection dropdown
model_dropdown = gr.Dropdown(
choices=["OpenAI (GPT-3.5)", "Google Gemini (2.0 Flash)", "Claude (Sonnet 4)"],
value="OpenAI (GPT-3.5)",
label="Select AI Model",
info="Choose which AI model to use for responses"
)
with gr.Row():
msg = gr.Textbox(
placeholder="Type your Figma question here...",
container=False,
scale=4
)
submit_btn = gr.Button("Ask", scale=1, variant="primary")
clear_btn = gr.Button("Clear Chat", scale=1)
# Your components with simple styling
chatbot = gr.Chatbot(
type="messages",
@@ -412,6 +409,9 @@ with gr.Blocks(title="Figma Onboarding Assistant", theme=gr.themes.Soft(), css=c
placeholder="Ask me anything about Figma! For example: 'How do I create a component?' or 'What are frames in Figma?'",
elem_classes=["styled-chat"]
)
with gr.Row():
audio_btn = gr.Button("🔊 Text To Audio", scale=1, variant="primary")
clear_audio_btn = gr.Button("🔇 Clear Audio", scale=1, variant="secondary")
audio_output = gr.Audio(
label="Audio Response",

View File

@@ -0,0 +1,171 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "4076637d",
"metadata": {},
"source": [
"Here we have 3 bots - Gpt, Gemini and Claude. They are trying to find the top 3 stocks on NYSE which all the 3 bots identify, review and finalize when in consensus. The call to the model is kept with a range of 100 in for loop to mimic infinite loop as a breaking condition is set for when the 3 bots are in consensus for the top 3 stocks.\n",
"I would like to invite the reader to go through the code and share any feedbacks that could help me improve more on this. Any suggestions and feedbacks are most welcome. You could send your feedback at - srbmisc@gmail.com.\n",
"\n",
"Thank You"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e8c8a1f2",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from dotenv import load_dotenv\n",
"from openai import OpenAI\n",
"import anthropic\n",
"from IPython.display import Markdown, display, update_display"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "24b06e47",
"metadata": {},
"outputs": [],
"source": [
"load_dotenv(override=True)\n",
"openai_api_key = os.getenv('OPENAI_API_KEY')\n",
"anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')\n",
"google_api_key = os.getenv('GOOGLE_API_KEY')\n",
"\n",
"if openai_api_key:\n",
" print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n",
"else:\n",
" print(\"OpenAI API Key not set\")\n",
" \n",
"if anthropic_api_key:\n",
" print(f\"Anthropic API Key exists and begins {anthropic_api_key[:7]}\")\n",
"else:\n",
" print(\"Anthropic API Key not set\")\n",
"\n",
"if google_api_key:\n",
" print(f\"Google API Key exists and begins {google_api_key[:8]}\")\n",
"else:\n",
" print(\"Google API Key not set\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d687412b",
"metadata": {},
"outputs": [],
"source": [
"openai = OpenAI()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6c5d430e",
"metadata": {},
"outputs": [],
"source": [
"sp_gpt = '''You are a bot Gpt. You are in a conversation with 2 other bots Gemini and Claude. All 3 of you are trying to figure out the top 3 stocks that have performed so far in NYSE. \n",
"At every turn, you propose a stock, share its performance and fundamentals, and ask for the others for their review on this stock. Similarly, when its their turn and they share their stock pick, you review their pick.\n",
"If you think thier pick was better, accept it and in your next turn share that same stock otherwise ask them to accept your pick. The goal is to come up with 3 stocks at the end that all 3 participants consider the best.\n",
"If there is a concensus on 3 top stocks and its your turn, just output like this CONSENSUS REACHED : stock 1, stock 2, stock 3\n",
"Prefix your response with Gpt: in bold and respond in Markdown'''\n",
"\n",
"sp_gemini = '''You are a bot Gemini. You are in a conversation with 2 other bots Gpt and Claude. All 3 of you are trying to figure out the top 3 stocks that have performed so far in NYSE. \n",
"At every turn, you propose a stock, share its performance and fundamentals, and ask for the others for their review on this stock. Similarly, when its their turn and they share their stock pick, you review their pick.\n",
"If you think thier pick was better, accept it and in your next turn share that same stock otherwise ask them to accept your pick. The goal is to come up with 3 stocks at the end that all 3 participants consider the best.\n",
"If there is a concensus on 3 top stocks and its your turn, just output like this CONSENSUS REACHED : stock 1, stock 2, stock 3\n",
"Prefix your response with Gemini: in bold and respond in Markdown'''\n",
"\n",
"sp_claude = '''You are a bot Claude. You are in a conversation with 2 other bots Gemini and Gpt. All 3 of you are trying to figure out the top 3 stocks that have performed so far in NYSE. \n",
"At every turn, you propose a stock, share its performance and fundamentals, and ask for the others for their review on this stock. Similarly, when its their turn and they share their stock pick, you review their pick.\n",
"If you think thier pick was better, accept it and in your next turn share that same stock otherwise ask them to accept your pick. The goal is to come up with 3 stocks at the end that all 3 participants consider the best.\n",
"If there is a concensus on 3 top stocks and its your turn, just output like this CONSENSUS REACHED : stock 1, stock 2, stock 3\n",
"Prefix your response with Claude: in bold and respond in Markdown'''\n",
"\n",
"talk = \"Gpt: Hello Gemini, Hello Claude. I want to discuss with you a good stock on the NYSE with you.<br> Gemini: Hello Gpt, Hello Claude. Sure go ahead, give us the best stock you think is there on the NYSE ?<br> Claude: Hello Gpt, Hello Gemini. Sure Gpt, lets discuss on some stocks. What stock do you have on mind ?<br>\"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8eb58eae",
"metadata": {},
"outputs": [],
"source": [
"def callBot(mode):\n",
"\n",
" global talk\n",
" talk = talk + \"<br><br><br>\"\n",
" messages = [{\"role\": \"system\", \"content\": sp_gpt if mode==0 else (sp_gemini if mode==1 else sp_claude)},\n",
" {\"role\":\"user\", \"content\":talk}]\n",
"\n",
" if mode==0:\n",
" model = 'gpt-4.1-mini'\n",
" client = OpenAI()\n",
" elif mode==1:\n",
" model = 'gemini-2.5-flash'\n",
" client = OpenAI(api_key=google_api_key, base_url=\"https://generativelanguage.googleapis.com/v1beta/openai/\")\n",
" else:\n",
" model = 'claude-3-5-haiku-latest'\n",
" client = OpenAI(api_key=anthropic_api_key, base_url=\"https://api.anthropic.com/v1/\")\n",
"\n",
" stream = client.chat.completions.create(\n",
" model=model,\n",
" messages=messages,\n",
" stream=True\n",
" )\n",
" for chunk in stream:\n",
" talk += (chunk.choices[0].delta.content or '')\n",
" talk = talk.replace(\"```\",\"\").replace(\"markdown\",\"\")\n",
" update_display(Markdown(talk), display_id=display_handle.display_id)\n",
"\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7a3e9ebc",
"metadata": {},
"outputs": [],
"source": [
"display_handle = display(Markdown(\"\"), display_id=True) \n",
"\n",
"for i in range(100):\n",
" callBot(i%3)\n",
" if 'CONSENSUS REACHED :' in talk or 'CONSENSUS REACHED:' in talk:\n",
" break\n",
"\n",
" "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,359 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "ddfa9ae6-69fe-444a-b994-8c4c5970a7ec",
"metadata": {},
"source": [
"# Project - Airline AI Assistant\n",
"\n",
"We'll now bring together what we've learned to make an AI Customer Support assistant for an Airline"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8b50bbe2-c0b1-49c3-9a5c-1ba7efa2bcb4",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import json\n",
"from dotenv import load_dotenv\n",
"from openai import OpenAI\n",
"import gradio as gr"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "747e8786-9da8-4342-b6c9-f5f69c2e22ae",
"metadata": {},
"outputs": [],
"source": [
"# Initialization\n",
"\n",
"load_dotenv(override=True)\n",
"\n",
"openai_api_key = os.getenv('OPENAI_API_KEY')\n",
"if openai_api_key:\n",
" print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n",
"else:\n",
" print(\"OpenAI API Key not set\")\n",
" \n",
"MODEL = \"gpt-4o-mini\"\n",
"openai = OpenAI()\n",
"\n",
"# As an alternative, if you'd like to use Ollama instead of OpenAI\n",
"# Check that Ollama is running for you locally (see week1/day2 exercise) then uncomment these next 2 lines\n",
"# MODEL = \"llama3.2\"\n",
"# openai = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0a521d84-d07c-49ab-a0df-d6451499ed97",
"metadata": {},
"outputs": [],
"source": [
"system_message = \"You are a helpful assistant for an Airline called FlightAI. \"\n",
"system_message += \"Give short, courteous answers, no more than 1 sentence. \"\n",
"system_message += \"Always be accurate. If you don't know the answer, say so.\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "61a2a15d-b559-4844-b377-6bd5cb4949f6",
"metadata": {},
"outputs": [],
"source": [
"# This function looks rather simpler than the one from my video, because we're taking advantage of the latest Gradio updates\n",
"\n",
"def chat(message, history):\n",
" messages = [{\"role\": \"system\", \"content\": system_message}] + history + [{\"role\": \"user\", \"content\": message}]\n",
" response = openai.chat.completions.create(model=MODEL, messages=messages)\n",
" return response.choices[0].message.content\n",
"\n",
"gr.ChatInterface(fn=chat, type=\"messages\").launch()"
]
},
{
"cell_type": "markdown",
"id": "36bedabf-a0a7-4985-ad8e-07ed6a55a3a4",
"metadata": {},
"source": [
"## Tools\n",
"\n",
"Tools are an incredibly powerful feature provided by the frontier LLMs.\n",
"\n",
"With tools, you can write a function, and have the LLM call that function as part of its response.\n",
"\n",
"Sounds almost spooky.. we're giving it the power to run code on our machine?\n",
"\n",
"Well, kinda."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0696acb1-0b05-4dc2-80d5-771be04f1fb2",
"metadata": {},
"outputs": [],
"source": [
"# Let's start by making a useful function\n",
"\n",
"ticket_prices = {\"london\": \"$799\", \"paris\": \"$899\", \"tokyo\": \"$1400\", \"berlin\": \"$499\"}\n",
"flight_schedules = {\"london\": [\"08:00\", \"15:00\"], \"paris\": [\"09:00\", \"16:00\"], \"tokyo\": [\"12:00\"], \"berlin\": [\"07:00\", \"13:00\"]}\n",
"\n",
"def get_ticket_price(destination_city):\n",
" print(f\"Tool get_ticket_price called for {destination_city}\")\n",
" city = destination_city.lower()\n",
" return ticket_prices.get(city, \"Unknown\")\n",
"\n",
"def get_flight_schedules(destination_city):\n",
" print(f\"Tool get_flight_hours called for {destination_city}\")\n",
" city = destination_city.lower()\n",
" return flight_schedules.get(city, \"Unknown\")\n",
"\n",
"def flight_confirmation_number(destination_city, date, hour):\n",
" import random\n",
" number = destination_city[:3].upper() + ''.join(random.choices('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', k=6))\n",
" print(f\"Tool flight_confirmation_number called for {destination_city} on {date} at {hour}, returning {number}\")\n",
" return number"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "80ca4e09-6287-4d3f-997d-fa6afbcf6c85",
"metadata": {},
"outputs": [],
"source": [
"print(get_ticket_price(\"London\"))\n",
"print(get_flight_schedules(\"London\"))\n",
"print(flight_confirmation_number(\"London\", \"2024-10-01\", \"15:00\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4afceded-7178-4c05-8fa6-9f2085e6a344",
"metadata": {},
"outputs": [],
"source": [
"# There's a particular dictionary structure that's required to describe our function:\n",
"\n",
"price_function = {\n",
" \"name\": \"get_ticket_price\",\n",
" \"description\": \"Get the price of a return ticket to the destination city. Call this whenever you need \\\n",
" to know the ticket price, for example when a customer asks 'How much is a ticket to this city'\",\n",
" \"parameters\": {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"destination_city\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"The city that the customer wants to travel to\",\n",
" },\n",
" },\n",
" \"required\": [\"destination_city\"],\n",
" \"additionalProperties\": False\n",
" }\n",
"}\n",
"\n",
"schedule_function = {\n",
" \"name\": \"get_flight_schedules\",\n",
" \"description\": \"Get the daily flight schedules (departure times) to the destination city. Call this \\\n",
" whenever you need to know the flight times, for example when a customer asks 'What time \\\n",
" are the flights to this city?'\",\n",
" \"parameters\": {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"destination_city\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"The city that the customer wants to travel to\",\n",
" },\n",
" },\n",
" \"required\": [\"destination_city\"],\n",
" \"additionalProperties\": False\n",
" }\n",
"}\n",
"\n",
"confirmation_function = {\n",
" \"name\": \"flight_confirmation_number\",\n",
" \"description\": \"Get a flight confirmation number for a booking. Call this whenever you need to \\\n",
" provide a confirmation number, after a customer has selected a destination city, a flight \\\n",
" date and a departure time, and also asked for the price. For example when a customer says \\\n",
" 'I'd like to book that flight'\",\n",
" \"parameters\": {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"destination_city\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"The city that the customer wants to travel to\",\n",
" },\n",
" \"date\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"The date of the flight, in YYYY-MM-DD format\",\n",
" },\n",
" \"hour\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"The departure time of the flight, in HH:MM format\",\n",
" },\n",
" },\n",
" \"required\": [\"destination_city\", \"date\", \"hour\"],\n",
" \"additionalProperties\": False\n",
" }\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bdca8679-935f-4e7f-97e6-e71a4d4f228c",
"metadata": {},
"outputs": [],
"source": [
"# And this is included in a list of tools:\n",
"\n",
"tools = [{\"type\": \"function\", \"function\": price_function},\n",
" {\"type\": \"function\", \"function\": schedule_function},\n",
" {\"type\": \"function\", \"function\": confirmation_function}]"
]
},
{
"cell_type": "markdown",
"id": "c3d3554f-b4e3-4ce7-af6f-68faa6dd2340",
"metadata": {},
"source": [
"## Getting OpenAI to use our Tool\n",
"\n",
"There's some fiddly stuff to allow OpenAI \"to call our tool\"\n",
"\n",
"What we actually do is give the LLM the opportunity to inform us that it wants us to run the tool.\n",
"\n",
"Here's how the new chat function looks:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ce9b0744-9c78-408d-b9df-9f6fd9ed78cf",
"metadata": {},
"outputs": [],
"source": [
"def chat(message, history):\n",
" messages = [{\"role\": \"system\", \"content\": system_message}] + history + [{\"role\": \"user\", \"content\": message}]\n",
" response = openai.chat.completions.create(model=MODEL, messages=messages, tools=tools)\n",
"\n",
" if response.choices[0].finish_reason==\"tool_calls\":\n",
" message = response.choices[0].message\n",
" response, city = handle_tool_call(message)\n",
" messages.append(message)\n",
" messages.append(response)\n",
" response = openai.chat.completions.create(model=MODEL, messages=messages)\n",
" \n",
" return response.choices[0].message.content"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b0992986-ea09-4912-a076-8e5603ee631f",
"metadata": {},
"outputs": [],
"source": [
"# We have to write that function handle_tool_call:\n",
"def handle_price_tool_call(tool_call_id, arguments):\n",
" city = arguments.get('destination_city')\n",
" price = get_ticket_price(city)\n",
" response = {\n",
" \"role\": \"tool\",\n",
" \"content\": json.dumps({\"destination_city\": city,\"price\": price}),\n",
" \"tool_call_id\": tool_call_id\n",
" }\n",
" return response, city\n",
" \n",
"def handle_schedule_tool_call(tool_call_id, arguments):\n",
" city = arguments.get('destination_city')\n",
" schedules = get_flight_schedules(city)\n",
" response = {\n",
" \"role\": \"tool\",\n",
" \"content\": json.dumps({\"destination_city\": city,\"schedules\": schedules}),\n",
" \"tool_call_id\": tool_call_id\n",
" }\n",
" return response, city\n",
"\n",
"def handle_confirmation_tool_call(tool_call_id, arguments):\n",
" city = arguments.get('destination_city')\n",
" date = arguments.get('date')\n",
" hour = arguments.get('hour')\n",
" confirmation = flight_confirmation_number(city, date, hour)\n",
" response = {\n",
" \"role\": \"tool\",\n",
" \"content\": json.dumps({\"destination_city\": city,\"date\": date,\"hour\": hour,\"confirmation_number\": confirmation}),\n",
" \"tool_call_id\": tool_call_id\n",
" }\n",
" return response, city\n",
"\n",
"def handle_tool_call(message):\n",
" print(\"Number of tool calls:\", len(message.tool_calls))\n",
" \n",
" tool_call = message.tool_calls[0]\n",
" print(\"Tool call is for function:\", tool_call.function.name)\n",
" arguments = json.loads(tool_call.function.arguments)\n",
" \n",
" if tool_call.function.name == \"get_ticket_price\":\n",
" response, city = handle_price_tool_call(tool_call.id, arguments)\n",
" elif tool_call.function.name == \"get_flight_schedules\":\n",
" response, city = handle_schedule_tool_call(tool_call.id, arguments)\n",
" elif tool_call.function.name == \"flight_confirmation_number\":\n",
" response, city = handle_confirmation_tool_call(tool_call.id, arguments)\n",
" \n",
" return response, city"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f4be8a71-b19e-4c2f-80df-f59ff2661f14",
"metadata": {},
"outputs": [],
"source": [
"gr.ChatInterface(fn=chat, type=\"messages\").launch()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "530e4bef",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -301,7 +301,6 @@
" background: linear-gradient(135deg, #4facfe 0%, #00f2fe 100%);\n",
" padding: 15px 20px;\n",
" border-radius: 10px;\n",
" margin: 20px 0;\n",
"}\n",
"\n",
".quickstart-title {\n",
@@ -324,7 +323,7 @@
" gr.HTML(\n",
" \"\"\"\n",
" <div class=\"header-container\">\n",
" <h1 class=\"header-title\">🎨 Figma Onboarding Assistant</h1>\n",
" <h1 class=\"header-title\">🎨 Figma AI Assistant</h1>\n",
" <p class=\"header-subtitle\">Your AI-powered Figma learning companion</p>\n",
" </div>\n",
" \n",
@@ -360,26 +359,6 @@
" \"\"\"\n",
" )\n",
" \n",
" # Model selection dropdown\n",
" model_dropdown = gr.Dropdown(\n",
" choices=[\"OpenAI (GPT-3.5)\", \"Google Gemini (2.0 Flash)\", \"Claude (Sonnet 4)\"],\n",
" value=\"OpenAI (GPT-3.5)\",\n",
" label=\"Select AI Model\",\n",
" info=\"Choose which AI model to use for responses\"\n",
" )\n",
" \n",
" with gr.Row():\n",
" msg = gr.Textbox(\n",
" placeholder=\"Type your Figma question here...\",\n",
" container=False,\n",
" scale=4\n",
" )\n",
" submit_btn = gr.Button(\"Ask\", scale=1, variant=\"primary\")\n",
" clear_btn = gr.Button(\"Clear Chat\", scale=1)\n",
" audio_btn = gr.Button(\"🔊 Play Audio\", scale=1, variant=\"secondary\")\n",
" clear_audio_btn = gr.Button(\"🔇 Clear Audio\", scale=1, variant=\"secondary\")\n",
" \n",
"\n",
" # Example questions\n",
" gr.HTML(\n",
" \"\"\"\n",
@@ -389,7 +368,7 @@
" </div>\n",
" \"\"\"\n",
" )\n",
" \n",
"\n",
" with gr.Row():\n",
" example_btns = [\n",
" gr.Button(\n",
@@ -414,6 +393,24 @@
" )\n",
" ]\n",
"\n",
" # Model selection dropdown\n",
" model_dropdown = gr.Dropdown(\n",
" choices=[\"OpenAI (GPT-3.5)\", \"Google Gemini (2.0 Flash)\", \"Claude (Sonnet 4)\"],\n",
" value=\"OpenAI (GPT-3.5)\",\n",
" label=\"Select AI Model\",\n",
" info=\"Choose which AI model to use for responses\"\n",
" )\n",
" \n",
" with gr.Row():\n",
" msg = gr.Textbox(\n",
" placeholder=\"Type your Figma question here...\",\n",
" container=False,\n",
" scale=4\n",
" )\n",
" submit_btn = gr.Button(\"Ask\", scale=1, variant=\"primary\")\n",
" clear_btn = gr.Button(\"Clear Chat\", scale=1)\n",
"\n",
"\n",
" # Your components with simple styling\n",
" chatbot = gr.Chatbot(\n",
" type=\"messages\",\n",
@@ -421,6 +418,9 @@
" placeholder=\"Ask me anything about Figma! For example: 'How do I create a component?' or 'What are frames in Figma?'\",\n",
" elem_classes=[\"styled-chat\"]\n",
" )\n",
" with gr.Row():\n",
" audio_btn = gr.Button(\"🔊 Text To Audio\", scale=1, variant=\"primary\")\n",
" clear_audio_btn = gr.Button(\"🔇 Clear Audio\", scale=1, variant=\"secondary\")\n",
"\n",
" audio_output = gr.Audio(\n",
" label=\"Audio Response\",\n",

View File

@@ -0,0 +1,324 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d006b2ea-9dfe-49c7-88a9-a5a0775185fd",
"metadata": {},
"source": [
"# Additional End of week Exercise - week 2\n",
"\n",
"Now use everything you've learned from Week 2 to build a full prototype for the technical question/answerer you built in Week 1 Exercise.\n",
"\n",
"This should include a Gradio UI, streaming, use of the system prompt to add expertise, and the ability to switch between models. Bonus points if you can demonstrate use of a tool!\n",
"\n",
"If you feel bold, see if you can add audio input so you can talk to it, and have it respond with audio. ChatGPT or Claude can help you, or email me if you have questions.\n",
"\n",
"I will publish a full solution here soon - unless someone beats me to it...\n",
"\n",
"There are so many commercial applications for this, from a language tutor, to a company onboarding solution, to a companion AI to a course (like this one!) I can't wait to see your results."
]
},
{
"cell_type": "markdown",
"id": "87f483d5-dc85-41d1-bb34-5b49c6eeb30c",
"metadata": {},
"source": [
"**I built a coding expert tutor with 2 models: Gemini and GPT.\n",
"It works with streamining and tools simultaneously.\n",
"If a user asks a mathematical question, the Dalle 3 will generate an image of that equation.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a07e7793-b8f5-44f4-aded-5562f633271a",
"metadata": {},
"outputs": [],
"source": [
"import gradio\n",
"from openai import OpenAI\n",
"import os\n",
"from dotenv import load_dotenv\n",
"import math\n",
"import json\n",
"import base64\n",
"from io import BytesIO\n",
"from PIL import Image"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "436819d1-8a09-43e2-9429-35189cc92317",
"metadata": {},
"outputs": [],
"source": [
"load_dotenv(override=True)\n",
"\n",
"openai_api_key = os.getenv('OPENAI_API_KEY')\n",
"google_api_key = os.getenv('GOOGLE_API_KEY')\n",
"\n",
"if openai_api_key:\n",
" print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n",
"else:\n",
" print(\"OpenAI API Key not set\")\n",
" \n",
"if google_api_key:\n",
" print(f\"Google API Key exists and begins {google_api_key[:8]}\")\n",
"else:\n",
" print(\"Google API Key not set\")\n",
" \n",
" \n",
"GPT_MODEL = \"gpt-5-nano\"\n",
"GEMINI_MODEL = \"gemini-2.5-flash\"\n",
"openai = OpenAI()\n",
"gemini = OpenAI(\n",
" api_key=google_api_key, \n",
" base_url=\"https://generativelanguage.googleapis.com/v1beta/openai/\"\n",
")\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e154015c-0c16-41a5-9518-163a9ae3ea0c",
"metadata": {},
"outputs": [],
"source": [
"system_message = \"You are an expert coding tutor. \\n\" \\\n",
"\"You explain the answers in a friendly and easy to understand way.\\n\" \\\n",
"\"However, if the input from the user feels too vague, ask them to provide more details before answering.\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "937dc916-fc0b-47a4-b963-4d689cec4f60",
"metadata": {},
"outputs": [],
"source": [
"def calculate_math(math_equation):\n",
" print(\"Math calculator tool has been run...\")\n",
" \n",
" allowed = {\"__builtins__\": None}\n",
" allowed.update({k: getattr(math, k) for k in dir(math) if not k.startswith(\"_\")})\n",
" \n",
" result = eval(math_equation, allowed, {})\n",
" return result\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "37a74256-fbf6-4539-8481-87bf73abefd4",
"metadata": {},
"outputs": [],
"source": [
"calculate_math(\"sqrt(25)\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c858d63d-c90f-4ab9-bf03-2047622ed151",
"metadata": {},
"outputs": [],
"source": [
"calculate_math_function = {\n",
" \"name\": \"calculate_math\",\n",
" \"description\": \"Calculate math requested by the user. You should run this tool when a user asks to know the result of ANY equation. For example: 'What is ther result of this: sqrt(25)'\",\n",
" \"parameters\": {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"math_equation\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"The math question the user wants to calculate. You should pass only the math equation, not text. For example: sqrt(25)\",\n",
" },\n",
" },\n",
" \"required\": [\"math_equation\"],\n",
" \"additionalProperties\": False\n",
" }\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1c32ef1f-909c-4646-b39f-006d26a44d10",
"metadata": {},
"outputs": [],
"source": [
"tools = [{\"type\": \"function\", \"function\": calculate_math_function}]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "edcea23f-769c-4d40-b07c-ac2fc89d2af9",
"metadata": {},
"outputs": [],
"source": [
"def generate_math_result_image(equation, result):\n",
" image_response = openai.images.generate(\n",
" model=\"dall-e-3\",\n",
" prompt=f\"Generate a realistic image of a math equation: '{equation}={result}' on a school chalk board with.\",\n",
" size=\"1024x1024\",\n",
" n=1,\n",
" response_format=\"b64_json\",\n",
" )\n",
" image_base64 = image_response.data[0].b64_json\n",
" image_data = base64.b64decode(image_base64)\n",
" return Image.open(BytesIO(image_data))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ea0fa17b-069e-4080-9cfc-a0674a2bcca6",
"metadata": {},
"outputs": [],
"source": [
"def chat(history, model=\"GPT\"):\n",
" messages = [{\"role\": \"system\", \"content\": system_message}] + history\n",
" if model == \"GPT\": \n",
" response = openai.chat.completions.create(model=GPT_MODEL, messages=messages, stream=True, tools=tools)\n",
" else:\n",
" response = gemini.chat.completions.create(model=GEMINI_MODEL, messages=messages, stream=True, tools=tools)\n",
" \n",
" buffer = {\"role\": \"assistant\", \"content\": \"\", \"tool_calls\": []}\n",
" tool_answer = \"\"\n",
" image = None\n",
" \n",
" for chunk in response:\n",
" delta = chunk.choices[0].delta\n",
" if delta.content:\n",
" buffer[\"content\"] += delta.content or \"\"\n",
" yield history + [buffer], image\n",
"\n",
" if delta.tool_calls:\n",
" if delta.tool_calls[0].function.name:\n",
" buffer[\"tool_calls\"].append(delta.tool_calls[0])\n",
" for call in delta.tool_calls:\n",
" if call.function and model == \"GPT\":\n",
" buffer[\"tool_calls\"][0].function.arguments += call.function.arguments\n",
" \n",
" if chunk.choices[0].finish_reason == \"tool_calls\":\n",
" tool_call = buffer[\"tool_calls\"][0]\n",
" response, result, math_equation = handle_calculate_tool_call(tool_call)\n",
" messages.append(buffer)\n",
" messages.append(response)\n",
" image = generate_math_result_image(math_equation, result)\n",
" if model == \"GPT\": \n",
" next_response = openai.chat.completions.create(model=GPT_MODEL, messages=messages, stream=True)\n",
" else:\n",
" next_response = gemini.chat.completions.create(model=GEMINI_MODEL, messages=messages, stream=True)\n",
" for next_chunk in next_response:\n",
" tool_answer += next_chunk.choices[0].delta.content or \"\"\n",
" yield history + [{\"role\": \"assistant\", \"content\": tool_answer}], image"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5605e90c-1ccb-4222-b15e-9be35fd58168",
"metadata": {},
"outputs": [],
"source": [
"def handle_calculate_tool_call(tool_call):\n",
" arguments = json.loads(tool_call.function.arguments)\n",
" math_equation = arguments.get('math_equation')\n",
" result = calculate_math(math_equation)\n",
" response = {\n",
" \"role\": \"tool\",\n",
" \"content\": json.dumps({\"math_equation\": math_equation, \"result\": result}),\n",
" \"tool_call_id\": tool_call.id\n",
" }\n",
" return response, result, math_equation"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "89da6939-f38f-4584-9413-85ff843d9b32",
"metadata": {},
"outputs": [],
"source": [
"def transcribe(audio_file):\n",
" if audio_file is None:\n",
" return \"\"\n",
" with open(audio_file, \"rb\") as f:\n",
" transcription = openai.audio.transcriptions.create(\n",
" model=\"gpt-4o-mini-transcribe\", \n",
" file=f\n",
" )\n",
" return transcription.text"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6b9ba370-6014-4f66-8f57-824465b7fe41",
"metadata": {},
"outputs": [],
"source": [
"with gradio.Blocks() as ui:\n",
" with gradio.Row():\n",
" chatbot = gradio.Chatbot(height=500, type=\"messages\")\n",
" image_output = gradio.Image(height=500)\n",
" with gradio.Row():\n",
" entry = gradio.Textbox(label=\"Chat with our code expert:\")\n",
" microphone = gradio.Audio(sources=\"microphone\", type=\"filepath\")\n",
" with gradio.Row():\n",
" ai_model = gradio.Dropdown([\"GPT\", \"Gemini\"], label=\"Select Model\")\n",
" clear = gradio.Button(\"Clear\")\n",
"\n",
" def do_entry(message, history):\n",
" history += [{\"role\":\"user\", \"content\":message}]\n",
" return \"\", history, None\n",
"\n",
" entry.submit(do_entry, inputs=[entry, chatbot], outputs=[entry, chatbot, microphone]).then(\n",
" chat, inputs=[chatbot, ai_model], outputs=[chatbot, image_output]\n",
" )\n",
" microphone.change(\n",
" transcribe,\n",
" inputs=[microphone],\n",
" outputs=[entry] \n",
" )\n",
" clear.click(lambda: None, inputs=None, outputs=chatbot, queue=False)\n",
"\n",
"ui.launch()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "53abd8ac-a7de-42d1-91bf-741a93e2347b",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,600 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "QTJt9pwUTbHo"
},
"source": [
"# Intelligent Synthetic Dataset Generator\n",
"\n",
"An AI-powered tool that creates realistic synthetic datasets for any business case—whether you provide the schema or let it intelligently design one for you.\n",
"\n",
"It works with Claude, Gemini, GPT and HugginFace APIs."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "l_FljmlTUoka"
},
"source": [
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "aONqZ-SjUJdg",
"outputId": "1f5c7b2e-95f0-4f23-cf01-2bd5bda0807a"
},
"outputs": [],
"source": [
"!pip install -q requests bitsandbytes anthropic"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Ub1unBFvTatE"
},
"outputs": [],
"source": [
"import os\n",
"import requests\n",
"import json\n",
"from google.colab import userdata\n",
"\n",
"from openai import OpenAI\n",
"import anthropic\n",
"from huggingface_hub import login\n",
"from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig\n",
"import torch\n",
"import pandas as pd\n",
"\n",
"import gradio as gr\n",
"import gc"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "viZNPtObUOcz"
},
"outputs": [],
"source": [
"hf_token = userdata.get('HF_TOKEN')\n",
"openai_api_key = userdata.get('OPENAI_API_KEY')\n",
"anthropic_api_key = userdata.get('ANTHROPIC_API_KEY')\n",
"google_api_key = userdata.get('GOOGLE_API_KEY')\n",
"\n",
"login(hf_token, add_to_git_credential=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "9Q94S6JTUWn5"
},
"outputs": [],
"source": [
"quant_config = BitsAndBytesConfig(\n",
" load_in_4bit=True,\n",
" bnb_4bit_use_double_quant=True,\n",
" bnb_4bit_compute_dtype=torch.bfloat16,\n",
" bnb_4bit_quant_type=\"nf4\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mrjdVEpaUxHz"
},
"source": [
"## Configuration"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "LvNE6foEUPaz"
},
"outputs": [],
"source": [
"LLAMA = \"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n",
"PHI3 = \"microsoft/Phi-3-mini-4k-instruct\"\n",
"GEMMA2 = \"google/gemma-2-2b-it\"\n",
"GPT = \"gpt-4o-mini\"\n",
"CLAUDE = \"claude-3-haiku-20240307\"\n",
"GEMINI = \"gemini-2.0-flash\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "tvafTFD8XmaO"
},
"outputs": [],
"source": [
"MODELS = {\n",
" 'LLama 3.1' : LLAMA,\n",
" 'Phi 3 mini': PHI3,\n",
" 'Gemma 2': GEMMA2,\n",
" 'GPT 4.o mini': GPT,\n",
" 'Claude 3 Haiku': CLAUDE,\n",
" 'Gemini 2.0 Flash': GEMINI,\n",
"}\n",
"\n",
"HF_MODELS = [LLAMA, PHI3, GEMMA2]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "2LZqA9QXXl0t"
},
"outputs": [],
"source": [
"FILE_FORMATS = [\".csv\", \".tsv\", \".jsonl\", \".json\"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "d6EnN7SVXhza",
"outputId": "55f6ac4d-adeb-4216-b2a8-d67524b005d3"
},
"outputs": [],
"source": [
"SCHEMA = [\n",
" (\"Name\", \"TEXT\", \"Name of the restaurant\", \"Blue River Bistro\"),\n",
" (\"Address\", \"TEXT\", \"Restaurant address\", \"742 Evergreen Terrace, Springfield, IL 62704\"),\n",
" (\"Type\", \"TEXT\", \"Kitchen type\", 'One of [\"Thai\",\"Mediterranean\",\"Vegan\",\"Steakhouse\",\"Japanese\"] or other potential types'),\n",
" (\"Average Price\", \"TEXT\", \"Average meal price\", \"$45, or '--' if unknown\"),\n",
" (\"Year\", \"INT\", \"Year of restaurant opening\", 2015),\n",
" (\"Menu\", \"Array\", \"List of meals\", '[\"Grilled Salmon\", \"Caesar Salad\", \"Pad Thai\", \"Margherita Pizza\", ...]'),\n",
"]\n",
"\n",
"DEFAULT_SCHEMA_TEXT = \"\\n\".join([f\"{i+1}. {col[0]} ({col[1]}) - {col[2]}, example: {col[3]}\" for i, col in enumerate(SCHEMA)])\n",
"print(DEFAULT_SCHEMA_TEXT)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "W-46TDTOXiS7"
},
"outputs": [],
"source": [
"system_prompt = \"\"\"\n",
"You are an expert in generating synthetic datasets tailored to a given business case and user requirements.\n",
"If the user does not specify output columns, infer and create the most appropriate columns based on your expertise.\n",
"Do NOT repeat column values from one row to another. Only output valid JSONL without any comments.\"\n",
"\"\"\"\n",
"\n",
"\n",
"def get_user_prompt(business_case, schema_text, nr_records):\n",
" prompt = f\"The business case is: {business_case}.\\nGenerate {nr_records} rows of data in JSONL format.\\n\"\n",
"\n",
" if schema_text is not None:\n",
" prompt += f\"Each line should be a JSON object with the following fields: \\n{schema_text}\\n\"\n",
"\n",
" return prompt"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gPf1GcAwhwa_"
},
"source": [
"## LLM handler"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Tf-WEQUKhY-z"
},
"outputs": [],
"source": [
"def ask_gpt(model: str, user_prompt: str):\n",
" client = OpenAI(api_key=openai_api_key)\n",
" messages = [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ]\n",
" response = client.chat.completions.create(\n",
" model=model,\n",
" messages=messages,\n",
" temperature=0.7\n",
" )\n",
" content = response.choices[0].message.content\n",
"\n",
" return content"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "856pnIHahzDd"
},
"outputs": [],
"source": [
"def ask_claude(model: str, user_prompt: str):\n",
" client = anthropic.Anthropic(api_key=anthropic_api_key)\n",
" response = client.messages.create(\n",
" model=model,\n",
" messages=[{\"role\": \"user\", \"content\": user_prompt}],\n",
" max_tokens=4000,\n",
" temperature=0.7,\n",
" system=system_prompt\n",
" )\n",
" content = response.content[0].text\n",
"\n",
" return content"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "p0AfSbcBiUlg"
},
"outputs": [],
"source": [
"def ask_gemini(model: str, user_prompt: str):\n",
" client = OpenAI(\n",
" api_key=google_api_key,\n",
" base_url=\"https://generativelanguage.googleapis.com/v1beta/openai/\"\n",
" )\n",
" messages = [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ]\n",
" response = client.chat.completions.create(\n",
" model=model,\n",
" messages=messages,\n",
" temperature=0.7\n",
" )\n",
" content = response.choices[0].message.content\n",
"\n",
" return content"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "K9LZZPJ9irrH"
},
"outputs": [],
"source": [
"def ask_hf(model: str, user_prompt: str):\n",
" global tokenizer, inputs, hf_model, outputs\n",
"\n",
" messages = [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ]\n",
"\n",
" tokenizer = AutoTokenizer.from_pretrained(model, trust_remote_code=True)\n",
" tokenizer.pad_token = tokenizer.eos_token\n",
" inputs = tokenizer.apply_chat_template(messages, return_tensors=\"pt\").to(\"cuda\")\n",
" if hf_model == None:\n",
" hf_model = AutoModelForCausalLM.from_pretrained(model, device_map=\"auto\", quantization_config=quant_config)\n",
" outputs = hf_model.generate(inputs, max_new_tokens=4000)\n",
"\n",
" _, _, after = tokenizer.decode(outputs[0]).partition(\"assistant<|end_header_id|>\")\n",
" content = after.strip()\n",
"\n",
" return content"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "eu7Sv3bDhXdI"
},
"outputs": [],
"source": [
"def query_llm(model_name: str, user_prompt):\n",
" try:\n",
" model = MODELS[model_name]\n",
"\n",
" if \"gpt\" in model.lower():\n",
" response = ask_gpt(model, user_prompt)\n",
"\n",
" elif \"claude\" in model.lower():\n",
" response = ask_claude(model, user_prompt)\n",
"\n",
" elif \"gemini\" in model.lower():\n",
" response = ask_gemini(model, user_prompt)\n",
"\n",
" elif model in HF_MODELS:\n",
" response = ask_hf(model, user_prompt)\n",
"\n",
" else:\n",
" raise ValueError(f\"Unsupported model. Use one of {', '.join(MODELS.keys())}\")\n",
"\n",
" lines = [line.strip() for line in response.strip().splitlines() if line.strip().startswith(\"{\")]\n",
"\n",
" return [json.loads(line) for line in lines]\n",
"\n",
" except Exception as e:\n",
" raise Exception(f\"Model query failed: {str(e)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mxuwLUsVlBlY"
},
"source": [
"## Output Formatter"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "IAKfqgZIlGuP"
},
"outputs": [],
"source": [
"def save_dataset(records, file_format: str, file_name: str):\n",
" df = pd.DataFrame(records)\n",
" print(df.shape)\n",
" if file_format == \".csv\":\n",
" df.to_csv(file_name, index=False)\n",
" elif file_format == \".tsv\":\n",
" df.to_csv(file_name, sep=\"\\t\", index=False)\n",
" elif file_format == \".jsonl\":\n",
" with open(file_name, \"w\") as f:\n",
" for record in records:\n",
" f.write(json.dumps(record) + \"\\n\")\n",
" elif file_format == \".json\":\n",
" df.to_json(file_name, orient=\"records\", index=False)\n",
" else:\n",
" raise ValueError(\"Unsupported file format\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "gkpkQ0nal_5B"
},
"outputs": [],
"source": [
"def generate_dataset(\n",
" model_name: str,\n",
" business_case: str,\n",
" num_records: int = 100,\n",
" schema_text: str = None,\n",
" file_format: str = '.jsonl',\n",
" file_name: str = 'test_dataset.jsonl'\n",
"):\n",
" \"\"\"\n",
" Generates a synthetic dataset using an LLM based on the given business case and optional schema.\n",
"\n",
" Returns:\n",
" Tuple[str, pd.DataFrame | None]: A status message and a preview DataFrame (first 10 rows) if successful.\n",
" \"\"\"\n",
" try:\n",
" # Validate number of records\n",
" if num_records <= 10:\n",
" return \"❌ Error: Number of records must be greater than 10.\", None\n",
" if num_records > 1000:\n",
" return \"❌ Error: Number of records must be less than or equal to 1000.\", None\n",
"\n",
" # Validate file format\n",
" if file_format not in FILE_FORMATS:\n",
" return f\"❌ Error: Invalid file format '{file_format}'. Supported formats: {FILE_FORMATS}\", None\n",
"\n",
" # Ensure file name has correct extension\n",
" if not file_name.endswith(file_format):\n",
" file_name += file_format\n",
"\n",
" # Generate the prompt and query the model\n",
" prompt = get_user_prompt(business_case, schema_text, num_records)\n",
" records = query_llm(model_name, prompt)\n",
"\n",
" if not records:\n",
" return \"❌ Error: No valid records were generated by the model.\", None\n",
"\n",
" # Save dataset\n",
" save_dataset(records, file_format, file_name)\n",
"\n",
" # Prepare preview\n",
" df = pd.DataFrame(records)\n",
" preview = df.head(10)\n",
"\n",
" success_message = (\n",
" f\"✅ Generated {len(records)} records successfully!\\n\"\n",
" f\"📁 Saved to: {file_name}\\n\"\n",
" )\n",
"\n",
" return success_message, preview\n",
"\n",
" except Exception as e:\n",
" return f\"❌ Error: {str(e)}\", None"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 702
},
"id": "Z9WdaSfFUakj",
"outputId": "2fbce2c5-a6d3-4dd8-a9d2-0e38c18d202e"
},
"outputs": [],
"source": [
"with gr.Blocks(title=\"Synthetic Dataset Generator\", theme=gr.themes.Monochrome()) as interface:\n",
" tokenizer = None\n",
" inputs = None\n",
" hf_model = None\n",
" outputs = None\n",
"\n",
" gr.Markdown(\"# Dataset Generator\")\n",
" gr.Markdown(\"Generate synthetic datasets using AI models\")\n",
"\n",
" with gr.Row():\n",
" with gr.Column(scale=2):\n",
" schema_input = gr.Textbox(\n",
" label=\"Schema\",\n",
" value=DEFAULT_SCHEMA_TEXT,\n",
" lines=15,\n",
" placeholder=\"Define your dataset schema here... Please follow this format: Name (TYPE) - Description, example: Example\"\n",
" )\n",
"\n",
" business_case_input = gr.Textbox(\n",
" label=\"Business Case\",\n",
" value=\"I want to generate restaurant dataset\",\n",
" lines=1,\n",
" placeholder=\"Enter business case description...\"\n",
" )\n",
"\n",
" with gr.Row():\n",
" model_dropdown = gr.Dropdown(\n",
" label=\"Model\",\n",
" choices=list(MODELS.keys()),\n",
" value=list(MODELS.keys())[0],\n",
" interactive=True\n",
" )\n",
"\n",
" nr_records_input = gr.Number(\n",
" label=\"Number of records\",\n",
" value=27,\n",
" minimum=11,\n",
" maximum=1000,\n",
" step=1\n",
" )\n",
"\n",
" with gr.Row():\n",
" filename_input = gr.Textbox(\n",
" label=\"Save as\",\n",
" value=\"restaurant_dataset\",\n",
" placeholder=\"Enter filename (extension will be added automatically)\"\n",
" )\n",
"\n",
" file_format_dropdown = gr.Dropdown(\n",
" label=\"File format\",\n",
" choices=FILE_FORMATS,\n",
" value=FILE_FORMATS[0],\n",
" interactive=True\n",
" )\n",
"\n",
" generate_btn = gr.Button(\"🚀 Generate\", variant=\"secondary\", size=\"lg\")\n",
"\n",
" with gr.Column(scale=1):\n",
" gr.Markdown(\"\"\"\n",
" ### 📝 Dataset Generation Instructions\n",
"\n",
" 1. **🗂 Schema** Define your dataset structure\n",
" *(default: restaurant schema provided)*\n",
" 2. **💡 Business Case** Enter a prompt to guide the AI for generating data\n",
" 3. **🤖 Model** Choose your AI model: GPT, Claude, Gemini, or Hugging Face\n",
" 4. **📊 Number of Records** Specify entries to generate\n",
" *(min: 11, max: 1000)*\n",
" 5. **📁 File Format** Select output type: `.csv`, `.tsv`, `.jsonl`, or `.json`\n",
" 6. **💾 Save As** Provide a filename *(extension auto-added)*\n",
" 7. **🚀 Generate** Click **Generate** to create your dataset\n",
"\n",
" ### 🔧 Requirements\n",
"\n",
" Set API keys in Colabs secret section:\n",
" `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`, `HF_TOKEN`\n",
" \"\"\")\n",
" output_status = gr.Textbox(\n",
" label=\"Status\",\n",
" lines=4,\n",
" interactive=False\n",
" )\n",
"\n",
" output_preview = gr.Dataframe(\n",
" label=\"Preview (first 10 rows)\",\n",
" interactive=False,\n",
" wrap=True\n",
" )\n",
"\n",
" generate_btn.click(\n",
" fn=generate_dataset,\n",
" inputs=[\n",
" model_dropdown,\n",
" business_case_input,\n",
" nr_records_input,\n",
" schema_input,\n",
" file_format_dropdown,\n",
" filename_input\n",
" ],\n",
" outputs=[output_status, output_preview]\n",
" )\n",
"\n",
"interface.launch(debug=True)\n",
"\n",
"del tokenizer, inputs, hf_model, outputs\n",
"gc.collect()\n",
"torch.cuda.empty_cache()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "w-ewbsjInopm"
},
"outputs": [],
"source": []
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"gpuType": "T4",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -0,0 +1,828 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "4a6ab9a2-28a2-445d-8512-a0dc8d1b54e9",
"metadata": {},
"source": [
"# Python Code Documentation Assistant\n",
"\n",
"The requirement: use a Frontier model to add docstrings and comments to your Python code\n"
]
},
{
"cell_type": "markdown",
"id": "d4634170-c444-4326-9e68-5f87c63fa0e0",
"metadata": {},
"source": [
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1f72dfaf-9f20-4d81-b082-018eda152c9f",
"metadata": {},
"outputs": [],
"source": [
"!pip install -U -q \"google-genai\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e610bf56-a46e-4aff-8de1-ab49d62b1ad3",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import io\n",
"import sys\n",
"from dotenv import load_dotenv\n",
"from openai import OpenAI\n",
"from google import genai\n",
"from google.genai import types\n",
"import anthropic\n",
"from IPython.display import Markdown, display, update_display\n",
"import gradio as gr\n",
"import subprocess"
]
},
{
"cell_type": "markdown",
"id": "f91e8b32-4c98-4210-a1e1-bfe0b1fddab7",
"metadata": {},
"source": [
"## Environment"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4f672e1c-87e9-4865-b760-370fa605e614",
"metadata": {},
"outputs": [],
"source": [
"load_dotenv(override=True)\n",
"openai_api_key = os.getenv('OPENAI_API_KEY')\n",
"anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')\n",
"google_api_key = os.getenv('GOOGLE_API_KEY')\n",
"\n",
"if openai_api_key:\n",
" print(f\"OpenAI API Key exists and begins with: {openai_api_key[:8]}\")\n",
"else:\n",
" print(\"OpenAI API Key not set\")\n",
" \n",
"if anthropic_api_key:\n",
" print(f\"Anthropic API Key exists and begins with: {anthropic_api_key[:7]}\")\n",
"else:\n",
" print(\"Anthropic API Key not set\")\n",
"\n",
"if google_api_key:\n",
" print(f\"Google API Key exists and begins with: {google_api_key[:4]}\")\n",
"else:\n",
" print(\"Google API Key not set\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8aa149ed-9298-4d69-8fe2-8f5de0f667da",
"metadata": {},
"outputs": [],
"source": [
"openai = OpenAI()\n",
"claude = anthropic.Anthropic()\n",
"gemini = genai.Client()\n",
"\n",
"OPENAI_MODEL = \"o4-mini\"\n",
"CLAUDE_MODEL = \"claude-3-7-sonnet-latest\"\n",
"GEMINI_MODEL = \"gemini-2.5-flash\""
]
},
{
"cell_type": "markdown",
"id": "88a18c58-40d5-4592-8dd3-d7c7b0d951aa",
"metadata": {},
"source": [
"## Prompts"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6896636f-923e-4a2c-9d6c-fac07828a201",
"metadata": {},
"outputs": [],
"source": [
"system_message = \"\"\"\n",
"You are an assistant that documents Python code. \n",
"Your task: \n",
"- Add concise, clear, and informative docstrings to functions, classes, and modules. \n",
"- Add inline comments only where they improve readability or clarify intent. \n",
"- Do not modify the code logic or structure. \n",
"- Respond with Python code only. \n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8e7b3546-57aa-4c29-bc5d-f211970d04eb",
"metadata": {},
"outputs": [],
"source": [
"def user_prompt_for(python):\n",
" user_prompt = \"Add docstrings and comments to the following Python code:\\n\"\n",
" user_prompt += python\n",
" return user_prompt"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c6190659-f54c-4951-bef4-4960f8e51cc4",
"metadata": {},
"outputs": [],
"source": [
"def messages_for(python):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_message},\n",
" {\"role\": \"user\", \"content\": user_prompt_for(python)}\n",
" ]"
]
},
{
"cell_type": "markdown",
"id": "624e5066-bcf6-490d-a790-608d2bb34184",
"metadata": {},
"source": [
"## Helper functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "71e1ba8c-5b05-4726-a9f3-8d8c6257350b",
"metadata": {},
"outputs": [],
"source": [
"def write_output(python, filename_suffix):\n",
" filename = f\"annotated_{filename_suffix}.py\"\n",
" code = python.replace(\"```python\",\"\").replace(\"```\",\"\")\n",
" with open(filename, \"w\") as f:\n",
" f.write(code)\n",
" print(f\"\\nWritten code to {filename}\")\n",
" return filename"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e7d2fea8-74c6-4421-8f1e-0e76d5b201b9",
"metadata": {},
"outputs": [],
"source": [
"def annotate_with_gpt(python, task_name): \n",
" stream = openai.chat.completions.create(model=OPENAI_MODEL, messages=messages_for(python), stream=True)\n",
" reply = \"\"\n",
" for chunk in stream:\n",
" fragment = chunk.choices[0].delta.content or \"\"\n",
" reply += fragment\n",
" print(fragment, end='', flush=True)\n",
" return write_output(reply, f\"{task_name}_gpt\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7cd84ad8-d55c-4fe0-9eeb-1895c95c4a9d",
"metadata": {},
"outputs": [],
"source": [
"def annotate_with_claude(python, task_name):\n",
" result = claude.messages.stream(\n",
" model=CLAUDE_MODEL,\n",
" max_tokens=2000,\n",
" system=system_message,\n",
" messages=[{\"role\": \"user\", \"content\": user_prompt_for(python)}],\n",
" )\n",
" reply = \"\"\n",
" with result as stream:\n",
" for text in stream.text_stream:\n",
" reply += text\n",
" print(text, end=\"\", flush=True)\n",
" return write_output(reply, f\"{task_name}_claude\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e8a35102-1c95-469b-8855-e85f4c9bdbdf",
"metadata": {},
"outputs": [],
"source": [
"def annotate_with_gemini(python, task_name):\n",
" reply = gemini.models.generate_content(\n",
" model=GEMINI_MODEL,\n",
" contents=user_prompt_for(python),\n",
" config=types.GenerateContentConfig(\n",
" system_instruction=system_message,\n",
" )\n",
" )\n",
"\n",
" print(reply.text)\n",
" return write_output(reply.text, f\"{task_name}_gemini\")"
]
},
{
"cell_type": "markdown",
"id": "028dcfdd-2d52-4e11-a79e-2214a97cb26d",
"metadata": {},
"source": [
"# Run the Annotator"
]
},
{
"cell_type": "markdown",
"id": "7462d9f9-6215-4fb0-9471-1d0141d33205",
"metadata": {},
"source": [
"## Pi example"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a1cbb778-fa57-43de-b04b-ed523f396c38",
"metadata": {},
"outputs": [],
"source": [
"pi = \"\"\"\n",
"import time\n",
"\n",
"def calculate(iterations, param1, param2):\n",
" result = 1.0\n",
" for i in range(1, iterations+1):\n",
" j = i * param1 - param2\n",
" result -= (1/j)\n",
" j = i * param1 + param2\n",
" result += (1/j)\n",
" return result\n",
"\n",
"start_time = time.time()\n",
"result = calculate(100_000_000, 4, 1) * 4\n",
"end_time = time.time()\n",
"\n",
"print(f\"Result: {result:.12f}\")\n",
"print(f\"Execution Time: {(end_time - start_time):.6f} seconds\")\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "105db6f9-343c-491d-8e44-3a5328b81719",
"metadata": {},
"outputs": [],
"source": [
"gpt_pi = annotate_with_gpt(pi, \"pi))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "415819d0-fc95-4f78-a6ae-5c7d6781c6a7",
"metadata": {},
"outputs": [],
"source": [
"# check if the script works\n",
"\n",
"exec(open(gpt_pi).read())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "983a11fe-e24d-4c65-8269-9802c5ef3ae6",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"claude_pi = annotate_with_claude(pi, \"pi\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "52f5b710-0dea-4884-8ed7-a94059d88281",
"metadata": {},
"outputs": [],
"source": [
"exec(open(claude_pi).read())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "01f331f2-caac-48f6-9a03-8a228ee521bc",
"metadata": {},
"outputs": [],
"source": [
"gemini_pi = annotate_with_gemini(pi, \"pi\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "23529942-53fa-46ad-a5db-1f3096dd6607",
"metadata": {},
"outputs": [],
"source": [
"exec(open(gemini_pi).read())"
]
},
{
"cell_type": "markdown",
"id": "7d1eaeca-61be-4d0a-a525-dd09f52aaa0f",
"metadata": {},
"source": [
"## Hard example"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c3b497b3-f569-420e-b92e-fb0f49957ce0",
"metadata": {},
"outputs": [],
"source": [
"python_hard = \"\"\"# Be careful to support large number sizes\n",
"\n",
"def lcg(seed, a=1664525, c=1013904223, m=2**32):\n",
" value = seed\n",
" while True:\n",
" value = (a * value + c) % m\n",
" yield value\n",
" \n",
"def max_subarray_sum(n, seed, min_val, max_val):\n",
" lcg_gen = lcg(seed)\n",
" random_numbers = [next(lcg_gen) % (max_val - min_val + 1) + min_val for _ in range(n)]\n",
" max_sum = float('-inf')\n",
" for i in range(n):\n",
" current_sum = 0\n",
" for j in range(i, n):\n",
" current_sum += random_numbers[j]\n",
" if current_sum > max_sum:\n",
" max_sum = current_sum\n",
" return max_sum\n",
"\n",
"def total_max_subarray_sum(n, initial_seed, min_val, max_val):\n",
" total_sum = 0\n",
" lcg_gen = lcg(initial_seed)\n",
" for _ in range(20):\n",
" seed = next(lcg_gen)\n",
" total_sum += max_subarray_sum(n, seed, min_val, max_val)\n",
" return total_sum\n",
"\n",
"# Parameters\n",
"n = 10000 # Number of random numbers\n",
"initial_seed = 42 # Initial seed for the LCG\n",
"min_val = -10 # Minimum value of random numbers\n",
"max_val = 10 # Maximum value of random numbers\n",
"\n",
"# Timing the function\n",
"import time\n",
"start_time = time.time()\n",
"result = total_max_subarray_sum(n, initial_seed, min_val, max_val)\n",
"end_time = time.time()\n",
"\n",
"print(\"Total Maximum Subarray Sum (20 runs):\", result)\n",
"print(\"Execution Time: {:.6f} seconds\".format(end_time - start_time))\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dab5e4bc-276c-4555-bd4c-12c699d5e899",
"metadata": {},
"outputs": [],
"source": [
"exec(python_hard)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e8d24ed5-2c15-4f55-80e7-13a3952b3cb8",
"metadata": {},
"outputs": [],
"source": [
"gpt_hard = annotate_with_gpt(python_hard, \"hard\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "80a15259-3d51-47b8-953c-6271fbd4b6fb",
"metadata": {},
"outputs": [],
"source": [
"exec(open(gpt_hard).read())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e9305446-1d0c-4b51-866a-b8c1e299bf5c",
"metadata": {},
"outputs": [],
"source": [
"gemini_hard = annotate_with_gemini(python_hard, \"hard\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ad6eecc8-0517-43d8-bd21-5bbdedae7a10",
"metadata": {},
"outputs": [],
"source": [
"exec(open(gemini_hard).read())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2ee75e72-9ecb-4edd-a74a-4d3a83c1eb79",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"claude_hard = annotate_with_claude(python_hard, \"hard\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "47af1516-455f-4d1c-8a1c-2da5a38c0ba5",
"metadata": {},
"outputs": [],
"source": [
"exec(open(claude_hard).read())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7f60d33c-f6b7-4fc5-bc2b-57957b076e34",
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"This module implements a Linear Congruential Generator (LCG) and uses it\n",
"to generate random numbers for calculating the maximum subarray sum.\n",
"It includes functions for the LCG, finding the maximum subarray sum, and\n",
"aggregating results over multiple runs.\n",
"\"\"\"\n",
"\n",
"def lcg(seed, a=1664525, c=1013904223, m=2**32):\n",
" \"\"\"\n",
" Implements a Linear Congruential Generator (LCG) to produce a sequence of\n",
" pseudorandom numbers.\n",
"\n",
" The generator uses the formula: X_{n+1} = (a * X_n + c) % m.\n",
"\n",
" Args:\n",
" seed (int): The initial seed value for the generator (X_0).\n",
" a (int, optional): The multiplier. Defaults to 1664525 (common LCG parameter).\n",
" c (int, optional): The increment. Defaults to 1013904223 (common LCG parameter).\n",
" m (int, optional): The modulus. Defaults to 2**32, meaning numbers will be\n",
" between 0 and m-1.\n",
"\n",
" Yields:\n",
" int: The next pseudorandom number in the sequence.\n",
" \"\"\"\n",
" value = seed\n",
" while True:\n",
" # Calculate the next pseudorandom number using the LCG formula.\n",
" value = (a * value + c) % m\n",
" yield value\n",
"\n",
"def max_subarray_sum(n, seed, min_val, max_val):\n",
" \"\"\"\n",
" Calculates the maximum possible sum of a contiguous subarray within a list\n",
" of 'n' pseudorandom numbers.\n",
"\n",
" The random numbers are generated using an LCG based on the provided seed,\n",
" and then mapped to the range [min_val, max_val].\n",
" This implementation uses a brute-force approach with O(n^2) complexity.\n",
"\n",
" Args:\n",
" n (int): The number of random integers to generate for the array.\n",
" seed (int): The seed for the LCG to generate the random numbers.\n",
" min_val (int): The minimum possible value for the generated random numbers.\n",
" max_val (int): The maximum possible value for the generated random numbers.\n",
"\n",
" Returns:\n",
" int: The maximum sum found among all contiguous subarrays.\n",
" \"\"\"\n",
" lcg_gen = lcg(seed)\n",
" # Generate a list of 'n' random numbers within the specified range [min_val, max_val].\n",
" random_numbers = [next(lcg_gen) % (max_val - min_val + 1) + min_val for _ in range(n)]\n",
"\n",
" max_sum = float('-inf') # Initialize max_sum to negative infinity to handle all negative numbers.\n",
"\n",
" # Iterate through all possible starting points of a subarray.\n",
" for i in range(n):\n",
" current_sum = 0\n",
" # Iterate through all possible ending points for the current starting point.\n",
" for j in range(i, n):\n",
" current_sum += random_numbers[j]\n",
" # Update max_sum if the current subarray sum is greater.\n",
" if current_sum > max_sum:\n",
" max_sum = current_sum\n",
" return max_sum\n",
"\n",
"def total_max_subarray_sum(n, initial_seed, min_val, max_val):\n",
" \"\"\"\n",
" Calculates the sum of maximum subarray sums over 20 separate runs.\n",
"\n",
" Each run generates a new set of 'n' random numbers for `max_subarray_sum`\n",
" using a new seed derived from the initial LCG sequence.\n",
"\n",
" Args:\n",
" n (int): The number of random integers for each subarray sum calculation.\n",
" initial_seed (int): The initial seed for the LCG that generates seeds\n",
" for individual `max_subarray_sum` runs.\n",
" min_val (int): The minimum possible value for random numbers in each run.\n",
" max_val (int): The maximum possible value for random numbers in each run.\n",
"\n",
" Returns:\n",
" int: The sum of the maximum subarray sums across all 20 runs.\n",
" \"\"\"\n",
" total_sum = 0\n",
" lcg_gen = lcg(initial_seed) # LCG to generate seeds for subsequent runs.\n",
" # Perform 20 independent runs.\n",
" for _ in range(20):\n",
" # Get a new seed for each run from the initial LCG generator.\n",
" seed = next(lcg_gen)\n",
" # Add the maximum subarray sum of the current run to the total sum.\n",
" total_sum += max_subarray_sum(n, seed, min_val, max_val)\n",
" return total_sum\n",
"\n",
"# Parameters for the simulation\n",
"n = 10000 # Number of random numbers to generate for each subarray\n",
"initial_seed = 42 # Initial seed for the LCG that generates seeds for runs\n",
"min_val = -10 # Minimum value for the random numbers\n",
"max_val = 10 # Maximum value for the random numbers\n",
"\n",
"# Import the time module to measure execution time.\n",
"import time\n",
"\n",
"# Record the start time before executing the main function.\n",
"start_time = time.time()\n",
"# Call the function to calculate the total maximum subarray sum over multiple runs.\n",
"result = total_max_subarray_sum(n, initial_seed, min_val, max_val)\n",
"# Record the end time after the function completes.\n",
"end_time = time.time()\n",
"\n",
"# Print the final aggregated result.\n",
"print(\"Total Maximum Subarray Sum (20 runs):\", result)\n",
"# Print the total execution time, formatted to 6 decimal places.\n",
"print(\"Execution Time: {:.6f} seconds\".format(end_time - start_time))"
]
},
{
"cell_type": "markdown",
"id": "ff02ce09-0544-49a5-944d-a57b25bf9b72",
"metadata": {},
"source": [
"# Streaming"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0be9f47d-5213-4700-b0e2-d444c7c738c0",
"metadata": {},
"outputs": [],
"source": [
"def stream_gpt(python): \n",
" stream = openai.chat.completions.create(model=OPENAI_MODEL, messages=messages_for(python), stream=True)\n",
" reply = \"\"\n",
" for chunk in stream:\n",
" fragment = chunk.choices[0].delta.content or \"\"\n",
" reply += fragment\n",
" yield reply.replace('```python\\n','').replace('```','')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8669f56b-8314-4582-a167-78842caea131",
"metadata": {},
"outputs": [],
"source": [
"def stream_claude(python):\n",
" result = claude.messages.stream(\n",
" model=CLAUDE_MODEL,\n",
" max_tokens=2000,\n",
" system=system_message,\n",
" messages=[{\"role\": \"user\", \"content\": user_prompt_for(python)}],\n",
" )\n",
" reply = \"\"\n",
" with result as stream:\n",
" for text in stream.text_stream:\n",
" reply += text\n",
" yield reply.replace('```python\\n','').replace('```','')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d48d44df-c082-4ed1-b3ea-fc2a880591c2",
"metadata": {},
"outputs": [],
"source": [
"def stream_gemini(python):\n",
" stream = gemini.models.generate_content_stream(\n",
" model=GEMINI_MODEL,\n",
" contents=user_prompt_for(python),\n",
" config=types.GenerateContentConfig(\n",
" system_instruction=system_message,\n",
" ),\n",
" )\n",
" reply = \"\"\n",
" for chunk in stream:\n",
" reply += chunk.text\n",
" yield reply.replace('```python\\n','').replace('```','')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2f1ae8f5-16c8-40a0-aa18-63b617df078d",
"metadata": {},
"outputs": [],
"source": [
"def annotate(python, model):\n",
" if model == \"GPT\":\n",
" result = stream_gpt(python)\n",
" elif model == \"Claude\":\n",
" result = stream_claude(python)\n",
" elif model == \"Gemini\":\n",
" result = stream_gemini(python)\n",
" else:\n",
" raise ValueError(\"Unknown model\")\n",
" for stream_so_far in result:\n",
" yield stream_so_far "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "19bf2bff-a822-4009-a539-f003b1651383",
"metadata": {},
"outputs": [],
"source": [
"def execute_python(code):\n",
" try:\n",
" output = io.StringIO()\n",
" sys.stdout = output\n",
" exec(code)\n",
" finally:\n",
" sys.stdout = sys.__stdout__\n",
" return output.getvalue()"
]
},
{
"cell_type": "markdown",
"id": "8391444b-b938-4f92-982f-91439b38d901",
"metadata": {},
"source": [
"# Gradio App"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9a2274f1-d03b-42c0-8dcc-4ce159b18442",
"metadata": {},
"outputs": [],
"source": [
"css = \"\"\"\n",
".python {background-color: #306998;}\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "76167ea9-d0a1-4bc6-8d73-633d3b8c8df6",
"metadata": {},
"outputs": [],
"source": [
"import gradio as gr\n",
"\n",
"# Parameters\n",
"LINES = 25\n",
"LINE_HEIGHT = 20 # px, typical CodeMirror line height\n",
"PADDING = 10 # px, top + bottom padding\n",
"\n",
"CODE_HEIGHT = LINES * LINE_HEIGHT + PADDING\n",
"\n",
"\n",
"with gr.Blocks(\n",
" theme=gr.themes.Soft(),\n",
" css=f\"\"\"\n",
"#code_input .cm-editor, #annotated_code .cm-editor {{\n",
" height: {CODE_HEIGHT}px !important;\n",
" overflow-y: auto !important;\n",
"}}\n",
"\"\"\"\n",
") as demo_v2:\n",
" gr.Markdown(\"## 🐍 Annotate Python Code with Docstrings and Comments\")\n",
"\n",
" with gr.Row():\n",
" with gr.Column(scale=1):\n",
" gr.Markdown(\"### Python code:\")\n",
" code_input = gr.Code(\n",
" language=\"python\", \n",
" value=python_hard,\n",
" elem_id=\"code_input\"\n",
" )\n",
" \n",
" with gr.Column(scale=1):\n",
" gr.Markdown(\"### Annotated code:\")\n",
" annotated_output = gr.Code(\n",
" language=\"python\",\n",
" elem_id=\"annotated_code\",\n",
" interactive=False\n",
" )\n",
"\n",
" with gr.Row():\n",
" with gr.Column(scale=1):\n",
" model_dropdown = gr.Dropdown(\n",
" choices=[\"Gemini\", \"GPT-4\", \"Claude\"],\n",
" value=\"Gemini\",\n",
" label=\"Select model\"\n",
" )\n",
" with gr.Column(scale=1):\n",
" annotate_btn = gr.Button(\"✨ Annotate code\", variant=\"primary\")\n",
" run_btn = gr.Button(\"▶️ Run Python\", variant=\"secondary\")\n",
"\n",
" with gr.Row():\n",
" with gr.Column():\n",
" gr.Markdown(\"### Python result:\")\n",
" result_output = gr.Textbox(\n",
" lines=5, \n",
" label=\"Output\",\n",
" interactive=False\n",
" )\n",
" \n",
" annotate_btn.click(\n",
" annotate,\n",
" inputs=[code_input, model_dropdown],\n",
" outputs=[annotated_output]\n",
" )\n",
" run_btn.click(execute_python, inputs=[annotated_output], outputs=[result_output])\n",
"\n",
" \n",
"demo_v2.launch(inbrowser=True)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ea42883b-fdba-46ed-97be-f42e3cb41f11",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,113 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "1c8f17b7-dc42-408f-9b21-cdcfd7dbfb78",
"metadata": {},
"source": [
"# AutoTrader Code Generator\n",
"\n",
"Gemini-driven autonomous equities trading bot code generator for simulated market APIs"
]
},
{
"cell_type": "markdown",
"id": "fcffcfd7-000f-4995-82ae-94232fef1654",
"metadata": {},
"source": [
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "43d8c659-08d4-4a65-8c39-e42f6c458ba8",
"metadata": {},
"outputs": [],
"source": [
"!pip install -U -q \"google-genai\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "de542411-6b4d-47cd-bf84-d80562b333a5",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import io\n",
"import sys\n",
"from dotenv import load_dotenv\n",
"from google import genai\n",
"from google.genai import types\n",
"from IPython.display import Markdown, display, update_display\n",
"import gradio as gr\n",
"import subprocess"
]
},
{
"cell_type": "markdown",
"id": "e9b78b19-2d47-4973-adbc-d281d8ac8224",
"metadata": {},
"source": [
"## Google API Key Setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d9a2e07f-9b07-4afe-8938-ba40c41701ff",
"metadata": {},
"outputs": [],
"source": [
"load_dotenv(override=True)\n",
"google_api_key = os.getenv('GOOGLE_API_KEY')\n",
"\n",
"if google_api_key:\n",
" print(f\"Google API Key exists and begins with: {google_api_key[:4]}\")\n",
"else:\n",
" print(\"Google API Key not set\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4da3664d-9d73-41a2-8a71-fe9468f3955f",
"metadata": {},
"outputs": [],
"source": [
"!python ./gemini_trading_code_generator.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1f7b7c74-3d77-49d3-ac9f-f3a04743946c",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,34 @@
{
"name": "ExampleEquitySim",
"base_url": "https://sim.example.com/api",
"endpoints": {
"get_price": {
"path": "/market/price",
"method": "GET",
"params": ["symbol"]
},
"place_order": {
"path": "/orders",
"method": "POST",
"body": ["symbol", "side", "quantity", "order_type", "price_optional"]
},
"cancel_order": {
"path": "/orders/{order_id}/cancel",
"method": "POST"
},
"get_balance": {
"path": "/account/balance",
"method": "GET"
},
"get_positions": {
"path": "/account/positions",
"method": "GET"
}
},
"auth": {
"type": "api_key_header",
"header_name": "X-API-KEY",
"api_key_placeholder": "<SIM_API_KEY>"
},
"notes": "This simulated API uses JSON and returns ISO timestamps in UTC."
}

View File

@@ -0,0 +1,177 @@
"""
gemini_trading_code_generator.py
Usage:
- Prepare you API Specification JSON file with your simulated API details.
- Run: pip install google-genai.
- Set GOOGLE_API_KEY env var before running.
- Run: python gemini_trading_code_generator.py
- The generated bot will be saved as `generated_trading_bot.py`.
Notes:
- THIS GENERATES CODE FOR A SIMULATED ENVIRONMENT. Read and review generated code before running.
- Keep your API keys safe.
"""
import os
import json
from typing import Dict, Any
from datetime import datetime
# Gemini client import (Google GenAI SDK)
try:
from google import genai
from google.genai import types
except Exception as e:
raise RuntimeError("google-genai not installed. Run: pip install google-genai") from e
# ------------ Gemini / Prompting helpers -------------
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
# We won't fail here — the generator will raise when trying to call the client.
pass
GEMINI_MODEL = "gemini-2.5-flash"
MAX_TOKEN_COUNT = 12000
def build_prompt(api_spec: Dict[str, Any], strategy: str = "sma_crossover") -> str:
"""
Create a clear instruction for Gemini to generate the trading bot code.
Strategy choices:
- sma_crossover (default): simple moving-average crossover strategy
- random: random buy/sell (for testing)
- placeholder for the user to request others
"""
now = datetime.utcnow().isoformat() + "Z"
prompt = f"""
You are a code-writing assistant. Produce a single, self-contained Python script named `generated_trading_bot.py`
that implements a trading bot for a *simulated equities API*. The simulator has the following specification (JSON):
{json.dumps(api_spec, indent=2)}
Requirements for the generated script:
1. The script must be runnable as-is (except for inserting API keys/config). Use only stdlib + `requests` (no other external deps).
2. Implement a simple trading strategy: {strategy}. For `sma_crossover`, implement:
- Fetch historical or recent prices (you may simulate historical by sampling current price in a loop if the API doesn't return history).
- Compute short and long simple moving averages (e.g., 5-period and 20-period).
- When short SMA crosses above long SMA: submit a MARKET or SIMULATED BUY order sized to use a configurable fraction of available cash.
- When short SMA crosses below long SMA: submit a MARKET or SIMULATED SELL order to close position.
3. Use the API endpoints from the spec exactly (build URLs using base_url + path). Respect auth header scheme from spec.
4. Include robust error handling and logging (print statements acceptable).
5. Include a `--dry-run` flag that prints actions instead of placing orders.
6. Include a safe simulation mode: throttle requests, avoid rapid-fire orders, and include a configurable `min_time_between_trades_seconds`.
7. Add inline comments explaining important functions and a short README-like docstring at the top of the generated file describing how to configure and run it.
8. At the end of the generated file, add a __main__ section that demonstrates a short run (e.g., a 60-second loop) in dry-run mode.
9. Do NOT assume any third-party libraries beyond `requests`. Use dataclasses where helpful. Use typing annotations.
10. Always document any assumptions you make in a top-level comment block.
11. Keep the entire output as valid Python code only (no additional text around it).
Generate code now.
Timestamp for reproducibility: {now}
"""
return prompt.strip()
# ------------ Gemini call -------------
def generate_code_with_gemini(prompt: str, model: str = GEMINI_MODEL, max_tokens: int = MAX_TOKEN_COUNT) -> str:
"""
Call the Gemini model to generate the code.
Uses google-genai SDK. Make sure env var GOOGLE_API_KEY or GOOGLE_API_KEY is set.
"""
if not GOOGLE_API_KEY:
raise RuntimeError("No Google API key found. Set GOOGLE_API_KEY environment variable.")
# Create client (per Google Gen AI quickstart)
client = genai.Client(api_key=GOOGLE_API_KEY)
# The SDK surface has varied; using the documented 'models.generate_content' style.
# If your SDK differs, adapt accordingly.
response = client.models.generate_content(
model=model,
contents=prompt,
config=types.GenerateContentConfig(
max_output_tokens=max_tokens,
)
)
text = None
if hasattr(response, "text") and response.text:
text = response.text
else:
# attempt to dig into typical structures
try:
# some SDKs return dict-like object
if isinstance(response, dict):
# Try common keys
for k in ("text", "content", "output", "candidates"):
if k in response and response[k]:
text = json.dumps(response[k]) if not isinstance(response[k], str) else response[k]
break
else:
# object with attributes
if hasattr(response, "output") and response.output:
# navigate first candidate -> text
out = response.output
if isinstance(out, (list, tuple)) and len(out) > 0:
first = out[0]
if isinstance(first, dict) and "content" in first:
text = first["content"][0].get("text")
elif hasattr(first, "content"):
text = first.content[0].text
except Exception:
pass
if not text:
raise RuntimeError("Could not extract generated text from Gemini response. Inspect `response` object: " + repr(response))
return text
# ------------ Save & basic verification -------------
def basic_sanity_check(code_text: str) -> bool:
"""Do a quick check that the output looks like Python file and contains required sections."""
checks = [
"import requests" in code_text or "import urllib" in code_text,
"def " in code_text,
"if __name__" in code_text,
"place_order" in code_text or "order" in code_text
]
return all(checks)
def save_generated_file(code_text: str, filename: str = "generated_trading_bot.py") -> str:
code_text = code_text.replace("```python","").replace("```","")
with open(filename, "w", encoding="utf-8") as f:
f.write(code_text)
return os.path.abspath(filename)
# ------------ Main CLI -------------
def main():
import argparse
parser = argparse.ArgumentParser(description="Generate trading bot code using Gemini (Google GenAI).")
parser.add_argument("--api-spec", type=str, default="api_spec.json", help="Path to JSON file with API spec.")
parser.add_argument("--out", type=str, default="generated_trading_bot.py", help="Output filename.")
parser.add_argument("--model", type=str, default=GEMINI_MODEL, help="Gemini model to use.")
parser.add_argument("--max-tokens", type=int, default=MAX_TOKEN_COUNT, help="Max tokens for generation.")
parser.add_argument("--strategy", type=str, default="sma_crossover", help="Trading strategy to request.")
args = parser.parse_args()
with open(args.api_spec, "r", encoding="utf-8") as f:
api_spec = json.load(f)
prompt = build_prompt(api_spec, strategy=args.strategy)
print("Calling Gemini to generate code... (this will use your GOOGLE_API_KEY)")
generated = generate_code_with_gemini(prompt, model=args.model, max_tokens=args.max_tokens)
print("Performing sanity checks on the generated code...")
if not basic_sanity_check(generated):
print("Warning: basic sanity checks failed. Still saving the file for inspection.")
path = save_generated_file(generated, filename=args.out)
print(f"Generated code saved to: {path}")
print("Important: Review the generated code carefully before running against any system (even a simulator).")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,532 @@
import requests
import json
import time
import os
import collections
import argparse
import math
from dataclasses import dataclass
from typing import Dict, Any, Deque, Optional, List
# --- Assumptions ---
# 1. Historical Price Data Simulation: The simulated API's `/market/price` endpoint
# only provides the current price. To implement SMA crossover, which requires
# historical data, this bot simulates history by repeatedly calling `get_price`
# over time and storing the results. It assumes that calling `get_price` at regular
# intervals (e.g., every 5 seconds) effectively provides a time-series of prices.
# 2. API Response Formats:
# - `get_price`: Assumed to return `{"symbol": "SYM", "price": 123.45}`.
# - `get_balance`: Assumed to return `{"cash_balance": 10000.00, ...}`.
# - `get_positions`: Assumed to return a list of dictionaries, e.
# `[{"symbol": "SYM", "quantity": 10}, ...]`. If no position, an empty list or
# a list without the symbol.
# - `place_order`: Assumed to return `{"order_id": "...", "status": "accepted"}`.
# 3. Order Type: For `place_order`, `order_type` is assumed to be "MARKET" for simplicity,
# as no other types are specified and "price_optional" implies it's for limit orders.
# For MARKET orders, `price_optional` will not be sent.
# 4. Error Handling: Basic network and API-level error checking is implemented.
# More complex retry logic or backoff strategies are not included to keep the example concise.
# 5. Time Zones: The API notes specify ISO timestamps in UTC. For internal logic,
# `time.time()` (epoch seconds in UTC) is used for time comparisons, which is
# sufficient for throttling and trade timing.
# --- Configuration ---
# You can override these defaults using command-line arguments.
DEFAULT_API_KEY = os.environ.get("SIM_API_KEY", "<YOUR_SIM_API_KEY>") # Set SIM_API_KEY env var or replace
DEFAULT_BASE_URL = "https://sim.example.com/api"
DEFAULT_SYMBOL = "AAPL" # Example stock symbol
# Trading Strategy Parameters
DEFAULT_SHORT_SMA_PERIOD = 5 # Number of price points for short SMA
DEFAULT_LONG_SMA_PERIOD = 20 # Number of price points for long SMA
DEFAULT_BUY_CASH_FRACTION = 0.95 # Fraction of available cash to use for a BUY order
# Bot Operation Parameters
DEFAULT_PRICE_FETCH_INTERVAL_SECONDS = 5 # How often to fetch a new price point for SMA calculation
DEFAULT_MAIN_LOOP_INTERVAL_SECONDS = 10 # How often the bot evaluates the strategy
DEFAULT_MIN_TIME_BETWEEN_TRADES_SECONDS = 60 # Minimum time (seconds) between placing orders
DEFAULT_INITIAL_HISTORY_COLLECTION_COUNT = DEFAULT_LONG_SMA_PERIOD + 5 # Ensure enough data for long SMA
@dataclass
class TradingBotConfig:
api_key: str
base_url: str
symbol: str
short_sma_period: int
long_sma_period: int
buy_cash_fraction: float
price_fetch_interval_seconds: int
main_loop_interval_seconds: int
min_time_between_trades_seconds: int
initial_history_collection_count: int
dry_run: bool
class SimulatedAPIClient:
"""
Client for interacting with the ExampleEquitySim API.
Handles request building, authentication, and basic error parsing.
"""
def __init__(self, base_url: str, api_key: str):
self.base_url = base_url
self.headers = {"X-API-KEY": api_key, "Content-Type": "application/json"}
self.session = requests.Session() # Use a session for connection pooling
def _log(self, message: str) -> None:
"""Simple logging utility."""
print(f"[API Client] {message}")
def _make_request(
self,
method: str,
path: str,
params: Optional[Dict[str, Any]] = None,
json_data: Optional[Dict[str, Any]] = None,
) -> Optional[Dict[str, Any]]:
"""
Generic helper to make API requests.
"""
url = f"{self.base_url}{path}"
try:
response = self.session.request(
method, url, headers=self.headers, params=params, json=json_data, timeout=10
)
response.raise_for_status() # Raise HTTPError for bad responses (4xx or 5xx)
return response.json()
except requests.exceptions.HTTPError as e:
self._log(f"HTTP error for {method} {url}: {e.response.status_code} - {e.response.text}")
except requests.exceptions.ConnectionError as e:
self._log(f"Connection error for {method} {url}: {e}")
except requests.exceptions.Timeout as e:
self._log(f"Timeout error for {method} {url}: {e}")
except requests.exceptions.RequestException as e:
self._log(f"An unexpected request error occurred for {method} {url}: {e}")
except json.JSONDecodeError:
self._log(f"Failed to decode JSON from response for {method} {url}: {response.text}")
return None
def get_price(self, symbol: str) -> Optional[float]:
"""
Fetches the current market price for a given symbol.
Returns the price as a float, or None on error.
"""
path = "/market/price"
params = {"symbol": symbol}
response = self._make_request("GET", path, params=params)
if response and "price" in response:
return float(response["price"])
self._log(f"Could not get price for {symbol}.")
return None
def place_order(
self,
symbol: str,
side: str, # "BUY" or "SELL"
quantity: float,
order_type: str = "MARKET",
price_optional: Optional[float] = None # For LIMIT orders, not used for MARKET
) -> Optional[Dict[str, Any]]:
"""
Places a trading order.
"""
path = "/orders"
payload = {
"symbol": symbol,
"side": side,
"quantity": quantity,
"order_type": order_type,
}
if order_type != "MARKET" and price_optional is not None:
payload["price_optional"] = price_optional
self._log(f"Placing {side} order: {quantity} {symbol} ({order_type})...")
response = self._make_request("POST", path, json_data=payload)
if response and response.get("status") == "accepted":
self._log(f"Order placed successfully: {response.get('order_id')}")
return response
self._log(f"Failed to place {side} order for {quantity} {symbol}. Response: {response}")
return None
def cancel_order(self, order_id: str) -> Optional[Dict[str, Any]]:
"""
Cancels an existing order.
"""
path = f"/orders/{order_id}/cancel"
self._log(f"Cancelling order {order_id}...")
response = self._make_request("POST", path)
if response and response.get("status") == "cancelled":
self._log(f"Order {order_id} cancelled.")
return response
self._log(f"Failed to cancel order {order_id}. Response: {response}")
return None
def get_balance(self) -> Optional[float]:
"""
Fetches the current cash balance.
Returns cash balance as float, or None on error.
"""
path = "/account/balance"
response = self._make_request("GET", path)
if response and "cash_balance" in response:
return float(response["cash_balance"])
self._log("Could not get account balance.")
return None
def get_positions(self) -> Optional[List[Dict[str, Any]]]:
"""
Fetches all current open positions.
Returns a list of position dictionaries, or None on error.
"""
path = "/account/positions"
response = self._make_request("GET", path)
if response is not None:
# Assuming the API returns a list, even if empty
if isinstance(response, list):
return response
else:
self._log(f"Unexpected response format for get_positions: {response}")
return []
self._log("Could not get account positions.")
return None
class TradingBot:
"""
Implements the SMA Crossover trading strategy for a simulated equities API.
"""
def __init__(self, config: TradingBotConfig, api_client: SimulatedAPIClient):
self.config = config
self.api_client = api_client
# Deque for efficient rolling window of prices
self.price_history: Deque[float] = collections.deque(
maxlen=self.config.long_sma_period
)
self.last_trade_timestamp: float = 0.0
self.current_position_quantity: float = 0.0
self.previous_short_sma: Optional[float] = None
self.previous_long_sma: Optional[float] = None
self._log(f"Trading bot initialized for symbol: {self.config.symbol}")
self._log(f"Short SMA: {self.config.short_sma_period} periods, Long SMA: {self.config.long_sma_period} periods")
if self.config.dry_run:
self._log("!!! DRY RUN MODE ACTIVE - NO REAL ORDERS WILL BE PLACED !!!")
def _log(self, message: str) -> None:
"""Simple logging utility for the bot."""
print(f"[Bot] {message}")
def _fetch_and_store_price(self) -> Optional[float]:
"""
Fetches the current price from the API and adds it to the price history.
Returns the fetched price or None if failed.
"""
price = self.api_client.get_price(self.config.symbol)
if price is not None:
self.price_history.append(price)
self._log(f"Fetched price for {self.config.symbol}: {price}. History size: {len(self.price_history)}")
return price
self._log(f"Failed to fetch current price for {self.config.symbol}.")
return None
def _calculate_sma(self, period: int) -> Optional[float]:
"""
Calculates the Simple Moving Average (SMA) for a given period
using the stored price history.
"""
if len(self.price_history) < period:
return None
# Get the last 'period' prices from the deque
# Python's deque doesn't have direct slicing like list[-period:]
# So we convert to list for slicing or iterate last 'n' elements
recent_prices = list(self.price_history)[-period:]
return sum(recent_prices) / period
def _update_current_position(self) -> None:
"""
Fetches the current position for the trading symbol from the API
and updates the bot's internal state.
"""
positions = self.api_client.get_positions()
self.current_position_quantity = 0.0
if positions:
for pos in positions:
if pos.get("symbol") == self.config.symbol:
self.current_position_quantity = float(pos.get("quantity", 0))
break
self._log(f"Current position in {self.config.symbol}: {self.current_position_quantity}")
def _can_trade(self) -> bool:
"""
Checks if enough time has passed since the last trade to place a new one.
"""
time_since_last_trade = time.time() - self.last_trade_timestamp
if time_since_last_trade < self.config.min_time_between_trades_seconds:
self._log(f"Throttling: Waiting {math.ceil(self.config.min_time_between_trades_seconds - time_since_last_trade)}s before next trade.")
return False
return True
def collect_initial_history(self) -> None:
"""
Collects an initial set of price data before starting the trading strategy.
This is crucial for calculating SMAs from the start.
"""
self._log(f"Collecting initial price history ({self.config.initial_history_collection_count} points required)...")
for i in range(self.config.initial_history_collection_count):
if self._fetch_and_store_price() is None:
self._log("Failed to collect initial price. Retrying...")
# Wait before fetching next price to simulate time passing
time.sleep(self.config.price_fetch_interval_seconds)
self._log(f"Collected {i+1}/{self.config.initial_history_collection_count} prices.")
self._log("Initial price history collection complete.")
def run_strategy_iteration(self) -> None:
"""
Executes one iteration of the SMA crossover strategy.
"""
self._log("--- Running strategy iteration ---")
# 1. Fetch current position and balance
self._update_current_position()
cash_balance = self.api_client.get_balance()
if cash_balance is None:
self._log("Could not get cash balance. Skipping iteration.")
return
# 2. Fetch new price and update history
if self._fetch_and_store_price() is None:
return # Skip iteration if price fetch fails
# 3. Ensure enough data for SMAs
if len(self.price_history) < self.config.long_sma_period:
self._log(f"Not enough price history for SMAs (need {self.config.long_sma_period}, have {len(self.price_history)}). Waiting for more data.")
return
# 4. Calculate SMAs
short_sma = self._calculate_sma(self.config.short_sma_period)
long_sma = self._calculate_sma(self.config.long_sma_period)
if short_sma is None or long_sma is None:
self._log("Could not calculate SMAs. Skipping iteration.")
return
self._log(f"Current SMAs: Short={short_sma:.2f}, Long={long_sma:.2f}")
# If this is the first time we calculated SMAs, just store them and exit
if self.previous_short_sma is None or self.previous_long_sma is None:
self._log("First SMA calculation. Storing values for next iteration comparison.")
self.previous_short_sma = short_sma
self.previous_long_sma = long_sma
return
# 5. Check for crossover signals
# Buy Signal: Short SMA crosses above Long SMA
if (self.previous_short_sma < self.previous_long_sma) and (short_sma >= long_sma):
self._log("!!! BUY SIGNAL DETECTED: Short SMA crossed above Long SMA !!!")
if self.current_position_quantity > 0:
self._log(f"Already hold a position of {self.current_position_quantity} {self.config.symbol}. No new buy order.")
elif not self._can_trade():
pass # Message already logged by _can_trade()
else:
buy_amount_dollars = cash_balance * self.config.buy_cash_fraction
# Use the most recent price for calculating quantity
current_price = self.price_history[-1]
if current_price > 0:
quantity_to_buy = math.floor(buy_amount_dollars / current_price)
if quantity_to_buy > 0:
self._log(f"Attempting to BUY {quantity_to_buy} shares of {self.config.symbol} at approx ${current_price:.2f} using ${buy_amount_dollars:.2f} of cash.")
if not self.config.dry_run:
order_response = self.api_client.place_order(self.config.symbol, "BUY", quantity_to_buy)
if order_response:
self.last_trade_timestamp = time.time()
self._update_current_position() # Refresh position after order
else:
self._log(f"DRY RUN: Would have placed BUY order for {quantity_to_buy} {self.config.symbol}.")
self.last_trade_timestamp = time.time() # Still simulate trade delay
else:
self._log("Calculated quantity to buy is zero.")
else:
self._log("Current price is zero, cannot calculate buy quantity.")
# Sell Signal: Short SMA crosses below Long SMA
elif (self.previous_short_sma > self.previous_long_sma) and (short_sma <= long_sma):
self._log("!!! SELL SIGNAL DETECTED: Short SMA crossed below Long SMA !!!")
if self.current_position_quantity == 0:
self._log("No open position to sell. No new sell order.")
elif not self._can_trade():
pass # Message already logged by _can_trade()
else:
quantity_to_sell = self.current_position_quantity
self._log(f"Attempting to SELL {quantity_to_sell} shares of {self.config.symbol}.")
if not self.config.dry_run:
order_response = self.api_client.place_order(self.config.symbol, "SELL", quantity_to_sell)
if order_response:
self.last_trade_timestamp = time.time()
self._update_current_position() # Refresh position after order
else:
self._log(f"DRY RUN: Would have placed SELL order for {quantity_to_sell} {self.config.symbol}.")
self.last_trade_timestamp = time.time() # Still simulate trade delay
else:
self._log("No crossover signal detected.")
# 6. Update previous SMA values for the next iteration
self.previous_short_sma = short_sma
self.previous_long_sma = long_sma
self._log("--- Iteration complete ---")
def main():
"""
Main function to parse arguments, configure the bot, and run the trading loop.
"""
parser = argparse.ArgumentParser(
description="SMA Crossover Trading Bot for Simulated Equities API."
)
parser.add_argument(
"--api-key",
type=str,
default=DEFAULT_API_KEY,
help=f"Your API key for the simulator. Default: '{DEFAULT_API_KEY}' (or SIM_API_KEY env var)"
)
parser.add_argument(
"--base-url",
type=str,
default=DEFAULT_BASE_URL,
help=f"Base URL of the simulated API. Default: {DEFAULT_BASE_URL}"
)
parser.add_argument(
"--symbol",
type=str,
default=DEFAULT_SYMBOL,
help=f"Trading symbol (e.g., AAPL). Default: {DEFAULT_SYMBOL}"
)
parser.add_argument(
"--dry-run",
action="store_true",
help="If set, the bot will log trade actions instead of placing real orders."
)
parser.add_argument(
"--short-sma-period",
type=int,
default=DEFAULT_SHORT_SMA_PERIOD,
help=f"Number of periods for the short SMA. Default: {DEFAULT_SHORT_SMA_PERIOD}"
)
parser.add_argument(
"--long-sma-period",
type=int,
default=DEFAULT_LONG_SMA_PERIOD,
help=f"Number of periods for the long SMA. Default: {DEFAULT_LONG_SMA_PERIOD}"
)
parser.add_argument(
"--buy-cash-fraction",
type=float,
default=DEFAULT_BUY_CASH_FRACTION,
help=f"Fraction of available cash to use for a BUY order (e.g., 0.95). Default: {DEFAULT_BUY_CASH_FRACTION}"
)
parser.add_argument(
"--price-fetch-interval",
type=int,
default=DEFAULT_PRICE_FETCH_INTERVAL_SECONDS,
help=f"Interval in seconds to fetch new price data for SMA calculation. Default: {DEFAULT_PRICE_FETCH_INTERVAL_SECONDS}"
)
parser.add_argument(
"--main-loop-interval",
type=int,
default=DEFAULT_MAIN_LOOP_INTERVAL_SECONDS,
help=f"Interval in seconds between strategy evaluations. Default: {DEFAULT_MAIN_LOOP_INTERVAL_SECONDS}"
)
parser.add_argument(
"--min-trade-interval",
type=int,
default=DEFAULT_MIN_TIME_BETWEEN_TRADES_SECONDS,
help=f"Minimum time in seconds between placing actual orders. Default: {DEFAULT_MIN_TIME_BETWEEN_TRADES_SECONDS}"
)
parser.add_argument(
"--initial-history-count",
type=int,
default=DEFAULT_INITIAL_HISTORY_COLLECTION_COUNT,
help=f"Number of initial price points to collect before starting strategy. Default: {DEFAULT_INITIAL_HISTORY_COLLECTION_COUNT}"
)
parser.add_argument(
"--run-duration",
type=int,
default=300, # Default to 5 minutes for demonstration
help="Total duration in seconds to run the bot loop. (0 for indefinite run)."
)
args = parser.parse_args()
if args.api_key == "<YOUR_SIM_API_KEY>":
print("WARNING: API Key is not set. Please replace <YOUR_SIM_API_KEY> in the script or set SIM_API_KEY environment variable, or pass with --api-key.")
print("Exiting...")
return
config = TradingBotConfig(
api_key=args.api_key,
base_url=args.base_url,
symbol=args.symbol,
short_sma_period=args.short_sma_period,
long_sma_period=args.long_sma_period,
buy_cash_fraction=args.buy_cash_fraction,
price_fetch_interval_seconds=args.price_fetch_interval,
main_loop_interval_seconds=args.main_loop_interval,
min_time_between_trades_seconds=args.min_trade_interval,
initial_history_collection_count=args.initial_history_count,
dry_run=args.dry_run,
)
api_client = SimulatedAPIClient(config.base_url, config.api_key)
trading_bot = TradingBot(config, api_client)
# Ensure enough history for SMA calculations
if config.initial_history_collection_count < config.long_sma_period:
trading_bot._log(f"WARNING: Initial history collection count ({config.initial_history_collection_count}) is less than long SMA period ({config.long_sma_period}). Adjusting to {config.long_sma_period + 5}.")
config.initial_history_collection_count = config.long_sma_period + 5
# Collect initial price data
trading_bot.collect_initial_history()
# Main trading loop
start_time = time.time()
iteration = 0
trading_bot._log(f"Starting main trading loop for {args.run_duration} seconds (0 for indefinite)...")
try:
while True:
iteration += 1
trading_bot._log(f"\n--- Main Loop Iteration {iteration} ---")
trading_bot.run_strategy_iteration()
if args.run_duration > 0 and (time.time() - start_time) >= args.run_duration:
trading_bot._log(f"Run duration of {args.run_duration} seconds completed. Exiting.")
break
trading_bot._log(f"Sleeping for {config.main_loop_interval_seconds} seconds...")
time.sleep(config.main_loop_interval_seconds)
except KeyboardInterrupt:
trading_bot._log("Bot stopped manually by user (KeyboardInterrupt).")
except Exception as e:
trading_bot._log(f"An unexpected error occurred in the main loop: {e}")
finally:
trading_bot._log("Trading bot shutting down.")
if __name__ == "__main__":
# --- Demonstration Run ---
# To run this example:
# 1. Save this script as `generated_trading_bot.py`.
# 2. Install requests: `pip install requests`
# 3. Replace `<YOUR_SIM_API_KEY>` with an actual API key or set the SIM_API_KEY environment variable.
# 4. Run from your terminal:
# `python generated_trading_bot.py --dry-run --run-duration 60 --symbol MSFT`
# This will simulate a 60-second run for MSFT in dry-run mode,
# printing potential trades without actually executing them.
# For a longer run, change --run-duration (e.g., 3600 for 1 hour).
# Remove --dry-run to enable live trading (use with caution!).
main()

View File

@@ -0,0 +1,690 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "4a6ab9a2-28a2-445d-8512-a0dc8d1b54e9",
"metadata": {},
"source": [
"# Code Generator\n",
"\n",
"The requirement: use a Frontier model to generate high performance C++ code from Python code\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1f72dfaf-9f20-4d81-b082-018eda152c9f",
"metadata": {},
"outputs": [],
"source": [
"!pip install -U -q \"google-genai\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e610bf56-a46e-4aff-8de1-ab49d62b1ad3",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import io\n",
"import sys\n",
"from dotenv import load_dotenv\n",
"from openai import OpenAI\n",
"from google import genai\n",
"from google.genai import types\n",
"import anthropic\n",
"from IPython.display import Markdown, display, update_display\n",
"import gradio as gr\n",
"import subprocess"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4f672e1c-87e9-4865-b760-370fa605e614",
"metadata": {},
"outputs": [],
"source": [
"# environment\n",
"\n",
"load_dotenv(override=True)\n",
"openai_api_key = os.getenv('OPENAI_API_KEY')\n",
"anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')\n",
"google_api_key = os.getenv('GOOGLE_API_KEY')\n",
"\n",
"if openai_api_key:\n",
" print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}\")\n",
"else:\n",
" print(\"OpenAI API Key not set\")\n",
" \n",
"if anthropic_api_key:\n",
" print(f\"Anthropic API Key exists and begins {anthropic_api_key[:7]}\")\n",
"else:\n",
" print(\"Anthropic API Key not set\")\n",
"\n",
"if google_api_key:\n",
" print(f\"Google API Key exists and begins {google_api_key[:8]}\")\n",
"else:\n",
" print(\"Google API Key not set\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8aa149ed-9298-4d69-8fe2-8f5de0f667da",
"metadata": {},
"outputs": [],
"source": [
"# initialize\n",
"\n",
"openai = OpenAI()\n",
"claude = anthropic.Anthropic()\n",
"gemini = genai.Client()\n",
"\n",
"OPENAI_MODEL = \"o4-mini\"\n",
"CLAUDE_MODEL = \"claude-3-7-sonnet-latest\"\n",
"GEMINI_MODEL = \"gemini-2.5-flash\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6896636f-923e-4a2c-9d6c-fac07828a201",
"metadata": {},
"outputs": [],
"source": [
"system_message = \"You are an assistant that reimplements Python code in high performance C++ for an M1 Mac. \"\n",
"system_message += \"Respond only with C++ code; use comments sparingly and do not provide any explanation other than occasional comments. \"\n",
"system_message += \"The C++ response needs to produce an identical output in the fastest possible time.\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8e7b3546-57aa-4c29-bc5d-f211970d04eb",
"metadata": {},
"outputs": [],
"source": [
"def user_prompt_for(python):\n",
" user_prompt = \"Rewrite this Python code in C++ with the fastest possible implementation that produces identical output in the least time. \"\n",
" user_prompt += \"Respond only with C++ code; do not explain your work other than a few comments. \"\n",
" user_prompt += \"Pay attention to number types to ensure no int overflows. Remember to #include all necessary C++ packages such as iomanip.\\n\\n\"\n",
" user_prompt += python\n",
" return user_prompt"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c6190659-f54c-4951-bef4-4960f8e51cc4",
"metadata": {},
"outputs": [],
"source": [
"def messages_for(python):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_message},\n",
" {\"role\": \"user\", \"content\": user_prompt_for(python)}\n",
" ]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "71e1ba8c-5b05-4726-a9f3-8d8c6257350b",
"metadata": {},
"outputs": [],
"source": [
"# write to a file called optimized.cpp\n",
"\n",
"def write_output(cpp):\n",
" code = cpp.replace(\"```cpp\",\"\").replace(\"```\",\"\")\n",
" with open(\"optimized.cpp\", \"w\") as f:\n",
" f.write(code)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e7d2fea8-74c6-4421-8f1e-0e76d5b201b9",
"metadata": {},
"outputs": [],
"source": [
"def optimize_gpt(python): \n",
" stream = openai.chat.completions.create(model=OPENAI_MODEL, messages=messages_for(python), stream=True)\n",
" reply = \"\"\n",
" for chunk in stream:\n",
" fragment = chunk.choices[0].delta.content or \"\"\n",
" reply += fragment\n",
" print(fragment, end='', flush=True)\n",
" write_output(reply)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7cd84ad8-d55c-4fe0-9eeb-1895c95c4a9d",
"metadata": {},
"outputs": [],
"source": [
"def optimize_claude(python):\n",
" result = claude.messages.stream(\n",
" model=CLAUDE_MODEL,\n",
" max_tokens=2000,\n",
" system=system_message,\n",
" messages=[{\"role\": \"user\", \"content\": user_prompt_for(python)}],\n",
" )\n",
" reply = \"\"\n",
" with result as stream:\n",
" for text in stream.text_stream:\n",
" reply += text\n",
" print(text, end=\"\", flush=True)\n",
" write_output(reply)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e8a35102-1c95-469b-8855-e85f4c9bdbdf",
"metadata": {},
"outputs": [],
"source": [
"def optimize_gemini(python):\n",
" reply = gemini.models.generate_content(\n",
" model=GEMINI_MODEL,\n",
" contents=user_prompt_for(python),\n",
" config=types.GenerateContentConfig(\n",
" system_instruction=system_message,\n",
" )\n",
" )\n",
"\n",
" print(reply.text)\n",
" write_output(reply.text)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a1cbb778-fa57-43de-b04b-ed523f396c38",
"metadata": {},
"outputs": [],
"source": [
"pi = \"\"\"\n",
"import time\n",
"\n",
"def calculate(iterations, param1, param2):\n",
" result = 1.0\n",
" for i in range(1, iterations+1):\n",
" j = i * param1 - param2\n",
" result -= (1/j)\n",
" j = i * param1 + param2\n",
" result += (1/j)\n",
" return result\n",
"\n",
"start_time = time.time()\n",
"result = calculate(100_000_000, 4, 1) * 4\n",
"end_time = time.time()\n",
"\n",
"print(f\"Result: {result:.12f}\")\n",
"print(f\"Execution Time: {(end_time - start_time):.6f} seconds\")\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7fe1cd4b-d2c5-4303-afed-2115a3fef200",
"metadata": {},
"outputs": [],
"source": [
"exec(pi)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "105db6f9-343c-491d-8e44-3a5328b81719",
"metadata": {},
"outputs": [],
"source": [
"optimize_gpt(pi)"
]
},
{
"cell_type": "markdown",
"id": "bf8f8018-f64d-425c-a0e1-d7862aa9592d",
"metadata": {},
"source": [
"# Compiling C++ and executing\n",
"\n",
"This next cell contains the command to compile a C++ file on my M1 Mac. \n",
"It compiles the file `optimized.cpp` into an executable called `optimized` \n",
"Then it runs the program called `optimized`\n",
"\n",
"In the next lab (day4), a student has contributed a full solution that compiles to efficient code on Mac, PC and Linux!\n",
"\n",
"You can wait for this, or you can google (or ask ChatGPT!) for how to do this on your platform, then replace the lines below.\n",
"If you're not comfortable with this step, you can skip it for sure - I'll show you exactly how it performs on my Mac.\n",
"\n",
"\n",
"OR alternatively: student Sandeep K.G. points out that you can run Python and C++ code online to test it out that way. Thank you Sandeep! \n",
"> Not an exact comparison but you can still get the idea of performance difference.\n",
"> For example here: https://www.programiz.com/cpp-programming/online-compiler/"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4194e40c-04ab-4940-9d64-b4ad37c5bb40",
"metadata": {},
"outputs": [],
"source": [
"# Compile C++ and run the executable\n",
"\n",
"!clang++ -O3 -std=c++17 -march=armv8.3-a -o optimized optimized.cpp\n",
"!./optimized"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "983a11fe-e24d-4c65-8269-9802c5ef3ae6",
"metadata": {},
"outputs": [],
"source": [
"optimize_claude(pi)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d5a766f9-3d23-4bb4-a1d4-88ec44b61ddf",
"metadata": {},
"outputs": [],
"source": [
"# Repeat for Claude - again, use the right approach for your platform\n",
"\n",
"!clang++ -O3 -std=c++17 -march=armv8.3-a -o optimized optimized.cpp\n",
"!./optimized"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "01f331f2-caac-48f6-9a03-8a228ee521bc",
"metadata": {},
"outputs": [],
"source": [
"optimize_gemini(pi)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5ef707a4-930e-4b8b-9443-e7e4fd309c2a",
"metadata": {},
"outputs": [],
"source": [
"!clang++ -O3 -std=c++17 -march=armv8.3-a -o optimized optimized.cpp\n",
"!./optimized"
]
},
{
"cell_type": "markdown",
"id": "7d1eaeca-61be-4d0a-a525-dd09f52aaa0f",
"metadata": {},
"source": [
"# Python Hard Version"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c3b497b3-f569-420e-b92e-fb0f49957ce0",
"metadata": {},
"outputs": [],
"source": [
"python_hard = \"\"\"# Be careful to support large number sizes\n",
"\n",
"def lcg(seed, a=1664525, c=1013904223, m=2**32):\n",
" value = seed\n",
" while True:\n",
" value = (a * value + c) % m\n",
" yield value\n",
" \n",
"def max_subarray_sum(n, seed, min_val, max_val):\n",
" lcg_gen = lcg(seed)\n",
" random_numbers = [next(lcg_gen) % (max_val - min_val + 1) + min_val for _ in range(n)]\n",
" max_sum = float('-inf')\n",
" for i in range(n):\n",
" current_sum = 0\n",
" for j in range(i, n):\n",
" current_sum += random_numbers[j]\n",
" if current_sum > max_sum:\n",
" max_sum = current_sum\n",
" return max_sum\n",
"\n",
"def total_max_subarray_sum(n, initial_seed, min_val, max_val):\n",
" total_sum = 0\n",
" lcg_gen = lcg(initial_seed)\n",
" for _ in range(20):\n",
" seed = next(lcg_gen)\n",
" total_sum += max_subarray_sum(n, seed, min_val, max_val)\n",
" return total_sum\n",
"\n",
"# Parameters\n",
"n = 10000 # Number of random numbers\n",
"initial_seed = 42 # Initial seed for the LCG\n",
"min_val = -10 # Minimum value of random numbers\n",
"max_val = 10 # Maximum value of random numbers\n",
"\n",
"# Timing the function\n",
"import time\n",
"start_time = time.time()\n",
"result = total_max_subarray_sum(n, initial_seed, min_val, max_val)\n",
"end_time = time.time()\n",
"\n",
"print(\"Total Maximum Subarray Sum (20 runs):\", result)\n",
"print(\"Execution Time: {:.6f} seconds\".format(end_time - start_time))\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dab5e4bc-276c-4555-bd4c-12c699d5e899",
"metadata": {},
"outputs": [],
"source": [
"exec(python_hard)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e8d24ed5-2c15-4f55-80e7-13a3952b3cb8",
"metadata": {},
"outputs": [],
"source": [
"optimize_gpt(python_hard)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e0b3d073-88a2-40b2-831c-6f0c345c256f",
"metadata": {},
"outputs": [],
"source": [
"# Replace this with the right C++ compile + execute command for your platform\n",
"\n",
"!clang++ -O3 -std=c++17 -march=armv8.3-a -o optimized optimized.cpp\n",
"!./optimized"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e9305446-1d0c-4b51-866a-b8c1e299bf5c",
"metadata": {},
"outputs": [],
"source": [
"optimize_gemini(python_hard)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0c181036-8193-4fdd-aef3-fc513b218d43",
"metadata": {},
"outputs": [],
"source": [
"# Replace this with the right C++ compile + execute command for your platform\n",
"\n",
"!clang++ -O3 -std=c++17 -march=armv8.3-a -o optimized optimized.cpp\n",
"!./optimized"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2ee75e72-9ecb-4edd-a74a-4d3a83c1eb79",
"metadata": {},
"outputs": [],
"source": [
"optimize_claude(python_hard)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4a4ab43c-7df2-4770-bd05-6bbc198a8c45",
"metadata": {},
"outputs": [],
"source": [
"# Replace this with the right C++ compile + execute command for your platform\n",
"\n",
"!clang++ -O3 -std=c++17 -march=armv8.3-a -o optimized optimized.cpp\n",
"!./optimized"
]
},
{
"cell_type": "markdown",
"id": "ff02ce09-0544-49a5-944d-a57b25bf9b72",
"metadata": {},
"source": [
"# Streaming"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0be9f47d-5213-4700-b0e2-d444c7c738c0",
"metadata": {},
"outputs": [],
"source": [
"def stream_gpt(python): \n",
" stream = openai.chat.completions.create(model=OPENAI_MODEL, messages=messages_for(python), stream=True)\n",
" reply = \"\"\n",
" for chunk in stream:\n",
" fragment = chunk.choices[0].delta.content or \"\"\n",
" reply += fragment\n",
" yield reply.replace('```cpp\\n','').replace('```','')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8669f56b-8314-4582-a167-78842caea131",
"metadata": {},
"outputs": [],
"source": [
"def stream_claude(python):\n",
" result = claude.messages.stream(\n",
" model=CLAUDE_MODEL,\n",
" max_tokens=2000,\n",
" system=system_message,\n",
" messages=[{\"role\": \"user\", \"content\": user_prompt_for(python)}],\n",
" )\n",
" reply = \"\"\n",
" with result as stream:\n",
" for text in stream.text_stream:\n",
" reply += text\n",
" yield reply.replace('```cpp\\n','').replace('```','')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d48d44df-c082-4ed1-b3ea-fc2a880591c2",
"metadata": {},
"outputs": [],
"source": [
"def stream_gemini(python):\n",
" stream = gemini.models.generate_content_stream(\n",
" model=GEMINI_MODEL,\n",
" contents=user_prompt_for(python),\n",
" config=types.GenerateContentConfig(\n",
" system_instruction=system_message,\n",
" ),\n",
" )\n",
" reply = \"\"\n",
" for chunk in stream:\n",
" reply += chunk.text\n",
" yield reply.replace('```cpp\\n','').replace('```','')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2f1ae8f5-16c8-40a0-aa18-63b617df078d",
"metadata": {},
"outputs": [],
"source": [
"def optimize(python, model):\n",
" if model==\"GPT\":\n",
" result = stream_gpt(python)\n",
" elif model==\"Claude\":\n",
" result = stream_claude(python)\n",
" elif model==\"Gemini\":\n",
" result = stream_gemini(python)\n",
" else:\n",
" raise ValueError(\"Unknown model\")\n",
" for stream_so_far in result:\n",
" yield stream_so_far "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f1ddb38e-6b0a-4c37-baa4-ace0b7de887a",
"metadata": {},
"outputs": [],
"source": [
"with gr.Blocks() as ui:\n",
" with gr.Row():\n",
" python = gr.Textbox(label=\"Python code:\", lines=10, value=python_hard)\n",
" cpp = gr.Textbox(label=\"C++ code:\", lines=10)\n",
" with gr.Row():\n",
" model = gr.Dropdown([\"GPT\", \"Claude\", \"Gemini\"], label=\"Select model\", value=\"GPT\")\n",
" convert = gr.Button(\"Convert code\")\n",
"\n",
" convert.click(optimize, inputs=[python, model], outputs=[cpp])\n",
"\n",
"ui.launch(inbrowser=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "19bf2bff-a822-4009-a539-f003b1651383",
"metadata": {},
"outputs": [],
"source": [
"def execute_python(code):\n",
" try:\n",
" output = io.StringIO()\n",
" sys.stdout = output\n",
" exec(code)\n",
" finally:\n",
" sys.stdout = sys.__stdout__\n",
" return output.getvalue()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "77f3ab5d-fcfb-4d3f-8728-9cacbf833ea6",
"metadata": {},
"outputs": [],
"source": [
"# M1 Mac version to compile and execute optimized C++ code:\n",
"\n",
"def execute_cpp(code):\n",
" write_output(code)\n",
" try:\n",
" compile_cmd = [\"clang++\", \"-Ofast\", \"-std=c++17\", \"-march=armv8.5-a\", \"-mtune=apple-m1\", \"-mcpu=apple-m1\", \"-o\", \"optimized\", \"optimized.cpp\"]\n",
" compile_result = subprocess.run(compile_cmd, check=True, text=True, capture_output=True)\n",
" run_cmd = [\"./optimized\"]\n",
" run_result = subprocess.run(run_cmd, check=True, text=True, capture_output=True)\n",
" return run_result.stdout\n",
" except subprocess.CalledProcessError as e:\n",
" return f\"An error occurred:\\n{e.stderr}\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9a2274f1-d03b-42c0-8dcc-4ce159b18442",
"metadata": {},
"outputs": [],
"source": [
"css = \"\"\"\n",
".python {background-color: #306998;}\n",
".cpp {background-color: #050;}\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f1303932-160c-424b-97a8-d28c816721b2",
"metadata": {},
"outputs": [],
"source": [
"with gr.Blocks(css=css) as ui:\n",
" gr.Markdown(\"## Convert code from Python to C++\")\n",
" with gr.Row():\n",
" python = gr.Textbox(label=\"Python code:\", value=python_hard, lines=20)\n",
" cpp = gr.Textbox(label=\"C++ code:\", lines=20)\n",
" with gr.Row():\n",
" model = gr.Dropdown([\"GPT\", \"Claude\", \"Gemini\"], label=\"Select model\", value=\"GPT\")\n",
" convert = gr.Button(\"Convert code\")\n",
" with gr.Row():\n",
" python_run = gr.Button(\"Run Python\")\n",
" cpp_run = gr.Button(\"Run C++\")\n",
" with gr.Row():\n",
" python_out = gr.TextArea(label=\"Python result:\", elem_classes=[\"python\"])\n",
" cpp_out = gr.TextArea(label=\"C++ result:\", elem_classes=[\"cpp\"])\n",
"\n",
" convert.click(optimize, inputs=[python, model], outputs=[cpp])\n",
" python_run.click(execute_python, inputs=[python], outputs=[python_out])\n",
" cpp_run.click(execute_cpp, inputs=[cpp], outputs=[cpp_out])\n",
"\n",
"ui.launch(inbrowser=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ea42883b-fdba-46ed-97be-f42e3cb41f11",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}