Merge pull request #332 from jstoops/community-contributions-branch

Added my contributions to week 1, 2, and 5 community-contributions - RecursiveCharacterTextSplitter
This commit is contained in:
Ed Donner
2025-04-19 08:43:23 +02:00
committed by GitHub
5 changed files with 1927 additions and 0 deletions

View File

@@ -0,0 +1,369 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "83bbedd0-eb58-48de-992e-484071b10104",
"metadata": {},
"source": [
"# Web Scraper with JavaScript Support\n",
"Uses day1-webscraping-selenium-for-javascript.ipynb solution simplified so easy to run.\n",
"\n",
"## Install dependencies\n",
"Uncomment and run once"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f2d91971-9dd0-4714-8ec7-f1fb25f95140",
"metadata": {},
"outputs": [],
"source": [
"# !pip install selenium\n",
"# !pip install undetected-chromedriver\n",
"# !ollama pull llama3.2"
]
},
{
"cell_type": "markdown",
"id": "967258fe-3296-464c-962d-2bcf821eae67",
"metadata": {},
"source": [
"## Import required dependencies"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fe8a87c8-0475-45a1-8ca2-fb9059e5470b",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import requests\n",
"from dotenv import load_dotenv\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI\n",
"import undetected_chromedriver as uc\n",
"from selenium.webdriver.common.by import By\n",
"from selenium.webdriver.support.ui import WebDriverWait\n",
"from selenium.webdriver.support import expected_conditions as EC\n",
"import time\n",
"from bs4 import BeautifulSoup\n",
"\n",
"# If you get an error running this cell, then please head over to the troubleshooting notebook!"
]
},
{
"cell_type": "markdown",
"id": "df60545e-2ab6-4e37-b41c-27ddf2affb92",
"metadata": {},
"source": [
"## Run setup"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a3846089-efa2-4602-8bc3-5f6f4945de64",
"metadata": {},
"outputs": [],
"source": [
"chrome_path = \"C:/Program Files/Google/Chrome/Application/chrome.exe\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b835812d-3692-4192-abc4-15fc463bd08f",
"metadata": {},
"outputs": [],
"source": [
"# Load environment variables in a file called .env\n",
"\n",
"load_dotenv()\n",
"api_key = os.getenv('OPENAI_API_KEY')\n",
"\n",
"# Check the key\n",
"\n",
"if not api_key:\n",
" print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
"elif not api_key.startswith(\"sk-proj-\"):\n",
" print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n",
"elif api_key.strip() != api_key:\n",
" print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n",
"else:\n",
" print(\"API key found and looks good so far!\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "acb89abb-dcee-4da6-98f8-e339d258f2a4",
"metadata": {},
"outputs": [],
"source": [
"openai = OpenAI()\n",
"\n",
"# If this doesn't work, try Kernel menu >> Restart Kernel and Clear Outputs Of All Cells, then run the cells from the top of this notebook down.\n",
"# If it STILL doesn't work (horrors!) then please see the troubleshooting notebook, or try the below line instead:\n",
"# openai = OpenAI(api_key=\"your-key-here-starting-sk-proj-\")"
]
},
{
"cell_type": "markdown",
"id": "e860e963-e7a1-4888-a4b9-db9c24bb9a6e",
"metadata": {},
"source": [
"# Create Prompts"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d4933c36-db8a-4333-8f81-e9db7ba41287",
"metadata": {},
"outputs": [],
"source": [
"# Define our system prompt - you can experiment with this later, changing the last sentence to 'Respond in markdown in Spanish.\"\n",
"\n",
"system_prompt = \"You are an assistant that analyzes the contents of a website \\\n",
"and provides a short summary, ignoring text that might be navigation related. \\\n",
"Respond in markdown.\"\n",
"\n",
"# A function that writes a User Prompt that asks for summaries of websites:\n",
"\n",
"def user_prompt_for(website):\n",
" user_prompt = f\"You are looking at a website titled {website.title}\"\n",
" user_prompt += \"\\nThe contents of this website is as follows; \\\n",
"please provide a short summary of this website in markdown. \\\n",
"If it includes news or announcements, then summarize these too.\\n\\n\"\n",
" user_prompt += website.text\n",
" return user_prompt\n"
]
},
{
"cell_type": "markdown",
"id": "17cfab59-304d-4d2f-b324-c388d9e87fca",
"metadata": {},
"source": [
"# Create Functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ca5e96e0-4d8f-49de-a608-a735a5b23b1a",
"metadata": {},
"outputs": [],
"source": [
"# Setup for how OpenAI expects to receive messages in a particular structure\n",
"\n",
"def messages_for(website):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt_for(website)}\n",
" ]\n",
"\n",
"# Use Selenium and chrome to scrape website\n",
"class WebsiteCrawler:\n",
" def __init__(self, url, wait_time=20, chrome_binary_path=None):\n",
" \"\"\"\n",
" Initialize the WebsiteCrawler using Selenium to scrape JavaScript-rendered content.\n",
" \"\"\"\n",
" self.url = url\n",
" self.wait_time = wait_time\n",
"\n",
" options = uc.ChromeOptions()\n",
" options.add_argument(\"--disable-gpu\")\n",
" options.add_argument(\"--no-sandbox\")\n",
" options.add_argument(\"--disable-dev-shm-usage\")\n",
" options.add_argument(\"--disable-blink-features=AutomationControlled\")\n",
" options.add_argument(\"start-maximized\")\n",
" options.add_argument(\n",
" \"user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n",
" )\n",
" if chrome_binary_path:\n",
" options.binary_location = chrome_binary_path\n",
"\n",
" self.driver = uc.Chrome(options=options)\n",
"\n",
" try:\n",
" # Load the URL\n",
" self.driver.get(url)\n",
"\n",
" # Wait for Cloudflare or similar checks\n",
" time.sleep(10)\n",
"\n",
" # Ensure the main content is loaded\n",
" WebDriverWait(self.driver, self.wait_time).until(\n",
" EC.presence_of_element_located((By.TAG_NAME, \"main\"))\n",
" )\n",
"\n",
" # Extract the main content\n",
" main_content = self.driver.find_element(By.CSS_SELECTOR, \"main\").get_attribute(\"outerHTML\")\n",
"\n",
" # Parse with BeautifulSoup\n",
" soup = BeautifulSoup(main_content, \"html.parser\")\n",
" self.title = self.driver.title if self.driver.title else \"No title found\"\n",
" self.text = soup.get_text(separator=\"\\n\", strip=True)\n",
"\n",
" except Exception as e:\n",
" print(f\"Error occurred: {e}\")\n",
" self.title = \"Error occurred\"\n",
" self.text = \"\"\n",
"\n",
" finally:\n",
" self.driver.quit()\n",
"\n",
"def new_summary(url, chrome_path):\n",
" web = WebsiteCrawler(url, 30, chrome_path)\n",
" response = openai.chat.completions.create(\n",
" model = \"gpt-4o-mini\",\n",
" messages = messages_for(web)\n",
" )\n",
"\n",
" web_summary = response.choices[0].message.content\n",
" \n",
" return display(Markdown(web_summary))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "20a8a14b-0a29-4f74-a591-d587b965409b",
"metadata": {},
"outputs": [],
"source": [
"# Define our system prompt - you can experiment with this later, changing the last sentence to 'Respond in markdown in Spanish.\"\n",
"\n",
"system_prompt = \"You are an assistant that analyzes the contents of a website \\\n",
"and provides a short summary, ignoring text that might be navigation related. \\\n",
"Respond in markdown.\"\n",
"\n",
"# A function that writes a User Prompt that asks for summaries of websites:\n",
"\n",
"def user_prompt_for(website):\n",
" user_prompt = f\"You are looking at a website titled {website.title}\"\n",
" user_prompt += \"\\nThe contents of this website is as follows; \\\n",
"please provide a short summary of this website in markdown. \\\n",
"If it includes news or announcements, then summarize these too.\\n\\n\"\n",
" user_prompt += website.text\n",
" return user_prompt\n",
"\n",
"# Setup for how OpenAI expects to receive messages in a particular structure\n",
"\n",
"def messages_for(website):\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt_for(website)}\n",
" ]\n",
"\n",
"# Use Selenium and chrome to scrape website\n",
"class WebsiteCrawler:\n",
" def __init__(self, url, wait_time=20, chrome_binary_path=None):\n",
" \"\"\"\n",
" Initialize the WebsiteCrawler using Selenium to scrape JavaScript-rendered content.\n",
" \"\"\"\n",
" self.url = url\n",
" self.wait_time = wait_time\n",
"\n",
" options = uc.ChromeOptions()\n",
" options.add_argument(\"--disable-gpu\")\n",
" options.add_argument(\"--no-sandbox\")\n",
" options.add_argument(\"--disable-dev-shm-usage\")\n",
" options.add_argument(\"--disable-blink-features=AutomationControlled\")\n",
" options.add_argument(\"start-maximized\")\n",
" options.add_argument(\n",
" \"user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n",
" )\n",
" if chrome_binary_path:\n",
" options.binary_location = chrome_binary_path\n",
"\n",
" self.driver = uc.Chrome(options=options)\n",
"\n",
" try:\n",
" # Load the URL\n",
" self.driver.get(url)\n",
"\n",
" # Wait for Cloudflare or similar checks\n",
" time.sleep(10)\n",
"\n",
" # Ensure the main content is loaded\n",
" WebDriverWait(self.driver, self.wait_time).until(\n",
" EC.presence_of_element_located((By.TAG_NAME, \"main\"))\n",
" )\n",
"\n",
" # Extract the main content\n",
" main_content = self.driver.find_element(By.CSS_SELECTOR, \"main\").get_attribute(\"outerHTML\")\n",
"\n",
" # Parse with BeautifulSoup\n",
" soup = BeautifulSoup(main_content, \"html.parser\")\n",
" self.title = self.driver.title if self.driver.title else \"No title found\"\n",
" self.text = soup.get_text(separator=\"\\n\", strip=True)\n",
"\n",
" except Exception as e:\n",
" print(f\"Error occurred: {e}\")\n",
" self.title = \"Error occurred\"\n",
" self.text = \"\"\n",
"\n",
" finally:\n",
" self.driver.quit()\n",
"\n",
"def new_summary(url, chrome_path):\n",
" web = WebsiteCrawler(url, 30, chrome_path)\n",
" response = openai.chat.completions.create(\n",
" model = \"gpt-4o-mini\",\n",
" messages = messages_for(web)\n",
" )\n",
"\n",
" web_summary = response.choices[0].message.content\n",
" \n",
" return display(Markdown(web_summary))"
]
},
{
"cell_type": "markdown",
"id": "e5f974b3-e417-43a2-88f1-8db06096cd53",
"metadata": {},
"source": [
"# Scrape and Summarize Web Page"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "55f240cb-1fca-46bf-81d1-1beeea64439d",
"metadata": {},
"outputs": [],
"source": [
"url = \"https://www.canva.com/\"\n",
"new_summary(url, chrome_path)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,379 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "a98030af-fcd1-4d63-a36e-38ba053498fa",
"metadata": {},
"source": [
"# Snarky brochure"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d5b08506-dc8b-4443-9201-5f1848161363",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"# If these fail, please check you're running from an 'activated' environment with (llms) in the command prompt\n",
"\n",
"import os\n",
"import requests\n",
"import json\n",
"from typing import List\n",
"from dotenv import load_dotenv\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display, update_display\n",
"from openai import OpenAI"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fc5d8880-f2ee-4c06-af16-ecbc0262af61",
"metadata": {},
"outputs": [],
"source": [
"# Initialize and constants\n",
"\n",
"load_dotenv(override=True)\n",
"api_key = os.getenv('OPENAI_API_KEY')\n",
"\n",
"if api_key and api_key.startswith('sk-proj-') and len(api_key)>10:\n",
" print(\"API key looks good so far\")\n",
"else:\n",
" print(\"There might be a problem with your API key? Please visit the troubleshooting notebook!\")\n",
" \n",
"MODEL = 'gpt-4o-mini'\n",
"openai = OpenAI()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "106dd65e-90af-4ca8-86b6-23a41840645b",
"metadata": {},
"outputs": [],
"source": [
"# A class to represent a Webpage\n",
"\n",
"# Some websites need you to use proper headers when fetching them:\n",
"headers = {\n",
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n",
"}\n",
"\n",
"class Website:\n",
" \"\"\"\n",
" A utility class to represent a Website that we have scraped, now with links\n",
" \"\"\"\n",
"\n",
" def __init__(self, url):\n",
" self.url = url\n",
" response = requests.get(url, headers=headers)\n",
" self.body = response.content\n",
" soup = BeautifulSoup(self.body, 'html.parser')\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
" if soup.body:\n",
" for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
" self.text = soup.body.get_text(separator=\"\\n\", strip=True)\n",
" else:\n",
" self.text = \"\"\n",
" links = [link.get('href') for link in soup.find_all('a')]\n",
" self.links = [link for link in links if link]\n",
"\n",
" def get_contents(self):\n",
" return f\"Webpage Title:\\n{self.title}\\nWebpage Contents:\\n{self.text}\\n\\n\""
]
},
{
"cell_type": "markdown",
"id": "1771af9c-717a-4fca-bbbe-8a95893312c3",
"metadata": {},
"source": [
"## Link prompts\n",
"### Multi-shot system prompt"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6957b079-0d96-45f7-a26a-3487510e9b35",
"metadata": {},
"outputs": [],
"source": [
"link_system_prompt = \"You are provided with a list of links found on a webpage. \\\n",
"You are able to decide which of the links would be most relevant to include in a brochure about the company, \\\n",
"such as links to an About page, or a Company page, or Careers/Jobs pages.\\n\"\n",
"link_system_prompt += \"You should respond in JSON as in these examples:\"\n",
"link_system_prompt += \"\"\"\n",
"Example 1\n",
"['https://my-company.com', 'https://my-company.com/about-me', 'https://www.linkedin.com/in/my-company/', 'mailto:joe.blog@gmail.com', 'https://my-company.com/news', '/case-studies', 'https://patents.google.com/patent/US20210049536A1/', 'https://my-company.com/workshop-ai']\n",
"\n",
" Links:\n",
"{\n",
" \"links\": [\n",
" {\"type\": \"landing page\", \"url\": \"https://great-comps.com/about-me\"},\n",
" {\"type\": \"about page\", \"url\": \"https://great-comps.com/about-me\"},\n",
" {\"type\": \"news page\": \"url\": \"https://great-comps.com/news\"},\n",
" {\"type\": \"case studies page\": \"url\": \"https://great-comps.com/case-studies\"},\n",
" {\"type\": \"workshop page\": \"url\": \"https://great-comps.com/workshop-ai\"},\n",
" ]\n",
"}\n",
"Example 2\n",
"['https://www.acmeinc.com', '/#about', '/#projects', '/#experience', '/#skills', 'https://github.com/acmeinc']\n",
"\n",
" Links:\n",
"{\n",
" \"links\": [\n",
" {\"type\": \"landing page\", \"url\": \"https://www.acmeinc.com\"},\n",
" {\"type\": \"GitHub projects\": \"url\": \"https://github.com/acmeinc\"},\n",
" ]\n",
"}\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b97e4068-97ed-4120-beae-c42105e4d59a",
"metadata": {},
"outputs": [],
"source": [
"print(link_system_prompt)"
]
},
{
"cell_type": "markdown",
"id": "baf384bb-4577-4885-a445-dc8da232b1d9",
"metadata": {},
"source": [
"### User prompt"
]
},
{
"cell_type": "markdown",
"id": "51174859-666a-43ad-9c34-5f082298d398",
"metadata": {},
"source": [
"## Get links"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8e1f601b-2eaf-499d-b6b8-c99050c9d6b3",
"metadata": {},
"outputs": [],
"source": [
"def get_links_user_prompt(website):\n",
" user_prompt = f\"Here is the list of links on the website of {website.url} - \"\n",
" user_prompt += \"please decide which of these are relevant web links for a brochure about the company, respond with the full https URL in JSON format. \\\n",
"Do not include Terms of Service, Privacy, email links.\\n\"\n",
" user_prompt += \"Links (some might be relative links):\\n\"\n",
" user_prompt += \"\\n\".join(website.links)\n",
" return user_prompt"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a29aca19-ca13-471c-a4b4-5abbfa813f69",
"metadata": {},
"outputs": [],
"source": [
"def get_links(url):\n",
" website = Website(url)\n",
" response = openai.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": link_system_prompt},\n",
" {\"role\": \"user\", \"content\": get_links_user_prompt(website)}\n",
" ],\n",
" response_format={\"type\": \"json_object\"}\n",
" )\n",
" result = response.choices[0].message.content\n",
" return json.loads(result)"
]
},
{
"cell_type": "markdown",
"id": "0d74128e-dfb6-47ec-9549-288b621c838c",
"metadata": {},
"source": [
"## Create brochure"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "85a5b6e2-e7ef-44a9-bc7f-59ede71037b5",
"metadata": {},
"outputs": [],
"source": [
"def get_all_details(url):\n",
" result = \"Landing page:\\n\"\n",
" result += Website(url).get_contents()\n",
" links = get_links(url)\n",
" print(\"Found links:\", links)\n",
" for link in links[\"links\"]:\n",
" result += f\"\\n\\n{link['type']}\\n\"\n",
" result += Website(link[\"url\"]).get_contents()\n",
" return result"
]
},
{
"cell_type": "markdown",
"id": "4b4d8ec1-4855-4c0e-afc0-33055e6b0a6d",
"metadata": {},
"source": [
"### Snarky system prompt"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9b863a55-f86c-4e3f-8a79-94e24c1a8cf2",
"metadata": {},
"outputs": [],
"source": [
"# system_prompt = \"You are an assistant that analyzes the contents of several relevant pages from a company website \\\n",
"# and creates a short brochure about the company for prospective customers, investors and recruits. Respond in markdown.\\\n",
"# Include details of company culture, customers and careers/jobs if you have the information.\"\n",
"\n",
"# Or uncomment the lines below for a more humorous brochure - this demonstrates how easy it is to incorporate 'tone':\n",
"\n",
"# system_prompt = \"You are an assistant that analyzes the contents of several relevant pages from a company website \\\n",
"# and creates a short humorous, entertaining, jokey brochure about the company for prospective customers, investors and recruits. Respond in markdown.\\\n",
"# Include details of company culture, customers and careers/jobs if you have the information.\"\n",
"\n",
"system_prompt = \"You are an assistant that analyzes the contents of several relevant pages from a company website \\\n",
"and creates a short snarky, entertaining, pun loaded brochure about the company for prospective customers, investors and recruits. Respond in markdown.\\\n",
"Include details of company culture, customers and careers/jobs if you have the information.\"\n"
]
},
{
"cell_type": "markdown",
"id": "c5766318-97cc-4442-bb9f-fa8c6998777e",
"metadata": {},
"source": [
"### User prompt"
]
},
{
"cell_type": "markdown",
"id": "d6e224b2-8ab0-476e-96c3-42763ad21f25",
"metadata": {},
"source": [
"### Generate brochure in English"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6ab83d92-d36b-4ce0-8bcc-5bb4c2f8ff23",
"metadata": {},
"outputs": [],
"source": [
"def get_brochure_user_prompt(company_name, url):\n",
" user_prompt = f\"You are looking at a company called: {company_name}\\n\"\n",
" user_prompt += f\"Here are the contents of its landing page and other relevant pages; use this information to build a short brochure of the company in markdown.\\n\"\n",
" user_prompt += get_all_details(url)\n",
" user_prompt = user_prompt[:5_000] # Truncate if more than 5,000 characters\n",
" return user_prompt"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e44de579-4a1a-4e6a-a510-20ea3e4b8d46",
"metadata": {},
"outputs": [],
"source": [
"def create_brochure(company_name, url):\n",
" response = openai.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": get_brochure_user_prompt(company_name, url)}\n",
" ],\n",
" )\n",
" result = response.choices[0].message.content\n",
" display(Markdown(result))\n",
" return result"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e093444a-9407-42ae-924a-145730591a39",
"metadata": {},
"outputs": [],
"source": [
"brochure_text = create_brochure(\"HuggingFace\", \"https://huggingface.co\")"
]
},
{
"cell_type": "markdown",
"id": "30415c72-d26a-454e-8900-f584977aca96",
"metadata": {},
"source": [
"### Translate brochure to another language"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2331eb34-12bf-4e88-83f9-a48d97cc83ec",
"metadata": {},
"outputs": [],
"source": [
"translation_sys_prompt = \"You are a language translator who is very good at translating business documents from \\\n",
"English to any language. You preserve the formatting, tone and facts contained in the document.\"\n",
"\n",
"def translate_brochure(brochure, language):\n",
" response = openai.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": translation_sys_prompt},\n",
" {\"role\": \"user\", \"content\": f\"Translate the following document into {language}: {brochure}\"}\n",
" ],\n",
" )\n",
" result = response.choices[0].message.content\n",
" display(Markdown(result))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "112beb4d-984b-4162-8d36-8cef79c351cc",
"metadata": {},
"outputs": [],
"source": [
"translate_brochure(brochure_text, \"Spanish\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,221 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "fe12c203-e6a6-452c-a655-afb8a03a4ff5",
"metadata": {},
"source": [
"# End of week 1 exercise\n",
"\n",
"To demonstrate your familiarity with OpenAI API, and also Ollama, build a tool that takes a technical question, \n",
"and responds with an explanation. This is a tool that you will be able to use yourself during the course!"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "c1070317-3ed9-4659-abe3-828943230e03",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import os\n",
"import requests\n",
"from dotenv import load_dotenv\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import Markdown, display, update_display\n",
"from openai import OpenAI"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "4a456906-915a-4bfd-bb9d-57e505c5093f",
"metadata": {},
"outputs": [],
"source": [
"# constants\n",
"\n",
"MODEL_GPT = 'gpt-4o-mini'\n",
"MODEL_LLAMA = 'llama3.2'\n",
"\n",
"OLLAMA_API = \"http://localhost:11434/v1\"\n",
"HEADERS = {\"Content-Type\": \"application/json\"}"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "a8d7923c-5f28-4c30-8556-342d7c8497c1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"API key looks good so far\n"
]
}
],
"source": [
"# set up environment\n",
"\n",
"load_dotenv(override=True)\n",
"api_key = os.getenv('OPENAI_API_KEY')\n",
"\n",
"if api_key and api_key.startswith('sk-proj-') and len(api_key)>10:\n",
" print(\"API key looks good so far\")\n",
"else:\n",
" print(\"There might be a problem with your API key? Please visit the troubleshooting notebook!\")\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "3f0d0137-52b0-47a8-81a8-11a90a010798",
"metadata": {},
"outputs": [],
"source": [
"# here is the system prompt and payloads;\n",
"\n",
"system_prompt = \"\"\"\n",
"You are an expert on LLMs and writing python code. You are able to answer complex questions with\n",
"detailed answers and explain what every line of code does. You can refactor the code when asked.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "60ce7000-a4a5-4cce-a261-e75ef45063b4",
"metadata": {},
"outputs": [],
"source": [
"# Function to get answer, with streaming\n",
"\n",
"def llm_copilot(question, model):\n",
" if 'llama' in model.lower():\n",
" openai = OpenAI(base_url=OLLAMA_API, api_key='ollama')\n",
" else:\n",
" openai = OpenAI()\n",
" \n",
" stream = openai.chat.completions.create(\n",
" model=model,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": question}\n",
" ],\n",
" stream=True\n",
" )\n",
" response = \"\"\n",
" display_handle = display(Markdown(\"\"), display_id=True)\n",
" for chunk in stream:\n",
" response += chunk.choices[0].delta.content or ''\n",
" response = response.replace(\"```\",\"\").replace(\"markdown\", \"\")\n",
" update_display(Markdown(response), display_id=display_handle.display_id)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "8f7c8ea8-4082-4ad0-8751-3301adcf6538",
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"Here's a revised version of your code:\n",
"\n",
"python\n",
"if 'llama' in model.lower():\n",
"\n",
"\n",
"OR if you want to keep the original style, you can modify it as follows:\n",
"\n",
"python\n",
"if model.split('.')[-1] == 'llama3.2':\n",
"\n",
"\n",
"In this second example, we use string indexing (`-1`) to get the last part of the `model` string after splitting at the dot (`.`) character.\n",
"\n",
"The first revised version uses Python's built-in string method `lower()` to convert `model` to lowercase and then checks if 'llama' is present in it. It returns True if the text contains \"llama\", otherwise, it will return False. \n",
"\n",
"However, both of these codes are using lazy evaluation, which means if you do this check inside a loop:\n",
"\n",
"python\n",
"for i in range(100):\n",
" print('llama')\n",
"\n",
"\n",
"Python will use 'a' instead of 'llame' most of the time until `i == 98` because it has to wait for the condition to be met (and also does some lookup and look-around). If you want a case-insensitive search without this slowness, consider using a regular expression as shown below\n",
"\n",
"python\n",
"import re\n",
"\n",
"if re.search(r' llama.', model):\n",
"\n",
"\n",
"And if you still want that specific code structure, then use `replace` function as follows:\n",
"\n",
"python\n",
"model = model.replace('llama', '')\n",
"if model == '3.2':\n"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"None\n"
]
}
],
"source": [
"# Ask question\n",
"question = \"\"\"\n",
"Change this code to check for just the 'llama' portion of text instead of the entire string:\n",
"if model == 'llama3.2':\n",
"\"\"\"\n",
"\n",
"print(llm_copilot(question, MODEL_LLAMA))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1a4026cd-8967-4961-b26b-e3997307c4ba",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}