Files
LLM_Engineering_OLD/week1/community-contributions/02_brochure_generator.ipynb
2025-06-05 16:20:51 +02:00

371 lines
14 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "dc8af57c-23a9-452e-9fc3-0e5027edda14",
"metadata": {},
"source": [
"# AI-powered Brochure Generator\n",
"---\n",
"- 🌍 Task: Generate a company brochure using its name and website for clients, investors, and recruits.\n",
"- 🧠 Model: Toggle `USE_OPENAI` to switch between OpenAI and Ollama models\n",
"- 🕵️‍♂️ Data Extraction: Scraping website content and filtering key links (About, Products, Careers, Contact).\n",
"- 📌 Output Format: a Markdown-formatted brochure streamed in real-time.\n",
"- 🚀 Tools: BeautifulSoup, OpenAI API, and IPython display, ollama.\n",
"- 🧑‍💻 Skill Level: Intermediate.\n",
"\n",
"🛠️ Requirements\n",
"- ⚙️ Hardware: ✅ CPU is sufficient — no GPU required\n",
"- 🔑 OpenAI API Key \n",
"- Install Ollama and pull llama3.2:3b or another lightweight model\n",
"---\n",
"📢 Find more LLM notebooks on my [GitHub repository](https://github.com/lisekarimi/lexo)"
]
},
{
"cell_type": "markdown",
"id": "ec869f2c",
"metadata": {},
"source": [
"## 🧩 System Design Overview\n",
"\n",
"### Class Structure\n",
"\n",
"![](https://github.com/lisekarimi/lexo/blob/main/assets/02_brochure_class_diagram.png?raw=true)\n",
"\n",
"This code consists of three main classes:\n",
"\n",
"1. **`Website`**: \n",
" - Scrapes and processes webpage content. \n",
" - Extracts **text** and **links** from a given URL. \n",
"\n",
"2. **`LLMClient`**: \n",
" - Handles interactions with **OpenAI or Ollama (`llama3`, `deepseek`, `qwen`)**. \n",
" - Uses `get_relevant_links()` to filter webpage links. \n",
" - Uses `generate_brochure()` to create and stream a Markdown-formatted brochure. \n",
"\n",
"3. **`BrochureGenerator`**: \n",
" - Uses `Website` to scrape the main webpage and relevant links. \n",
" - Uses `LLMClient` to filter relevant links and generate a brochure. \n",
" - Calls `generate()` to run the entire process.\n",
"\n",
"### Workflow\n",
"\n",
"1. **`main()`** initializes `BrochureGenerator` and calls `generate()`. \n",
"2. **`generate()`** calls **`LLMClient.get_relevant_links()`** to extract relevant links using **LLM (OpenAI/Ollama)**. \n",
"3. **`Website` scrapes the webpage**, extracting **text and links** from the given URL. \n",
"4. **Relevant links are re-scraped** using `Website` to collect additional content. \n",
"5. **All collected content is passed to `LLMClient.generate_brochure()`**. \n",
"6. **`LLMClient` streams the generated brochure** using **OpenAI or Ollama**. \n",
"7. **The final brochure is displayed in Markdown format.**\n",
"\n",
"![](https://github.com/lisekarimi/lexo/blob/main/assets/02_brochure_process.png?raw=true)\n",
"\n",
"\n",
"### Intermediate reasoning\n",
"\n",
"In this workflow, we have intermediate reasoning because the LLM is called twice:\n",
"\n",
"1. **First LLM call**: Takes raw links → filters/selects relevant ones (reasoning step).\n",
"2. **Second LLM call**: Takes selected content → generates final brochure.\n",
"\n",
"🧠 **LLM output becomes LLM input** — thats intermediate reasoning.\n",
"\n",
"![](https://github.com/lisekarimi/lexo/blob/main/assets/02_llm_intermd_reasoning.png?raw=true)"
]
},
{
"cell_type": "markdown",
"id": "4b286461-35ee-4bc5-b07d-af554923e36d",
"metadata": {},
"source": [
"## 📦 Import Libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3fe5670c-5146-474b-9e75-484210533f55",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import requests\n",
"import json\n",
"import ollama\n",
"from dotenv import load_dotenv\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import display, Markdown, update_display\n",
"from openai import OpenAI"
]
},
{
"cell_type": "markdown",
"id": "f3e23181-1e66-410d-a910-1fb4230f8088",
"metadata": {},
"source": [
"## 🧠 Define the Model\n",
"\n",
"The user can switch between OpenAI and Ollama by changing a single variable (`USE_OPENAI`). The model selection is dynamic."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fa2bd452-0cf4-4fec-9542-e1c86584c23f",
"metadata": {},
"outputs": [],
"source": [
"# Load API key\n",
"load_dotenv()\n",
"api_key = os.getenv('OPENAI_API_KEY')\n",
"if not api_key or not api_key.startswith('sk-'):\n",
" raise ValueError(\"Invalid OpenAI API key. Check your .env file.\")\n",
"\n",
"# Define the model dynamically\n",
"USE_OPENAI = True # True to use openai and False to use Ollama\n",
"MODEL = 'gpt-4o-mini' if USE_OPENAI else 'llama3.2:3b'\n",
"\n",
"openai_client = OpenAI() if USE_OPENAI else None"
]
},
{
"cell_type": "markdown",
"id": "4fd997b7-1b89-4817-b53a-078164f5f71f",
"metadata": {},
"source": [
"## 🏗️ Define Classes"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aed1af59-8b8f-4add-98dc-a9f1b5b511a5",
"metadata": {},
"outputs": [],
"source": [
"headers = {\n",
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n",
"}\n",
"\n",
"class Website:\n",
" \"\"\"\n",
" A utility class to scrape and process website content.\n",
" \"\"\"\n",
" def __init__(self, url):\n",
" self.url = url\n",
" response = requests.get(url, headers=headers)\n",
" soup = BeautifulSoup(response.content, 'html.parser')\n",
" self.title = soup.title.string if soup.title else \"No title found\"\n",
" self.text = self.extract_text(soup)\n",
" self.links = self.extract_links(soup)\n",
"\n",
" def extract_text(self, soup):\n",
" if soup.body:\n",
" for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
" irrelevant.decompose()\n",
" return soup.body.get_text(separator=\"\\n\", strip=True)\n",
" return \"\"\n",
"\n",
" def extract_links(self, soup):\n",
" links = [link.get('href') for link in soup.find_all('a')]\n",
" return [link for link in links if link and 'http' in link]\n",
"\n",
" def get_contents(self):\n",
" return f\"Webpage Title:\\n{self.title}\\nWebpage Contents:\\n{self.text}\\n\\n\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ea04dc7e-ff4c-4113-83b7-0bddcf5072b9",
"metadata": {},
"outputs": [],
"source": [
"class LLMClient:\n",
" def __init__(self, model=MODEL):\n",
" self.model = model\n",
"\n",
" def get_relevant_links(self, website):\n",
" link_system_prompt = \"\"\"\n",
" You are given a list of links from a company website.\n",
" Select only relevant links for a brochure (About, Company, Careers, Products, Contact).\n",
" Exclude login, terms, privacy, and emails.\n",
"\n",
" ### **Instructions**\n",
" - Return **only valid JSON**.\n",
" - **Do not** include explanations, comments, or Markdown.\n",
" - Example output:\n",
" {\n",
" \"links\": [\n",
" {\"type\": \"about\", \"url\": \"https://company.com/about\"},\n",
" {\"type\": \"contact\", \"url\": \"https://company.com/contact\"},\n",
" {\"type\": \"product\", \"url\": \"https://company.com/products\"}\n",
" ]\n",
" }\n",
" \"\"\"\n",
"\n",
" user_prompt = f\"\"\"\n",
" Here is the list of links on the website of {website.url}:\n",
" Please identify the relevant web links for a company brochure. Respond in JSON format.\n",
" Do not include login, terms of service, privacy, or email links.\n",
" Links (some might be relative links):\n",
" {', '.join(website.links)}\n",
" \"\"\"\n",
"\n",
" if USE_OPENAI:\n",
" response = openai_client.chat.completions.create(\n",
" model=self.model,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": link_system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ]\n",
" )\n",
" return json.loads(response.choices[0].message.content.strip())\n",
" else:\n",
" response = ollama.chat(\n",
" model=self.model,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": link_system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ]\n",
" )\n",
" result = response.get(\"message\", {}).get(\"content\", \"\").strip()\n",
" try:\n",
" return json.loads(result) # Attempt to parse JSON\n",
" except json.JSONDecodeError:\n",
" print(\"Error: Response is not valid JSON\")\n",
" return {\"links\": []} # Return empty list if parsing fails\n",
"\n",
"\n",
" def generate_brochure(self, company_name, content, language):\n",
" system_prompt = \"\"\"\n",
" You are a professional translator and writer who creates fun and engaging brochures.\n",
" Your task is to read content from a companys website and write a short, humorous, joky,\n",
" and entertaining brochure for potential customers, investors, and job seekers.\n",
" Include details about the companys culture, customers, and career opportunities if available.\n",
" Respond in Markdown format.\n",
" \"\"\"\n",
"\n",
" user_prompt = f\"\"\"\n",
" Create a fun brochure for '{company_name}' using the following content:\n",
" {content[:5000]}\n",
" Respond in {language} only, and format your response correctly in Markdown.\n",
" Do NOT escape characters or return extra backslashes.\n",
" \"\"\"\n",
"\n",
" if USE_OPENAI:\n",
" response_stream = openai_client.chat.completions.create(\n",
" model=self.model,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ],\n",
" stream=True\n",
" )\n",
" response = \"\"\n",
" display_handle = display(Markdown(\"\"), display_id=True)\n",
" for chunk in response_stream:\n",
" response += chunk.choices[0].delta.content or ''\n",
" response = response.replace(\"```\",\"\").replace(\"markdown\", \"\")\n",
" update_display(Markdown(response), display_id=display_handle.display_id)\n",
" else:\n",
" response_stream = ollama.chat(\n",
" model=self.model,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": user_prompt}\n",
" ],\n",
" stream=True\n",
" )\n",
" display_handle = display(Markdown(\"\"), display_id=True)\n",
" full_text = \"\"\n",
" for chunk in response_stream:\n",
" if \"message\" in chunk:\n",
" content = chunk[\"message\"][\"content\"] or \"\"\n",
" full_text += content\n",
" update_display(Markdown(full_text), display_id=display_handle.display_id)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1c69651f-e004-421e-acc5-c439e57a8762",
"metadata": {},
"outputs": [],
"source": [
"class BrochureGenerator:\n",
" \"\"\"\n",
" Main class to generate a company brochure.\n",
" \"\"\"\n",
" def __init__(self, company_name, url, language='English'):\n",
" self.company_name = company_name\n",
" self.url = url\n",
" self.language = language\n",
" self.website = Website(url)\n",
" self.llm_client = LLMClient()\n",
"\n",
" def generate(self):\n",
" links = self.llm_client.get_relevant_links(self.website)\n",
" content = self.website.get_contents()\n",
"\n",
" for link in links['links']:\n",
" linked_website = Website(link['url'])\n",
" content += f\"\\n\\n{link['type']}:\\n\"\n",
" content += linked_website.get_contents()\n",
"\n",
" self.llm_client.generate_brochure(self.company_name, content, self.language)\n"
]
},
{
"cell_type": "markdown",
"id": "1379d39d",
"metadata": {},
"source": [
"## 📝 Generate Brochure"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1a63519a-1981-477b-9de1-f1ff9be94201",
"metadata": {},
"outputs": [],
"source": [
"def main():\n",
" company_name = \"Tour Eiffel\"\n",
" url = \"https://www.toureiffel.paris/fr\"\n",
" language = \"French\"\n",
"\n",
" generator = BrochureGenerator(company_name, url, language)\n",
" generator.generate()\n",
"\n",
"if __name__ == \"__main__\":\n",
" main()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}