{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "dc8af57c-23a9-452e-9fc3-0e5027edda14", "metadata": {}, "source": [ "# AI-powered Brochure Generator\n", "---\n", "- 🌍 Task: Generate a company brochure using its name and website for clients, investors, and recruits.\n", "- 🧠 Model: Toggle `USE_OPENAI` to switch between OpenAI and Ollama models\n", "- πŸ•΅οΈβ€β™‚οΈ Data Extraction: Scraping website content and filtering key links (About, Products, Careers, Contact).\n", "- πŸ“Œ Output Format: a Markdown-formatted brochure streamed in real-time.\n", "- πŸš€ Tools: BeautifulSoup, OpenAI API, and IPython display, ollama.\n", "- πŸ§‘β€πŸ’» Skill Level: Intermediate.\n", "\n", "πŸ› οΈ Requirements\n", "- βš™οΈ Hardware: βœ… CPU is sufficient β€” no GPU required\n", "- πŸ”‘ OpenAI API Key \n", "- Install Ollama and pull llama3.2:3b or another lightweight model\n", "---\n", "πŸ“’ Find more LLM notebooks on my [GitHub repository](https://github.com/lisekarimi/lexo)" ] }, { "cell_type": "markdown", "id": "ec869f2c", "metadata": {}, "source": [ "## 🧩 System Design Overview\n", "\n", "### Class Structure\n", "\n", "![](https://github.com/lisekarimi/lexo/blob/main/assets/02_brochure_class_diagram.png?raw=true)\n", "\n", "This code consists of three main classes:\n", "\n", "1. **`Website`**: \n", " - Scrapes and processes webpage content. \n", " - Extracts **text** and **links** from a given URL. \n", "\n", "2. **`LLMClient`**: \n", " - Handles interactions with **OpenAI or Ollama (`llama3`, `deepseek`, `qwen`)**. \n", " - Uses `get_relevant_links()` to filter webpage links. \n", " - Uses `generate_brochure()` to create and stream a Markdown-formatted brochure. \n", "\n", "3. **`BrochureGenerator`**: \n", " - Uses `Website` to scrape the main webpage and relevant links. \n", " - Uses `LLMClient` to filter relevant links and generate a brochure. \n", " - Calls `generate()` to run the entire process.\n", "\n", "### Workflow\n", "\n", "1. **`main()`** initializes `BrochureGenerator` and calls `generate()`. \n", "2. **`generate()`** calls **`LLMClient.get_relevant_links()`** to extract relevant links using **LLM (OpenAI/Ollama)**. \n", "3. **`Website` scrapes the webpage**, extracting **text and links** from the given URL. \n", "4. **Relevant links are re-scraped** using `Website` to collect additional content. \n", "5. **All collected content is passed to `LLMClient.generate_brochure()`**. \n", "6. **`LLMClient` streams the generated brochure** using **OpenAI or Ollama**. \n", "7. **The final brochure is displayed in Markdown format.**\n", "\n", "![](https://github.com/lisekarimi/lexo/blob/main/assets/02_brochure_process.png?raw=true)\n", "\n", "\n", "### Intermediate reasoning\n", "\n", "In this workflow, we have intermediate reasoning because the LLM is called twice:\n", "\n", "1. **First LLM call**: Takes raw links β†’ filters/selects relevant ones (reasoning step).\n", "2. **Second LLM call**: Takes selected content β†’ generates final brochure.\n", "\n", "🧠 **LLM output becomes LLM input** β€” that’s intermediate reasoning.\n", "\n", "![](https://github.com/lisekarimi/lexo/blob/main/assets/02_llm_intermd_reasoning.png?raw=true)" ] }, { "cell_type": "markdown", "id": "4b286461-35ee-4bc5-b07d-af554923e36d", "metadata": {}, "source": [ "## πŸ“¦ Import Libraries" ] }, { "cell_type": "code", "execution_count": null, "id": "3fe5670c-5146-474b-9e75-484210533f55", "metadata": {}, "outputs": [], "source": [ "import os\n", "import requests\n", "import json\n", "import ollama\n", "from dotenv import load_dotenv\n", "from bs4 import BeautifulSoup\n", "from IPython.display import display, Markdown, update_display\n", "from openai import OpenAI" ] }, { "cell_type": "markdown", "id": "f3e23181-1e66-410d-a910-1fb4230f8088", "metadata": {}, "source": [ "## 🧠 Define the Model\n", "\n", "The user can switch between OpenAI and Ollama by changing a single variable (`USE_OPENAI`). The model selection is dynamic." ] }, { "cell_type": "code", "execution_count": null, "id": "fa2bd452-0cf4-4fec-9542-e1c86584c23f", "metadata": {}, "outputs": [], "source": [ "# Load API key\n", "load_dotenv()\n", "api_key = os.getenv('OPENAI_API_KEY')\n", "if not api_key or not api_key.startswith('sk-'):\n", " raise ValueError(\"Invalid OpenAI API key. Check your .env file.\")\n", "\n", "# Define the model dynamically\n", "USE_OPENAI = True # True to use openai and False to use Ollama\n", "MODEL = 'gpt-4o-mini' if USE_OPENAI else 'llama3.2:3b'\n", "\n", "openai_client = OpenAI() if USE_OPENAI else None" ] }, { "cell_type": "markdown", "id": "4fd997b7-1b89-4817-b53a-078164f5f71f", "metadata": {}, "source": [ "## πŸ—οΈ Define Classes" ] }, { "cell_type": "code", "execution_count": null, "id": "aed1af59-8b8f-4add-98dc-a9f1b5b511a5", "metadata": {}, "outputs": [], "source": [ "headers = {\n", " \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n", "}\n", "\n", "class Website:\n", " \"\"\"\n", " A utility class to scrape and process website content.\n", " \"\"\"\n", " def __init__(self, url):\n", " self.url = url\n", " response = requests.get(url, headers=headers)\n", " soup = BeautifulSoup(response.content, 'html.parser')\n", " self.title = soup.title.string if soup.title else \"No title found\"\n", " self.text = self.extract_text(soup)\n", " self.links = self.extract_links(soup)\n", "\n", " def extract_text(self, soup):\n", " if soup.body:\n", " for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n", " irrelevant.decompose()\n", " return soup.body.get_text(separator=\"\\n\", strip=True)\n", " return \"\"\n", "\n", " def extract_links(self, soup):\n", " links = [link.get('href') for link in soup.find_all('a')]\n", " return [link for link in links if link and 'http' in link]\n", "\n", " def get_contents(self):\n", " return f\"Webpage Title:\\n{self.title}\\nWebpage Contents:\\n{self.text}\\n\\n\"" ] }, { "cell_type": "code", "execution_count": null, "id": "ea04dc7e-ff4c-4113-83b7-0bddcf5072b9", "metadata": {}, "outputs": [], "source": [ "class LLMClient:\n", " def __init__(self, model=MODEL):\n", " self.model = model\n", "\n", " def get_relevant_links(self, website):\n", " link_system_prompt = \"\"\"\n", " You are given a list of links from a company website.\n", " Select only relevant links for a brochure (About, Company, Careers, Products, Contact).\n", " Exclude login, terms, privacy, and emails.\n", "\n", " ### **Instructions**\n", " - Return **only valid JSON**.\n", " - **Do not** include explanations, comments, or Markdown.\n", " - Example output:\n", " {\n", " \"links\": [\n", " {\"type\": \"about\", \"url\": \"https://company.com/about\"},\n", " {\"type\": \"contact\", \"url\": \"https://company.com/contact\"},\n", " {\"type\": \"product\", \"url\": \"https://company.com/products\"}\n", " ]\n", " }\n", " \"\"\"\n", "\n", " user_prompt = f\"\"\"\n", " Here is the list of links on the website of {website.url}:\n", " Please identify the relevant web links for a company brochure. Respond in JSON format.\n", " Do not include login, terms of service, privacy, or email links.\n", " Links (some might be relative links):\n", " {', '.join(website.links)}\n", " \"\"\"\n", "\n", " if USE_OPENAI:\n", " response = openai_client.chat.completions.create(\n", " model=self.model,\n", " messages=[\n", " {\"role\": \"system\", \"content\": link_system_prompt},\n", " {\"role\": \"user\", \"content\": user_prompt}\n", " ]\n", " )\n", " return json.loads(response.choices[0].message.content.strip())\n", " else:\n", " response = ollama.chat(\n", " model=self.model,\n", " messages=[\n", " {\"role\": \"system\", \"content\": link_system_prompt},\n", " {\"role\": \"user\", \"content\": user_prompt}\n", " ]\n", " )\n", " result = response.get(\"message\", {}).get(\"content\", \"\").strip()\n", " try:\n", " return json.loads(result) # Attempt to parse JSON\n", " except json.JSONDecodeError:\n", " print(\"Error: Response is not valid JSON\")\n", " return {\"links\": []} # Return empty list if parsing fails\n", "\n", "\n", " def generate_brochure(self, company_name, content, language):\n", " system_prompt = \"\"\"\n", " You are a professional translator and writer who creates fun and engaging brochures.\n", " Your task is to read content from a company’s website and write a short, humorous, joky,\n", " and entertaining brochure for potential customers, investors, and job seekers.\n", " Include details about the company’s culture, customers, and career opportunities if available.\n", " Respond in Markdown format.\n", " \"\"\"\n", "\n", " user_prompt = f\"\"\"\n", " Create a fun brochure for '{company_name}' using the following content:\n", " {content[:5000]}\n", " Respond in {language} only, and format your response correctly in Markdown.\n", " Do NOT escape characters or return extra backslashes.\n", " \"\"\"\n", "\n", " if USE_OPENAI:\n", " response_stream = openai_client.chat.completions.create(\n", " model=self.model,\n", " messages=[\n", " {\"role\": \"system\", \"content\": system_prompt},\n", " {\"role\": \"user\", \"content\": user_prompt}\n", " ],\n", " stream=True\n", " )\n", " response = \"\"\n", " display_handle = display(Markdown(\"\"), display_id=True)\n", " for chunk in response_stream:\n", " response += chunk.choices[0].delta.content or ''\n", " response = response.replace(\"```\",\"\").replace(\"markdown\", \"\")\n", " update_display(Markdown(response), display_id=display_handle.display_id)\n", " else:\n", " response_stream = ollama.chat(\n", " model=self.model,\n", " messages=[\n", " {\"role\": \"system\", \"content\": system_prompt},\n", " {\"role\": \"user\", \"content\": user_prompt}\n", " ],\n", " stream=True\n", " )\n", " display_handle = display(Markdown(\"\"), display_id=True)\n", " full_text = \"\"\n", " for chunk in response_stream:\n", " if \"message\" in chunk:\n", " content = chunk[\"message\"][\"content\"] or \"\"\n", " full_text += content\n", " update_display(Markdown(full_text), display_id=display_handle.display_id)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "1c69651f-e004-421e-acc5-c439e57a8762", "metadata": {}, "outputs": [], "source": [ "class BrochureGenerator:\n", " \"\"\"\n", " Main class to generate a company brochure.\n", " \"\"\"\n", " def __init__(self, company_name, url, language='English'):\n", " self.company_name = company_name\n", " self.url = url\n", " self.language = language\n", " self.website = Website(url)\n", " self.llm_client = LLMClient()\n", "\n", " def generate(self):\n", " links = self.llm_client.get_relevant_links(self.website)\n", " content = self.website.get_contents()\n", "\n", " for link in links['links']:\n", " linked_website = Website(link['url'])\n", " content += f\"\\n\\n{link['type']}:\\n\"\n", " content += linked_website.get_contents()\n", "\n", " self.llm_client.generate_brochure(self.company_name, content, self.language)\n" ] }, { "cell_type": "markdown", "id": "1379d39d", "metadata": {}, "source": [ "## πŸ“ Generate Brochure" ] }, { "cell_type": "code", "execution_count": null, "id": "1a63519a-1981-477b-9de1-f1ff9be94201", "metadata": {}, "outputs": [], "source": [ "def main():\n", " company_name = \"Tour Eiffel\"\n", " url = \"https://www.toureiffel.paris/fr\"\n", " language = \"French\"\n", "\n", " generator = BrochureGenerator(company_name, url, language)\n", " generator.generate()\n", "\n", "if __name__ == \"__main__\":\n", " main()" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.9" } }, "nbformat": 4, "nbformat_minor": 5 }