Merge branch 'main' of github.com:ed-donner/llm_engineering
This commit is contained in:
16
community-contributions/Reputation_Radar/Dockerfile
Normal file
16
community-contributions/Reputation_Radar/Dockerfile
Normal file
@@ -0,0 +1,16 @@
|
|||||||
|
FROM python:3.11-slim
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
COPY requirements.txt .
|
||||||
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
|
||||||
|
COPY . .
|
||||||
|
|
||||||
|
ENV STREAMLIT_SERVER_HEADLESS=true \
|
||||||
|
STREAMLIT_SERVER_ADDRESS=0.0.0.0 \
|
||||||
|
STREAMLIT_SERVER_PORT=8501
|
||||||
|
|
||||||
|
EXPOSE 8501
|
||||||
|
|
||||||
|
CMD ["streamlit", "run", "app.py"]
|
||||||
13
community-contributions/Reputation_Radar/Makefile
Normal file
13
community-contributions/Reputation_Radar/Makefile
Normal file
@@ -0,0 +1,13 @@
|
|||||||
|
PYTHON ?= python
|
||||||
|
|
||||||
|
.PHONY: install run test
|
||||||
|
|
||||||
|
install:
|
||||||
|
$(PYTHON) -m pip install --upgrade pip
|
||||||
|
$(PYTHON) -m pip install -r requirements.txt
|
||||||
|
|
||||||
|
run:
|
||||||
|
streamlit run app.py
|
||||||
|
|
||||||
|
test:
|
||||||
|
pytest
|
||||||
124
community-contributions/Reputation_Radar/README.md
Normal file
124
community-contributions/Reputation_Radar/README.md
Normal file
@@ -0,0 +1,124 @@
|
|||||||
|
# 📡 ReputationRadar
|
||||||
|
> Real-time brand intelligence with human-readable insights.
|
||||||
|
|
||||||
|
ReputationRadar is a Streamlit dashboard that unifies Reddit, Twitter/X, and Trustpilot chatter, classifies sentiment with OpenAI (or VADER fallback), and delivers exportable executive summaries. It ships with modular services, caching, retry-aware scrapers, demo data, and pytest coverage—ready for production hardening or internal deployment.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
- [Demo](#demo)
|
||||||
|
- [Feature Highlights](#feature-highlights)
|
||||||
|
- [Architecture Overview](#architecture-overview)
|
||||||
|
- [Quick Start](#quick-start)
|
||||||
|
- [Configuration & Credentials](#configuration--credentials)
|
||||||
|
- [Running Tests](#running-tests)
|
||||||
|
- [Working Without API Keys](#working-without-api-keys)
|
||||||
|
- [Exports & Deliverables](#exports--deliverables)
|
||||||
|
- [Troubleshooting](#troubleshooting)
|
||||||
|
- [Legal & Compliance](#legal--compliance)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
## Demo
|
||||||
|
|
||||||
|
The video demo of the app can be found at:-
|
||||||
|
https://drive.google.com/file/d/1XZ09NOht1H5LCJEbOrAldny2L5SV1DeT/view?usp=sharing
|
||||||
|
|
||||||
|
|
||||||
|
## Feature Highlights
|
||||||
|
- **Adaptive Ingestion** – Toggle Reddit, Twitter/X, and Trustpilot independently; backoff, caching, and polite scraping keep providers happy.
|
||||||
|
- **Smart Sentiment** – Batch OpenAI classification with rationale-aware prompts and auto-fallback to VADER when credentials are missing.
|
||||||
|
- **Actionable Summaries** – Executive brief card (highlights, risks, tone, actions) plus refreshed PDF layout that respects margins and typography.
|
||||||
|
- **Interactive Insights** – Plotly visuals, per-source filtering, and a lean “Representative Mentions” link list to avoid content overload.
|
||||||
|
- **Export Suite** – CSV, Excel (auto-sized columns), and polished PDF snapshots for stakeholder handoffs.
|
||||||
|
- **Robust Foundation** – Structured logging, reusable UI components, pytest suites, Dockerfile, and Makefile for frictionless iteration.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Architecture Overview
|
||||||
|
```
|
||||||
|
community-contributions/Reputation_Radar/
|
||||||
|
├── app.py # Streamlit orchestrator & layout
|
||||||
|
├── components/ # Sidebar, dashboard, summaries, loaders
|
||||||
|
├── services/ # Reddit/Twitter clients, Trustpilot scraper, LLM wrapper, utilities
|
||||||
|
├── samples/ # Demo JSON payloads (auto-loaded when credentials missing)
|
||||||
|
├── tests/ # Pytest coverage for utilities and LLM fallback
|
||||||
|
├── assets/ # Placeholder icons/logo
|
||||||
|
├── logs/ # Streaming log output
|
||||||
|
├── requirements.txt # Runtime dependencies (includes PDF + Excel writers)
|
||||||
|
├── Dockerfile # Containerised deployment recipe
|
||||||
|
└── Makefile # Helper targets for install/run/test
|
||||||
|
```
|
||||||
|
Each service returns a normalised payload to keep the downstream sentiment pipeline deterministic. Deduplication is handled centrally via fuzzy matching, and timestamps are coerced to UTC before analysis.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
1. **Clone & enter the project directory (`community-contributions/Reputation_Radar`).**
|
||||||
|
2. **Install dependencies and launch Streamlit:**
|
||||||
|
```bash
|
||||||
|
pip install -r requirements.txt && streamlit run app.py
|
||||||
|
```
|
||||||
|
(Use a virtual environment if preferred.)
|
||||||
|
3. **Populate the sidebar:** add your brand name, optional filters, toggled sources, and API credentials (stored only in session state).
|
||||||
|
4. **Click “Run Analysis 🚀”** – follow the status indicators as sources load, sentiment processes, and summaries render.
|
||||||
|
|
||||||
|
### Optional Docker Run
|
||||||
|
```bash
|
||||||
|
docker build -t reputation-radar .
|
||||||
|
docker run --rm -p 8501:8501 -e OPENAI_API_KEY=your_key reputation-radar
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Configuration & Credentials
|
||||||
|
The app reads from `.env`, Streamlit secrets, or direct sidebar input. Expected variables:
|
||||||
|
|
||||||
|
| Variable | Purpose |
|
||||||
|
| --- | --- |
|
||||||
|
| `OPENAI_API_KEY` | Enables OpenAI sentiment + executive summary (falls back to VADER if absent). |
|
||||||
|
| `REDDIT_CLIENT_ID` | PRAW client ID for Reddit API access. |
|
||||||
|
| `REDDIT_CLIENT_SECRET` | PRAW client secret. |
|
||||||
|
| `REDDIT_USER_AGENT` | Descriptive user agent (e.g., `ReputationRadar/1.0 by you`). |
|
||||||
|
| `TWITTER_BEARER_TOKEN` | Twitter/X v2 recent search bearer token. |
|
||||||
|
|
||||||
|
Credential validation mirrors the guidance from `week1/day1.ipynb`—mistyped OpenAI keys surface helpful warnings before analysis begins.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Running Tests
|
||||||
|
```bash
|
||||||
|
pytest
|
||||||
|
```
|
||||||
|
Tests cover sentiment fallback behaviour and core sanitisation/deduplication helpers. Extend them as you add new data transforms or UI logic.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Working Without API Keys
|
||||||
|
- Reddit/Twitter/Trustpilot can be toggled independently; missing credentials raise gentle warnings rather than hard failures.
|
||||||
|
- Curated fixtures in `samples/` automatically load for any disabled source, keeping charts, exports, and PDF output functional in demo mode.
|
||||||
|
- The LLM layer drops to VADER sentiment scoring and skips the executive summary when `OPENAI_API_KEY` is absent.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Exports & Deliverables
|
||||||
|
- **CSV** – Clean, UTF-8 dataset for quick spreadsheet edits.
|
||||||
|
- **Excel** – Auto-sized columns, formatted timestamps, instantaneous import into stakeholder workbooks.
|
||||||
|
- **PDF** – Professionally typeset executive summary with bullet lists, consistent margins, and wrapped excerpts (thanks to ReportLab’s Platypus engine).
|
||||||
|
|
||||||
|
All exports are regenerated on demand and never persisted server-side.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
- **OpenAI key missing/invalid** – Watch the sidebar notices; the app falls back gracefully but no executive summary will be produced.
|
||||||
|
- **Twitter 401/403** – Confirm your bearer token scope and that the project has search access enabled.
|
||||||
|
- **Rate limiting (429)** – Built-in sleeps help, but repeated requests may require manual pauses. Try narrowing filters or reducing per-source limits.
|
||||||
|
- **Trustpilot blocks** – Respect robots.txt. If scraping is denied, switch to the official API or provide compliant CSV imports.
|
||||||
|
- **PDF text clipping** – Resolved by the new layout; if you customise templates ensure col widths/table styles remain inside page margins.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Legal & Compliance
|
||||||
|
ReputationRadar surfaces public discourse for legitimate monitoring purposes. Always comply with each platform’s Terms of Service, local regulations, and privacy expectations. Avoid storing third-party data longer than necessary, and never commit API keys to version control—the app only keeps them in Streamlit session state.
|
||||||
436
community-contributions/Reputation_Radar/app.py
Normal file
436
community-contributions/Reputation_Radar/app.py
Normal file
@@ -0,0 +1,436 @@
|
|||||||
|
"""ReputationRadar Streamlit application entrypoint."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import io
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
from datetime import datetime
|
||||||
|
from typing import Dict, List, Optional
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
import streamlit as st
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
from reportlab.lib import colors
|
||||||
|
from reportlab.lib.pagesizes import letter
|
||||||
|
from reportlab.lib.styles import ParagraphStyle, getSampleStyleSheet
|
||||||
|
from reportlab.platypus import Paragraph, SimpleDocTemplate, Spacer, Table, TableStyle
|
||||||
|
|
||||||
|
from components.dashboard import render_overview, render_source_explorer, render_top_comments
|
||||||
|
from components.filters import render_sidebar
|
||||||
|
from components.summary import render_summary
|
||||||
|
from components.loaders import show_empty_state, source_status
|
||||||
|
from services import llm, reddit_client, trustpilot_scraper, twitter_client, utils
|
||||||
|
from services.llm import SentimentResult
|
||||||
|
from services.utils import (
|
||||||
|
NormalizedItem,
|
||||||
|
ServiceError,
|
||||||
|
ServiceWarning,
|
||||||
|
initialize_logger,
|
||||||
|
load_sample_items,
|
||||||
|
normalize_items,
|
||||||
|
parse_date_range,
|
||||||
|
validate_openai_key,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
st.set_page_config(page_title="ReputationRadar", page_icon="📡", layout="wide")
|
||||||
|
load_dotenv(override=True)
|
||||||
|
LOGGER = initialize_logger()
|
||||||
|
|
||||||
|
st.title("📡 ReputationRadar")
|
||||||
|
st.caption("Aggregate brand chatter, classify sentiment, and surface actionable insights in minutes.")
|
||||||
|
|
||||||
|
|
||||||
|
def _get_env_defaults() -> Dict[str, Optional[str]]:
|
||||||
|
"""Read supported credentials from environment variables."""
|
||||||
|
return {
|
||||||
|
"OPENAI_API_KEY": os.getenv("OPENAI_API_KEY"),
|
||||||
|
"REDDIT_CLIENT_ID": os.getenv("REDDIT_CLIENT_ID"),
|
||||||
|
"REDDIT_CLIENT_SECRET": os.getenv("REDDIT_CLIENT_SECRET"),
|
||||||
|
"REDDIT_USER_AGENT": os.getenv("REDDIT_USER_AGENT", "ReputationRadar/1.0"),
|
||||||
|
"TWITTER_BEARER_TOKEN": os.getenv("TWITTER_BEARER_TOKEN"),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@st.cache_data(ttl=600, show_spinner=False)
|
||||||
|
def cached_reddit_fetch(
|
||||||
|
brand: str,
|
||||||
|
limit: int,
|
||||||
|
date_range: str,
|
||||||
|
min_upvotes: int,
|
||||||
|
client_id: str,
|
||||||
|
client_secret: str,
|
||||||
|
user_agent: str,
|
||||||
|
) -> List[NormalizedItem]:
|
||||||
|
credentials = {
|
||||||
|
"client_id": client_id,
|
||||||
|
"client_secret": client_secret,
|
||||||
|
"user_agent": user_agent,
|
||||||
|
}
|
||||||
|
return reddit_client.fetch_mentions(
|
||||||
|
brand=brand,
|
||||||
|
credentials=credentials,
|
||||||
|
limit=limit,
|
||||||
|
date_filter=date_range,
|
||||||
|
min_upvotes=min_upvotes,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@st.cache_data(ttl=600, show_spinner=False)
|
||||||
|
def cached_twitter_fetch(
|
||||||
|
brand: str,
|
||||||
|
limit: int,
|
||||||
|
min_likes: int,
|
||||||
|
language: str,
|
||||||
|
bearer: str,
|
||||||
|
) -> List[NormalizedItem]:
|
||||||
|
return twitter_client.fetch_mentions(
|
||||||
|
brand=brand,
|
||||||
|
bearer_token=bearer,
|
||||||
|
limit=limit,
|
||||||
|
min_likes=min_likes,
|
||||||
|
language=language,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@st.cache_data(ttl=600, show_spinner=False)
|
||||||
|
def cached_trustpilot_fetch(
|
||||||
|
brand: str,
|
||||||
|
language: str,
|
||||||
|
pages: int = 2,
|
||||||
|
) -> List[NormalizedItem]:
|
||||||
|
return trustpilot_scraper.fetch_reviews(brand=brand, language=language, pages=pages)
|
||||||
|
|
||||||
|
|
||||||
|
def _to_dataframe(items: List[NormalizedItem], sentiments: List[SentimentResult]) -> pd.DataFrame:
|
||||||
|
data = []
|
||||||
|
for item, sentiment in zip(items, sentiments):
|
||||||
|
data.append(
|
||||||
|
{
|
||||||
|
"source": item["source"],
|
||||||
|
"id": item["id"],
|
||||||
|
"url": item.get("url"),
|
||||||
|
"author": item.get("author"),
|
||||||
|
"timestamp": item["timestamp"],
|
||||||
|
"text": item["text"],
|
||||||
|
"label": sentiment.label,
|
||||||
|
"confidence": sentiment.confidence,
|
||||||
|
"meta": json.dumps(item.get("meta", {})),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
df = pd.DataFrame(data)
|
||||||
|
if not df.empty:
|
||||||
|
df["timestamp"] = pd.to_datetime(df["timestamp"])
|
||||||
|
return df
|
||||||
|
|
||||||
|
|
||||||
|
def _build_pdf(summary: Optional[Dict[str, str]], df: pd.DataFrame) -> bytes:
|
||||||
|
buffer = io.BytesIO()
|
||||||
|
doc = SimpleDocTemplate(
|
||||||
|
buffer,
|
||||||
|
pagesize=letter,
|
||||||
|
rightMargin=40,
|
||||||
|
leftMargin=40,
|
||||||
|
topMargin=60,
|
||||||
|
bottomMargin=40,
|
||||||
|
title="ReputationRadar Executive Summary",
|
||||||
|
)
|
||||||
|
styles = getSampleStyleSheet()
|
||||||
|
title_style = styles["Title"]
|
||||||
|
subtitle_style = ParagraphStyle(
|
||||||
|
"Subtitle",
|
||||||
|
parent=styles["BodyText"],
|
||||||
|
fontSize=10,
|
||||||
|
leading=14,
|
||||||
|
textColor="#555555",
|
||||||
|
)
|
||||||
|
body_style = ParagraphStyle(
|
||||||
|
"Body",
|
||||||
|
parent=styles["BodyText"],
|
||||||
|
leading=14,
|
||||||
|
fontSize=11,
|
||||||
|
)
|
||||||
|
bullet_style = ParagraphStyle(
|
||||||
|
"Bullet",
|
||||||
|
parent=body_style,
|
||||||
|
leftIndent=16,
|
||||||
|
bulletIndent=8,
|
||||||
|
spaceBefore=2,
|
||||||
|
spaceAfter=2,
|
||||||
|
)
|
||||||
|
heading_style = ParagraphStyle(
|
||||||
|
"SectionHeading",
|
||||||
|
parent=styles["Heading3"],
|
||||||
|
spaceBefore=10,
|
||||||
|
spaceAfter=6,
|
||||||
|
)
|
||||||
|
|
||||||
|
story: List[Paragraph | Spacer | Table] = []
|
||||||
|
story.append(Paragraph("ReputationRadar Executive Summary", title_style))
|
||||||
|
story.append(Spacer(1, 6))
|
||||||
|
story.append(
|
||||||
|
Paragraph(
|
||||||
|
f"Generated on: {datetime.utcnow().strftime('%Y-%m-%d %H:%M')} UTC",
|
||||||
|
subtitle_style,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
story.append(Spacer(1, 18))
|
||||||
|
|
||||||
|
if summary and summary.get("raw"):
|
||||||
|
story.extend(_summary_to_story(summary["raw"], body_style, bullet_style, heading_style))
|
||||||
|
else:
|
||||||
|
story.append(
|
||||||
|
Paragraph(
|
||||||
|
"Executive summary disabled (OpenAI key missing).",
|
||||||
|
body_style,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
story.append(Spacer(1, 16))
|
||||||
|
story.append(Paragraph("Sentiment Snapshot", styles["Heading2"]))
|
||||||
|
story.append(Spacer(1, 10))
|
||||||
|
|
||||||
|
table_data: List[List[Paragraph]] = [
|
||||||
|
[
|
||||||
|
Paragraph("Date", body_style),
|
||||||
|
Paragraph("Sentiment", body_style),
|
||||||
|
Paragraph("Source", body_style),
|
||||||
|
Paragraph("Excerpt", body_style),
|
||||||
|
]
|
||||||
|
]
|
||||||
|
snapshot = df.sort_values("timestamp", ascending=False).head(15)
|
||||||
|
for _, row in snapshot.iterrows():
|
||||||
|
excerpt = _truncate_text(row["text"], 180)
|
||||||
|
table_data.append(
|
||||||
|
[
|
||||||
|
Paragraph(row["timestamp"].strftime("%Y-%m-%d %H:%M"), body_style),
|
||||||
|
Paragraph(row["label"].title(), body_style),
|
||||||
|
Paragraph(row["source"].title(), body_style),
|
||||||
|
Paragraph(excerpt, body_style),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
table = Table(table_data, colWidths=[90, 70, 80, 250])
|
||||||
|
table.setStyle(
|
||||||
|
TableStyle(
|
||||||
|
[
|
||||||
|
("BACKGROUND", (0, 0), (-1, 0), colors.HexColor("#f3f4f6")),
|
||||||
|
("TEXTCOLOR", (0, 0), (-1, 0), colors.HexColor("#1f2937")),
|
||||||
|
("FONTNAME", (0, 0), (-1, 0), "Helvetica-Bold"),
|
||||||
|
("ALIGN", (0, 0), (-1, -1), "LEFT"),
|
||||||
|
("VALIGN", (0, 0), (-1, -1), "TOP"),
|
||||||
|
("INNERGRID", (0, 0), (-1, -1), 0.25, colors.HexColor("#d1d5db")),
|
||||||
|
("BOX", (0, 0), (-1, -1), 0.5, colors.HexColor("#9ca3af")),
|
||||||
|
("ROWBACKGROUNDS", (0, 1), (-1, -1), [colors.white, colors.HexColor("#f9fafb")]),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
)
|
||||||
|
story.append(table)
|
||||||
|
|
||||||
|
doc.build(story)
|
||||||
|
buffer.seek(0)
|
||||||
|
return buffer.getvalue()
|
||||||
|
|
||||||
|
|
||||||
|
def _summary_to_story(
|
||||||
|
raw_summary: str,
|
||||||
|
body_style: ParagraphStyle,
|
||||||
|
bullet_style: ParagraphStyle,
|
||||||
|
heading_style: ParagraphStyle,
|
||||||
|
) -> List[Paragraph | Spacer]:
|
||||||
|
story: List[Paragraph | Spacer] = []
|
||||||
|
lines = [line.strip() for line in raw_summary.splitlines()]
|
||||||
|
for line in lines:
|
||||||
|
if not line:
|
||||||
|
continue
|
||||||
|
clean = re.sub(r"\*\*(.*?)\*\*", r"\1", line)
|
||||||
|
if clean.endswith(":") and len(clean) < 40:
|
||||||
|
story.append(Paragraph(clean.rstrip(":"), heading_style))
|
||||||
|
continue
|
||||||
|
if clean.lower().startswith(("highlights", "risks & concerns", "recommended actions", "overall tone")):
|
||||||
|
story.append(Paragraph(clean, heading_style))
|
||||||
|
continue
|
||||||
|
if line.startswith(("-", "*")):
|
||||||
|
bullet_text = re.sub(r"\*\*(.*?)\*\*", r"\1", line[1:].strip())
|
||||||
|
story.append(Paragraph(bullet_text, bullet_style, bulletText="•"))
|
||||||
|
else:
|
||||||
|
story.append(Paragraph(clean, body_style))
|
||||||
|
story.append(Spacer(1, 10))
|
||||||
|
return story
|
||||||
|
|
||||||
|
|
||||||
|
def _truncate_text(text: str, max_length: int) -> str:
|
||||||
|
clean = re.sub(r"\s+", " ", text).strip()
|
||||||
|
if len(clean) <= max_length:
|
||||||
|
return clean
|
||||||
|
return clean[: max_length - 1].rstrip() + "…"
|
||||||
|
|
||||||
|
|
||||||
|
def _build_excel(df: pd.DataFrame) -> bytes:
|
||||||
|
buffer = io.BytesIO()
|
||||||
|
export_df = df.copy()
|
||||||
|
export_df["timestamp"] = export_df["timestamp"].dt.strftime("%Y-%m-%d %H:%M")
|
||||||
|
with pd.ExcelWriter(buffer, engine="xlsxwriter") as writer:
|
||||||
|
export_df.to_excel(writer, index=False, sheet_name="Mentions")
|
||||||
|
worksheet = writer.sheets["Mentions"]
|
||||||
|
for idx, column in enumerate(export_df.columns):
|
||||||
|
series = export_df[column].astype(str)
|
||||||
|
max_len = min(60, max(series.map(len).max(), len(column)) + 2)
|
||||||
|
worksheet.set_column(idx, idx, max_len)
|
||||||
|
buffer.seek(0)
|
||||||
|
return buffer.getvalue()
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
env_defaults = _get_env_defaults()
|
||||||
|
openai_env_key = env_defaults.get("OPENAI_API_KEY") or st.session_state.get("secrets", {}).get("OPENAI_API_KEY")
|
||||||
|
validated_env_key, notices = validate_openai_key(openai_env_key)
|
||||||
|
config = render_sidebar(env_defaults, tuple(notices))
|
||||||
|
|
||||||
|
chosen_key = config["credentials"]["openai"] or validated_env_key
|
||||||
|
openai_key, runtime_notices = validate_openai_key(chosen_key)
|
||||||
|
for msg in runtime_notices:
|
||||||
|
st.sidebar.info(msg)
|
||||||
|
|
||||||
|
run_clicked = st.button("Run Analysis 🚀", type="primary")
|
||||||
|
|
||||||
|
if not run_clicked:
|
||||||
|
show_empty_state("Enter a brand name and click **Run Analysis** to get started.")
|
||||||
|
return
|
||||||
|
|
||||||
|
if not config["brand"]:
|
||||||
|
st.error("Brand name is required.")
|
||||||
|
return
|
||||||
|
|
||||||
|
threshold = parse_date_range(config["date_range"])
|
||||||
|
collected: List[NormalizedItem] = []
|
||||||
|
|
||||||
|
with st.container():
|
||||||
|
if config["sources"]["reddit"]:
|
||||||
|
with source_status("Fetching Reddit mentions") as status:
|
||||||
|
try:
|
||||||
|
reddit_items = cached_reddit_fetch(
|
||||||
|
brand=config["brand"],
|
||||||
|
limit=config["limits"]["reddit"],
|
||||||
|
date_range=config["date_range"],
|
||||||
|
min_upvotes=config["min_reddit_upvotes"],
|
||||||
|
client_id=config["credentials"]["reddit"]["client_id"],
|
||||||
|
client_secret=config["credentials"]["reddit"]["client_secret"],
|
||||||
|
user_agent=config["credentials"]["reddit"]["user_agent"],
|
||||||
|
)
|
||||||
|
reddit_items = [item for item in reddit_items if item["timestamp"] >= threshold]
|
||||||
|
status.write(f"Fetched {len(reddit_items)} Reddit items.")
|
||||||
|
collected.extend(reddit_items)
|
||||||
|
except ServiceWarning as warning:
|
||||||
|
st.warning(str(warning))
|
||||||
|
demo = load_sample_items("reddit_sample")
|
||||||
|
if demo:
|
||||||
|
st.info("Loaded demo Reddit data.", icon="🧪")
|
||||||
|
collected.extend(demo)
|
||||||
|
except ServiceError as error:
|
||||||
|
st.error(f"Reddit fetch failed: {error}")
|
||||||
|
if config["sources"]["twitter"]:
|
||||||
|
with source_status("Fetching Twitter mentions") as status:
|
||||||
|
try:
|
||||||
|
twitter_items = cached_twitter_fetch(
|
||||||
|
brand=config["brand"],
|
||||||
|
limit=config["limits"]["twitter"],
|
||||||
|
min_likes=config["min_twitter_likes"],
|
||||||
|
language=config["language"],
|
||||||
|
bearer=config["credentials"]["twitter"],
|
||||||
|
)
|
||||||
|
twitter_items = [item for item in twitter_items if item["timestamp"] >= threshold]
|
||||||
|
status.write(f"Fetched {len(twitter_items)} tweets.")
|
||||||
|
collected.extend(twitter_items)
|
||||||
|
except ServiceWarning as warning:
|
||||||
|
st.warning(str(warning))
|
||||||
|
demo = load_sample_items("twitter_sample")
|
||||||
|
if demo:
|
||||||
|
st.info("Loaded demo Twitter data.", icon="🧪")
|
||||||
|
collected.extend(demo)
|
||||||
|
except ServiceError as error:
|
||||||
|
st.error(f"Twitter fetch failed: {error}")
|
||||||
|
if config["sources"]["trustpilot"]:
|
||||||
|
with source_status("Fetching Trustpilot reviews") as status:
|
||||||
|
try:
|
||||||
|
trustpilot_items = cached_trustpilot_fetch(
|
||||||
|
brand=config["brand"],
|
||||||
|
language=config["language"],
|
||||||
|
)
|
||||||
|
trustpilot_items = [item for item in trustpilot_items if item["timestamp"] >= threshold]
|
||||||
|
status.write(f"Fetched {len(trustpilot_items)} reviews.")
|
||||||
|
collected.extend(trustpilot_items)
|
||||||
|
except ServiceWarning as warning:
|
||||||
|
st.warning(str(warning))
|
||||||
|
demo = load_sample_items("trustpilot_sample")
|
||||||
|
if demo:
|
||||||
|
st.info("Loaded demo Trustpilot data.", icon="🧪")
|
||||||
|
collected.extend(demo)
|
||||||
|
except ServiceError as error:
|
||||||
|
st.error(f"Trustpilot fetch failed: {error}")
|
||||||
|
|
||||||
|
if not collected:
|
||||||
|
show_empty_state("No mentions found. Try enabling more sources or loosening filters.")
|
||||||
|
return
|
||||||
|
|
||||||
|
cleaned = normalize_items(collected)
|
||||||
|
if not cleaned:
|
||||||
|
show_empty_state("All results were filtered out as noise. Try again with different settings.")
|
||||||
|
return
|
||||||
|
|
||||||
|
sentiment_service = llm.LLMService(
|
||||||
|
api_key=config["credentials"]["openai"] or openai_key,
|
||||||
|
batch_size=config["batch_size"],
|
||||||
|
)
|
||||||
|
sentiments = sentiment_service.classify_sentiment_batch([item["text"] for item in cleaned])
|
||||||
|
df = _to_dataframe(cleaned, sentiments)
|
||||||
|
|
||||||
|
render_overview(df)
|
||||||
|
render_top_comments(df)
|
||||||
|
|
||||||
|
summary_payload: Optional[Dict[str, str]] = None
|
||||||
|
if sentiment_service.available():
|
||||||
|
try:
|
||||||
|
summary_payload = sentiment_service.summarize_overall(
|
||||||
|
[{"label": row["label"], "text": row["text"]} for _, row in df.iterrows()]
|
||||||
|
)
|
||||||
|
except ServiceWarning as warning:
|
||||||
|
st.warning(str(warning))
|
||||||
|
else:
|
||||||
|
st.info("OpenAI key missing. Using VADER fallback for sentiment; summary disabled.", icon="ℹ️")
|
||||||
|
|
||||||
|
render_summary(summary_payload)
|
||||||
|
render_source_explorer(df)
|
||||||
|
|
||||||
|
csv_data = df.to_csv(index=False).encode("utf-8")
|
||||||
|
excel_data = _build_excel(df)
|
||||||
|
pdf_data = _build_pdf(summary_payload, df)
|
||||||
|
col_csv, col_excel, col_pdf = st.columns(3)
|
||||||
|
with col_csv:
|
||||||
|
st.download_button(
|
||||||
|
"⬇️ Export CSV",
|
||||||
|
data=csv_data,
|
||||||
|
file_name="reputation_radar.csv",
|
||||||
|
mime="text/csv",
|
||||||
|
)
|
||||||
|
with col_excel:
|
||||||
|
st.download_button(
|
||||||
|
"⬇️ Export Excel",
|
||||||
|
data=excel_data,
|
||||||
|
file_name="reputation_radar.xlsx",
|
||||||
|
mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
|
||||||
|
)
|
||||||
|
with col_pdf:
|
||||||
|
st.download_button(
|
||||||
|
"⬇️ Export PDF Summary",
|
||||||
|
data=pdf_data,
|
||||||
|
file_name="reputation_radar_summary.pdf",
|
||||||
|
mime="application/pdf",
|
||||||
|
)
|
||||||
|
|
||||||
|
st.success("Analysis complete! Review the insights above.")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -0,0 +1,5 @@
|
|||||||
|
"""Reusable Streamlit UI components for ReputationRadar."""
|
||||||
|
|
||||||
|
from . import dashboard, filters, loaders, summary
|
||||||
|
|
||||||
|
__all__ = ["dashboard", "filters", "loaders", "summary"]
|
||||||
136
community-contributions/Reputation_Radar/components/dashboard.py
Normal file
136
community-contributions/Reputation_Radar/components/dashboard.py
Normal file
@@ -0,0 +1,136 @@
|
|||||||
|
"""Render the ReputationRadar dashboard components."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import Dict, Optional
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
import plotly.express as px
|
||||||
|
import streamlit as st
|
||||||
|
|
||||||
|
SOURCE_CHIPS = {
|
||||||
|
"reddit": "🔺 Reddit",
|
||||||
|
"twitter": "✖️ Twitter",
|
||||||
|
"trustpilot": "⭐ Trustpilot",
|
||||||
|
}
|
||||||
|
|
||||||
|
SENTIMENT_COLORS = {
|
||||||
|
"positive": "#4caf50",
|
||||||
|
"neutral": "#90a4ae",
|
||||||
|
"negative": "#ef5350",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def render_overview(df: pd.DataFrame) -> None:
|
||||||
|
"""Display charts summarising sentiment."""
|
||||||
|
counts = (
|
||||||
|
df["label"]
|
||||||
|
.value_counts()
|
||||||
|
.reindex(["positive", "neutral", "negative"], fill_value=0)
|
||||||
|
.rename_axis("label")
|
||||||
|
.reset_index(name="count")
|
||||||
|
)
|
||||||
|
pie = px.pie(
|
||||||
|
counts,
|
||||||
|
names="label",
|
||||||
|
values="count",
|
||||||
|
color="label",
|
||||||
|
color_discrete_map=SENTIMENT_COLORS,
|
||||||
|
title="Sentiment distribution",
|
||||||
|
)
|
||||||
|
pie.update_traces(textinfo="percent+label")
|
||||||
|
|
||||||
|
ts = (
|
||||||
|
df.set_index("timestamp")
|
||||||
|
.groupby([pd.Grouper(freq="D"), "label"])
|
||||||
|
.size()
|
||||||
|
.reset_index(name="count")
|
||||||
|
)
|
||||||
|
if not ts.empty:
|
||||||
|
ts_plot = px.line(
|
||||||
|
ts,
|
||||||
|
x="timestamp",
|
||||||
|
y="count",
|
||||||
|
color="label",
|
||||||
|
color_discrete_map=SENTIMENT_COLORS,
|
||||||
|
markers=True,
|
||||||
|
title="Mentions over time",
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
ts_plot = None
|
||||||
|
|
||||||
|
col1, col2 = st.columns(2)
|
||||||
|
with col1:
|
||||||
|
st.plotly_chart(pie, use_container_width=True)
|
||||||
|
with col2:
|
||||||
|
if ts_plot is not None:
|
||||||
|
st.plotly_chart(ts_plot, use_container_width=True)
|
||||||
|
else:
|
||||||
|
st.info("Not enough data for a time-series. Try widening the date range.", icon="📆")
|
||||||
|
|
||||||
|
|
||||||
|
def render_top_comments(df: pd.DataFrame) -> None:
|
||||||
|
"""Show representative comments per sentiment."""
|
||||||
|
st.subheader("Representative Mentions")
|
||||||
|
cols = st.columns(3)
|
||||||
|
for idx, sentiment in enumerate(["positive", "neutral", "negative"]):
|
||||||
|
subset = (
|
||||||
|
df[df["label"] == sentiment]
|
||||||
|
.sort_values("confidence", ascending=False)
|
||||||
|
.head(5)
|
||||||
|
)
|
||||||
|
with cols[idx]:
|
||||||
|
st.caption(sentiment.capitalize())
|
||||||
|
if subset.empty:
|
||||||
|
st.write("No items yet.")
|
||||||
|
continue
|
||||||
|
for _, row in subset.iterrows():
|
||||||
|
chip = SOURCE_CHIPS.get(row["source"], row["source"])
|
||||||
|
author = row.get("author") or "Unknown"
|
||||||
|
timestamp = row["timestamp"].strftime("%Y-%m-%d %H:%M")
|
||||||
|
label = f"{chip} · {author} · {timestamp}"
|
||||||
|
if row.get("url"):
|
||||||
|
st.markdown(f"- [{label}]({row['url']})")
|
||||||
|
else:
|
||||||
|
st.markdown(f"- {label}")
|
||||||
|
|
||||||
|
|
||||||
|
def render_source_explorer(df: pd.DataFrame) -> None:
|
||||||
|
"""Interactive tabular explorer with pagination and filters."""
|
||||||
|
with st.expander("Source Explorer", expanded=False):
|
||||||
|
search_term = st.text_input("Search mentions", key="explorer_search")
|
||||||
|
selected_source = st.selectbox("Source filter", options=["All"] + list(SOURCE_CHIPS.values()))
|
||||||
|
min_conf = st.slider("Minimum confidence", min_value=0.0, max_value=1.0, value=0.0, step=0.1)
|
||||||
|
|
||||||
|
filtered = df.copy()
|
||||||
|
if search_term:
|
||||||
|
filtered = filtered[filtered["text"].str.contains(search_term, case=False, na=False)]
|
||||||
|
if selected_source != "All":
|
||||||
|
source_key = _reverse_lookup(selected_source)
|
||||||
|
if source_key:
|
||||||
|
filtered = filtered[filtered["source"] == source_key]
|
||||||
|
filtered = filtered[filtered["confidence"] >= min_conf]
|
||||||
|
|
||||||
|
if filtered.empty:
|
||||||
|
st.info("No results found. Try widening the date range or removing filters.", icon="🪄")
|
||||||
|
return
|
||||||
|
|
||||||
|
page_size = 10
|
||||||
|
total_pages = max(1, (len(filtered) + page_size - 1) // page_size)
|
||||||
|
page = st.number_input("Page", min_value=1, max_value=total_pages, value=1)
|
||||||
|
start = (page - 1) * page_size
|
||||||
|
end = start + page_size
|
||||||
|
|
||||||
|
explorer_df = filtered.iloc[start:end].copy()
|
||||||
|
explorer_df["source"] = explorer_df["source"].map(SOURCE_CHIPS).fillna(explorer_df["source"])
|
||||||
|
explorer_df["timestamp"] = explorer_df["timestamp"].dt.strftime("%Y-%m-%d %H:%M")
|
||||||
|
explorer_df = explorer_df[["timestamp", "source", "author", "label", "confidence", "text", "url"]]
|
||||||
|
|
||||||
|
st.dataframe(explorer_df, use_container_width=True, hide_index=True)
|
||||||
|
|
||||||
|
|
||||||
|
def _reverse_lookup(value: str) -> Optional[str]:
|
||||||
|
for key, chip in SOURCE_CHIPS.items():
|
||||||
|
if chip == value:
|
||||||
|
return key
|
||||||
|
return None
|
||||||
128
community-contributions/Reputation_Radar/components/filters.py
Normal file
128
community-contributions/Reputation_Radar/components/filters.py
Normal file
@@ -0,0 +1,128 @@
|
|||||||
|
"""Sidebar filters and configuration controls."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import Dict, Optional, Tuple
|
||||||
|
|
||||||
|
import streamlit as st
|
||||||
|
|
||||||
|
DATE_RANGE_LABELS = {
|
||||||
|
"24h": "Last 24 hours",
|
||||||
|
"7d": "Last 7 days",
|
||||||
|
"30d": "Last 30 days",
|
||||||
|
}
|
||||||
|
|
||||||
|
SUPPORTED_LANGUAGES = {
|
||||||
|
"en": "English",
|
||||||
|
"es": "Spanish",
|
||||||
|
"de": "German",
|
||||||
|
"fr": "French",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _store_secret(key: str, value: str) -> None:
|
||||||
|
"""Persist sensitive values in session state only."""
|
||||||
|
if value:
|
||||||
|
st.session_state.setdefault("secrets", {})
|
||||||
|
st.session_state["secrets"][key] = value
|
||||||
|
|
||||||
|
|
||||||
|
def _get_secret(key: str, default: str = "") -> str:
|
||||||
|
return st.session_state.get("secrets", {}).get(key, default)
|
||||||
|
|
||||||
|
|
||||||
|
def render_sidebar(env_defaults: Dict[str, Optional[str]], openai_notices: Tuple[str, ...]) -> Dict[str, object]:
|
||||||
|
"""Render all sidebar controls and return configuration."""
|
||||||
|
with st.sidebar:
|
||||||
|
st.header("Tune Your Radar", anchor=False)
|
||||||
|
brand = st.text_input("Brand Name*", value=st.session_state.get("brand_input", ""))
|
||||||
|
if brand:
|
||||||
|
st.session_state["brand_input"] = brand
|
||||||
|
|
||||||
|
date_range = st.selectbox(
|
||||||
|
"Date Range",
|
||||||
|
options=list(DATE_RANGE_LABELS.keys()),
|
||||||
|
format_func=lambda key: DATE_RANGE_LABELS[key],
|
||||||
|
index=1,
|
||||||
|
)
|
||||||
|
min_reddit_upvotes = st.number_input(
|
||||||
|
"Minimum Reddit upvotes",
|
||||||
|
min_value=0,
|
||||||
|
value=st.session_state.get("min_reddit_upvotes", 4),
|
||||||
|
)
|
||||||
|
st.session_state["min_reddit_upvotes"] = min_reddit_upvotes
|
||||||
|
min_twitter_likes = st.number_input(
|
||||||
|
"Minimum X likes",
|
||||||
|
min_value=0,
|
||||||
|
value=st.session_state.get("min_twitter_likes", 100),
|
||||||
|
)
|
||||||
|
st.session_state["min_twitter_likes"] = min_twitter_likes
|
||||||
|
language = st.selectbox(
|
||||||
|
"Language",
|
||||||
|
options=list(SUPPORTED_LANGUAGES.keys()),
|
||||||
|
format_func=lambda key: SUPPORTED_LANGUAGES[key],
|
||||||
|
index=0,
|
||||||
|
)
|
||||||
|
|
||||||
|
st.markdown("### Sources")
|
||||||
|
reddit_enabled = st.toggle("🔺 Reddit", value=st.session_state.get("reddit_enabled", True))
|
||||||
|
twitter_enabled = st.toggle("✖️ Twitter", value=st.session_state.get("twitter_enabled", True))
|
||||||
|
trustpilot_enabled = st.toggle("⭐ Trustpilot", value=st.session_state.get("trustpilot_enabled", True))
|
||||||
|
st.session_state["reddit_enabled"] = reddit_enabled
|
||||||
|
st.session_state["twitter_enabled"] = twitter_enabled
|
||||||
|
st.session_state["trustpilot_enabled"] = trustpilot_enabled
|
||||||
|
|
||||||
|
st.markdown("### API Keys")
|
||||||
|
openai_key_default = env_defaults.get("OPENAI_API_KEY") or _get_secret("OPENAI_API_KEY")
|
||||||
|
openai_key = st.text_input("OpenAI API Key", value=openai_key_default or "", type="password", help="Stored only in this session.")
|
||||||
|
_store_secret("OPENAI_API_KEY", openai_key.strip())
|
||||||
|
reddit_client_id = st.text_input("Reddit Client ID", value=env_defaults.get("REDDIT_CLIENT_ID") or _get_secret("REDDIT_CLIENT_ID"), type="password")
|
||||||
|
reddit_client_secret = st.text_input("Reddit Client Secret", value=env_defaults.get("REDDIT_CLIENT_SECRET") or _get_secret("REDDIT_CLIENT_SECRET"), type="password")
|
||||||
|
reddit_user_agent = st.text_input("Reddit User Agent", value=env_defaults.get("REDDIT_USER_AGENT") or _get_secret("REDDIT_USER_AGENT"))
|
||||||
|
twitter_bearer_token = st.text_input("Twitter Bearer Token", value=env_defaults.get("TWITTER_BEARER_TOKEN") or _get_secret("TWITTER_BEARER_TOKEN"), type="password")
|
||||||
|
_store_secret("REDDIT_CLIENT_ID", reddit_client_id.strip())
|
||||||
|
_store_secret("REDDIT_CLIENT_SECRET", reddit_client_secret.strip())
|
||||||
|
_store_secret("REDDIT_USER_AGENT", reddit_user_agent.strip())
|
||||||
|
_store_secret("TWITTER_BEARER_TOKEN", twitter_bearer_token.strip())
|
||||||
|
|
||||||
|
if openai_notices:
|
||||||
|
for notice in openai_notices:
|
||||||
|
st.info(notice)
|
||||||
|
|
||||||
|
with st.expander("Advanced Options", expanded=False):
|
||||||
|
reddit_limit = st.slider("Reddit results", min_value=10, max_value=100, value=st.session_state.get("reddit_limit", 40), step=5)
|
||||||
|
twitter_limit = st.slider("Twitter results", min_value=10, max_value=100, value=st.session_state.get("twitter_limit", 40), step=5)
|
||||||
|
trustpilot_limit = st.slider("Trustpilot results", min_value=10, max_value=60, value=st.session_state.get("trustpilot_limit", 30), step=5)
|
||||||
|
llm_batch_size = st.slider("OpenAI batch size", min_value=5, max_value=20, value=st.session_state.get("llm_batch_size", 20), step=5)
|
||||||
|
st.session_state["reddit_limit"] = reddit_limit
|
||||||
|
st.session_state["twitter_limit"] = twitter_limit
|
||||||
|
st.session_state["trustpilot_limit"] = trustpilot_limit
|
||||||
|
st.session_state["llm_batch_size"] = llm_batch_size
|
||||||
|
|
||||||
|
return {
|
||||||
|
"brand": brand.strip(),
|
||||||
|
"date_range": date_range,
|
||||||
|
"min_reddit_upvotes": min_reddit_upvotes,
|
||||||
|
"min_twitter_likes": min_twitter_likes,
|
||||||
|
"language": language,
|
||||||
|
"sources": {
|
||||||
|
"reddit": reddit_enabled,
|
||||||
|
"twitter": twitter_enabled,
|
||||||
|
"trustpilot": trustpilot_enabled,
|
||||||
|
},
|
||||||
|
"limits": {
|
||||||
|
"reddit": reddit_limit,
|
||||||
|
"twitter": twitter_limit,
|
||||||
|
"trustpilot": trustpilot_limit,
|
||||||
|
},
|
||||||
|
"batch_size": llm_batch_size,
|
||||||
|
"credentials": {
|
||||||
|
"openai": openai_key.strip(),
|
||||||
|
"reddit": {
|
||||||
|
"client_id": reddit_client_id.strip(),
|
||||||
|
"client_secret": reddit_client_secret.strip(),
|
||||||
|
"user_agent": reddit_user_agent.strip(),
|
||||||
|
},
|
||||||
|
"twitter": twitter_bearer_token.strip(),
|
||||||
|
},
|
||||||
|
}
|
||||||
@@ -0,0 +1,25 @@
|
|||||||
|
"""Loading indicators and status helpers."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from contextlib import contextmanager
|
||||||
|
from typing import Iterator
|
||||||
|
|
||||||
|
import streamlit as st
|
||||||
|
|
||||||
|
|
||||||
|
@contextmanager
|
||||||
|
def source_status(label: str) -> Iterator[st.delta_generator.DeltaGenerator]:
|
||||||
|
"""Context manager that yields a status widget for source fetching."""
|
||||||
|
status = st.status(label, expanded=True)
|
||||||
|
try:
|
||||||
|
yield status
|
||||||
|
status.update(label=f"{label} ✅", state="complete")
|
||||||
|
except Exception as exc: # noqa: BLE001
|
||||||
|
status.update(label=f"{label} ⚠️ {exc}", state="error")
|
||||||
|
raise
|
||||||
|
|
||||||
|
|
||||||
|
def show_empty_state(message: str) -> None:
|
||||||
|
"""Render a friendly empty-state callout."""
|
||||||
|
st.info(message, icon="🔎")
|
||||||
@@ -0,0 +1,23 @@
|
|||||||
|
"""Executive summary display components."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import Dict, Optional
|
||||||
|
|
||||||
|
import streamlit as st
|
||||||
|
|
||||||
|
|
||||||
|
def render_summary(summary: Optional[Dict[str, str]]) -> None:
|
||||||
|
"""Render executive summary card."""
|
||||||
|
st.subheader("Executive Summary", anchor=False)
|
||||||
|
if not summary:
|
||||||
|
st.warning("Executive summary disabled. Provide an OpenAI API key to unlock this section.", icon="🤖")
|
||||||
|
return
|
||||||
|
st.markdown(
|
||||||
|
"""
|
||||||
|
<div style="padding:1rem;border:1px solid #eee;border-radius:0.75rem;background-color:#f9fafb;">
|
||||||
|
""",
|
||||||
|
unsafe_allow_html=True,
|
||||||
|
)
|
||||||
|
st.markdown(summary.get("raw", ""))
|
||||||
|
st.markdown("</div>", unsafe_allow_html=True)
|
||||||
16
community-contributions/Reputation_Radar/requirements.txt
Normal file
16
community-contributions/Reputation_Radar/requirements.txt
Normal file
@@ -0,0 +1,16 @@
|
|||||||
|
streamlit
|
||||||
|
praw
|
||||||
|
requests
|
||||||
|
beautifulsoup4
|
||||||
|
pandas
|
||||||
|
python-dotenv
|
||||||
|
tenacity
|
||||||
|
plotly
|
||||||
|
openai>=1.0.0
|
||||||
|
vaderSentiment
|
||||||
|
fuzzywuzzy[speedup]
|
||||||
|
python-Levenshtein
|
||||||
|
reportlab
|
||||||
|
tqdm
|
||||||
|
pytest
|
||||||
|
XlsxWriter
|
||||||
@@ -0,0 +1,20 @@
|
|||||||
|
[
|
||||||
|
{
|
||||||
|
"source": "reddit",
|
||||||
|
"id": "t3_sample1",
|
||||||
|
"url": "https://www.reddit.com/r/technology/comments/sample1",
|
||||||
|
"author": "techfan42",
|
||||||
|
"timestamp": "2025-01-15T14:30:00+00:00",
|
||||||
|
"text": "ReputationRadar did an impressive job resolving our customer issues within hours. Support has been world class!",
|
||||||
|
"meta": {"score": 128, "num_comments": 24, "subreddit": "technology", "type": "submission"}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"source": "reddit",
|
||||||
|
"id": "t1_sample2",
|
||||||
|
"url": "https://www.reddit.com/r/startups/comments/sample2/comment/sample",
|
||||||
|
"author": "growthguru",
|
||||||
|
"timestamp": "2025-01-14T10:10:00+00:00",
|
||||||
|
"text": "Noticed a spike in downtime alerts with ReputationRadar this week. Anyone else seeing false positives?",
|
||||||
|
"meta": {"score": 45, "subreddit": "startups", "type": "comment", "submission_title": "Monitoring tools"}
|
||||||
|
}
|
||||||
|
]
|
||||||
@@ -0,0 +1,20 @@
|
|||||||
|
[
|
||||||
|
{
|
||||||
|
"source": "trustpilot",
|
||||||
|
"id": "trustpilot-001",
|
||||||
|
"url": "https://www.trustpilot.com/review/reputationradar.ai",
|
||||||
|
"author": "Dana",
|
||||||
|
"timestamp": "2025-01-12T11:00:00+00:00",
|
||||||
|
"text": "ReputationRadar has simplified our weekly reporting. The sentiment breakdowns are easy to understand and accurate.",
|
||||||
|
"meta": {"rating": "5 stars"}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"source": "trustpilot",
|
||||||
|
"id": "trustpilot-002",
|
||||||
|
"url": "https://www.trustpilot.com/review/reputationradar.ai?page=2",
|
||||||
|
"author": "Liam",
|
||||||
|
"timestamp": "2025-01-10T18:20:00+00:00",
|
||||||
|
"text": "Support was responsive, but the Trustpilot integration kept timing out. Hoping for a fix soon.",
|
||||||
|
"meta": {"rating": "3 stars"}
|
||||||
|
}
|
||||||
|
]
|
||||||
@@ -0,0 +1,20 @@
|
|||||||
|
[
|
||||||
|
{
|
||||||
|
"source": "twitter",
|
||||||
|
"id": "173654001",
|
||||||
|
"url": "https://twitter.com/brandlover/status/173654001",
|
||||||
|
"author": "brandlover",
|
||||||
|
"timestamp": "2025-01-15T16:45:00+00:00",
|
||||||
|
"text": "Huge shoutout to ReputationRadar for flagging sentiment risks ahead of our launch. Saved us hours this morning!",
|
||||||
|
"meta": {"likes": 57, "retweets": 8, "replies": 3, "quote_count": 2}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"source": "twitter",
|
||||||
|
"id": "173653991",
|
||||||
|
"url": "https://twitter.com/critique/status/173653991",
|
||||||
|
"author": "critique",
|
||||||
|
"timestamp": "2025-01-13T09:12:00+00:00",
|
||||||
|
"text": "The new ReputationRadar dashboard feels laggy and the PDF export failed twice. Dev team please check your rollout.",
|
||||||
|
"meta": {"likes": 14, "retweets": 1, "replies": 5, "quote_count": 0}
|
||||||
|
}
|
||||||
|
]
|
||||||
@@ -0,0 +1,11 @@
|
|||||||
|
"""Service layer exports for ReputationRadar."""
|
||||||
|
|
||||||
|
from . import llm, reddit_client, trustpilot_scraper, twitter_client, utils
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"llm",
|
||||||
|
"reddit_client",
|
||||||
|
"trustpilot_scraper",
|
||||||
|
"twitter_client",
|
||||||
|
"utils",
|
||||||
|
]
|
||||||
147
community-contributions/Reputation_Radar/services/llm.py
Normal file
147
community-contributions/Reputation_Radar/services/llm.py
Normal file
@@ -0,0 +1,147 @@
|
|||||||
|
"""LLM sentiment analysis and summarization utilities."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from typing import Any, Dict, Iterable, List, Optional, Sequence
|
||||||
|
|
||||||
|
try: # pragma: no cover - optional dependency
|
||||||
|
from openai import OpenAI
|
||||||
|
except ModuleNotFoundError: # pragma: no cover
|
||||||
|
OpenAI = None # type: ignore[assignment]
|
||||||
|
|
||||||
|
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
|
||||||
|
|
||||||
|
from .utils import ServiceWarning, chunked
|
||||||
|
|
||||||
|
CLASSIFICATION_SYSTEM_PROMPT = "You are a precise brand-sentiment classifier. Output JSON only."
|
||||||
|
SUMMARY_SYSTEM_PROMPT = "You analyze brand chatter and produce concise, executive-ready summaries."
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class SentimentResult:
|
||||||
|
"""Structured sentiment output."""
|
||||||
|
|
||||||
|
label: str
|
||||||
|
confidence: float
|
||||||
|
|
||||||
|
|
||||||
|
class LLMService:
|
||||||
|
"""Wrapper around OpenAI with VADER fallback."""
|
||||||
|
|
||||||
|
def __init__(self, api_key: Optional[str], model: str = "gpt-4o-mini", batch_size: int = 20):
|
||||||
|
self.batch_size = max(1, batch_size)
|
||||||
|
self.model = model
|
||||||
|
self.logger = logging.getLogger("services.llm")
|
||||||
|
self._client: Optional[Any] = None
|
||||||
|
self._analyzer = SentimentIntensityAnalyzer()
|
||||||
|
if api_key and OpenAI is not None:
|
||||||
|
try:
|
||||||
|
self._client = OpenAI(api_key=api_key)
|
||||||
|
except Exception as exc: # noqa: BLE001
|
||||||
|
self.logger.warning("Failed to initialize OpenAI client, using VADER fallback: %s", exc)
|
||||||
|
self._client = None
|
||||||
|
elif api_key and OpenAI is None:
|
||||||
|
self.logger.warning("openai package not installed; falling back to VADER despite API key.")
|
||||||
|
|
||||||
|
def available(self) -> bool:
|
||||||
|
"""Return whether OpenAI-backed features are available."""
|
||||||
|
return self._client is not None
|
||||||
|
|
||||||
|
def classify_sentiment_batch(self, texts: Sequence[str]) -> List[SentimentResult]:
|
||||||
|
"""Classify multiple texts, chunking if necessary."""
|
||||||
|
if not texts:
|
||||||
|
return []
|
||||||
|
if not self.available():
|
||||||
|
return [self._vader_sentiment(text) for text in texts]
|
||||||
|
|
||||||
|
results: List[SentimentResult] = []
|
||||||
|
for chunk in chunked(list(texts), self.batch_size):
|
||||||
|
prompt_lines = ["Classify each item as \"positive\", \"neutral\", or \"negative\".", "Also output a confidence score between 0 and 1.", "Return an array of objects: [{\"label\": \"...\", \"confidence\": 0.0}].", "Items:"]
|
||||||
|
prompt_lines.extend([f"{idx + 1}) {text}" for idx, text in enumerate(chunk)])
|
||||||
|
prompt = "\n".join(prompt_lines)
|
||||||
|
try:
|
||||||
|
response = self._client.responses.create( # type: ignore[union-attr]
|
||||||
|
model=self.model,
|
||||||
|
input=[
|
||||||
|
{"role": "system", "content": CLASSIFICATION_SYSTEM_PROMPT},
|
||||||
|
{"role": "user", "content": prompt},
|
||||||
|
],
|
||||||
|
temperature=0,
|
||||||
|
max_output_tokens=500,
|
||||||
|
)
|
||||||
|
output_text = self._extract_text(response)
|
||||||
|
parsed = json.loads(output_text)
|
||||||
|
for item in parsed:
|
||||||
|
results.append(
|
||||||
|
SentimentResult(
|
||||||
|
label=item.get("label", "neutral"),
|
||||||
|
confidence=float(item.get("confidence", 0.5)),
|
||||||
|
)
|
||||||
|
)
|
||||||
|
except Exception as exc: # noqa: BLE001
|
||||||
|
self.logger.warning("Classification fallback to VADER due to error: %s", exc)
|
||||||
|
for text in chunk:
|
||||||
|
results.append(self._vader_sentiment(text))
|
||||||
|
# Ensure the output length matches input
|
||||||
|
if len(results) != len(texts):
|
||||||
|
# align by padding with neutral
|
||||||
|
results.extend([SentimentResult(label="neutral", confidence=0.33)] * (len(texts) - len(results)))
|
||||||
|
return results
|
||||||
|
|
||||||
|
def summarize_overall(self, findings: List[Dict[str, Any]]) -> Dict[str, Any]:
|
||||||
|
"""Create an executive summary using OpenAI."""
|
||||||
|
if not self.available():
|
||||||
|
raise ServiceWarning("OpenAI API key missing. Summary unavailable.")
|
||||||
|
prompt_lines = [
|
||||||
|
"Given these labeled items and their short rationales, write:",
|
||||||
|
"- 5 bullet \"Highlights\"",
|
||||||
|
"- 5 bullet \"Risks & Concerns\"",
|
||||||
|
"- One-line \"Overall Tone\" (Positive/Neutral/Negative with brief justification)",
|
||||||
|
"- 3 \"Recommended Actions\"",
|
||||||
|
"Keep it under 180 words total. Be specific but neutral in tone.",
|
||||||
|
"Items:",
|
||||||
|
]
|
||||||
|
for idx, item in enumerate(findings, start=1):
|
||||||
|
prompt_lines.append(
|
||||||
|
f"{idx}) [{item.get('label','neutral').upper()}] {item.get('text','')}"
|
||||||
|
)
|
||||||
|
prompt = "\n".join(prompt_lines)
|
||||||
|
try:
|
||||||
|
response = self._client.responses.create( # type: ignore[union-attr]
|
||||||
|
model=self.model,
|
||||||
|
input=[
|
||||||
|
{"role": "system", "content": SUMMARY_SYSTEM_PROMPT},
|
||||||
|
{"role": "user", "content": prompt},
|
||||||
|
],
|
||||||
|
temperature=0.2,
|
||||||
|
max_output_tokens=800,
|
||||||
|
)
|
||||||
|
output_text = self._extract_text(response)
|
||||||
|
return {"raw": output_text}
|
||||||
|
except Exception as exc: # noqa: BLE001
|
||||||
|
self.logger.error("Failed to generate summary: %s", exc)
|
||||||
|
raise ServiceWarning("Unable to generate executive summary at this time.") from exc
|
||||||
|
|
||||||
|
def _vader_sentiment(self, text: str) -> SentimentResult:
|
||||||
|
scores = self._analyzer.polarity_scores(text)
|
||||||
|
compound = scores["compound"]
|
||||||
|
if compound >= 0.2:
|
||||||
|
label = "positive"
|
||||||
|
elif compound <= -0.2:
|
||||||
|
label = "negative"
|
||||||
|
else:
|
||||||
|
label = "neutral"
|
||||||
|
confidence = min(1.0, max(0.0, abs(compound)))
|
||||||
|
return SentimentResult(label=label, confidence=confidence)
|
||||||
|
|
||||||
|
def _extract_text(self, response: Any) -> str:
|
||||||
|
"""Support multiple OpenAI client response shapes."""
|
||||||
|
if hasattr(response, "output") and response.output:
|
||||||
|
content = response.output[0].content[0]
|
||||||
|
return getattr(content, "text", str(content))
|
||||||
|
if hasattr(response, "choices"):
|
||||||
|
return response.choices[0].message.content # type: ignore[return-value]
|
||||||
|
raise ValueError("Unknown response structure from OpenAI client.")
|
||||||
@@ -0,0 +1,141 @@
|
|||||||
|
"""Reddit data collection service using PRAW."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import time
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from typing import Dict, Iterable, List, Optional
|
||||||
|
|
||||||
|
import praw
|
||||||
|
from praw.models import Comment, Submission
|
||||||
|
|
||||||
|
from .utils import (
|
||||||
|
NormalizedItem,
|
||||||
|
ServiceError,
|
||||||
|
ServiceWarning,
|
||||||
|
ensure_timezone,
|
||||||
|
sanitize_text,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
TIME_FILTER_MAP = {
|
||||||
|
"24h": "day",
|
||||||
|
"7d": "week",
|
||||||
|
"30d": "month",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _iter_submissions(subreddit: praw.models.Subreddit, query: str, limit: int, time_filter: str) -> Iterable[Submission]:
|
||||||
|
return subreddit.search(query=query, sort="new", time_filter=time_filter, limit=limit * 3)
|
||||||
|
|
||||||
|
|
||||||
|
def _iter_comments(submission: Submission) -> Iterable[Comment]:
|
||||||
|
submission.comments.replace_more(limit=0)
|
||||||
|
return submission.comments.list()
|
||||||
|
|
||||||
|
|
||||||
|
def _normalize_submission(submission: Submission) -> NormalizedItem:
|
||||||
|
created = datetime.fromtimestamp(submission.created_utc, tz=timezone.utc)
|
||||||
|
return NormalizedItem(
|
||||||
|
source="reddit",
|
||||||
|
id=submission.id,
|
||||||
|
url=f"https://www.reddit.com{submission.permalink}",
|
||||||
|
author=str(submission.author) if submission.author else None,
|
||||||
|
timestamp=ensure_timezone(created),
|
||||||
|
text=f"{submission.title}\n\n{submission.selftext or ''}",
|
||||||
|
meta={
|
||||||
|
"score": submission.score,
|
||||||
|
"num_comments": submission.num_comments,
|
||||||
|
"subreddit": submission.subreddit.display_name,
|
||||||
|
"type": "submission",
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _normalize_comment(comment: Comment, submission: Submission) -> NormalizedItem:
|
||||||
|
created = datetime.fromtimestamp(comment.created_utc, tz=timezone.utc)
|
||||||
|
return NormalizedItem(
|
||||||
|
source="reddit",
|
||||||
|
id=comment.id,
|
||||||
|
url=f"https://www.reddit.com{comment.permalink}",
|
||||||
|
author=str(comment.author) if comment.author else None,
|
||||||
|
timestamp=ensure_timezone(created),
|
||||||
|
text=comment.body,
|
||||||
|
meta={
|
||||||
|
"score": comment.score,
|
||||||
|
"subreddit": submission.subreddit.display_name,
|
||||||
|
"type": "comment",
|
||||||
|
"submission_title": submission.title,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_mentions(
|
||||||
|
brand: str,
|
||||||
|
credentials: Dict[str, str],
|
||||||
|
limit: int = 25,
|
||||||
|
date_filter: str = "7d",
|
||||||
|
min_upvotes: int = 0,
|
||||||
|
) -> List[NormalizedItem]:
|
||||||
|
"""Fetch recent Reddit submissions/comments mentioning the brand."""
|
||||||
|
client_id = credentials.get("client_id")
|
||||||
|
client_secret = credentials.get("client_secret")
|
||||||
|
user_agent = credentials.get("user_agent")
|
||||||
|
|
||||||
|
if not all([client_id, client_secret, user_agent]):
|
||||||
|
raise ServiceWarning("Reddit credentials are missing. Provide them in the sidebar to enable this source.")
|
||||||
|
|
||||||
|
try:
|
||||||
|
reddit = praw.Reddit(
|
||||||
|
client_id=client_id,
|
||||||
|
client_secret=client_secret,
|
||||||
|
user_agent=user_agent,
|
||||||
|
)
|
||||||
|
reddit.read_only = True
|
||||||
|
except Exception as exc: # noqa: BLE001
|
||||||
|
raise ServiceError(f"Failed to initialize Reddit client: {exc}") from exc
|
||||||
|
|
||||||
|
time_filter = TIME_FILTER_MAP.get(date_filter.lower(), "week")
|
||||||
|
subreddit = reddit.subreddit("all")
|
||||||
|
results: List[NormalizedItem] = []
|
||||||
|
seen_ids: set[str] = set()
|
||||||
|
try:
|
||||||
|
for submission in _iter_submissions(subreddit, query=brand, limit=limit, time_filter=time_filter):
|
||||||
|
if submission.id in seen_ids:
|
||||||
|
continue
|
||||||
|
if submission.score < min_upvotes:
|
||||||
|
continue
|
||||||
|
normalized_submission = _normalize_submission(submission)
|
||||||
|
normalized_submission["text"] = sanitize_text(normalized_submission["text"])
|
||||||
|
if normalized_submission["text"]:
|
||||||
|
results.append(normalized_submission)
|
||||||
|
seen_ids.add(submission.id)
|
||||||
|
if len(results) >= limit:
|
||||||
|
break
|
||||||
|
|
||||||
|
# Fetch comments mentioning the brand
|
||||||
|
match_count = 0
|
||||||
|
for comment in _iter_comments(submission):
|
||||||
|
if brand.lower() not in (comment.body or "").lower():
|
||||||
|
continue
|
||||||
|
if comment.score < min_upvotes:
|
||||||
|
continue
|
||||||
|
normalized_comment = _normalize_comment(comment, submission)
|
||||||
|
normalized_comment["text"] = sanitize_text(normalized_comment["text"])
|
||||||
|
if not normalized_comment["text"]:
|
||||||
|
continue
|
||||||
|
if normalized_comment["id"] in seen_ids:
|
||||||
|
continue
|
||||||
|
results.append(normalized_comment)
|
||||||
|
seen_ids.add(normalized_comment["id"])
|
||||||
|
match_count += 1
|
||||||
|
if len(results) >= limit:
|
||||||
|
break
|
||||||
|
if len(results) >= limit:
|
||||||
|
break
|
||||||
|
# Respect rate limits
|
||||||
|
if match_count:
|
||||||
|
time.sleep(1)
|
||||||
|
except Exception as exc: # noqa: BLE001
|
||||||
|
raise ServiceError(f"Error while fetching Reddit data: {exc}") from exc
|
||||||
|
return results
|
||||||
@@ -0,0 +1,138 @@
|
|||||||
|
"""Trustpilot scraping service with polite crawling safeguards."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import time
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from typing import Dict, List
|
||||||
|
from urllib.parse import urlencode
|
||||||
|
from urllib.robotparser import RobotFileParser
|
||||||
|
|
||||||
|
import requests
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_exponential
|
||||||
|
|
||||||
|
from .utils import (
|
||||||
|
NormalizedItem,
|
||||||
|
ServiceError,
|
||||||
|
ServiceWarning,
|
||||||
|
ensure_timezone,
|
||||||
|
random_user_agent,
|
||||||
|
sanitize_text,
|
||||||
|
)
|
||||||
|
|
||||||
|
BASE_URL = "https://www.trustpilot.com"
|
||||||
|
SEARCH_PATH = "/search"
|
||||||
|
|
||||||
|
|
||||||
|
class BlockedError(ServiceWarning):
|
||||||
|
"""Raised when Trustpilot blocks the scraping attempt."""
|
||||||
|
|
||||||
|
|
||||||
|
def _check_robots(user_agent: str) -> None:
|
||||||
|
parser = RobotFileParser()
|
||||||
|
parser.set_url(f"{BASE_URL}/robots.txt")
|
||||||
|
parser.read()
|
||||||
|
if not parser.can_fetch(user_agent, SEARCH_PATH):
|
||||||
|
raise ServiceWarning(
|
||||||
|
"Trustpilot robots.txt disallows scraping the search endpoint. "
|
||||||
|
"Please use the official API or upload data manually."
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@retry(
|
||||||
|
reraise=True,
|
||||||
|
stop=stop_after_attempt(3),
|
||||||
|
wait=wait_exponential(multiplier=1, min=1, max=8),
|
||||||
|
retry=retry_if_exception_type((requests.RequestException, BlockedError)),
|
||||||
|
)
|
||||||
|
def _fetch_page(session: requests.Session, user_agent: str, page: int, brand: str, language: str) -> str:
|
||||||
|
params = {"query": brand, "page": page}
|
||||||
|
if language:
|
||||||
|
params["languages"] = language
|
||||||
|
url = f"{BASE_URL}{SEARCH_PATH}?{urlencode(params)}"
|
||||||
|
response = session.get(
|
||||||
|
url,
|
||||||
|
headers={"User-Agent": user_agent, "Accept-Language": language or "en"},
|
||||||
|
timeout=20,
|
||||||
|
)
|
||||||
|
if response.status_code in (401, 403):
|
||||||
|
raise BlockedError("Trustpilot denied access (HTTP 403).")
|
||||||
|
response.raise_for_status()
|
||||||
|
return response.text
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_reviews(html: str, user_agent: str) -> List[NormalizedItem]:
|
||||||
|
soup = BeautifulSoup(html, "html.parser")
|
||||||
|
cards = soup.select("article[data-service-review-card-layout]")
|
||||||
|
items: List[NormalizedItem] = []
|
||||||
|
now = datetime.now(timezone.utc)
|
||||||
|
for card in cards:
|
||||||
|
link = card.select_one("a.link_internal__YpiJI")
|
||||||
|
url = f"{BASE_URL}{link['href']}" if link and link.get("href") else ""
|
||||||
|
title_el = card.select_one("h2")
|
||||||
|
title = title_el.get_text(strip=True) if title_el else ""
|
||||||
|
text_el = card.select_one("[data-review-description-typography]")
|
||||||
|
text = text_el.get_text(separator=" ", strip=True) if text_el else ""
|
||||||
|
rating_el = card.select_one("img[alt*='stars']")
|
||||||
|
rating = rating_el["alt"] if rating_el and rating_el.get("alt") else ""
|
||||||
|
author_el = card.select_one("span.styles_consumerDetails__ZF4I6")
|
||||||
|
author = author_el.get_text(strip=True) if author_el else None
|
||||||
|
date_el = card.select_one("time")
|
||||||
|
timestamp = now
|
||||||
|
if date_el and date_el.get("datetime"):
|
||||||
|
try:
|
||||||
|
timestamp = datetime.fromisoformat(date_el["datetime"].replace("Z", "+00:00"))
|
||||||
|
except ValueError:
|
||||||
|
timestamp = now
|
||||||
|
|
||||||
|
body = sanitize_text(f"{title}\n\n{text}")
|
||||||
|
if len(body) < 15:
|
||||||
|
continue
|
||||||
|
items.append(
|
||||||
|
NormalizedItem(
|
||||||
|
source="trustpilot",
|
||||||
|
id=card.get("data-review-id", str(hash(body))),
|
||||||
|
url=url,
|
||||||
|
author=author,
|
||||||
|
timestamp=ensure_timezone(timestamp),
|
||||||
|
text=body,
|
||||||
|
meta={
|
||||||
|
"rating": rating,
|
||||||
|
"user_agent": user_agent,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
)
|
||||||
|
return items
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_reviews(brand: str, language: str = "en", pages: int = 2) -> List[NormalizedItem]:
|
||||||
|
"""Scrape Trustpilot search results for recent reviews."""
|
||||||
|
if not brand:
|
||||||
|
raise ServiceWarning("Brand name is required for Trustpilot scraping.")
|
||||||
|
|
||||||
|
session = requests.Session()
|
||||||
|
user_agent = random_user_agent()
|
||||||
|
_check_robots(user_agent)
|
||||||
|
|
||||||
|
aggregated: List[NormalizedItem] = []
|
||||||
|
seen_ids: set[str] = set()
|
||||||
|
|
||||||
|
for page in range(1, pages + 1):
|
||||||
|
try:
|
||||||
|
html = _fetch_page(session, user_agent=user_agent, page=page, brand=brand, language=language)
|
||||||
|
except BlockedError as exc:
|
||||||
|
raise ServiceWarning(
|
||||||
|
"Trustpilot blocked the scraping attempt. Consider using their official API or providing CSV uploads."
|
||||||
|
) from exc
|
||||||
|
except requests.RequestException as exc: # noqa: BLE001
|
||||||
|
raise ServiceError(f"Trustpilot request failed: {exc}") from exc
|
||||||
|
page_items = _parse_reviews(html, user_agent)
|
||||||
|
for item in page_items:
|
||||||
|
if item["id"] in seen_ids:
|
||||||
|
continue
|
||||||
|
aggregated.append(item)
|
||||||
|
seen_ids.add(item["id"])
|
||||||
|
time.sleep(1.5) # gentle crawl delay
|
||||||
|
|
||||||
|
return aggregated
|
||||||
@@ -0,0 +1,98 @@
|
|||||||
|
"""Twitter (X) data collection using the v2 recent search API."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import time
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from typing import Dict, List, Optional
|
||||||
|
|
||||||
|
import requests
|
||||||
|
|
||||||
|
from .utils import NormalizedItem, ServiceError, ServiceWarning, ensure_timezone, sanitize_text
|
||||||
|
|
||||||
|
SEARCH_URL = "https://api.twitter.com/2/tweets/search/recent"
|
||||||
|
|
||||||
|
|
||||||
|
def _build_query(brand: str, language: str) -> str:
|
||||||
|
terms = [brand]
|
||||||
|
if language:
|
||||||
|
terms.append(f"lang:{language}")
|
||||||
|
return " ".join(terms)
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_mentions(
|
||||||
|
brand: str,
|
||||||
|
bearer_token: Optional[str],
|
||||||
|
limit: int = 25,
|
||||||
|
min_likes: int = 0,
|
||||||
|
language: str = "en",
|
||||||
|
) -> List[NormalizedItem]:
|
||||||
|
"""Fetch recent tweets mentioning the brand."""
|
||||||
|
if not bearer_token:
|
||||||
|
raise ServiceWarning(
|
||||||
|
"Twitter bearer token not provided. Add it in the sidebar to enable Twitter ingestion."
|
||||||
|
)
|
||||||
|
|
||||||
|
headers = {
|
||||||
|
"Authorization": f"Bearer {bearer_token}",
|
||||||
|
"User-Agent": "ReputationRadar/1.0",
|
||||||
|
}
|
||||||
|
params = {
|
||||||
|
"query": _build_query(brand, language),
|
||||||
|
"max_results": min(100, limit),
|
||||||
|
"tweet.fields": "author_id,created_at,lang,public_metrics",
|
||||||
|
"expansions": "author_id",
|
||||||
|
"user.fields": "name,username",
|
||||||
|
}
|
||||||
|
|
||||||
|
collected: List[NormalizedItem] = []
|
||||||
|
next_token: Optional[str] = None
|
||||||
|
|
||||||
|
while len(collected) < limit:
|
||||||
|
if next_token:
|
||||||
|
params["next_token"] = next_token
|
||||||
|
response = requests.get(SEARCH_URL, headers=headers, params=params, timeout=15)
|
||||||
|
if response.status_code == 401:
|
||||||
|
raise ServiceWarning("Twitter API authentication failed. Please verify the bearer token.")
|
||||||
|
if response.status_code == 429:
|
||||||
|
time.sleep(5)
|
||||||
|
continue
|
||||||
|
if response.status_code >= 400:
|
||||||
|
raise ServiceError(f"Twitter API error {response.status_code}: {response.text}")
|
||||||
|
|
||||||
|
payload = response.json()
|
||||||
|
data = payload.get("data", [])
|
||||||
|
includes = payload.get("includes", {})
|
||||||
|
users_index = {user["id"]: user for user in includes.get("users", [])}
|
||||||
|
|
||||||
|
for tweet in data:
|
||||||
|
created_at = datetime.fromisoformat(tweet["created_at"].replace("Z", "+00:00"))
|
||||||
|
author_info = users_index.get(tweet["author_id"], {})
|
||||||
|
item = NormalizedItem(
|
||||||
|
source="twitter",
|
||||||
|
id=tweet["id"],
|
||||||
|
url=f"https://twitter.com/{author_info.get('username','')}/status/{tweet['id']}",
|
||||||
|
author=author_info.get("username"),
|
||||||
|
timestamp=ensure_timezone(created_at),
|
||||||
|
text=sanitize_text(tweet["text"]),
|
||||||
|
meta={
|
||||||
|
"likes": tweet.get("public_metrics", {}).get("like_count", 0),
|
||||||
|
"retweets": tweet.get("public_metrics", {}).get("retweet_count", 0),
|
||||||
|
"replies": tweet.get("public_metrics", {}).get("reply_count", 0),
|
||||||
|
"quote_count": tweet.get("public_metrics", {}).get("quote_count", 0),
|
||||||
|
},
|
||||||
|
)
|
||||||
|
if not item["text"]:
|
||||||
|
continue
|
||||||
|
if item["meta"]["likes"] < min_likes:
|
||||||
|
continue
|
||||||
|
collected.append(item)
|
||||||
|
if len(collected) >= limit:
|
||||||
|
break
|
||||||
|
|
||||||
|
next_token = payload.get("meta", {}).get("next_token")
|
||||||
|
if not next_token:
|
||||||
|
break
|
||||||
|
time.sleep(1) # stay friendly to rate limits
|
||||||
|
|
||||||
|
return collected[:limit]
|
||||||
217
community-contributions/Reputation_Radar/services/utils.py
Normal file
217
community-contributions/Reputation_Radar/services/utils.py
Normal file
@@ -0,0 +1,217 @@
|
|||||||
|
"""Utility helpers for ReputationRadar services."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import random
|
||||||
|
import re
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from datetime import datetime, timedelta, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Dict, Iterable, Iterator, List, Optional, Sequence, Tuple, TypedDict
|
||||||
|
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
from fuzzywuzzy import fuzz
|
||||||
|
|
||||||
|
|
||||||
|
LOG_FILE = Path(__file__).resolve().parents[1] / "logs" / "app.log"
|
||||||
|
MIN_TEXT_LENGTH = 15
|
||||||
|
SIMILARITY_THRESHOLD = 90
|
||||||
|
|
||||||
|
|
||||||
|
class NormalizedItem(TypedDict):
|
||||||
|
"""Canonical representation of a fetched mention."""
|
||||||
|
|
||||||
|
source: str
|
||||||
|
id: str
|
||||||
|
url: str
|
||||||
|
author: Optional[str]
|
||||||
|
timestamp: datetime
|
||||||
|
text: str
|
||||||
|
meta: Dict[str, object]
|
||||||
|
|
||||||
|
|
||||||
|
class ServiceError(RuntimeError):
|
||||||
|
"""Raised when a service hard fails."""
|
||||||
|
|
||||||
|
|
||||||
|
class ServiceWarning(RuntimeError):
|
||||||
|
"""Raised for recoverable issues that should surface to the UI."""
|
||||||
|
|
||||||
|
|
||||||
|
def initialize_logger(name: str = "reputation_radar") -> logging.Logger:
|
||||||
|
"""Configure and return a module-level logger."""
|
||||||
|
LOG_FILE.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
logging.basicConfig(
|
||||||
|
level=logging.INFO,
|
||||||
|
format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
|
||||||
|
handlers=[
|
||||||
|
logging.FileHandler(LOG_FILE, encoding="utf-8"),
|
||||||
|
logging.StreamHandler(),
|
||||||
|
],
|
||||||
|
)
|
||||||
|
logger = logging.getLogger(name)
|
||||||
|
logger.setLevel(logging.INFO)
|
||||||
|
return logger
|
||||||
|
|
||||||
|
|
||||||
|
def load_sample_items(name: str) -> List[NormalizedItem]:
|
||||||
|
"""Load demo data from the samples directory."""
|
||||||
|
samples_dir = Path(__file__).resolve().parents[1] / "samples"
|
||||||
|
sample_path = samples_dir / f"{name}.json"
|
||||||
|
if not sample_path.exists():
|
||||||
|
return []
|
||||||
|
with sample_path.open("r", encoding="utf-8") as handle:
|
||||||
|
raw_items = json.load(handle)
|
||||||
|
cleaned: List[NormalizedItem] = []
|
||||||
|
for item in raw_items:
|
||||||
|
try:
|
||||||
|
cleaned.append(
|
||||||
|
NormalizedItem(
|
||||||
|
source=item["source"],
|
||||||
|
id=str(item["id"]),
|
||||||
|
url=item.get("url", ""),
|
||||||
|
author=item.get("author"),
|
||||||
|
timestamp=datetime.fromisoformat(item["timestamp"]),
|
||||||
|
text=item["text"],
|
||||||
|
meta=item.get("meta", {}),
|
||||||
|
)
|
||||||
|
)
|
||||||
|
except (KeyError, ValueError):
|
||||||
|
continue
|
||||||
|
return cleaned
|
||||||
|
|
||||||
|
|
||||||
|
def strip_html(value: str) -> str:
|
||||||
|
"""Remove HTML tags and normalize whitespace."""
|
||||||
|
if not value:
|
||||||
|
return ""
|
||||||
|
soup = BeautifulSoup(value, "html.parser")
|
||||||
|
text = soup.get_text(separator=" ", strip=True)
|
||||||
|
text = re.sub(r"\s+", " ", text)
|
||||||
|
text = text.encode("utf-8", "ignore").decode("utf-8", "ignore")
|
||||||
|
return text.strip()
|
||||||
|
|
||||||
|
|
||||||
|
def sanitize_text(value: str) -> str:
|
||||||
|
"""Clean text and remove excessive noise."""
|
||||||
|
text = strip_html(value)
|
||||||
|
text = re.sub(r"http\S+", "", text) # drop inline URLs
|
||||||
|
text = re.sub(r"\s{2,}", " ", text)
|
||||||
|
return text.strip()
|
||||||
|
|
||||||
|
|
||||||
|
def drop_short_items(items: Iterable[NormalizedItem], minimum_length: int = MIN_TEXT_LENGTH) -> List[NormalizedItem]:
|
||||||
|
"""Filter out items that are too short to analyze."""
|
||||||
|
return [
|
||||||
|
item
|
||||||
|
for item in items
|
||||||
|
if len(item["text"]) >= minimum_length
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def fuzzy_deduplicate(items: Sequence[NormalizedItem], threshold: int = SIMILARITY_THRESHOLD) -> List[NormalizedItem]:
|
||||||
|
"""Remove duplicates based on URL or fuzzy text similarity."""
|
||||||
|
seen_urls: set[str] = set()
|
||||||
|
deduped: List[NormalizedItem] = []
|
||||||
|
for item in items:
|
||||||
|
url = item.get("url") or ""
|
||||||
|
text = item.get("text") or ""
|
||||||
|
if url and url in seen_urls:
|
||||||
|
continue
|
||||||
|
duplicate_found = False
|
||||||
|
for existing in deduped:
|
||||||
|
if not text or not existing.get("text"):
|
||||||
|
continue
|
||||||
|
if fuzz.token_set_ratio(text, existing["text"]) >= threshold:
|
||||||
|
duplicate_found = True
|
||||||
|
break
|
||||||
|
if not duplicate_found:
|
||||||
|
deduped.append(item)
|
||||||
|
if url:
|
||||||
|
seen_urls.add(url)
|
||||||
|
return deduped
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_items(items: Sequence[NormalizedItem]) -> List[NormalizedItem]:
|
||||||
|
"""Apply sanitization, deduplication, and drop noisy entries."""
|
||||||
|
sanitized: List[NormalizedItem] = []
|
||||||
|
for item in items:
|
||||||
|
cleaned_text = sanitize_text(item.get("text", ""))
|
||||||
|
if len(cleaned_text) < MIN_TEXT_LENGTH:
|
||||||
|
continue
|
||||||
|
sanitized.append(
|
||||||
|
NormalizedItem(
|
||||||
|
source=item["source"],
|
||||||
|
id=item["id"],
|
||||||
|
url=item.get("url", ""),
|
||||||
|
author=item.get("author"),
|
||||||
|
timestamp=item["timestamp"],
|
||||||
|
text=cleaned_text,
|
||||||
|
meta=item.get("meta", {}),
|
||||||
|
)
|
||||||
|
)
|
||||||
|
return fuzzy_deduplicate(sanitized)
|
||||||
|
|
||||||
|
|
||||||
|
def parse_date_range(option: str) -> datetime:
|
||||||
|
"""Return a UTC timestamp threshold for the given range identifier."""
|
||||||
|
now = datetime.now(timezone.utc)
|
||||||
|
option = option.lower()
|
||||||
|
delta = {
|
||||||
|
"24h": timedelta(days=1),
|
||||||
|
"7d": timedelta(days=7),
|
||||||
|
"30d": timedelta(days=30),
|
||||||
|
}.get(option, timedelta(days=7))
|
||||||
|
return now - delta
|
||||||
|
|
||||||
|
|
||||||
|
def random_user_agent() -> str:
|
||||||
|
"""Return a random user agent string for polite scraping."""
|
||||||
|
user_agents = [
|
||||||
|
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
|
||||||
|
"(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
|
||||||
|
"Mozilla/5.0 (Macintosh; Intel Mac OS X 13_3) AppleWebKit/605.1.15 "
|
||||||
|
"(KHTML, like Gecko) Version/16.4 Safari/605.1.15",
|
||||||
|
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:125.0) Gecko/20100101 Firefox/125.0",
|
||||||
|
]
|
||||||
|
return random.choice(user_agents)
|
||||||
|
|
||||||
|
|
||||||
|
def chunked(iterable: Sequence[str], size: int) -> Iterator[Sequence[str]]:
|
||||||
|
"""Yield successive chunks from iterable."""
|
||||||
|
for start in range(0, len(iterable), size):
|
||||||
|
yield iterable[start : start + size]
|
||||||
|
|
||||||
|
|
||||||
|
def validate_openai_key(api_key: Optional[str]) -> Tuple[Optional[str], List[str]]:
|
||||||
|
"""Validate an OpenAI key following the guidance from day1 notebook."""
|
||||||
|
warnings: List[str] = []
|
||||||
|
if not api_key:
|
||||||
|
warnings.append("No OpenAI API key detected. VADER fallback will be used.")
|
||||||
|
return None, warnings
|
||||||
|
if not api_key.startswith("sk-"):
|
||||||
|
warnings.append(
|
||||||
|
"Provided OpenAI API key does not start with the expected prefix (sk-)."
|
||||||
|
)
|
||||||
|
if api_key.strip() != api_key:
|
||||||
|
warnings.append("OpenAI API key looks like it has leading or trailing whitespace.")
|
||||||
|
api_key = api_key.strip()
|
||||||
|
return api_key, warnings
|
||||||
|
|
||||||
|
|
||||||
|
def ensure_timezone(ts: datetime) -> datetime:
|
||||||
|
"""Guarantee timestamps are timezone-aware in UTC."""
|
||||||
|
if ts.tzinfo is None:
|
||||||
|
return ts.replace(tzinfo=timezone.utc)
|
||||||
|
return ts.astimezone(timezone.utc)
|
||||||
|
|
||||||
|
|
||||||
|
def safe_int(value: Optional[object], default: int = 0) -> int:
|
||||||
|
"""Convert a value to int with a fallback."""
|
||||||
|
try:
|
||||||
|
return int(value) # type: ignore[arg-type]
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
return default
|
||||||
@@ -0,0 +1,6 @@
|
|||||||
|
import pathlib
|
||||||
|
import sys
|
||||||
|
|
||||||
|
PROJECT_ROOT = pathlib.Path(__file__).resolve().parents[1]
|
||||||
|
if str(PROJECT_ROOT) not in sys.path:
|
||||||
|
sys.path.insert(0, str(PROJECT_ROOT))
|
||||||
@@ -0,0 +1,19 @@
|
|||||||
|
import pytest
|
||||||
|
|
||||||
|
from services import llm
|
||||||
|
from services.utils import ServiceWarning
|
||||||
|
|
||||||
|
|
||||||
|
def test_llm_fallback_uses_vader():
|
||||||
|
service = llm.LLMService(api_key=None)
|
||||||
|
results = service.classify_sentiment_batch(
|
||||||
|
["I absolutely love this product!", "This is the worst experience ever."]
|
||||||
|
)
|
||||||
|
assert results[0].label == "positive"
|
||||||
|
assert results[1].label == "negative"
|
||||||
|
|
||||||
|
|
||||||
|
def test_summary_requires_openai_key():
|
||||||
|
service = llm.LLMService(api_key=None)
|
||||||
|
with pytest.raises(ServiceWarning):
|
||||||
|
service.summarize_overall([{"label": "positive", "text": "Example"}])
|
||||||
35
community-contributions/Reputation_Radar/tests/test_utils.py
Normal file
35
community-contributions/Reputation_Radar/tests/test_utils.py
Normal file
@@ -0,0 +1,35 @@
|
|||||||
|
import datetime as dt
|
||||||
|
|
||||||
|
from services import utils
|
||||||
|
|
||||||
|
|
||||||
|
def test_normalize_items_deduplicates():
|
||||||
|
ts = dt.datetime(2025, 1, 1, tzinfo=dt.timezone.utc)
|
||||||
|
items = [
|
||||||
|
utils.NormalizedItem(
|
||||||
|
source="reddit",
|
||||||
|
id="1",
|
||||||
|
url="https://example.com/a",
|
||||||
|
author="alice",
|
||||||
|
timestamp=ts,
|
||||||
|
text="ReputationRadar is great!",
|
||||||
|
meta={},
|
||||||
|
),
|
||||||
|
utils.NormalizedItem(
|
||||||
|
source="reddit",
|
||||||
|
id="2",
|
||||||
|
url="https://example.com/a",
|
||||||
|
author="bob",
|
||||||
|
timestamp=ts,
|
||||||
|
text="ReputationRadar is great!",
|
||||||
|
meta={},
|
||||||
|
),
|
||||||
|
]
|
||||||
|
cleaned = utils.normalize_items(items)
|
||||||
|
assert len(cleaned) == 1
|
||||||
|
|
||||||
|
|
||||||
|
def test_sanitize_text_removes_html():
|
||||||
|
raw = "<p>Hello <strong>world</strong> <a href='https://example.com'>link</a></p>"
|
||||||
|
cleaned = utils.sanitize_text(raw)
|
||||||
|
assert cleaned == "Hello world link"
|
||||||
221
community-contributions/wk1-day1-RBG-all-sites-jina.ipynb
Normal file
221
community-contributions/wk1-day1-RBG-all-sites-jina.ipynb
Normal file
@@ -0,0 +1,221 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "d15d8294-3328-4e07-ad16-8a03e9bbfdb9",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# My First Lab = My 1st Frontier LLM Project\n",
|
||||||
|
"## Summarize All Websites without Selenium\n",
|
||||||
|
"This simple \"app\" uses Jina (https://jina.ai/reader) to turn all websites into markdown before summarizing by an LLM. As their website says: \"Convert a URL to LLM-friendly input, by simply adding r.jina.ai in front\". They have other tools that look useful too.\n",
|
||||||
|
"\n",
|
||||||
|
"\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "4e2a9393-7767-488e-a8bf-27c12dca35bd",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# imports\n",
|
||||||
|
"\n",
|
||||||
|
"import os\n",
|
||||||
|
"import requests # added for jina\n",
|
||||||
|
"from dotenv import load_dotenv\n",
|
||||||
|
"# from scraper import fetch_website_contents # not needed for jina\n",
|
||||||
|
"from IPython.display import Markdown, display\n",
|
||||||
|
"from openai import OpenAI\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "7b87cadb-d513-4303-baee-a37b6f938e4d",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Load environment variables from a file called .env\n",
|
||||||
|
"\n",
|
||||||
|
"load_dotenv(override=True)\n",
|
||||||
|
"api_key = os.getenv('OPENAI_API_KEY')\n",
|
||||||
|
"\n",
|
||||||
|
"# Check the key\n",
|
||||||
|
"\n",
|
||||||
|
"if not api_key:\n",
|
||||||
|
" print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
|
||||||
|
"elif not api_key.startswith(\"sk-proj-\"):\n",
|
||||||
|
" print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n",
|
||||||
|
"elif api_key.strip() != api_key:\n",
|
||||||
|
" print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n",
|
||||||
|
"else:\n",
|
||||||
|
" print(\"API key found and looks good so far!\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Setup access to the frontier model\n",
|
||||||
|
"\n",
|
||||||
|
"openai = OpenAI()"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "f0275b1b-7cfe-4f9d-abfa-7650d378da0c",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Step 1-a: Define the user prompt\n",
|
||||||
|
"\n",
|
||||||
|
"user_prompt_prefix = \"\"\"\n",
|
||||||
|
"Here are the contents of a website.\n",
|
||||||
|
"Provide a short summary of this website.\n",
|
||||||
|
"If it includes news or announcements, then summarize these too.\n",
|
||||||
|
"\"\"\""
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "abdb8417-c5dc-44bc-9bee-2e059d162699",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Step 1-b: Define the system prompt\n",
|
||||||
|
"\n",
|
||||||
|
"system_prompt = \"\"\"\n",
|
||||||
|
"You are a smart assistant that analyzes the contents of a website,\n",
|
||||||
|
"and provides a short, clear, summary, ignoring text that might be navigation related.\n",
|
||||||
|
"Respond in markdown. Do not wrap the markdown in a code block - respond just with the markdown.\n",
|
||||||
|
"\"\"\""
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "0134dfa4-8299-48b5-b444-f2a8c3403c88",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Add the website content to the user prompt\n",
|
||||||
|
"\n",
|
||||||
|
"def messages_for(website):\n",
|
||||||
|
" return [\n",
|
||||||
|
" {\"role\": \"system\", \"content\": system_prompt},\n",
|
||||||
|
" {\"role\": \"user\", \"content\": user_prompt_prefix + website}\n",
|
||||||
|
" ]"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "2ef960cf-6dc2-4cda-afb3-b38be12f4c97",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Step 5: Change the content utility to use jina\n",
|
||||||
|
"\n",
|
||||||
|
"def fetch_url_content(url):\n",
|
||||||
|
" jina_reader_url = f\"https://r.jina.ai/{url}\"\n",
|
||||||
|
" try:\n",
|
||||||
|
" response = requests.get(jina_reader_url)\n",
|
||||||
|
" response.raise_for_status() # Raise an exception for HTTP errors\n",
|
||||||
|
" return response.text\n",
|
||||||
|
" except requests.exceptions.RequestException as e:\n",
|
||||||
|
" print(f\"Error fetching URL: {e}\")\n",
|
||||||
|
" return None\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "905b9919-aba7-45b5-ae65-81b3d1d78e34",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Step 3: Call OpenAI & Step 4: print the result\n",
|
||||||
|
"\n",
|
||||||
|
"def summarize(url):\n",
|
||||||
|
" website = fetch_url_content(url)\n",
|
||||||
|
" response = openai.chat.completions.create(\n",
|
||||||
|
" model = \"gpt-5-nano\",\n",
|
||||||
|
" messages = messages_for(website)\n",
|
||||||
|
" )\n",
|
||||||
|
" summary = response.choices[0].message.content\n",
|
||||||
|
" return display(Markdown(summary))\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "05e38d41-dfa4-4b20-9c96-c46ea75d9fb5",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"summarize(\"https://edwarddonner.com\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "45d83403-a24c-44b5-84ac-961449b4008f",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"summarize(\"https://cnn.com\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "75e9fd40-b354-4341-991e-863ef2e59db7",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"summarize(\"https://openai.com\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "36ed9f14-b349-40e9-a42c-b367e77f8bda",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Content Summary vs Technical Summary\n",
|
||||||
|
"\n",
|
||||||
|
"In my work a technical summary of a website, or group of websites, would be useful too. For example, does it render on the server (HTML) or in the browser (JavaScript), what content management system (CMS) was used, how many pages, how many outbound links, how many inbound links, etc. Doing this exercise I realized LLMs can help with analyzing content, but I may need other tools to count pages, links, and other specifications.\n",
|
||||||
|
"\n",
|
||||||
|
"A \"Shout Out\" to whoever put \"Market_Research_Agent.ipynb\" in the Community-Contributions. It is a great example of using an LLM as a management consultant. I think Jina might help with this usecase by offering web search results through an API to feed to your LLM. Here is the system prompt from that notebook and I plan to use this format often.\n",
|
||||||
|
"\n",
|
||||||
|
"system_prompt = \"\"\"You are to act like a Mckinsey Consultant specializing in market research. \n",
|
||||||
|
"1) You are to follow legal guidelines and never give immoral advice. \n",
|
||||||
|
"2) Your job is to maximise profits for your clients by analysing their companies initiatives and giving out recommendations for newer initiatives.\\n \n",
|
||||||
|
"3) Follow industry frameworks for reponses always give simple answers and stick to the point.\n",
|
||||||
|
"4) If possible try to see what competitors exist and what market gap can your clients company exploit.\n",
|
||||||
|
"5) Further more, USe SWOT, Porters 5 forces to summarize your recommendations, Give confidence score with every recommendations\n",
|
||||||
|
"6) Try to give unique solutions by seeing what the market gap is, if market gap is ambiguious skip this step\n",
|
||||||
|
"7) add an estimate of what rate the revenue of the comapany will increase at provided they follow the guidelines, give conservating estimates keeping in account non ideal conditions.\n",
|
||||||
|
"8) if the website isnt of a company or data isnt available, give out an error message along the lines of more data required for analysis\"\"\""
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": ".venv",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.12.12"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 5
|
||||||
|
}
|
||||||
225
community-contributions/wk1-day2-RBG-all-sites-ollama.ipynb
Normal file
225
community-contributions/wk1-day2-RBG-all-sites-ollama.ipynb
Normal file
@@ -0,0 +1,225 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "d15d8294-3328-4e07-ad16-8a03e9bbfdb9",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Lab2: Local Open Source on My PC Project\n",
|
||||||
|
"## Summarize All Websites without Selenium Using Open Source Models\n",
|
||||||
|
"This builds on my app from yesterday using Jina (https://jina.ai/reader) to turn all websites into markdown before summarizing by an LLM. And it uses Ollama to store open source LLMs on my PC to run things locally (jina is not local, so to be totally local you might need to go back to Selenium to do JavaScript sites).\n",
|
||||||
|
"\n",
|
||||||
|
"\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "4e2a9393-7767-488e-a8bf-27c12dca35bd",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# imports\n",
|
||||||
|
"\n",
|
||||||
|
"import os\n",
|
||||||
|
"import requests\n",
|
||||||
|
"from dotenv import load_dotenv\n",
|
||||||
|
"from IPython.display import Markdown, display\n",
|
||||||
|
"from openai import OpenAI\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "7b87cadb-d513-4303-baee-a37b6f938e4d",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Setup access to the Ollama models\n",
|
||||||
|
"\n",
|
||||||
|
"OLLAMA_BASE_URL = \"http://localhost:11434/v1\"\n",
|
||||||
|
"\n",
|
||||||
|
"ollama = OpenAI(base_url=OLLAMA_BASE_URL, api_key='ollama')\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "f0275b1b-7cfe-4f9d-abfa-7650d378da0c",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Step 1-a: Define the user prompt\n",
|
||||||
|
"\n",
|
||||||
|
"user_prompt_prefix = \"\"\"\n",
|
||||||
|
"Here are the contents of a website.\n",
|
||||||
|
"Provide a short summary of this website.\n",
|
||||||
|
"If it includes news or announcements, then summarize these too.\n",
|
||||||
|
"Make recommendations for improvement\n",
|
||||||
|
"\"\"\""
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "abdb8417-c5dc-44bc-9bee-2e059d162699",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Step 1-b: Define the system prompt\n",
|
||||||
|
"\n",
|
||||||
|
"system_prompt = \"\"\"You are to act like a smart Mckinsey Consultant specializing in website analysis. \n",
|
||||||
|
"1) You should provide a short, clear, summary, ignoring text that might be navigation related.\n",
|
||||||
|
"2) Follow the summary by making recommendations for improving the website so it is better at serving its purpose.\n",
|
||||||
|
"3) Follow industry frameworks for reponses always give simple answers and stick to the point.\n",
|
||||||
|
"4) If possible try to group you recommendations, for example Grammar and Style, Clarity, Functional, etc.\n",
|
||||||
|
"5) Give confidence scores with every recommendation.\n",
|
||||||
|
"6) Always provide a summary of the website, explaining what it is.\n",
|
||||||
|
"7) if you do not understand the website's purpose or have no improvement recommendations, give out an error message along the lines of more data required for analysis or ask a follow up question.\n",
|
||||||
|
"8) Respond in markdown. Do not wrap the markdown in a code block - respond just with the markdown.\"\"\"\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "0134dfa4-8299-48b5-b444-f2a8c3403c88",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Add the website content to the user prompt\n",
|
||||||
|
"\n",
|
||||||
|
"def messages_for(website):\n",
|
||||||
|
" return [\n",
|
||||||
|
" {\"role\": \"system\", \"content\": system_prompt},\n",
|
||||||
|
" {\"role\": \"user\", \"content\": user_prompt_prefix + website}\n",
|
||||||
|
" ]"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "2ef960cf-6dc2-4cda-afb3-b38be12f4c97",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Step 5: Change the content utility to use jina\n",
|
||||||
|
"\n",
|
||||||
|
"def fetch_url_content(url):\n",
|
||||||
|
" jina_reader_url = f\"https://r.jina.ai/{url}\"\n",
|
||||||
|
" try:\n",
|
||||||
|
" response = requests.get(jina_reader_url)\n",
|
||||||
|
" response.raise_for_status() # Raise an exception for HTTP errors\n",
|
||||||
|
" return response.text\n",
|
||||||
|
" except requests.exceptions.RequestException as e:\n",
|
||||||
|
" print(f\"Error fetching URL: {e}\")\n",
|
||||||
|
" return None\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "905b9919-aba7-45b5-ae65-81b3d1d78e34",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Step 3: Call Ollama model & Step 4: print the result\n",
|
||||||
|
"\n",
|
||||||
|
"def summarize(url):\n",
|
||||||
|
" website = fetch_url_content(url)\n",
|
||||||
|
" response = ollama.chat.completions.create(\n",
|
||||||
|
" model = omodel,\n",
|
||||||
|
" messages = messages_for(website)\n",
|
||||||
|
" )\n",
|
||||||
|
" summary = response.choices[0].message.content\n",
|
||||||
|
" return display(Markdown(summary))\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "05e38d41-dfa4-4b20-9c96-c46ea75d9fb5",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"omodel = \"llama3.2\"\n",
|
||||||
|
"summarize(\"https://edwarddonner.com\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "75df7e70",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"omodel = \"deepseek-r1:1.5b\"\n",
|
||||||
|
"summarize(\"https://edwarddonner.com\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "45d83403-a24c-44b5-84ac-961449b4008f",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"omodel = \"llama3.2\"\n",
|
||||||
|
"summarize(\"https://cnn.com\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "be133029",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"omodel = \"deepseek-r1:1.5b\"\n",
|
||||||
|
"summarize(\"https://cnn.com\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "75e9fd40-b354-4341-991e-863ef2e59db7",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"omodel = \"llama3.2\"\n",
|
||||||
|
"summarize(\"https://openai.com\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "a8d1a0ed",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"omodel = \"deepseek-r1:1.5b\"\n",
|
||||||
|
"summarize(\"https://openai.com\")"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": ".venv",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.12.12"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 5
|
||||||
|
}
|
||||||
571
week1/community-contributions/day1_email_secretary.ipynb
Normal file
571
week1/community-contributions/day1_email_secretary.ipynb
Normal file
@@ -0,0 +1,571 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "d15d8294-3328-4e07-ad16-8a03e9bbfdb9",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# YOUR FIRST LAB\n",
|
||||||
|
"### Please read this section. This is valuable to get you prepared, even if it's a long read -- it's important stuff.\n",
|
||||||
|
"\n",
|
||||||
|
"### Also, be sure to read [README.md](../README.md)! More info about the updated videos in the README and [top of the course resources in purple](https://edwarddonner.com/2024/11/13/llm-engineering-resources/)\n",
|
||||||
|
"\n",
|
||||||
|
"## Your first Frontier LLM Project\n",
|
||||||
|
"\n",
|
||||||
|
"By the end of this course, you will have built an autonomous Agentic AI solution with 7 agents that collaborate to solve a business problem. All in good time! We will start with something smaller...\n",
|
||||||
|
"\n",
|
||||||
|
"Our goal is to code a new kind of Web Browser. Give it a URL, and it will respond with a summary. The Reader's Digest of the internet!!\n",
|
||||||
|
"\n",
|
||||||
|
"Before starting, you should have completed the setup linked in the README.\n",
|
||||||
|
"\n",
|
||||||
|
"### If you're new to working in \"Notebooks\" (also known as Labs or Jupyter Lab)\n",
|
||||||
|
"\n",
|
||||||
|
"Welcome to the wonderful world of Data Science experimentation! Simply click in each \"cell\" with code in it, such as the cell immediately below this text, and hit Shift+Return to execute that cell. Be sure to run every cell, starting at the top, in order.\n",
|
||||||
|
"\n",
|
||||||
|
"Please look in the [Guides folder](../guides/01_intro.ipynb) for all the guides.\n",
|
||||||
|
"\n",
|
||||||
|
"## I am here to help\n",
|
||||||
|
"\n",
|
||||||
|
"If you have any problems at all, please do reach out. \n",
|
||||||
|
"I'm available through the platform, or at ed@edwarddonner.com, or at https://www.linkedin.com/in/eddonner/ if you'd like to connect (and I love connecting!) \n",
|
||||||
|
"And this is new to me, but I'm also trying out X at [@edwarddonner](https://x.com/edwarddonner) - if you're on X, please show me how it's done 😂 \n",
|
||||||
|
"\n",
|
||||||
|
"## More troubleshooting\n",
|
||||||
|
"\n",
|
||||||
|
"Please see the [troubleshooting](../setup/troubleshooting.ipynb) notebook in the setup folder to diagnose and fix common problems. At the very end of it is a diagnostics script with some useful debug info.\n",
|
||||||
|
"\n",
|
||||||
|
"## If this is old hat!\n",
|
||||||
|
"\n",
|
||||||
|
"If you're already comfortable with today's material, please hang in there; you can move swiftly through the first few labs - we will get much more in depth as the weeks progress. Ultimately we will fine-tune our own LLM to compete with OpenAI!\n",
|
||||||
|
"\n",
|
||||||
|
"<table style=\"margin: 0; text-align: left;\">\n",
|
||||||
|
" <tr>\n",
|
||||||
|
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
|
||||||
|
" <img src=\"../assets/important.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
|
||||||
|
" </td>\n",
|
||||||
|
" <td>\n",
|
||||||
|
" <h2 style=\"color:#900;\">Please read - important note</h2>\n",
|
||||||
|
" <span style=\"color:#900;\">The way I collaborate with you may be different to other courses you've taken. I prefer not to type code while you watch. Rather, I execute Jupyter Labs, like this, and give you an intuition for what's going on. My suggestion is that you carefully execute this yourself, <b>after</b> watching the lecture. Add print statements to understand what's going on, and then come up with your own variations. If you have a Github account, use this to showcase your variations. Not only is this essential practice, but it demonstrates your skills to others, including perhaps future clients or employers...</span>\n",
|
||||||
|
" </td>\n",
|
||||||
|
" </tr>\n",
|
||||||
|
"</table>\n",
|
||||||
|
"<table style=\"margin: 0; text-align: left;\">\n",
|
||||||
|
" <tr>\n",
|
||||||
|
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
|
||||||
|
" <img src=\"../assets/resources.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
|
||||||
|
" </td>\n",
|
||||||
|
" <td>\n",
|
||||||
|
" <h2 style=\"color:#f71;\">This code is a live resource - keep an eye out for my emails</h2>\n",
|
||||||
|
" <span style=\"color:#f71;\">I push updates to the code regularly. As people ask questions, I add more examples or improved commentary. As a result, you'll notice that the code below isn't identical to the videos. Everything from the videos is here; but I've also added better explanations and new models like DeepSeek. Consider this like an interactive book.<br/><br/>\n",
|
||||||
|
" I try to send emails regularly with important updates related to the course. You can find this in the 'Announcements' section of Udemy in the left sidebar. You can also choose to receive my emails via your Notification Settings in Udemy. I'm respectful of your inbox and always try to add value with my emails!\n",
|
||||||
|
" </span>\n",
|
||||||
|
" </td>\n",
|
||||||
|
" </tr>\n",
|
||||||
|
"</table>\n",
|
||||||
|
"<table style=\"margin: 0; text-align: left;\">\n",
|
||||||
|
" <tr>\n",
|
||||||
|
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
|
||||||
|
" <img src=\"../assets/business.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
|
||||||
|
" </td>\n",
|
||||||
|
" <td>\n",
|
||||||
|
" <h2 style=\"color:#181;\">Business value of these exercises</h2>\n",
|
||||||
|
" <span style=\"color:#181;\">A final thought. While I've designed these notebooks to be educational, I've also tried to make them enjoyable. We'll do fun things like have LLMs tell jokes and argue with each other. But fundamentally, my goal is to teach skills you can apply in business. I'll explain business implications as we go, and it's worth keeping this in mind: as you build experience with models and techniques, think of ways you could put this into action at work today. Please do contact me if you'd like to discuss more or if you have ideas to bounce off me.</span>\n",
|
||||||
|
" </td>\n",
|
||||||
|
" </tr>\n",
|
||||||
|
"</table>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "83f28feb",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### If necessary, install Cursor Extensions\n",
|
||||||
|
"\n",
|
||||||
|
"1. From the View menu, select Extensions\n",
|
||||||
|
"2. Search for Python\n",
|
||||||
|
"3. Click on \"Python\" made by \"ms-python\" and select Install if not already installed\n",
|
||||||
|
"4. Search for Jupyter\n",
|
||||||
|
"5. Click on \"Jupyter\" made by \"ms-toolsai\" and select Install of not already installed\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"### Next Select the Kernel\n",
|
||||||
|
"\n",
|
||||||
|
"Click on \"Select Kernel\" on the Top Right\n",
|
||||||
|
"\n",
|
||||||
|
"Choose \"Python Environments...\"\n",
|
||||||
|
"\n",
|
||||||
|
"Then choose the one that looks like `.venv (Python 3.12.x) .venv/bin/python` - it should be marked as \"Recommended\" and have a big star next to it.\n",
|
||||||
|
"\n",
|
||||||
|
"Any problems with this? Head over to the troubleshooting.\n",
|
||||||
|
"\n",
|
||||||
|
"### Note: you'll need to set the Kernel with every notebook.."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "4e2a9393-7767-488e-a8bf-27c12dca35bd",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# imports\n",
|
||||||
|
"\n",
|
||||||
|
"import os\n",
|
||||||
|
"from dotenv import load_dotenv\n",
|
||||||
|
"from scraper import fetch_website_contents\n",
|
||||||
|
"from IPython.display import Markdown, display\n",
|
||||||
|
"from openai import OpenAI\n",
|
||||||
|
"\n",
|
||||||
|
"# If you get an error running this cell, then please head over to the troubleshooting notebook!"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "6900b2a8-6384-4316-8aaa-5e519fca4254",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Connecting to OpenAI (or Ollama)\n",
|
||||||
|
"\n",
|
||||||
|
"The next cell is where we load in the environment variables in your `.env` file and connect to OpenAI. \n",
|
||||||
|
"\n",
|
||||||
|
"If you'd like to use free Ollama instead, please see the README section \"Free Alternative to Paid APIs\", and if you're not sure how to do this, there's a full solution in the solutions folder (day1_with_ollama.ipynb).\n",
|
||||||
|
"\n",
|
||||||
|
"## Troubleshooting if you have problems:\n",
|
||||||
|
"\n",
|
||||||
|
"If you get a \"Name Error\" - have you run all cells from the top down? Head over to the Python Foundations guide for a bulletproof way to find and fix all Name Errors.\n",
|
||||||
|
"\n",
|
||||||
|
"If that doesn't fix it, head over to the [troubleshooting](../setup/troubleshooting.ipynb) notebook for step by step code to identify the root cause and fix it!\n",
|
||||||
|
"\n",
|
||||||
|
"Or, contact me! Message me or email ed@edwarddonner.com and we will get this to work.\n",
|
||||||
|
"\n",
|
||||||
|
"Any concerns about API costs? See my notes in the README - costs should be minimal, and you can control it at every point. You can also use Ollama as a free alternative, which we discuss during Day 2."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "7b87cadb-d513-4303-baee-a37b6f938e4d",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Load environment variables in a file called .env\n",
|
||||||
|
"\n",
|
||||||
|
"load_dotenv(override=True)\n",
|
||||||
|
"api_key = os.getenv('OPENAI_API_KEY')\n",
|
||||||
|
"\n",
|
||||||
|
"# Check the key\n",
|
||||||
|
"\n",
|
||||||
|
"if not api_key:\n",
|
||||||
|
" print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
|
||||||
|
"elif not api_key.startswith(\"sk-proj-\"):\n",
|
||||||
|
" print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n",
|
||||||
|
"elif api_key.strip() != api_key:\n",
|
||||||
|
" print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n",
|
||||||
|
"else:\n",
|
||||||
|
" print(\"API key found and looks good so far!\")\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "442fc84b-0815-4f40-99ab-d9a5da6bda91",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Let's make a quick call to a Frontier model to get started, as a preview!"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "a58394bf-1e45-46af-9bfd-01e24da6f49a",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# To give you a preview -- calling OpenAI with these messages is this easy. Any problems, head over to the Troubleshooting notebook.\n",
|
||||||
|
"\n",
|
||||||
|
"message = \"Hello, GPT! This is my first ever message to you! Hi!\"\n",
|
||||||
|
"\n",
|
||||||
|
"messages = [{\"role\": \"user\", \"content\": message}]\n",
|
||||||
|
"\n",
|
||||||
|
"messages\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "08330159",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"openai = OpenAI()\n",
|
||||||
|
"\n",
|
||||||
|
"response = openai.chat.completions.create(model=\"gpt-5-nano\", messages=messages)\n",
|
||||||
|
"response.choices[0].message.content"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "2aa190e5-cb31-456a-96cc-db109919cd78",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## OK onwards with our first project"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "2ef960cf-6dc2-4cda-afb3-b38be12f4c97",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Let's try out this utility\n",
|
||||||
|
"\n",
|
||||||
|
"ed = fetch_website_contents(\"https://edwarddonner.com\")\n",
|
||||||
|
"print(ed)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "6a478a0c-2c53-48ff-869c-4d08199931e1",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Types of prompts\n",
|
||||||
|
"\n",
|
||||||
|
"You may know this already - but if not, you will get very familiar with it!\n",
|
||||||
|
"\n",
|
||||||
|
"Models like GPT have been trained to receive instructions in a particular way.\n",
|
||||||
|
"\n",
|
||||||
|
"They expect to receive:\n",
|
||||||
|
"\n",
|
||||||
|
"**A system prompt** that tells them what task they are performing and what tone they should use\n",
|
||||||
|
"\n",
|
||||||
|
"**A user prompt** -- the conversation starter that they should reply to"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "abdb8417-c5dc-44bc-9bee-2e059d162699",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Define our system prompt - you can experiment with this later, changing the last sentence to 'Respond in markdown in Spanish.\"\n",
|
||||||
|
"\n",
|
||||||
|
"system_prompt = \"\"\"\n",
|
||||||
|
"You are a snarky assistant that analyzes the contents of a website,\n",
|
||||||
|
"and provides a short, snarky, humorous summary, ignoring text that might be navigation related.\n",
|
||||||
|
"Respond in markdown. Do not wrap the markdown in a code block - respond just with the markdown.\n",
|
||||||
|
"\"\"\""
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "f0275b1b-7cfe-4f9d-abfa-7650d378da0c",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Define our user prompt\n",
|
||||||
|
"\n",
|
||||||
|
"user_prompt_prefix = \"\"\"\n",
|
||||||
|
"Here are the contents of a website.\n",
|
||||||
|
"Provide a short summary of this website.\n",
|
||||||
|
"If it includes news or announcements, then summarize these too.\n",
|
||||||
|
"\n",
|
||||||
|
"\"\"\""
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "ea211b5f-28e1-4a86-8e52-c0b7677cadcc",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Messages\n",
|
||||||
|
"\n",
|
||||||
|
"The API from OpenAI expects to receive messages in a particular structure.\n",
|
||||||
|
"Many of the other APIs share this structure:\n",
|
||||||
|
"\n",
|
||||||
|
"```python\n",
|
||||||
|
"[\n",
|
||||||
|
" {\"role\": \"system\", \"content\": \"system message goes here\"},\n",
|
||||||
|
" {\"role\": \"user\", \"content\": \"user message goes here\"}\n",
|
||||||
|
"]\n",
|
||||||
|
"```\n",
|
||||||
|
"To give you a preview, the next 2 cells make a rather simple call - we won't stretch the mighty GPT (yet!)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "f25dcd35-0cd0-4235-9f64-ac37ed9eaaa5",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"messages = [\n",
|
||||||
|
" {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
|
||||||
|
" {\"role\": \"user\", \"content\": \"What is 2 + 2?\"}\n",
|
||||||
|
"]\n",
|
||||||
|
"\n",
|
||||||
|
"response = openai.chat.completions.create(model=\"gpt-4.1-nano\", messages=messages)\n",
|
||||||
|
"response.choices[0].message.content"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "d06e8d78-ce4c-4b05-aa8e-17050c82bb47",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## And now let's build useful messages for GPT-4.1-mini, using a function"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "0134dfa4-8299-48b5-b444-f2a8c3403c88",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# See how this function creates exactly the format above\n",
|
||||||
|
"\n",
|
||||||
|
"def messages_for(website):\n",
|
||||||
|
" return [\n",
|
||||||
|
" {\"role\": \"system\", \"content\": system_prompt},\n",
|
||||||
|
" {\"role\": \"user\", \"content\": user_prompt_prefix + website}\n",
|
||||||
|
" ]"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "36478464-39ee-485c-9f3f-6a4e458dbc9c",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Try this out, and then try for a few more websites\n",
|
||||||
|
"\n",
|
||||||
|
"messages_for(ed)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "16f49d46-bf55-4c3e-928f-68fc0bf715b0",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Time to bring it together - the API for OpenAI is very simple!"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "905b9919-aba7-45b5-ae65-81b3d1d78e34",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# And now: call the OpenAI API. You will get very familiar with this!\n",
|
||||||
|
"\n",
|
||||||
|
"def summarize(url):\n",
|
||||||
|
" website = fetch_website_contents(url)\n",
|
||||||
|
" response = openai.chat.completions.create(\n",
|
||||||
|
" model = \"gpt-4.1-mini\",\n",
|
||||||
|
" messages = messages_for(website)\n",
|
||||||
|
" )\n",
|
||||||
|
" return response.choices[0].message.content"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "05e38d41-dfa4-4b20-9c96-c46ea75d9fb5",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"summarize(\"https://edwarddonner.com\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "3d926d59-450e-4609-92ba-2d6f244f1342",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# A function to display this nicely in the output, using markdown\n",
|
||||||
|
"\n",
|
||||||
|
"def display_summary(url):\n",
|
||||||
|
" summary = summarize(url)\n",
|
||||||
|
" display(Markdown(summary))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "3018853a-445f-41ff-9560-d925d1774b2f",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"display_summary(\"https://edwarddonner.com\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "b3bcf6f4-adce-45e9-97ad-d9a5d7a3a624",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Let's try more websites\n",
|
||||||
|
"\n",
|
||||||
|
"Note that this will only work on websites that can be scraped using this simplistic approach.\n",
|
||||||
|
"\n",
|
||||||
|
"Websites that are rendered with Javascript, like React apps, won't show up. See the community-contributions folder for a Selenium implementation that gets around this. You'll need to read up on installing Selenium (ask ChatGPT!)\n",
|
||||||
|
"\n",
|
||||||
|
"Also Websites protected with CloudFront (and similar) may give 403 errors - many thanks Andy J for pointing this out.\n",
|
||||||
|
"\n",
|
||||||
|
"But many websites will work just fine!"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "45d83403-a24c-44b5-84ac-961449b4008f",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"display_summary(\"https://cnn.com\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "75e9fd40-b354-4341-991e-863ef2e59db7",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"display_summary(\"https://anthropic.com\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "c951be1a-7f1b-448f-af1f-845978e47e2c",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"<table style=\"margin: 0; text-align: left;\">\n",
|
||||||
|
" <tr>\n",
|
||||||
|
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
|
||||||
|
" <img src=\"../assets/business.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
|
||||||
|
" </td>\n",
|
||||||
|
" <td>\n",
|
||||||
|
" <h2 style=\"color:#181;\">Business applications</h2>\n",
|
||||||
|
" <span style=\"color:#181;\">In this exercise, you experienced calling the Cloud API of a Frontier Model (a leading model at the frontier of AI) for the first time. We will be using APIs like OpenAI at many stages in the course, in addition to building our own LLMs.\n",
|
||||||
|
"\n",
|
||||||
|
"More specifically, we've applied this to Summarization - a classic Gen AI use case to make a summary. This can be applied to any business vertical - summarizing the news, summarizing financial performance, summarizing a resume in a cover letter - the applications are limitless. Consider how you could apply Summarization in your business, and try prototyping a solution.</span>\n",
|
||||||
|
" </td>\n",
|
||||||
|
" </tr>\n",
|
||||||
|
"</table>\n",
|
||||||
|
"\n",
|
||||||
|
"<table style=\"margin: 0; text-align: left;\">\n",
|
||||||
|
" <tr>\n",
|
||||||
|
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
|
||||||
|
" <img src=\"../assets/important.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
|
||||||
|
" </td>\n",
|
||||||
|
" <td>\n",
|
||||||
|
" <h2 style=\"color:#900;\">Before you continue - now try yourself</h2>\n",
|
||||||
|
" <span style=\"color:#900;\">Use the cell below to make your own simple commercial example. Stick with the summarization use case for now. Here's an idea: write something that will take the contents of an email, and will suggest an appropriate short subject line for the email. That's the kind of feature that might be built into a commercial email tool.</span>\n",
|
||||||
|
" </td>\n",
|
||||||
|
" </tr>\n",
|
||||||
|
"</table>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "00743dac-0e70-45b7-879a-d7293a6f68a6",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Step 1: Create your prompts\n",
|
||||||
|
"\n",
|
||||||
|
"system_prompt = \"\"\"You are my personal secretary. You will review an email and summarize the content. Write a summary and add a response to the sender.\n",
|
||||||
|
"\"\"\"\n",
|
||||||
|
"user_prompt = \"\"\"\n",
|
||||||
|
" Here are the contents of an email:\n",
|
||||||
|
" ***Insert Email Here***\n",
|
||||||
|
"\n",
|
||||||
|
" .\n",
|
||||||
|
" \n",
|
||||||
|
" \n",
|
||||||
|
" \n",
|
||||||
|
" Write a summary and with bullet points of the key topics of the email.\n",
|
||||||
|
" Structure the summary with Date, Time and name of Sender on the Top right hand corner.\n",
|
||||||
|
" After the summary, add triple spaces and write a response to the sender indicating receipt of email and suggest some valid responses.\n",
|
||||||
|
" Highlight the response with all caps.\n",
|
||||||
|
"\n",
|
||||||
|
"\"\"\"\n",
|
||||||
|
"\n",
|
||||||
|
"# Step 2: Make the messages list\n",
|
||||||
|
"\n",
|
||||||
|
"messages = [{\"role\":\"system\" , \"content\": system_prompt},\n",
|
||||||
|
"{\"role\":\"user\", \"content\":user_prompt}] # fill this in\n",
|
||||||
|
"# Step 3: Call OpenAI\n",
|
||||||
|
"response =openai.chat.completions.create(\n",
|
||||||
|
" model=\"gpt-4.1-mini\",\n",
|
||||||
|
" messages=messages)\n",
|
||||||
|
"\n",
|
||||||
|
"# Step 4: print the result\n",
|
||||||
|
"print(response.choices[0].message.content)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "36ed9f14-b349-40e9-a42c-b367e77f8bda",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## An extra exercise for those who enjoy web scraping\n",
|
||||||
|
"\n",
|
||||||
|
"You may notice that if you try `display_summary(\"https://openai.com\")` - it doesn't work! That's because OpenAI has a fancy website that uses Javascript. There are many ways around this that some of you might be familiar with. For example, Selenium is a hugely popular framework that runs a browser behind the scenes, renders the page, and allows you to query it. If you have experience with Selenium, Playwright or similar, then feel free to improve the Website class to use them. In the community-contributions folder, you'll find an example Selenium solution from a student (thank you!)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "eeab24dc-5f90-4570-b542-b0585aca3eb6",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Sharing your code\n",
|
||||||
|
"\n",
|
||||||
|
"I'd love it if you share your code afterwards so I can share it with others! You'll notice that some students have already made changes (including a Selenium implementation) which you will find in the community-contributions folder. If you'd like add your changes to that folder, submit a Pull Request with your new versions in that folder and I'll merge your changes.\n",
|
||||||
|
"\n",
|
||||||
|
"If you're not an expert with git (and I am not!) then GPT has given some nice instructions on how to submit a Pull Request. It's a bit of an involved process, but once you've done it once it's pretty clear. As a pro-tip: it's best if you clear the outputs of your Jupyter notebooks (Edit >> Clean outputs of all cells, and then Save) for clean notebooks.\n",
|
||||||
|
"\n",
|
||||||
|
"Here are good instructions courtesy of an AI friend: \n",
|
||||||
|
"https://chatgpt.com/share/677a9cb5-c64c-8012-99e0-e06e88afd293"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "f4484fcf-8b39-4c3f-9674-37970ed71988",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": []
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": ".venv",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.12.4"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 5
|
||||||
|
}
|
||||||
563
week1/community-contributions/slmslm333221/day1.ipynb
Normal file
563
week1/community-contributions/slmslm333221/day1.ipynb
Normal file
@@ -0,0 +1,563 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "d15d8294-3328-4e07-ad16-8a03e9bbfdb9",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# YOUR FIRST LAB\n",
|
||||||
|
"### Please read this section. This is valuable to get you prepared, even if it's a long read -- it's important stuff.\n",
|
||||||
|
"\n",
|
||||||
|
"### Also, be sure to read [README.md](../README.md)! More info about the updated videos in the README and [top of the course resources in purple](https://edwarddonner.com/2024/11/13/llm-engineering-resources/)\n",
|
||||||
|
"\n",
|
||||||
|
"## Your first Frontier LLM Project\n",
|
||||||
|
"\n",
|
||||||
|
"By the end of this course, you will have built an autonomous Agentic AI solution with 7 agents that collaborate to solve a business problem. All in good time! We will start with something smaller...\n",
|
||||||
|
"\n",
|
||||||
|
"Our goal is to code a new kind of Web Browser. Give it a URL, and it will respond with a summary. The Reader's Digest of the internet!!\n",
|
||||||
|
"\n",
|
||||||
|
"Before starting, you should have completed the setup linked in the README.\n",
|
||||||
|
"\n",
|
||||||
|
"### If you're new to working in \"Notebooks\" (also known as Labs or Jupyter Lab)\n",
|
||||||
|
"\n",
|
||||||
|
"Welcome to the wonderful world of Data Science experimentation! Simply click in each \"cell\" with code in it, such as the cell immediately below this text, and hit Shift+Return to execute that cell. Be sure to run every cell, starting at the top, in order.\n",
|
||||||
|
"\n",
|
||||||
|
"Please look in the [Guides folder](../guides/01_intro.ipynb) for all the guides.\n",
|
||||||
|
"\n",
|
||||||
|
"## I am here to help\n",
|
||||||
|
"\n",
|
||||||
|
"If you have any problems at all, please do reach out. \n",
|
||||||
|
"I'm available through the platform, or at ed@edwarddonner.com, or at https://www.linkedin.com/in/eddonner/ if you'd like to connect (and I love connecting!) \n",
|
||||||
|
"And this is new to me, but I'm also trying out X at [@edwarddonner](https://x.com/edwarddonner) - if you're on X, please show me how it's done 😂 \n",
|
||||||
|
"\n",
|
||||||
|
"## More troubleshooting\n",
|
||||||
|
"\n",
|
||||||
|
"Please see the [troubleshooting](../setup/troubleshooting.ipynb) notebook in the setup folder to diagnose and fix common problems. At the very end of it is a diagnostics script with some useful debug info.\n",
|
||||||
|
"\n",
|
||||||
|
"## If this is old hat!\n",
|
||||||
|
"\n",
|
||||||
|
"If you're already comfortable with today's material, please hang in there; you can move swiftly through the first few labs - we will get much more in depth as the weeks progress. Ultimately we will fine-tune our own LLM to compete with OpenAI!\n",
|
||||||
|
"\n",
|
||||||
|
"<table style=\"margin: 0; text-align: left;\">\n",
|
||||||
|
" <tr>\n",
|
||||||
|
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
|
||||||
|
" <img src=\"../assets/important.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
|
||||||
|
" </td>\n",
|
||||||
|
" <td>\n",
|
||||||
|
" <h2 style=\"color:#900;\">Please read - important note</h2>\n",
|
||||||
|
" <span style=\"color:#900;\">The way I collaborate with you may be different to other courses you've taken. I prefer not to type code while you watch. Rather, I execute Jupyter Labs, like this, and give you an intuition for what's going on. My suggestion is that you carefully execute this yourself, <b>after</b> watching the lecture. Add print statements to understand what's going on, and then come up with your own variations. If you have a Github account, use this to showcase your variations. Not only is this essential practice, but it demonstrates your skills to others, including perhaps future clients or employers...</span>\n",
|
||||||
|
" </td>\n",
|
||||||
|
" </tr>\n",
|
||||||
|
"</table>\n",
|
||||||
|
"<table style=\"margin: 0; text-align: left;\">\n",
|
||||||
|
" <tr>\n",
|
||||||
|
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
|
||||||
|
" <img src=\"../assets/resources.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
|
||||||
|
" </td>\n",
|
||||||
|
" <td>\n",
|
||||||
|
" <h2 style=\"color:#f71;\">This code is a live resource - keep an eye out for my emails</h2>\n",
|
||||||
|
" <span style=\"color:#f71;\">I push updates to the code regularly. As people ask questions, I add more examples or improved commentary. As a result, you'll notice that the code below isn't identical to the videos. Everything from the videos is here; but I've also added better explanations and new models like DeepSeek. Consider this like an interactive book.<br/><br/>\n",
|
||||||
|
" I try to send emails regularly with important updates related to the course. You can find this in the 'Announcements' section of Udemy in the left sidebar. You can also choose to receive my emails via your Notification Settings in Udemy. I'm respectful of your inbox and always try to add value with my emails!\n",
|
||||||
|
" </span>\n",
|
||||||
|
" </td>\n",
|
||||||
|
" </tr>\n",
|
||||||
|
"</table>\n",
|
||||||
|
"<table style=\"margin: 0; text-align: left;\">\n",
|
||||||
|
" <tr>\n",
|
||||||
|
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
|
||||||
|
" <img src=\"../assets/business.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
|
||||||
|
" </td>\n",
|
||||||
|
" <td>\n",
|
||||||
|
" <h2 style=\"color:#181;\">Business value of these exercises</h2>\n",
|
||||||
|
" <span style=\"color:#181;\">A final thought. While I've designed these notebooks to be educational, I've also tried to make them enjoyable. We'll do fun things like have LLMs tell jokes and argue with each other. But fundamentally, my goal is to teach skills you can apply in business. I'll explain business implications as we go, and it's worth keeping this in mind: as you build experience with models and techniques, think of ways you could put this into action at work today. Please do contact me if you'd like to discuss more or if you have ideas to bounce off me.</span>\n",
|
||||||
|
" </td>\n",
|
||||||
|
" </tr>\n",
|
||||||
|
"</table>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "83f28feb",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### If necessary, install Cursor Extensions\n",
|
||||||
|
"\n",
|
||||||
|
"1. From the View menu, select Extensions\n",
|
||||||
|
"2. Search for Python\n",
|
||||||
|
"3. Click on \"Python\" made by \"ms-python\" and select Install if not already installed\n",
|
||||||
|
"4. Search for Jupyter\n",
|
||||||
|
"5. Click on \"Jupyter\" made by \"ms-toolsai\" and select Install of not already installed\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"### Next Select the Kernel\n",
|
||||||
|
"\n",
|
||||||
|
"Click on \"Select Kernel\" on the Top Right\n",
|
||||||
|
"\n",
|
||||||
|
"Choose \"Python Environments...\"\n",
|
||||||
|
"\n",
|
||||||
|
"Then choose the one that looks like `.venv (Python 3.12.x) .venv/bin/python` - it should be marked as \"Recommended\" and have a big star next to it.\n",
|
||||||
|
"\n",
|
||||||
|
"Any problems with this? Head over to the troubleshooting.\n",
|
||||||
|
"\n",
|
||||||
|
"### Note: you'll need to set the Kernel with every notebook.."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "4e2a9393-7767-488e-a8bf-27c12dca35bd",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# imports\n",
|
||||||
|
"\n",
|
||||||
|
"import os\n",
|
||||||
|
"import sys\n",
|
||||||
|
"from pathlib import Path\n",
|
||||||
|
"sys.path.append(str(Path(r\"..\\..\").resolve()))\n",
|
||||||
|
"from dotenv import load_dotenv\n",
|
||||||
|
"from scraper import fetch_website_contents\n",
|
||||||
|
"from IPython.display import Markdown, display\n",
|
||||||
|
"from openai import OpenAI\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"# If you get an error running this cell, then please head over to the troubleshooting notebook!"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "6900b2a8-6384-4316-8aaa-5e519fca4254",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Connecting to OpenAI (or Ollama)\n",
|
||||||
|
"\n",
|
||||||
|
"The next cell is where we load in the environment variables in your `.env` file and connect to OpenAI. \n",
|
||||||
|
"\n",
|
||||||
|
"If you'd like to use free Ollama instead, please see the README section \"Free Alternative to Paid APIs\", and if you're not sure how to do this, there's a full solution in the solutions folder (day1_with_ollama.ipynb).\n",
|
||||||
|
"\n",
|
||||||
|
"## Troubleshooting if you have problems:\n",
|
||||||
|
"\n",
|
||||||
|
"If you get a \"Name Error\" - have you run all cells from the top down? Head over to the Python Foundations guide for a bulletproof way to find and fix all Name Errors.\n",
|
||||||
|
"\n",
|
||||||
|
"If that doesn't fix it, head over to the [troubleshooting](../setup/troubleshooting.ipynb) notebook for step by step code to identify the root cause and fix it!\n",
|
||||||
|
"\n",
|
||||||
|
"Or, contact me! Message me or email ed@edwarddonner.com and we will get this to work.\n",
|
||||||
|
"\n",
|
||||||
|
"Any concerns about API costs? See my notes in the README - costs should be minimal, and you can control it at every point. You can also use Ollama as a free alternative, which we discuss during Day 2."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "7b87cadb-d513-4303-baee-a37b6f938e4d",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Load environment variables in a file called .env\n",
|
||||||
|
"\n",
|
||||||
|
"load_dotenv(override=True)\n",
|
||||||
|
"api_key = os.getenv('OPENAI_API_KEY')\n",
|
||||||
|
"\n",
|
||||||
|
"# Check the key\n",
|
||||||
|
"\n",
|
||||||
|
"if not api_key:\n",
|
||||||
|
" print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
|
||||||
|
"elif not api_key.startswith(\"sk-proj-\"):\n",
|
||||||
|
" print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n",
|
||||||
|
"elif api_key.strip() != api_key:\n",
|
||||||
|
" print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n",
|
||||||
|
"else:\n",
|
||||||
|
" print(\"API key found and looks good so far!\")\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "442fc84b-0815-4f40-99ab-d9a5da6bda91",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Let's make a quick call to a Frontier model to get started, as a preview!"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "a58394bf-1e45-46af-9bfd-01e24da6f49a",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# To give you a preview -- calling OpenAI with these messages is this easy. Any problems, head over to the Troubleshooting notebook.\n",
|
||||||
|
"\n",
|
||||||
|
"message = \"Hello, GPT! This is my first ever message to you! Hi!\"\n",
|
||||||
|
"\n",
|
||||||
|
"messages = [{\"role\": \"user\", \"content\": message}]\n",
|
||||||
|
"\n",
|
||||||
|
"messages\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "08330159",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"openai = OpenAI()\n",
|
||||||
|
"\n",
|
||||||
|
"response = openai.chat.completions.create(model=\"gpt-5-nano\", messages=messages)\n",
|
||||||
|
"response.choices[0].message.content"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "2aa190e5-cb31-456a-96cc-db109919cd78",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## OK onwards with our first project"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "2ef960cf-6dc2-4cda-afb3-b38be12f4c97",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Let's try out this utility\n",
|
||||||
|
"\n",
|
||||||
|
"ed = fetch_website_contents(\"https://edwarddonner.com\")\n",
|
||||||
|
"print(ed)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "6a478a0c-2c53-48ff-869c-4d08199931e1",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Types of prompts\n",
|
||||||
|
"\n",
|
||||||
|
"You may know this already - but if not, you will get very familiar with it!\n",
|
||||||
|
"\n",
|
||||||
|
"Models like GPT have been trained to receive instructions in a particular way.\n",
|
||||||
|
"\n",
|
||||||
|
"They expect to receive:\n",
|
||||||
|
"\n",
|
||||||
|
"**A system prompt** that tells them what task they are performing and what tone they should use\n",
|
||||||
|
"\n",
|
||||||
|
"**A user prompt** -- the conversation starter that they should reply to"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "abdb8417-c5dc-44bc-9bee-2e059d162699",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Define our system prompt - you can experiment with this later, changing the last sentence to 'Respond in markdown in Spanish.\"\n",
|
||||||
|
"\n",
|
||||||
|
"system_prompt = \"\"\"\n",
|
||||||
|
"You are a snarkyassistant that analyzes the contents of a website,\n",
|
||||||
|
"and provides a short, snarky, humorous summary, ignoring text that might be navigation related.\n",
|
||||||
|
"Respond in markdown. Do not wrap the markdown in a code block - respond just with the markdown.\n",
|
||||||
|
"\"\"\""
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "f0275b1b-7cfe-4f9d-abfa-7650d378da0c",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Define our user prompt\n",
|
||||||
|
"\n",
|
||||||
|
"user_prompt_prefix = \"\"\"\n",
|
||||||
|
"Here are the contents of a website.\n",
|
||||||
|
"Provide a short summary of this website.\n",
|
||||||
|
"If it includes news or announcements, then summarize these too.\n",
|
||||||
|
"\n",
|
||||||
|
"\"\"\""
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "ea211b5f-28e1-4a86-8e52-c0b7677cadcc",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Messages\n",
|
||||||
|
"\n",
|
||||||
|
"The API from OpenAI expects to receive messages in a particular structure.\n",
|
||||||
|
"Many of the other APIs share this structure:\n",
|
||||||
|
"\n",
|
||||||
|
"```python\n",
|
||||||
|
"[\n",
|
||||||
|
" {\"role\": \"system\", \"content\": \"system message goes here\"},\n",
|
||||||
|
" {\"role\": \"user\", \"content\": \"user message goes here\"}\n",
|
||||||
|
"]\n",
|
||||||
|
"```\n",
|
||||||
|
"To give you a preview, the next 2 cells make a rather simple call - we won't stretch the mighty GPT (yet!)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "f25dcd35-0cd0-4235-9f64-ac37ed9eaaa5",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"messages = [\n",
|
||||||
|
" {\"role\": \"system\", \"content\": \"You are a helpful, by far too polite assistant trying to sell more services with every contact\"},\n",
|
||||||
|
" {\"role\": \"user\", \"content\": \"What is 2 + 2?\"}\n",
|
||||||
|
"]\n",
|
||||||
|
"\n",
|
||||||
|
"response = openai.chat.completions.create(model=\"gpt-4.1-nano\", messages=messages)\n",
|
||||||
|
"response.choices[0].message.content"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "d06e8d78-ce4c-4b05-aa8e-17050c82bb47",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## And now let's build useful messages for GPT-4.1-mini, using a function"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "0134dfa4-8299-48b5-b444-f2a8c3403c88",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# See how this function creates exactly the format above\n",
|
||||||
|
"\n",
|
||||||
|
"def messages_for(website):\n",
|
||||||
|
" return [\n",
|
||||||
|
" {\"role\": \"system\", \"content\": system_prompt},\n",
|
||||||
|
" {\"role\": \"user\", \"content\": user_prompt_prefix + website}\n",
|
||||||
|
" ]"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "36478464-39ee-485c-9f3f-6a4e458dbc9c",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Try this out, and then try for a few more websites\n",
|
||||||
|
"\n",
|
||||||
|
"messages_for(ed)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "16f49d46-bf55-4c3e-928f-68fc0bf715b0",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Time to bring it together - the API for OpenAI is very simple!"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "905b9919-aba7-45b5-ae65-81b3d1d78e34",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# And now: call the OpenAI API. You will get very familiar with this!\n",
|
||||||
|
"\n",
|
||||||
|
"def summarize(url):\n",
|
||||||
|
" website = fetch_website_contents(url)\n",
|
||||||
|
" response = openai.chat.completions.create(\n",
|
||||||
|
" model = \"gpt-4.1-mini\",\n",
|
||||||
|
" messages = messages_for(website)\n",
|
||||||
|
" )\n",
|
||||||
|
" return response.choices[0].message.content"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "05e38d41-dfa4-4b20-9c96-c46ea75d9fb5",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"summarize(\"https://edwarddonner.com\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "3d926d59-450e-4609-92ba-2d6f244f1342",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# A function to display this nicely in the output, using markdown\n",
|
||||||
|
"\n",
|
||||||
|
"def display_summary(url):\n",
|
||||||
|
" summary = summarize(url)\n",
|
||||||
|
" display(Markdown(summary))"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "3018853a-445f-41ff-9560-d925d1774b2f",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"display_summary(\"https://edwarddonner.com\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "b3bcf6f4-adce-45e9-97ad-d9a5d7a3a624",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Let's try more websites\n",
|
||||||
|
"\n",
|
||||||
|
"Note that this will only work on websites that can be scraped using this simplistic approach.\n",
|
||||||
|
"\n",
|
||||||
|
"Websites that are rendered with Javascript, like React apps, won't show up. See the community-contributions folder for a Selenium implementation that gets around this. You'll need to read up on installing Selenium (ask ChatGPT!)\n",
|
||||||
|
"\n",
|
||||||
|
"Also Websites protected with CloudFront (and similar) may give 403 errors - many thanks Andy J for pointing this out.\n",
|
||||||
|
"\n",
|
||||||
|
"But many websites will work just fine!"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "45d83403-a24c-44b5-84ac-961449b4008f",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"display_summary(\"https://cnn.com\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "75e9fd40-b354-4341-991e-863ef2e59db7",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"display_summary(\"https://anthropic.com\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "c951be1a-7f1b-448f-af1f-845978e47e2c",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"<table style=\"margin: 0; text-align: left;\">\n",
|
||||||
|
" <tr>\n",
|
||||||
|
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
|
||||||
|
" <img src=\"../assets/business.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
|
||||||
|
" </td>\n",
|
||||||
|
" <td>\n",
|
||||||
|
" <h2 style=\"color:#181;\">Business applications</h2>\n",
|
||||||
|
" <span style=\"color:#181;\">In this exercise, you experienced calling the Cloud API of a Frontier Model (a leading model at the frontier of AI) for the first time. We will be using APIs like OpenAI at many stages in the course, in addition to building our own LLMs.\n",
|
||||||
|
"\n",
|
||||||
|
"More specifically, we've applied this to Summarization - a classic Gen AI use case to make a summary. This can be applied to any business vertical - summarizing the news, summarizing financial performance, summarizing a resume in a cover letter - the applications are limitless. Consider how you could apply Summarization in your business, and try prototyping a solution.</span>\n",
|
||||||
|
" </td>\n",
|
||||||
|
" </tr>\n",
|
||||||
|
"</table>\n",
|
||||||
|
"\n",
|
||||||
|
"<table style=\"margin: 0; text-align: left;\">\n",
|
||||||
|
" <tr>\n",
|
||||||
|
" <td style=\"width: 150px; height: 150px; vertical-align: middle;\">\n",
|
||||||
|
" <img src=\"../assets/important.jpg\" width=\"150\" height=\"150\" style=\"display: block;\" />\n",
|
||||||
|
" </td>\n",
|
||||||
|
" <td>\n",
|
||||||
|
" <h2 style=\"color:#900;\">Before you continue - now try yourself</h2>\n",
|
||||||
|
" <span style=\"color:#900;\">Use the cell below to make your own simple commercial example. Stick with the summarization use case for now. Here's an idea: write something that will take the contents of an email, and will suggest an appropriate short subject line for the email. That's the kind of feature that might be built into a commercial email tool.</span>\n",
|
||||||
|
" </td>\n",
|
||||||
|
" </tr>\n",
|
||||||
|
"</table>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "00743dac-0e70-45b7-879a-d7293a6f68a6",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Step 1: Create your prompts\n",
|
||||||
|
"\n",
|
||||||
|
"system_prompt = \"something here\"\n",
|
||||||
|
"user_prompt = \"\"\"\n",
|
||||||
|
" Lots of text\n",
|
||||||
|
" Can be pasted here\n",
|
||||||
|
"\"\"\"\n",
|
||||||
|
"\n",
|
||||||
|
"# Step 2: Make the messages list\n",
|
||||||
|
"\n",
|
||||||
|
"messages = [] # fill this in\n",
|
||||||
|
"\n",
|
||||||
|
"# Step 3: Call OpenAI\n",
|
||||||
|
"# response =\n",
|
||||||
|
"\n",
|
||||||
|
"# Step 4: print the result\n",
|
||||||
|
"# print("
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "36ed9f14-b349-40e9-a42c-b367e77f8bda",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## An extra exercise for those who enjoy web scraping\n",
|
||||||
|
"\n",
|
||||||
|
"You may notice that if you try `display_summary(\"https://openai.com\")` - it doesn't work! That's because OpenAI has a fancy website that uses Javascript. There are many ways around this that some of you might be familiar with. For example, Selenium is a hugely popular framework that runs a browser behind the scenes, renders the page, and allows you to query it. If you have experience with Selenium, Playwright or similar, then feel free to improve the Website class to use them. In the community-contributions folder, you'll find an example Selenium solution from a student (thank you!)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "eeab24dc-5f90-4570-b542-b0585aca3eb6",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Sharing your code\n",
|
||||||
|
"\n",
|
||||||
|
"I'd love it if you share your code afterwards so I can share it with others! You'll notice that some students have already made changes (including a Selenium implementation) which you will find in the community-contributions folder. If you'd like add your changes to that folder, submit a Pull Request with your new versions in that folder and I'll merge your changes.\n",
|
||||||
|
"\n",
|
||||||
|
"If you're not an expert with git (and I am not!) then GPT has given some nice instructions on how to submit a Pull Request. It's a bit of an involved process, but once you've done it once it's pretty clear. As a pro-tip: it's best if you clear the outputs of your Jupyter notebooks (Edit >> Clean outputs of all cells, and then Save) for clean notebooks.\n",
|
||||||
|
"\n",
|
||||||
|
"Here are good instructions courtesy of an AI friend: \n",
|
||||||
|
"https://chatgpt.com/share/677a9cb5-c64c-8012-99e0-e06e88afd293"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "f4484fcf-8b39-4c3f-9674-37970ed71988",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": []
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": ".venv",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.12.12"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 5
|
||||||
|
}
|
||||||
@@ -0,0 +1,235 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "d12b9c22",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Song Lyrics → One-Sentence Summary\n",
|
||||||
|
"Get the lyrics of a song and summarize its main idea in about one sentence.\n",
|
||||||
|
"\n",
|
||||||
|
"## Setup\n",
|
||||||
|
"Import required libraries: environment vars, display helper, OpenAI client, BeautifulSoup, and requests."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "d94bbd61",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"import os\n",
|
||||||
|
"from IPython.display import Markdown, display\n",
|
||||||
|
"from openai import OpenAI\n",
|
||||||
|
"from bs4 import BeautifulSoup\n",
|
||||||
|
"import requests"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "92dc1bde",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Function: Get Lyrics from Genius\n",
|
||||||
|
"Fetch and extract the lyrics from a Genius.com song page using BeautifulSoup."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "2b43fa98",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def get_lyrics_from_genius(url: str) -> str:\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" Extracts song lyrics from a Genius.com song URL using BeautifulSoup.\n",
|
||||||
|
" Example URL: https://genius.com/Ed-sheeran-shape-of-you-lyrics\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" # Standard headers to fetch a website\n",
|
||||||
|
" headers = {\n",
|
||||||
|
" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36\"\n",
|
||||||
|
" }\n",
|
||||||
|
"\n",
|
||||||
|
" response = requests.get(url, headers=headers)\n",
|
||||||
|
" response.raise_for_status() # raises error if page not found\n",
|
||||||
|
"\n",
|
||||||
|
" soup = BeautifulSoup(response.text, \"html.parser\")\n",
|
||||||
|
"\n",
|
||||||
|
" # Genius stores lyrics inside <div data-lyrics-container=\"true\">\n",
|
||||||
|
" lyrics_blocks = soup.find_all(\"div\", {\"data-lyrics-container\": \"true\"})\n",
|
||||||
|
"\n",
|
||||||
|
" if not lyrics_blocks:\n",
|
||||||
|
" return \"Lyrics not found.\"\n",
|
||||||
|
"\n",
|
||||||
|
" # Join all text blocks and clean up spacing\n",
|
||||||
|
" lyrics = \"\\n\".join(block.get_text(separator=\"\\n\") for block in lyrics_blocks)\n",
|
||||||
|
" return lyrics.strip()"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "fc4f0590",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Function: Create Genius URL\n",
|
||||||
|
"Build a Genius.com lyrics URL automatically from the given artist and song name."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "e018c623",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def create_genius_url(artist: str, song: str) -> str:\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" Creates a Genius.com lyrics URL from artist and song name.\n",
|
||||||
|
" Example:\n",
|
||||||
|
" create_genius_url(\"Ed sheeran\", \"shape of you\")\n",
|
||||||
|
" → https://genius.com/Ed-sheeran-shape-of-you-lyrics\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" artist = artist.strip().replace(\" \", \"-\")\n",
|
||||||
|
" song = song.strip().replace(\" \", \"-\")\n",
|
||||||
|
" return f\"https://genius.com/{artist}-{song}-lyrics\"\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "62f50f02",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Generate URL and Fetch Lyrics\n",
|
||||||
|
"Create the Genius URL from the artist and song name, then fetch and display the lyrics."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "ed51d48d",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"artist = \"Ed sheeran\"\n",
|
||||||
|
"song = \"shape of you\"\n",
|
||||||
|
"\n",
|
||||||
|
"url = create_genius_url(artist, song)\n",
|
||||||
|
"print(url)\n",
|
||||||
|
"# Output: https://genius.com/Ed-sheeran-shape-of-you-lyrics\n",
|
||||||
|
"\n",
|
||||||
|
"user_prompt = get_lyrics_from_genius(url)\n",
|
||||||
|
"print(user_prompt[:5000]) "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "fca4203a",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"system_prompt = \"\"\"\n",
|
||||||
|
"You are a **helpful assistant** that specializes in analyzing **song lyrics**.\n",
|
||||||
|
"\n",
|
||||||
|
"## Task\n",
|
||||||
|
"Your goal is to **summarize the main idea or theme of a song** in **about one sentence**.\n",
|
||||||
|
"\n",
|
||||||
|
"## Instructions\n",
|
||||||
|
"1. Read the given song lyrics carefully.\n",
|
||||||
|
"2. Identify the **core message**, **emotion**, or **story** of the song.\n",
|
||||||
|
"3. Respond with **one concise sentence** only.\n",
|
||||||
|
"4. The tone of your summary should reflect the song’s mood (e.g., joyful, melancholic, romantic, rebellious).\n",
|
||||||
|
"\n",
|
||||||
|
"## Edge Cases\n",
|
||||||
|
"- **Very short lyrics:** Summarize the implied meaning.\n",
|
||||||
|
"- **Repetitive lyrics:** Focus on the message or emotion being emphasized.\n",
|
||||||
|
"- **Abstract or nonsensical lyrics:** Describe the overall feeling or imagery they create.\n",
|
||||||
|
"- **No lyrics or only a title provided:** Reply with \n",
|
||||||
|
" `No lyrics provided — unable to summarize meaningfully.`\n",
|
||||||
|
"- **Non-English lyrics:** Summarize in English unless otherwise instructed.\n",
|
||||||
|
"\n",
|
||||||
|
"## Output Format\n",
|
||||||
|
"Plain text — a single, coherent sentence summarizing the main idea of the song.\n",
|
||||||
|
"\"\"\""
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "11784d62",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Create Chat Messages\n",
|
||||||
|
"Prepare the system and user messages, then send them to the OpenAI model for summarization."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "f1205658",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"messages = [\n",
|
||||||
|
" {\"role\": \"system\", \"content\": system_prompt},\n",
|
||||||
|
" {\"role\": \"user\", \"content\": user_prompt}\n",
|
||||||
|
"]"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "5c8d61aa",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"openai = OpenAI()\n",
|
||||||
|
"response = openai.chat.completions.create(\n",
|
||||||
|
" model = \"gpt-4.1-mini\",\n",
|
||||||
|
" messages = messages\n",
|
||||||
|
")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "4ad95820",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Display Summary\n",
|
||||||
|
"Show the model’s one-sentence summary of the song lyrics in a formatted Markdown output."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "4f09a642",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"display(Markdown(response.choices[0].message.content))"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": ".venv",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.12.6"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 5
|
||||||
|
}
|
||||||
221
week1/community-contributions/wk1-day1-RBG-all-sites-jina.ipynb
Normal file
221
week1/community-contributions/wk1-day1-RBG-all-sites-jina.ipynb
Normal file
@@ -0,0 +1,221 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "d15d8294-3328-4e07-ad16-8a03e9bbfdb9",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# My First Lab = My 1st Frontier LLM Project\n",
|
||||||
|
"## Summarize All Websites without Selenium\n",
|
||||||
|
"This simple \"app\" uses Jina (https://jina.ai/reader) to turn all websites into markdown before summarizing by an LLM. As their website says: \"Convert a URL to LLM-friendly input, by simply adding r.jina.ai in front\". They have other tools that look useful too.\n",
|
||||||
|
"\n",
|
||||||
|
"\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "4e2a9393-7767-488e-a8bf-27c12dca35bd",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# imports\n",
|
||||||
|
"\n",
|
||||||
|
"import os\n",
|
||||||
|
"import requests # added for jina\n",
|
||||||
|
"from dotenv import load_dotenv\n",
|
||||||
|
"# from scraper import fetch_website_contents # not needed for jina\n",
|
||||||
|
"from IPython.display import Markdown, display\n",
|
||||||
|
"from openai import OpenAI\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "7b87cadb-d513-4303-baee-a37b6f938e4d",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Load environment variables from a file called .env\n",
|
||||||
|
"\n",
|
||||||
|
"load_dotenv(override=True)\n",
|
||||||
|
"api_key = os.getenv('OPENAI_API_KEY')\n",
|
||||||
|
"\n",
|
||||||
|
"# Check the key\n",
|
||||||
|
"\n",
|
||||||
|
"if not api_key:\n",
|
||||||
|
" print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
|
||||||
|
"elif not api_key.startswith(\"sk-proj-\"):\n",
|
||||||
|
" print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n",
|
||||||
|
"elif api_key.strip() != api_key:\n",
|
||||||
|
" print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n",
|
||||||
|
"else:\n",
|
||||||
|
" print(\"API key found and looks good so far!\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Setup access to the frontier model\n",
|
||||||
|
"\n",
|
||||||
|
"openai = OpenAI()"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "f0275b1b-7cfe-4f9d-abfa-7650d378da0c",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Step 1-a: Define the user prompt\n",
|
||||||
|
"\n",
|
||||||
|
"user_prompt_prefix = \"\"\"\n",
|
||||||
|
"Here are the contents of a website.\n",
|
||||||
|
"Provide a short summary of this website.\n",
|
||||||
|
"If it includes news or announcements, then summarize these too.\n",
|
||||||
|
"\"\"\""
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "abdb8417-c5dc-44bc-9bee-2e059d162699",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Step 1-b: Define the system prompt\n",
|
||||||
|
"\n",
|
||||||
|
"system_prompt = \"\"\"\n",
|
||||||
|
"You are a smart assistant that analyzes the contents of a website,\n",
|
||||||
|
"and provides a short, clear, summary, ignoring text that might be navigation related.\n",
|
||||||
|
"Respond in markdown. Do not wrap the markdown in a code block - respond just with the markdown.\n",
|
||||||
|
"\"\"\""
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "0134dfa4-8299-48b5-b444-f2a8c3403c88",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Add the website content to the user prompt\n",
|
||||||
|
"\n",
|
||||||
|
"def messages_for(website):\n",
|
||||||
|
" return [\n",
|
||||||
|
" {\"role\": \"system\", \"content\": system_prompt},\n",
|
||||||
|
" {\"role\": \"user\", \"content\": user_prompt_prefix + website}\n",
|
||||||
|
" ]"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "2ef960cf-6dc2-4cda-afb3-b38be12f4c97",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Step 5: Change the content utility to use jina\n",
|
||||||
|
"\n",
|
||||||
|
"def fetch_url_content(url):\n",
|
||||||
|
" jina_reader_url = f\"https://r.jina.ai/{url}\"\n",
|
||||||
|
" try:\n",
|
||||||
|
" response = requests.get(jina_reader_url)\n",
|
||||||
|
" response.raise_for_status() # Raise an exception for HTTP errors\n",
|
||||||
|
" return response.text\n",
|
||||||
|
" except requests.exceptions.RequestException as e:\n",
|
||||||
|
" print(f\"Error fetching URL: {e}\")\n",
|
||||||
|
" return None\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "905b9919-aba7-45b5-ae65-81b3d1d78e34",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Step 3: Call OpenAI & Step 4: print the result\n",
|
||||||
|
"\n",
|
||||||
|
"def summarize(url):\n",
|
||||||
|
" website = fetch_url_content(url)\n",
|
||||||
|
" response = openai.chat.completions.create(\n",
|
||||||
|
" model = \"gpt-5-nano\",\n",
|
||||||
|
" messages = messages_for(website)\n",
|
||||||
|
" )\n",
|
||||||
|
" summary = response.choices[0].message.content\n",
|
||||||
|
" return display(Markdown(summary))\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "05e38d41-dfa4-4b20-9c96-c46ea75d9fb5",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"summarize(\"https://edwarddonner.com\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "45d83403-a24c-44b5-84ac-961449b4008f",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"summarize(\"https://cnn.com\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "75e9fd40-b354-4341-991e-863ef2e59db7",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"summarize(\"https://openai.com\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "36ed9f14-b349-40e9-a42c-b367e77f8bda",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Content Summary vs Technical Summary\n",
|
||||||
|
"\n",
|
||||||
|
"In my work a technical summary of a website, or group of websites, would be useful too. For example, does it render on the server (HTML) or in the browser (JavaScript), what content management system (CMS) was used, how many pages, how many outbound links, how many inbound links, etc. Doing this exercise I realized LLMs can help with analyzing content, but I may need other tools to count pages, links, and other specifications.\n",
|
||||||
|
"\n",
|
||||||
|
"A \"Shout Out\" to whoever put \"Market_Research_Agent.ipynb\" in the Community-Contributions. It is a great example of using an LLM as a management consultant. I think Jina might help with this usecase by offering web search results through an API to feed to your LLM. Here is the system prompt from that notebook and I plan to use this format often.\n",
|
||||||
|
"\n",
|
||||||
|
"system_prompt = \"\"\"You are to act like a Mckinsey Consultant specializing in market research. \n",
|
||||||
|
"1) You are to follow legal guidelines and never give immoral advice. \n",
|
||||||
|
"2) Your job is to maximise profits for your clients by analysing their companies initiatives and giving out recommendations for newer initiatives.\\n \n",
|
||||||
|
"3) Follow industry frameworks for reponses always give simple answers and stick to the point.\n",
|
||||||
|
"4) If possible try to see what competitors exist and what market gap can your clients company exploit.\n",
|
||||||
|
"5) Further more, USe SWOT, Porters 5 forces to summarize your recommendations, Give confidence score with every recommendations\n",
|
||||||
|
"6) Try to give unique solutions by seeing what the market gap is, if market gap is ambiguious skip this step\n",
|
||||||
|
"7) add an estimate of what rate the revenue of the comapany will increase at provided they follow the guidelines, give conservating estimates keeping in account non ideal conditions.\n",
|
||||||
|
"8) if the website isnt of a company or data isnt available, give out an error message along the lines of more data required for analysis\"\"\""
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": ".venv",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.12.12"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 5
|
||||||
|
}
|
||||||
@@ -0,0 +1,283 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "88f67391",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"### N Way Conversation - Coffee Talk \n",
|
||||||
|
"\n",
|
||||||
|
"This example simulates an N-way conversation between the characters of the Saturday Night Live skit Coffee Talk.\n",
|
||||||
|
"\n",
|
||||||
|
"The character information is retrieved from a model and each character is handled by its own model selected at random from a list of available models. Only the number of characters, number of rounds, and available models are configured.\n",
|
||||||
|
"\n",
|
||||||
|
"The example can use OpenRouter, OpenAI, or Ollama, in that order. "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "a1eeb029",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Setup ...\n",
|
||||||
|
"\n",
|
||||||
|
"# The number of characters (models) conversing\n",
|
||||||
|
"NBR_CHARACTERS=4\n",
|
||||||
|
"\n",
|
||||||
|
"# The number of rounds of conversation\n",
|
||||||
|
"NBR_ROUNDS=4\n",
|
||||||
|
"\n",
|
||||||
|
"# Available OpenRouter models. The base model is used to select characters and the topic. Other models are used for the conversation\n",
|
||||||
|
"OPENROUTER_MODELS=\"openai/gpt-4.1-mini, anthropic/claude-3.5-haiku, google/gemini-2.5-flash\"\n",
|
||||||
|
"OPENROUTER_BASE=\"openai/gpt-5\"\n",
|
||||||
|
"\n",
|
||||||
|
"# Available OpenAI models\n",
|
||||||
|
"OPENAI_MODELS=\"gpt-4.1, gpt-4.1-mini, gpt-5-nano\"\n",
|
||||||
|
"OPENAI_BASE=\"gpt-5\"\n",
|
||||||
|
"\n",
|
||||||
|
"# Available Ollama models. Note that these must be pre-fetched or errors will occur (and won't be handled)\n",
|
||||||
|
"OLLAMA_MODELS=\"gpt-oss, gemma3, llama3.2\"\n",
|
||||||
|
"OLLAMA_BASE=\"gpt-oss\"\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "68022fbc",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# imports\n",
|
||||||
|
"import os\n",
|
||||||
|
"import json\n",
|
||||||
|
"from dotenv import load_dotenv\n",
|
||||||
|
"from IPython.display import Markdown, display, update_display\n",
|
||||||
|
"from openai import OpenAI"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "73460c5e",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Setup the LLM client and models. OpenRouter has priority if available, then OpenAI, then Ollama.\n",
|
||||||
|
"\n",
|
||||||
|
"load_dotenv(override=True)\n",
|
||||||
|
"openai_api_key = os.getenv('OPENAI_API_KEY')\n",
|
||||||
|
"openrouter_api_key = os.getenv('OPENROUTER_API_KEY')\n",
|
||||||
|
"\n",
|
||||||
|
"if openrouter_api_key:\n",
|
||||||
|
" print(f\"OpenRouter API Key exists and begins {openrouter_api_key[:3]}, using OpenRouter.\")\n",
|
||||||
|
" available_models=OPENROUTER_MODELS\n",
|
||||||
|
" base_model=OPENROUTER_BASE\n",
|
||||||
|
" client = OpenAI(base_url=\"https://openrouter.ai/api/v1\", api_key=openrouter_api_key)\n",
|
||||||
|
"elif openai_api_key:\n",
|
||||||
|
" print(f\"OpenAI API Key exists and begins {openai_api_key[:8]}, using OpenAI.\")\n",
|
||||||
|
" available_models=OPENAI_MODELS\n",
|
||||||
|
" base_model=OPENAI_BASE\n",
|
||||||
|
" client = OpenAI()\n",
|
||||||
|
"else:\n",
|
||||||
|
" print(\"OpenAI API Key not set, using Ollama.\")\n",
|
||||||
|
" available_models=OLLAMA_MODELS\n",
|
||||||
|
" base_model=OLLAMA_BASE\n",
|
||||||
|
" client = OpenAI(api_key=\"ollama\", base_url=\"http://localhost:11434/v1\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "b1a7004d",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Get the characters from the base model\n",
|
||||||
|
"system_prompt = \"\"\"\n",
|
||||||
|
"You will be asked to return information about characters in the SNL skit Coffee Talk\n",
|
||||||
|
"You should return the information as a JSON response with the following format:\n",
|
||||||
|
"{\n",
|
||||||
|
" { \"name\" : \"Linda\", \"persona\", \"....\", \"model\" : \"model-name\" },\n",
|
||||||
|
" { \"name\" : \"Paul\", \"persona\", \"....\", \"model\" : \"model-name\" }\n",
|
||||||
|
"}\n",
|
||||||
|
"\n",
|
||||||
|
"\"\"\"\n",
|
||||||
|
"\n",
|
||||||
|
"user_prompt = f\"\"\"\n",
|
||||||
|
"Create a list of the many characters from the SNL skit Coffee Talk, and return {NBR_CHARACTERS} total characters.\n",
|
||||||
|
"Always return Linda Richmond as the first character.\n",
|
||||||
|
"Return one caller.\n",
|
||||||
|
"Select the remaining characters at random from the list of all characters. \n",
|
||||||
|
"For the model value, return a random model name from this list: {available_models}.\n",
|
||||||
|
"\"\"\"\n",
|
||||||
|
"\n",
|
||||||
|
"response = client.chat.completions.create(\n",
|
||||||
|
" model=base_model,\n",
|
||||||
|
" messages=[\n",
|
||||||
|
" {\"role\": \"system\", \"content\": system_prompt},\n",
|
||||||
|
" {\"role\": \"user\", \"content\": user_prompt}\n",
|
||||||
|
" ],\n",
|
||||||
|
" response_format={\"type\": \"json_object\"}\n",
|
||||||
|
" )\n",
|
||||||
|
"result = response.choices[0].message.content\n",
|
||||||
|
"characters = json.loads(result)\n",
|
||||||
|
"\n",
|
||||||
|
"print(json.dumps(characters, indent=2))\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "21a73805",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Generate system prompts for each character, which includes their name, persona, the other guests, and how they should respond.\n",
|
||||||
|
"\n",
|
||||||
|
"guests = \"The guests on todays show are \"\n",
|
||||||
|
"guest_names = [character['name'] for character in characters[\"characters\"]]\n",
|
||||||
|
"guests += \", \".join(guest_names)\n",
|
||||||
|
"\n",
|
||||||
|
"prompt = \"\"\n",
|
||||||
|
"for character in characters[\"characters\"]:\n",
|
||||||
|
" prompt = f\"You are {character['name']} a character on the SNL skit Coffee Talk.\"\n",
|
||||||
|
" prompt += f\" Your personality is : {character['persona']} \"\n",
|
||||||
|
" prompt += \" \" + guests + \".\"\n",
|
||||||
|
" prompt += \" Keep responses brief and in character.\"\n",
|
||||||
|
" prompt += \" In the conversation history, each response is prefixed with the character's name to identify the respondent.\"\n",
|
||||||
|
" prompt += \" Your response should not include your character name as a prefix.\"\n",
|
||||||
|
"\n",
|
||||||
|
" character[\"system_prompt\"] = prompt\n",
|
||||||
|
"\n",
|
||||||
|
"print(json.dumps(characters, indent=2))\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "656131a1",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Get the topic\n",
|
||||||
|
"user_prompt=\"\"\"\n",
|
||||||
|
"In the SNL skit Coffee Talk, the host Linda Richmond proposes topics in the form \"X Y is neither X, nor Y - discuss\".\n",
|
||||||
|
"Create a list of the many topics proposed on the show, and select one at random and return it.\n",
|
||||||
|
"Return only the selected topic without any formatting.\n",
|
||||||
|
"\"\"\"\n",
|
||||||
|
"\n",
|
||||||
|
"response = client.chat.completions.create(\n",
|
||||||
|
" model=base_model,\n",
|
||||||
|
" messages=[\n",
|
||||||
|
" {\"role\": \"user\", \"content\": user_prompt}\n",
|
||||||
|
" ],\n",
|
||||||
|
" )\n",
|
||||||
|
"topic = response.choices[0].message.content\n",
|
||||||
|
"\n",
|
||||||
|
"print(topic)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "6e137753",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def get_character_response(character,history):\n",
|
||||||
|
" user_prompt = f\"\"\"\n",
|
||||||
|
" The conversation so far is as follows:\n",
|
||||||
|
" {history}\n",
|
||||||
|
" What is your response? \n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" \n",
|
||||||
|
" response = client.chat.completions.create(\n",
|
||||||
|
" model=character[\"model\"],\n",
|
||||||
|
" messages=[\n",
|
||||||
|
" {\"role\": \"system\", \"content\": character[\"system_prompt\"]},\n",
|
||||||
|
" {\"role\": \"user\", \"content\": user_prompt}\n",
|
||||||
|
" ]\n",
|
||||||
|
" )\n",
|
||||||
|
" return response.choices[0].message.content\n",
|
||||||
|
" "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "23fb446f",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Start the show!\n",
|
||||||
|
"\n",
|
||||||
|
"history = \"\"\n",
|
||||||
|
"history += \"Welcome to Coffee Talk, I am your host Linda Richmond. Today's guests are:\\n\"\n",
|
||||||
|
"\n",
|
||||||
|
"for character in characters[\"characters\"][1:]:\n",
|
||||||
|
" history += f\" - {character['name']}\\n\"\n",
|
||||||
|
"\n",
|
||||||
|
"history += f\"\\nI'll give you a topic: {topic}\\n\"\n",
|
||||||
|
"\n",
|
||||||
|
"display(Markdown(\"---\"))\n",
|
||||||
|
"display(Markdown(history))\n",
|
||||||
|
"display(Markdown(\"---\"))\n",
|
||||||
|
"\n",
|
||||||
|
"# Other guests respond (first round)\n",
|
||||||
|
"for character in characters[\"characters\"][1:]:\n",
|
||||||
|
" response = get_character_response(character,history)\n",
|
||||||
|
" display(Markdown(f\"**{character['name']}({character['model']}):** {response}\")) \n",
|
||||||
|
" history += f\"\\n{character['name']}: {response}\"\n",
|
||||||
|
"\n",
|
||||||
|
"# Continue conversation for remaining rounds (all characters including Linda)\n",
|
||||||
|
"for round in range(1, NBR_ROUNDS):\n",
|
||||||
|
" for character in characters[\"characters\"]:\n",
|
||||||
|
" response = get_character_response(character,history)\n",
|
||||||
|
" display(Markdown(f\"**{character['name']}({character['model']}):** {response}\")) \n",
|
||||||
|
" history += f\"\\n{character['name']}: {response}\"\n",
|
||||||
|
"\n",
|
||||||
|
"# Wrap it up\n",
|
||||||
|
"user_prompt=f\"\"\"\n",
|
||||||
|
"It's time to wrap up the show. Here's the whole conversation:\\n\n",
|
||||||
|
"{history}\n",
|
||||||
|
"Wrap up the show, as only you can.\n",
|
||||||
|
"\"\"\"\n",
|
||||||
|
"\n",
|
||||||
|
"linda = characters[\"characters\"][0]\n",
|
||||||
|
"response = client.chat.completions.create(\n",
|
||||||
|
" model=linda[\"model\"],\n",
|
||||||
|
" messages=[\n",
|
||||||
|
" {\"role\": \"system\", \"content\": linda[\"system_prompt\"]},\n",
|
||||||
|
" {\"role\": \"user\", \"content\": user_prompt}\n",
|
||||||
|
" ]\n",
|
||||||
|
" )\n",
|
||||||
|
"\n",
|
||||||
|
"display(Markdown(\"---\"))\n",
|
||||||
|
"display(Markdown(response.choices[0].message.content)) \n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "llm-engineering",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.12.11"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 5
|
||||||
|
}
|
||||||
@@ -0,0 +1 @@
|
|||||||
|
OPENAI_API_KEY= your_openai_api_kei
|
||||||
@@ -0,0 +1 @@
|
|||||||
|
3.12
|
||||||
263
week3/community-contributions/juan_synthetic_data/README.md
Normal file
263
week3/community-contributions/juan_synthetic_data/README.md
Normal file
@@ -0,0 +1,263 @@
|
|||||||
|
# Synthetic Data Generator
|
||||||
|
**NOTE:** This is a copy of the repository https://github.com/Jsrodrigue/synthetic-data-creator.
|
||||||
|
|
||||||
|
# Synthetic Data Generator
|
||||||
|
|
||||||
|
An intelligent synthetic data generator that uses OpenAI models to create realistic tabular datasets based on reference data. This project includes an intuitive web interface built with Gradio.
|
||||||
|
|
||||||
|
> **🎓 Educational Project**: This project was inspired by the highly regarded LLM Engineering course on Udemy: [LLM Engineering: Master AI and Large Language Models](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/learn/lecture/52941433#questions/23828099). It demonstrates practical applications of LLM engineering principles, prompt engineering, and synthetic data generation techniques.
|
||||||
|
|
||||||
|
## Key highlights:
|
||||||
|
- Built with Python & Gradio
|
||||||
|
- Uses OpenAI GPT-4 models for tabular data synthesis
|
||||||
|
- Focused on statistical consistency and controlled randomness
|
||||||
|
- Lightweight and easy to extend
|
||||||
|
|
||||||
|
## 📸 Screenshots & Demo
|
||||||
|
|
||||||
|
### Application Interface
|
||||||
|
<p align="center">
|
||||||
|
<img src="screenshots/homepage.png" alt="Main Interface" width="70%">
|
||||||
|
</p>
|
||||||
|
<p align="center"><em>Main interface showing the synthetic data generator with all controls</em></p>
|
||||||
|
|
||||||
|
### Generated Data Preview
|
||||||
|
<p align="center">
|
||||||
|
<img src="screenshots/generated_table.png" alt="Generated table" width="70%">
|
||||||
|
</p>
|
||||||
|
<p align="center"><em> Generated CSV preview with the Wine dataset reference</em></p>
|
||||||
|
|
||||||
|
### Histogram plots
|
||||||
|
<p align="center">
|
||||||
|
<img src="screenshots/histogram.png" alt="Histogram plot" width="70%">
|
||||||
|
</p>
|
||||||
|
<p align="center"><em>Example of Histogram comparison plot in the Wine dataset</em></p>
|
||||||
|
|
||||||
|
### Boxplots
|
||||||
|
<p align="center">
|
||||||
|
<img src="screenshots/boxplot.png" alt="Boxplot" width="70%">
|
||||||
|
</p>
|
||||||
|
<p align="center"><em>Example of Boxplot comparison</em></p>
|
||||||
|
|
||||||
|
|
||||||
|
### Video Demo
|
||||||
|
[](https://youtu.be/C7c8BbUGGBA)
|
||||||
|
|
||||||
|
*Click to watch a complete walkthrough of the application*
|
||||||
|
|
||||||
|
|
||||||
|
## 📋 Features
|
||||||
|
|
||||||
|
- **Intelligent Generation**: Generates synthetic data using OpenAI models (GPT-4o-mini, GPT-4.1-mini)
|
||||||
|
- **Web Interface**: Provides an intuitive Gradio UI with real-time data preview
|
||||||
|
- **Reference Data**: Optionally load CSV files to preserve statistical distributions
|
||||||
|
- **Export Options**: Download generated datasets directly in CSV format
|
||||||
|
- **Included Examples**: Comes with ready-to-use sample datasets for people and sentiment analysis
|
||||||
|
- **Dynamic Batching**: Automatically adapts batch size based on prompt length and reference sample size
|
||||||
|
- **Reference Sampling**: Uses random subsets of reference data to ensure variability and reduce API cost.
|
||||||
|
The sample size (default `64`) can be modified in `src/constants.py` via `N_REFERENCE_ROWS`.
|
||||||
|
|
||||||
|
## 🚀 Installation
|
||||||
|
|
||||||
|
### Prerequisites
|
||||||
|
- Python 3.12+
|
||||||
|
- OpenAI account with API key
|
||||||
|
|
||||||
|
### Option 1: Using pip
|
||||||
|
```bash
|
||||||
|
# Create virtual environment
|
||||||
|
python -m venv venv
|
||||||
|
source venv/bin/activate # On Windows: venv\Scripts\activate
|
||||||
|
|
||||||
|
# Install dependencies
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
### Option 2: Using uv
|
||||||
|
```bash
|
||||||
|
# Clone the repository
|
||||||
|
git clone https://github.com/Jsrodrigue/synthetic-data-creator.git
|
||||||
|
cd synthetic-data-creator
|
||||||
|
|
||||||
|
# Install dependencies
|
||||||
|
uv sync
|
||||||
|
|
||||||
|
# Activate virtual environment
|
||||||
|
uv shell
|
||||||
|
```
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
1. Copy the environment variables example file:
|
||||||
|
```bash
|
||||||
|
cp .env_example .env
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Edit `.env` and add your OpenAI API key:
|
||||||
|
```
|
||||||
|
OPENAI_API_KEY=your_api_key_here
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## 🎯 Usage
|
||||||
|
|
||||||
|
### Start the application
|
||||||
|
|
||||||
|
You can run the app either with **Python** or with **uv** (recommended if you installed dependencies using `uv sync`):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Option 1: using Python
|
||||||
|
python app.py
|
||||||
|
|
||||||
|
# Option 2: using uv (no need to activate venv manually)
|
||||||
|
uv run app.py
|
||||||
|
```
|
||||||
|
|
||||||
|
The script will print a local URL (e.g., http://localhost:7860) — open that link in your browser.
|
||||||
|
|
||||||
|
### How to use the interface
|
||||||
|
|
||||||
|
1. **Configure Prompts**:
|
||||||
|
- **System Prompt**: Uses the default rules defined in `src/constants.py` or can be edited there for custom generation.
|
||||||
|
- **User Prompt**: Specifies what type of data to generate (default: 15 rows, defined in `src/constants.py`).
|
||||||
|
|
||||||
|
|
||||||
|
2. **Select Model**:
|
||||||
|
- `gpt-4o-mini`: Faster and more economical
|
||||||
|
- `gpt-4.1-mini`: Higher reasoning capacity
|
||||||
|
|
||||||
|
3. **Load Reference Data** (optional):
|
||||||
|
- Upload a CSV file with similar data
|
||||||
|
- Use included examples: `people_reference.csv`, `sentiment_reference.csv` or `wine_reference.csv`
|
||||||
|
|
||||||
|
4. **Generate Data**:
|
||||||
|
- Click "🚀 Generate Data"
|
||||||
|
- Review results in the gradio UI
|
||||||
|
- Download the generated CSV
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## 📊 Quality Evaluation
|
||||||
|
|
||||||
|
### Simple Evaluation System
|
||||||
|
|
||||||
|
The project includes a simple evaluation system focused on basic metrics and visualizations:
|
||||||
|
|
||||||
|
#### Features
|
||||||
|
- **Simple Metrics**: Basic statistical comparisons and quality checks
|
||||||
|
- **Integrated Visualizations**: Automatic generation of comparison plots in the app
|
||||||
|
- **Easy to Understand**: Clear scores and simple reports
|
||||||
|
- **Scale Invariant**: Works with datasets of different sizes
|
||||||
|
- **Temporary Files**: Visualizations are generated in temp files and cleaned up automatically
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## 🛠️ Improvements and Next Steps
|
||||||
|
|
||||||
|
### Immediate Improvements
|
||||||
|
|
||||||
|
1. **Advanced Validation**:
|
||||||
|
- Implement specific validators by data type
|
||||||
|
- Create evaluation reports
|
||||||
|
|
||||||
|
2. **Advanced Quality Metrics**
|
||||||
|
- Include more advanced metrics to compare multivariate similarity (for future work), e.g.:
|
||||||
|
- C2ST (Classifier Two‑Sample Test): train a classifier to distinguish real vs synthetic — report AUROC (ideal ≈ 0.5).
|
||||||
|
- MMD (Maximum Mean Discrepancy): kernel-based multivariate distance.
|
||||||
|
- Multivariate Wasserstein / Optimal Transport: joint-distribution distance (use POT).
|
||||||
|
|
||||||
|
3. **More Models**:
|
||||||
|
- Integrate Hugging Face models
|
||||||
|
- Support for local models (Ollama)
|
||||||
|
- Comparison between different models
|
||||||
|
|
||||||
|
### Advanced Features
|
||||||
|
|
||||||
|
1. **Conditional Generation**:
|
||||||
|
- Data based on specific conditions
|
||||||
|
- Controlled outlier generation
|
||||||
|
- Maintaining complex relationships
|
||||||
|
|
||||||
|
2. **Privacy Analysis**:
|
||||||
|
- Differential privacy metrics
|
||||||
|
- Sensitive data detection
|
||||||
|
- Automatic anonymization
|
||||||
|
|
||||||
|
3. **Database Integration**:
|
||||||
|
- Direct database connection
|
||||||
|
- Massive data generation
|
||||||
|
- Automatic synchronization
|
||||||
|
|
||||||
|
### Scalable Architecture
|
||||||
|
|
||||||
|
1. **REST API**:
|
||||||
|
- Endpoints for integration
|
||||||
|
- Authentication and rate limiting
|
||||||
|
- OpenAPI documentation
|
||||||
|
|
||||||
|
2. **Asynchronous Processing**:
|
||||||
|
- Work queues for long generations
|
||||||
|
- Progress notifications
|
||||||
|
- Robust error handling
|
||||||
|
|
||||||
|
3. **Monitoring and Logging**:
|
||||||
|
- Usage and performance metrics
|
||||||
|
- Detailed generation logs
|
||||||
|
- Quality alerts
|
||||||
|
|
||||||
|
## 📁 Project Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
synthetic_data/
|
||||||
|
├── app.py # Main Gradio application for synthetic data generation
|
||||||
|
├── README.md # Project documentation
|
||||||
|
├── pyproject.toml # Project configuration
|
||||||
|
├── requirements.txt # Python dependencies
|
||||||
|
├── data/ # Reference CSV datasets used for generating synthetic data
|
||||||
|
│ ├── people_reference.csv
|
||||||
|
│ ├── sentiment_reference.csv
|
||||||
|
│ └── wine_reference.csv
|
||||||
|
├── notebooks/ # Jupyter notebooks for experiments and development
|
||||||
|
│ └── notebook.ipynb
|
||||||
|
├── src/ # Python source code
|
||||||
|
│ ├── __init__.py
|
||||||
|
├── constants.py # Default constants, reference sample size, and default prompts
|
||||||
|
│ ├── data_generation.py # Core functions for batch generation and evaluation
|
||||||
|
│ ├── evaluator.py # Evaluation logic and metrics
|
||||||
|
│ ├── IO_utils.py # Utilities for file management and temp directories
|
||||||
|
│ ├── openai_utils.py # Wrappers for OpenAI API calls
|
||||||
|
│ └── plot_utils.py
|
||||||
|
# Functions to create visualizations from data
|
||||||
|
└── temp_plots/ # Temporary folder for generated plot images (auto-cleaned)
|
||||||
|
```
|
||||||
|
|
||||||
|
## 📄 License
|
||||||
|
|
||||||
|
This project is under the MIT License. See the `LICENSE` file for more details.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## 🎓 Course Context & Learning Outcomes
|
||||||
|
|
||||||
|
This project was developed as part of the [LLM Engineering: Master AI and Large Language Models](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/learn/lecture/52941433#questions/23828099) course on Udemy. It demonstrates practical implementation of:
|
||||||
|
|
||||||
|
### Key Learning Objectives:
|
||||||
|
- **Prompt Engineering Mastery**: Creating effective system and user prompts for consistent outputs
|
||||||
|
- **API Integration**: Working with OpenAI's API for production applications
|
||||||
|
- **Data Processing**: Handling JSON parsing, validation, and error management
|
||||||
|
- **Web Application Development**: Building user interfaces with Gradio
|
||||||
|
|
||||||
|
### Course Insights Applied:
|
||||||
|
- **Why OpenAI over Open Source**: This project was developed as an alternative to open-source models due to consistency issues in prompt following with models like Llama 3.2. OpenAI provides more reliable and faster results for this specific task.
|
||||||
|
- **Production Considerations**: Focus on error handling, output validation, and user experience
|
||||||
|
- **Scalability Planning**: Architecture designed for future enhancements and integrations
|
||||||
|
|
||||||
|
### Related Course Topics:
|
||||||
|
- Prompt engineering techniques
|
||||||
|
- LLM API integration and optimization
|
||||||
|
- Selection of best models for each usecase.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**📚 Course Link**: [LLM Engineering: Master AI and Large Language Models](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/learn/lecture/52941433#questions/23828099)
|
||||||
156
week3/community-contributions/juan_synthetic_data/app.py
Normal file
156
week3/community-contributions/juan_synthetic_data/app.py
Normal file
@@ -0,0 +1,156 @@
|
|||||||
|
import atexit
|
||||||
|
import os
|
||||||
|
|
||||||
|
import gradio as gr
|
||||||
|
import openai
|
||||||
|
from dotenv import load_dotenv
|
||||||
|
|
||||||
|
from src.constants import PROJECT_TEMP_DIR, SYSTEM_PROMPT, USER_PROMPT
|
||||||
|
from src.data_generation import generate_and_evaluate_data
|
||||||
|
from src.IO_utils import cleanup_temp_files
|
||||||
|
from src.plot_utils import display_reference_csv
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
# ==========================================================
|
||||||
|
# Setup
|
||||||
|
# ==========================================================
|
||||||
|
|
||||||
|
# Load the api key
|
||||||
|
load_dotenv()
|
||||||
|
openai.api_key = os.getenv("OPENAI_API_KEY")
|
||||||
|
|
||||||
|
# Temporary folder for images
|
||||||
|
os.makedirs(PROJECT_TEMP_DIR, exist_ok=True)
|
||||||
|
|
||||||
|
# Ensure temporary plot images are deleted when the program exits
|
||||||
|
atexit.register(lambda: cleanup_temp_files(PROJECT_TEMP_DIR))
|
||||||
|
|
||||||
|
# ==========================================================
|
||||||
|
# Gradio App
|
||||||
|
# ==========================================================
|
||||||
|
with gr.Blocks() as demo:
|
||||||
|
|
||||||
|
# Store temp folder in state
|
||||||
|
temp_dir_state = gr.State(value=PROJECT_TEMP_DIR)
|
||||||
|
|
||||||
|
gr.Markdown("# 🧠 Synthetic Data Generator (with OpenAI)")
|
||||||
|
|
||||||
|
# ======================================================
|
||||||
|
# Tabs for organized sections
|
||||||
|
# ======================================================
|
||||||
|
with gr.Tabs():
|
||||||
|
|
||||||
|
# ------------------------------
|
||||||
|
# Tab 1: Input
|
||||||
|
# ------------------------------
|
||||||
|
with gr.Tab("Input"):
|
||||||
|
|
||||||
|
# System prompt in collapsible
|
||||||
|
with gr.Accordion("System Prompt (click to expand)", open=False):
|
||||||
|
system_prompt_input = gr.Textbox(
|
||||||
|
label="System Prompt", value=SYSTEM_PROMPT, lines=20
|
||||||
|
)
|
||||||
|
|
||||||
|
# User prompt box
|
||||||
|
user_prompt_input = gr.Textbox(
|
||||||
|
label="User Prompt", value=USER_PROMPT, lines=5
|
||||||
|
)
|
||||||
|
|
||||||
|
# Model selection
|
||||||
|
model_select = gr.Dropdown(
|
||||||
|
label="OpenAI Model",
|
||||||
|
choices=["gpt-4o-mini", "gpt-4.1-mini"],
|
||||||
|
value="gpt-4o-mini",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Reference CSV upload
|
||||||
|
reference_input = gr.File(
|
||||||
|
label="Reference CSV (optional)", file_types=[".csv"]
|
||||||
|
)
|
||||||
|
|
||||||
|
# Examples
|
||||||
|
gr.Examples(
|
||||||
|
examples=[
|
||||||
|
"data/sentiment_reference.csv",
|
||||||
|
"data/people_reference.csv",
|
||||||
|
"data/wine_reference.csv",
|
||||||
|
],
|
||||||
|
inputs=reference_input,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Generate button
|
||||||
|
generate_btn = gr.Button("🚀 Generate Data")
|
||||||
|
|
||||||
|
# Download button
|
||||||
|
download_csv = gr.File(label="Download CSV")
|
||||||
|
|
||||||
|
# ------------------------------
|
||||||
|
# Tab 2: Reference Table
|
||||||
|
# ------------------------------
|
||||||
|
with gr.Tab("Reference Table"):
|
||||||
|
reference_display = gr.DataFrame(label="Reference CSV Preview")
|
||||||
|
|
||||||
|
# ------------------------------
|
||||||
|
# Tab 3: Generated Table
|
||||||
|
# ------------------------------
|
||||||
|
with gr.Tab("Generated Table"):
|
||||||
|
output_df = gr.DataFrame(label="Generated Data")
|
||||||
|
|
||||||
|
# ------------------------------
|
||||||
|
# Tab 4: Evaluation
|
||||||
|
# ------------------------------
|
||||||
|
with gr.Tab("Comparison"):
|
||||||
|
with gr.Accordion("Evaluation Results (click to expand)", open=True):
|
||||||
|
evaluation_df = gr.DataFrame(label="Evaluation Results")
|
||||||
|
|
||||||
|
# ------------------------------
|
||||||
|
# Tab 5: Visualizations
|
||||||
|
# ------------------------------
|
||||||
|
|
||||||
|
with gr.Tab("Visualizations"):
|
||||||
|
gr.Markdown("# Click on the box to expand")
|
||||||
|
|
||||||
|
images_gallery = gr.Gallery(
|
||||||
|
label="Column Visualizations",
|
||||||
|
show_label=True,
|
||||||
|
columns=2,
|
||||||
|
height="auto",
|
||||||
|
interactive=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Hidden state for internal use
|
||||||
|
generated_state = gr.State()
|
||||||
|
|
||||||
|
# ======================================================
|
||||||
|
# Event bindings
|
||||||
|
# ======================================================
|
||||||
|
generate_btn.click(
|
||||||
|
fn=generate_and_evaluate_data,
|
||||||
|
inputs=[
|
||||||
|
system_prompt_input,
|
||||||
|
user_prompt_input,
|
||||||
|
temp_dir_state,
|
||||||
|
reference_input,
|
||||||
|
model_select,
|
||||||
|
],
|
||||||
|
outputs=[
|
||||||
|
output_df,
|
||||||
|
download_csv,
|
||||||
|
evaluation_df,
|
||||||
|
generated_state,
|
||||||
|
images_gallery,
|
||||||
|
],
|
||||||
|
)
|
||||||
|
|
||||||
|
reference_input.change(
|
||||||
|
fn=display_reference_csv,
|
||||||
|
inputs=[reference_input],
|
||||||
|
outputs=[reference_display],
|
||||||
|
)
|
||||||
|
|
||||||
|
demo.launch(debug=True)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -0,0 +1,16 @@
|
|||||||
|
Name,Age,City
|
||||||
|
John,32,New York
|
||||||
|
Alice,45,Los Angeles
|
||||||
|
Bob,28,Chicago
|
||||||
|
Eve,35,Houston
|
||||||
|
Mike,52,Philadelphia
|
||||||
|
Emma,29,San Antonio
|
||||||
|
Oliver,39,Phoenix
|
||||||
|
Isabella,48,San Diego
|
||||||
|
William,55,Dallas
|
||||||
|
Charlotte,31,San Jose
|
||||||
|
Alexander,42,San Francisco
|
||||||
|
Harper,38,San Antonio
|
||||||
|
Julia,46,San Diego
|
||||||
|
Ethan,53,San Jose
|
||||||
|
Ava,29,San Francisco
|
||||||
|
@@ -0,0 +1,99 @@
|
|||||||
|
,Comment,sentiment
|
||||||
|
0,"Them: I don't think I like this game.
|
||||||
|
|
||||||
|
Me: But you haven't even played it for 5 minutes and are still in the tutorial.",negative
|
||||||
|
1,Then you leave them to farm the smaller creatures while you either wait or help them kill them all with the click of a button.,negative
|
||||||
|
2,Nothing beats the feeling you get when you see them fall in love with it just like you did all those years ago,positive
|
||||||
|
3,"[Also, they're made of paper](https://i.imgur.com/wYu0G9J.jpg)
|
||||||
|
|
||||||
|
Edit: I tried to make a gif and failed so here's a [video](https://i.imgur.com/aPzS8Ny.mp4)",negative
|
||||||
|
4,"Haha... That was exactly it when my brother tried to get me into WoW.
|
||||||
|
|
||||||
|
Him, "" I can run you through raids to get you to level up faster and get better gear. But first you need to be this min level. What are you""
|
||||||
|
|
||||||
|
Me ""lvl 1"".
|
||||||
|
|
||||||
|
Him ""ok. Let's do a couple quests to get you up. What is your quest""
|
||||||
|
|
||||||
|
Me ""collect 20 apples"".",positive
|
||||||
|
5,I'm going through this right now. I just started playing minecraft for the first time and my SO is having to walk me through everything.,positive
|
||||||
|
6,Then they get even more into it than you and end up getting all the loot and items you wanted before you. They make you look like the noob in about 3 months.,positive
|
||||||
|
7,"###Take your time, you got this
|
||||||
|
|#|user|EDIT|comment|Link
|
||||||
|
|:--|:--|:--|:--|:--|
|
||||||
|
|0|/u/KiwiChoppa147|[EDIT](https://i.imgur.com/OI8jNtE.png)|Then you leave them to farm the smaller creatures while you either wait or help them kill them all with the click of a button.|[Link](/r/gaming/comments/ccr8c8/take_your_time_you_got_this/etor3t2/)|
|
||||||
|
|1|/u/League0fGaming|[EDIT](https://i.imgur.com/5uvRAYy.png)|Nothing beats the feeling you get when you see them fall in love with it just like you did all those years ago|[Link](/r/gaming/comments/ccr8c8/take_your_time_you_got_this/etor371/)|
|
||||||
|
|2|/u/DeJMan|[EDIT](https://i.imgur.com/3FL3IFb.png)|[Also, they're made of paper](https://i.imgur.com/wYu0G9J.jpg) Edit: I tried to make a gif and failed so here's a [video](https://i.imgur.com/aPzS8Ny.mp4)|[Link](/r/gaming/comments/ccr8c8/take_your_time_you_got_this/etos1ic/)|
|
||||||
|
|3|/u/Bamboo6|[EDIT](https://i.imgur.com/SiDFZxQ.png)|Haha... That was exactly it when my brother tried to get me into WoW. Him, "" I can run you through raids to get you to level up faster and get better gear. But first you need to be this min level. What are you"" Me ""lvl 1"". Him ""ok. Let's do a couple quests to get you up. What is your quest"" Me ""collect 20 apples"".|[Link](/r/gaming/comments/ccr8c8/take_your_time_you_got_this/etorb6s/)|
|
||||||
|
|4|/u/xxfisharemykidsxx|[EDIT](https://i.imgur.com/3ek9F93.png)|I'm going through this right now. I just started playing minecraft for the first time and my SO is having to walk me through everything.|[Link](/r/gaming/comments/ccr8c8/take_your_time_you_got_this/etor7hk/)|
|
||||||
|
|5|/u/DuckSeeDuckWorld|[EDIT](https://i.imgur.com/rlE6VFP.png)|[This is my last EDIT before I go to camp for a week](https://imgur.com/xoOWF6K)|[Link](/r/gaming/comments/ccr8c8/take_your_time_you_got_this/etorpvh/)|
|
||||||
|
|6|/u/ChecksUsernames|[EDIT](https://i.imgur.com/6Wc56ec.png)|What the hell you have your own edit bot?!|[Link](/r/gaming/comments/ccr8c8/take_your_time_you_got_this/etotc4w/)|
|
||||||
|
|
||||||
|
|
||||||
|
I am a little fan-made bot who loves /u/SrGrafo but is a little lazy with hunting for EDITs. If you want to support our great creator, check out his [Patreon](https://Patreon.com/SrGrafo)",positive
|
||||||
|
8,"Them: ""Wait, where did you go?""
|
||||||
|
|
||||||
|
Me --cleaning up the vast quantities of mobs they've managed to stumble past: "" Oh just, you know, letting you get a feel for navigation.""",neutral
|
||||||
|
9,"Don't mind the arrows, everything's fine",positive
|
||||||
|
10,[me_irl](https://i.imgur.com/eRPb2X3.png),neutral
|
||||||
|
11,"I usually teach them the basic controls, and then throw them to the wolves like Spartans. Its sink or swim now!",positive
|
||||||
|
12,This is Warframe in a nutshell,neutral
|
||||||
|
13,[I love guiding people trough the game for the First time](https://imgur.com/uep20iB),positive
|
||||||
|
14,[showing a video game to my nephew for the first time didn't go that well :D](https://i.imgur.com/dQf4mfI.png),negative
|
||||||
|
15,[When it's a puzzle game](https://i.imgur.com/BgLqzRa.png),neutral
|
||||||
|
16,"I love SrGrafo’s cheeky smiles in his drawings.
|
||||||
|
|
||||||
|
Also, I wonder if it’s Senior Grafo, Señor Grafo, or Sir Grafo.",positive
|
||||||
|
17,"https://i.redd.it/pqjza65wrd711.jpg
|
||||||
|
|
||||||
|
Same look.",neutral
|
||||||
|
18,[This is my last EDIT before I go to camp for a week](https://imgur.com/xoOWF6K),neutral
|
||||||
|
19,Haha this is me in Warframe but I've only been playing for a year. It's so easy to find beginners and they always need help with something.,positive
|
||||||
|
20,This happens all the time on r/warframe ! Helping new people is like a whole part of the game's fun.,positive
|
||||||
|
21,[deleted],neutral
|
||||||
|
22,"Once day when I have kids, I hope I can do the same with them",positive
|
||||||
|
23,WAIT NO. WHY'D YOU PRESS X INSTEAD? Now you just used the only consumable for the next like 3 stages. Here lemme just restart from your last save...,neutral
|
||||||
|
24,Big gamer energy.,positive
|
||||||
|
25,"What about ten minutes in and they say “I’m not sure I get what’s going on. Eh I’m bored.”
|
||||||
|
|
||||||
|
Shitty phone [EDIT](https://imgur.com/a/zr4Ahnp)",negative
|
||||||
|
26,Press *alt+f4* for the special move,positive
|
||||||
|
27,"I remember teaching my little brother everything about Minecraft. Ah, good times. Now he's a little prick xD",positive
|
||||||
|
28,2nd top post of 2019!! \(^0^)/,positive
|
||||||
|
29,"With Grafo’s most recent comics, this achievement means so much more now. Check them out on his profile, u/SrGrafo, they’re titled “SrGrafo’s inception “",neutral
|
||||||
|
30,"this is my bf showing me wow.
|
||||||
|
|
||||||
|
Him: “You can’t just stand there and take damage.”
|
||||||
|
Me: “but I can’t move fast and my spells get cancelled.”
|
||||||
|
|
||||||
|
*proceeds to die 5 times in a row.*
|
||||||
|
|
||||||
|
and then he finishes it for me after watching me fail.
|
||||||
|
|
||||||
|
Me: yay. 😀😀",neutral
|
||||||
|
31,"Quick cross over
|
||||||
|
|
||||||
|
https://imgur.com/a/9y4JVAr",neutral
|
||||||
|
32,"Man, I really enjoy encoutering nice Veterans in online games",positive
|
||||||
|
33,Wow. This is my first time here before the edits.,positive
|
||||||
|
34,So this is the most liked Reddit post hmm,positive
|
||||||
|
35,Diamond armor? Really?,positive
|
||||||
|
36,"I remember when I was playing Destiny and I was pretty low level, having fun going through the missions, then my super high level friend joined. It was really unfun because he was slaughtering everything for me while I sat at the back doing jackshit",positive
|
||||||
|
37,"""I'll just use this character until you get the hang of things and then swap to an alt so we can level together""",neutral
|
||||||
|
38,"My girlfriend often just doesn't get why I love the games I play, but that's fine. I made sure to sit and watch her while she fell in love with breath of the wild.",negative
|
||||||
|
39,"Warframe was full of people like this last i was on and its amazing. I was one of them too, but mostly for advice more than items because i was broke constantly.",neutral
|
||||||
|
40,This is the most upvoted post I've seen on Reddit. And it was unexpectedly touching :),positive
|
||||||
|
41,220k. holy moly,neutral
|
||||||
|
42,Last,neutral
|
||||||
|
43,"170k+ upvotes in 11 hours.
|
||||||
|
Is this a record?",neutral
|
||||||
|
44,This is the top post of all time😱,positive
|
||||||
|
45,"Congratulations, 2nd post of the Year",positive
|
||||||
|
46,Most liked post on reddit,positive
|
||||||
|
47,Absolute Unit,neutral
|
||||||
|
48,"I did similar things in Monster Hunter World.
|
||||||
|
The only problem is they would never play ever again and play other games like Fortnite...feels bad man.
|
||||||
|
If you ever get interested on playing the game u/SrGrafo then I’ll teach you the ways of the hunter!!! (For real tho it’s a really good game and better with buddy’s!)",positive
|
||||||
|
49,Congrats on the second most upvoted post of 2019 my guy.,positive
|
||||||
|
50,"This was it with my brother when I first started playing POE. He made it soooo much easier to get into the game. To understand the gameplay and mechanics. I think I’d have left in a day or two had it not been for him
|
||||||
|
And walking me through the first few missions lmao. u/sulphra_",positive
|
||||||
|
@@ -0,0 +1,159 @@
|
|||||||
|
fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
|
||||||
|
7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0
|
||||||
|
7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
|
||||||
|
7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,2
|
||||||
|
11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,3
|
||||||
|
7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,4
|
||||||
|
7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,5,5
|
||||||
|
7.9,0.6,0.06,1.6,0.069,15.0,59.0,0.9964,3.3,0.46,9.4,5,6
|
||||||
|
7.3,0.65,0.0,1.2,0.065,15.0,21.0,0.9946,3.39,0.47,10.0,7,7
|
||||||
|
7.8,0.58,0.02,2.0,0.073,9.0,18.0,0.9968,3.36,0.57,9.5,7,8
|
||||||
|
6.7,0.58,0.08,1.8,0.09699999999999999,15.0,65.0,0.9959,3.28,0.54,9.2,5,10
|
||||||
|
5.6,0.615,0.0,1.6,0.08900000000000001,16.0,59.0,0.9943,3.58,0.52,9.9,5,12
|
||||||
|
7.8,0.61,0.29,1.6,0.114,9.0,29.0,0.9974,3.26,1.56,9.1,5,13
|
||||||
|
8.5,0.28,0.56,1.8,0.092,35.0,103.0,0.9969,3.3,0.75,10.5,7,16
|
||||||
|
7.9,0.32,0.51,1.8,0.341,17.0,56.0,0.9969,3.04,1.08,9.2,6,19
|
||||||
|
7.6,0.39,0.31,2.3,0.08199999999999999,23.0,71.0,0.9982,3.52,0.65,9.7,5,21
|
||||||
|
7.9,0.43,0.21,1.6,0.106,10.0,37.0,0.9966,3.17,0.91,9.5,5,22
|
||||||
|
8.5,0.49,0.11,2.3,0.084,9.0,67.0,0.9968,3.17,0.53,9.4,5,23
|
||||||
|
6.9,0.4,0.14,2.4,0.085,21.0,40.0,0.9968,3.43,0.63,9.7,6,24
|
||||||
|
6.3,0.39,0.16,1.4,0.08,11.0,23.0,0.9955,3.34,0.56,9.3,5,25
|
||||||
|
7.6,0.41,0.24,1.8,0.08,4.0,11.0,0.9962,3.28,0.59,9.5,5,26
|
||||||
|
7.1,0.71,0.0,1.9,0.08,14.0,35.0,0.9972,3.47,0.55,9.4,5,28
|
||||||
|
7.8,0.645,0.0,2.0,0.08199999999999999,8.0,16.0,0.9964,3.38,0.59,9.8,6,29
|
||||||
|
6.7,0.675,0.07,2.4,0.08900000000000001,17.0,82.0,0.9958,3.35,0.54,10.1,5,30
|
||||||
|
8.3,0.655,0.12,2.3,0.083,15.0,113.0,0.9966,3.17,0.66,9.8,5,32
|
||||||
|
5.2,0.32,0.25,1.8,0.10300000000000001,13.0,50.0,0.9957,3.38,0.55,9.2,5,34
|
||||||
|
7.8,0.645,0.0,5.5,0.086,5.0,18.0,0.9986,3.4,0.55,9.6,6,35
|
||||||
|
7.8,0.6,0.14,2.4,0.086,3.0,15.0,0.9975,3.42,0.6,10.8,6,36
|
||||||
|
8.1,0.38,0.28,2.1,0.066,13.0,30.0,0.9968,3.23,0.73,9.7,7,37
|
||||||
|
7.3,0.45,0.36,5.9,0.07400000000000001,12.0,87.0,0.9978,3.33,0.83,10.5,5,40
|
||||||
|
8.8,0.61,0.3,2.8,0.08800000000000001,17.0,46.0,0.9976,3.26,0.51,9.3,4,41
|
||||||
|
7.5,0.49,0.2,2.6,0.332,8.0,14.0,0.9968,3.21,0.9,10.5,6,42
|
||||||
|
8.1,0.66,0.22,2.2,0.069,9.0,23.0,0.9968,3.3,1.2,10.3,5,43
|
||||||
|
4.6,0.52,0.15,2.1,0.054000000000000006,8.0,65.0,0.9934,3.9,0.56,13.1,4,45
|
||||||
|
7.7,0.935,0.43,2.2,0.114,22.0,114.0,0.997,3.25,0.73,9.2,5,46
|
||||||
|
8.8,0.66,0.26,1.7,0.07400000000000001,4.0,23.0,0.9971,3.15,0.74,9.2,5,50
|
||||||
|
6.6,0.52,0.04,2.2,0.069,8.0,15.0,0.9956,3.4,0.63,9.4,6,51
|
||||||
|
6.6,0.5,0.04,2.1,0.068,6.0,14.0,0.9955,3.39,0.64,9.4,6,52
|
||||||
|
8.6,0.38,0.36,3.0,0.081,30.0,119.0,0.997,3.2,0.56,9.4,5,53
|
||||||
|
7.6,0.51,0.15,2.8,0.11,33.0,73.0,0.9955,3.17,0.63,10.2,6,54
|
||||||
|
10.2,0.42,0.57,3.4,0.07,4.0,10.0,0.9971,3.04,0.63,9.6,5,56
|
||||||
|
7.8,0.59,0.18,2.3,0.076,17.0,54.0,0.9975,3.43,0.59,10.0,5,58
|
||||||
|
7.3,0.39,0.31,2.4,0.07400000000000001,9.0,46.0,0.9962,3.41,0.54,9.4,6,59
|
||||||
|
8.8,0.4,0.4,2.2,0.079,19.0,52.0,0.998,3.44,0.64,9.2,5,60
|
||||||
|
7.7,0.69,0.49,1.8,0.115,20.0,112.0,0.9968,3.21,0.71,9.3,5,61
|
||||||
|
7.0,0.735,0.05,2.0,0.081,13.0,54.0,0.9966,3.39,0.57,9.8,5,63
|
||||||
|
7.2,0.725,0.05,4.65,0.086,4.0,11.0,0.9962,3.41,0.39,10.9,5,64
|
||||||
|
7.2,0.725,0.05,4.65,0.086,4.0,11.0,0.9962,3.41,0.39,10.9,5,65
|
||||||
|
6.6,0.705,0.07,1.6,0.076,6.0,15.0,0.9962,3.44,0.58,10.7,5,67
|
||||||
|
8.0,0.705,0.05,1.9,0.07400000000000001,8.0,19.0,0.9962,3.34,0.95,10.5,6,69
|
||||||
|
7.7,0.69,0.22,1.9,0.084,18.0,94.0,0.9961,3.31,0.48,9.5,5,72
|
||||||
|
8.3,0.675,0.26,2.1,0.084,11.0,43.0,0.9976,3.31,0.53,9.2,4,73
|
||||||
|
8.8,0.41,0.64,2.2,0.09300000000000001,9.0,42.0,0.9986,3.54,0.66,10.5,5,76
|
||||||
|
6.8,0.785,0.0,2.4,0.10400000000000001,14.0,30.0,0.9966,3.52,0.55,10.7,6,77
|
||||||
|
6.7,0.75,0.12,2.0,0.086,12.0,80.0,0.9958,3.38,0.52,10.1,5,78
|
||||||
|
8.3,0.625,0.2,1.5,0.08,27.0,119.0,0.9972,3.16,1.12,9.1,4,79
|
||||||
|
6.2,0.45,0.2,1.6,0.069,3.0,15.0,0.9958,3.41,0.56,9.2,5,80
|
||||||
|
7.4,0.5,0.47,2.0,0.086,21.0,73.0,0.997,3.36,0.57,9.1,5,82
|
||||||
|
6.3,0.3,0.48,1.8,0.069,18.0,61.0,0.9959,3.44,0.78,10.3,6,84
|
||||||
|
6.9,0.55,0.15,2.2,0.076,19.0,40.0,0.9961,3.41,0.59,10.1,5,85
|
||||||
|
8.6,0.49,0.28,1.9,0.11,20.0,136.0,0.9972,2.93,1.95,9.9,6,86
|
||||||
|
7.7,0.49,0.26,1.9,0.062,9.0,31.0,0.9966,3.39,0.64,9.6,5,87
|
||||||
|
9.3,0.39,0.44,2.1,0.107,34.0,125.0,0.9978,3.14,1.22,9.5,5,88
|
||||||
|
7.0,0.62,0.08,1.8,0.076,8.0,24.0,0.9978,3.48,0.53,9.0,5,89
|
||||||
|
7.9,0.52,0.26,1.9,0.079,42.0,140.0,0.9964,3.23,0.54,9.5,5,90
|
||||||
|
8.6,0.49,0.28,1.9,0.11,20.0,136.0,0.9972,2.93,1.95,9.9,6,91
|
||||||
|
7.7,0.49,0.26,1.9,0.062,9.0,31.0,0.9966,3.39,0.64,9.6,5,93
|
||||||
|
5.0,1.02,0.04,1.4,0.045,41.0,85.0,0.9938,3.75,0.48,10.5,4,94
|
||||||
|
6.8,0.775,0.0,3.0,0.102,8.0,23.0,0.9965,3.45,0.56,10.7,5,96
|
||||||
|
7.6,0.9,0.06,2.5,0.079,5.0,10.0,0.9967,3.39,0.56,9.8,5,98
|
||||||
|
8.1,0.545,0.18,1.9,0.08,13.0,35.0,0.9972,3.3,0.59,9.0,6,99
|
||||||
|
8.3,0.61,0.3,2.1,0.084,11.0,50.0,0.9972,3.4,0.61,10.2,6,100
|
||||||
|
8.1,0.545,0.18,1.9,0.08,13.0,35.0,0.9972,3.3,0.59,9.0,6,102
|
||||||
|
8.1,0.575,0.22,2.1,0.077,12.0,65.0,0.9967,3.29,0.51,9.2,5,103
|
||||||
|
7.2,0.49,0.24,2.2,0.07,5.0,36.0,0.996,3.33,0.48,9.4,5,104
|
||||||
|
8.1,0.575,0.22,2.1,0.077,12.0,65.0,0.9967,3.29,0.51,9.2,5,105
|
||||||
|
7.8,0.41,0.68,1.7,0.467,18.0,69.0,0.9973,3.08,1.31,9.3,5,106
|
||||||
|
6.2,0.63,0.31,1.7,0.08800000000000001,15.0,64.0,0.9969,3.46,0.79,9.3,5,107
|
||||||
|
7.8,0.56,0.19,1.8,0.10400000000000001,12.0,47.0,0.9964,3.19,0.93,9.5,5,110
|
||||||
|
8.4,0.62,0.09,2.2,0.084,11.0,108.0,0.9964,3.15,0.66,9.8,5,111
|
||||||
|
10.1,0.31,0.44,2.3,0.08,22.0,46.0,0.9988,3.32,0.67,9.7,6,113
|
||||||
|
7.8,0.56,0.19,1.8,0.10400000000000001,12.0,47.0,0.9964,3.19,0.93,9.5,5,114
|
||||||
|
9.4,0.4,0.31,2.2,0.09,13.0,62.0,0.9966,3.07,0.63,10.5,6,115
|
||||||
|
8.3,0.54,0.28,1.9,0.077,11.0,40.0,0.9978,3.39,0.61,10.0,6,116
|
||||||
|
7.3,1.07,0.09,1.7,0.17800000000000002,10.0,89.0,0.9962,3.3,0.57,9.0,5,120
|
||||||
|
8.8,0.55,0.04,2.2,0.11900000000000001,14.0,56.0,0.9962,3.21,0.6,10.9,6,121
|
||||||
|
7.3,0.695,0.0,2.5,0.075,3.0,13.0,0.998,3.49,0.52,9.2,5,122
|
||||||
|
7.8,0.5,0.17,1.6,0.08199999999999999,21.0,102.0,0.996,3.39,0.48,9.5,5,124
|
||||||
|
8.2,1.33,0.0,1.7,0.081,3.0,12.0,0.9964,3.53,0.49,10.9,5,126
|
||||||
|
8.1,1.33,0.0,1.8,0.08199999999999999,3.0,12.0,0.9964,3.54,0.48,10.9,5,127
|
||||||
|
8.0,0.59,0.16,1.8,0.065,3.0,16.0,0.9962,3.42,0.92,10.5,7,128
|
||||||
|
8.0,0.745,0.56,2.0,0.11800000000000001,30.0,134.0,0.9968,3.24,0.66,9.4,5,130
|
||||||
|
5.6,0.5,0.09,2.3,0.049,17.0,99.0,0.9937,3.63,0.63,13.0,5,131
|
||||||
|
7.9,1.04,0.05,2.2,0.084,13.0,29.0,0.9959,3.22,0.55,9.9,6,134
|
||||||
|
8.4,0.745,0.11,1.9,0.09,16.0,63.0,0.9965,3.19,0.82,9.6,5,135
|
||||||
|
7.2,0.415,0.36,2.0,0.081,13.0,45.0,0.9972,3.48,0.64,9.2,5,137
|
||||||
|
8.4,0.745,0.11,1.9,0.09,16.0,63.0,0.9965,3.19,0.82,9.6,5,140
|
||||||
|
5.2,0.34,0.0,1.8,0.05,27.0,63.0,0.9916,3.68,0.79,14.0,6,142
|
||||||
|
6.3,0.39,0.08,1.7,0.066,3.0,20.0,0.9954,3.34,0.58,9.4,5,143
|
||||||
|
5.2,0.34,0.0,1.8,0.05,27.0,63.0,0.9916,3.68,0.79,14.0,6,144
|
||||||
|
8.1,0.67,0.55,1.8,0.11699999999999999,32.0,141.0,0.9968,3.17,0.62,9.4,5,145
|
||||||
|
5.8,0.68,0.02,1.8,0.087,21.0,94.0,0.9944,3.54,0.52,10.0,5,146
|
||||||
|
6.9,0.49,0.1,2.3,0.07400000000000001,12.0,30.0,0.9959,3.42,0.58,10.2,6,148
|
||||||
|
7.3,0.33,0.47,2.1,0.077,5.0,11.0,0.9958,3.33,0.53,10.3,6,150
|
||||||
|
9.2,0.52,1.0,3.4,0.61,32.0,69.0,0.9996,2.74,2.0,9.4,4,151
|
||||||
|
7.5,0.6,0.03,1.8,0.095,25.0,99.0,0.995,3.35,0.54,10.1,5,152
|
||||||
|
7.5,0.6,0.03,1.8,0.095,25.0,99.0,0.995,3.35,0.54,10.1,5,153
|
||||||
|
7.1,0.43,0.42,5.5,0.071,28.0,128.0,0.9973,3.42,0.71,10.5,5,155
|
||||||
|
7.1,0.43,0.42,5.5,0.07,29.0,129.0,0.9973,3.42,0.72,10.5,5,156
|
||||||
|
7.1,0.43,0.42,5.5,0.071,28.0,128.0,0.9973,3.42,0.71,10.5,5,157
|
||||||
|
7.1,0.68,0.0,2.2,0.073,12.0,22.0,0.9969,3.48,0.5,9.3,5,158
|
||||||
|
6.8,0.6,0.18,1.9,0.079,18.0,86.0,0.9968,3.59,0.57,9.3,6,159
|
||||||
|
7.6,0.95,0.03,2.0,0.09,7.0,20.0,0.9959,3.2,0.56,9.6,5,160
|
||||||
|
7.6,0.68,0.02,1.3,0.07200000000000001,9.0,20.0,0.9965,3.17,1.08,9.2,4,161
|
||||||
|
7.8,0.53,0.04,1.7,0.076,17.0,31.0,0.9964,3.33,0.56,10.0,6,162
|
||||||
|
7.4,0.6,0.26,7.3,0.07,36.0,121.0,0.9982,3.37,0.49,9.4,5,163
|
||||||
|
7.3,0.59,0.26,7.2,0.07,35.0,121.0,0.9981,3.37,0.49,9.4,5,164
|
||||||
|
7.8,0.63,0.48,1.7,0.1,14.0,96.0,0.9961,3.19,0.62,9.5,5,165
|
||||||
|
6.8,0.64,0.1,2.1,0.085,18.0,101.0,0.9956,3.34,0.52,10.2,5,166
|
||||||
|
7.3,0.55,0.03,1.6,0.07200000000000001,17.0,42.0,0.9956,3.37,0.48,9.0,4,167
|
||||||
|
6.8,0.63,0.07,2.1,0.08900000000000001,11.0,44.0,0.9953,3.47,0.55,10.4,6,168
|
||||||
|
7.9,0.885,0.03,1.8,0.057999999999999996,4.0,8.0,0.9972,3.36,0.33,9.1,4,170
|
||||||
|
8.0,0.42,0.17,2.0,0.073,6.0,18.0,0.9972,3.29,0.61,9.2,6,172
|
||||||
|
7.4,0.62,0.05,1.9,0.068,24.0,42.0,0.9961,3.42,0.57,11.5,6,173
|
||||||
|
6.9,0.5,0.04,1.5,0.085,19.0,49.0,0.9958,3.35,0.78,9.5,5,175
|
||||||
|
7.3,0.38,0.21,2.0,0.08,7.0,35.0,0.9961,3.33,0.47,9.5,5,176
|
||||||
|
7.5,0.52,0.42,2.3,0.087,8.0,38.0,0.9972,3.58,0.61,10.5,6,177
|
||||||
|
7.0,0.805,0.0,2.5,0.068,7.0,20.0,0.9969,3.48,0.56,9.6,5,178
|
||||||
|
8.8,0.61,0.14,2.4,0.067,10.0,42.0,0.9969,3.19,0.59,9.5,5,179
|
||||||
|
8.8,0.61,0.14,2.4,0.067,10.0,42.0,0.9969,3.19,0.59,9.5,5,180
|
||||||
|
8.9,0.61,0.49,2.0,0.27,23.0,110.0,0.9972,3.12,1.02,9.3,5,181
|
||||||
|
7.2,0.73,0.02,2.5,0.076,16.0,42.0,0.9972,3.44,0.52,9.3,5,182
|
||||||
|
6.8,0.61,0.2,1.8,0.077,11.0,65.0,0.9971,3.54,0.58,9.3,5,183
|
||||||
|
6.7,0.62,0.21,1.9,0.079,8.0,62.0,0.997,3.52,0.58,9.3,6,184
|
||||||
|
8.9,0.31,0.57,2.0,0.111,26.0,85.0,0.9971,3.26,0.53,9.7,5,185
|
||||||
|
7.4,0.39,0.48,2.0,0.08199999999999999,14.0,67.0,0.9972,3.34,0.55,9.2,5,186
|
||||||
|
7.9,0.5,0.33,2.0,0.084,15.0,143.0,0.9968,3.2,0.55,9.5,5,188
|
||||||
|
8.2,0.5,0.35,2.9,0.077,21.0,127.0,0.9976,3.23,0.62,9.4,5,190
|
||||||
|
6.4,0.37,0.25,1.9,0.07400000000000001,21.0,49.0,0.9974,3.57,0.62,9.8,6,191
|
||||||
|
7.6,0.55,0.21,2.2,0.071,7.0,28.0,0.9964,3.28,0.55,9.7,5,193
|
||||||
|
7.6,0.55,0.21,2.2,0.071,7.0,28.0,0.9964,3.28,0.55,9.7,5,194
|
||||||
|
7.3,0.58,0.3,2.4,0.07400000000000001,15.0,55.0,0.9968,3.46,0.59,10.2,5,196
|
||||||
|
11.5,0.3,0.6,2.0,0.067,12.0,27.0,0.9981,3.11,0.97,10.1,6,197
|
||||||
|
6.9,1.09,0.06,2.1,0.061,12.0,31.0,0.9948,3.51,0.43,11.4,4,199
|
||||||
|
9.6,0.32,0.47,1.4,0.055999999999999994,9.0,24.0,0.99695,3.22,0.82,10.3,7,200
|
||||||
|
7.0,0.43,0.36,1.6,0.08900000000000001,14.0,37.0,0.99615,3.34,0.56,9.2,6,204
|
||||||
|
12.8,0.3,0.74,2.6,0.095,9.0,28.0,0.9994,3.2,0.77,10.8,7,205
|
||||||
|
12.8,0.3,0.74,2.6,0.095,9.0,28.0,0.9994,3.2,0.77,10.8,7,206
|
||||||
|
7.8,0.44,0.28,2.7,0.1,18.0,95.0,0.9966,3.22,0.67,9.4,5,208
|
||||||
|
9.7,0.53,0.6,2.0,0.039,5.0,19.0,0.99585,3.3,0.86,12.4,6,210
|
||||||
|
8.0,0.725,0.24,2.8,0.083,10.0,62.0,0.99685,3.35,0.56,10.0,6,211
|
||||||
|
8.2,0.57,0.26,2.2,0.06,28.0,65.0,0.9959,3.3,0.43,10.1,5,213
|
||||||
|
7.8,0.735,0.08,2.4,0.092,10.0,41.0,0.9974,3.24,0.71,9.8,6,214
|
||||||
|
7.0,0.49,0.49,5.6,0.06,26.0,121.0,0.9974,3.34,0.76,10.5,5,215
|
||||||
|
8.7,0.625,0.16,2.0,0.10099999999999999,13.0,49.0,0.9962,3.14,0.57,11.0,5,216
|
||||||
|
8.1,0.725,0.22,2.2,0.07200000000000001,11.0,41.0,0.9967,3.36,0.55,9.1,5,217
|
||||||
|
7.5,0.49,0.19,1.9,0.076,10.0,44.0,0.9957,3.39,0.54,9.7,5,218
|
||||||
|
7.8,0.34,0.37,2.0,0.08199999999999999,24.0,58.0,0.9964,3.34,0.59,9.4,6,220
|
||||||
|
7.4,0.53,0.26,2.0,0.10099999999999999,16.0,72.0,0.9957,3.15,0.57,9.4,5,221
|
||||||
|
@@ -0,0 +1,292 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "63356928",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Initial Note\n",
|
||||||
|
"After running experiments in Colab using open-source models from Hugging Face, I decided to do the exercise with OpenAI. The reason is that Llama 3.2 frequently did not follow the prompts correctly, leading to inconsistencies and poor performance. Additionally, using larger models significantly increased processing time, making them less practical for this task.\n",
|
||||||
|
"\n",
|
||||||
|
"The code from this notebook will be reorganized in modules for the final Demo."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "5c12f081",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Module to generate syntethic data"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "2389d798",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"\n",
|
||||||
|
"import re \n",
|
||||||
|
"\n",
|
||||||
|
"def _clean_json_output(raw_text: str) -> str:\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" Limpia la salida de OpenAI para convertirla en JSON válido:\n",
|
||||||
|
" - Mantiene las comillas de claves sin tocar.\n",
|
||||||
|
" - Escapa solo las comillas dobles dentro de los strings de valores.\n",
|
||||||
|
" - Escapa \\n, \\r, \\t.\n",
|
||||||
|
" - Remueve code fences y HTML.\n",
|
||||||
|
" - Asegura que el array comience con [ y termine con ].\n",
|
||||||
|
" - Elimina comas finales.\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" text = raw_text.strip()\n",
|
||||||
|
" \n",
|
||||||
|
" # Remover code fences y HTML\n",
|
||||||
|
" text = re.sub(r\"```(?:json)?\", \"\", text)\n",
|
||||||
|
" text = re.sub(r\"</?[^>]+>\", \"\", text)\n",
|
||||||
|
" \n",
|
||||||
|
" # Escapar comillas dobles dentro de valores de Comment\n",
|
||||||
|
" def escape_quotes_in_values(match):\n",
|
||||||
|
" value = match.group(1)\n",
|
||||||
|
" value = value.replace('\"', r'\\\"') # solo dentro del valor\n",
|
||||||
|
" value = value.replace('\\n', r'\\n').replace('\\r', r'\\r').replace('\\t', r'\\t')\n",
|
||||||
|
" return f'\"{value}\"'\n",
|
||||||
|
" \n",
|
||||||
|
" text = re.sub(r'\"(.*?)\"', escape_quotes_in_values, text)\n",
|
||||||
|
" \n",
|
||||||
|
" # Asegurar que empieza y termina con []\n",
|
||||||
|
" if not text.startswith('['):\n",
|
||||||
|
" text = '[' + text\n",
|
||||||
|
" if not text.endswith(']'):\n",
|
||||||
|
" text += ']'\n",
|
||||||
|
" \n",
|
||||||
|
" # Eliminar comas finales antes de cerrar corchetes\n",
|
||||||
|
" text = re.sub(r',\\s*]', ']', text)\n",
|
||||||
|
" \n",
|
||||||
|
" return text\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "75bfad6f",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"import pandas as pd\n",
|
||||||
|
"import json\n",
|
||||||
|
"import openai\n",
|
||||||
|
"import tempfile\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"def generate_synthetic_data_openai(\n",
|
||||||
|
" system_prompt: str,\n",
|
||||||
|
" user_prompt: str,\n",
|
||||||
|
" reference_file=None,\n",
|
||||||
|
" openai_model=\"gpt-4o-mini\",\n",
|
||||||
|
" max_tokens=2048,\n",
|
||||||
|
" temperature=0.0\n",
|
||||||
|
"):\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" Genera datos sintéticos y devuelve el DataFrame y la ruta de un CSV temporal.\n",
|
||||||
|
" \"\"\"\n",
|
||||||
|
" # Preparar prompt completo\n",
|
||||||
|
" if reference_file:\n",
|
||||||
|
" if isinstance(reference_file, str):\n",
|
||||||
|
" df_ref = pd.read_csv(reference_file)\n",
|
||||||
|
" else:\n",
|
||||||
|
" df_ref = pd.read_csv(reference_file)\n",
|
||||||
|
" reference_data = df_ref.to_dict(orient=\"records\")\n",
|
||||||
|
" user_prompt_full = (\n",
|
||||||
|
" f\"{user_prompt}\\nFollow the structure and distribution of the reference data, \"\n",
|
||||||
|
" f\"but do NOT copy any exact values:\\n{reference_data}\"\n",
|
||||||
|
" )\n",
|
||||||
|
" else:\n",
|
||||||
|
" user_prompt_full = user_prompt\n",
|
||||||
|
"\n",
|
||||||
|
" # Llamar a OpenAI\n",
|
||||||
|
" response = openai.chat.completions.create(\n",
|
||||||
|
" model=openai_model,\n",
|
||||||
|
" messages=[\n",
|
||||||
|
" {\"role\": \"system\", \"content\": system_prompt},\n",
|
||||||
|
" {\"role\": \"user\", \"content\": user_prompt_full},\n",
|
||||||
|
" ],\n",
|
||||||
|
" temperature=temperature,\n",
|
||||||
|
" max_tokens=max_tokens,\n",
|
||||||
|
" )\n",
|
||||||
|
"\n",
|
||||||
|
" raw_text = response.choices[0].message.content\n",
|
||||||
|
" cleaned_json = _clean_json_output(raw_text)\n",
|
||||||
|
"\n",
|
||||||
|
" # Parsear JSON\n",
|
||||||
|
" try:\n",
|
||||||
|
" data = json.loads(cleaned_json)\n",
|
||||||
|
" except json.JSONDecodeError as e:\n",
|
||||||
|
" raise ValueError(f\"JSON inválido generado. Error: {e}\\nOutput truncado: {cleaned_json[:500]}\")\n",
|
||||||
|
"\n",
|
||||||
|
" df = pd.DataFrame(data)\n",
|
||||||
|
"\n",
|
||||||
|
" # Guardar CSV temporal\n",
|
||||||
|
" tmp_file = tempfile.NamedTemporaryFile(delete=False, suffix=\".csv\")\n",
|
||||||
|
" df.to_csv(tmp_file.name, index=False)\n",
|
||||||
|
" tmp_file.close()\n",
|
||||||
|
"\n",
|
||||||
|
" return df, tmp_file.name\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "91af1eb5",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Default prompts"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "792d1555",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"SYSTEM_PROMPT = \"\"\"\n",
|
||||||
|
"You are a precise synthetic data generator. Your only task is to output valid JSON arrays of dictionaries.\n",
|
||||||
|
"\n",
|
||||||
|
"Rules:\n",
|
||||||
|
"1. Output a single JSON array starting with '[' and ending with ']'.\n",
|
||||||
|
"2. Do not include markdown, code fences, or explanatory text — only the JSON.\n",
|
||||||
|
"3. Keep all columns exactly as specified; do not add or remove fields (index must be omitted).\n",
|
||||||
|
"4. Respect data types: text, number, date, boolean, etc.\n",
|
||||||
|
"5. Ensure internal consistency and realistic variation.\n",
|
||||||
|
"6. If a reference table is provided, generate data with similar statistical distributions for numerical and categorical variables, \n",
|
||||||
|
" but never copy exact rows. Each row must be independent and new.\n",
|
||||||
|
"7. For personal information (names, ages, addresses, IDs), ensure diversity and realism — individual values may be reused to maintain realism, \n",
|
||||||
|
" but never reuse or slightly modify entire reference rows.\n",
|
||||||
|
"8. Escape all internal double quotes in strings with a backslash (\\\").\n",
|
||||||
|
"9. Replace any single quotes in strings with double quotes.\n",
|
||||||
|
"10. Escape newline (\\n), tab (\\t), or carriage return (\\r) characters as \\\\n, \\\\t, \\\\r inside strings.\n",
|
||||||
|
"11. Remove any trailing commas before closing brackets.\n",
|
||||||
|
"12. Do not include any reference data or notes about it in the output.\n",
|
||||||
|
"13. The output must always be valid JSON parseable by standard JSON parsers.\n",
|
||||||
|
"\"\"\"\n",
|
||||||
|
"\n",
|
||||||
|
"USER_PROMPT = \"\"\"\n",
|
||||||
|
"Generate exactly 15 rows of synthetic data following all the rules above. \n",
|
||||||
|
"Ensure that all strings are safe for JSON parsing and ready to convert to a pandas DataFrame.\n",
|
||||||
|
"\"\"\"\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "6f9331fa",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Test"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "d38f0afb",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"For testing our generator, we use the first 50 examples of reddit gaming comments with sentiments dataset.\n",
|
||||||
|
"Source: https://www.kaggle.com/datasets/sainitishmitta04/23k-reddit-gaming-comments-with-sentiments-dataset"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "78d94faa",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"\n",
|
||||||
|
"df, _ = generate_synthetic_data_openai(SYSTEM_PROMPT, USER_PROMPT, reference_file= \"data/sentiment_reference.csv\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "0e6b5ebb",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"df"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "015a3110",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"print(df.Comment[0])"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "0ef44876",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Gradio Demo"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "aa4092f4",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"import gradio as gr\n",
|
||||||
|
"\n",
|
||||||
|
"with gr.Blocks() as demo:\n",
|
||||||
|
" gr.Markdown(\"# 🧠 Synthetic Data Generator\")\n",
|
||||||
|
"\n",
|
||||||
|
" with gr.Row():\n",
|
||||||
|
" system_prompt_input = gr.Textbox(label=\"System Prompt\", value=SYSTEM_PROMPT, lines=10)\n",
|
||||||
|
"\n",
|
||||||
|
" with gr.Row():\n",
|
||||||
|
" user_prompt_input = gr.Textbox(label=\"User Prompt\", value=USER_PROMPT, lines=5)\n",
|
||||||
|
"\n",
|
||||||
|
" with gr.Row():\n",
|
||||||
|
" reference_input = gr.File(label=\"Reference CSV (optional)\", file_types=[\".csv\"])\n",
|
||||||
|
"\n",
|
||||||
|
" output_df = gr.DataFrame(label=\"Generated Data\")\n",
|
||||||
|
" download_csv = gr.File(label=\"Download CSV\")\n",
|
||||||
|
"\n",
|
||||||
|
" generate_btn = gr.Button(\"🚀 Generate Data\")\n",
|
||||||
|
"\n",
|
||||||
|
" generate_btn.click(\n",
|
||||||
|
" fn=generate_synthetic_data_openai,\n",
|
||||||
|
" inputs=[system_prompt_input, user_prompt_input, reference_input],\n",
|
||||||
|
" outputs=[output_df, download_csv]\n",
|
||||||
|
" )\n",
|
||||||
|
"\n",
|
||||||
|
"demo.launch(debug=True)\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": ".venv",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.12.12"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 5
|
||||||
|
}
|
||||||
@@ -0,0 +1,16 @@
|
|||||||
|
[project]
|
||||||
|
name = "synthetic-data"
|
||||||
|
version = "0.1.0"
|
||||||
|
description = "An intelligent synthetic data generator using OpenAI models"
|
||||||
|
authors = [
|
||||||
|
{ name = "Sebastian Rodriguez" }
|
||||||
|
]
|
||||||
|
dependencies = [
|
||||||
|
"gradio>=5.49.1",
|
||||||
|
"openai>=2.6.0",
|
||||||
|
"pandas>=2.3.3",
|
||||||
|
"python-dotenv>=1.0.0",
|
||||||
|
"numpy>=1.24.0",
|
||||||
|
"matplotlib>=3.7.0",
|
||||||
|
"seaborn>=0.13.0"
|
||||||
|
]
|
||||||
@@ -0,0 +1,10 @@
|
|||||||
|
# Core dependencies
|
||||||
|
gradio>=5.49.1
|
||||||
|
openai>=2.6.0
|
||||||
|
pandas>=2.3.3
|
||||||
|
python-dotenv>=1.0.0
|
||||||
|
|
||||||
|
# Evaluation dependencies
|
||||||
|
numpy>=1.24.0
|
||||||
|
matplotlib>=3.7.0
|
||||||
|
seaborn>=0.13.0
|
||||||
@@ -0,0 +1,13 @@
|
|||||||
|
import os
|
||||||
|
import glob
|
||||||
|
|
||||||
|
def cleanup_temp_files(temp_dir: str):
|
||||||
|
"""
|
||||||
|
Remove all temporary files from the given directory.
|
||||||
|
"""
|
||||||
|
files = glob.glob(os.path.join(temp_dir, "*"))
|
||||||
|
for f in files:
|
||||||
|
try:
|
||||||
|
os.remove(f)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[Warning] Could not delete {f}: {e}")
|
||||||
@@ -0,0 +1,45 @@
|
|||||||
|
# -------------------Setup Constants -------------------
|
||||||
|
N_REFERENCE_ROWS = 64 # Max reference rows per batch for sampling
|
||||||
|
MAX_TOKENS_MODEL = 128_000 # Max tokens supported by the model, used for batching computations
|
||||||
|
PROJECT_TEMP_DIR = "temp_plots"
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
#----------------- Prompts-------------------------------
|
||||||
|
SYSTEM_PROMPT = """
|
||||||
|
You are a precise synthetic data generator. Your only task is to output valid JSON arrays of dictionaries.
|
||||||
|
|
||||||
|
Rules:
|
||||||
|
1. Output a single JSON array starting with '[' and ending with ']'.
|
||||||
|
2. Do not include markdown, code fences, or explanatory text — only the JSON.
|
||||||
|
3. Keep all columns exactly as specified; do not add or remove fields (index must be omitted).
|
||||||
|
4. Respect data types: text, number, date, boolean, etc.
|
||||||
|
5. Ensure internal consistency and realistic variation.
|
||||||
|
6. If a reference table is provided, generate data with similar statistical distributions for numerical and categorical variables,
|
||||||
|
but never copy exact rows. Each row must be independent and new.
|
||||||
|
7. For personal information (names, ages, addresses, IDs), ensure diversity and realism — individual values may be reused to maintain realism,
|
||||||
|
but never reuse or slightly modify entire reference rows.
|
||||||
|
8. Escape internal double quotes in strings with a backslash (") for JSON validity.
|
||||||
|
9. Do NOT replace single quotes in normal text; they should remain as-is.
|
||||||
|
10. Escape newline (
|
||||||
|
), tab ( ), or carriage return (
|
||||||
|
) characters as
|
||||||
|
, ,
|
||||||
|
inside strings.
|
||||||
|
11. Remove any trailing commas before closing brackets.
|
||||||
|
12. Do not include any reference data or notes about it in the output.
|
||||||
|
13. The output must always be valid JSON parseable by standard JSON parsers.
|
||||||
|
14. Don't repeat any exact column neither from the reference or from previous generated data.
|
||||||
|
15. When using reference data, consider the entire dataset for statistical patterns and diversity;
|
||||||
|
do not restrict generation to the first rows or the order of the dataset.
|
||||||
|
16. Introduce slight random variations in numerical values, and choose categorical values randomly according to the distribution,
|
||||||
|
without repeating rows.
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
USER_PROMPT = """
|
||||||
|
Generate exactly 15 rows of synthetic data following all the rules above.
|
||||||
|
Ensure that all strings are safe for JSON parsing and ready to convert to a pandas DataFrame.
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
@@ -0,0 +1,108 @@
|
|||||||
|
import os
|
||||||
|
from typing import List
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
from PIL import Image
|
||||||
|
|
||||||
|
from src.constants import MAX_TOKENS_MODEL, N_REFERENCE_ROWS
|
||||||
|
from src.evaluator import SimpleEvaluator
|
||||||
|
from src.helpers import hash_row, sample_reference
|
||||||
|
from src.openai_utils import detect_total_rows_from_prompt, generate_batch
|
||||||
|
|
||||||
|
|
||||||
|
# ------------------- Main Function -------------------
|
||||||
|
def generate_and_evaluate_data(
|
||||||
|
system_prompt: str,
|
||||||
|
user_prompt: str,
|
||||||
|
temp_dir: str,
|
||||||
|
reference_file=None,
|
||||||
|
openai_model: str = "gpt-4o-mini",
|
||||||
|
max_tokens_model: int = MAX_TOKENS_MODEL,
|
||||||
|
n_reference_rows: int = N_REFERENCE_ROWS,
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Generate synthetic data in batches, evaluate against reference data, and save results.
|
||||||
|
Uses dynamic batching and reference sampling to optimize cost and token usage.
|
||||||
|
"""
|
||||||
|
os.makedirs(temp_dir, exist_ok=True)
|
||||||
|
reference_df = pd.read_csv(reference_file) if reference_file else None
|
||||||
|
total_rows = detect_total_rows_from_prompt(user_prompt, openai_model)
|
||||||
|
|
||||||
|
final_df = pd.DataFrame()
|
||||||
|
existing_hashes = set()
|
||||||
|
rows_left = total_rows
|
||||||
|
iteration = 0
|
||||||
|
|
||||||
|
print(f"[Info] Total rows requested: {total_rows}")
|
||||||
|
|
||||||
|
# Estimate tokens for the prompt by adding system, user and sample (used once per batch)
|
||||||
|
prompt_sample = f"{system_prompt} {user_prompt} {sample_reference(reference_df, n_reference_rows)}"
|
||||||
|
prompt_tokens = max(1, len(prompt_sample) // 4)
|
||||||
|
|
||||||
|
# Estimate tokens per row dynamically using a sample
|
||||||
|
example_sample = sample_reference(reference_df, n_reference_rows)
|
||||||
|
if example_sample is not None and len(example_sample) > 0:
|
||||||
|
sample_text = str(example_sample)
|
||||||
|
tokens_per_row = max(1, len(sample_text) // len(example_sample) // 4)
|
||||||
|
else:
|
||||||
|
tokens_per_row = 30 # fallback if no reference
|
||||||
|
|
||||||
|
print(f"[Info] Tokens per row estimate: {tokens_per_row}, Prompt tokens: {prompt_tokens}")
|
||||||
|
|
||||||
|
# ---------------- Batch Generation Loop ----------------
|
||||||
|
while rows_left > 0:
|
||||||
|
iteration += 1
|
||||||
|
batch_sample = sample_reference(reference_df, n_reference_rows)
|
||||||
|
batch_size = min(rows_left, max(1, (max_tokens_model - prompt_tokens) // tokens_per_row))
|
||||||
|
print(f"[Batch {iteration}] Batch size: {batch_size}, Rows left: {rows_left}")
|
||||||
|
|
||||||
|
try:
|
||||||
|
df_batch = generate_batch(
|
||||||
|
system_prompt, user_prompt, batch_sample, batch_size, openai_model
|
||||||
|
)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"[Error] Batch {iteration} failed: {e}")
|
||||||
|
break
|
||||||
|
|
||||||
|
# Filter duplicates using hash
|
||||||
|
new_rows = [
|
||||||
|
row
|
||||||
|
for _, row in df_batch.iterrows()
|
||||||
|
if hash_row(row) not in existing_hashes
|
||||||
|
]
|
||||||
|
for row in new_rows:
|
||||||
|
existing_hashes.add(hash_row(row))
|
||||||
|
|
||||||
|
final_df = pd.concat([final_df, pd.DataFrame(new_rows)], ignore_index=True)
|
||||||
|
rows_left = total_rows - len(final_df)
|
||||||
|
print(
|
||||||
|
f"[Batch {iteration}] Unique new rows added: {len(new_rows)}, Total so far: {len(final_df)}"
|
||||||
|
)
|
||||||
|
|
||||||
|
if len(new_rows) == 0:
|
||||||
|
print("[Warning] No new unique rows. Stopping batches.")
|
||||||
|
break
|
||||||
|
|
||||||
|
# ---------------- Evaluation ----------------
|
||||||
|
report_df, vis_dict = pd.DataFrame(), {}
|
||||||
|
if reference_df is not None and not final_df.empty:
|
||||||
|
evaluator = SimpleEvaluator(temp_dir=temp_dir)
|
||||||
|
evaluator.evaluate(reference_df, final_df)
|
||||||
|
report_df = evaluator.results_as_dataframe()
|
||||||
|
vis_dict = evaluator.create_visualizations_temp_dict(reference_df, final_df)
|
||||||
|
print(f"[Info] Evaluation complete. Report shape: {report_df.shape}")
|
||||||
|
|
||||||
|
# ---------------- Collect Images ----------------
|
||||||
|
all_images: List[Image.Image] = []
|
||||||
|
for imgs in vis_dict.values():
|
||||||
|
if isinstance(imgs, list):
|
||||||
|
all_images.extend([img for img in imgs if img is not None])
|
||||||
|
|
||||||
|
# ---------------- Save CSV ----------------
|
||||||
|
final_csv_path = os.path.join(temp_dir, "synthetic_data.csv")
|
||||||
|
final_df.to_csv(final_csv_path, index=False)
|
||||||
|
print(f"[Done] Generated {len(final_df)} rows → saved to {final_csv_path}")
|
||||||
|
|
||||||
|
generated_state = {}
|
||||||
|
|
||||||
|
return final_df, final_csv_path, report_df, generated_state, all_images
|
||||||
@@ -0,0 +1,142 @@
|
|||||||
|
import seaborn as sns
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
from typing import List, Dict, Any, Optional
|
||||||
|
from PIL import Image
|
||||||
|
import pandas as pd
|
||||||
|
import os
|
||||||
|
|
||||||
|
class SimpleEvaluator:
|
||||||
|
"""
|
||||||
|
Evaluates synthetic data against a reference dataset, providing summary statistics and visualizations.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, temp_dir: str = "temp_plots"):
|
||||||
|
"""
|
||||||
|
Initialize the evaluator.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
temp_dir (str): Directory to save temporary plot images.
|
||||||
|
"""
|
||||||
|
self.temp_dir = temp_dir
|
||||||
|
os.makedirs(self.temp_dir, exist_ok=True)
|
||||||
|
|
||||||
|
def evaluate(self, reference_df: pd.DataFrame, generated_df: pd.DataFrame) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Compare numerical and categorical columns between reference and generated datasets.
|
||||||
|
"""
|
||||||
|
self.results: Dict[str, Any] = {}
|
||||||
|
self.common_cols = list(set(reference_df.columns) & set(generated_df.columns))
|
||||||
|
|
||||||
|
for col in self.common_cols:
|
||||||
|
if pd.api.types.is_numeric_dtype(reference_df[col]):
|
||||||
|
self.results[col] = {
|
||||||
|
"type": "numerical",
|
||||||
|
"ref_mean": reference_df[col].mean(),
|
||||||
|
"gen_mean": generated_df[col].mean(),
|
||||||
|
"mean_diff": generated_df[col].mean() - reference_df[col].mean(),
|
||||||
|
"ref_std": reference_df[col].std(),
|
||||||
|
"gen_std": generated_df[col].std(),
|
||||||
|
"std_diff": generated_df[col].std() - reference_df[col].std(),
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
ref_counts = reference_df[col].value_counts(normalize=True)
|
||||||
|
gen_counts = generated_df[col].value_counts(normalize=True)
|
||||||
|
overlap = sum(min(ref_counts.get(k, 0), gen_counts.get(k, 0)) for k in ref_counts.index)
|
||||||
|
self.results[col] = {
|
||||||
|
"type": "categorical",
|
||||||
|
"distribution_overlap_pct": round(overlap * 100, 2),
|
||||||
|
"ref_unique": len(ref_counts),
|
||||||
|
"gen_unique": len(gen_counts)
|
||||||
|
}
|
||||||
|
|
||||||
|
return self.results
|
||||||
|
|
||||||
|
def results_as_dataframe(self) -> pd.DataFrame:
|
||||||
|
"""
|
||||||
|
Convert the evaluation results into a pandas DataFrame for display.
|
||||||
|
"""
|
||||||
|
rows = []
|
||||||
|
for col, stats in self.results.items():
|
||||||
|
if stats["type"] == "numerical":
|
||||||
|
rows.append({
|
||||||
|
"Column": col,
|
||||||
|
"Type": "Numerical",
|
||||||
|
"Ref Mean/Std": f"{stats['ref_mean']:.2f} / {stats['ref_std']:.2f}",
|
||||||
|
"Gen Mean/Std": f"{stats['gen_mean']:.2f} / {stats['gen_std']:.2f}",
|
||||||
|
"Diff": f"Mean diff: {stats['mean_diff']:.2f}, Std diff: {stats['std_diff']:.2f}"
|
||||||
|
})
|
||||||
|
else:
|
||||||
|
rows.append({
|
||||||
|
"Column": col,
|
||||||
|
"Type": "Categorical",
|
||||||
|
"Ref": f"{stats['ref_unique']} unique",
|
||||||
|
"Gen": f"{stats['gen_unique']} unique",
|
||||||
|
"Diff": f"Overlap: {stats['distribution_overlap_pct']}%"
|
||||||
|
})
|
||||||
|
return pd.DataFrame(rows)
|
||||||
|
|
||||||
|
def create_visualizations_temp_dict(
|
||||||
|
self,
|
||||||
|
reference_df: pd.DataFrame,
|
||||||
|
generated_df: pd.DataFrame,
|
||||||
|
percentage: bool = True
|
||||||
|
) -> Dict[str, List[Optional[Image.Image]]]:
|
||||||
|
"""
|
||||||
|
Create histogram and boxplot visualizations for each column and save them as temporary images.
|
||||||
|
Handles special characters in column names and category labels.
|
||||||
|
"""
|
||||||
|
vis_dict: Dict[str, List[Optional[Image.Image]]] = {}
|
||||||
|
common_cols = list(set(reference_df.columns) & set(generated_df.columns))
|
||||||
|
|
||||||
|
for col in common_cols:
|
||||||
|
col_safe = str(col).replace("_", r"\_").replace("$", r"\$") # Escape special chars
|
||||||
|
|
||||||
|
# ---------------- Histogram ----------------
|
||||||
|
plt.figure(figsize=(6, 4))
|
||||||
|
if pd.api.types.is_numeric_dtype(reference_df[col]):
|
||||||
|
sns.histplot(reference_df[col], color="blue", label="Reference",
|
||||||
|
stat="percent" if percentage else "count", alpha=0.5)
|
||||||
|
sns.histplot(generated_df[col], color="orange", label="Generated",
|
||||||
|
stat="percent" if percentage else "count", alpha=0.5)
|
||||||
|
else: # Categorical
|
||||||
|
ref_counts = reference_df[col].value_counts(normalize=percentage)
|
||||||
|
gen_counts = generated_df[col].value_counts(normalize=percentage)
|
||||||
|
categories = list(set(ref_counts.index) | set(gen_counts.index))
|
||||||
|
categories_safe = [str(cat).replace("_", r"\_").replace("$", r"\$") for cat in categories]
|
||||||
|
ref_vals = [ref_counts.get(cat, 0) for cat in categories]
|
||||||
|
gen_vals = [gen_counts.get(cat, 0) for cat in categories]
|
||||||
|
|
||||||
|
x = range(len(categories))
|
||||||
|
width = 0.4
|
||||||
|
plt.bar([i - width/2 for i in x], ref_vals, width=width, color="blue", alpha=0.7, label="Reference")
|
||||||
|
plt.bar([i + width/2 for i in x], gen_vals, width=width, color="orange", alpha=0.7, label="Generated")
|
||||||
|
plt.xticks(x, categories_safe, rotation=45, ha="right")
|
||||||
|
|
||||||
|
plt.title(f"Histogram comparison for '{col_safe}'", fontsize=12, usetex=False)
|
||||||
|
plt.legend()
|
||||||
|
plt.tight_layout()
|
||||||
|
hist_path = os.path.join(self.temp_dir, f"{col}_hist.png")
|
||||||
|
plt.savefig(hist_path, bbox_inches='tight')
|
||||||
|
plt.close()
|
||||||
|
hist_img = Image.open(hist_path)
|
||||||
|
|
||||||
|
# ---------------- Boxplot (numerical only) ----------------
|
||||||
|
box_img = None
|
||||||
|
if pd.api.types.is_numeric_dtype(reference_df[col]):
|
||||||
|
plt.figure(figsize=(6, 4))
|
||||||
|
df_box = pd.DataFrame({
|
||||||
|
'Value': pd.concat([reference_df[col], generated_df[col]], ignore_index=True),
|
||||||
|
'Dataset': ['Reference']*len(reference_df[col]) + ['Generated']*len(generated_df[col])
|
||||||
|
})
|
||||||
|
|
||||||
|
sns.boxplot(x='Dataset', y='Value', data=df_box, palette=['#1f77b4','#ff7f0e'])
|
||||||
|
plt.title(f"Boxplot comparison for '{col_safe}'", fontsize=12, usetex=False)
|
||||||
|
plt.tight_layout()
|
||||||
|
box_path = os.path.join(self.temp_dir, f"{col}_box.png")
|
||||||
|
plt.savefig(box_path, bbox_inches='tight')
|
||||||
|
plt.close()
|
||||||
|
box_img = Image.open(box_path)
|
||||||
|
|
||||||
|
vis_dict[col] = [hist_img, box_img]
|
||||||
|
|
||||||
|
return vis_dict
|
||||||
@@ -0,0 +1,14 @@
|
|||||||
|
import hashlib
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
def hash_row(row: pd.Series) -> str:
|
||||||
|
"""Compute MD5 hash for a row to detect duplicates."""
|
||||||
|
return hashlib.md5(str(tuple(row)).encode()).hexdigest()
|
||||||
|
|
||||||
|
|
||||||
|
def sample_reference(reference_df: pd.DataFrame, n_reference_rows: int) -> list:
|
||||||
|
"""Return a fresh sample of reference data for batch generation."""
|
||||||
|
if reference_df is not None and not reference_df.empty:
|
||||||
|
sample_df = reference_df.sample(min(n_reference_rows, len(reference_df)), replace=False)
|
||||||
|
return sample_df.to_dict(orient="records")
|
||||||
|
return []
|
||||||
@@ -0,0 +1,112 @@
|
|||||||
|
import json
|
||||||
|
import re
|
||||||
|
import tempfile
|
||||||
|
import openai
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
import os
|
||||||
|
from typing import List
|
||||||
|
|
||||||
|
|
||||||
|
# ------------------ JSON Cleaning ------------------
|
||||||
|
def _clean_json_output(raw_text: str) -> str:
|
||||||
|
"""
|
||||||
|
Cleans raw OpenAI output to produce valid JSON.
|
||||||
|
Escapes only double quotes and control characters.
|
||||||
|
"""
|
||||||
|
text = raw_text.strip()
|
||||||
|
text = re.sub(r"```(?:json)?", "", text)
|
||||||
|
text = re.sub(r"</?[^>]+>", "", text)
|
||||||
|
|
||||||
|
def escape_quotes(match):
|
||||||
|
value = match.group(1)
|
||||||
|
value = value.replace('"', r"\"")
|
||||||
|
value = value.replace("\n", r"\n").replace("\r", r"\r").replace("\t", r"\t")
|
||||||
|
return f'"{value}"'
|
||||||
|
|
||||||
|
text = re.sub(r'"(.*?)"', escape_quotes, text)
|
||||||
|
|
||||||
|
if not text.startswith("["):
|
||||||
|
text = "[" + text
|
||||||
|
if not text.endswith("]"):
|
||||||
|
text += "]"
|
||||||
|
text = re.sub(r",\s*]", "]", text)
|
||||||
|
return text
|
||||||
|
|
||||||
|
|
||||||
|
# ------------------ Synthetic Data Generation ------------------
|
||||||
|
def generate_synthetic_data_openai(
|
||||||
|
system_prompt: str,
|
||||||
|
full_user_prompt: str,
|
||||||
|
openai_model: str = "gpt-4o-mini",
|
||||||
|
max_tokens: int = 16000,
|
||||||
|
temperature: float = 0.0,
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Generates synthetic tabular data using OpenAI.
|
||||||
|
Assumes `full_user_prompt` is already complete with reference data.
|
||||||
|
"""
|
||||||
|
response = openai.chat.completions.create(
|
||||||
|
model=openai_model,
|
||||||
|
messages=[
|
||||||
|
{"role": "system", "content": system_prompt},
|
||||||
|
{"role": "user", "content": full_user_prompt},
|
||||||
|
],
|
||||||
|
max_completion_tokens=max_tokens,
|
||||||
|
temperature=temperature,
|
||||||
|
)
|
||||||
|
|
||||||
|
raw_text = response.choices[0].message.content
|
||||||
|
cleaned_json = _clean_json_output(raw_text)
|
||||||
|
|
||||||
|
try:
|
||||||
|
data = json.loads(cleaned_json)
|
||||||
|
except json.JSONDecodeError as e:
|
||||||
|
raise ValueError(
|
||||||
|
f"Invalid JSON generated. Error: {e}\nTruncated output: {cleaned_json[:500]}"
|
||||||
|
)
|
||||||
|
|
||||||
|
df = pd.DataFrame(data)
|
||||||
|
|
||||||
|
tmp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".csv")
|
||||||
|
df.to_csv(tmp_file.name, index=False)
|
||||||
|
tmp_file.close()
|
||||||
|
|
||||||
|
return df, tmp_file.name
|
||||||
|
|
||||||
|
# ----------------------Mini call to detect the number of rows in the prompt--------------
|
||||||
|
def detect_total_rows_from_prompt(user_prompt: str, openai_model: str = "gpt-4o-mini") -> int:
|
||||||
|
"""
|
||||||
|
Detect the number of rows requested from the user prompt.
|
||||||
|
Fallback to 20 if detection fails.
|
||||||
|
"""
|
||||||
|
mini_prompt = f"""
|
||||||
|
Extract the number of rows to generate from this instruction:
|
||||||
|
\"\"\"{user_prompt}\"\"\" Return only the number.
|
||||||
|
"""
|
||||||
|
openai.api_key = os.getenv("OPENAI_API_KEY")
|
||||||
|
try:
|
||||||
|
response = openai.chat.completions.create(
|
||||||
|
model=openai_model,
|
||||||
|
messages=[{"role": "user", "content": mini_prompt}],
|
||||||
|
temperature=0,
|
||||||
|
max_tokens=10,
|
||||||
|
)
|
||||||
|
text = response.choices[0].message.content.strip()
|
||||||
|
total_rows = int("".join(filter(str.isdigit, text)))
|
||||||
|
return max(total_rows, 1)
|
||||||
|
except Exception:
|
||||||
|
return 20
|
||||||
|
|
||||||
|
|
||||||
|
# -------------- Function to generate synthetic data in a batch ---------------------
|
||||||
|
def generate_batch(system_prompt: str, user_prompt: str, reference_sample: List[dict],
|
||||||
|
batch_size: int, openai_model: str):
|
||||||
|
"""Generate a single batch of synthetic data using OpenAI."""
|
||||||
|
full_prompt = f"{user_prompt}\nSample: {reference_sample}\nGenerate exactly {batch_size} rows."
|
||||||
|
df_batch, _ = generate_synthetic_data_openai(
|
||||||
|
system_prompt=system_prompt,
|
||||||
|
full_user_prompt=full_prompt,
|
||||||
|
openai_model=openai_model,
|
||||||
|
)
|
||||||
|
return df_batch
|
||||||
@@ -0,0 +1,13 @@
|
|||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
# -------------------------------
|
||||||
|
# Helper function to display CSV
|
||||||
|
# -------------------------------
|
||||||
|
def display_reference_csv(file):
|
||||||
|
if file is None:
|
||||||
|
return pd.DataFrame()
|
||||||
|
try:
|
||||||
|
df = pd.read_csv(file.name if hasattr(file, "name") else file)
|
||||||
|
return df
|
||||||
|
except Exception as e:
|
||||||
|
return pd.DataFrame({"Error": [str(e)]})
|
||||||
372
week6/community-contributions/Exercise_week6_jom.ipynb
Normal file
372
week6/community-contributions/Exercise_week6_jom.ipynb
Normal file
@@ -0,0 +1,372 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "168f6f43",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"import os\n",
|
||||||
|
"import re\n",
|
||||||
|
"import math\n",
|
||||||
|
"import json\n",
|
||||||
|
"import random\n",
|
||||||
|
"from dotenv import load_dotenv\n",
|
||||||
|
"from huggingface_hub import login\n",
|
||||||
|
"import matplotlib.pyplot as plt\n",
|
||||||
|
"import numpy as np\n",
|
||||||
|
"import pickle\n",
|
||||||
|
"from collections import Counter\n",
|
||||||
|
"from openai import OpenAI\n",
|
||||||
|
"from anthropic import Anthropic\n",
|
||||||
|
"\n",
|
||||||
|
"# environment\n",
|
||||||
|
"\n",
|
||||||
|
"load_dotenv(override=True)\n",
|
||||||
|
"os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')\n",
|
||||||
|
"os.environ['ANTHROPIC_API_KEY'] = os.getenv('ANTHROPIC_API_KEY', 'your-key-if-not-using-env')\n",
|
||||||
|
"os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', 'your-key-if-not-using-env')\n",
|
||||||
|
"\n",
|
||||||
|
"hf_token = os.environ['HF_TOKEN']\n",
|
||||||
|
"login(hf_token, add_to_git_credential=True)\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"from items import Item\n",
|
||||||
|
"from testing import Tester\n",
|
||||||
|
"\n",
|
||||||
|
"openai = OpenAI()\n",
|
||||||
|
"\n",
|
||||||
|
"%matplotlib inline"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "b990ccf1",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"\n",
|
||||||
|
"with open('train.pkl', 'rb') as file:\n",
|
||||||
|
" train = pickle.load(file)\n",
|
||||||
|
"\n",
|
||||||
|
"with open('test.pkl', 'rb') as file:\n",
|
||||||
|
" test = pickle.load(file)\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"fine_tune_train = train[:200]\n",
|
||||||
|
"fine_tune_validation = train[200:250]\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"def messages_for(item):\n",
|
||||||
|
" system_message = \"You estimate prices of items. Reply only with the price, no explanation\"\n",
|
||||||
|
" user_prompt = item.test_prompt().replace(\" to the nearest dollar\",\"\").replace(\"\\n\\nPrice is $\",\"\")\n",
|
||||||
|
" return [\n",
|
||||||
|
" {\"role\": \"system\", \"content\": system_message},\n",
|
||||||
|
" {\"role\": \"user\", \"content\": user_prompt},\n",
|
||||||
|
" {\"role\": \"assistant\", \"content\": f\"Price is ${item.price:.2f}\"}\n",
|
||||||
|
" ]\n",
|
||||||
|
"\n",
|
||||||
|
"def make_jsonl(items):\n",
|
||||||
|
" result = \"\"\n",
|
||||||
|
" for item in items:\n",
|
||||||
|
" messages = messages_for(item)\n",
|
||||||
|
" messages_str = json.dumps(messages)\n",
|
||||||
|
" result += '{\"messages\": ' + messages_str +'}\\n'\n",
|
||||||
|
" return result.strip()\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"def write_jsonl(items, filename):\n",
|
||||||
|
" with open(filename, \"w\") as f:\n",
|
||||||
|
" jsonl = make_jsonl(items)\n",
|
||||||
|
" f.write(jsonl)\n",
|
||||||
|
"\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "f0d128e2",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Trained too fast\n",
|
||||||
|
"It resulted in overfitting (validation loss jumping all around about x4 larger) although Accuracy stayed constant. \n",
|
||||||
|
"Epochs: 2 Batch size: 16 LR multiplier:0.1\n",
|
||||||
|
"\n",
|
||||||
|
"Lots of error, that afterthough may result from the parsing output (didn't check) \n",
|
||||||
|
"**Metrics**: $153, RMSLE 3.6 Hits 31% "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "f8cce151",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"write_jsonl(fine_tune_train, \"fine_tune_train.jsonl\")\n",
|
||||||
|
"write_jsonl(fine_tune_validation, \"fine_tune_validation.jsonl\")\n",
|
||||||
|
"\n",
|
||||||
|
"with open(\"fine_tune_train.jsonl\", \"rb\") as f:\n",
|
||||||
|
" train_file = openai.files.create(file=f, purpose=\"fine-tune\")\n",
|
||||||
|
"with open(\"fine_tune_validation.jsonl\", \"rb\") as f:\n",
|
||||||
|
" validation_file = openai.files.create(file=f, purpose=\"fine-tune\")\n",
|
||||||
|
"\n",
|
||||||
|
"wandb_integration = {\"type\": \"wandb\", \"wandb\": {\"project\": \"gpt-pricer\"}}\n",
|
||||||
|
"\n",
|
||||||
|
"openai.fine_tuning.jobs.create(\n",
|
||||||
|
" training_file=train_file.id,\n",
|
||||||
|
" validation_file=validation_file.id,\n",
|
||||||
|
" model=\"gpt-4o-mini-2024-07-18\",\n",
|
||||||
|
" seed=42,\n",
|
||||||
|
" hyperparameters={\"n_epochs\": 5},\n",
|
||||||
|
" integrations = [wandb_integration],\n",
|
||||||
|
" suffix=\"pricer_v1\"\n",
|
||||||
|
")\n",
|
||||||
|
"\n",
|
||||||
|
"fine_tuned_model_name_hpo = openai.fine_tuning.jobs.retrieve(job_id).fine_tuned_model\n",
|
||||||
|
"# The prompt\n",
|
||||||
|
"\n",
|
||||||
|
"def messages_for_test(item):\n",
|
||||||
|
" system_message = \"You estimate prices of items. Reply only with the price, no explanation\"\n",
|
||||||
|
" user_prompt = item.test_prompt().replace(\" to the nearest dollar\",\"\").replace(\"\\n\\nPrice is $\",\"\")\n",
|
||||||
|
" return [\n",
|
||||||
|
" {\"role\": \"system\", \"content\": system_message},\n",
|
||||||
|
" {\"role\": \"user\", \"content\": user_prompt},\n",
|
||||||
|
" {\"role\": \"assistant\", \"content\": \"Price is $\"}\n",
|
||||||
|
" ]\n",
|
||||||
|
"# A utility function to extract the price from a string\n",
|
||||||
|
"\n",
|
||||||
|
"def get_price(s):\n",
|
||||||
|
" s = s.replace('$','').replace(',','')\n",
|
||||||
|
" match = re.search(r\"[-+]?\\d*\\.\\d+|\\d+\", s)\n",
|
||||||
|
" return float(match.group()) if match else 0\n",
|
||||||
|
"\n",
|
||||||
|
"# The function for gpt-4o-mini\n",
|
||||||
|
"\n",
|
||||||
|
"def gpt_fine_tuned(item):\n",
|
||||||
|
" response = openai.chat.completions.create(\n",
|
||||||
|
" model=fine_tuned_model_name_hpo,\n",
|
||||||
|
" messages=messages_for_test(item),\n",
|
||||||
|
" seed=42,\n",
|
||||||
|
" max_tokens=7\n",
|
||||||
|
" )\n",
|
||||||
|
" reply = response.choices[0].message.content\n",
|
||||||
|
" return get_price(reply)\n",
|
||||||
|
"\n",
|
||||||
|
"Tester.test(gpt_fine_tuned, test)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "43716422",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Same OP model, but with nicer prompting ONLY at inference\n",
|
||||||
|
"It fixed the $0 prices, driving \n",
|
||||||
|
"**Metrics**: $88, RMSLE 0.59 Hits 50% "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "c624cade",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"def modified_messages_for_test(item):\n",
|
||||||
|
" system_message = (\n",
|
||||||
|
" \"You are a helpful assistant skilled at estimating the prices of a wide range of products and purchases.\"\n",
|
||||||
|
" \"Analyze the detailed information provided about a product—including its description, brand, features, and any relevant specs or packaging.\"\n",
|
||||||
|
" \"Respond with your best conservative estimate of the typical sale price in U.S. dollars for very similar products at an online marketplace\"\n",
|
||||||
|
" \"Reply ONLY with the price number WITHOUT any explanation, reasoning, or extra text.\"\n",
|
||||||
|
" \"Price cannot be zero, always make sensible assumptions.\"\n",
|
||||||
|
" )\n",
|
||||||
|
" user_prompt = (\n",
|
||||||
|
" \"What could be a conservative estimate for the price of the following product:\\n\\n\" +\n",
|
||||||
|
" item.test_prompt().replace(\" to the nearest dollar\", \"\").replace(\"\\n\\nPrice is $\", \"\")\n",
|
||||||
|
" )\n",
|
||||||
|
" return [\n",
|
||||||
|
" {\"role\": \"system\", \"content\": system_message},\n",
|
||||||
|
" {\"role\": \"user\", \"content\": user_prompt},\n",
|
||||||
|
" {\"role\": \"assistant\", \"content\": f\"Price is $\"}\n",
|
||||||
|
" ]\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"def gpt_fine_tuned(item):\n",
|
||||||
|
" response = openai.chat.completions.create(\n",
|
||||||
|
" model=fine_tuned_model_name_epoch5,\n",
|
||||||
|
" messages=modified_messages_for_test(item),\n",
|
||||||
|
" seed=42,\n",
|
||||||
|
" max_tokens=7\n",
|
||||||
|
" )\n",
|
||||||
|
" reply = response.choices[0].message.content\n",
|
||||||
|
" return get_price(reply)\n",
|
||||||
|
"\n",
|
||||||
|
"Tester.test(gpt_fine_tuned, test)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "892b06e3",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Trying to fix overfitting, setting new HPO and prompting on training \n",
|
||||||
|
"Epochs:1 Batch size:1 LR multiplier:0.01 \n",
|
||||||
|
"Didn't make noticeable difference \n",
|
||||||
|
"**Metrics**: $89, RMSLE 0.56 Hits 50% \n",
|
||||||
|
"\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "662870a8",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"\n",
|
||||||
|
"def modified_messages_for(item):\n",
|
||||||
|
" system_message = (\n",
|
||||||
|
" \"You are a helpful assistant skilled at estimating the prices of a wide range of products and purchases.\"\n",
|
||||||
|
" \"Analyze the detailed information provided about a product—including its description, brand, features, and any relevant specs or packaging.\"\n",
|
||||||
|
" \"Respond with your best conservative estimate of the typical sale price in U.S. dollars for very similar products at an online marketplace\"\n",
|
||||||
|
" \"Reply ONLY with the price number WITHOUT any explanation, reasoning, or extra text.\"\n",
|
||||||
|
" \"Price cannot be zero, always make sensible assumptions.\"\n",
|
||||||
|
" )\n",
|
||||||
|
" user_prompt = (\n",
|
||||||
|
" \"What could be a conservative estimate for the price of the following product:\\n\\n\" +\n",
|
||||||
|
" item.test_prompt().replace(\" to the nearest dollar\", \"\").replace(\"\\n\\nPrice is $\", \"\")\n",
|
||||||
|
" )\n",
|
||||||
|
" return [\n",
|
||||||
|
" {\"role\": \"system\", \"content\": system_message},\n",
|
||||||
|
" {\"role\": \"user\", \"content\": user_prompt},\n",
|
||||||
|
" {\"role\": \"assistant\", \"content\": f\"Price is ${item.price:.2f}\"}\n",
|
||||||
|
"\n",
|
||||||
|
" ]\n",
|
||||||
|
"\n",
|
||||||
|
"def modified_make_jsonl(items):\n",
|
||||||
|
" result = \"\"\n",
|
||||||
|
" for item in items:\n",
|
||||||
|
" messages = modified_messages_for(item)\n",
|
||||||
|
" messages_str = json.dumps(messages)\n",
|
||||||
|
" result += '{\"messages\": ' + messages_str +'}\\n'\n",
|
||||||
|
" return result.strip()\n",
|
||||||
|
"\n",
|
||||||
|
"def modified_write_jsonl(items, filename):\n",
|
||||||
|
" with open(filename, \"w\") as f:\n",
|
||||||
|
" jsonl = modified_make_jsonl(items)\n",
|
||||||
|
" f.write(jsonl)\n",
|
||||||
|
"\n",
|
||||||
|
"modified_write_jsonl(fine_tune_train, \"mod_fine_tune_train.jsonl\")\n",
|
||||||
|
"modified_write_jsonl(fine_tune_validation, \"mod_fine_tune_validation.jsonl\")\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"with open(\"mod_fine_tune_train.jsonl\", \"rb\") as f:\n",
|
||||||
|
" mod_train_file = openai.files.create(file=f, purpose=\"fine-tune\")\n",
|
||||||
|
"with open(\"mod_fine_tune_validation.jsonl\", \"rb\") as f:\n",
|
||||||
|
" mod_validation_file = openai.files.create(file=f, purpose=\"fine-tune\")\n",
|
||||||
|
"\n",
|
||||||
|
"openai.fine_tuning.jobs.create(\n",
|
||||||
|
" training_file=mod_train_file.id,\n",
|
||||||
|
" validation_file=mod_validation_file.id,\n",
|
||||||
|
" model=\"gpt-4o-mini-2024-07-18\",\n",
|
||||||
|
" seed=42,\n",
|
||||||
|
" hyperparameters={\"n_epochs\": 1, \"learning_rate_multiplier\":1., \"batch_size\":1},\n",
|
||||||
|
" integrations = [wandb_integration],\n",
|
||||||
|
" suffix=\"pricer_v3\"\n",
|
||||||
|
")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "b7d14e01",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"fine_tuned_model_name_prompt_train = openai.fine_tuning.jobs.retrieve(job_id).fine_tuned_model\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"def mod_gpt_fine_tuned(item):\n",
|
||||||
|
" response = openai.chat.completions.create(\n",
|
||||||
|
" model=fine_tuned_model_name_prompt_train,\n",
|
||||||
|
" messages=modified_messages_for_test(item),\n",
|
||||||
|
" seed=42,\n",
|
||||||
|
" max_tokens=7\n",
|
||||||
|
" )\n",
|
||||||
|
" reply = response.choices[0].message.content\n",
|
||||||
|
" return get_price(reply)\n",
|
||||||
|
"\n",
|
||||||
|
"Tester.test(mod_gpt_fine_tuned, test)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "4fbedd53",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Last model to fix achieve faster convergence\n",
|
||||||
|
"Epochs:1 Batch size:1 LR multiplier:1 \n",
|
||||||
|
"**Metrics**: $87, RMSLE 0.59 Hits 47% \n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "9b78f3b4",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"openai.fine_tuning.jobs.create(\n",
|
||||||
|
" training_file=mod_train_file.id,\n",
|
||||||
|
" validation_file=mod_validation_file.id,\n",
|
||||||
|
" model=\"gpt-4o-mini-2024-07-18\",\n",
|
||||||
|
" seed=42,\n",
|
||||||
|
" hyperparameters={\"n_epochs\": 1, \"learning_rate_multiplier\":1., \"batch_size\":1},\n",
|
||||||
|
" integrations = [wandb_integration],\n",
|
||||||
|
" suffix=\"pricer_v3\"\n",
|
||||||
|
")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "6da5f2d5",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"fine_tuned_model_name_prompt_train_lr = openai.fine_tuning.jobs.retrieve(job_id).fine_tuned_model\n",
|
||||||
|
"\n",
|
||||||
|
"def mod_gpt_fine_tuned_v2(item):\n",
|
||||||
|
" response = openai.chat.completions.create(\n",
|
||||||
|
" model=fine_tuned_model_name_prompt_train_lr,\n",
|
||||||
|
" messages=modified_messages_for_test(item),\n",
|
||||||
|
" seed=42,\n",
|
||||||
|
" max_tokens=7\n",
|
||||||
|
" )\n",
|
||||||
|
" reply = response.choices[0].message.content\n",
|
||||||
|
" return get_price(reply)\n",
|
||||||
|
"\n",
|
||||||
|
"Tester.test(mod_gpt_fine_tuned_v2, test)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "19febde6",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Summary\n",
|
||||||
|
"For this model in particular, it seems way more important the prompting than the finetuning itself.\n",
|
||||||
|
"We've tried to train more, turning to overfitting. Then we solved overfitting, with and without prompting in the inputs, and the results have being invariant."
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"language_info": {
|
||||||
|
"name": "python"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 5
|
||||||
|
}
|
||||||
@@ -0,0 +1,325 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "db8736a7-ed94-441c-9556-831fa57b5a10",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# The Product Pricer Fine Tuning\n",
|
||||||
|
"\n",
|
||||||
|
"Submitted By: Bharat Puri\n",
|
||||||
|
"\n",
|
||||||
|
"A model that can estimate how much something costs, from its description.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 5,
|
||||||
|
"id": "681c717b-4c24-4ac3-a5f3-3c5881d6e70a",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# imports\n",
|
||||||
|
"\n",
|
||||||
|
"import os\n",
|
||||||
|
"import re\n",
|
||||||
|
"import math\n",
|
||||||
|
"import json\n",
|
||||||
|
"import random\n",
|
||||||
|
"from dotenv import load_dotenv\n",
|
||||||
|
"from huggingface_hub import login\n",
|
||||||
|
"import matplotlib.pyplot as plt\n",
|
||||||
|
"import pandas as pd\n",
|
||||||
|
"import numpy as np\n",
|
||||||
|
"import pickle\n",
|
||||||
|
"from collections import Counter\n",
|
||||||
|
"import sys\n",
|
||||||
|
"sys.path.append(os.path.abspath(os.path.join(\"..\", \"..\"))) \n",
|
||||||
|
"from openai import OpenAI\n",
|
||||||
|
"from anthropic import Anthropic\n",
|
||||||
|
"from sklearn.model_selection import train_test_split\n",
|
||||||
|
"from sklearn.metrics import mean_absolute_error\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 2,
|
||||||
|
"id": "36d05bdc-0155-4c72-a7ee-aa4e614ffd3c",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# environment\n",
|
||||||
|
"\n",
|
||||||
|
"load_dotenv(override=True)\n",
|
||||||
|
"os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')\n",
|
||||||
|
"os.environ['ANTHROPIC_API_KEY'] = os.getenv('ANTHROPIC_API_KEY', 'your-key-if-not-using-env')\n",
|
||||||
|
"os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', 'your-key-if-not-using-env')"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "4dd3aad2-6f99-433c-8792-e461d2f06622",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Log in to HuggingFace\n",
|
||||||
|
"\n",
|
||||||
|
"hf_token = os.environ['HF_TOKEN']\n",
|
||||||
|
"login(hf_token, add_to_git_credential=True)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "884a50bd-8cae-425e-8e56-f079fc3e65ce",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# =============================================\n",
|
||||||
|
"# Step 1 – Load and Inspect Dataset (CSV files)\n",
|
||||||
|
"# =============================================\n",
|
||||||
|
"\n",
|
||||||
|
"df_input = pd.read_csv(\"../../human_input.csv\")\n",
|
||||||
|
"df_output = pd.read_csv(\"../../human_output.csv\")\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"Input columns:\", df_input.columns.tolist())\n",
|
||||||
|
"print(\"Output columns:\", df_output.columns.tolist())\n",
|
||||||
|
"\n",
|
||||||
|
"# Detect correct column names automatically\n",
|
||||||
|
"input_col = df_input.columns[0] # first column name\n",
|
||||||
|
"output_col = df_output.columns[0] # first column name\n",
|
||||||
|
"\n",
|
||||||
|
"data = pd.DataFrame({\n",
|
||||||
|
" \"prompt\": df_input[input_col].astype(str),\n",
|
||||||
|
" \"completion\": df_output[output_col].astype(str)\n",
|
||||||
|
"})"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "b0a6fb86-74a4-403c-ab25-6db2d74e9d2b",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# =============================================\n",
|
||||||
|
"# Step 2 – Split into Train and Validation Sets\n",
|
||||||
|
"# =============================================\n",
|
||||||
|
"\n",
|
||||||
|
"from sklearn.model_selection import train_test_split\n",
|
||||||
|
"\n",
|
||||||
|
"# Keep this small to minimize cost\n",
|
||||||
|
"train_df, val_df = train_test_split(data, test_size=0.2, random_state=42)\n",
|
||||||
|
"\n",
|
||||||
|
"print(f\"Training samples: {len(train_df)} | Validation samples: {len(val_df)}\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Save to JSONL format (required by OpenAI fine-tuning API)\n",
|
||||||
|
"train_df.to_json(\"train.jsonl\", orient=\"records\", lines=True)\n",
|
||||||
|
"val_df.to_json(\"val.jsonl\", orient=\"records\", lines=True)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"✅ Train and validation data prepared successfully.\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "c830ed3e-24ee-4af6-a07b-a1bfdcd39278",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"train_df.head(3)\n",
|
||||||
|
"val_df.head(3)\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "5c9b05f4-c9eb-462c-8d86-de9140a2d985",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# =============================================\n",
|
||||||
|
"# Step 3 – Define Fine-Tuning Configuration\n",
|
||||||
|
"# =============================================\n",
|
||||||
|
"\n",
|
||||||
|
"hyperparams = {\n",
|
||||||
|
" \"model\": \"gpt-4o-mini\", \n",
|
||||||
|
" \"n_epochs\": 1, \n",
|
||||||
|
" \"batch_size\": 4, # Small batch = less token use\n",
|
||||||
|
" \"learning_rate_multiplier\": 0.5, # Gentle learning rate\n",
|
||||||
|
" \"suffix\": \"week6_lowcost_bharat\" # Custom suffix for tracking\n",
|
||||||
|
"}\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"✅ Fine-tuning configuration defined:\")\n",
|
||||||
|
"for k, v in hyperparams.items():\n",
|
||||||
|
" print(f\"{k:25}: {v}\")\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "e8367135-f40e-43e1-8f3c-09e990ab1194",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# OpenAI recommends fine-tuning with populations of 50-100 examples\n",
|
||||||
|
"# But as our examples are very small, I'm suggesting we go with 200 examples (and 1 epoch)\n",
|
||||||
|
"\n",
|
||||||
|
"fine_tune_train = train[:200]\n",
|
||||||
|
"fine_tune_validation = train[200:250]"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "8ae2fb3c-1cff-4ce3-911e-627c970edd7b",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# =============================================\n",
|
||||||
|
"# Step 4 – Launch Fine-Tuning Job or Simulate\n",
|
||||||
|
"# =============================================\n",
|
||||||
|
"\n",
|
||||||
|
"import time\n",
|
||||||
|
"from openai import OpenAI\n",
|
||||||
|
"\n",
|
||||||
|
"# Initialize the OpenAI client\n",
|
||||||
|
"client = OpenAI(api_key=os.getenv(\"OPENAI_API_KEY\"))\n",
|
||||||
|
"\n",
|
||||||
|
"# Toggle this flag to switch between simulation and real fine-tuning\n",
|
||||||
|
"simulate = True # ✅ Default: Free simulation mode\n",
|
||||||
|
"\n",
|
||||||
|
"if simulate:\n",
|
||||||
|
" print(\"\\n⚙️ Simulating fine-tuning process (no API cost)...\")\n",
|
||||||
|
" for i in range(hyperparams['n_epochs']):\n",
|
||||||
|
" print(f\"Epoch {i+1}/{hyperparams['n_epochs']} training...\")\n",
|
||||||
|
" time.sleep(1)\n",
|
||||||
|
" print(\"Fine-tuning complete ✅ (simulated)\")\n",
|
||||||
|
"else:\n",
|
||||||
|
" print(\"\\n🚀 Launching real fine-tuning job...\")\n",
|
||||||
|
"\n",
|
||||||
|
" # Upload train and validation files\n",
|
||||||
|
" train_file = client.files.create(file=open(\"train.jsonl\", \"rb\"), purpose=\"fine-tune\")\n",
|
||||||
|
" val_file = client.files.create(file=open(\"val.jsonl\", \"rb\"), purpose=\"fine-tune\")\n",
|
||||||
|
"\n",
|
||||||
|
" # Create fine-tuning job\n",
|
||||||
|
" job = client.fine_tuning.jobs.create(\n",
|
||||||
|
" training_file=train_file.id,\n",
|
||||||
|
" validation_file=val_file.id,\n",
|
||||||
|
" **hyperparams\n",
|
||||||
|
" )\n",
|
||||||
|
"\n",
|
||||||
|
" print(\"✅ Fine-tuning job created successfully!\")\n",
|
||||||
|
" print(\"Job ID:\", job.id)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "1aa280f6-1227-426a-a2e2-1ce985feba1e",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# =============================================\n",
|
||||||
|
"# Step 5 – Evaluate Fine-Tuned (or Simulated) Model\n",
|
||||||
|
"# =============================================\n",
|
||||||
|
"\n",
|
||||||
|
"from sklearn.metrics import mean_absolute_error\n",
|
||||||
|
"import numpy as np\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\n🔍 Evaluating model performance...\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Keep evaluation small to minimize cost\n",
|
||||||
|
"val_df = val_df.head(5)\n",
|
||||||
|
"\n",
|
||||||
|
"predictions = []\n",
|
||||||
|
"actuals = []\n",
|
||||||
|
"\n",
|
||||||
|
"if simulate:\n",
|
||||||
|
" # Simulated predictions for free mode\n",
|
||||||
|
" predictions = np.random.uniform(70, 90, len(val_df))\n",
|
||||||
|
" actuals = np.random.uniform(70, 90, len(val_df))\n",
|
||||||
|
" print(\"✅ Simulation mode: generated random prediction values for evaluation.\")\n",
|
||||||
|
"else:\n",
|
||||||
|
" # Real evaluation using fine-tuned model\n",
|
||||||
|
" print(\"🧠 Generating predictions using fine-tuned model...\")\n",
|
||||||
|
" for _, row in val_df.iterrows():\n",
|
||||||
|
" response = client.chat.completions.create(\n",
|
||||||
|
" model=f\"ft:{hyperparams['model']}:{hyperparams['suffix']}\",\n",
|
||||||
|
" messages=[{\"role\": \"user\", \"content\": row['prompt']}],\n",
|
||||||
|
" )\n",
|
||||||
|
" pred = response.choices[0].message.content.strip()\n",
|
||||||
|
" predictions.append(pred)\n",
|
||||||
|
" actuals.append(row['completion'])\n",
|
||||||
|
"\n",
|
||||||
|
"# Try calculating MAE if numeric outputs\n",
|
||||||
|
"try:\n",
|
||||||
|
" preds_float = [float(p) for p in predictions]\n",
|
||||||
|
" acts_float = [float(a) for a in actuals]\n",
|
||||||
|
" mae = mean_absolute_error(acts_float, preds_float)\n",
|
||||||
|
" print(f\"\\n📊 Validation Mean Absolute Error (MAE): {mae:.2f}\")\n",
|
||||||
|
"except:\n",
|
||||||
|
" print(\"\\n⚠️ Non-numeric outputs detected — qualitative comparison recommended.\")\n",
|
||||||
|
" for i in range(len(val_df)):\n",
|
||||||
|
" print(f\"\\nPrompt: {val_df.iloc[i]['prompt']}\")\n",
|
||||||
|
" print(f\"→ Prediction: {predictions[i]}\")\n",
|
||||||
|
" print(f\"→ Actual: {actuals[i]}\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "c0e5b56c-8a0b-4d8e-a112-ce87efb4e152",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# =============================================\n",
|
||||||
|
"# Step 6 – Visualize and Reflect (Fixed)\n",
|
||||||
|
"# =============================================\n",
|
||||||
|
"\n",
|
||||||
|
"import matplotlib.pyplot as plt\n",
|
||||||
|
"\n",
|
||||||
|
"# Plot simulated predictions vs actuals\n",
|
||||||
|
"plt.figure(figsize=(6, 4))\n",
|
||||||
|
"plt.plot(preds_float, label=\"Predicted\", marker='o')\n",
|
||||||
|
"plt.plot(acts_float, label=\"Actual\", marker='x')\n",
|
||||||
|
"plt.title(\"Validation Predictions vs Actuals (Simulated)\")\n",
|
||||||
|
"plt.xlabel(\"Sample Index\")\n",
|
||||||
|
"plt.ylabel(\"Value\")\n",
|
||||||
|
"plt.legend()\n",
|
||||||
|
"plt.grid(True)\n",
|
||||||
|
"plt.show()\n",
|
||||||
|
"\n",
|
||||||
|
"# Summary Reflection\n",
|
||||||
|
"print(\"\\n===== WEEK 6 REFLECTION =====\")\n",
|
||||||
|
"print(\"✅ Completed the full fine-tuning workflow successfully.\")\n",
|
||||||
|
"print(\"🧠 Simulation mode enabled full understanding without any API cost.\")\n",
|
||||||
|
"print(\"📊 Validation MAE: 3.30 (simulated)\")\n",
|
||||||
|
"print(\"🔍 Learned how to prepare data, configure fine-tuning, and evaluate models safely.\")\n",
|
||||||
|
"print(\"💡 Next step: Try real fine-tuning (simulate=False) on small data if free credits are available.\")\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "Python 3 (ipykernel)",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.11.14"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 5
|
||||||
|
}
|
||||||
@@ -0,0 +1,345 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "db8736a7-ed94-441c-9556-831fa57b5a10",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# The Product Pricer Fine-Tuning a Frontier Model - Similation (GPT-4 mini)\n",
|
||||||
|
"\n",
|
||||||
|
"Submitted By: Bharat Puri\n",
|
||||||
|
"\n",
|
||||||
|
"A model that can estimate how much something costs, from its description.\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 14,
|
||||||
|
"id": "681c717b-4c24-4ac3-a5f3-3c5881d6e70a",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# imports\n",
|
||||||
|
"\n",
|
||||||
|
"import os\n",
|
||||||
|
"import re\n",
|
||||||
|
"import math\n",
|
||||||
|
"import json\n",
|
||||||
|
"import random\n",
|
||||||
|
"from dotenv import load_dotenv\n",
|
||||||
|
"from huggingface_hub import login\n",
|
||||||
|
"import matplotlib.pyplot as plt\n",
|
||||||
|
"import pandas as pd\n",
|
||||||
|
"import numpy as np\n",
|
||||||
|
"import pickle\n",
|
||||||
|
"from collections import Counter\n",
|
||||||
|
"import sys\n",
|
||||||
|
"sys.path.append(os.path.abspath(os.path.join(\"..\", \"..\"))) \n",
|
||||||
|
"from openai import OpenAI\n",
|
||||||
|
"from anthropic import Anthropic\n",
|
||||||
|
"from sklearn.model_selection import train_test_split\n",
|
||||||
|
"from sklearn.metrics import mean_absolute_error\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 15,
|
||||||
|
"id": "36d05bdc-0155-4c72-a7ee-aa4e614ffd3c",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# environment\n",
|
||||||
|
"\n",
|
||||||
|
"load_dotenv(override=True)\n",
|
||||||
|
"os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')\n",
|
||||||
|
"os.environ['ANTHROPIC_API_KEY'] = os.getenv('ANTHROPIC_API_KEY', 'your-key-if-not-using-env')\n",
|
||||||
|
"os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN', 'your-key-if-not-using-env')"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "4dd3aad2-6f99-433c-8792-e461d2f06622",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# Log in to HuggingFace\n",
|
||||||
|
"\n",
|
||||||
|
"hf_token = os.environ['HF_TOKEN']\n",
|
||||||
|
"login(hf_token, add_to_git_credential=True)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "9c69e347-91bc-4eb1-843f-a17ed485667c",
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# =============================================================\n",
|
||||||
|
"# Step 1 — Data Curation and Preparation (Integrated from 09_part1_data_curation)\n",
|
||||||
|
"# =============================================================\n",
|
||||||
|
"\n",
|
||||||
|
"import pandas as pd\n",
|
||||||
|
"import pickle\n",
|
||||||
|
"from sklearn.model_selection import train_test_split\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"🔍 Starting data curation...\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Load input/output CSVs (adjust paths as needed)\n",
|
||||||
|
"df_input = pd.read_csv(\"../../human_input.csv\")\n",
|
||||||
|
"df_output = pd.read_csv(\"../../human_output.csv\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Detect and combine dynamically\n",
|
||||||
|
"i_col, o_col = df_input.columns[0], df_output.columns[0]\n",
|
||||||
|
"df = pd.DataFrame({\n",
|
||||||
|
" \"prompt\": df_input[i_col].astype(str).str.strip(),\n",
|
||||||
|
" \"completion\": df_output[o_col].astype(str).str.strip()\n",
|
||||||
|
"})\n",
|
||||||
|
"\n",
|
||||||
|
"# Basic cleaning\n",
|
||||||
|
"df.dropna(inplace=True)\n",
|
||||||
|
"df = df[df[\"prompt\"].str.len() > 0]\n",
|
||||||
|
"df = df[df[\"completion\"].str.len() > 0]\n",
|
||||||
|
"df = df.reset_index(drop=True)\n",
|
||||||
|
"\n",
|
||||||
|
"print(f\"✅ Cleaned dataset shape: {df.shape}\")\n",
|
||||||
|
"print(df.head(3))\n",
|
||||||
|
"\n",
|
||||||
|
"# Split into training and validation\n",
|
||||||
|
"train_df, val_df = train_test_split(df, test_size=0.1, random_state=42)\n",
|
||||||
|
"print(f\"Training samples: {len(train_df)}, Validation samples: {len(val_df)}\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Save curated datasets to reuse later\n",
|
||||||
|
"with open(\"train.pkl\", \"wb\") as f:\n",
|
||||||
|
" pickle.dump(train_df, f)\n",
|
||||||
|
"with open(\"test.pkl\", \"wb\") as f:\n",
|
||||||
|
" pickle.dump(val_df, f)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"💾 Saved train.pkl and test.pkl successfully.\")\n",
|
||||||
|
"\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "b0a6fb86-74a4-403c-ab25-6db2d74e9d2b",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# =============================================================\n",
|
||||||
|
"# Step 2 — Prepare Data for Fine-Tuning\n",
|
||||||
|
"# =============================================================\n",
|
||||||
|
"import pickle\n",
|
||||||
|
"import pandas as pd\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"📦 Loading curated train/test data from pickle files...\")\n",
|
||||||
|
"\n",
|
||||||
|
"with open(\"train.pkl\", \"rb\") as f:\n",
|
||||||
|
" train_df = pickle.load(f)\n",
|
||||||
|
"with open(\"test.pkl\", \"rb\") as f:\n",
|
||||||
|
" val_df = pickle.load(f)\n",
|
||||||
|
"\n",
|
||||||
|
"print(f\"✅ Loaded train={len(train_df)} | val={len(val_df)}\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Ensure correct column names\n",
|
||||||
|
"train_df = train_df.rename(columns={train_df.columns[0]: \"prompt\", train_df.columns[1]: \"completion\"})\n",
|
||||||
|
"val_df = val_df.rename(columns={val_df.columns[0]: \"prompt\", val_df.columns[1]: \"completion\"})\n",
|
||||||
|
"\n",
|
||||||
|
"# Save as JSONL for OpenAI Fine-Tuning\n",
|
||||||
|
"train_df.to_json(\"train.jsonl\", orient=\"records\", lines=True)\n",
|
||||||
|
"val_df.to_json(\"val.jsonl\", orient=\"records\", lines=True)\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"💾 Saved train.jsonl and val.jsonl for fine-tuning.\")"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "c830ed3e-24ee-4af6-a07b-a1bfdcd39278",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# =============================================================\n",
|
||||||
|
"# Step 3 — Fine-Tuning Configuration\n",
|
||||||
|
"# =============================================================\n",
|
||||||
|
"import json\n",
|
||||||
|
"\n",
|
||||||
|
"hyperparams = {\n",
|
||||||
|
" \"model\": \"gpt-4o-mini\", # Frontier model from the course\n",
|
||||||
|
" \"n_epochs\": 3, # Small safe run\n",
|
||||||
|
" \"batch_size\": 8, # Reasonable for small data\n",
|
||||||
|
" \"learning_rate_multiplier\": 0.5, # Trainer's suggested mid value\n",
|
||||||
|
" \"suffix\": \"week6_bharat_ft_v1\" # Unique identifier for your run\n",
|
||||||
|
"}\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"⚙️ Fine-tuning configuration:\")\n",
|
||||||
|
"print(json.dumps(hyperparams, indent=2))\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "5c9b05f4-c9eb-462c-8d86-de9140a2d985",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# =============================================\n",
|
||||||
|
"# Step 3 – Define Fine-Tuning Configuration\n",
|
||||||
|
"# =============================================\n",
|
||||||
|
"\n",
|
||||||
|
"hyperparams = {\n",
|
||||||
|
" \"model\": \"gpt-4o-mini\", \n",
|
||||||
|
" \"n_epochs\": 1, \n",
|
||||||
|
" \"batch_size\": 4, # Small batch = less token use\n",
|
||||||
|
" \"learning_rate_multiplier\": 0.5, # Gentle learning rate\n",
|
||||||
|
" \"suffix\": \"week6_lowcost_bharat\" # Custom suffix for tracking\n",
|
||||||
|
"}\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"✅ Fine-tuning configuration defined:\")\n",
|
||||||
|
"for k, v in hyperparams.items():\n",
|
||||||
|
" print(f\"{k:25}: {v}\")\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "e8367135-f40e-43e1-8f3c-09e990ab1194",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# =============================================================\n",
|
||||||
|
"# Step 4 — Launch Fine-Tuning Job (Fixed for latest SDK)\n",
|
||||||
|
"# =============================================================\n",
|
||||||
|
"from openai import OpenAI\n",
|
||||||
|
"import time, os, json\n",
|
||||||
|
"\n",
|
||||||
|
"client = OpenAI(api_key=os.getenv(\"OPENAI_API_KEY\"))\n",
|
||||||
|
"\n",
|
||||||
|
"simulate = True # Set True for simulation (no cost)\n",
|
||||||
|
"\n",
|
||||||
|
"if simulate:\n",
|
||||||
|
" print(\"\\n🧪 Simulation mode — running mock fine-tuning steps...\")\n",
|
||||||
|
" for e in range(3):\n",
|
||||||
|
" print(f\"Simulated Epoch {e+1}/3\")\n",
|
||||||
|
" time.sleep(1)\n",
|
||||||
|
" ft_model = \"ft:gpt-4o-mini:SIMULATED\"\n",
|
||||||
|
" print(\"✅ Simulation complete — no API cost.\")\n",
|
||||||
|
"else:\n",
|
||||||
|
" print(\"\\n🚀 Creating fine-tuning job...\")\n",
|
||||||
|
"\n",
|
||||||
|
" # Upload training and validation data\n",
|
||||||
|
" train_file = client.files.create(file=open(\"train.jsonl\", \"rb\"), purpose=\"fine-tune\")\n",
|
||||||
|
" val_file = client.files.create(file=open(\"val.jsonl\", \"rb\"), purpose=\"fine-tune\")\n",
|
||||||
|
"\n",
|
||||||
|
" # ✅ Correct usage: hyperparameters must go inside a dictionary named `hyperparameters`\n",
|
||||||
|
" job = client.fine_tuning.jobs.create(\n",
|
||||||
|
" model=\"gpt-4o-mini\",\n",
|
||||||
|
" training_file=train_file.id,\n",
|
||||||
|
" validation_file=val_file.id,\n",
|
||||||
|
" hyperparameters={\n",
|
||||||
|
" \"n_epochs\": 3,\n",
|
||||||
|
" \"batch_size\": 8,\n",
|
||||||
|
" \"learning_rate_multiplier\": 0.5\n",
|
||||||
|
" },\n",
|
||||||
|
" suffix=\"week6_bharat_ft_v1\"\n",
|
||||||
|
" )\n",
|
||||||
|
"\n",
|
||||||
|
" print(\"🆔 Job created:\", job.id)\n",
|
||||||
|
"\n",
|
||||||
|
" # Poll until completion\n",
|
||||||
|
" status = job.status\n",
|
||||||
|
" while status in (\"validating_files\", \"queued\", \"running\"):\n",
|
||||||
|
" print(\"⏳ Status:\", status)\n",
|
||||||
|
" time.sleep(20)\n",
|
||||||
|
" job = client.fine_tuning.jobs.retrieve(job.id)\n",
|
||||||
|
" status = job.status\n",
|
||||||
|
"\n",
|
||||||
|
" if job.status != \"succeeded\":\n",
|
||||||
|
" raise RuntimeError(f\"❌ Fine-tune failed with status: {job.status}\")\n",
|
||||||
|
"\n",
|
||||||
|
" ft_model = job.fine_tuned_model\n",
|
||||||
|
" print(\"🎯 Fine-tuning complete! Model ID:\", ft_model)\n"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"id": "32a2b85e-e978-4c8f-90d9-d697731e6569",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# =============================================================\n",
|
||||||
|
"# Step 5 — Evaluate Simulated Fine-Tuned Model\n",
|
||||||
|
"# =============================================================\n",
|
||||||
|
"import numpy as np\n",
|
||||||
|
"from sklearn.metrics import mean_absolute_error\n",
|
||||||
|
"import matplotlib.pyplot as plt\n",
|
||||||
|
"import re\n",
|
||||||
|
"\n",
|
||||||
|
"print(\"\\n🧮 Evaluating simulated fine-tuned model performance...\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Use small sample of validation data\n",
|
||||||
|
"val_subset = val_df.sample(min(20, len(val_df)), random_state=42).reset_index(drop=True)\n",
|
||||||
|
"prompts = val_subset[\"prompt\"].tolist()\n",
|
||||||
|
"actuals = val_subset[\"completion\"].tolist()\n",
|
||||||
|
"\n",
|
||||||
|
"# Convert actuals into numeric form (if applicable)\n",
|
||||||
|
"def extract_number(x):\n",
|
||||||
|
" match = re.findall(r\"[-+]?\\d*\\.?\\d+\", str(x))\n",
|
||||||
|
" return float(match[0]) if match else np.random.uniform(70, 90)\n",
|
||||||
|
"\n",
|
||||||
|
"actual_values = [extract_number(a) for a in actuals]\n",
|
||||||
|
"\n",
|
||||||
|
"# 🧪 Simulate predicted values (normally would come from API)\n",
|
||||||
|
"predicted_values = [v + np.random.uniform(-3, 3) for v in actual_values]\n",
|
||||||
|
"\n",
|
||||||
|
"# Calculate Mean Absolute Error\n",
|
||||||
|
"mae = mean_absolute_error(actual_values, predicted_values)\n",
|
||||||
|
"print(f\"\\n📊 Validation Mean Absolute Error (Simulated): {mae:.2f}\")\n",
|
||||||
|
"\n",
|
||||||
|
"# Plot comparison\n",
|
||||||
|
"plt.figure(figsize=(6, 4))\n",
|
||||||
|
"plt.plot(predicted_values, label=\"Predicted\", marker=\"o\")\n",
|
||||||
|
"plt.plot(actual_values, label=\"Actual\", marker=\"x\")\n",
|
||||||
|
"plt.title(\"Validation Predictions vs Actuals (Simulated)\")\n",
|
||||||
|
"plt.xlabel(\"Sample Index\")\n",
|
||||||
|
"plt.ylabel(\"Value\")\n",
|
||||||
|
"plt.legend()\n",
|
||||||
|
"plt.grid(True)\n",
|
||||||
|
"plt.show()\n",
|
||||||
|
"\n",
|
||||||
|
"# Reflection Summary\n",
|
||||||
|
"print(\"\\n===== WEEK 6 REFLECTION =====\")\n",
|
||||||
|
"print(\"✅ Completed full fine-tuning workflow (simulated) successfully.\")\n",
|
||||||
|
"print(\"🧠 Understood how fine-tuning integrates with GPT-4o-mini API workflow.\")\n",
|
||||||
|
"print(f\"📊 Validation MAE (simulated): {mae:.2f}\")\n",
|
||||||
|
"print(\"🔍 Practiced prompt alignment, data curation, and evaluation safely.\")\n",
|
||||||
|
"print(\"💡 Next step: Try real fine-tuning (simulate=False) on small data if credits are available.\")\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "Python 3 (ipykernel)",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.11.14"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 5
|
||||||
|
}
|
||||||
621
week6/community-contributions/week_6_exercise_revised.py
Normal file
621
week6/community-contributions/week_6_exercise_revised.py
Normal file
@@ -0,0 +1,621 @@
|
|||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
"""Week_6_exercise_revised.ipynb
|
||||||
|
|
||||||
|
Automatically generated by Colab.
|
||||||
|
|
||||||
|
Original file is located at
|
||||||
|
https://colab.research.google.com/drive/1GaV053HB8l-Wd3J3o9BcOAjC009Qk_W0
|
||||||
|
"""
|
||||||
|
|
||||||
|
#installations
|
||||||
|
!pip install --upgrade pip
|
||||||
|
!pip install datasets==3.0.1 anthropic transformers accelerate pandas tqdm numpy
|
||||||
|
|
||||||
|
#imports
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import json
|
||||||
|
import random
|
||||||
|
import time
|
||||||
|
from typing import Optional, List, Dict, Any, Tuple
|
||||||
|
from sklearn.model_selection import train_test_split
|
||||||
|
import anthropic
|
||||||
|
from datasets import load_dataset
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
import numpy as np
|
||||||
|
import pandas as pd
|
||||||
|
from tqdm import tqdm
|
||||||
|
import seaborn as sns
|
||||||
|
|
||||||
|
#TEMPORARY: Hard-coded keys
|
||||||
|
|
||||||
|
#I hid my keys, you can replace your keys with 'sk' and 'hf'
|
||||||
|
os.environ["ANTHROPIC_API_KEY"] = "sk"
|
||||||
|
os.environ["HF_TOKEN"] = "hf"
|
||||||
|
|
||||||
|
|
||||||
|
# Anthropic Client
|
||||||
|
try:
|
||||||
|
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
|
||||||
|
print("Anthropic client initialized")
|
||||||
|
except Exception as e:
|
||||||
|
raise ImportError("Please install anthropic: !pip install anthropic") from e
|
||||||
|
|
||||||
|
#some Basic configrations used throughtout the notebook
|
||||||
|
RANDOM_SEED = 42
|
||||||
|
# medium test size
|
||||||
|
TEST_SIZE = 50
|
||||||
|
CLAUDE_MODEL = "claude-opus-4-20250514"
|
||||||
|
MAX_TOKENS = 300
|
||||||
|
|
||||||
|
random.seed(RANDOM_SEED)
|
||||||
|
np.random.seed(RANDOM_SEED)
|
||||||
|
|
||||||
|
# Load my dataset, the Aplliances in my case
|
||||||
|
dataset = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_meta_Appliances", split="full")
|
||||||
|
#using Pandas to create a dataframe
|
||||||
|
df = dataset.to_pandas()
|
||||||
|
#see the data
|
||||||
|
df.head()
|
||||||
|
|
||||||
|
# Let clean the Price column and have it as a Price-clean
|
||||||
|
df["price_clean"] = pd.to_numeric(df["price"], errors="coerce")
|
||||||
|
|
||||||
|
#check the number of rows In the ddata
|
||||||
|
print("Dataset size:", len(df))
|
||||||
|
|
||||||
|
#check The featues in the data
|
||||||
|
print(df.columns.tolist())
|
||||||
|
|
||||||
|
#checking some info
|
||||||
|
print(df.info())
|
||||||
|
|
||||||
|
print("Price-related columns found:", [c for c in df.columns if "price" in c])
|
||||||
|
|
||||||
|
print("Missing price_clean:", df["price_clean"].isna().sum(), "rows")
|
||||||
|
|
||||||
|
# Price distribution visualization (Zoomed histogram)
|
||||||
|
plt.figure(figsize=(10,5))
|
||||||
|
df[df["price_clean"] < 200]["price_clean"].hist(bins=50)
|
||||||
|
plt.title("Price Distribution")
|
||||||
|
plt.xlabel("Price ($)")
|
||||||
|
plt.ylabel("Frequency")
|
||||||
|
plt.show()
|
||||||
|
|
||||||
|
# Keep only rows where price_clean is not null
|
||||||
|
df_model = df.dropna(subset=["price_clean"]).copy()
|
||||||
|
|
||||||
|
# come up with a ptompt text combined
|
||||||
|
def combine_text(row):
|
||||||
|
title = row["title"] or ""
|
||||||
|
features = " ".join(row["features"]) if isinstance(row["features"], list) else ""
|
||||||
|
description = " ".join(row["description"]) if isinstance(row["description"], list) else ""
|
||||||
|
return f"{title}\n\nFEATURES: {features}\n\nDESCRIPTION: {description}"
|
||||||
|
|
||||||
|
df_model["text"] = df_model.apply(combine_text, axis=1)
|
||||||
|
|
||||||
|
# Retain what's needed
|
||||||
|
df_model = df_model[["text", "price_clean"]].reset_index(drop=True)
|
||||||
|
|
||||||
|
# check the model dataset size
|
||||||
|
print(len(df_model))
|
||||||
|
df_model.head(5)
|
||||||
|
|
||||||
|
# Splitting the data into Training and test
|
||||||
|
train_df, test_df = train_test_split(
|
||||||
|
df_model,
|
||||||
|
test_size=0.10, # 10% test split
|
||||||
|
random_state=RANDOM_SEED
|
||||||
|
)
|
||||||
|
|
||||||
|
#Training
|
||||||
|
len(train_df)
|
||||||
|
|
||||||
|
#Testing
|
||||||
|
len(test_df)
|
||||||
|
|
||||||
|
# make the test a list for better samplng
|
||||||
|
test_records = test_df.to_dict(orient="records")
|
||||||
|
|
||||||
|
# Pricing system Prompt
|
||||||
|
|
||||||
|
def build_prompt(item_text: str) -> str:
|
||||||
|
return f"""
|
||||||
|
You are a pricing analyst. Given a marketplace product listing, estimate the item's correct fair market price in KES.
|
||||||
|
|
||||||
|
Return ONLY a number, no currency sign, no explanation.
|
||||||
|
|
||||||
|
Product details:
|
||||||
|
\"\"\"
|
||||||
|
{item_text}
|
||||||
|
\"\"\"
|
||||||
|
"""
|
||||||
|
|
||||||
|
def estimate_price_claude(item_text: str) -> Optional[float]:
|
||||||
|
try:
|
||||||
|
prompt = build_prompt(item_text)
|
||||||
|
|
||||||
|
response = client.messages.create(
|
||||||
|
model=CLAUDE_MODEL,
|
||||||
|
max_tokens=MAX_TOKENS,
|
||||||
|
messages=[
|
||||||
|
{"role": "user", "content": prompt}
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
raw_output = response.content[0].text.strip()
|
||||||
|
|
||||||
|
# Extract first valid number from model response
|
||||||
|
match = re.search(r"\d+(\.\d+)?", raw_output.replace(",", ""))
|
||||||
|
return float(match.group(0)) if match else None
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print("Error:", e)
|
||||||
|
return None
|
||||||
|
|
||||||
|
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
|
||||||
|
|
||||||
|
# Filter and Sample 100 usable Rows
|
||||||
|
df_usable = df[df["price_clean"].notna()].copy()
|
||||||
|
sample_df = df_usable.sample(100, random_state=42).reset_index(drop=True)
|
||||||
|
|
||||||
|
#empty predriction list for them to be stored
|
||||||
|
predictions = []
|
||||||
|
|
||||||
|
#Getting the prices
|
||||||
|
def extract_price(text):
|
||||||
|
"""Extract the first valid float from Claude's reply."""
|
||||||
|
match = re.search(r"\d+(\.\d+)?", text.replace(",", ""))
|
||||||
|
return float(match.group(0)) if match else None
|
||||||
|
|
||||||
|
# Getting the predictions
|
||||||
|
for i, row in tqdm(sample_df.iterrows(), total=len(sample_df)):
|
||||||
|
title = row["title"]
|
||||||
|
desc = " ".join(row["description"]) if isinstance(row["description"], list) else str(row["description"])
|
||||||
|
feat = " ".join(row["features"]) if isinstance(row["features"], list) else str(row["features"])
|
||||||
|
cats = " ".join(row["categories"]) if isinstance(row["categories"], list) else str(row["categories"])
|
||||||
|
|
||||||
|
prompt = f"""
|
||||||
|
You are estimating the USD retail price of an appliance part.
|
||||||
|
|
||||||
|
Analyze the information and respond with **only a single number** (no currency symbol, no text, no explanation).
|
||||||
|
|
||||||
|
TITLE: {title}
|
||||||
|
DESCRIPTION: {desc}
|
||||||
|
FEATURES: {feat}
|
||||||
|
CATEGORIES: {cats}
|
||||||
|
|
||||||
|
Your response must be only a number like: 29.99
|
||||||
|
"""
|
||||||
|
|
||||||
|
response = client.messages.create(
|
||||||
|
model=CLAUDE_MODEL,
|
||||||
|
max_tokens=50,
|
||||||
|
messages=[{"role": "user", "content": prompt}]
|
||||||
|
)
|
||||||
|
|
||||||
|
raw = response.content[0].text.strip()
|
||||||
|
pred_price = extract_price(raw)
|
||||||
|
|
||||||
|
predictions.append({
|
||||||
|
"title": title,
|
||||||
|
"true_price": row["price_clean"],
|
||||||
|
"claude_price": pred_price,
|
||||||
|
"raw_reply": raw
|
||||||
|
})
|
||||||
|
|
||||||
|
# Saving output in a csv nw
|
||||||
|
result_df = pd.DataFrame(predictions)
|
||||||
|
result_df.to_csv("claude_price_predictions_100.csv", index=False)
|
||||||
|
|
||||||
|
# Show preview
|
||||||
|
display(result_df.head())
|
||||||
|
|
||||||
|
# Error metrics
|
||||||
|
valid = result_df[result_df["claude_price"].notna()]
|
||||||
|
mae = np.mean(np.abs(valid["true_price"] - valid["claude_price"]))
|
||||||
|
rmse = np.sqrt(np.mean((valid["true_price"] - valid["claude_price"])**2))
|
||||||
|
pct_within_20 = np.mean(np.abs(valid["true_price"] - valid["claude_price"]) <= 20) * 100
|
||||||
|
|
||||||
|
print(f"\nValid predictions: {len(valid)}/{len(result_df)}")
|
||||||
|
print(f"MAE: {mae:.2f}")
|
||||||
|
print(f"RMSE: {rmse:.2f}")
|
||||||
|
print(f"% within $20: {pct_within_20:.1f}%")
|
||||||
|
|
||||||
|
"""The model returned a price every single time:
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
1. -->MAE = 22.52 On average Claude is off by 22.52 from the true price
|
||||||
|
2. -->RMSE = 44.11 Big errors exist on some items — a sign of occasional wild guesses
|
||||||
|
2. -->RMSE = 44.11 Big errors exist on some items — a sign of occasional wild guesses
|
||||||
|
2. -->72% within $20 Claude predicts reasonable accuracy on most products, but 28% are far off.
|
||||||
|
|
||||||
|
;
|
||||||
|
|
||||||
|
1. Strengths- Model is somehow decent with zero/low fine-tuning. It understood the task, 72% within $20 on a dataset it’s never seen is a good baseline
|
||||||
|
1. Weaknesses- Too many rounded “classic” retail numbers (24.99, 89.99, 14.99, 29.99). Seems not to deeply use features, category, or rating. Also the RMSE is high → meaning a few really bad errors are dragging performance
|
||||||
|
|
||||||
|
Improvements
|
||||||
|
|
||||||
|
1. Prompt enhancements
|
||||||
|
2. Multi-shot and also better structuring
|
||||||
|
3. Fine-tuning with local model
|
||||||
|
"""
|
||||||
|
|
||||||
|
#Now we build a persona Prompt
|
||||||
|
def build_pricing_prompt(examples: list, new_title: str) -> str:
|
||||||
|
"""
|
||||||
|
Build a multi-shot prompt for the E-commerce Market Analyst persona.
|
||||||
|
Each example has (title, price).
|
||||||
|
"""
|
||||||
|
few_shots = "\n".join(
|
||||||
|
[f"Product: {t}\nEstimated fair market price: ${p:.2f}" for t, p in examples]
|
||||||
|
)
|
||||||
|
|
||||||
|
system_prompt = (
|
||||||
|
"You are a meticulous Data-Driven Market Analyst who estimates realistic, data-based "
|
||||||
|
"product prices for online marketplaces. You base estimates on comparable items and "
|
||||||
|
"avoid outliers. Return only the price number."
|
||||||
|
)
|
||||||
|
|
||||||
|
user_prompt = (
|
||||||
|
f"{system_prompt}\n\nHere are recent examples:\n{few_shots}\n\n"
|
||||||
|
f"Now estimate a fair market price for this product:\n"
|
||||||
|
f"Product: {new_title}\n\n"
|
||||||
|
"Respond with only a number, no text or symbols."
|
||||||
|
)
|
||||||
|
return user_prompt
|
||||||
|
|
||||||
|
#10-shot predictios
|
||||||
|
subset_10 = df.dropna(subset=["price_clean"]).sample(10, random_state=42).reset_index(drop=True)
|
||||||
|
few_shots_3 = subset_10.sample(3, random_state=42)[["title", "price_clean"]].values.tolist()
|
||||||
|
results_10 = []
|
||||||
|
|
||||||
|
for i, row in tqdm(subset_10.iterrows(), total=len(subset_10)):
|
||||||
|
prompt = build_pricing_prompt(few_shots_3, row["title"])
|
||||||
|
try:
|
||||||
|
resp = client.messages.create(
|
||||||
|
model=CLAUDE_MODEL,
|
||||||
|
max_tokens=MAX_TOKENS,
|
||||||
|
messages=[{"role": "user", "content": prompt}],
|
||||||
|
)
|
||||||
|
reply = resp.content[0].text.strip()
|
||||||
|
pred = float(reply.replace("$", "").strip())
|
||||||
|
except Exception:
|
||||||
|
pred, reply = np.nan, None
|
||||||
|
results_10.append({"title": row["title"], "true_price": row["price_clean"], "pred_price": pred, "raw": reply})
|
||||||
|
|
||||||
|
df10 = pd.DataFrame(results_10).dropna(subset=["pred_price"])
|
||||||
|
|
||||||
|
mae10 = np.mean(np.abs(df10.pred_price - df10.true_price))
|
||||||
|
|
||||||
|
rmse10 = np.sqrt(np.mean((df10.pred_price - df10.true_price)**2))
|
||||||
|
|
||||||
|
pct20_10 = np.mean(np.abs(df10.pred_price - df10.true_price) <= 20) * 100
|
||||||
|
|
||||||
|
print(f"MAE={mae10:.2f}, RMSE={rmse10:.2f}, %within$20={pct20_10:.1f}%")
|
||||||
|
df10.head()
|
||||||
|
|
||||||
|
#30 shot
|
||||||
|
subset_30 = df.dropna(subset=["price_clean"]).sample(30, random_state=42).reset_index(drop=True)
|
||||||
|
few_shots_5 = subset_30.sample(5, random_state=42)[["title", "price_clean"]].values.tolist()
|
||||||
|
results_30 = []
|
||||||
|
|
||||||
|
for i, row in tqdm(subset_30.iterrows(), total=len(subset_30)):
|
||||||
|
prompt = build_pricing_prompt(few_shots_5, row["title"])
|
||||||
|
try:
|
||||||
|
resp = client.messages.create(
|
||||||
|
model=CLAUDE_MODEL,
|
||||||
|
max_tokens=MAX_TOKENS,
|
||||||
|
messages=[{"role": "user", "content": prompt}],
|
||||||
|
)
|
||||||
|
reply = resp.content[0].text.strip()
|
||||||
|
pred = float(reply.replace("$", "").strip())
|
||||||
|
except Exception:
|
||||||
|
pred, reply = np.nan, None
|
||||||
|
results_30.append({"title": row["title"], "true_price": row["price_clean"], "pred_price": pred, "raw": reply})
|
||||||
|
|
||||||
|
df30 = pd.DataFrame(results_30).dropna(subset=["pred_price"])
|
||||||
|
|
||||||
|
mae30 = np.mean(np.abs(df30.pred_price - df30.true_price))
|
||||||
|
|
||||||
|
rmse30 = np.sqrt(np.mean((df30.pred_price - df30.true_price)**2))
|
||||||
|
|
||||||
|
pct20_30 = np.mean(np.abs(df30.pred_price - df30.true_price) <= 20) * 100
|
||||||
|
|
||||||
|
print(f"MAE={mae30:.2f}, RMSE={rmse30:.2f}, %within$20={pct20_30:.1f}%")
|
||||||
|
df30.head()
|
||||||
|
|
||||||
|
#50 Shot s
|
||||||
|
subset_50 = df.dropna(subset=["price_clean"]).sample(50, random_state=42).reset_index(drop=True)
|
||||||
|
few_shots_8 = subset_50.sample(8, random_state=42)[["title", "price_clean"]].values.tolist()
|
||||||
|
results_50 = []
|
||||||
|
|
||||||
|
for i, row in tqdm(subset_50.iterrows(), total=len(subset_50)):
|
||||||
|
prompt = build_pricing_prompt(few_shots_8, row["title"])
|
||||||
|
try:
|
||||||
|
resp = client.messages.create(
|
||||||
|
model=CLAUDE_MODEL,
|
||||||
|
max_tokens=MAX_TOKENS,
|
||||||
|
messages=[{"role": "user", "content": prompt}],
|
||||||
|
)
|
||||||
|
reply = resp.content[0].text.strip()
|
||||||
|
pred = float(reply.replace("$", "").strip())
|
||||||
|
except Exception:
|
||||||
|
pred, reply = np.nan, None
|
||||||
|
results_50.append({"title": row["title"], "true_price": row["price_clean"], "pred_price": pred, "raw": reply})
|
||||||
|
|
||||||
|
df50 = pd.DataFrame(results_50).dropna(subset=["pred_price"])
|
||||||
|
|
||||||
|
mae50 = np.mean(np.abs(df50.pred_price - df50.true_price))
|
||||||
|
|
||||||
|
rmse50 = np.sqrt(np.mean((df50.pred_price - df50.true_price)**2))
|
||||||
|
|
||||||
|
pct20_50 = np.mean(np.abs(df50.pred_price - df50.true_price) <= 20) * 100
|
||||||
|
|
||||||
|
print(f"MAE={mae50:.2f}, RMSE={rmse50:.2f}, %within$20={pct20_50:.1f}%")
|
||||||
|
df50.head()
|
||||||
|
|
||||||
|
#Improved Ptompt and comparin the 10,30, &50 shot hints
|
||||||
|
def build_strict_prompt(few_shots, test_title):
|
||||||
|
shots_text = "\n".join([f"Title: {t}\nPrice: ${p:.2f}" for t, p in few_shots])
|
||||||
|
return f"""
|
||||||
|
You are an expert e-commerce product pricing analyst. Your job is to predict the most realistic market price for a product based purely on its title.
|
||||||
|
|
||||||
|
Here are reference examples:
|
||||||
|
{shots_text}
|
||||||
|
|
||||||
|
Now predict the price for:
|
||||||
|
Title: {test_title}
|
||||||
|
|
||||||
|
RULES:
|
||||||
|
- Return ONLY a single number.
|
||||||
|
- No dollar sign.
|
||||||
|
- No text, no reasoning, no words.
|
||||||
|
- Format: 123.45
|
||||||
|
"""
|
||||||
|
|
||||||
|
def run_eval(name, subset, shot_count):
|
||||||
|
few = subset.sample(shot_count, random_state=42)[["title", "price_clean"]].values.tolist()
|
||||||
|
results = []
|
||||||
|
|
||||||
|
for _, row in tqdm(subset.iterrows(), total=len(subset), desc=f"{name}"):
|
||||||
|
prompt = build_strict_prompt(few, row["title"])
|
||||||
|
try:
|
||||||
|
resp = client.messages.create(
|
||||||
|
model=CLAUDE_MODEL,
|
||||||
|
max_tokens=MAX_TOKENS,
|
||||||
|
messages=[{"role": "user", "content": prompt}],
|
||||||
|
)
|
||||||
|
reply = resp.content[0].text.strip()
|
||||||
|
pred = float(reply)
|
||||||
|
except Exception:
|
||||||
|
pred, reply = np.nan, None
|
||||||
|
|
||||||
|
results.append({"title": row["title"], "true": row["price_clean"], "pred": pred})
|
||||||
|
|
||||||
|
df = pd.DataFrame(results).dropna(subset=["pred"])
|
||||||
|
mae = np.mean(np.abs(df.pred - df.true))
|
||||||
|
rmse = np.sqrt(np.mean((df.pred - df.true)**2))
|
||||||
|
pct20 = np.mean(np.abs(df.pred - df.true) <= 20) * 100
|
||||||
|
return df, mae, rmse, pct20
|
||||||
|
|
||||||
|
# Run 10 / 30 / 50
|
||||||
|
subset10 = df.dropna(subset=["price_clean"]).sample(10, random_state=1).reset_index(drop=True)
|
||||||
|
subset30 = df.dropna(subset=["price_clean"]).sample(30, random_state=2).reset_index(drop=True)
|
||||||
|
subset50 = df.dropna(subset=["price_clean"]).sample(50, random_state=3).reset_index(drop=True)
|
||||||
|
|
||||||
|
df10, mae10, rmse10, pct10 = run_eval("RUN10", subset10, 3)
|
||||||
|
df30, mae30, rmse30, pct30 = run_eval("RUN30", subset30, 6)
|
||||||
|
df50, mae50, rmse50, pct50 = run_eval("RUN50", subset50, 8)
|
||||||
|
|
||||||
|
#compare
|
||||||
|
comparison = pd.DataFrame([
|
||||||
|
{"shots": 10, "MAE": mae10, "RMSE": rmse10, "%≤$20": pct10},
|
||||||
|
{"shots": 30, "MAE": mae30, "RMSE": rmse30, "%≤$20": pct30},
|
||||||
|
{"shots": 50, "MAE": mae50, "RMSE": rmse50, "%≤$20": pct50},
|
||||||
|
])
|
||||||
|
|
||||||
|
print(comparison)
|
||||||
|
comparison
|
||||||
|
|
||||||
|
"""The model becomes confused by too many examples, became more biased toward random values and less less stable and less accurate.
|
||||||
|
Hypothesis: Possibly the dataset has high variance (many unrelated categories), and the model benefits from small, clean, representative few-shots, not large few-shots.
|
||||||
|
"""
|
||||||
|
|
||||||
|
#Rechecking the variance in the data
|
||||||
|
prices = df["price_clean"].dropna()
|
||||||
|
print(prices.describe(percentiles=[0.25, 0.5, 0.75, 0.9, 0.95]))
|
||||||
|
|
||||||
|
print("\nSkewness:", prices.skew())
|
||||||
|
print("Kurtosis:", prices.kurt())
|
||||||
|
|
||||||
|
# Plot histogram
|
||||||
|
plt.figure(figsize=(12,4))
|
||||||
|
sns.histplot(prices, bins=50)
|
||||||
|
plt.title("Histogram — Full Dataset Price Distribution")
|
||||||
|
plt.xlabel("Price ($)")
|
||||||
|
plt.ylabel("Frequency")
|
||||||
|
plt.show()
|
||||||
|
|
||||||
|
# Plot boxplot
|
||||||
|
plt.figure(figsize=(10,2))
|
||||||
|
sns.boxplot(x=prices)
|
||||||
|
plt.title("Boxplot — Full Dataset Price Spread")
|
||||||
|
plt.show()
|
||||||
|
|
||||||
|
"""Testing fewer shots to check fr the optimal"""
|
||||||
|
|
||||||
|
def run_few_shot_test(df_subset, shots, model=CLAUDE_MODEL):
|
||||||
|
few_shots = df_subset.sample(shots, random_state=42)[["title", "price_clean"]].values.tolist()
|
||||||
|
results = []
|
||||||
|
|
||||||
|
for _, row in df_subset.iterrows():
|
||||||
|
prompt = build_pricing_prompt(few_shots, row["title"])
|
||||||
|
try:
|
||||||
|
resp = client.messages.create(
|
||||||
|
model=model,
|
||||||
|
max_tokens=MAX_TOKENS,
|
||||||
|
messages=[{"role": "user", "content": prompt}],
|
||||||
|
)
|
||||||
|
reply = resp.content[0].text.strip()
|
||||||
|
pred = float(reply.replace("$", "").strip())
|
||||||
|
except:
|
||||||
|
pred, reply = np.nan, None
|
||||||
|
|
||||||
|
results.append({"title": row["title"], "true": row["price_clean"], "pred": pred})
|
||||||
|
|
||||||
|
df_res = pd.DataFrame(results).dropna()
|
||||||
|
mae = np.mean(np.abs(df_res.pred - df_res.true))
|
||||||
|
rmse = np.sqrt(np.mean((df_res.pred - df_res.true)**2))
|
||||||
|
pct20 = np.mean(np.abs(df_res.pred - df_res.true) <= 20) * 100
|
||||||
|
return df_res, mae, rmse, pct20
|
||||||
|
|
||||||
|
#Tabulate the 2 shot results
|
||||||
|
df2, mae2, rmse2, pct2 = run_few_shot_test(subset_50, shots=2)
|
||||||
|
print("2-SHOT RESULTS → MAE={:.2f}, RMSE={:.2f}, %≤$20={:.1f}%".format(mae2, rmse2, pct2))
|
||||||
|
df2.head()
|
||||||
|
|
||||||
|
#5 shot results
|
||||||
|
df5, mae5, rmse5, pct5 = run_few_shot_test(subset_50, shots=5)
|
||||||
|
print("5-SHOT RESULTS → MAE={:.2f}, RMSE={:.2f}, %≤$20={:.1f}%".format(mae5, rmse5, pct5))
|
||||||
|
df5.head()
|
||||||
|
|
||||||
|
#7 shot results
|
||||||
|
df7, mae7, rmse7, pct7 = run_few_shot_test(subset_50, shots=7)
|
||||||
|
print("7-SHOT RESULTS → MAE={:.2f}, RMSE={:.2f}, %≤$20={:.1f}%".format(mae7, rmse7, pct7))
|
||||||
|
df7.head()
|
||||||
|
|
||||||
|
#Tabulate all the shots to choose the optimal or if there is Any need for the shots
|
||||||
|
|
||||||
|
results_summary = [
|
||||||
|
{"shots": 0, "MAE": 22.52, "RMSE": 44.11, "%≤$20": 72.0}, # baseline
|
||||||
|
{"shots": 2, "MAE": mae2, "RMSE": rmse2, "%≤$20": pct2},
|
||||||
|
{"shots": 5, "MAE": mae5, "RMSE": rmse5, "%≤$20": pct5},
|
||||||
|
{"shots": 7, "MAE": mae7, "RMSE": rmse7, "%≤$20": pct7},
|
||||||
|
{"shots": 10, "MAE": 16.27, "RMSE": 38.59, "%≤$20": 90.0},
|
||||||
|
{"shots": 30, "MAE": 135.73, "RMSE": 606.78, "%≤$20": 70.0},
|
||||||
|
{"shots": 50, "MAE": 42.54, "RMSE": 136.61, "%≤$20": 72.0},
|
||||||
|
]
|
||||||
|
|
||||||
|
df_comparison = pd.DataFrame(results_summary)
|
||||||
|
df_comparison = df_comparison.sort_values("shots").reset_index(drop=True)
|
||||||
|
df_comparison
|
||||||
|
|
||||||
|
"""1. 0-shot baseline: MAE 22.52, %≤$20 72%
|
||||||
|
|
||||||
|
2. Very low few-shots (2, 5): Surprisingly worse than baseline (MAE ↑, %≤$20 ↓), likely due to variance and poor example selection.
|
||||||
|
|
||||||
|
3. 7-shot: Improves over baseline slightly, MAE 19.91, %≤$20 back to 72%
|
||||||
|
|
||||||
|
4. 10-shot: Best performance overall — MAE 16.27, %≤$20 jumps to 90%! Clearly the few-shot hints are helping here.
|
||||||
|
|
||||||
|
5. 30-shot: Performance collapses (MAE 135.73, RMSE 606.78) — too many examples may confuse the model.
|
||||||
|
|
||||||
|
6. 50-shot: Slightly better than 30-shot but still worse than 10-shot.
|
||||||
|
|
||||||
|
|
||||||
|
Conclusion: Optimal few-shot count is 10 for this dataset and prompt style.
|
||||||
|
"""
|
||||||
|
|
||||||
|
#Further finetuning of the selected 10-shot
|
||||||
|
|
||||||
|
def build_finetune_prompt(few_shots: list, target_title: str, max_chars=800):
|
||||||
|
"""
|
||||||
|
few_shots: list of dicts {"title":..., "price_clean":...}
|
||||||
|
target_title: title string
|
||||||
|
"""
|
||||||
|
parts = ["You are an e-commerce pricing expert. Estimate product prices in USD accurately. Output only a number."]
|
||||||
|
parts.append("\nExamples:")
|
||||||
|
for ex in few_shots:
|
||||||
|
parts.append(f"- {ex['title']}: {ex['price_clean']}")
|
||||||
|
parts.append("\nPredict price for the following product:")
|
||||||
|
parts.append(f"Title: {target_title}")
|
||||||
|
prompt = "\n".join(parts)
|
||||||
|
if len(prompt) > max_chars:
|
||||||
|
return prompt[:max_chars] + "..."
|
||||||
|
return prompt
|
||||||
|
|
||||||
|
# Sample 10-shot prompts for fine-tuning
|
||||||
|
finetune_examples = []
|
||||||
|
subset_10 = df.dropna(subset=["price_clean"]).sample(100, random_state=42).reset_index(drop=True) # 100 products for initial fine-tuning
|
||||||
|
|
||||||
|
for idx, row in subset_10.iterrows():
|
||||||
|
# Pick 10 random examples from subset for few-shot
|
||||||
|
few_shots = subset_10.drop(idx).sample(10, random_state=idx)[["title","price_clean"]].to_dict(orient="records")
|
||||||
|
prompt = build_finetune_prompt(few_shots, row["title"])
|
||||||
|
finetune_examples.append({
|
||||||
|
"prompt": prompt,
|
||||||
|
"completion": str(row["price_clean"])
|
||||||
|
})
|
||||||
|
|
||||||
|
print("Sample fine-tuning example:")
|
||||||
|
print(finetune_examples[0])
|
||||||
|
|
||||||
|
with open("finetune_10shot.jsonl", "w") as f:
|
||||||
|
for ex in finetune_examples:
|
||||||
|
f.write(json.dumps(ex) + "\n")
|
||||||
|
print("(10-shot format).finetuned")
|
||||||
|
|
||||||
|
# Evaluate enhanced 10-shot prompt on sample
|
||||||
|
results_finetune_test = []
|
||||||
|
|
||||||
|
for idx, row in subset_10.iterrows():
|
||||||
|
few_shots = subset_10.drop(idx).sample(10, random_state=idx)[["title","price_clean"]].to_dict(orient="records")
|
||||||
|
prompt = build_finetune_prompt(few_shots, row["title"])
|
||||||
|
try:
|
||||||
|
resp = client.messages.create(
|
||||||
|
model=CLAUDE_MODEL,
|
||||||
|
max_tokens=MAX_TOKENS,
|
||||||
|
messages=[{"role": "user", "content": prompt}]
|
||||||
|
)
|
||||||
|
reply = resp.content[0].text.strip()
|
||||||
|
pred = float(reply.replace("$","").strip())
|
||||||
|
except Exception:
|
||||||
|
pred, reply = np.nan, None
|
||||||
|
results_finetune_test.append({"title": row["title"], "true_price": row["price_clean"], "pred": pred, "raw": reply})
|
||||||
|
|
||||||
|
df_finetune_test = pd.DataFrame(results_finetune_test).dropna(subset=["pred"])
|
||||||
|
mae_ft = np.mean(np.abs(df_finetune_test.pred - df_finetune_test.true_price))
|
||||||
|
rmse_ft = np.sqrt(np.mean((df_finetune_test.pred - df_finetune_test.true_price)**2))
|
||||||
|
pct20_ft = np.mean(np.abs(df_finetune_test.pred - df_finetune_test.true_price) <= 20) * 100
|
||||||
|
|
||||||
|
print(f"Finetuned 10-shot performance: MAE={mae_ft:.2f}, RMSE={rmse_ft:.2f}, %≤$20={pct20_ft:.1f}%")
|
||||||
|
|
||||||
|
"""Multi-shot prompting (10 examples in the prompt) without fine-tuning performed much better.
|
||||||
|
|
||||||
|
|
||||||
|
Next trial: Prompt optimization
|
||||||
|
"""
|
||||||
|
|
||||||
|
#prompt optimization seems like th eonly choice
|
||||||
|
def build_pricing_prompt_alt(few_shots: list, target_title: str) -> str:
|
||||||
|
"""
|
||||||
|
Build an alternative multi-shot pricing prompt for Claude.
|
||||||
|
|
||||||
|
few_shots: list of dicts with keys 'title' and 'price_clean'
|
||||||
|
target_title: product title to predict the price for
|
||||||
|
"""
|
||||||
|
parts = []
|
||||||
|
|
||||||
|
# Instruction with a slightly different phrasing
|
||||||
|
parts.append("Act as an expert e-commerce pricing analyst.")
|
||||||
|
parts.append("Given product titles and their prices, predict the price in USD for the new product.")
|
||||||
|
parts.append("Only provide the numeric price. No extra text, explanations, or symbols.")
|
||||||
|
|
||||||
|
# Format the examples differently: numbered list
|
||||||
|
parts.append("\nExample prices:")
|
||||||
|
for i, ex in enumerate(few_shots, start=1):
|
||||||
|
parts.append(f"{i}. {ex['title']} — ${ex['price_clean']:.2f}")
|
||||||
|
|
||||||
|
# Target product
|
||||||
|
parts.append("\nPredict the price for this product:")
|
||||||
|
parts.append(f"Title: {target_title}")
|
||||||
|
parts.append("Price (USD):")
|
||||||
|
|
||||||
|
# Combine into single prompt
|
||||||
|
prompt = "\n".join(parts)
|
||||||
|
return prompt
|
||||||
|
|
||||||
|
"""eda"""
|
||||||
@@ -350,7 +350,7 @@
|
|||||||
" system_message = messages[0]['content']\n",
|
" system_message = messages[0]['content']\n",
|
||||||
" messages = messages[1:]\n",
|
" messages = messages[1:]\n",
|
||||||
" response = claude.messages.create(\n",
|
" response = claude.messages.create(\n",
|
||||||
" model=\"claude-3-5-sonnet-20240620\",\n",
|
" model=\"claude-sonnet-4-5-20250929\",\n",
|
||||||
" max_tokens=5,\n",
|
" max_tokens=5,\n",
|
||||||
" system=system_message,\n",
|
" system=system_message,\n",
|
||||||
" messages=messages\n",
|
" messages=messages\n",
|
||||||
|
|||||||
Reference in New Issue
Block a user