Files
LLM_Engineering_OLD/week4/community-contributions/07_data_generator.ipynb
2025-06-05 16:40:08 +02:00

570 lines
18 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "BSbc4VbLi2Ek"
},
"source": [
"# Synthetic Dataset generator\n",
"- 🚀 Live Demo: https://huggingface.co/spaces/lisekarimi/datagen\n",
"- 🧑‍💻 Repo: https://github.com/lisekarimi/datagen\n",
"\n",
"---\n",
"\n",
"- 🌍 **Task**: Generate realistic synthetic datasets\n",
"- 🎯 **Supported Data Types**: Tabular, Text, Time-series\n",
"- 🧠 **Models**: GPT (OpenAI) , Claude (Anthropic), CodeQwen1.5-7B-Chat (via Hugging Face Inference) / Llama (in Google Colab through T4 GPU)\n",
"- 🚀 **Tools**: Python, Gradio UI, OpenAI / Anthropic / HuggingFace APIs\n",
"- 📤 **Output Formats**: JSON and CSV file\n",
"- 🧑‍💻 **Skill Level**: Intermediate\n",
"\n",
"🎯 **How It Works**\n",
"\n",
"1⃣ Define your business problem or dataset topic.\n",
"\n",
"2⃣ Choose the dataset type, output format, model, and number of samples.\n",
"\n",
"3⃣ The LLM generates the code; you can adjust or modify it as needed.\n",
"\n",
"4⃣ Execute the code to generate your output file.\n",
"\n",
"🛠️ **Requirements** \n",
"- ⚙️ **Hardware**: ✅ GPU required (model download); Google Colab recommended (T4)\n",
"- 🔑 OpenAI API Key (for GPT) \n",
"- 🔑 Anthropic API Key (for Claude) \n",
"- 🔑 Hugging Face Token \n",
"\n",
"**Deploy CodeQwen Endpoint:**\n",
"- Visit https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat\n",
"- Click **Deploy** → **Inference Endpoints** → **Create Endpoint** (requires credit card)\n",
"- Copy your endpoint URL: `https://[id].us-east-1.aws.endpoints.huggingface.cloud`\n",
"\n",
"⚙️ **Customizable by user** \n",
"- 🤖 Selected model: GPT / Claude / Llama / Code Qwen\n",
"- 📜 `system_prompt`: Controls model behavior (concise, accurate, structured) \n",
"- 💬 `user_prompt`: Dynamic — include other fields\n",
"\n",
"---\n",
"📢 Find more LLM notebooks on my [GitHub repository](https://github.com/lisekarimi/lexo)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9E-Ioggxi2Em"
},
"source": [
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "pR-ftUatjEGd",
"outputId": "ae5668c5-c369-4066-bbbf-b560fb28e39a"
},
"outputs": [],
"source": [
"# Install required packages in Google Colab\n",
"%pip install -q python-dotenv gradio anthropic openai requests torch bitsandbytes transformers sentencepiece accelerate"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "VPmk2-Ggi2Em"
},
"outputs": [],
"source": [
"import re\n",
"import sys\n",
"import subprocess\n",
"import threading\n",
"import anthropic\n",
"import torch\n",
"import gradio as gr\n",
"from openai import OpenAI\n",
"from huggingface_hub import InferenceClient, login\n",
"from google.colab import userdata\n",
"from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer, BitsAndBytesConfig"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DUQ55_oji2En"
},
"source": [
"## Initialization"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "MiicxGawi2En"
},
"outputs": [],
"source": [
"# Google Colab User Data\n",
"# Ensure you have set the following in your Google Colab environment:\n",
"openai_api_key = userdata.get(\"OPENAI_API_KEY\")\n",
"anthropic_api_key = userdata.get(\"ANTHROPIC_API_KEY\")\n",
"hf_token = userdata.get('HF_TOKEN')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"OPENAI_MODEL = \"gpt-4o-mini\"\n",
"CLAUDE_MODEL = \"claude-3-5-sonnet-20240620\"\n",
"LLAMA = \"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n",
"\n",
"code_qwen = \"Qwen/CodeQwen1.5-7B-Chat\"\n",
"CODE_QWEN_URL = \"https://zfkokxzs1xrqv13v.us-east-1.aws.endpoints.huggingface.cloud\"\n",
"\n",
"login(hf_token, add_to_git_credential=True)\n",
"openai = OpenAI(api_key=openai_api_key)\n",
"claude = anthropic.Anthropic(api_key=anthropic_api_key)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ipA1F440i2En"
},
"source": [
"## Prompts definition"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "JgtqCyRji2En"
},
"outputs": [],
"source": [
"system_message = \"\"\"\n",
"You are a helpful assistant whose main purpose is to generate datasets for business problems.\n",
"\n",
"Be less verbose.\n",
"Be accurate and concise.\n",
"\n",
"The user will describe a business problem. Based on this, you must generate a synthetic dataset that fits the context.\n",
"\n",
"The dataset should be saved in a specific format such as CSV, JSON — the desired format will be specified by the user.\n",
"\n",
"The dependencies for python code should include only standard python libraries such as numpy, pandas and built-in libraries.\n",
"\n",
"When saving a DataFrame to JSON using `to_json()`, do not use the `encoding` parameter. Instead, manually open the file with `open()` and specify the encoding. Then pass the file object to `to_json()`.\n",
"\n",
"Ensure Python code blocks are correctly indented, especially inside `with`, `for`, `if`, `try`, and `def` blocks.\n",
"\n",
"Return only the Python code that generates and saves the dataset.\n",
"After saving the file, print the code that was executed and a message confirming the dataset was generated successfully.\n",
"\"\"\"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Bk6saP4oi2Eo"
},
"outputs": [],
"source": [
"def user_prompt(**input_data):\n",
" user_prompt = f\"\"\"\n",
" Generate a synthetic {input_data[\"dataset_type\"].lower()} dataset in {input_data[\"output_format\"].upper()} format.\n",
" Business problem: {input_data[\"business_problem\"]}\n",
" Samples: {input_data[\"num_samples\"]}\n",
" \"\"\"\n",
" return user_prompt\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XnrPiAZ7i2Eo"
},
"source": [
"## Call API for Closed Models"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Sx7hHKczi2Eo"
},
"outputs": [],
"source": [
"def stream_gpt(user_prompt):\n",
" stream = openai.chat.completions.create(\n",
" model=OPENAI_MODEL,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_message},\n",
" {\"role\": \"user\",\"content\": user_prompt},\n",
" ],\n",
" stream=True,\n",
" )\n",
"\n",
" response = \"\"\n",
" for chunk in stream:\n",
" response += chunk.choices[0].delta.content or \"\"\n",
" yield response\n",
"\n",
" return response\n",
"\n",
"\n",
"def stream_claude(user_prompt):\n",
" result = claude.messages.stream(\n",
" model=CLAUDE_MODEL,\n",
" max_tokens=2000,\n",
" system=system_message,\n",
" messages=[\n",
" {\"role\": \"user\",\"content\": user_prompt}\n",
" ]\n",
" )\n",
" reply = \"\"\n",
" with result as stream:\n",
" for text in stream.text_stream:\n",
" reply += text\n",
" yield reply\n",
" print(text, end=\"\", flush=True)\n",
" return reply\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PUPeZ4xPi2Eo"
},
"source": [
"## Call Open Source Models\n",
"- Llama is downloaded and run on T4 GPU (Google Colab).\n",
"- Code Qwen is run through inference endpoint"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "W0AuZT2uk0Sd"
},
"outputs": [],
"source": [
"def stream_llama(user_prompt):\n",
" try:\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_message},\n",
" {\"role\": \"user\",\"content\": user_prompt},\n",
" ]\n",
"\n",
" tokenizer = AutoTokenizer.from_pretrained(LLAMA)\n",
" tokenizer.pad_token = tokenizer.eos_token\n",
"\n",
" quant_config = BitsAndBytesConfig(\n",
" load_in_4bit=True,\n",
" bnb_4bit_use_double_quant=True,\n",
" bnb_4bit_compute_dtype=torch.bfloat16,\n",
" bnb_4bit_quant_type=\"nf4\"\n",
" )\n",
"\n",
" model = AutoModelForCausalLM.from_pretrained(\n",
" LLAMA,\n",
" device_map=\"auto\",\n",
" quantization_config=quant_config\n",
" )\n",
"\n",
" inputs = tokenizer.apply_chat_template(messages, return_tensors=\"pt\").to(\"cuda\")\n",
" streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False)\n",
"\n",
" thread = threading.Thread(target=model.generate, kwargs={\n",
" \"input_ids\": inputs,\n",
" \"max_new_tokens\": 1000,\n",
" \"pad_token_id\": tokenizer.eos_token_id,\n",
" \"streamer\": streamer\n",
" })\n",
" thread.start()\n",
"\n",
" started = False\n",
" reply = \"\"\n",
"\n",
" for new_text in streamer:\n",
" if not started:\n",
" if \"<|start_header_id|>assistant<|end_header_id|>\" in new_text:\n",
" started = True\n",
" new_text = new_text.split(\"<|start_header_id|>assistant<|end_header_id|>\")[-1].strip()\n",
" else:\n",
" continue\n",
"\n",
" if \"<|eot_id|>\" in new_text:\n",
" new_text = new_text.replace(\"<|eot_id|>\", \"\")\n",
" if new_text.strip():\n",
" reply += new_text\n",
" yield reply\n",
" break\n",
"\n",
" if new_text.strip():\n",
" reply += new_text\n",
" yield reply\n",
"\n",
" return reply\n",
"\n",
" except Exception as e:\n",
" print(f\"LLaMA error: {e}\")\n",
" raise\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "V0JS_6THi2Eo"
},
"outputs": [],
"source": [
"def stream_code_qwen(user_prompt):\n",
" tokenizer = AutoTokenizer.from_pretrained(code_qwen)\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_message},\n",
" {\"role\": \"user\",\"content\": user_prompt},\n",
" ]\n",
" text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
" client = InferenceClient(CODE_QWEN_URL, token=hf_token)\n",
" stream = client.text_generation(text, stream=True, details=True, max_new_tokens=3000)\n",
" result = \"\"\n",
" for r in stream:\n",
" result += r.token.text\n",
" yield result"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PqG57dJIi2Eo"
},
"source": [
"## Select the model and generate the ouput"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "YqSKnklRi2Eo"
},
"outputs": [],
"source": [
"def generate_from_inputs(model, **input_data):\n",
" # print(\"🔍 input_data received:\", input_data)\n",
" user_prompt_str = user_prompt(**input_data)\n",
"\n",
" if model == \"GPT\":\n",
" result = stream_gpt(user_prompt_str)\n",
" elif model == \"Claude\":\n",
" result = stream_claude(user_prompt_str)\n",
" elif model == \"Llama\":\n",
" result = stream_llama(user_prompt_str)\n",
" elif model == \"Code Qwen\":\n",
" result = stream_code_qwen(user_prompt_str)\n",
" else:\n",
" raise ValueError(\"Unknown model\")\n",
"\n",
" for stream_so_far in result:\n",
" yield stream_so_far\n",
"\n",
" return result\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "zG6_TSfni2Eo"
},
"outputs": [],
"source": [
"def handle_generate(business_problem, dataset_type, dataset_format, num_samples, model):\n",
" input_data = {\n",
" \"business_problem\": business_problem,\n",
" \"dataset_type\": dataset_type,\n",
" \"output_format\": dataset_format,\n",
" \"num_samples\": num_samples,\n",
" }\n",
"\n",
" response = generate_from_inputs(model, **input_data)\n",
" for chunk in response:\n",
" yield chunk\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "p5DQcx71i2Ep"
},
"source": [
"## Extract python code from the LLM output and execute it locally"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "NcEkmsnai2Ep",
"jp-MarkdownHeadingCollapsed": true
},
"outputs": [],
"source": [
"def extract_code(text):\n",
" match = re.search(r\"```python(.*?)```\", text, re.DOTALL)\n",
"\n",
" if match:\n",
" code = match.group(0).strip()\n",
" else:\n",
" code = \"\"\n",
" print(\"No matching substring found.\")\n",
"\n",
" return code.replace(\"```python\\n\", \"\").replace(\"```\", \"\")\n",
"\n",
"\n",
"def execute_code_in_virtualenv(text, python_interpreter=sys.executable):\n",
" if not python_interpreter:\n",
" raise EnvironmentError(\"Python interpreter not found in the specified virtual environment.\")\n",
"\n",
" code_str = extract_code(text)\n",
" command = [python_interpreter, '-c', code_str]\n",
"\n",
" try:\n",
" result = subprocess.run(command, check=True, capture_output=True, text=True)\n",
" stdout = result.stdout\n",
" return stdout\n",
"\n",
" except subprocess.CalledProcessError as e:\n",
" return f\"Execution error:\\n{e}\"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DQgEyFzJi2Ep"
},
"source": [
"## Gradio interface"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "SEiZVkdFi2Ep"
},
"outputs": [],
"source": [
"def update_output_format(dataset_type):\n",
" if dataset_type in [\"Tabular\", \"Time-series\"]:\n",
" return gr.update(choices=[\"JSON\", \"csv\"], value=\"JSON\")\n",
" elif dataset_type == \"Text\":\n",
" return gr.update(choices=[\"JSON\"], value=\"JSON\")\n",
"\n",
"with gr.Blocks() as ui:\n",
" gr.Markdown(\"## Create a dataset for a business problem\")\n",
"\n",
" with gr.Column():\n",
" business_problem = gr.Textbox(label=\"Business problem\", lines=2)\n",
" dataset_type = gr.Dropdown(\n",
" [\"Tabular\", \"Time-series\", \"Text\"], label=\"Dataset type\"\n",
" )\n",
"\n",
" output_format = gr.Dropdown( choices=[\"JSON\", \"csv\"], value=\"JSON\",label=\"Output Format\")\n",
"\n",
" num_samples = gr.Number(label=\"Number of samples\", value=10, precision=0)\n",
"\n",
" model = gr.Dropdown([\"GPT\", \"Claude\", \"Llama\", \"Code Qwen\"], label=\"Select model\", value=\"GPT\")\n",
"\n",
" dataset_type.change(update_output_format,inputs=[dataset_type], outputs=[output_format])\n",
"\n",
" with gr.Row():\n",
" with gr.Column():\n",
" dataset_run = gr.Button(\"Create a dataset\")\n",
" gr.Markdown(\"\"\"⚠️ For Llama and Code Qwen: The generated code might not be optimal. It's recommended to review it before execution.\n",
" Some mistakes may occur.\"\"\")\n",
"\n",
" with gr.Column():\n",
" code_run = gr.Button(\"Execute code for a dataset\")\n",
" gr.Markdown(\"\"\"⚠️ Be cautious when sharing this app with code execution publicly, as it could pose safety risks.\n",
" The execution of user-generated code may lead to potential vulnerabilities, and its important to use this tool responsibly.\"\"\")\n",
"\n",
" with gr.Row():\n",
" dataset_out = gr.Textbox(label=\"Generated Dataset\")\n",
" code_out = gr.Textbox(label=\"Executed code\")\n",
"\n",
" dataset_run.click(\n",
" handle_generate,\n",
" inputs=[business_problem, dataset_type, output_format, num_samples, model],\n",
" outputs=[dataset_out]\n",
" )\n",
"\n",
" code_run.click(\n",
" execute_code_in_virtualenv,\n",
" inputs=[dataset_out],\n",
" outputs=[code_out]\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 646
},
"id": "jCAkTEtMi2Ep",
"outputId": "deeeb1a7-c432-4007-eba2-cbcc28dbc0ff"
},
"outputs": [],
"source": [
"ui.launch(inbrowser=True)"
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"gpuType": "T4",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.11"
}
},
"nbformat": 4,
"nbformat_minor": 4
}