{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "BSbc4VbLi2Ek" }, "source": [ "# Synthetic Dataset generator\n", "- πŸš€ Live Demo: https://huggingface.co/spaces/lisekarimi/datagen\n", "- πŸ§‘β€πŸ’» Repo: https://github.com/lisekarimi/datagen\n", "\n", "---\n", "\n", "- 🌍 **Task**: Generate realistic synthetic datasets\n", "- 🎯 **Supported Data Types**: Tabular, Text, Time-series\n", "- 🧠 **Models**: GPT (OpenAI) , Claude (Anthropic), CodeQwen1.5-7B-Chat (via Hugging Face Inference) / Llama (in Google Colab through T4 GPU)\n", "- πŸš€ **Tools**: Python, Gradio UI, OpenAI / Anthropic / HuggingFace APIs\n", "- πŸ“€ **Output Formats**: JSON and CSV file\n", "- πŸ§‘β€πŸ’» **Skill Level**: Intermediate\n", "\n", "🎯 **How It Works**\n", "\n", "1️⃣ Define your business problem or dataset topic.\n", "\n", "2️⃣ Choose the dataset type, output format, model, and number of samples.\n", "\n", "3️⃣ The LLM generates the code; you can adjust or modify it as needed.\n", "\n", "4️⃣ Execute the code to generate your output file.\n", "\n", "πŸ› οΈ **Requirements** \n", "- βš™οΈ **Hardware**: βœ… GPU required (model download); Google Colab recommended (T4)\n", "- πŸ”‘ OpenAI API Key (for GPT) \n", "- πŸ”‘ Anthropic API Key (for Claude) \n", "- πŸ”‘ Hugging Face Token \n", "\n", "**Deploy CodeQwen Endpoint:**\n", "- Visit https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat\n", "- Click **Deploy** β†’ **Inference Endpoints** β†’ **Create Endpoint** (requires credit card)\n", "- Copy your endpoint URL: `https://[id].us-east-1.aws.endpoints.huggingface.cloud`\n", "\n", "βš™οΈ **Customizable by user** \n", "- πŸ€– Selected model: GPT / Claude / Llama / Code Qwen\n", "- πŸ“œ `system_prompt`: Controls model behavior (concise, accurate, structured) \n", "- πŸ’¬ `user_prompt`: Dynamic β€” include other fields\n", "\n", "---\n", "πŸ“’ Find more LLM notebooks on my [GitHub repository](https://github.com/lisekarimi/lexo)" ] }, { "cell_type": "markdown", "metadata": { "id": "9E-Ioggxi2Em" }, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "pR-ftUatjEGd", "outputId": "ae5668c5-c369-4066-bbbf-b560fb28e39a" }, "outputs": [], "source": [ "# Install required packages in Google Colab\n", "%pip install -q python-dotenv gradio anthropic openai requests torch bitsandbytes transformers sentencepiece accelerate" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "VPmk2-Ggi2Em" }, "outputs": [], "source": [ "import re\n", "import sys\n", "import subprocess\n", "import threading\n", "import anthropic\n", "import torch\n", "import gradio as gr\n", "from openai import OpenAI\n", "from huggingface_hub import InferenceClient, login\n", "from google.colab import userdata\n", "from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer, BitsAndBytesConfig" ] }, { "cell_type": "markdown", "metadata": { "id": "DUQ55_oji2En" }, "source": [ "## Initialization" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "MiicxGawi2En" }, "outputs": [], "source": [ "# Google Colab User Data\n", "# Ensure you have set the following in your Google Colab environment:\n", "openai_api_key = userdata.get(\"OPENAI_API_KEY\")\n", "anthropic_api_key = userdata.get(\"ANTHROPIC_API_KEY\")\n", "hf_token = userdata.get('HF_TOKEN')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "OPENAI_MODEL = \"gpt-4o-mini\"\n", "CLAUDE_MODEL = \"claude-3-5-sonnet-20240620\"\n", "LLAMA = \"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n", "\n", "code_qwen = \"Qwen/CodeQwen1.5-7B-Chat\"\n", "CODE_QWEN_URL = \"https://zfkokxzs1xrqv13v.us-east-1.aws.endpoints.huggingface.cloud\"\n", "\n", "login(hf_token, add_to_git_credential=True)\n", "openai = OpenAI(api_key=openai_api_key)\n", "claude = anthropic.Anthropic(api_key=anthropic_api_key)" ] }, { "cell_type": "markdown", "metadata": { "id": "ipA1F440i2En" }, "source": [ "## Prompts definition" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JgtqCyRji2En" }, "outputs": [], "source": [ "system_message = \"\"\"\n", "You are a helpful assistant whose main purpose is to generate datasets for business problems.\n", "\n", "Be less verbose.\n", "Be accurate and concise.\n", "\n", "The user will describe a business problem. Based on this, you must generate a synthetic dataset that fits the context.\n", "\n", "The dataset should be saved in a specific format such as CSV, JSON β€” the desired format will be specified by the user.\n", "\n", "The dependencies for python code should include only standard python libraries such as numpy, pandas and built-in libraries.\n", "\n", "When saving a DataFrame to JSON using `to_json()`, do not use the `encoding` parameter. Instead, manually open the file with `open()` and specify the encoding. Then pass the file object to `to_json()`.\n", "\n", "Ensure Python code blocks are correctly indented, especially inside `with`, `for`, `if`, `try`, and `def` blocks.\n", "\n", "Return only the Python code that generates and saves the dataset.\n", "After saving the file, print the code that was executed and a message confirming the dataset was generated successfully.\n", "\"\"\"\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Bk6saP4oi2Eo" }, "outputs": [], "source": [ "def user_prompt(**input_data):\n", " user_prompt = f\"\"\"\n", " Generate a synthetic {input_data[\"dataset_type\"].lower()} dataset in {input_data[\"output_format\"].upper()} format.\n", " Business problem: {input_data[\"business_problem\"]}\n", " Samples: {input_data[\"num_samples\"]}\n", " \"\"\"\n", " return user_prompt\n" ] }, { "cell_type": "markdown", "metadata": { "id": "XnrPiAZ7i2Eo" }, "source": [ "## Call API for Closed Models" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Sx7hHKczi2Eo" }, "outputs": [], "source": [ "def stream_gpt(user_prompt):\n", " stream = openai.chat.completions.create(\n", " model=OPENAI_MODEL,\n", " messages=[\n", " {\"role\": \"system\", \"content\": system_message},\n", " {\"role\": \"user\",\"content\": user_prompt},\n", " ],\n", " stream=True,\n", " )\n", "\n", " response = \"\"\n", " for chunk in stream:\n", " response += chunk.choices[0].delta.content or \"\"\n", " yield response\n", "\n", " return response\n", "\n", "\n", "def stream_claude(user_prompt):\n", " result = claude.messages.stream(\n", " model=CLAUDE_MODEL,\n", " max_tokens=2000,\n", " system=system_message,\n", " messages=[\n", " {\"role\": \"user\",\"content\": user_prompt}\n", " ]\n", " )\n", " reply = \"\"\n", " with result as stream:\n", " for text in stream.text_stream:\n", " reply += text\n", " yield reply\n", " print(text, end=\"\", flush=True)\n", " return reply\n" ] }, { "cell_type": "markdown", "metadata": { "id": "PUPeZ4xPi2Eo" }, "source": [ "## Call Open Source Models\n", "- Llama is downloaded and run on T4 GPU (Google Colab).\n", "- Code Qwen is run through inference endpoint" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "W0AuZT2uk0Sd" }, "outputs": [], "source": [ "def stream_llama(user_prompt):\n", " try:\n", " messages=[\n", " {\"role\": \"system\", \"content\": system_message},\n", " {\"role\": \"user\",\"content\": user_prompt},\n", " ]\n", "\n", " tokenizer = AutoTokenizer.from_pretrained(LLAMA)\n", " tokenizer.pad_token = tokenizer.eos_token\n", "\n", " quant_config = BitsAndBytesConfig(\n", " load_in_4bit=True,\n", " bnb_4bit_use_double_quant=True,\n", " bnb_4bit_compute_dtype=torch.bfloat16,\n", " bnb_4bit_quant_type=\"nf4\"\n", " )\n", "\n", " model = AutoModelForCausalLM.from_pretrained(\n", " LLAMA,\n", " device_map=\"auto\",\n", " quantization_config=quant_config\n", " )\n", "\n", " inputs = tokenizer.apply_chat_template(messages, return_tensors=\"pt\").to(\"cuda\")\n", " streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False)\n", "\n", " thread = threading.Thread(target=model.generate, kwargs={\n", " \"input_ids\": inputs,\n", " \"max_new_tokens\": 1000,\n", " \"pad_token_id\": tokenizer.eos_token_id,\n", " \"streamer\": streamer\n", " })\n", " thread.start()\n", "\n", " started = False\n", " reply = \"\"\n", "\n", " for new_text in streamer:\n", " if not started:\n", " if \"<|start_header_id|>assistant<|end_header_id|>\" in new_text:\n", " started = True\n", " new_text = new_text.split(\"<|start_header_id|>assistant<|end_header_id|>\")[-1].strip()\n", " else:\n", " continue\n", "\n", " if \"<|eot_id|>\" in new_text:\n", " new_text = new_text.replace(\"<|eot_id|>\", \"\")\n", " if new_text.strip():\n", " reply += new_text\n", " yield reply\n", " break\n", "\n", " if new_text.strip():\n", " reply += new_text\n", " yield reply\n", "\n", " return reply\n", "\n", " except Exception as e:\n", " print(f\"LLaMA error: {e}\")\n", " raise\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "V0JS_6THi2Eo" }, "outputs": [], "source": [ "def stream_code_qwen(user_prompt):\n", " tokenizer = AutoTokenizer.from_pretrained(code_qwen)\n", " messages=[\n", " {\"role\": \"system\", \"content\": system_message},\n", " {\"role\": \"user\",\"content\": user_prompt},\n", " ]\n", " text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n", " client = InferenceClient(CODE_QWEN_URL, token=hf_token)\n", " stream = client.text_generation(text, stream=True, details=True, max_new_tokens=3000)\n", " result = \"\"\n", " for r in stream:\n", " result += r.token.text\n", " yield result" ] }, { "cell_type": "markdown", "metadata": { "id": "PqG57dJIi2Eo" }, "source": [ "## Select the model and generate the ouput" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "YqSKnklRi2Eo" }, "outputs": [], "source": [ "def generate_from_inputs(model, **input_data):\n", " # print(\"πŸ” input_data received:\", input_data)\n", " user_prompt_str = user_prompt(**input_data)\n", "\n", " if model == \"GPT\":\n", " result = stream_gpt(user_prompt_str)\n", " elif model == \"Claude\":\n", " result = stream_claude(user_prompt_str)\n", " elif model == \"Llama\":\n", " result = stream_llama(user_prompt_str)\n", " elif model == \"Code Qwen\":\n", " result = stream_code_qwen(user_prompt_str)\n", " else:\n", " raise ValueError(\"Unknown model\")\n", "\n", " for stream_so_far in result:\n", " yield stream_so_far\n", "\n", " return result\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zG6_TSfni2Eo" }, "outputs": [], "source": [ "def handle_generate(business_problem, dataset_type, dataset_format, num_samples, model):\n", " input_data = {\n", " \"business_problem\": business_problem,\n", " \"dataset_type\": dataset_type,\n", " \"output_format\": dataset_format,\n", " \"num_samples\": num_samples,\n", " }\n", "\n", " response = generate_from_inputs(model, **input_data)\n", " for chunk in response:\n", " yield chunk\n" ] }, { "cell_type": "markdown", "metadata": { "id": "p5DQcx71i2Ep" }, "source": [ "## Extract python code from the LLM output and execute it locally" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NcEkmsnai2Ep", "jp-MarkdownHeadingCollapsed": true }, "outputs": [], "source": [ "def extract_code(text):\n", " match = re.search(r\"```python(.*?)```\", text, re.DOTALL)\n", "\n", " if match:\n", " code = match.group(0).strip()\n", " else:\n", " code = \"\"\n", " print(\"No matching substring found.\")\n", "\n", " return code.replace(\"```python\\n\", \"\").replace(\"```\", \"\")\n", "\n", "\n", "def execute_code_in_virtualenv(text, python_interpreter=sys.executable):\n", " if not python_interpreter:\n", " raise EnvironmentError(\"Python interpreter not found in the specified virtual environment.\")\n", "\n", " code_str = extract_code(text)\n", " command = [python_interpreter, '-c', code_str]\n", "\n", " try:\n", " result = subprocess.run(command, check=True, capture_output=True, text=True)\n", " stdout = result.stdout\n", " return stdout\n", "\n", " except subprocess.CalledProcessError as e:\n", " return f\"Execution error:\\n{e}\"\n" ] }, { "cell_type": "markdown", "metadata": { "id": "DQgEyFzJi2Ep" }, "source": [ "## Gradio interface" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "SEiZVkdFi2Ep" }, "outputs": [], "source": [ "def update_output_format(dataset_type):\n", " if dataset_type in [\"Tabular\", \"Time-series\"]:\n", " return gr.update(choices=[\"JSON\", \"csv\"], value=\"JSON\")\n", " elif dataset_type == \"Text\":\n", " return gr.update(choices=[\"JSON\"], value=\"JSON\")\n", "\n", "with gr.Blocks() as ui:\n", " gr.Markdown(\"## Create a dataset for a business problem\")\n", "\n", " with gr.Column():\n", " business_problem = gr.Textbox(label=\"Business problem\", lines=2)\n", " dataset_type = gr.Dropdown(\n", " [\"Tabular\", \"Time-series\", \"Text\"], label=\"Dataset type\"\n", " )\n", "\n", " output_format = gr.Dropdown( choices=[\"JSON\", \"csv\"], value=\"JSON\",label=\"Output Format\")\n", "\n", " num_samples = gr.Number(label=\"Number of samples\", value=10, precision=0)\n", "\n", " model = gr.Dropdown([\"GPT\", \"Claude\", \"Llama\", \"Code Qwen\"], label=\"Select model\", value=\"GPT\")\n", "\n", " dataset_type.change(update_output_format,inputs=[dataset_type], outputs=[output_format])\n", "\n", " with gr.Row():\n", " with gr.Column():\n", " dataset_run = gr.Button(\"Create a dataset\")\n", " gr.Markdown(\"\"\"⚠️ For Llama and Code Qwen: The generated code might not be optimal. It's recommended to review it before execution.\n", " Some mistakes may occur.\"\"\")\n", "\n", " with gr.Column():\n", " code_run = gr.Button(\"Execute code for a dataset\")\n", " gr.Markdown(\"\"\"⚠️ Be cautious when sharing this app with code execution publicly, as it could pose safety risks.\n", " The execution of user-generated code may lead to potential vulnerabilities, and it’s important to use this tool responsibly.\"\"\")\n", "\n", " with gr.Row():\n", " dataset_out = gr.Textbox(label=\"Generated Dataset\")\n", " code_out = gr.Textbox(label=\"Executed code\")\n", "\n", " dataset_run.click(\n", " handle_generate,\n", " inputs=[business_problem, dataset_type, output_format, num_samples, model],\n", " outputs=[dataset_out]\n", " )\n", "\n", " code_run.click(\n", " execute_code_in_virtualenv,\n", " inputs=[dataset_out],\n", " outputs=[code_out]\n", " )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 646 }, "id": "jCAkTEtMi2Ep", "outputId": "deeeb1a7-c432-4007-eba2-cbcc28dbc0ff" }, "outputs": [], "source": [ "ui.launch(inbrowser=True)" ] } ], "metadata": { "accelerator": "GPU", "colab": { "gpuType": "T4", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 4 }