LLM_Engineering_OLD/week4/community-contributions/07_data_generator.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BSbc4VbLi2Ek"
   },
   "source": [
    "# Synthetic Dataset generator\n",
    "- 🚀 Live Demo: https://huggingface.co/spaces/lisekarimi/datagen\n",
    "- 🧑‍💻 Repo: https://github.com/lisekarimi/datagen\n",
    "\n",
    "---\n",
    "\n",
    "- 🌍 **Task**: Generate realistic synthetic datasets\n",
    "- 🎯 **Supported Data Types**: Tabular, Text, Time-series\n",
    "- 🧠 **Models**: GPT (OpenAI) , Claude (Anthropic), CodeQwen1.5-7B-Chat (via Hugging Face Inference) / Llama (in Google Colab through T4 GPU)\n",
    "- 🚀 **Tools**: Python, Gradio UI, OpenAI / Anthropic / HuggingFace APIs\n",
    "- 📤 **Output Formats**: JSON and CSV file\n",
    "- 🧑‍💻 **Skill Level**: Intermediate\n",
    "\n",
    "🎯 **How It Works**\n",
    "\n",
    "1️⃣ Define your business problem or dataset topic.\n",
    "\n",
    "2️⃣ Choose the dataset type, output format, model, and number of samples.\n",
    "\n",
    "3️⃣ The LLM generates the code; you can adjust or modify it as needed.\n",
    "\n",
    "4️⃣ Execute the code to generate your output file.\n",
    "\n",
    "🛠️ **Requirements** \n",
    "- ⚙️ **Hardware**: ✅ GPU required (model download); Google Colab recommended (T4)\n",
    "- 🔑 OpenAI API Key (for GPT)  \n",
    "- 🔑 Anthropic API Key (for Claude)  \n",
    "- 🔑 Hugging Face Token \n",
    "\n",
    "**Deploy CodeQwen Endpoint:**\n",
    "- Visit https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat\n",
    "- Click **Deploy** → **Inference Endpoints** → **Create Endpoint** (requires credit card)\n",
    "- Copy your endpoint URL: `https://[id].us-east-1.aws.endpoints.huggingface.cloud`\n",
    "\n",
    "⚙️ **Customizable by user**  \n",
    "- 🤖 Selected model: GPT / Claude / Llama  / Code Qwen\n",
    "- 📜 `system_prompt`: Controls model behavior (concise, accurate, structured)  \n",
    "- 💬 `user_prompt`: Dynamic — include other fields\n",
    "\n",
    "---\n",
    "📢 Find more LLM notebooks on my [GitHub repository](https://github.com/lisekarimi/lexo)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9E-Ioggxi2Em"
   },
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "pR-ftUatjEGd",
    "outputId": "ae5668c5-c369-4066-bbbf-b560fb28e39a"
   },
   "outputs": [],
   "source": [
    "# Install required packages in Google Colab\n",
    "%pip install -q python-dotenv gradio anthropic openai requests torch bitsandbytes transformers sentencepiece accelerate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "VPmk2-Ggi2Em"
   },
   "outputs": [],
   "source": [
    "import re\n",
    "import sys\n",
    "import subprocess\n",
    "import threading\n",
    "import anthropic\n",
    "import torch\n",
    "import gradio as gr\n",
    "from openai import OpenAI\n",
    "from huggingface_hub import InferenceClient, login\n",
    "from google.colab import userdata\n",
    "from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer, BitsAndBytesConfig"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "DUQ55_oji2En"
   },
   "source": [
    "## Initialization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "MiicxGawi2En"
   },
   "outputs": [],
   "source": [
    "# Google Colab User Data\n",
    "# Ensure you have set the following in your Google Colab environment:\n",
    "openai_api_key = userdata.get(\"OPENAI_API_KEY\")\n",
    "anthropic_api_key = userdata.get(\"ANTHROPIC_API_KEY\")\n",
    "hf_token = userdata.get('HF_TOKEN')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "OPENAI_MODEL = \"gpt-4o-mini\"\n",
    "CLAUDE_MODEL = \"claude-3-5-sonnet-20240620\"\n",
    "LLAMA = \"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n",
    "\n",
    "code_qwen = \"Qwen/CodeQwen1.5-7B-Chat\"\n",
    "CODE_QWEN_URL = \"https://zfkokxzs1xrqv13v.us-east-1.aws.endpoints.huggingface.cloud\"\n",
    "\n",
    "login(hf_token, add_to_git_credential=True)\n",
    "openai = OpenAI(api_key=openai_api_key)\n",
    "claude = anthropic.Anthropic(api_key=anthropic_api_key)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ipA1F440i2En"
   },
   "source": [
    "## Prompts definition"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "JgtqCyRji2En"
   },
   "outputs": [],
   "source": [
    "system_message = \"\"\"\n",
    "You are a helpful assistant whose main purpose is to generate datasets for business problems.\n",
    "\n",
    "Be less verbose.\n",
    "Be accurate and concise.\n",
    "\n",
    "The user will describe a business problem. Based on this, you must generate a synthetic dataset that fits the context.\n",
    "\n",
    "The dataset should be saved in a specific format such as CSV, JSON — the desired format will be specified by the user.\n",
    "\n",
    "The dependencies for python code should include only standard python libraries such as numpy, pandas and built-in libraries.\n",
    "\n",
    "When saving a DataFrame to JSON using `to_json()`, do not use the `encoding` parameter. Instead, manually open the file with `open()` and specify the encoding. Then pass the file object to `to_json()`.\n",
    "\n",
    "Ensure Python code blocks are correctly indented, especially inside `with`, `for`, `if`, `try`, and `def` blocks.\n",
    "\n",
    "Return only the Python code that generates and saves the dataset.\n",
    "After saving the file, print the code that was executed and a message confirming the dataset was generated successfully.\n",
    "\"\"\"\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Bk6saP4oi2Eo"
   },
   "outputs": [],
   "source": [
    "def user_prompt(**input_data):\n",
    "  user_prompt = f\"\"\"\n",
    "      Generate a synthetic {input_data[\"dataset_type\"].lower()} dataset in {input_data[\"output_format\"].upper()} format.\n",
    "      Business problem: {input_data[\"business_problem\"]}\n",
    "      Samples: {input_data[\"num_samples\"]}\n",
    "      \"\"\"\n",
    "  return user_prompt\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XnrPiAZ7i2Eo"
   },
   "source": [
    "## Call API for Closed Models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Sx7hHKczi2Eo"
   },
   "outputs": [],
   "source": [
    "def stream_gpt(user_prompt):\n",
    "  stream = openai.chat.completions.create(\n",
    "      model=OPENAI_MODEL,\n",
    "      messages=[\n",
    "          {\"role\": \"system\", \"content\": system_message},\n",
    "          {\"role\": \"user\",\"content\": user_prompt},\n",
    "      ],\n",
    "      stream=True,\n",
    "  )\n",
    "\n",
    "  response = \"\"\n",
    "  for chunk in stream:\n",
    "      response += chunk.choices[0].delta.content or \"\"\n",
    "      yield response\n",
    "\n",
    "  return response\n",
    "\n",
    "\n",
    "def stream_claude(user_prompt):\n",
    "  result = claude.messages.stream(\n",
    "      model=CLAUDE_MODEL,\n",
    "      max_tokens=2000,\n",
    "      system=system_message,\n",
    "      messages=[\n",
    "          {\"role\": \"user\",\"content\": user_prompt}\n",
    "      ]\n",
    "  )\n",
    "  reply = \"\"\n",
    "  with result as stream:\n",
    "      for text in stream.text_stream:\n",
    "          reply += text\n",
    "          yield reply\n",
    "          print(text, end=\"\", flush=True)\n",
    "  return reply\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PUPeZ4xPi2Eo"
   },
   "source": [
    "## Call Open Source Models\n",
    "- Llama is downloaded and run on T4 GPU (Google Colab).\n",
    "- Code Qwen is run through inference endpoint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "W0AuZT2uk0Sd"
   },
   "outputs": [],
   "source": [
    "def stream_llama(user_prompt):\n",
    "  try:\n",
    "    messages=[\n",
    "        {\"role\": \"system\", \"content\": system_message},\n",
    "        {\"role\": \"user\",\"content\": user_prompt},\n",
    "    ]\n",
    "\n",
    "    tokenizer = AutoTokenizer.from_pretrained(LLAMA)\n",
    "    tokenizer.pad_token = tokenizer.eos_token\n",
    "\n",
    "    quant_config = BitsAndBytesConfig(\n",
    "        load_in_4bit=True,\n",
    "        bnb_4bit_use_double_quant=True,\n",
    "        bnb_4bit_compute_dtype=torch.bfloat16,\n",
    "        bnb_4bit_quant_type=\"nf4\"\n",
    "    )\n",
    "\n",
    "    model = AutoModelForCausalLM.from_pretrained(\n",
    "        LLAMA,\n",
    "        device_map=\"auto\",\n",
    "        quantization_config=quant_config\n",
    "    )\n",
    "\n",
    "    inputs = tokenizer.apply_chat_template(messages, return_tensors=\"pt\").to(\"cuda\")\n",
    "    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False)\n",
    "\n",
    "    thread = threading.Thread(target=model.generate, kwargs={\n",
    "        \"input_ids\": inputs,\n",
    "        \"max_new_tokens\": 1000,\n",
    "        \"pad_token_id\": tokenizer.eos_token_id,\n",
    "        \"streamer\": streamer\n",
    "    })\n",
    "    thread.start()\n",
    "\n",
    "    started = False\n",
    "    reply = \"\"\n",
    "\n",
    "    for new_text in streamer:\n",
    "        if not started:\n",
    "            if \"<|start_header_id|>assistant<|end_header_id|>\" in new_text:\n",
    "                started = True\n",
    "                new_text = new_text.split(\"<|start_header_id|>assistant<|end_header_id|>\")[-1].strip()\n",
    "            else:\n",
    "                continue\n",
    "\n",
    "        if \"<|eot_id|>\" in new_text:\n",
    "            new_text = new_text.replace(\"<|eot_id|>\", \"\")\n",
    "            if new_text.strip():\n",
    "                reply += new_text\n",
    "                yield reply\n",
    "            break\n",
    "\n",
    "        if new_text.strip():\n",
    "            reply += new_text\n",
    "            yield reply\n",
    "\n",
    "    return reply\n",
    "\n",
    "  except Exception as e:\n",
    "    print(f\"LLaMA error: {e}\")\n",
    "    raise\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "V0JS_6THi2Eo"
   },
   "outputs": [],
   "source": [
    "def stream_code_qwen(user_prompt):\n",
    "    tokenizer = AutoTokenizer.from_pretrained(code_qwen)\n",
    "    messages=[\n",
    "            {\"role\": \"system\", \"content\": system_message},\n",
    "            {\"role\": \"user\",\"content\": user_prompt},\n",
    "        ]\n",
    "    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
    "    client = InferenceClient(CODE_QWEN_URL, token=hf_token)\n",
    "    stream = client.text_generation(text, stream=True, details=True, max_new_tokens=3000)\n",
    "    result = \"\"\n",
    "    for r in stream:\n",
    "        result += r.token.text\n",
    "        yield result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PqG57dJIi2Eo"
   },
   "source": [
    "## Select the model and generate the ouput"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "YqSKnklRi2Eo"
   },
   "outputs": [],
   "source": [
    "def generate_from_inputs(model, **input_data):\n",
    "  # print(\"🔍 input_data received:\", input_data)\n",
    "  user_prompt_str = user_prompt(**input_data)\n",
    "\n",
    "  if model == \"GPT\":\n",
    "    result = stream_gpt(user_prompt_str)\n",
    "  elif model == \"Claude\":\n",
    "    result = stream_claude(user_prompt_str)\n",
    "  elif model == \"Llama\":\n",
    "    result = stream_llama(user_prompt_str)\n",
    "  elif model == \"Code Qwen\":\n",
    "    result = stream_code_qwen(user_prompt_str)\n",
    "  else:\n",
    "    raise ValueError(\"Unknown model\")\n",
    "\n",
    "  for stream_so_far in result:\n",
    "    yield stream_so_far\n",
    "\n",
    "  return result\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "zG6_TSfni2Eo"
   },
   "outputs": [],
   "source": [
    "def handle_generate(business_problem, dataset_type, dataset_format, num_samples, model):\n",
    "  input_data = {\n",
    "      \"business_problem\": business_problem,\n",
    "      \"dataset_type\": dataset_type,\n",
    "      \"output_format\": dataset_format,\n",
    "      \"num_samples\": num_samples,\n",
    "  }\n",
    "\n",
    "  response = generate_from_inputs(model, **input_data)\n",
    "  for chunk in response:\n",
    "      yield chunk\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "p5DQcx71i2Ep"
   },
   "source": [
    "## Extract python code from the LLM output and execute it locally"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "NcEkmsnai2Ep",
    "jp-MarkdownHeadingCollapsed": true
   },
   "outputs": [],
   "source": [
    "def extract_code(text):\n",
    "  match = re.search(r\"```python(.*?)```\", text, re.DOTALL)\n",
    "\n",
    "  if match:\n",
    "      code = match.group(0).strip()\n",
    "  else:\n",
    "      code = \"\"\n",
    "      print(\"No matching substring found.\")\n",
    "\n",
    "  return code.replace(\"```python\\n\", \"\").replace(\"```\", \"\")\n",
    "\n",
    "\n",
    "def execute_code_in_virtualenv(text, python_interpreter=sys.executable):\n",
    "  if not python_interpreter:\n",
    "      raise EnvironmentError(\"Python interpreter not found in the specified virtual environment.\")\n",
    "\n",
    "  code_str = extract_code(text)\n",
    "  command = [python_interpreter, '-c', code_str]\n",
    "\n",
    "  try:\n",
    "      result = subprocess.run(command, check=True, capture_output=True, text=True)\n",
    "      stdout = result.stdout\n",
    "      return stdout\n",
    "\n",
    "  except subprocess.CalledProcessError as e:\n",
    "      return f\"Execution error:\\n{e}\"\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "DQgEyFzJi2Ep"
   },
   "source": [
    "## Gradio interface"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "SEiZVkdFi2Ep"
   },
   "outputs": [],
   "source": [
    "def update_output_format(dataset_type):\n",
    "    if dataset_type in [\"Tabular\", \"Time-series\"]:\n",
    "        return gr.update(choices=[\"JSON\", \"csv\"], value=\"JSON\")\n",
    "    elif dataset_type == \"Text\":\n",
    "        return gr.update(choices=[\"JSON\"], value=\"JSON\")\n",
    "\n",
    "with gr.Blocks() as ui:\n",
    "    gr.Markdown(\"## Create a dataset for a business problem\")\n",
    "\n",
    "    with gr.Column():\n",
    "        business_problem = gr.Textbox(label=\"Business problem\", lines=2)\n",
    "        dataset_type = gr.Dropdown(\n",
    "            [\"Tabular\", \"Time-series\", \"Text\"], label=\"Dataset type\"\n",
    "        )\n",
    "\n",
    "        output_format = gr.Dropdown( choices=[\"JSON\", \"csv\"], value=\"JSON\",label=\"Output Format\")\n",
    "\n",
    "        num_samples = gr.Number(label=\"Number of samples\", value=10, precision=0)\n",
    "\n",
    "        model = gr.Dropdown([\"GPT\", \"Claude\", \"Llama\", \"Code Qwen\"], label=\"Select model\", value=\"GPT\")\n",
    "\n",
    "        dataset_type.change(update_output_format,inputs=[dataset_type], outputs=[output_format])\n",
    "\n",
    "    with gr.Row():\n",
    "            with gr.Column():\n",
    "              dataset_run = gr.Button(\"Create a dataset\")\n",
    "              gr.Markdown(\"\"\"⚠️ For Llama and Code Qwen: The generated code might not be optimal. It's recommended to review it before execution.\n",
    "                            Some mistakes may occur.\"\"\")\n",
    "\n",
    "            with gr.Column():\n",
    "              code_run = gr.Button(\"Execute code for a dataset\")\n",
    "              gr.Markdown(\"\"\"⚠️ Be cautious when sharing this app with code execution publicly, as it could pose safety risks.\n",
    "                            The execution of user-generated code may lead to potential vulnerabilities, and it’s important to use this tool responsibly.\"\"\")\n",
    "\n",
    "    with gr.Row():\n",
    "        dataset_out = gr.Textbox(label=\"Generated Dataset\")\n",
    "        code_out = gr.Textbox(label=\"Executed code\")\n",
    "\n",
    "    dataset_run.click(\n",
    "        handle_generate,\n",
    "        inputs=[business_problem, dataset_type, output_format, num_samples, model],\n",
    "        outputs=[dataset_out]\n",
    "    )\n",
    "\n",
    "    code_run.click(\n",
    "        execute_code_in_virtualenv,\n",
    "        inputs=[dataset_out],\n",
    "        outputs=[code_out]\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 646
    },
    "id": "jCAkTEtMi2Ep",
    "outputId": "deeeb1a7-c432-4007-eba2-cbcc28dbc0ff"
   },
   "outputs": [],
   "source": [
    "ui.launch(inbrowser=True)"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "T4",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}