diff --git a/week4/community-contributions/07_data_generator.ipynb b/week4/community-contributions/07_data_generator.ipynb new file mode 100644 index 0000000..6de3bcf --- /dev/null +++ b/week4/community-contributions/07_data_generator.ipynb @@ -0,0 +1,569 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "BSbc4VbLi2Ek" + }, + "source": [ + "# Synthetic Dataset generator\n", + "- πŸš€ Live Demo: https://huggingface.co/spaces/lisekarimi/datagen\n", + "- πŸ§‘β€πŸ’» Repo: https://github.com/lisekarimi/datagen\n", + "\n", + "---\n", + "\n", + "- 🌍 **Task**: Generate realistic synthetic datasets\n", + "- 🎯 **Supported Data Types**: Tabular, Text, Time-series\n", + "- 🧠 **Models**: GPT (OpenAI) , Claude (Anthropic), CodeQwen1.5-7B-Chat (via Hugging Face Inference) / Llama (in Google Colab through T4 GPU)\n", + "- πŸš€ **Tools**: Python, Gradio UI, OpenAI / Anthropic / HuggingFace APIs\n", + "- πŸ“€ **Output Formats**: JSON and CSV file\n", + "- πŸ§‘β€πŸ’» **Skill Level**: Intermediate\n", + "\n", + "🎯 **How It Works**\n", + "\n", + "1️⃣ Define your business problem or dataset topic.\n", + "\n", + "2️⃣ Choose the dataset type, output format, model, and number of samples.\n", + "\n", + "3️⃣ The LLM generates the code; you can adjust or modify it as needed.\n", + "\n", + "4️⃣ Execute the code to generate your output file.\n", + "\n", + "πŸ› οΈ **Requirements** \n", + "- βš™οΈ **Hardware**: βœ… GPU required (model download); Google Colab recommended (T4)\n", + "- πŸ”‘ OpenAI API Key (for GPT) \n", + "- πŸ”‘ Anthropic API Key (for Claude) \n", + "- πŸ”‘ Hugging Face Token \n", + "\n", + "**Deploy CodeQwen Endpoint:**\n", + "- Visit https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat\n", + "- Click **Deploy** β†’ **Inference Endpoints** β†’ **Create Endpoint** (requires credit card)\n", + "- Copy your endpoint URL: `https://[id].us-east-1.aws.endpoints.huggingface.cloud`\n", + "\n", + "βš™οΈ **Customizable by user** \n", + "- πŸ€– Selected model: GPT / Claude / Llama / Code Qwen\n", + "- πŸ“œ `system_prompt`: Controls model behavior (concise, accurate, structured) \n", + "- πŸ’¬ `user_prompt`: Dynamic β€” include other fields\n", + "\n", + "---\n", + "πŸ“’ Find more LLM notebooks on my [GitHub repository](https://github.com/lisekarimi/lexo)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9E-Ioggxi2Em" + }, + "source": [ + "## Imports" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "pR-ftUatjEGd", + "outputId": "ae5668c5-c369-4066-bbbf-b560fb28e39a" + }, + "outputs": [], + "source": [ + "# Install required packages in Google Colab\n", + "%pip install -q python-dotenv gradio anthropic openai requests torch bitsandbytes transformers sentencepiece accelerate" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VPmk2-Ggi2Em" + }, + "outputs": [], + "source": [ + "import re\n", + "import sys\n", + "import subprocess\n", + "import threading\n", + "import anthropic\n", + "import torch\n", + "import gradio as gr\n", + "from openai import OpenAI\n", + "from huggingface_hub import InferenceClient, login\n", + "from google.colab import userdata\n", + "from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer, BitsAndBytesConfig" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DUQ55_oji2En" + }, + "source": [ + "## Initialization" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MiicxGawi2En" + }, + "outputs": [], + "source": [ + "# Google Colab User Data\n", + "# Ensure you have set the following in your Google Colab environment:\n", + "openai_api_key = userdata.get(\"OPENAI_API_KEY\")\n", + "anthropic_api_key = userdata.get(\"ANTHROPIC_API_KEY\")\n", + "hf_token = userdata.get('HF_TOKEN')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "OPENAI_MODEL = \"gpt-4o-mini\"\n", + "CLAUDE_MODEL = \"claude-3-5-sonnet-20240620\"\n", + "LLAMA = \"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n", + "\n", + "code_qwen = \"Qwen/CodeQwen1.5-7B-Chat\"\n", + "CODE_QWEN_URL = \"https://zfkokxzs1xrqv13v.us-east-1.aws.endpoints.huggingface.cloud\"\n", + "\n", + "login(hf_token, add_to_git_credential=True)\n", + "openai = OpenAI(api_key=openai_api_key)\n", + "claude = anthropic.Anthropic(api_key=anthropic_api_key)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ipA1F440i2En" + }, + "source": [ + "## Prompts definition" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JgtqCyRji2En" + }, + "outputs": [], + "source": [ + "system_message = \"\"\"\n", + "You are a helpful assistant whose main purpose is to generate datasets for business problems.\n", + "\n", + "Be less verbose.\n", + "Be accurate and concise.\n", + "\n", + "The user will describe a business problem. Based on this, you must generate a synthetic dataset that fits the context.\n", + "\n", + "The dataset should be saved in a specific format such as CSV, JSON β€” the desired format will be specified by the user.\n", + "\n", + "The dependencies for python code should include only standard python libraries such as numpy, pandas and built-in libraries.\n", + "\n", + "When saving a DataFrame to JSON using `to_json()`, do not use the `encoding` parameter. Instead, manually open the file with `open()` and specify the encoding. Then pass the file object to `to_json()`.\n", + "\n", + "Ensure Python code blocks are correctly indented, especially inside `with`, `for`, `if`, `try`, and `def` blocks.\n", + "\n", + "Return only the Python code that generates and saves the dataset.\n", + "After saving the file, print the code that was executed and a message confirming the dataset was generated successfully.\n", + "\"\"\"\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Bk6saP4oi2Eo" + }, + "outputs": [], + "source": [ + "def user_prompt(**input_data):\n", + " user_prompt = f\"\"\"\n", + " Generate a synthetic {input_data[\"dataset_type\"].lower()} dataset in {input_data[\"output_format\"].upper()} format.\n", + " Business problem: {input_data[\"business_problem\"]}\n", + " Samples: {input_data[\"num_samples\"]}\n", + " \"\"\"\n", + " return user_prompt\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XnrPiAZ7i2Eo" + }, + "source": [ + "## Call API for Closed Models" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Sx7hHKczi2Eo" + }, + "outputs": [], + "source": [ + "def stream_gpt(user_prompt):\n", + " stream = openai.chat.completions.create(\n", + " model=OPENAI_MODEL,\n", + " messages=[\n", + " {\"role\": \"system\", \"content\": system_message},\n", + " {\"role\": \"user\",\"content\": user_prompt},\n", + " ],\n", + " stream=True,\n", + " )\n", + "\n", + " response = \"\"\n", + " for chunk in stream:\n", + " response += chunk.choices[0].delta.content or \"\"\n", + " yield response\n", + "\n", + " return response\n", + "\n", + "\n", + "def stream_claude(user_prompt):\n", + " result = claude.messages.stream(\n", + " model=CLAUDE_MODEL,\n", + " max_tokens=2000,\n", + " system=system_message,\n", + " messages=[\n", + " {\"role\": \"user\",\"content\": user_prompt}\n", + " ]\n", + " )\n", + " reply = \"\"\n", + " with result as stream:\n", + " for text in stream.text_stream:\n", + " reply += text\n", + " yield reply\n", + " print(text, end=\"\", flush=True)\n", + " return reply\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PUPeZ4xPi2Eo" + }, + "source": [ + "## Call Open Source Models\n", + "- Llama is downloaded and run on T4 GPU (Google Colab).\n", + "- Code Qwen is run through inference endpoint" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "W0AuZT2uk0Sd" + }, + "outputs": [], + "source": [ + "def stream_llama(user_prompt):\n", + " try:\n", + " messages=[\n", + " {\"role\": \"system\", \"content\": system_message},\n", + " {\"role\": \"user\",\"content\": user_prompt},\n", + " ]\n", + "\n", + " tokenizer = AutoTokenizer.from_pretrained(LLAMA)\n", + " tokenizer.pad_token = tokenizer.eos_token\n", + "\n", + " quant_config = BitsAndBytesConfig(\n", + " load_in_4bit=True,\n", + " bnb_4bit_use_double_quant=True,\n", + " bnb_4bit_compute_dtype=torch.bfloat16,\n", + " bnb_4bit_quant_type=\"nf4\"\n", + " )\n", + "\n", + " model = AutoModelForCausalLM.from_pretrained(\n", + " LLAMA,\n", + " device_map=\"auto\",\n", + " quantization_config=quant_config\n", + " )\n", + "\n", + " inputs = tokenizer.apply_chat_template(messages, return_tensors=\"pt\").to(\"cuda\")\n", + " streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False)\n", + "\n", + " thread = threading.Thread(target=model.generate, kwargs={\n", + " \"input_ids\": inputs,\n", + " \"max_new_tokens\": 1000,\n", + " \"pad_token_id\": tokenizer.eos_token_id,\n", + " \"streamer\": streamer\n", + " })\n", + " thread.start()\n", + "\n", + " started = False\n", + " reply = \"\"\n", + "\n", + " for new_text in streamer:\n", + " if not started:\n", + " if \"<|start_header_id|>assistant<|end_header_id|>\" in new_text:\n", + " started = True\n", + " new_text = new_text.split(\"<|start_header_id|>assistant<|end_header_id|>\")[-1].strip()\n", + " else:\n", + " continue\n", + "\n", + " if \"<|eot_id|>\" in new_text:\n", + " new_text = new_text.replace(\"<|eot_id|>\", \"\")\n", + " if new_text.strip():\n", + " reply += new_text\n", + " yield reply\n", + " break\n", + "\n", + " if new_text.strip():\n", + " reply += new_text\n", + " yield reply\n", + "\n", + " return reply\n", + "\n", + " except Exception as e:\n", + " print(f\"LLaMA error: {e}\")\n", + " raise\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "V0JS_6THi2Eo" + }, + "outputs": [], + "source": [ + "def stream_code_qwen(user_prompt):\n", + " tokenizer = AutoTokenizer.from_pretrained(code_qwen)\n", + " messages=[\n", + " {\"role\": \"system\", \"content\": system_message},\n", + " {\"role\": \"user\",\"content\": user_prompt},\n", + " ]\n", + " text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n", + " client = InferenceClient(CODE_QWEN_URL, token=hf_token)\n", + " stream = client.text_generation(text, stream=True, details=True, max_new_tokens=3000)\n", + " result = \"\"\n", + " for r in stream:\n", + " result += r.token.text\n", + " yield result" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PqG57dJIi2Eo" + }, + "source": [ + "## Select the model and generate the ouput" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "YqSKnklRi2Eo" + }, + "outputs": [], + "source": [ + "def generate_from_inputs(model, **input_data):\n", + " # print(\"πŸ” input_data received:\", input_data)\n", + " user_prompt_str = user_prompt(**input_data)\n", + "\n", + " if model == \"GPT\":\n", + " result = stream_gpt(user_prompt_str)\n", + " elif model == \"Claude\":\n", + " result = stream_claude(user_prompt_str)\n", + " elif model == \"Llama\":\n", + " result = stream_llama(user_prompt_str)\n", + " elif model == \"Code Qwen\":\n", + " result = stream_code_qwen(user_prompt_str)\n", + " else:\n", + " raise ValueError(\"Unknown model\")\n", + "\n", + " for stream_so_far in result:\n", + " yield stream_so_far\n", + "\n", + " return result\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "zG6_TSfni2Eo" + }, + "outputs": [], + "source": [ + "def handle_generate(business_problem, dataset_type, dataset_format, num_samples, model):\n", + " input_data = {\n", + " \"business_problem\": business_problem,\n", + " \"dataset_type\": dataset_type,\n", + " \"output_format\": dataset_format,\n", + " \"num_samples\": num_samples,\n", + " }\n", + "\n", + " response = generate_from_inputs(model, **input_data)\n", + " for chunk in response:\n", + " yield chunk\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p5DQcx71i2Ep" + }, + "source": [ + "## Extract python code from the LLM output and execute it locally" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NcEkmsnai2Ep", + "jp-MarkdownHeadingCollapsed": true + }, + "outputs": [], + "source": [ + "def extract_code(text):\n", + " match = re.search(r\"```python(.*?)```\", text, re.DOTALL)\n", + "\n", + " if match:\n", + " code = match.group(0).strip()\n", + " else:\n", + " code = \"\"\n", + " print(\"No matching substring found.\")\n", + "\n", + " return code.replace(\"```python\\n\", \"\").replace(\"```\", \"\")\n", + "\n", + "\n", + "def execute_code_in_virtualenv(text, python_interpreter=sys.executable):\n", + " if not python_interpreter:\n", + " raise EnvironmentError(\"Python interpreter not found in the specified virtual environment.\")\n", + "\n", + " code_str = extract_code(text)\n", + " command = [python_interpreter, '-c', code_str]\n", + "\n", + " try:\n", + " result = subprocess.run(command, check=True, capture_output=True, text=True)\n", + " stdout = result.stdout\n", + " return stdout\n", + "\n", + " except subprocess.CalledProcessError as e:\n", + " return f\"Execution error:\\n{e}\"\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DQgEyFzJi2Ep" + }, + "source": [ + "## Gradio interface" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "SEiZVkdFi2Ep" + }, + "outputs": [], + "source": [ + "def update_output_format(dataset_type):\n", + " if dataset_type in [\"Tabular\", \"Time-series\"]:\n", + " return gr.update(choices=[\"JSON\", \"csv\"], value=\"JSON\")\n", + " elif dataset_type == \"Text\":\n", + " return gr.update(choices=[\"JSON\"], value=\"JSON\")\n", + "\n", + "with gr.Blocks() as ui:\n", + " gr.Markdown(\"## Create a dataset for a business problem\")\n", + "\n", + " with gr.Column():\n", + " business_problem = gr.Textbox(label=\"Business problem\", lines=2)\n", + " dataset_type = gr.Dropdown(\n", + " [\"Tabular\", \"Time-series\", \"Text\"], label=\"Dataset type\"\n", + " )\n", + "\n", + " output_format = gr.Dropdown( choices=[\"JSON\", \"csv\"], value=\"JSON\",label=\"Output Format\")\n", + "\n", + " num_samples = gr.Number(label=\"Number of samples\", value=10, precision=0)\n", + "\n", + " model = gr.Dropdown([\"GPT\", \"Claude\", \"Llama\", \"Code Qwen\"], label=\"Select model\", value=\"GPT\")\n", + "\n", + " dataset_type.change(update_output_format,inputs=[dataset_type], outputs=[output_format])\n", + "\n", + " with gr.Row():\n", + " with gr.Column():\n", + " dataset_run = gr.Button(\"Create a dataset\")\n", + " gr.Markdown(\"\"\"⚠️ For Llama and Code Qwen: The generated code might not be optimal. It's recommended to review it before execution.\n", + " Some mistakes may occur.\"\"\")\n", + "\n", + " with gr.Column():\n", + " code_run = gr.Button(\"Execute code for a dataset\")\n", + " gr.Markdown(\"\"\"⚠️ Be cautious when sharing this app with code execution publicly, as it could pose safety risks.\n", + " The execution of user-generated code may lead to potential vulnerabilities, and it’s important to use this tool responsibly.\"\"\")\n", + "\n", + " with gr.Row():\n", + " dataset_out = gr.Textbox(label=\"Generated Dataset\")\n", + " code_out = gr.Textbox(label=\"Executed code\")\n", + "\n", + " dataset_run.click(\n", + " handle_generate,\n", + " inputs=[business_problem, dataset_type, output_format, num_samples, model],\n", + " outputs=[dataset_out]\n", + " )\n", + "\n", + " code_run.click(\n", + " execute_code_in_virtualenv,\n", + " inputs=[dataset_out],\n", + " outputs=[code_out]\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 646 + }, + "id": "jCAkTEtMi2Ep", + "outputId": "deeeb1a7-c432-4007-eba2-cbcc28dbc0ff" + }, + "outputs": [], + "source": [ + "ui.launch(inbrowser=True)" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "T4", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.11" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}