Add week4 contributions

This commit is contained in:
lisekarimi
2025-06-05 16:40:08 +02:00
parent 5782ca2b43
commit 2d511eb3c6

View File

@@ -0,0 +1,569 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "BSbc4VbLi2Ek"
},
"source": [
"# Synthetic Dataset generator\n",
"- 🚀 Live Demo: https://huggingface.co/spaces/lisekarimi/datagen\n",
"- 🧑‍💻 Repo: https://github.com/lisekarimi/datagen\n",
"\n",
"---\n",
"\n",
"- 🌍 **Task**: Generate realistic synthetic datasets\n",
"- 🎯 **Supported Data Types**: Tabular, Text, Time-series\n",
"- 🧠 **Models**: GPT (OpenAI) , Claude (Anthropic), CodeQwen1.5-7B-Chat (via Hugging Face Inference) / Llama (in Google Colab through T4 GPU)\n",
"- 🚀 **Tools**: Python, Gradio UI, OpenAI / Anthropic / HuggingFace APIs\n",
"- 📤 **Output Formats**: JSON and CSV file\n",
"- 🧑‍💻 **Skill Level**: Intermediate\n",
"\n",
"🎯 **How It Works**\n",
"\n",
"1⃣ Define your business problem or dataset topic.\n",
"\n",
"2⃣ Choose the dataset type, output format, model, and number of samples.\n",
"\n",
"3⃣ The LLM generates the code; you can adjust or modify it as needed.\n",
"\n",
"4⃣ Execute the code to generate your output file.\n",
"\n",
"🛠️ **Requirements** \n",
"- ⚙️ **Hardware**: ✅ GPU required (model download); Google Colab recommended (T4)\n",
"- 🔑 OpenAI API Key (for GPT) \n",
"- 🔑 Anthropic API Key (for Claude) \n",
"- 🔑 Hugging Face Token \n",
"\n",
"**Deploy CodeQwen Endpoint:**\n",
"- Visit https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat\n",
"- Click **Deploy** → **Inference Endpoints** → **Create Endpoint** (requires credit card)\n",
"- Copy your endpoint URL: `https://[id].us-east-1.aws.endpoints.huggingface.cloud`\n",
"\n",
"⚙️ **Customizable by user** \n",
"- 🤖 Selected model: GPT / Claude / Llama / Code Qwen\n",
"- 📜 `system_prompt`: Controls model behavior (concise, accurate, structured) \n",
"- 💬 `user_prompt`: Dynamic — include other fields\n",
"\n",
"---\n",
"📢 Find more LLM notebooks on my [GitHub repository](https://github.com/lisekarimi/lexo)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9E-Ioggxi2Em"
},
"source": [
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "pR-ftUatjEGd",
"outputId": "ae5668c5-c369-4066-bbbf-b560fb28e39a"
},
"outputs": [],
"source": [
"# Install required packages in Google Colab\n",
"%pip install -q python-dotenv gradio anthropic openai requests torch bitsandbytes transformers sentencepiece accelerate"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "VPmk2-Ggi2Em"
},
"outputs": [],
"source": [
"import re\n",
"import sys\n",
"import subprocess\n",
"import threading\n",
"import anthropic\n",
"import torch\n",
"import gradio as gr\n",
"from openai import OpenAI\n",
"from huggingface_hub import InferenceClient, login\n",
"from google.colab import userdata\n",
"from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer, BitsAndBytesConfig"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DUQ55_oji2En"
},
"source": [
"## Initialization"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "MiicxGawi2En"
},
"outputs": [],
"source": [
"# Google Colab User Data\n",
"# Ensure you have set the following in your Google Colab environment:\n",
"openai_api_key = userdata.get(\"OPENAI_API_KEY\")\n",
"anthropic_api_key = userdata.get(\"ANTHROPIC_API_KEY\")\n",
"hf_token = userdata.get('HF_TOKEN')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"OPENAI_MODEL = \"gpt-4o-mini\"\n",
"CLAUDE_MODEL = \"claude-3-5-sonnet-20240620\"\n",
"LLAMA = \"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n",
"\n",
"code_qwen = \"Qwen/CodeQwen1.5-7B-Chat\"\n",
"CODE_QWEN_URL = \"https://zfkokxzs1xrqv13v.us-east-1.aws.endpoints.huggingface.cloud\"\n",
"\n",
"login(hf_token, add_to_git_credential=True)\n",
"openai = OpenAI(api_key=openai_api_key)\n",
"claude = anthropic.Anthropic(api_key=anthropic_api_key)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ipA1F440i2En"
},
"source": [
"## Prompts definition"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "JgtqCyRji2En"
},
"outputs": [],
"source": [
"system_message = \"\"\"\n",
"You are a helpful assistant whose main purpose is to generate datasets for business problems.\n",
"\n",
"Be less verbose.\n",
"Be accurate and concise.\n",
"\n",
"The user will describe a business problem. Based on this, you must generate a synthetic dataset that fits the context.\n",
"\n",
"The dataset should be saved in a specific format such as CSV, JSON — the desired format will be specified by the user.\n",
"\n",
"The dependencies for python code should include only standard python libraries such as numpy, pandas and built-in libraries.\n",
"\n",
"When saving a DataFrame to JSON using `to_json()`, do not use the `encoding` parameter. Instead, manually open the file with `open()` and specify the encoding. Then pass the file object to `to_json()`.\n",
"\n",
"Ensure Python code blocks are correctly indented, especially inside `with`, `for`, `if`, `try`, and `def` blocks.\n",
"\n",
"Return only the Python code that generates and saves the dataset.\n",
"After saving the file, print the code that was executed and a message confirming the dataset was generated successfully.\n",
"\"\"\"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Bk6saP4oi2Eo"
},
"outputs": [],
"source": [
"def user_prompt(**input_data):\n",
" user_prompt = f\"\"\"\n",
" Generate a synthetic {input_data[\"dataset_type\"].lower()} dataset in {input_data[\"output_format\"].upper()} format.\n",
" Business problem: {input_data[\"business_problem\"]}\n",
" Samples: {input_data[\"num_samples\"]}\n",
" \"\"\"\n",
" return user_prompt\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XnrPiAZ7i2Eo"
},
"source": [
"## Call API for Closed Models"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Sx7hHKczi2Eo"
},
"outputs": [],
"source": [
"def stream_gpt(user_prompt):\n",
" stream = openai.chat.completions.create(\n",
" model=OPENAI_MODEL,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_message},\n",
" {\"role\": \"user\",\"content\": user_prompt},\n",
" ],\n",
" stream=True,\n",
" )\n",
"\n",
" response = \"\"\n",
" for chunk in stream:\n",
" response += chunk.choices[0].delta.content or \"\"\n",
" yield response\n",
"\n",
" return response\n",
"\n",
"\n",
"def stream_claude(user_prompt):\n",
" result = claude.messages.stream(\n",
" model=CLAUDE_MODEL,\n",
" max_tokens=2000,\n",
" system=system_message,\n",
" messages=[\n",
" {\"role\": \"user\",\"content\": user_prompt}\n",
" ]\n",
" )\n",
" reply = \"\"\n",
" with result as stream:\n",
" for text in stream.text_stream:\n",
" reply += text\n",
" yield reply\n",
" print(text, end=\"\", flush=True)\n",
" return reply\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PUPeZ4xPi2Eo"
},
"source": [
"## Call Open Source Models\n",
"- Llama is downloaded and run on T4 GPU (Google Colab).\n",
"- Code Qwen is run through inference endpoint"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "W0AuZT2uk0Sd"
},
"outputs": [],
"source": [
"def stream_llama(user_prompt):\n",
" try:\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_message},\n",
" {\"role\": \"user\",\"content\": user_prompt},\n",
" ]\n",
"\n",
" tokenizer = AutoTokenizer.from_pretrained(LLAMA)\n",
" tokenizer.pad_token = tokenizer.eos_token\n",
"\n",
" quant_config = BitsAndBytesConfig(\n",
" load_in_4bit=True,\n",
" bnb_4bit_use_double_quant=True,\n",
" bnb_4bit_compute_dtype=torch.bfloat16,\n",
" bnb_4bit_quant_type=\"nf4\"\n",
" )\n",
"\n",
" model = AutoModelForCausalLM.from_pretrained(\n",
" LLAMA,\n",
" device_map=\"auto\",\n",
" quantization_config=quant_config\n",
" )\n",
"\n",
" inputs = tokenizer.apply_chat_template(messages, return_tensors=\"pt\").to(\"cuda\")\n",
" streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False)\n",
"\n",
" thread = threading.Thread(target=model.generate, kwargs={\n",
" \"input_ids\": inputs,\n",
" \"max_new_tokens\": 1000,\n",
" \"pad_token_id\": tokenizer.eos_token_id,\n",
" \"streamer\": streamer\n",
" })\n",
" thread.start()\n",
"\n",
" started = False\n",
" reply = \"\"\n",
"\n",
" for new_text in streamer:\n",
" if not started:\n",
" if \"<|start_header_id|>assistant<|end_header_id|>\" in new_text:\n",
" started = True\n",
" new_text = new_text.split(\"<|start_header_id|>assistant<|end_header_id|>\")[-1].strip()\n",
" else:\n",
" continue\n",
"\n",
" if \"<|eot_id|>\" in new_text:\n",
" new_text = new_text.replace(\"<|eot_id|>\", \"\")\n",
" if new_text.strip():\n",
" reply += new_text\n",
" yield reply\n",
" break\n",
"\n",
" if new_text.strip():\n",
" reply += new_text\n",
" yield reply\n",
"\n",
" return reply\n",
"\n",
" except Exception as e:\n",
" print(f\"LLaMA error: {e}\")\n",
" raise\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "V0JS_6THi2Eo"
},
"outputs": [],
"source": [
"def stream_code_qwen(user_prompt):\n",
" tokenizer = AutoTokenizer.from_pretrained(code_qwen)\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": system_message},\n",
" {\"role\": \"user\",\"content\": user_prompt},\n",
" ]\n",
" text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
" client = InferenceClient(CODE_QWEN_URL, token=hf_token)\n",
" stream = client.text_generation(text, stream=True, details=True, max_new_tokens=3000)\n",
" result = \"\"\n",
" for r in stream:\n",
" result += r.token.text\n",
" yield result"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PqG57dJIi2Eo"
},
"source": [
"## Select the model and generate the ouput"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "YqSKnklRi2Eo"
},
"outputs": [],
"source": [
"def generate_from_inputs(model, **input_data):\n",
" # print(\"🔍 input_data received:\", input_data)\n",
" user_prompt_str = user_prompt(**input_data)\n",
"\n",
" if model == \"GPT\":\n",
" result = stream_gpt(user_prompt_str)\n",
" elif model == \"Claude\":\n",
" result = stream_claude(user_prompt_str)\n",
" elif model == \"Llama\":\n",
" result = stream_llama(user_prompt_str)\n",
" elif model == \"Code Qwen\":\n",
" result = stream_code_qwen(user_prompt_str)\n",
" else:\n",
" raise ValueError(\"Unknown model\")\n",
"\n",
" for stream_so_far in result:\n",
" yield stream_so_far\n",
"\n",
" return result\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "zG6_TSfni2Eo"
},
"outputs": [],
"source": [
"def handle_generate(business_problem, dataset_type, dataset_format, num_samples, model):\n",
" input_data = {\n",
" \"business_problem\": business_problem,\n",
" \"dataset_type\": dataset_type,\n",
" \"output_format\": dataset_format,\n",
" \"num_samples\": num_samples,\n",
" }\n",
"\n",
" response = generate_from_inputs(model, **input_data)\n",
" for chunk in response:\n",
" yield chunk\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "p5DQcx71i2Ep"
},
"source": [
"## Extract python code from the LLM output and execute it locally"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "NcEkmsnai2Ep",
"jp-MarkdownHeadingCollapsed": true
},
"outputs": [],
"source": [
"def extract_code(text):\n",
" match = re.search(r\"```python(.*?)```\", text, re.DOTALL)\n",
"\n",
" if match:\n",
" code = match.group(0).strip()\n",
" else:\n",
" code = \"\"\n",
" print(\"No matching substring found.\")\n",
"\n",
" return code.replace(\"```python\\n\", \"\").replace(\"```\", \"\")\n",
"\n",
"\n",
"def execute_code_in_virtualenv(text, python_interpreter=sys.executable):\n",
" if not python_interpreter:\n",
" raise EnvironmentError(\"Python interpreter not found in the specified virtual environment.\")\n",
"\n",
" code_str = extract_code(text)\n",
" command = [python_interpreter, '-c', code_str]\n",
"\n",
" try:\n",
" result = subprocess.run(command, check=True, capture_output=True, text=True)\n",
" stdout = result.stdout\n",
" return stdout\n",
"\n",
" except subprocess.CalledProcessError as e:\n",
" return f\"Execution error:\\n{e}\"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DQgEyFzJi2Ep"
},
"source": [
"## Gradio interface"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "SEiZVkdFi2Ep"
},
"outputs": [],
"source": [
"def update_output_format(dataset_type):\n",
" if dataset_type in [\"Tabular\", \"Time-series\"]:\n",
" return gr.update(choices=[\"JSON\", \"csv\"], value=\"JSON\")\n",
" elif dataset_type == \"Text\":\n",
" return gr.update(choices=[\"JSON\"], value=\"JSON\")\n",
"\n",
"with gr.Blocks() as ui:\n",
" gr.Markdown(\"## Create a dataset for a business problem\")\n",
"\n",
" with gr.Column():\n",
" business_problem = gr.Textbox(label=\"Business problem\", lines=2)\n",
" dataset_type = gr.Dropdown(\n",
" [\"Tabular\", \"Time-series\", \"Text\"], label=\"Dataset type\"\n",
" )\n",
"\n",
" output_format = gr.Dropdown( choices=[\"JSON\", \"csv\"], value=\"JSON\",label=\"Output Format\")\n",
"\n",
" num_samples = gr.Number(label=\"Number of samples\", value=10, precision=0)\n",
"\n",
" model = gr.Dropdown([\"GPT\", \"Claude\", \"Llama\", \"Code Qwen\"], label=\"Select model\", value=\"GPT\")\n",
"\n",
" dataset_type.change(update_output_format,inputs=[dataset_type], outputs=[output_format])\n",
"\n",
" with gr.Row():\n",
" with gr.Column():\n",
" dataset_run = gr.Button(\"Create a dataset\")\n",
" gr.Markdown(\"\"\"⚠️ For Llama and Code Qwen: The generated code might not be optimal. It's recommended to review it before execution.\n",
" Some mistakes may occur.\"\"\")\n",
"\n",
" with gr.Column():\n",
" code_run = gr.Button(\"Execute code for a dataset\")\n",
" gr.Markdown(\"\"\"⚠️ Be cautious when sharing this app with code execution publicly, as it could pose safety risks.\n",
" The execution of user-generated code may lead to potential vulnerabilities, and its important to use this tool responsibly.\"\"\")\n",
"\n",
" with gr.Row():\n",
" dataset_out = gr.Textbox(label=\"Generated Dataset\")\n",
" code_out = gr.Textbox(label=\"Executed code\")\n",
"\n",
" dataset_run.click(\n",
" handle_generate,\n",
" inputs=[business_problem, dataset_type, output_format, num_samples, model],\n",
" outputs=[dataset_out]\n",
" )\n",
"\n",
" code_run.click(\n",
" execute_code_in_virtualenv,\n",
" inputs=[dataset_out],\n",
" outputs=[code_out]\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 646
},
"id": "jCAkTEtMi2Ep",
"outputId": "deeeb1a7-c432-4007-eba2-cbcc28dbc0ff"
},
"outputs": [],
"source": [
"ui.launch(inbrowser=True)"
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"gpuType": "T4",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.11"
}
},
"nbformat": 4,
"nbformat_minor": 4
}