Refreshed notebooks, particularly with new Week 1

2024-11-13 15:46:22 +00:00
parent 6ba1875cd3
commit 21c7a8155c
34 changed files with 2331 additions and 410 deletions
--- a/week6/day1.ipynb
+++ b/week6/day1.ipynb
@@ -259,7 +259,22 @@
    "\n",
    "And will create a prompt to be used during Training.\n",
    "\n",
-    "Items will be rejected if they don't have sufficient characters."
+    "Items will be rejected if they don't have sufficient characters.\n",
+    "\n",
+    "## But why 180 tokens??\n",
+    "\n",
+    "A student asked me a great question - why are we truncating to 180 tokens? How did we determine that number? (Thank you Moataz A. for the excellent question).\n",
+    "\n",
+    "The answer: this is an example of a \"hyper-parameter\". In other words, it's basically trial and error! We want a sufficiently large number of tokens so that we have enough useful information to gauge the price. But we also want to keep the number low so that we can train efficiently. You'll see this in action in Week 7.\n",
+    "\n",
+    "I started with a number that seemed reasonable, and experimented with a few variations before settling on 180. If you have time, you should do the same! You might find that you can beat my results by finding a better balance. This kind of trial-and-error might sound a bit unsatisfactory, but it's a crucial part of the data science R&D process.\n",
+    "\n",
+    "There's another interesting reason why we might favor a lower number of tokens in the training data. When we eventually get to use our model at inference time, we'll want to provide new products and have it estimate a price. And we'll be using short descriptions of products - like 1-2 sentences. For best performance, we should size our training data to be similar to the inputs we will provide at inference time.\n",
+    "\n",
+    "## But I see in items.py it constrains inputs to 160 tokens?\n",
+    "\n",
+    "Another great question from Moataz A.! The description of the products is limited to 160 tokens because we add some more text before and after the description to turn it into a prompt. That brings it to around 180 tokens in total.\n",
+    "\n"
   ]
  },
  {
--- a/week6/day3.ipynb
+++ b/week6/day3.ipynb
@@ -155,8 +155,6 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "\n",
-    "\n",
    "with open('train.pkl', 'rb') as file:\n",
    "    train = pickle.load(file)\n",
    "\n",
--- a/week6/day5.ipynb
+++ b/week6/day5.ipynb
@@ -529,77 +529,6 @@
   "source": [
    "Tester.test(gpt_fine_tuned, test)"
   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 320,
-   "id": "03ff4b48-3788-4370-9e34-6592f23d1bce",
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "DNS resolution for api.gradio.app: 54.68.118.249\n",
-      "Gradio API Status: 500\n",
-      "Gradio API Response: Internal Server Error\n",
-      "HuggingFace CDN Status: 403\n"
-     ]
-    }
-   ],
-   "source": [
-    "import requests\n",
-    "import socket\n",
-    "\n",
-    "def check_connectivity():\n",
-    "    try:\n",
-    "        # Check DNS resolution\n",
-    "        ip = socket.gethostbyname('api.gradio.app')\n",
-    "        print(f\"DNS resolution for api.gradio.app: {ip}\")\n",
-    "\n",
-    "        # Check connection to Gradio API\n",
-    "        response = requests.get(\"https://api.gradio.app/v2/tunnel/\", timeout=5)\n",
-    "        print(f\"Gradio API Status: {response.status_code}\")\n",
-    "        print(f\"Gradio API Response: {response.text}\")\n",
-    "\n",
-    "        # Check connection to HuggingFace CDN\n",
-    "        cdn_response = requests.get(\"https://cdn-media.huggingface.co/frpc-gradio-0.2/frpc_linux_aarch64\", timeout=5)\n",
-    "        print(f\"HuggingFace CDN Status: {cdn_response.status_code}\")\n",
-    "    except Exception as e:\n",
-    "        print(f\"Error in connectivity check: {e}\")\n",
-    "\n",
-    "check_connectivity()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 323,
-   "id": "f7d4eec4-da5e-4fbf-ba3e-fbbcfb399d6c",
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "'4.44.0'"
-      ]
-     },
-     "execution_count": 323,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "import gradio\n",
-    "gradio.__version__"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "cad08a54-912b-43d2-9280-f00b5a7775a6",
-   "metadata": {},
-   "outputs": [],
-   "source": []
  }
 ],
 "metadata": {
--- a/week6/items.py
+++ b/week6/items.py
@@ -3,8 +3,10 @@ from transformers import AutoTokenizer
 import re

 BASE_MODEL = "meta-llama/Meta-Llama-3.1-8B"
-MIN_TOKENS = 150
-MAX_TOKENS = 160
+
+MIN_TOKENS = 150 # Any less than this, and we don't have enough useful content
+MAX_TOKENS = 160 # Truncate after this many tokens. Then after adding in prompt text, we will get to around 180 tokens
+
 MIN_CHARS = 300
 CEILING_CHARS = MAX_TOKENS * 7