Improvements to explanations and minor edits
This commit is contained in:
@@ -11,7 +11,8 @@
|
||||
"\n",
|
||||
"## Data Curation Part 2\n",
|
||||
"\n",
|
||||
"Today we'll extend our dataset to a greater coverage, and craft it into an excellent dataset for training.\n",
|
||||
"Today we'll extend our dataset to a greater coverage, and craft it into an excellent dataset for training. \n",
|
||||
"Data curation can seem less exciting than other things we work on, but it's a crucial part of the LLM engineers' responsibility and an important craft to hone, so that you can build your own commercial solutions with high quality datasets.\n",
|
||||
"\n",
|
||||
"The dataset is here: \n",
|
||||
"https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023\n",
|
||||
@@ -23,7 +24,9 @@
|
||||
"\n",
|
||||
"We are about to craft a massive dataset of 400,000 items covering multiple types of product. In Week 7 we will be using this data to train our own model. It's a pretty big dataset, and depending on the GPU you select, training could take 20+ hours. It will be really good fun, but it could cost a few dollars in compute units.\n",
|
||||
"\n",
|
||||
"As an alternative, if you want to keep things quick & low cost, you can work with a smaller dataset focused only on Home Appliances. You'll be able to cover the same learning points; the results will be good -- not quite as good as the full dataset, but still pretty amazing! If you'd prefer to do this, I've set up an alternative jupyter notebook in this folder called `lite.ipynb` that you should use in place of this one."
|
||||
"As an alternative, if you want to keep things quick & low cost, you can work with a smaller dataset focused only on Home Appliances. You'll be able to cover the same learning points; the results will be good -- not quite as good as the full dataset, but still pretty amazing! If you'd prefer to do this, I've set up an alternative jupyter notebook in this folder called `lite.ipynb` that you should use in place of this one.\n",
|
||||
"\n",
|
||||
"Also, if you'd prefer, you can shortcut running all this data curation by downloading the pickle files that we save in the last cell. The pickle files are available here: https://drive.google.com/drive/folders/1f_IZGybvs9o0J5sb3xmtTEQB3BXllzrW"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -610,7 +613,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.10"
|
||||
"version": "3.11.11"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
@@ -52,6 +52,20 @@
|
||||
"from sklearn.preprocessing import StandardScaler"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b3c87c11-8dbe-4b8c-8989-01e3d3a60026",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## NLP imports\n",
|
||||
"\n",
|
||||
"In the next cell, we have more imports for our NLP related machine learning. \n",
|
||||
"If the gensim import gives you an error like \"Cannot import name 'triu' from 'scipy.linalg' then please run in another cell: \n",
|
||||
"`!pip install \"scipy<1.13\"` \n",
|
||||
"As described on StackOverflow [here](https://stackoverflow.com/questions/78279136/importerror-cannot-import-name-triu-from-scipy-linalg-when-importing-gens). \n",
|
||||
"Many thanks to students Arnaldo G and Ard V for sorting this."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
@@ -59,7 +73,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# And more imports for our NLP related machine learning\n",
|
||||
"# NLP related imports\n",
|
||||
"\n",
|
||||
"from sklearn.feature_extraction.text import CountVectorizer\n",
|
||||
"from gensim.models import Word2Vec\n",
|
||||
|
||||
@@ -1508,7 +1508,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.10"
|
||||
"version": "3.11.11"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
|
||||
Reference in New Issue
Block a user