upload week5 exercise: personal knowledge assistant with local file, Gmail, outlook and Google Workspace files

This commit is contained in:
Zhufeng-Qiu
2025-07-13 06:03:03 -07:00
parent 1bc1229395
commit 820cbd60c7
16 changed files with 2505 additions and 0 deletions

View File

@@ -0,0 +1,154 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "35177005-196a-48b3-bf92-fa37d84693f4",
"metadata": {},
"source": [
"# Gmail API Credential Guide"
]
},
{
"cell_type": "markdown",
"id": "7bcad9ee-cd11-4b12-834d-9f1ddcefb190",
"metadata": {},
"source": [
"Use Gmail API to Read Your Emails\n",
"1. Set up a Google Cloud Project\n",
"\n",
" Go to Google Cloud Platform(GCP) Console\n",
"\n",
" Create a new project\n",
"\n",
"2. Enable the Gmail API for that project\n",
"\n",
" Select the created project and go to \"APIs & services\" page\n",
"\n",
" Click \"+ Enable APIs and services\" button, search \"Gmail API\" and enable it\n",
"\n",
"3. Go to \"OAuth Consent Screen\" and configure:\n",
"\n",
" Choose External and Fill in app name, dedveloper email, etc.\n",
"\n",
"4. Create OAuth Credentials\n",
"\n",
" Go to APIs & Services > Credentials\n",
"\n",
" Click \"+ Create Credentials\" > \"OAuth client ID\"\n",
"\n",
" Choose Desktop App\n",
"\n",
" Download the generated credentials.json\n",
"\n",
" Sometimes, GCP will navigate you to \"Google Auth Platform\" > \"Clients\", and you can click \"+ Create client\" here to create the OAuth Credentials\n",
"\n",
" \n",
"5. Add Test Users for Gmail API OAuth Access\n",
" \n",
" Go to \"APIs & Services\" > \"OAuth consent screen\" > \"Audience\" > \"Test Users\"\n",
"\n",
" Add the email account from which you want to extract email content.\n",
"\n",
"\n",
"6. Create 'credentials' folders to store gmail credential and user tokens"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bc86bec0-bda8-4e9e-9c85-423179a99981",
"metadata": {},
"outputs": [],
"source": [
"# !pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4270e52e-378c-4127-bd52-1d082e9834e0",
"metadata": {},
"outputs": [],
"source": [
"from __future__ import print_function\n",
"import os.path\n",
"import base64\n",
"import re\n",
"from email import message_from_bytes\n",
"from google.oauth2.credentials import Credentials\n",
"from google_auth_oauthlib.flow import InstalledAppFlow\n",
"from googleapiclient.discovery import build\n",
"\n",
"# If modifying these SCOPES, delete the token.json\n",
"SCOPES = ['https://www.googleapis.com/auth/gmail.readonly']\n",
"PORT = 18000\n",
"\n",
"def main():\n",
" creds = None\n",
" # token.json stores the user's access and refresh tokens\n",
" if os.path.exists('token.json'):\n",
" creds = Credentials.from_authorized_user_file('token.json', SCOPES)\n",
" else:\n",
" flow = InstalledAppFlow.from_client_secrets_file('credentials/gmail_credentials.json', SCOPES)\n",
" creds = flow.run_local_server(port=PORT)\n",
" with open('token.json', 'w') as token:\n",
" token.write(creds.to_json())\n",
"\n",
" service = build('gmail', 'v1', credentials=creds)\n",
"\n",
" # Get the latest message\n",
" results = service.users().messages().list(userId='me', maxResults=1).execute()\n",
" messages = results.get('messages', [])\n",
"\n",
" if not messages:\n",
" print(\"No messages found.\")\n",
" return\n",
"\n",
" msg = service.users().messages().get(userId='me', id=messages[0]['id'], format='raw').execute()\n",
" raw_msg = base64.urlsafe_b64decode(msg['raw'].encode('ASCII'))\n",
" email_message = message_from_bytes(raw_msg)\n",
"\n",
" subject = email_message['Subject']\n",
" print(\"Subject:\", subject)\n",
"\n",
" # Extract text/plain body\n",
" for part in email_message.walk():\n",
" if part.get_content_type() == 'text/plain':\n",
" print(\"Body:\")\n",
" print(part.get_payload(decode=True).decode('utf-8'))\n",
"\n",
"if __name__ == '__main__':\n",
" main()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5ff68e06-3cfb-48ae-9dad-fa431d0d548a",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,294 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "35177005-196a-48b3-bf92-fa37d84693f4",
"metadata": {},
"source": [
"# Google Workspace API Credential Guide"
]
},
{
"cell_type": "markdown",
"id": "7bcad9ee-cd11-4b12-834d-9f1ddcefb190",
"metadata": {},
"source": [
"Use Google Drive API to Read files in Google Workspace \n",
"1. Set up a Google Cloud Project\n",
"\n",
" Go to Google Cloud Platform(GCP) Console\n",
"\n",
" Create a new project\n",
"\n",
"2. Enable the Gmail API for that project\n",
"\n",
" Select the created project and go to \"APIs & services\" page\n",
"\n",
" Click \"+ Enable APIs and services\" button, enable these APIs: Google Drive API, Google Docs API, Google Sheets API, and Google Slides API \n",
"\n",
"3. Go to \"OAuth Consent Screen\" and configure:\n",
"\n",
" Choose External and Fill in app name, dedveloper email, etc.\n",
"\n",
"4. Create OAuth Credentials\n",
"\n",
" Go to APIs & Services > Credentials\n",
"\n",
" Click \"+ Create Credentials\" > \"OAuth client ID\"\n",
"\n",
" Choose Desktop App\n",
"\n",
" Download the generated credentials.json\n",
"\n",
" Sometimes, GCP will navigate you to \"Google Auth Platform\" > \"Clients\", and you can click \"+ Create client\" here to create the OAuth Credentials\n",
"\n",
" \n",
"5. Add Test Users for Gmail API OAuth Access\n",
" \n",
" Go to \"APIs & Services\" > \"OAuth consent screen\" > \"Audience\" > \"Test Users\"\n",
"\n",
" Add the email account from which you want to extract email content.\n",
"\n",
"\n",
"6. Create 'credentials' folders to store google workspace credential and user tokens"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bc86bec0-bda8-4e9e-9c85-423179a99981",
"metadata": {},
"outputs": [],
"source": [
"# !pip install PyPDF2\n",
"# !pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4270e52e-378c-4127-bd52-1d082e9834e0",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "5ff68e06-3cfb-48ae-9dad-fa431d0d548a",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "69c20f2d-2f49-408c-8700-f12d6745efd3",
"metadata": {},
"outputs": [],
"source": [
"from google_auth_oauthlib.flow import InstalledAppFlow\n",
"from googleapiclient.discovery import build\n",
"from google.oauth2.credentials import Credentials\n",
"from googleapiclient.http import MediaIoBaseDownload\n",
"import os\n",
"\n",
"import io\n",
"from PyPDF2 import PdfReader\n",
"from langchain.vectorstores import Chroma\n",
"from langchain.embeddings import OpenAIEmbeddings\n",
"from langchain.schema import Document\n",
"\n",
"GOOGLE_WORKSPACE_SCOPES = [\"https://www.googleapis.com/auth/drive.readonly\",\n",
" 'https://www.googleapis.com/auth/documents.readonly',\n",
" 'https://www.googleapis.com/auth/spreadsheets.readonly',\n",
" 'https://www.googleapis.com/auth/presentations.readonly'\n",
" ]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7164903b-be81-46b2-8c04-886397599c27",
"metadata": {},
"outputs": [],
"source": [
"def extract_google_doc(docs_service, file_id):\n",
" doc = docs_service.documents().get(documentId=file_id).execute()\n",
" content = \"\"\n",
" for elem in doc.get(\"body\", {}).get(\"content\", []):\n",
" if \"paragraph\" in elem:\n",
" for run in elem[\"paragraph\"][\"elements\"]:\n",
" content += run.get(\"textRun\", {}).get(\"content\", \"\")\n",
" return content.strip()\n",
"\n",
"def extract_google_sheet(service, file_id):\n",
" # Get spreadsheet metadata\n",
" spreadsheet = service.spreadsheets().get(spreadsheetId=file_id).execute()\n",
" all_text = \"\"\n",
"\n",
" # Loop through each sheet\n",
" for sheet in spreadsheet.get(\"sheets\", []):\n",
" title = sheet[\"properties\"][\"title\"]\n",
" result = service.spreadsheets().values().get(\n",
" spreadsheetId=file_id,\n",
" range=title\n",
" ).execute()\n",
"\n",
" values = result.get(\"values\", [])\n",
" sheet_text = f\"### Sheet: {title} ###\\n\"\n",
" sheet_text += \"\\n\".join([\", \".join(row) for row in values])\n",
" all_text += sheet_text + \"\\n\\n\"\n",
"\n",
" return all_text.strip()\n",
"\n",
"\n",
"def extract_google_slide(slides_service, file_id):\n",
" pres = slides_service.presentations().get(presentationId=file_id).execute()\n",
" text = \"\"\n",
" for slide in pres.get(\"slides\", []):\n",
" for element in slide.get(\"pageElements\", []):\n",
" shape = element.get(\"shape\")\n",
" if shape:\n",
" for p in shape.get(\"text\", {}).get(\"textElements\", []):\n",
" if \"textRun\" in p:\n",
" text += p[\"textRun\"][\"content\"]\n",
" return text.strip()\n",
"\n",
"def extract_pdf_from_drive(drive_service, file_id, filename='downloaded.pdf'):\n",
" request = drive_service.files().get_media(fileId=file_id)\n",
" fh = io.BytesIO()\n",
" downloader = MediaIoBaseDownload(fh, request)\n",
" done = False\n",
" while not done:\n",
" _, done = downloader.next_chunk()\n",
" fh.seek(0)\n",
" reader = PdfReader(fh)\n",
" return \"\\n\".join([page.extract_text() for page in reader.pages if page.extract_text()])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5f2edc68-f9f8-4cba-810e-159bea4fe4ac",
"metadata": {},
"outputs": [],
"source": [
"def get_creds():\n",
" if os.path.exists(\"token.json\"):\n",
" creds = Credentials.from_authorized_user_file(\"token.json\", SCOPES)\n",
" else:\n",
" flow = InstalledAppFlow.from_client_secrets_file(\"credentials/google_drive_workspace_credentials.json\", SCOPES)\n",
" creds = flow.run_local_server(port=0)\n",
" with open(\"token.json\", \"w\") as token:\n",
" token.write(creds.to_json())\n",
" return creds\n",
" \n",
"\n",
"def get_folder_id_by_name(drive_service, folder_name):\n",
" query = f\"mimeType='application/vnd.google-apps.folder' and name='{folder_name}' and trashed=false\"\n",
" results = drive_service.files().list(\n",
" q=query,\n",
" fields=\"files(id, name)\",\n",
" pageSize=1\n",
" ).execute()\n",
"\n",
" folders = results.get(\"files\", [])\n",
" if not folders:\n",
" raise ValueError(f\"❌ Folder named '{folder_name}' not found.\")\n",
" return folders[0]['id']\n",
"\n",
"\n",
"def extract_docs_from_google_workspace(folder_name):\n",
" info = \"\"\n",
" \n",
" creds = get_creds()\n",
"\n",
" file_types = {\n",
" 'application/vnd.google-apps.document': lambda fid: extract_google_doc(docs_service, fid),\n",
" 'application/vnd.google-apps.spreadsheet': lambda fid: extract_google_sheet(sheets_service, fid),\n",
" 'application/vnd.google-apps.presentation': lambda fid: extract_google_slide(slides_service, fid),\n",
" 'application/pdf': lambda fid: extract_pdf_from_drive(drive_service, fid),\n",
" }\n",
" \n",
" drive_service = build(\"drive\", \"v3\", credentials=creds)\n",
" docs_service = build('docs', 'v1', credentials=creds)\n",
" sheets_service = build('sheets', 'v4', credentials=creds)\n",
" slides_service = build('slides', 'v1', credentials=creds)\n",
"\n",
" folder_id = get_folder_id_by_name(drive_service, folder_name)\n",
" info += f\"Collection files from folder: {folder_name}\\n\"\n",
" \n",
" query = (\n",
" f\"'{folder_id}' in parents and (\"\n",
" 'mimeType=\"application/vnd.google-apps.document\" or '\n",
" 'mimeType=\"application/vnd.google-apps.spreadsheet\" or '\n",
" 'mimeType=\"application/vnd.google-apps.presentation\" or '\n",
" 'mimeType=\"application/pdf\")'\n",
" )\n",
" \n",
" results = drive_service.files().list(\n",
" q=query,\n",
" fields=\"files(id, name, mimeType)\",\n",
" pageSize=20\n",
" ).execute()\n",
"\n",
" docs = []\n",
" summary_info = {\n",
" 'application/vnd.google-apps.document': {'file_type': 'Google Doc', 'count': 0},\n",
" 'application/vnd.google-apps.spreadsheet': {'file_type': 'Google Sheet', 'count': 0},\n",
" 'application/vnd.google-apps.presentation': {'file_type': 'Google Silde', 'count': 0},\n",
" 'application/pdf': {'file_type': 'PDF', 'count': 0}\n",
" }\n",
" for file in results.get(\"files\", []):\n",
" extractor = file_types.get(file['mimeType'])\n",
" if extractor:\n",
" try:\n",
" content = extractor(file[\"id\"])\n",
" if content:\n",
" docs.append(Document(page_content=content, metadata={\"source\": file[\"name\"]}))\n",
" summary_info[file['mimeType']]['count'] += 1\n",
" except Exception as e:\n",
" print(f\"❌ Error processing {file['name']}: {e}\")\n",
" \n",
" total = 0;\n",
" for file_type, element in summary_info.items():\n",
" total += element['count']\n",
" info += f\"Found {element['count']} {element['file_type']} files\\n\"\n",
" info += f\"Total documents loaded: {total}\"\n",
" return docs, info"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8a9da5c9-415c-4856-973a-627a1790f38d",
"metadata": {},
"outputs": [],
"source": [
"docs, info = extract_docs_from_google_workspace(\"google_workspace_knowledge_base\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,178 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "35177005-196a-48b3-bf92-fa37d84693f4",
"metadata": {},
"source": [
"# Outlook API Credential Guide"
]
},
{
"cell_type": "markdown",
"id": "7bcad9ee-cd11-4b12-834d-9f1ddcefb190",
"metadata": {},
"source": [
"Extract Outlook Emails via Microsoft Graph API\n",
"\n",
"1. Register Your App on Azure Portal\n",
"\n",
" Go to Azure Portal > Azure Active Directory > App registrations\n",
"\n",
" Click “New registration”\n",
"\n",
" Choose Mobole/Desktop app\n",
" \n",
" After creation, note the Application (client) ID\n",
"\n",
"2. API Permissions\n",
"\n",
" Go to API permissions tab\n",
"\n",
" Click Add permission > Microsoft Graph > Delegated\n",
"\n",
" Choose: Mail.Read\n",
"\n",
" Click Grant admin consent\n",
"\n",
"3. Allow public client flows\n",
"\n",
" Navigate to: Azure Active Directory > App registrations > Your App\n",
"\n",
" Go to Authentication tab\n",
"\n",
" Under \"Advanced settings\" → \"Allow public client flows\", set to \"Yes\"\n",
"\n",
" Save changes"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bc86bec0-bda8-4e9e-9c85-423179a99981",
"metadata": {},
"outputs": [],
"source": [
"!pip install msal requests"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4270e52e-378c-4127-bd52-1d082e9834e0",
"metadata": {},
"outputs": [],
"source": [
"from msal import PublicClientApplication\n",
"import os\n",
"from dotenv import load_dotenv\n",
"import requests"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5ff68e06-3cfb-48ae-9dad-fa431d0d548a",
"metadata": {},
"outputs": [],
"source": [
"load_dotenv()\n",
"\n",
"CLIENT_ID = os.getenv(\"AZURE_CLIENT_ID\")\n",
"AUTHORITY = \"https://login.microsoftonline.com/common\" \n",
"SCOPES = [\"Mail.Read\"]\n",
"\n",
"app = PublicClientApplication(CLIENT_ID, authority=AUTHORITY)\n",
"\n",
"flow = app.initiate_device_flow(scopes=SCOPES)\n",
"print(\"Go to:\", flow[\"verification_uri\"])\n",
"print(\"Enter code:\", flow[\"user_code\"])\n",
"\n",
"result = app.acquire_token_by_device_flow(flow)\n",
"\n",
"if \"access_token\" not in result:\n",
" raise Exception(\"Failed to authenticate:\", result)\n",
"\n",
"access_token = result[\"access_token\"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9c7f97da-68cc-4923-b280-1ddf7e5b7aa3",
"metadata": {},
"outputs": [],
"source": [
"print(\"Granted scopes:\", result.get(\"scope\"))\n",
"\n",
"headers = {\n",
" \"Authorization\": f\"Bearer {access_token}\",\n",
" \"Prefer\": \"outlook.body-content-type='text'\"\n",
"}\n",
"\n",
"query = (\n",
" \"https://graph.microsoft.com/v1.0/me/messages\"\n",
" \"?$top=1\"\n",
" \"&$select=id,subject,receivedDateTime,body\"\n",
")\n",
"\n",
"all_emails = []\n",
"\n",
"while query:\n",
" response = requests.get(query, headers=headers)\n",
"\n",
" if not response.ok:\n",
" print(response.text)\n",
" print(f\"❌ HTTP {response.status_code}: {response.text}\")\n",
" break\n",
"\n",
" try:\n",
" res = response.json()\n",
" except ValueError:\n",
" print(\"❌ Invalid JSON:\", response.text)\n",
" break\n",
"\n",
" for msg in res.get(\"value\", []):\n",
" all_emails.append({\n",
" \"id\": msg.get(\"id\"),\n",
" \"subject\": msg.get(\"subject\", \"\"),\n",
" \"body\": msg.get(\"body\", {}).get(\"content\", \"\"),\n",
" \"date\": msg.get(\"receivedDateTime\", \"\")\n",
" })\n",
"\n",
" query = res.get(\"@odata.nextLink\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e29493b6-0a9e-4106-93c9-e58ff6aa0f97",
"metadata": {},
"outputs": [],
"source": [
"all_emails"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,3 @@
// delete key
{"installed":{"client_id":"196620306719-vr5i30l44mqmkmnp7j96iavjfqsfl41f.apps.googleusercontent.com","project_id":"llms-personal-knowledge","auth_uri":"https://accounts.google.com/o/oauth2/auth","token_uri":"https://oauth2.googleapis.com/token","auth_provider_x509_cert_url":"https://www.googleapis.com/oauth2/v1/certs","redirect_uris":["http://localhost"]}}

View File

@@ -0,0 +1,3 @@
// delete key
{"installed":{"client_id":"196620306719-7qvdhd86sau3ngmrrlcb1314us9nuli4.apps.googleusercontent.com","project_id":"llms-personal-knowledge","auth_uri":"https://accounts.google.com/o/oauth2/auth","token_uri":"https://oauth2.googleapis.com/token","auth_provider_x509_cert_url":"https://www.googleapis.com/oauth2/v1/certs","redirect_uris":["http://localhost"]}}

Binary file not shown.

After

Width:  |  Height:  |  Size: 11 KiB

View File

@@ -0,0 +1,9 @@
<!DOCTYPE html>
<html>
<head>
<title>My First Web Page</title>
</head>
<body>
<h1>Zephyr won ZHTML award</h1>
</body>
</html>