Files
reflector/server/notebooks/Viz-experiments.ipynb
2023-07-26 15:13:46 +07:00

861 lines
78 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"id": "a5ace857",
"metadata": {},
"source": [
"# Visualization Experiments"
]
},
{
"cell_type": "markdown",
"id": "9bfc569d",
"metadata": {},
"source": [
"Lets load the data artefacts to local memory. These files are to be downloaded from S3 as the pipeline automatically uploads them to the pre-configured S3 bucket."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "edc584b2",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"\u001b[32m2023-06-23 15:01:36.558\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36mfile_utilities\u001b[0m:\u001b[36mdownload_files\u001b[0m:\u001b[36m36\u001b[0m - \u001b[1mDownloading file df_06-23-2023_06:10:03.pkl\u001b[0m\n",
"\u001b[32m2023-06-23 15:01:38.450\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36mfile_utilities\u001b[0m:\u001b[36mdownload_files\u001b[0m:\u001b[36m36\u001b[0m - \u001b[1mDownloading file mappings_06-23-2023_06:10:03.pkl\u001b[0m\n",
"\u001b[32m2023-06-23 15:01:39.179\u001b[0m | \u001b[1mINFO \u001b[0m | \u001b[36mfile_utilities\u001b[0m:\u001b[36mdownload_files\u001b[0m:\u001b[36m36\u001b[0m - \u001b[1mDownloading file transcript_with_timestamp_06-23-2023_06:10:03.txt\u001b[0m\n"
]
}
],
"source": [
"from file_utilities import download_files\n",
"import pickle\n",
"\n",
"# Download files from S3 bucket. You can download multiple files at a time by passing a list of names\n",
"# files_to_download = [\"df.pkl\",\n",
"# \"mappings.pkl\",\n",
"# \"transcript_timestamps.txt\"]\n",
"\n",
"# set the timestamp \n",
"timestamp = \"06-23-2023_06:10:03\"\n",
"\n",
"# df,mappings,transcript_timestamps file names\n",
"df_file_name = \"df_\" + timestamp + \".pkl\"\n",
"mappings_file_name = \"mappings_\" + timestamp + \".pkl\"\n",
"transcript_file_name = \"transcript_with_timestamp_\" + timestamp + \".txt\"\n",
"\n",
"\n",
"files_to_download = [df_file_name,\n",
" mappings_file_name,\n",
" transcript_file_name] \n",
"download_files(files_to_download)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "5027fe25",
"metadata": {},
"outputs": [],
"source": [
"# Download spacy model for the first time\n",
"import nltk\n",
"import spacy\n",
"from nltk.corpus import stopwords\n",
"\n",
"nltk.download('punkt', quiet=True)\n",
"nltk.download('stopwords', quiet=True)\n",
"spaCy_model = \"en_core_web_md\"\n",
"nlp = spacy.load(spaCy_model)\n",
"spacy_stopwords = nlp.Defaults.stop_words\n",
"STOPWORDS = set(spacy_stopwords).union(set(stopwords.words('english')))"
]
},
{
"cell_type": "markdown",
"id": "8abc435d",
"metadata": {},
"source": [
"## Example template 1"
]
},
{
"cell_type": "markdown",
"id": "2b1a4834",
"metadata": {},
"source": [
"## Scatter plot of transcription with Topic modelling"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "55a75dcf",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>timestamp</th>\n",
" <th>text</th>\n",
" <th>ts_to_topic_mapping_top_1</th>\n",
" <th>ts_to_topic_mapping_top_2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>(0.0, 12.36)</td>\n",
" <td>this . Okay , yeah , so it looks like I am re...</td>\n",
" <td>TAM</td>\n",
" <td>Founders</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>(12.36, 25.76)</td>\n",
" <td>because Goku needs that for the audio plus the...</td>\n",
" <td>Founders</td>\n",
" <td>TAM</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>(25.76, 30.32)</td>\n",
" <td>the rest of the team did . So I want to just ...</td>\n",
" <td>Founders</td>\n",
" <td>AGENDA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>(30.32, 35.52)</td>\n",
" <td>then we can ask questions or how do you want t...</td>\n",
" <td>TAM</td>\n",
" <td>Founders</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>(35.52, 49.56)</td>\n",
" <td>introduction . So what I , it all started wit...</td>\n",
" <td>Founders</td>\n",
" <td>TAM</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>554</th>\n",
" <td>(3323.0, 3326.56)</td>\n",
" <td>It 's crazy . But definitely with the</td>\n",
" <td>Founders</td>\n",
" <td>TAM</td>\n",
" </tr>\n",
" <tr>\n",
" <th>555</th>\n",
" <td>(3326.56, 3332.24)</td>\n",
" <td>local models , we have n't found a way to work...</td>\n",
" <td>Founders</td>\n",
" <td>TAM</td>\n",
" </tr>\n",
" <tr>\n",
" <th>556</th>\n",
" <td>(3332.24, 3337.2)</td>\n",
" <td>if you 'd have 90 minutes of audio to transfer...</td>\n",
" <td>TAM</td>\n",
" <td>Founders</td>\n",
" </tr>\n",
" <tr>\n",
" <th>557</th>\n",
" <td>(3338.32, 3344.4)</td>\n",
" <td>We actually have a preprocessor to resolve wha...</td>\n",
" <td>Founders</td>\n",
" <td>TAM</td>\n",
" </tr>\n",
" <tr>\n",
" <th>558</th>\n",
" <td>(3344.4, None)</td>\n",
" <td>there 's still some struggles on the local mod...</td>\n",
" <td>Founders</td>\n",
" <td>TAM</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>559 rows × 4 columns</p>\n",
"</div>"
],
"text/plain": [
" timestamp text \\\n",
"0 (0.0, 12.36) this . Okay , yeah , so it looks like I am re... \n",
"1 (12.36, 25.76) because Goku needs that for the audio plus the... \n",
"2 (25.76, 30.32) the rest of the team did . So I want to just ... \n",
"3 (30.32, 35.52) then we can ask questions or how do you want t... \n",
"4 (35.52, 49.56) introduction . So what I , it all started wit... \n",
".. ... ... \n",
"554 (3323.0, 3326.56) It 's crazy . But definitely with the \n",
"555 (3326.56, 3332.24) local models , we have n't found a way to work... \n",
"556 (3332.24, 3337.2) if you 'd have 90 minutes of audio to transfer... \n",
"557 (3338.32, 3344.4) We actually have a preprocessor to resolve wha... \n",
"558 (3344.4, None) there 's still some struggles on the local mod... \n",
"\n",
" ts_to_topic_mapping_top_1 ts_to_topic_mapping_top_2 \n",
"0 TAM Founders \n",
"1 Founders TAM \n",
"2 Founders AGENDA \n",
"3 TAM Founders \n",
"4 Founders TAM \n",
".. ... ... \n",
"554 Founders TAM \n",
"555 Founders TAM \n",
"556 TAM Founders \n",
"557 Founders TAM \n",
"558 Founders TAM \n",
"\n",
"[559 rows x 4 columns]"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\n",
"df = pd.read_pickle(df_file_name)\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "a795137e",
"metadata": {},
"source": [
"Change the values of \"category\", \"category_name\" to one agenda topic and change the value of \"not_category_name\" and see different plots."
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "43e01074",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import scattertext as st\n",
"\n",
"df = pd.read_pickle(df_file_name)\n",
"\n",
"def plot_topic_modelling_and_word_to_sentence_search(df, cat_1, cat_1_name, cat_2_name):\n",
" df = df.assign(parse=lambda df: df.text.apply(st.whitespace_nlp_with_sentences))\n",
"\n",
" corpus = st.CorpusFromParsedDocuments(\n",
" df, category_col='ts_to_topic_mapping_top_1', parsed_col='parse'\n",
" ).build().get_unigram_corpus().remove_terms(STOPWORDS, ignore_absences=True).compact(st.AssociationCompactor(2000))\n",
" \n",
" html = st.produce_scattertext_explorer(\n",
" corpus,\n",
" category=cat_1, category_name=cat_1_name, not_category_name=cat_2_name,\n",
" minimum_term_frequency=0, pmi_threshold_coefficient=0,\n",
" width_in_pixels=1000,\n",
" transform=st.Scalers.dense_rank\n",
" )\n",
" open('./new_viz_' + timestamp + '.html', 'w').write(html)\n",
"\n",
"plot_topic_modelling_and_word_to_sentence_search(df,\n",
" cat_1=\"Founders\",\n",
" cat_1_name=\"Founders\",\n",
" cat_2_name=\"TAM\")\n",
"\n",
"# once you are done, check the generated HTML file\n"
]
},
{
"cell_type": "markdown",
"id": "e9994c87",
"metadata": {},
"source": [
"## Example template 2"
]
},
{
"cell_type": "markdown",
"id": "35c4f7fd",
"metadata": {},
"source": [
"## Time driven Insights"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "7cdcd66f",
"metadata": {},
"outputs": [],
"source": [
"mappings = pickle.load(open(mappings_file_name, \"rb\"))\n",
"timestamp_to_topic_first_match = mappings[0]\n",
"timestamp_to_topic_second_match = mappings[1]\n",
"topic_to_timestamp_first_match = mappings[2]\n",
"topic_to_timestamp_second_match = mappings[3]"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "11221022",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 720x576 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 720x576 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"import collections \n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"\n",
"def plot_time_spent_for_topic(mapping, order):\n",
" topic_times = collections.defaultdict(int)\n",
" for key in mapping.keys():\n",
" if key[1] is None or key[0] is None:\n",
" continue\n",
" duration = key[1] - key[0]\n",
" topic_times[mapping[key]] += duration\n",
" \n",
" keys = list(topic_times.keys())\n",
" vals = [int(topic_times[k]) for k in keys] \n",
" plt.figure(figsize=(10,8))\n",
" sns.barplot(x=vals, y=keys).set(title='Time spent on ' + order + ' matched topic')\n",
"\n",
" \n",
"\n",
"plot_time_spent_for_topic(timestamp_to_topic_first_match, \"first\")\n",
"plot_time_spent_for_topic(timestamp_to_topic_second_match, \"second\")"
]
},
{
"cell_type": "markdown",
"id": "e9ae6e25",
"metadata": {},
"source": [
"## Example template 3"
]
},
{
"cell_type": "markdown",
"id": "69be38ce",
"metadata": {},
"source": [
"## Enhanced search for timelines"
]
},
{
"cell_type": "markdown",
"id": "f8a47348",
"metadata": {},
"source": [
"We can already search for a particular word in the interactive HTML document from example 1 to see a list of all transcribed sentences having an occurence of the word (in the context of the chosen topic). \n",
"\n",
"We can also retrieve all the segments(timestamps)in the transcription, related to a particular topic, to\n",
"\n",
"i) Segregrate all content on a particular topic of importance.\n",
"\n",
"ii) Perform selective summarization of the segregated content to make productive follow-ups. (Maybe use a model to extract action items and announcements from the transcription or selective summary ? )\n",
"\n",
"iii) Use the timestamps to highlight video / audio / transcription segments.\n",
"\n",
"iv) Jump to a desired segment of video / audio / transcription."
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "69d814c9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Timelines where Founders was covered : \n"
]
},
{
"data": {
"text/plain": [
"[(12.36, 25.76),\n",
" (25.76, 30.32),\n",
" (35.52, 49.56),\n",
" (76.64, 87.76),\n",
" (87.76, 96.08),\n",
" (104.32, 105.62),\n",
" (107.02, 111.0),\n",
" (119.24, 123.14),\n",
" (125.64, 130.64),\n",
" (130.8, 132.6),\n",
" (142.72, 145.24),\n",
" (152.88, 155.14),\n",
" (155.14, 157.8),\n",
" (157.8, 160.96),\n",
" (171.0, 188.68),\n",
" (188.68, 197.28),\n",
" (202.84, 214.48),\n",
" (214.48, 219.28),\n",
" (219.84, 228.44),\n",
" (228.44, 238.76),\n",
" (246.46, 248.84),\n",
" (248.84, 251.38),\n",
" (256.78, 259.5),\n",
" (259.5, 261.94),\n",
" (270.96, 278.4),\n",
" (278.4, 288.32),\n",
" (291.44, 293.84),\n",
" (336.0, 339.12),\n",
" (351.48, 352.32),\n",
" (362.64, 393.44),\n",
" (402.72, 403.04),\n",
" (427.12, 432.12),\n",
" (438.96, 441.34),\n",
" (448.72, 453.84),\n",
" (453.84, 463.36),\n",
" (463.36, 466.5),\n",
" (466.5, 468.26),\n",
" (471.66, 473.06),\n",
" (482.28, 487.6),\n",
" (515.84, 522.64),\n",
" (529.66, 533.24),\n",
" (533.24, 535.52),\n",
" (540.42, 543.16),\n",
" (544.0, 550.0),\n",
" (556.0, 564.0),\n",
" (564.0, 568.76),\n",
" (568.76, 569.6),\n",
" (573.36, 574.56),\n",
" (577.52, 580.96),\n",
" (580.96, 584.6),\n",
" (584.6, 605.62),\n",
" (607.86, 609.18),\n",
" (619.42, 621.54),\n",
" (621.54, 623.98),\n",
" (623.98, 626.88),\n",
" (630.24, 634.96),\n",
" (634.96, 639.96),\n",
" (641.68, 646.66),\n",
" (659.28, 661.96),\n",
" (666.3, 667.14),\n",
" (670.48, 673.92),\n",
" (673.92, 677.02),\n",
" (677.02, 678.52),\n",
" (678.52, 680.72),\n",
" (680.72, 683.48),\n",
" (683.48, 746.8),\n",
" (758.32, 773.0),\n",
" (797.28, 813.36),\n",
" (813.36, 819.6),\n",
" (820.32, 824.08),\n",
" (821.7, 824.86),\n",
" (828.56, 830.94),\n",
" (831.78, 832.88),\n",
" (832.88, 837.04),\n",
" (837.04, 840.68),\n",
" (840.68, 844.64),\n",
" (854.78, 860.2),\n",
" (860.2, 865.4),\n",
" (865.4, 870.0),\n",
" (870.0, 872.2),\n",
" (877.24, 880.94),\n",
" (880.94, 882.94),\n",
" (882.94, 885.96),\n",
" (889.52, 892.96),\n",
" (896.4, 900.88),\n",
" (900.88, 903.24),\n",
" (908.36, 911.42),\n",
" (911.42, 912.74),\n",
" (915.26, 917.68),\n",
" (921.32, 924.0),\n",
" (924.0, 953.04),\n",
" (982.0, 988.0),\n",
" (988.0, 995.0),\n",
" (995.0, 1009.52),\n",
" (1010.24, 1024.8),\n",
" (1022.36, 1025.5),\n",
" (1033.8, 1036.58),\n",
" (1040.9, 1044.56),\n",
" (1047.44, 1070.4),\n",
" (1070.48, 1076.6),\n",
" (1080.56, 1088.32),\n",
" (1088.32, 1091.08),\n",
" (1108.12, 1109.4),\n",
" (1115.16, 1119.1),\n",
" (1131.44, 1134.4),\n",
" (1134.4, 1135.64),\n",
" (1141.8, 1144.46),\n",
" (1151.52, 1152.88),\n",
" (1155.16, 1156.44),\n",
" (1161.36, 1166.24),\n",
" (1166.24, 1172.32),\n",
" (1198.36, 1203.16),\n",
" (1205.28, 1206.12),\n",
" (1211.94, 1213.8),\n",
" (1216.16, 1219.56),\n",
" (1221.84, 1224.2),\n",
" (1224.2, 1245.0),\n",
" (1254.34, 1256.64),\n",
" (1257.96, 1259.04),\n",
" (1259.04, 1260.76),\n",
" (1260.76, 1269.0),\n",
" (1269.0, 1275.0),\n",
" (1275.0, 1280.0),\n",
" (1290.64, 1293.92),\n",
" (1293.92, 1296.6),\n",
" (1296.6, 1300.16),\n",
" (1300.16, 1303.16),\n",
" (1303.16, 1307.2),\n",
" (1330.72, 1337.44),\n",
" (1394.92, 1401.62),\n",
" (1401.62, 1408.2),\n",
" (1408.2, 1409.56),\n",
" (1413.54, 1417.68),\n",
" (1417.68, 1420.92),\n",
" (1420.92, 1422.56),\n",
" (1422.56, 1424.36),\n",
" (1424.36, 1426.68),\n",
" (1427.52, 1428.86),\n",
" (1428.86, 1429.7),\n",
" (1429.7, 1430.54),\n",
" (1436.84, 1439.84),\n",
" (1455.56, 1456.64),\n",
" (1465.72, 1468.38),\n",
" (1468.38, 1469.54),\n",
" (1476.24, 1481.24),\n",
" (1482.64, 1490.0),\n",
" (1490.0, 1539.0),\n",
" (1541.0, 1547.68),\n",
" (1547.68, 1553.2),\n",
" (1553.2, 1558.96),\n",
" (1563.04, 1566.6),\n",
" (1566.6, 1568.4),\n",
" (1569.48, 1571.1),\n",
" (1571.1, 1574.0),\n",
" (1574.0, 1575.62),\n",
" (1577.2, 1578.98),\n",
" (1581.7, 1583.04),\n",
" (1583.04, 1584.32),\n",
" (1584.32, 1591.0),\n",
" (1596.0, 1606.92),\n",
" (1606.92, 1611.72),\n",
" (1614.04, 1615.88),\n",
" (1618.88, 1619.8),\n",
" (1625.72, 1630.44),\n",
" (1634.04, 1635.04),\n",
" (1646.4, 1648.52),\n",
" (1648.52, 1649.72),\n",
" (1649.72, 1652.04),\n",
" (1652.04, 1655.16),\n",
" (1658.96, 1660.88),\n",
" (1666.12, 1668.68),\n",
" (1672.84, 1674.56),\n",
" (1674.56, 1676.88),\n",
" (1676.88, 1680.44),\n",
" (1682.68, 1688.0),\n",
" (1694.0, 1702.0),\n",
" (1702.0, 1749.0),\n",
" (1769.36, 1772.84),\n",
" (1784.32, 1789.2),\n",
" (1821.24, 1826.8),\n",
" (1826.8, 1833.44),\n",
" (1839.68, 1854.0),\n",
" (1871.0, 1877.0),\n",
" (1877.0, 1883.0),\n",
" (1885.32, 1888.4),\n",
" (1888.4, 1891.36),\n",
" (1893.72, 1896.36),\n",
" (1901.52, 1903.04),\n",
" (1904.96, 1909.94),\n",
" (1915.06, 1918.38),\n",
" (1922.78, 1927.08),\n",
" (1927.08, 1933.0),\n",
" (1933.0, 1939.68),\n",
" (1939.68, 1946.62),\n",
" (1952.86, 1958.66),\n",
" (1958.66, 1961.74),\n",
" (1961.74, 1964.54),\n",
" (1964.54, 1987.12),\n",
" (1995.6, 2000.24),\n",
" (2006.5, 2009.9),\n",
" (2009.9, 2011.14),\n",
" (2016.94, 2017.86),\n",
" (2030.48, 2032.12),\n",
" (2034.52, 2036.64),\n",
" (2036.64, 2039.12),\n",
" (2039.12, 2041.36),\n",
" (2042.24, 2050.0),\n",
" (2050.0, 2070.08),\n",
" (2070.08, 2079.28),\n",
" (2079.28, 2087.44),\n",
" (2087.44, 2092.24),\n",
" (2100.16, 2104.64),\n",
" (2105.0, 2108.24),\n",
" (2112.8, 2115.12),\n",
" (2125.0, 2129.0),\n",
" (2136.0, 2139.0),\n",
" (2139.0, 2141.0),\n",
" (2141.0, 2143.0),\n",
" (2143.0, 2145.1),\n",
" (2145.1, 2148.98),\n",
" (2148.98, 2153.98),\n",
" (2162.38, 2165.68),\n",
" (2165.68, 2169.92),\n",
" (2172.02, 2173.66),\n",
" (2181.18, 2183.28),\n",
" (2183.28, 2209.52),\n",
" (2209.52, 2213.0),\n",
" (2213.0, 2214.44),\n",
" (2214.44, 2216.32),\n",
" (2216.32, 2217.72),\n",
" (2246.8, 2251.44),\n",
" (2292.0, 2307.76),\n",
" (2307.76, 2321.52),\n",
" (2338.0, 2341.0),\n",
" (2343.0, 2365.92),\n",
" (2365.92, 2370.76),\n",
" (2370.76, 2375.12),\n",
" (2378.96, 2382.32),\n",
" (2386.54, 2392.52),\n",
" (2399.18, 2409.92),\n",
" (2415.6, 2428.0),\n",
" (2446.0, 2448.0),\n",
" (2461.0, 2467.0),\n",
" (2471.3, 2475.8),\n",
" (2495.04, 2498.72),\n",
" (2526.72, 2535.76),\n",
" (2536.96, 2542.4),\n",
" (2545.48, 2547.16),\n",
" (2547.16, 2550.28),\n",
" (2550.28, 2554.44),\n",
" (2554.44, 2561.88),\n",
" (2561.88, 2563.64),\n",
" (2563.64, 2590.0),\n",
" (2590.8, 2595.36),\n",
" (2595.92, 2609.0),\n",
" (2609.0, 2612.0),\n",
" (2620.0, 2646.12),\n",
" (2650.58, 2651.92),\n",
" (2706.94, 2708.52),\n",
" (2708.52, 2710.6),\n",
" (2710.6, 2712.64),\n",
" (2715.68, 2719.76),\n",
" (2719.76, 2722.08),\n",
" (2722.08, 2723.6),\n",
" (2723.6, 2725.64),\n",
" (2726.88, 2728.32),\n",
" (2732.6, 2733.44),\n",
" (2739.76, 2741.72),\n",
" (2745.68, 2748.7),\n",
" (2751.14, 2754.6),\n",
" (2754.6, 2755.6),\n",
" (2764.32, 2766.4),\n",
" (2766.4, 2769.08),\n",
" (2774.04, 2776.2),\n",
" (2776.2, 2778.68),\n",
" (2778.68, 2782.06),\n",
" (2782.06, 2784.96),\n",
" (2784.96, 2804.24),\n",
" (2826.0, 2827.0),\n",
" (2838.0, 2853.0),\n",
" (2872.6, 2881.64),\n",
" (2881.64, 2889.2),\n",
" (2889.2, 2913.12),\n",
" (2913.12, 2926.0),\n",
" (2926.0, 2928.0),\n",
" (2928.0, 2933.0),\n",
" (2933.0, 2938.0),\n",
" (2938.0, 2940.0),\n",
" (2944.0, 2945.2),\n",
" (2953.3, 2955.5),\n",
" (2960.18, 2961.84),\n",
" (2968.24, 2975.2),\n",
" (2975.2, 2981.76),\n",
" (2986.0, 2992.78),\n",
" (3005.14, 3013.4),\n",
" (3013.4, 3049.0),\n",
" (3049.0, 3051.0),\n",
" (3051.0, 3053.0),\n",
" (3063.0, 3085.28),\n",
" (3091.2, 3097.04),\n",
" (3123.84, 3172.0),\n",
" (3209.44, 3216.72),\n",
" (3227.0, 3231.0),\n",
" (3231.0, 3234.0),\n",
" (3240.0, 3241.0),\n",
" (3244.56, 3249.92),\n",
" (3255.88, 3257.44),\n",
" (3257.44, 3261.24),\n",
" (3261.24, 3268.4),\n",
" (3268.4, 3275.28),\n",
" (3323.0, 3326.56),\n",
" (3326.56, 3332.24),\n",
" (3338.32, 3344.4),\n",
" (3344.4, None)]"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def retrieve_time_segments(topic):\n",
" return topic_to_timestamp_first_match[topic]\n",
"\n",
"search_topic = \"Founders\"\n",
"print(\"Timelines where \" + search_topic + \" was covered : \")\n",
"time_segments_of_interest = retrieve_time_segments(topic=search_topic)\n",
"time_segments_of_interest"
]
},
{
"cell_type": "markdown",
"id": "b587da79",
"metadata": {},
"source": [
"## Selective segregation of content"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "5dc2014f",
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"import ast\n",
"\n",
"time_segments_of_interest = retrieve_time_segments(\"Founders\")\n",
"\n",
"ts_transcript = {}\n",
"with open(transcript_file_name, \"r\") as f:\n",
" ts_transcript = f.read()\n",
"ts_transcript = ast.literal_eval(ts_transcript)\n",
"\n",
"selective_transcribed_content = \"\"\n",
"for chunk in ts_transcript[\"chunks\"]:\n",
" if chunk[\"timestamp\"] in time_segments_of_interest:\n",
" selective_transcribed_content += chunk[\"text\"]"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "caeff7f1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"because Goku needs that for the audio plus the transcript plus the timestamps . So cool . Okay , so we can have our discussion as planned about ontology prompt . So JDC , I just wanted to learn about it and I think the rest of the team did . So I want to just start off , maybe you want to give us some context and introduction . So what I , it all started with the demo from Palantir. , as you know , they were able to make this AIP thing where they were such a powerful thing perhaps because they use an ontology . So it means that they have a definition of entities and actions that they can take . So then they control this to an LLM . So I started exploring with that with Jamo . in our hierarchical way using just like texts . And it kind of works . So in the end , I decided that one good pattern operations , mutations and queries over certain entities . So I assume that many of the LLMs have seen a lot of this data and know how to operate . So people from OpenAI and many companies So that 's how I started . So one of the problem domains that we could map to this , it 's CP because it 's some actions that you need to take . So that 's basically the system I end up developing . But that has some challenges . So the first thing is that , let me share the screen . And I can take you over . It 's amazing that Corey , Michal , and John are attending , because I 'm reaching to the point that I would love to have more ideas . Can you see the screen ? to you independently , just very late for him today . So I started by like mocking these kind of entities that could work with Zibi . So we have like employees , candidates . I just took a look at out of the comments that we support right now , reminders , and then put together this GraphQL thing . Yeah . Right now I see the whereby window . Is that what you 're showing ? Oh , sorry . No . Let me see . Maybe I 'm just sharing one , not the whole screen . Okay . I think this will solve it . that we have for many purposes , like requesting vacations and some other things . So there are some entities that are employees . I just like mocked some of them . just mocked up . Get reminder . This was the first thing . There is the vacations input , create reminder , and some mutation . So as you can see , this is part of the first test that I did . So let me show you here how this can work . So let 's say in the first day of July , I 'm trying to , sorry , I was in the middle of some page here . Oh , my God . It 's always show up . Yeah , yeah . I 'm refactoring like heavily this , but you see , it does work very well . So wait , let me change the strategy here . Let me see another example . Okay . So let 's say vacations . Because I 've been experimenting a lot . Just want to show you one related with vacation . Well . Let me see if it can work this way . I have many modes of operation . It behaves very good for this kind of . So it creates vacations , it fills all the things here . So this can basically be adapted to any CLI . GraphQL pretty well . command line interface for many things as long as you can describe the things like this . And this is a very flexible pattern . It was really taught by the people that invented this on Facebook . It divides the things by queries and mutations , and I 've tested heavily . What 's the challenge ? So the initial LLMs that we had access to , especially from OpenAI , I 'm giving it . And on the basis of this , I create like a subtree because this thing generates So there are two things reference here to types . There is vacations input and employee . This has not like further reference to other types . And then I come to employee here , and there are no further references . So I 've been implementing strategies for this . So one strategy kind of infers what do I need to use on the basis of the description . So let me show you , to Michal and Shonda , are very familiar . with these descriptions . and then identify or just use that strategy of the top K to know which ones should I include . Then I create this like sub tree that fits in most of the time in the GraphQL . Then I can use this . There is another strategy where I directly ask the LLM . So instead of using embeddings , I ask the LLM as if they were tools . That is the thing that is now like very popular . So I asked the LLM , this is the query I have . with their own descriptions . And then on the basis of this , I create again , these like sub tree of the GraphQL , and then I can obtain the query . with using the LLM in the case of , says me concretely , that it 's the use case that Adam wanted to test . This thing has so , the number of queries is so big , So let 's say , let 's say with GitHub . Let me see . So this is the query I 'm giving for the Rails organization , get the Rails repo on the last full request with the status open . This is the GraphQL that I 'm passing . The schema strategy , I 'm using embeddings . So I 'm just creating the embeddings and then getting the top K , rebuilding that tree so I can fit it to the LLM . Sorry . This is in an unstable state right now . Let me try to uncomment some things that I 'm thinking right now . Let me test another one , stars . Yeah , it 's better for me to use the LLM strategy here . So this usually can take a while . So there are even some cases where the GraphQL is so big that even after skimming it and just getting the parts , it 's too big . But with the Anthropic LLM , I can make it , idea is that you have a human query , then you have a description of the mutations and queries that you can take . But to be able to feed that to the LLM , you need a strategy to tell the LLM which parts it needs to use . how big it is , not even in the bigger ones , in the bigger Anthropic . So it 's 58,000 lines . And maybe make an average of two tokens per each . So that crosses even the 100K window of this LLM . So I just need to pick the entities that are relevant with whatever strategy . Right now there are two , like LLMs and embeddings . with whatever strategy . Right now there are two like LLMs and embeddings . and let me show it for you . Quick question . Do they all have like the really verbose comments there ? All the GraphQL queries ? Are they all , you know , do they have rich comments like that ? shrink that by eliminating those . Yeah . Right now that 's one of the requirements for the strategies . Because if you got rid of the comments , then the LLM would not maybe know as well . Yeah , as well . Yeah . because of the lower tokens , but then , you know , depending on whether the variable names are self-describing enough . Exactly , exactly . So I 've implemented some strategies . the tools that I have available . And then when I include the GraphQL , It 's all about like trying to save like characters always , not just for like cause reasons , But yeah , all of those combinations are available . and it might work if these variable names are self-descriptive . As you have seen , this one , it 's really without any comments , and it works pretty well . Let me show you how these things work . There is this Explorer from ... Let 's see . I asked this , get the ROS projects with most stars . so it 's , like , ordered . And then you 've got the stars and then you can . So that 's , let me try another one . Yeah , and maybe I could share a use case that what we were talking about , KTC , and maybe the others on the call have already seen it in the chat , but with that kind of US federal initiative says me the smart manufacturing Institute and then the think IQ platform they 're building . They 're really pushing tons of small medium sized manufacturers to leverage these open data standards to collect data for like incompatible appliances , like let 's say mixers on a manufacturing line from different manufacturers that do n't interoperate . let 's say mixers on a manufacturing line from different manufacturers that do n't interoperate . to do that with just their knowledge . then improving the GUI interface that they have , And I think where we can step in potentially what we 've been exploring is to go beyond the GUI interface where it 's like a lot of , they use GraphiQL actually , a lot of like pointing and clicking to select the specific appliances and data elements . How do we take that a step further and basically use language like JDC is describing here to say , Hey , I want to get a historian graph for all of the mixers that we had on the line for the last four Saturdays because you know , So that 's the idea to spit out a graph QL endpoint that would put that specific data for them , right ? That 's kind of the idea we 've been exploring at least . but by just describing . for a given instance after we 've specified time range . 1690 , max 10 samples within the start time , blah , blah , blah , end time . of more things , just the first , Yeah , I see that that 's an example . like English , right ? would be able to think about the information they need and have an awareness of that . And honestly , most engineers in manufacturing spaces are familiar with that type of notation , through this interface , then that 's incredibly robust . Exactly right . Yeah , because I 'm using Anthropic , That 's another story of like tweaking the prompt . then use the cheaper models that are faster . So right now it 's in a state that it 's like a Frankenstein , because I 'm using some parts from OpenAI and other from Anthropic . But right now it 's like the embeddings . It 's using like embedding from OpenAI and getting the tools that I need to use . And then when I get the GraphQL , I use it in Anthropic , Sesame 's GraphQL , it 's like , whoa . 15,000 , yeah . Yeah , 15,000 . Just let 's say like 30K , 30K tokens , making an average . So that goes into every prompt . It has to have all 30,000 tokens , so you 're left with 70,000 for our response , essentially . Or no , 70,000 for context , I guess . as context , that it 's the prompt plus all the data that I include , that it 's in context data , and that includes the GraphQL description . So that 's why I need to like to save as much as possible , trying to squeeze and getting just the parts that are relevant to the query . request . No , I 'm not . I 'm not . I 'm not like right now . I 'm not sampling that , but it 's less than that because the whole idea . I mean , once let me , let me explain you better . So the strategy that my point , but I think that the Miha and Sean , do you get the point ? Like it 's these things like reference other types , so you need to include them like all . Yeah , yeah , you do like an LLM guided tree shaking to minimize the input code . Yeah , yeah . And on the basis of this , I kind of like , how do you say , cut this huge tree of GraphQL . Like tree shaking , yeah . That 's basically . Yeah . To include just the things that I will know . And all the dependencies . Yeah , exactly . Yeah . Is it only the human language descriptions that I use for Anthropic . You must use the following criteria , blah blah , blah , context , date , time . Like these rules , the context , date , time , and that 's it . Yeah , but on the OpenAI side , when you generate the embeddings to find what you want to use from this . Let me show you . So I came up with this like a strategy . So there is this general thing , schema 'm passing , I return the whole schema . LLM strategy . So , it works like this . Get tool descriptions by operation . So , what it basically does , it gets the ... I mean , it 's easier to show it here . So , let 's say I get this , just this part that says code of conduct , And then I put two dots and say like this , look up the code of conduct by scheme . So I pass all of these in this format . You see here that key value . This is the descriptions of the tools . So it 's like , it 's the symbol name symbol name colon and then the comment . Yeah , yeah exactly . That 's the LLM strategy . So does it incorporate the comment on the type name as well ? Because it 's like the whole tree right or what ? Yeah and this list of tools separated by commas , answer the following question as best as you can , blah , blah , blah . And here you have the tool description . So that 's the format of code of conduct , comma . So that 's the LLM strategy . The other strategy is simpler . And it 's embedding strategy . So for the embedding strategy , I basically construct these documents that are the same thing , you see ? So for the LLM strategy , do you run the LLM once for every single field of every type in the source schema ? No , no , no , it 's not . I chunk this , like code of conduct to dot , look up the code of conduct by its key . Yeah , yeah . but I 'm not talking that case , but that 's not very hard to do . Yeah . and all the mutations and it still it works . It does n't work for Sesame . So that 's why I wrote embeddings . Because embeddings , you know , I just create embedding , Let 's say the tools are the queries and mutations shrink and tree , shrink and graph QL tree . and mutation descriptions . And then I create this document using the Java index capabilities . It 's just that , let 's say it will say like Enterprise and Lookup Enterprise in this . So you see here , for all in blah , blah , blah , get keys . And then I return the proper documents . So this is what it identifies the operations , order by a score , and then build the schema from operations . That 's what I call the thing . Once I know which queries or mutations I could be needing , this is the function works . But remarkably , it 's able to tackle these very huge schemas really well . You see ? You 've got it . JVC , what are your ? So two questions . One , I have n't seen if you put out a repo for this yet . I understand you 're refactoring . But if I want to run this on my own machine , are you going to be done refactoring soon ? Or what 's your thoughts on that ? open AI and for the GraphQL , So it 's here . What 's the other advantage of this ? awesome examples and maybe like two or three examples that like are totally out of scope and not feasible , and then take some time to demo that to them and maybe start the conversation us eager ? How could this make maybe some additional companies adopt this that are like , you know , they 're afraid of using a graph to a GUI to structure endpoints and then dump them into some sort of data explorer , right ? And then demonstrate the query , but then also maybe to be able to see the actual data being run itself . So if we put some sort of like a graphana type thing , whatever , on the third end there , I 'm wondering how much of this that , how elegant that could be really , right ? but at least maybe certain things it works with . Yeah , I 've tried with many of these queries , and give them extra details when those are missing , of all attributes . So query equipment , you see this one , equipment . My impression is that this kind of use case , that then they put into some like dashboard or anything . Because in the end , you will need , I think you will need some expertise because sometimes , unless you describe all the fields , it will miss some fields or unless you , let 's say , sometimes it just selects the first 10 . So unless you understand how to get rid of that . and specific interfaces for , again , key questions . Give me this particular data for these types of machines . Or give me a list of these machines . Or give me a list of these locations and all the machines , you know , and all the machines that are like the combination of all these queries . I think that that 's really how our customers are utilizing that to create these kind of custom reports . But like , what if that did n't have to happen ? And instead , you could have that natural language query that maybe works for like the simpler or majority of requests . And then you maybe it bypasses , I guess for instance , this one . A Grafana visualization of that , right ? but for any reason uncontrollable , it just got the first 10 . That 's another thing , in the GraphQL , but that 's another story . So there might be some inconsistency . Cause I just downloaded these from someplace here . Like , yeah . And in that case , we would need some sort of like a post-processor for these very specific errors , right ? To basically resolve them . Yeah . That 's ... I kind of see like more potential on some other use cases like general interface for general command line interface . There are some efforts on it . So that 's also worth mentioning maybe to the guys here . So these people from Stanford tried to make basically this kind of the same thing . So , scaling tools . So , these tool patterns , that is , you have some query and then you have some tools available to solve it , but you always have the that is some YAMA , and fine-tuned it on the instructions . That is some YAMA , and fine-tune it on the instructions . This is using tools such as Hugging Face Models and things like that . But it 's way more complex , in my opinion , I think , because they tried to format everything as function calls . And I think this thing has seen more ... rather than SQL or some function code . You know , there is this comas and things . GraphQL , it 's kind of simpler . You just know the fields . You might miss some fields and still the thing goes on . So it 's kind of in between in the spectrum of this contract . And on the other end , you have like things like REST , but in the , sorry , in the other end , you have just JSON , you have like REST and GraphQL . But there are many efforts doing this . And you know , right now also OpenAI has included natively a functionality to call , to do this kind of thing . I think behind the scenes , they are doing the same . So it does n't , one can just use the same thing and it will take advantage of the training that they put already . But underneath , it 's doing a very similar thing , I guess . It 's called OpenAI functions . Yeah , OK . For the graphic , sorry . Sorry , go ahead . them and get their perspective . And then , you know , I can summarize those learnings for us to think There is some ... It says that it has an error , but it 's not erroneous . That 's something I do n't understand sometimes with GraphQL . I instructed for the Rails organization , get the repo Rails and the last 10 pull request titles with open status . Let me show you . Sorry , Sharon , let me cut you off . I just want to close out that item . Yeah . So you can see here . Also , response request not issues . So , qualify association . It 's here , it 's in there . Get rid of unnecessary gem . You see it 's , and it 's kind of a complex query . I just said like for the Rails organization , get the repo Rails and the last 10 pull requests with titles with open status . And it has enums and things like that that are not trivial . This is an enum . This is this kind of nesting pattern that it needs to use . So I feel very confident with the thing . And it also did some good work with Sesme . One idea is that this can form the basis of some assistant , local assistant , so you could thing that I 've seen that people did with this Gorilla . So if you ... This is the ... That 's related with the other thing URLs , internal URLs ? These people are moving in that direction . Yeah , okay , yeah . like this text interface for many things . This is the one I was referencing , the AIP thing . Let 's say it receives an alert , and then it says , show me more details . You can just stuff the prompt , show me more details , but they are referencing these entities . So you know , immediately what 's the query that it 's involved there and even the IDs that you can like stuff into the prompt . So this is the one dimension that was the motivation to start with this . And it 's remarkable that these guys , there is some place where it 's shown that they are using open source model here . You see , they disclose this . So they 're using Flan , GPT-NEWX . Thank you , God . I got all of these screenshotted . You can go to AIP , or maybe at the start of ontology prompt . And I was analyzing all of these screens that they disclose . Yeah . Remember , they also disclose some very interesting thing here of how they work . So they just use the LLM as some other user . They do n't have this problem of what it can access or not . It 's just like another user . So whatever that user has access to , the LLM has access to . And this is the actions that it can take . They include one thing called suggestions . Yeah , that 's a main thing that they advertise , just like , figure out LLM security . Look , this other thing , this other thing has also a bunch of clues of how they , so instead of getting this PII problem embedded into the model , they just put an input filter to whatever PII is involved . They just cut it , then it 's the model , and then there is the validation . Very cool , yeah . Sorry , Tom , you were about to ask a question earlier . Yeah , so for the output of the GraphQL , are you using the constraint generation or whatever ? Or is it just opening the syntactically correct GraphQL on its own ? it 's open and it 's working well . I 'm not like constraining it to some like concrete schema . with knowledge of the schema . Yeah , definitely . But right now , it 's not constrained at all . Yeah , the only protection that I have right now for this . You validate afterwards , right ? Oh , I do two things . Like , let do two things . Let me show you . I think here it 's in main . Oh , okay . because there could be a bunch of errors . GraphQL at all , but I 've never encountered that . but the prompts that I 'm using are already like covering for that . like , hey , take into account that this data , it 's in this format , something like that . for GraphQL queries or mutations . Oh , like to extract the part of the result . Yeah , and then I do the validation . But I 'm really not very concerned . I think right now this is not a problem . I mean , it behaves pretty well . Okay , in that in that front . But yeah , you could like constrain I think there are some efforts and I there I read a debate of what was I read a debate of what was OpenAI really doing , because you could do two things . One , it 's this open-ended thing , and then constrain it with regex . And in the case that it misses , you can feed it back and tell it , hey , this is bad . the valid ones . That 's another solution that I 've seen . And finally , it might be the one that you have in mind that it 's like constrained , that you have to hack into the LLM latest layer , and then you can squash the probability because in that layer , you will have one output per token , so there will be some invalid tokens . There 's some guy that did that . Yeah . That 's kind of the approach that Gorilla ... They do n't need this . They are able to chunk thousands of tools without having to select the tool . It 's just the training that speeds that . Yeah , they did say . But it has some small errors . I think I prefer to have a way more powerful LLM rather than have that ... From what I 've seen , it 's very easy just to use the regex and then validate . Yeah , fair . Especially since so much of this is plain language already . The way the function calls are , they 're kind of intuitive when you read them in a way , right ? Yeah . Yeah , I 've seen that this model , because it 's this fine-tuning , foundational model really helps in that situation . from English to Chinese if you come here . And it selected the French model . So there should be , I do n't know why they include this example in their showcase , but they are being honest with that . So it 's like , how do you call that ? It 's like overfitting because you 're not even able to generalize these codes . You just use the model . completely biased , it 's like overfitted . So I kind of prefer the more general model than Google did this but for SQL . And if you read the paper like very , with a lot of attention , they did n't gain much by fine tuning . You are comparing it with others . Look at the figures . The difference between fine-tune and the other one was just like 1 % . Put the analysis here . Yeah . 77.3 to ... Look3 . 77.3 to 78.3 . Look , so the only difference ... Okay , this is the whole dataset . Sorry . So , the difference between a few-shot stuff model to a fine-tuned model , it was just 1 % . So , I really prefer to have the few shots . That 's something that is also emerging . samples to stuff in . And that might work better than just fine-tuning and that 's like costly . They changed their docs , but let me search for samples . Wait . Or read Or retreaters . Well , they have this concept , you know , that on the basis of the prompt , they fetch the most relevant view shot samples to stuff in . And according to what I 've seen , it might work better than just like fine-tuning . Yeah , but at some point it might make sense to fine-tune . Yeah , this is very exciting . I 'm looking forward to running it on my own machine and kind of understanding a little bit deeper . And yeah , exploring more use cases , that sounds really exciting as well . Yeah , so I play with ... Let me ... Can be stuffy . But it did work also for these huge GraphQLs . I 'm not using toy things . GitHub . Sorry ? Yeah . Well , at this point , just to comply with monadicals thing , sending the . Yeah , and that 's where Sean 's ideas about how to overcome the context window size and local processing would maybe help . If it was feasible . Not necessary . No , it 's just using a privately hosted LLMs . I 'm using Anthropix and OpenAI . It 's crazy . But definitely with the local models , we have n't found a way to work with that . Plus , you know , if some of them , it 's like , We actually have a preprocessor to resolve what we 're seeing on the screen right now . But yeah , there 's still some struggles on the local model that we have to figure out and get creative with . Okay , I 'm going to stop the recording unless anybody else has any last questions model could be useful for the token . Contact size problem as well . Like if something that could be applied to the scheme . Get rid of this . Anyway , that 's all I just wanted to mention . Yeah , yeah . Awesome . Hey , thanks , everyone . \""
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"selective_transcribed_content"
]
},
{
"cell_type": "markdown",
"id": "a20896b4",
"metadata": {},
"source": [
"## Selective topic summarization"
]
},
{
"cell_type": "markdown",
"id": "6f8ab415",
"metadata": {},
"source": [
"We can use this selective content to now summarize using the already available pipeline !"
]
},
{
"cell_type": "markdown",
"id": "06f009d5",
"metadata": {},
"source": [
"# And Much More !!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}