local-only demo with transcription, agenda comparison and summarization

2026-02-04 09:56:47 +00:00 · 2023-06-14 13:07:31 -04:00
parent 90c66b5a49
commit 47915820fe
10 changed files with 104 additions and 1 deletions
--- a/.DS_Store
+++ b/.DS_Store
--- a/.vscode/settings.json
+++ b/.vscode/settings.json
@@ -0,0 +1,6 @@
+{
+    "workbench.colorCustomizations": {
+        "minimap.background": "#00000000",
+        "scrollbar.shadow": "#00000000"
+    }
+}
--- a/reflector-local/.DS_Store
+++ b/reflector-local/.DS_Store
--- a/reflector-local/0-reflector-local.py
+++ b/reflector-local/0-reflector-local.py
@@ -28,3 +28,6 @@ subprocess.run(["python3", "1-transcript-generator.py", input_file, f"{input_fil

 # Run the second script to compare the transcript to the agenda
 subprocess.run(["python3", "2-agenda-transcript-diff.py", agenda_file, f"{input_file}_transcript.txt"])
+
+# Run the third script to summarize the transcript
+subprocess.run(["python3", "3-transcript-summarizer.py", f"{input_file}_transcript.txt", f"{input_file}_summary.txt"])
--- a/reflector-local/3-transcript-summarizer.py
+++ b/reflector-local/3-transcript-summarizer.py
@@ -0,0 +1,86 @@
+import argparse
+import nltk
+nltk.download('stopwords')
+from nltk.corpus import stopwords
+from nltk.tokenize import word_tokenize, sent_tokenize
+from heapq import nlargest
+from loguru import logger
+
+# Function to initialize the argument parser
+def init_argparse():
+    parser = argparse.ArgumentParser(
+        usage="%(prog)s <TRANSCRIPT> <SUMMARY>",
+        description="Summarization"
+    )
+    parser.add_argument("transcript", type=str, default="transcript.txt", help="Path to the input transcript file")
+    parser.add_argument("summary", type=str, default="summary.txt", help="Path to the output summary file")
+    parser.add_argument("--num_sentences", type=int, default=5, help="Number of sentences to include in the summary")
+    return parser
+
+# Function to read the input transcript file
+def read_transcript(file_path):
+    with open(file_path, "r") as file:
+        transcript = file.read()
+    return transcript
+
+# Function to preprocess the text by removing stop words and special characters
+def preprocess_text(text):
+    stop_words = set(stopwords.words('english'))
+    words = word_tokenize(text)
+    words = [w.lower() for w in words if w.isalpha() and w.lower() not in stop_words]
+    return words
+
+# Function to score each sentence based on the frequency of its words and return the top sentences
+def summarize_text(text, num_sentences):
+    # Tokenize the text into sentences
+    sentences = sent_tokenize(text)
+
+    # Preprocess the text by removing stop words and special characters
+    words = preprocess_text(text)
+
+    # Calculate the frequency of each word in the text
+    word_freq = nltk.FreqDist(words)
+
+    # Calculate the score for each sentence based on the frequency of its words
+    sentence_scores = {}
+    for i, sentence in enumerate(sentences):
+        sentence_words = preprocess_text(sentence)
+        for word in sentence_words:
+            if word in word_freq:
+                if i not in sentence_scores:
+                    sentence_scores[i] = word_freq[word]
+                else:
+                    sentence_scores[i] += word_freq[word]
+
+    # Select the top sentences based on their scores
+    top_sentences = nlargest(num_sentences, sentence_scores, key=sentence_scores.get)
+
+    # Sort the top sentences in the order they appeared in the original text
+    summary_sent = sorted(top_sentences)
+    summary = [sentences[i] for i in summary_sent]
+
+    return " ".join(summary)
+
+def main():
+    # Initialize the argument parser and parse the arguments
+    parser = init_argparse()
+    args = parser.parse_args()
+
+    # Read the input transcript file
+    logger.info(f"Reading transcript from: {args.transcript}")
+    transcript = read_transcript(args.transcript)
+
+    # Summarize the transcript using the nltk library
+    logger.info("Summarizing transcript")
+    summary = summarize_text(transcript, args.num_sentences)
+
+    # Write the summary to the output file
+    logger.info(f"Writing summary to: {args.summary}")
+    with open(args.summary, "w") as f:
+        f.write("Summary of: " + args.transcript + "\n\n")
+        f.write(summary)
+
+    logger.info("Summarization completed")
+
+if __name__ == "__main__":
+    main()
--- a/reflector-local/30min-CyberHR/30min-CyberHR.m4a.mp4_summary.txt
+++ b/reflector-local/30min-CyberHR/30min-CyberHR.m4a.mp4_summary.txt
@@ -0,0 +1,3 @@
+Summary of: 30min-CyberHR/30min-CyberHR.m4a.mp4_transcript.txt
+
+Since the workforce is an organization's most valuable asset, investing in workforce experience activities, we've found has lead to more productive work, more efficient work, more innovative approaches to the work, and more engaged teams which ultimately results in better mission outcomes for your organization. And this one really focuses on not just pulsing a workforce once a year through an annual HR survey of, how do you really feel like, you know, what leadership considerations should we implement or, you know, how can we enhance the performance management process. We've just found that, you know, by investing in this and putting the workforce as, you know, the center part of what you invest in as an organization and leaders, it's not only about retention, talent, you know, the cyber workforce crisis, but people want to do work well and they're able to get more done and achieve more without you, you know, directly supervising and micromanaging or looking at everything because, you know, you know, you know, you're not going to be able to do anything. I hope there was a little bit of, you know, the landscape of the cyber workforce with some practical tips that you can take away for how to just think about, you know, improving the overall workforce experience and investing in your employees. So with this, you know, we know that all of you are in the trenches every day, you're facing this, you're living this, and we are just interested to hear from all of you, you know, just to start, like, what's one thing that has worked well in your organization in terms of enhancing or investing in the workforce experience?
--- a/reflector-local/30min-CyberHR/30min-CyberHR.m4a.mp4_transcript.txt
+++ b/reflector-local/30min-CyberHR/30min-CyberHR.m4a.mp4_transcript.txt
--- a/reflector-local/42min-StartupsTechTalk/42min-StartupsTechTalk.mp4_summary.txt
+++ b/reflector-local/42min-StartupsTechTalk/42min-StartupsTechTalk.mp4_summary.txt
@@ -0,0 +1,3 @@
+Summary of: 42min-StartupsTechTalk/42min-StartupsTechTalk.mp4_transcript.txt
+
+If you had perfect knowledge, and you need like one more piece of advertising, drove like 0.2 customers in each customer generates, like let's say you wanted to completely maximize, you'd make it say your contribution margin, on incremental sales, is just over what you're spending on ad revenue. Like if you're, I don't know, well, let's see, I got like you don't really want to advertise a ton in the huge and everywhere, and then getting to ubiquitous, because you grab it, damage your brands, but just like an economic textbook theory, and be like, it'd be that basic math. And the table's like exactly, we're going to be really cautious to like be able to move in a year if we need to, but Google's goal is going to be giving away foundational models, lock everyone in, make them use Google Cloud, make them use Google Tools, and it's going to be very hard to switch off. Like if you were starting to develop Figma, you might say, okay, well Adobe is just gonna eat my lunch, right, like right away. So when you see a startup or talk to a founder and he's saying these things in your head like, man, this isn't gonna work because of, you know, there's no tab or there's, you know, like Amazon's gonna roll these cuts over in like two days or whatever, you know, or the man, this is really interesting because not only they're not doing it and no one else is doing this, but like they're going after a big market.
--- a/reflector-local/42min-StartupsTechTalk/42min-StartupsTechTalk.mp4_transcript.txt
+++ b/reflector-local/42min-StartupsTechTalk/42min-StartupsTechTalk.mp4_transcript.txt
--- a/reflector-local/readme.md
+++ b/reflector-local/readme.md
@@ -8,4 +8,4 @@ python 0-reflector-local.py voicememo.m4a agenda.txt

 OR - using 30min-CyberHR example:

-python 0-reflector-local.py 30min-HR-cyber.m4a 30min-HR-cyber-agenda.txt
+python 0-reflector-local.py 30min-CyberHR/30min-CyberHR.m4a 30min-CyberHR/30min-CyberHR-agenda.txt