Merge pull request #16 from Monadical-SAS/whisper-jax-gokul

Add feature for real time transcription locally
2026-02-04 09:56:47 +00:00 · 2023-06-20 17:37:55 +05:30
parent 1676133f2c 7fd02607f6
commit 512348fabb
5 changed files with 41 additions and 13 deletions
--- a/README.md
+++ b/README.md
@@ -110,12 +110,35 @@ This is a jupyter notebook playground with template instructions on handling the
 pipeline. Follow the instructions given and tweak your own logic into it or use it as a playground to experiment libraries and
 visualizations on top of the metadata.

+**WHISPER-JAX REALTIME TRANSCRIPTION PIPELINE:**
+
+We also support a provision to perform real-time transcripton using whisper-jax pipeline. But, there are 
+a few pre-requisites before you run it on your local machine. The instructions are for 
+configuring on a MacOS.
+
+We need to way to route audio from an application opened via the browser, ex. "Whereby" and audio from your local
+microphone input which you will be using for speaking. We use [Blackhole](https://github.com/ExistentialAudio/BlackHole).
+
+1) Install Blackhole-2ch (2 ch is enough) by 1 of 2 options listed.
+2) Setup [Aggregare device](https://github.com/ExistentialAudio/BlackHole/wiki/Aggregate-Device) to route web audio and
+   local microphone input.
+
+   Be sure to mirror the settings given ![here](./images/aggregate_input.png) (including the name)
+3) Setup [Multi-Output device](https://github.com/ExistentialAudio/BlackHole/wiki/Multi-Output-Device)
+   Refer ![here](./images/multi-output.png)
+
+From the reflector root folder, 
+
+run ```python3 whisjax_realtime_trial.py```
+
+**Permissions:**
+
+You may have to add permission for Terminal/Code Editor microphone access to record audio and in
+```System Preferences -> Privacy & Security -> Accessibility``` as well.

 NEXT STEPS:

-1) Run this demo on a local Mac M1 to test flow and observe the performance
-2) Create a pipeline using a microphone to listen to audio chunks to perform transcription realtime (and also efficiently
- summarize it as well) -> *done as part of whisjax_realtime_trial.py*
-3) Create a RunPod setup for this feature (mentioned in 1 & 2) and test it end-to-end
-4) Perform Speaker Diarization using Whisper-JAX
-5) Based on the feasibility of the above points, explore suitable visualizations for transcription & summarization.
+
+1) Create a RunPod setup for this feature (mentioned in 1 & 2) and test it end-to-end
+2) Perform Speaker Diarization using Whisper-JAX
+3) Based on the feasibility of the above points, explore suitable visualizations for transcription & summarization.
--- a/config.ini
+++ b/config.ini
@@ -5,6 +5,7 @@ KMP_DUPLICATE_LIB_OK=TRUE
 OPENAI_APIKEY=
 # Export Whisper Model Size
 WHISPER_MODEL_SIZE=medium
+WHISPER_REAL_TIME_MODEL_SIZE=medium
 # AWS config
 AWS_ACCESS_KEY=***REMOVED***
 AWS_SECRET_KEY=***REMOVED***
--- a/images/aggregate_input.png
+++ b/images/aggregate_input.png
--- a/images/multi-output.png
+++ b/images/multi-output.png
--- a/whisjax_realtime_trial.py
+++ b/whisjax_realtime_trial.py
@@ -14,23 +14,28 @@ WHISPER_MODEL_SIZE = config['DEFAULT']["WHISPER_MODEL_SIZE"]

 FRAMES_PER_BUFFER = 8000
 FORMAT = pyaudio.paInt16
-CHANNELS = 1
+CHANNELS = 2
 RATE = 44100
-RECORD_SECONDS = 5
+RECORD_SECONDS = 15


 def main():
    p = pyaudio.PyAudio()
-
+    AUDIO_DEVICE_ID = -1
+    for i in range(p.get_device_count()):
+        if p.get_device_info_by_index(i)["name"] == "ref-agg-input":
+            AUDIO_DEVICE_ID = i
+    audio_devices = p.get_device_info_by_index(AUDIO_DEVICE_ID)
    stream = p.open(
        format=FORMAT,
        channels=CHANNELS,
        rate=RATE,
        input=True,
-        frames_per_buffer=FRAMES_PER_BUFFER
+        frames_per_buffer=FRAMES_PER_BUFFER,
+        input_device_index=audio_devices['index']
    )

-    pipeline = FlaxWhisperPipline("openai/whisper-" + WHISPER_MODEL_SIZE,
+    pipeline = FlaxWhisperPipline("openai/whisper-" + config["DEFAULT"]["WHISPER_REAL_TIME_MODEL_SIZE"],
                                  dtype=jnp.float16,
                                  batch_size=16)

@@ -48,8 +53,7 @@ def main():

    listener = keyboard.Listener(on_press=on_press)
    listener.start()
-    print("Listening...")
-
+    print("Attempting real-time transcription.. Listening...")
    while proceed:
        try:
            frames = []