update feature for real time transcription locally

This commit is contained in:
gokul
2023-06-20 17:37:01 +05:30
parent 60072062af
commit 7fd02607f6
5 changed files with 41 additions and 13 deletions

View File

@@ -110,12 +110,35 @@ This is a jupyter notebook playground with template instructions on handling the
pipeline. Follow the instructions given and tweak your own logic into it or use it as a playground to experiment libraries and
visualizations on top of the metadata.
**WHISPER-JAX REALTIME TRANSCRIPTION PIPELINE:**
We also support a provision to perform real-time transcripton using whisper-jax pipeline. But, there are
a few pre-requisites before you run it on your local machine. The instructions are for
configuring on a MacOS.
We need to way to route audio from an application opened via the browser, ex. "Whereby" and audio from your local
microphone input which you will be using for speaking. We use [Blackhole](https://github.com/ExistentialAudio/BlackHole).
1) Install Blackhole-2ch (2 ch is enough) by 1 of 2 options listed.
2) Setup [Aggregare device](https://github.com/ExistentialAudio/BlackHole/wiki/Aggregate-Device) to route web audio and
local microphone input.
Be sure to mirror the settings given ![here](./images/aggregate_input.png) (including the name)
3) Setup [Multi-Output device](https://github.com/ExistentialAudio/BlackHole/wiki/Multi-Output-Device)
Refer ![here](./images/multi-output.png)
From the reflector root folder,
run ```python3 whisjax_realtime_trial.py```
**Permissions:**
You may have to add permission for Terminal/Code Editor microphone access to record audio and in
```System Preferences -> Privacy & Security -> Accessibility``` as well.
NEXT STEPS:
1) Run this demo on a local Mac M1 to test flow and observe the performance
2) Create a pipeline using a microphone to listen to audio chunks to perform transcription realtime (and also efficiently
summarize it as well) -> *done as part of whisjax_realtime_trial.py*
3) Create a RunPod setup for this feature (mentioned in 1 & 2) and test it end-to-end
4) Perform Speaker Diarization using Whisper-JAX
5) Based on the feasibility of the above points, explore suitable visualizations for transcription & summarization.
1) Create a RunPod setup for this feature (mentioned in 1 & 2) and test it end-to-end
2) Perform Speaker Diarization using Whisper-JAX
3) Based on the feasibility of the above points, explore suitable visualizations for transcription & summarization.

View File

@@ -5,6 +5,7 @@ KMP_DUPLICATE_LIB_OK=TRUE
OPENAI_APIKEY=
# Export Whisper Model Size
WHISPER_MODEL_SIZE=medium
WHISPER_REAL_TIME_MODEL_SIZE=medium
# AWS config
AWS_ACCESS_KEY=***REMOVED***
AWS_SECRET_KEY=***REMOVED***

BIN
images/aggregate_input.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 124 KiB

BIN
images/multi-output.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 113 KiB

View File

@@ -14,23 +14,28 @@ WHISPER_MODEL_SIZE = config['DEFAULT']["WHISPER_MODEL_SIZE"]
FRAMES_PER_BUFFER = 8000
FORMAT = pyaudio.paInt16
CHANNELS = 1
CHANNELS = 2
RATE = 44100
RECORD_SECONDS = 5
RECORD_SECONDS = 15
def main():
p = pyaudio.PyAudio()
AUDIO_DEVICE_ID = -1
for i in range(p.get_device_count()):
if p.get_device_info_by_index(i)["name"] == "ref-agg-input":
AUDIO_DEVICE_ID = i
audio_devices = p.get_device_info_by_index(AUDIO_DEVICE_ID)
stream = p.open(
format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=FRAMES_PER_BUFFER
frames_per_buffer=FRAMES_PER_BUFFER,
input_device_index=audio_devices['index']
)
pipeline = FlaxWhisperPipline("openai/whisper-" + WHISPER_MODEL_SIZE,
pipeline = FlaxWhisperPipline("openai/whisper-" + config["DEFAULT"]["WHISPER_REAL_TIME_MODEL_SIZE"],
dtype=jnp.float16,
batch_size=16)
@@ -48,8 +53,7 @@ def main():
listener = keyboard.Listener(on_press=on_press)
listener.start()
print("Listening...")
print("Attempting real-time transcription.. Listening...")
while proceed:
try:
frames = []