RecordViewer: call transcription
RecordViewer version 0.3 is a relatively big update introducing speech to text conversion using whisper.cpp. This is done on local PC without any privacy issues and for free (except for the CPU time and energy cost). While transcription precision might be limited, it is much easier to search for some keyword than trying to find it in hundreds of audio recordings.
Example: filtering using "£ód¼" text (name of the city in Poland):
and opening window with text transcription of the call:
RecordViewer is intended to work with tSIP and should be unzipped to its folder (keeping tSIP.exe and RecordViewer.exe next to each other). In my proposed setup whisper.cpp is placed in a subfolder and RecordViewer by default is using relative paths, so it should be portable.
As whisper.cpp binaries from its repository at the moment are not working with AVX-only CPUs (Sandy and Ivy Bridge i3/i5/i7) I've included in zip recompiled whisper.cpp binaries. If you have Haswell or newer "Core" CPU you can switch to AVX2-based version of whisper.cpp.
I'm not hosting whisper.cpp model files at the moment, so you need to download one or more model files separately. There are two variants of model files: English-only and multilanguage. I've made some tests using recordings in Polish and while the "base" model (~150 MB) gives rather poor accuracy, "small" model (~460 MB) seems useful. I believe that for English-only transcriptions even "base" ggml-base.en.bin model might be worth trying. Note that when using for language other than English, language code should be specified in RecordViewer settings.
Transcription can be started from popup menu for selected file or from main menu ("Tools") for currently visible recordings that have no transcription yet. If you want to find particular recording and you know whou you have talked with, you can use it as filter before starting transcription process from "Tools" menu.
Transcribing process is quite slow and running it overnight might be recommended. By default RecordViewer would start whisper.cpp conversion using 2 threads
but you can adjust this in settings (whisper.cpp default setting is 4 threads).
tSIP can record stereo files, using separate audio channels for transmitted and received audio with clean separation that can be used to reliably distinguish call parties. While whisper.cpp version 1.4 has some support for diarization, it is very basic - audio samples from left and right channels are mixed together anyway and diarization is based on simple energy calculation for channels. This is not reliable when both parties are talking same time. To make the best use of recording as stereo in tSIP, RecordViewer is splitting channels and running whisper.cpp separately for each one. This is slower, but allows to precisely filter recordings using text/audio transmitted, received or both directions. It should also improve overal accuracy as even if there is no double-talk, there is always some noise from each direction.
Back to RecordViewer.
Back to tSIP.