Resource requirements for a Whisper STT web Service
I've been working on a web service to provide speech-to-text transcriptions to Chief of Staff.
Originally, I presumed some GPU-muscled serverless function would be the way to go. However, on the services I looked at the time the model loading meant extended delays in getting shorter transcriptions done.
I wondered: "How cheaply can a persistant instance of Whisper as a webservice be had, and how fast could it be to deliver reasonable quality?"
Ahmet Oner has a great starting place with his `whisper-asr-webservice` project. The project is based on FastAPI and offers a bare-bones way to access OpenAI's Whisper and @guillaumekln's `faster-whisper`.
Took me a minute but I have a working setup of vps standup and basic config using Terraform and Github Actions for CI build and deployment of this service. As I refine these, I may publish the setup so others can get a turnkey whisper instance of their own.
My setup uses `nginx-proxy` and `jrcs/letsencrypt-nginx-proxy-companion` in front of the gunicorn web instance.
Starting small
I kicked off my testing using the cheapest possible VPS droplet, "Regular Intel" with a 1vCPU, 1 GB memory ($6/mo) aka `s-1vcpu-1gb`.
This is "1-click Docker Droplet" meaning it comes with a recent version of Docker installed at the time the instance is created.
Not Enough Resources
Perhaps unsurprisingly, I found this VPS was not enough to complete a short transcription. (At least at the default configuration for transcription, which I'll have to spell out in more detail to provide real benchmarks)
It wasn't totally obvious the job was failing, as `top` on the host showed docker suddenly getting up to 80% of its allocation of ~500MB RAM. CPU climbed too but neither were high for long before the call failed.
Here's a quick pass at symptoms for this STT web service failing due to insufficient resources:
After the normal warning about FP16/FP32 I had the following logs:
```
/app/.venv/lib/python3.10/site-packages/whisper/transcribe.py:114: UserWarning: FP16 is not supported on CPU; using FP32 instead
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
[2023-10-11 05:17:18 +0000] [1] [WARNING] Worker with pid 25 was terminated due to signal 9
[2023-10-11 05:17:19 +0000] [27] [INFO] Booting worker with pid: 27
[2023-10-11 05:17:23 +0000] [27] [INFO] Started server process [27]
[2023-10-11 05:17:23 +0000] [27] [INFO] Waiting for application startup.
[2023-10-11 05:17:23 +0000] [27] [INFO] Application startup complete.
```
You can see more details using `dmesg` on the host, which included the log:
```
[47344.013944] Out of memory: Killed process 36117 (gunicorn) total-vm:1334116kB, anon-rss:621308kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:2120kB oom_score_adj:0
```
On the web side of things, the request was coming back with either a 503 or 502, I'll have to run it again to check.
Enough Resources
This is probably the key question what is enough?
Once I had server up and working, I cranked up the resources and found "Premium Intel", 2 vCPUs, 2 GB RAM ($21/mo) to handle inbound STT requests no problem.
I haven't gone back down to find out the middle ground, and need to spec the testing file and configuration (whisper / faster-whisper) to do benchmarking.
I'll come back and update this entry when I can with more info.