The company is a startup based in US that provides an end to end platform for immigrants in the country that helps them secure jobs, loans, find communities, and much more. The startup recently overhauled their model from an Agency to a Tech platform to help achieve their goals at scale.
The startup needed to reach out to migrants at scale to help build a database of eligible candidates for multiple roles that were popping up. It was not possible to scale the telephonic operations without incurring significant overheads. Tele-calling comes at a rate of $20 per hour which was proving to be prohibitively expensive to expand operations.
The tele-callers connected with the migrants, and after initial pleasantries, proceeded to gather information regarding their past work history, ability to work in various stressful environments, and other basics. This flow needed to move to an AI agent to achieve parallelization and significant cost reduction.
The greatest challenges in building a screening agent were these:
[1] Handling user interruption: On a call, users naturally interrupt the agent when the agent is going off track, or when the agent has itself interrupted the user when the user was speaking. Handling these interruptions adeptly was important in delivering a smooth user experience.
[2] Latency of response: To make a conversation sound natural, it is important to keep the latency of response below 1.5 to 2 seconds. However, a voice calling agent has multiple systems in place, each with it's own latency. Bringing the combined latency below 1.5 second mark is challenging. Moreover, it is important not to interrupt a user when they are speaking, so there needs to be a fixed pause duration of ~0.75 seconds only after which the agent should start forming its response.
[3] Keeping costs reasonable: Despite the benefits of AI agents, they are exorbitantly expensive in a voice agent use case. They only offer significant benefits to the customer when there are multiple cost optimization logics in place. It is important to balance conversation quality while doing cost optimizations.
For a voice to voice interaction, we had to use 3 important models:
[1] Speech to Text: This model helped us transcribe the uses voice frames into text for processing. We used Deepgram for this.
[2] Text Processing: We used GPT 4o mini to process the conversation and form coherent responses. It is in this model itself that we handled user interruption scenarios, filler words, etc.
[3] Text to Speech: This model converted the processed text back to speech and sent it to the user as an audio. We used Amazon Polly GPT models for this to make it as natural sounding as possible.
We used Twilio for Telephony so that the call could be placed on any mobile number.
We used multiple approaches to reduce the latency.
[1] Split sentences into chunks intelligently for faster processing
[2] Used hard coded responses where LLM was not needed
[3] Used Open AI sockets to ensure streaming of words instead of bulk processing
[4] Intelligent pause detection that considers the speed of talking to decide the pause duration. This helped us ensure that user is not interrupted, while also maintaining low latency
We used the following cost optimization techniques:
[1] Using lighter speech synthesis and LLMs wherever possible
[2] Hard coded (decision tree based) audios wherever possible
It was important to monitor and assess the quality of conversations to continuously improve it. We explored options like alan.app for this, and the implementation process is ongoing.