When I joined Praxis Labs, they were a VR company using Unity to create Human Resources training experiences in 3D, which were used in in-person trainings with Quest headsets. Due to less people being in-office in 2020 they pivoted to Unity WebGL – I made an AR phone prototype of the app but we decided to go with web as it was the most accessible to all users. The experiences were short interactive fiction modules, each centering a different HR problem.

In 2024 we started integrating AI into the experiences. We added voice recognition using the offline Recognissimo plugin to allow users to improvise responses to the characters instead of picking from a prewritten list, and OpenAI to judge the quality of the users’ responses. We then moved to AWSTranscribe using Websockets for better quality. Here’s a link to a page on the finished product.

This proved very popular with clients, so we decided to abandon Unity and move to a React-based fully AI experience using Hume to for voice input, LLM response generation, and voice output, and OpenAI to judge the quality of user’s responses. Hume’s voice output wasn’t completely satisfactory (voices would vary in pitch and tone over the course of a conversation) so I did a lot of research on other services, but none provided the same level of integration, responsiveness and price as Hume. ElevenLabs had particularly high quality voices but more expensive, and not enough simultaneous users to support a large cohort of learners.
I researched lip-sync solutions in the hope of making fully-animated avatars. I particularly liked Azure’s Viseme but it didn’t integrate with Hume. I produced a prototype that generated lip-synced animated avatars using 3JS as an engine, and Viseme to produce the facial animation, but it had a much greater latency than just pure Hume, so that didn’t get off the launchpad.
