Looks like even for the non-realtime API they're charging $200/M for output audio. Their current TTS API is $15/M (characters) for output audio, which equates to $60/M if each token is around 4 characters. Then add in the manual piping to the 4o LLM which is $15/M, around $75/M total.
So from $75 to $200/M is a big premium for the convenience of one model and the quality of multi modal input/output. Will have to test and see if it's worth it.
Also is there still no way to connect users directly to OpenAI? Like directly from a user's browser to OpenAI's servers, without the user having to supply their own API key? How does this work with realtime that needs websockets? Do I need an intermediate proxy server for all my users conversations? Seems like a waste of bandwidth, an unnecessary failure point, and a privacy problem. I hope I am wrong.
So from $75 to $200/M is a big premium for the convenience of one model and the quality of multi modal input/output. Will have to test and see if it's worth it.
Also is there still no way to connect users directly to OpenAI? Like directly from a user's browser to OpenAI's servers, without the user having to supply their own API key? How does this work with realtime that needs websockets? Do I need an intermediate proxy server for all my users conversations? Seems like a waste of bandwidth, an unnecessary failure point, and a privacy problem. I hope I am wrong.