π Audio processing
Edenai AI agregates a large set of audio processing features from a variety of well know providers like amazon
, google
, Microsoft
, deepgram
, assembly
and many others. With just a few lines of codes, and without any heavy configurations, Eden AI allows you to perform and use speech and audio technologies quickly and efficiently.
Subfeatures
Below, is a complete list of all audio features on Eden AI:
Speech-to-text
Speech-to-text
technology permits to transcribe and recognize audio/video files. It recognizes spoken words within a file and converts it to a readable text using well trained IA models.
Speech-to-text feature is designed to be asynchronous, meaning that the analysis process happens in the background and the results are accessible when ready . [Please refer to this section](#Asynchronous Process)
Supported languages, constraints and features
The languages supported by Eden AI for each provider are listed in the following excel table; also listed are the features and file extensions that are provider-specific.
Indication
Please verify that the file encoding you want to submit, or the features you want to use are supported for the provider(s) you want to use before proceeding with your transcription. Ultimate Guide
Supported programming languages
Eden AI speech-to-text
is programming language constraints free, which means users can use any programming languages using a unique API Getting Started
Note
You can find an Opened Sourced version of EdenAI that you can can find on github as a
python
module
Usage
- The
speech-to-text
analysis starts with a POST request the the API endpoint with the audio/video file to process, the features to consider (Ex: filter profanities or provide a custom vocabulary) and the list of providers to use. The API call will return a job ID.
import requests
headers = {"Authorization": "Bearer π Your_API_Key"}
url="https://api.edenai.run/v2/audio/speech_to_text_async"
data={"providers": "google,amazon,assembly","language": "en"}
files = {'file': open("π path/to/your/sound.mp3",'rb')}
response = requests.post(url, data=data, files=files, headers=headers)
data = response.json()
For a successful call, the API call will a return a JSON response in form of a JSON Object with a public_id
key referring to the job ID value.
{
"public_id": "9ac4dbe1-0990-4691-b5ed-4a84e83de357"
}
- A GET request is needed after to get the completion status using the job ID
url = "https://api.edenai.run/v2/audio/speech_to_text_async/{public_id}"
headers = {
"accept": "application/json",
"authorization": "Bearer π Your_API_Key"
}
response = requests.get(url, headers=headers)
data = response.json()
- Finally, when the audio processing is finished, it is possible to access the processing results by performing a GET operation again. Here is an example of a call response using both
amazon
andassembly
{
"public_id": "96732b2a-d810-4a87-a266-01973877e11",
"status": "finished",
"error": null,
"results": {
"assembly": {
"error": null,
"id": "53d3f72b-e99c-42b3-8d8c-44bbbcd3bed5",
"final_status": "succeeded",
"text": "There's always a ridiculously long line for the red bean cakes. I know. It's weird. I tried them once and couldn't for the life of me figure out what was so special about them. We're lucky they don't all like the Chiver Rolls instead. No waiting. Ten chiverls, please. Here you are. Be careful, they're hot. That'll be 130 anti dollars. Thanks, I'm stuffed. Those things are really filling.",
"diarization": {
"entries": [
],
"error_message": null,
"total_speakers": 1
}
},
"amazon": {
"error": null,
"id": "6450ac77-4740-447f-ad7d-f26f4ad3259c",
"final_status": "succeeded",
"text": "There's always a ridiculously long line for the red bean cakes. I know it's weird. I tried them once and couldn't for the life of me figure out what was so special about them. We're lucky. They don't all like the chive rolls instead. No waiting 10 chive rolls, please. Here you are. Be careful. They're hot. That'll be 130 anti dollars. Thanks. I'm stuffed. Those things are really filling.",
"diarization": {
"entries": [
],
"error_message": null,
"total_speakers": 1
}
}
}
}
Note
The status key within the API response indicates whether or not all the providers' call are finished. However, It's still possible to access a one provider response when it's finished independently from others providers status.
Features
Speakers Labels (Diarization)
Speaker diarization allows to automatically detect the number of speakers and so recognize speaker changes. Speakers are labeled as Speaker 1
, Speaker 2
, etc and each word or segment of text will be associated with it's speaker tag with a confidence
score between 0 and 1.
Speaker diarization is enabled by default, no additional parameter is required . However, you can additionally specify the number of speakers to consider for the Speaker diarization. To do so, just include the speakers
parameter with the value equal to the number of speakers.
When the transcription is completed, you'll notice the diarization
key in the JSON response, as show below. This key will contain the number of detected speakers, a list of entries within the key entries
and an error_message if something went wrong. The entries
key contains the diarization result. A list of words or segments of text along with their start time and end time (in seconds) from the beginning of the audio stream.
{
"diarization": {
"entries": [
{
"segment": "There's",
"speaker": 1,
"end_time": "0.954",
"confidence": 0.51003,
"start_time": "0.73"
},
{
"segment": "always",
"speaker": 1,
"end_time": "1.31",
"confidence": 0.99925,
"start_time": "1.002"
},
..
],
"error_message": null,
"total_speakers": 1
}
}
Profanity filter
Eden AI allows you to look for recognized profanity and remove them from the transcribe result. By default (except for some providers), the API will return a verbatim transcription of the audio, meaning profanity will be present in the transcript if spoken in the audio.
To replace profanity with asterisks as show below, or to convert it to the nearest recognized non-profane word (which is the case with deepgram
), include the parameter profanity_filter
to your request.
import requests
headers = {"Authorization": "Bearer π Your_API_Key"}
url="https://api.edenai.run/v2/audio/speech_to_text_async"
data={
"providers": "google,amazon,assembly",
"language": "en",
"speakers": 2,
"profanity_filter": true
}
files = {'file': open("π path/to/your/sound.mp3",'rb')}
response = requests.post(url, data=data, files=files, headers=headers)
data = response.json()
How are you? How are you not in f** school?
Note
When enabled, profanity will be replaced with asterisks also in the
diarization
key
Indication
Check the [speech-to-text ultimate guide](#Ultimate Guide) for
profanity filter
support for the available providers
Automatic Language Detection
The Automatic language detection feature enables to identify the dominant language that is spoken in an audio file. When no language code is provided, providers that support automatic language detection
will automatically recognize and detect the spoken language.
Indication
Check the [speech-to-text ultimate guide](#Ultimate Guide) for
automatic language detection
support for the available providers
Custom Vocabulary
Custom vocabulary
helps speech-to-text recognize specific words or phrases that are more frequently used within a context. For example, if your speech include the word meet, and you want your provider to transcribe the word as meet when it encounters it more often or instead of the word meat. In this case, you need to include the custom_vocabulary
parameter to your POST request with a list of words or phrases to bias providers' models to recognize the desired words (in this case, the word meet).
import requests
headers = {"Authorization": "Bearer π Your_API_Key"}
url="https://api.edenai.run/v2/audio/speech_to_text_async"
data={
"providers": "google,amazon,assembly",
"custom_vocabulary": "meet, sell, Los Angeles"
}
files = {'file': open("π path/to/your/sound.mp3",'rb')}
response = requests.post(url, data=data, files=files, headers=headers)
data = response.json()
Indication
Check the [speech-to-text ultimate guide](#Ultimate Guide) for
custom vocabulary
support for the available providers
Pricing
Speech-to-text pricing is calculated with either per unit of 1 second, 15 seconds or 60 seconds depending of the provider selected. Cost prices are displayed per unit of 15 seconds. Please refer to the pricing table or the right
Text-to-speech
Text-to-speech
system permits to convert normal language text into speech audio.
Eden AI aggregates the best providers in the market, your can for example use one or many of the providers available in our platform like amazon
, google
or Microsoft
.
To perform a text-to-speech feature as shown below, you need to pass the text
that you want to synthesize, select a voice gender
and a language
.
- Text: Provide the text you want to synthesize. You can provide the input as plain text or in Speech Synthesis Markup Language (SSML) format. With SSML you can control various aspects of speech, such as pronunciation, volume, pitch, and speech rate.
- Option: Select a voice gender with either
MALE
orFEMALE
values. - Language: Provide the language to use when synthesize the provided text
import json
import requests
headers = {"Authorization": "Bearer π Your_API_Key"}
url ="https://api.edenai.run/v2/audio/text_to_speech"
payload={
"providers": "google,amazon",
"language": "en",
"option":"MALE",
"text": "this is a test"}
response = requests.post(url, json=payload, headers=headers)
Supported Languages
The languages supported by Eden AI for each provider are listed in the following excel table; also listed are the features and audio types that are provider-specific. Text-to-speech ultimate guide
Supported programming languages
Eden AI text-to-speech
is programming language constraints free, which means users can use any programming languages using a unique API Getting Started
Note
You can find an Opened Sourced version of EdenAI that you can can find on github as a
python
module
Pricing
Text-to-speech pricing is calculated per unit of 1 character. For IBM exceptionally, the cost is calculated per unit of 1000 characters. Cost prices are displayed per unit of 1 Million characters. Please refer to the pricing table or the right
Speech-to-text pricing is calculated per unit of 1 seconds. For Google exceptionally, the cost is calculated per unit of 15 seconds. For NeuralSpace, the cost is calculated per 1 minutes. Cost prices are displayed per unit of 15 seconds. Please refer to the pricing table or the right
Updated 10 months ago