πŸ”Š Audio processing

Edenai AI agregates a large set of audio processing features from a variety of well know providers like amazon, google, Microsoft, deepgram, assembly and many others. With just a few lines of codes, and without any heavy configurations, Eden AI allows you to perform and use speech and audio technologies quickly and efficiently.

Subfeatures

Below, is a complete list of all audio features on Eden AI:

Speech-to-text

Speech-to-text technology permits to transcribe and recognize audio/video files. It recognizes spoken words within a file and converts it to a readable text using well trained IA models.

Speech-to-text feature is designed to be asynchronous, meaning that the analysis process happens in the background and the results are accessible when ready . [Please refer to this section](#Asynchronous Process)

Supported languages, constraints and features

The languages supported by Eden AI for each provider are listed in the following excel table; also listed are the features and file extensions that are provider-specific.

🚧

Indication

Please verify that the file encoding you want to submit, or the features you want to use are supported for the provider(s) you want to use before proceeding with your transcription. Ultimate Guide

Supported programming languages

Eden AI speech-to-textis programming language constraints free, which means users can use any programming languages using a unique API Getting Started

πŸ“˜

Note

You can find an Opened Sourced version of EdenAI that you can can find on github as a python module

Usage

  1. The speech-to-text analysis starts with a POST request the the API endpoint with the audio/video file to process, the features to consider (Ex: filter profanities or provide a custom vocabulary) and the list of providers to use. The API call will return a job ID.
import requests

headers = {"Authorization": "Bearer πŸ”‘ Your_API_Key"}

url="https://api.edenai.run/v2/audio/speech_to_text_async"
data={"providers": "google,amazon,assembly","language": "en"}

files = {'file': open("πŸ”Š path/to/your/sound.mp3",'rb')}

response = requests.post(url, data=data, files=files, headers=headers)
data = response.json()

For a successful call, the API call will a return a JSON response in form of a JSON Object with a public_id key referring to the job ID value.

{
    "public_id": "9ac4dbe1-0990-4691-b5ed-4a84e83de357"
}
  1. A GET request is needed after to get the completion status using the job ID
url = "https://api.edenai.run/v2/audio/speech_to_text_async/{public_id}"

headers = {
    "accept": "application/json",
    "authorization": "Bearer πŸ”‘ Your_API_Key"
}

response = requests.get(url, headers=headers)
data = response.json()
  1. Finally, when the audio processing is finished, it is possible to access the processing results by performing a GET operation again. Here is an example of a call response using both amazon and assembly
{
    "public_id": "96732b2a-d810-4a87-a266-01973877e11",
    "status": "finished",
    "error": null,
    "results": {
        "assembly": {
            "error": null,
            "id": "53d3f72b-e99c-42b3-8d8c-44bbbcd3bed5",
            "final_status": "succeeded",
            "text": "There's always a ridiculously long line for the red bean cakes. I know. It's weird. I tried them once and couldn't for the life of me figure out what was so special about them. We're lucky they don't all like the Chiver Rolls instead. No waiting. Ten chiverls, please. Here you are. Be careful, they're hot. That'll be 130 anti dollars. Thanks, I'm stuffed. Those things are really filling.",
            "diarization": {
                "entries": [
                    
                ],
                "error_message": null,
                "total_speakers": 1
            }
        },
        "amazon": {
            "error": null,
            "id": "6450ac77-4740-447f-ad7d-f26f4ad3259c",
            "final_status": "succeeded",
            "text": "There's always a ridiculously long line for the red bean cakes. I know it's weird. I tried them once and couldn't for the life of me figure out what was so special about them. We're lucky. They don't all like the chive rolls instead. No waiting 10 chive rolls, please. Here you are. Be careful. They're hot. That'll be 130 anti dollars. Thanks. I'm stuffed. Those things are really filling.",
            "diarization": {
                "entries": [
                    
                ],
                "error_message": null,
                "total_speakers": 1
            }
        }
    }
}

πŸ“˜

Note

The status key within the API response indicates whether or not all the providers' call are finished. However, It's still possible to access a one provider response when it's finished independently from others providers status.

Features

Speakers Labels (Diarization)

Speaker diarization allows to automatically detect the number of speakers and so recognize speaker changes. Speakers are labeled as Speaker 1, Speaker 2, etc and each word or segment of text will be associated with it's speaker tag with a confidence score between 0 and 1.

Speaker diarization is enabled by default, no additional parameter is required . However, you can additionally specify the number of speakers to consider for the Speaker diarization. To do so, just include the speakers parameter with the value equal to the number of speakers.

When the transcription is completed, you'll notice the diarization key in the JSON response, as show below. This key will contain the number of detected speakers, a list of entries within the key entriesand an error_message if something went wrong. The entries key contains the diarization result. A list of words or segments of text along with their start time and end time (in seconds) from the beginning of the audio stream.

{ 
"diarization": {
                "entries": [
                    {
                        "segment": "There's",
                        "speaker": 1,
                        "end_time": "0.954",
                        "confidence": 0.51003,
                        "start_time": "0.73"
                    },
                    {
                        "segment": "always",
                        "speaker": 1,
                        "end_time": "1.31",
                        "confidence": 0.99925,
                        "start_time": "1.002"
                    },
                    ..
                ],
                "error_message": null,
                "total_speakers": 1
            }
}

Profanity filter

Eden AI allows you to look for recognized profanity and remove them from the transcribe result. By default (except for some providers), the API will return a verbatim transcription of the audio, meaning profanity will be present in the transcript if spoken in the audio.

To replace profanity with asterisks as show below, or to convert it to the nearest recognized non-profane word (which is the case with deepgram), include the parameter profanity_filter to your request.

import requests

headers = {"Authorization": "Bearer πŸ”‘ Your_API_Key"}

url="https://api.edenai.run/v2/audio/speech_to_text_async"
data={
  "providers": "google,amazon,assembly",
  "language": "en",
  "speakers": 2,
  "profanity_filter": true
}

files = {'file': open("πŸ”Š path/to/your/sound.mp3",'rb')}

response = requests.post(url, data=data, files=files, headers=headers)
data = response.json()

How are you? How are you not in f** school?

πŸ“˜

Note

When enabled, profanity will be replaced with asterisks also in the diarization key

🚧

Indication

Check the [speech-to-text ultimate guide](#Ultimate Guide) for profanity filter support for the available providers

Automatic Language Detection

The Automatic language detection feature enables to identify the dominant language that is spoken in an audio file. When no language code is provided, providers that support automatic language detectionwill automatically recognize and detect the spoken language.

🚧

Indication

Check the [speech-to-text ultimate guide](#Ultimate Guide) for automatic language detection support for the available providers

Custom Vocabulary

Custom vocabulary helps speech-to-text recognize specific words or phrases that are more frequently used within a context. For example, if your speech include the word meet, and you want your provider to transcribe the word as meet when it encounters it more often or instead of the word meat. In this case, you need to include the custom_vocabulary parameter to your POST request with a list of words or phrases to bias providers' models to recognize the desired words (in this case, the word meet).

import requests

headers = {"Authorization": "Bearer πŸ”‘ Your_API_Key"}

url="https://api.edenai.run/v2/audio/speech_to_text_async"
data={
  "providers": "google,amazon,assembly",
  "custom_vocabulary": "meet, sell, Los Angeles"
}

files = {'file': open("πŸ”Š path/to/your/sound.mp3",'rb')}

response = requests.post(url, data=data, files=files, headers=headers)
data = response.json()

🚧

Indication

Check the [speech-to-text ultimate guide](#Ultimate Guide) for custom vocabulary support for the available providers

Pricing

Speech-to-text pricing is calculated with either per unit of 1 second, 15 seconds or 60 seconds depending of the provider selected. Cost prices are displayed per unit of 15 seconds. Please refer to the pricing table or the right

Text-to-speech

Text-to-speech system permits to convert normal language text into speech audio.

Eden AI aggregates the best providers in the market, your can for example use one or many of the providers available in our platform like amazon, google or Microsoft.

To perform a text-to-speech feature as shown below, you need to pass the textthat you want to synthesize, select a voice gender and a language.

  • Text: Provide the text you want to synthesize. You can provide the input as plain text or in Speech Synthesis Markup Language (SSML) format. With SSML you can control various aspects of speech, such as pronunciation, volume, pitch, and speech rate.
  • Option: Select a voice gender with either MALE or FEMALE values.
  • Language: Provide the language to use when synthesize the provided text
import json
import requests

headers = {"Authorization": "Bearer πŸ”‘ Your_API_Key"}

url ="https://api.edenai.run/v2/audio/text_to_speech"
payload={
  "providers": "google,amazon", 
  "language": "en", 
  "option":"MALE", 
  "text": "this is a test"}

response = requests.post(url, json=payload, headers=headers)

Supported Languages

The languages supported by Eden AI for each provider are listed in the following excel table; also listed are the features and audio types that are provider-specific. Text-to-speech ultimate guide

Supported programming languages

Eden AI text-to-speechis programming language constraints free, which means users can use any programming languages using a unique API Getting Started

πŸ“˜

Note

You can find an Opened Sourced version of EdenAI that you can can find on github as a python module

Pricing

Text-to-speech pricing is calculated per unit of 1 character. For IBM exceptionally, the cost is calculated per unit of 1000 characters. Cost prices are displayed per unit of 1 Million characters. Please refer to the pricing table or the right

Speech-to-text pricing is calculated per unit of 1 seconds. For Google exceptionally, the cost is calculated per unit of 15 seconds. For NeuralSpace, the cost is calculated per 1 minutes. Cost prices are displayed per unit of 15 seconds. Please refer to the pricing table or the right