Amazon Just Made AI Easier

Among the product announcements from AWS re:Invent 2016, a new triplet of production web services has emerged under the heady title of Artificial Intelligence.

The three services, Lex, Polly and Rekognition, provide speech recognition and natural language processing, speech synthesis, and image-based face/object detection respectively. Amazon has tackled the biggest blocker to broader uptake and are providing these services “ready to run” against trained and tuned deep learning networks, taking the more expensive components of development and deployment out of the equation.

To achieve this, AWS will have had to use a huge amount of training data and tune the network design and architecture to reliably respond to a wide range of input cases. There are always edge cases and it will be interesting to see how well they perform in the real world.

Conversational User Interfaces
With the announcement of services Lex and Polly, Amazon has made text-to-speech, automatic speech recognition and natural language processing a set of scalable managed services. These mechanisms individually are all valuable tools that are infamously complicated to engineer and configure. Together, however, these technologies represent something far more significant in the evolution of how users interact with computer systems.

Verbal communication interfaces for computer systems have been ubiquitous for a long time. They have been leveraged to allow users to interact with software hands free, often in an effort to aid with accessibility and dictation. But coupled with natural language processing, the nature of how we interact with software shifts. To fully comprehend the significance of this, it is important to note what makes Amazon remarkable and why that company, specifically, chose to invest in creating greater access to this technology.

Embedded deep within the DNA of Amazon is the need to programmatically learn and adapt to human behavioral patterns, systemically, at scale. It is the cornerstone of how Amazon revolutionized retail and made online purchasing the dominant way goods and services are procured around the world. Amazon created a retail system that identifies usage patterns and contextualizes an individual’s experience with the software based on what it has learned.

Amazon has also revolutionized the way cloud technology is deployed, scaled and delivered to users. With services such as AWS, Amazon has made something that was complex and cost prohibitive, such as delivering software as a service globally over the internet, an affordable and relatively easy process.

Offering natural language processing and voice interfaces as services points to the fact that Amazon (among other tech giants like Google and Apple) is aiming to lower the barrier to developers to include these services because they believe it will become an integral way for humans to interact with software systems. It should be noted that the way humans interact verbally and conversationally is much different from how they navigate screen-based software. Language and conversation is exploratory, nuanced and dynamic. Language is the base currency of human thought. Allowing us to use our “native” mode of communication with the user interface to a software system that can understand and learn from our particular nuances will change the way humans interact with computers (and one another). The effort required to make these technologies affordable and ubiquitous means Amazon believes this is the way of the future.


Configuring a Lex bot from the web console

Amazon Lex: Lex offers a sophisticated set of natural processing and recognition services that can be configured using a web-based Lex console. A configured Lex bot is supplied with a scaffolding to guide responses to users through this console. The configuration is very structured and deceptively simple, largely made up of five component types—intents, utterances, slots, prompts and fulfillment. These components work in concert to achieve supplied success criteria (fulfillment) of the stated goal (intent). Lex touts an impressive list of integrations, not just with AWS services, but also with third-party offerings such as Facebook Messenger (available) and Slack (coming soon). The pricing model follows a well-trod Amazon pattern: a bulk of requests free per month (10,000 text requests and 5,000 speech requests), with a small fee per request beyond that.

Amazon Polly: Polly provides a web console as well as command-line tools to generate lifelike audio files in 24 languages. While text-to-speech functionality has been with us for a while, Polly operationalizes the synthesis and delivery with its cloud-based streaming infrastructure. Transmission of all info is encrypted both at rest and while transferred, and all audio files can be downloaded on demand. Like other services, there is a monthly bucket of free requests allocated, with a nominal fee per request beyond that.  

Amazon Rekognition: AWS touts Rekognition as a deep learning–powered image detection and recognition service. We are used to facial recognition as an application, so at a first glance it looks like just another facial recognition service. However, that underplays what the service is intended to do by a long shot.


Scene and object detection in Rekognition

The service is able to detect and label image content and provides scored textual labels for the overall image scene as well as objects and faces. It’s more a question of “What is in this image?” than “Where is X in this image?” The fact that we are generating textual labels from images or camera feeds hints at the ability to hook into Polly and Lex to talk to the user about what we are seeing in an image, video stream or camera feed.

Facial recognition does get special treatment, though. As well as recognizing faces, finer-grained label sets are applied and you can also cache your faces to a collection or multiple collections and use them for matching so that your apps can start to recognize who is who. This can be valuable for additional authentication or for getting emotion-based cues as part of a conversational UI.


Facial analysis in Rekognition

The pricing model is tiered starting at $1 for the first 1 million images processed per month. Face metadata storage is priced at $0.01 per 1,000 faces per month.

Ready for More

While this initial set of production services is quite focused, the move in AWS to more broadly supporting custom-built and -trained AI technologies is alluded to in their recent blog post on the MXNet framework. It will be interesting to see how AWS fills the gap between these specific services and the pre-configured plain vanilla AMIs in the near future.

Further reading: