This article series shows how you can extend the AI capabilities of the robot Pepper by leveraging its on-device Machine Learning solutions and Google’s ML Kit.
Pepper is a social robot by Aldebaran with a humanoid body shape designed to engage with people through natural language using voice and gestures. Pepper can be programmed to adapt to humans by recognizing faces and basic human emotions, thus creating a sort of bond during an encounter. The interaction is also supported by a touch screen integrated onto its chest, through which visual contents can be presented to the user and input can be collected.
Although some of its traits are impressive during first interactions, such as the way it always meets your eyes while talking to you and how natural its movements feel, users tend to have very high expectations with regard to its intelligence, and these are very often not met by the standard features. In this series, we will see how we can extend Pepper’s capabilities by leveraging on-device Machine Learning solutions to create an application that brings a bit more AI into this cute robot by using Google’s ML Kit.
ML Kit on a social robot
The fact that Pepper runs on Android allows us to benefit from all APIs available for mobile devices and, although its computing power is limited, it has enough capacity to process machine learning-powered applications that are based on lightweight models designed for mobile devices.
ML Kit is an SDK that brings Google’s machine learning to apps. Besides simplifying the task of integrating ML in apps, which is otherwise a challenging and time-consuming process, it has a special advantage that all their APIs run on-device. This is great news since it addresses lots of issues arousing when relying on cloud-based solutions, such as latency (the delay caused by sending the data back and forth through the network prevents real-time use cases) and, even more importantly, security and privacy, as the data will not leave the device. Ensuring the privacy of an app’s data is a key part of developing an app that handles sensitive user data, such as camera and microphone input, and a requirement to meet privacy laws such as the GDPR. Another great benefit is that the app will also work offline.
ML Kit is composed of Vision APIs, which compute analysis of image and video input, and Natural Language APIs, which process and translate text. Among the Vision APIs, we find face detection, pose detection, text recognition, image labeling, object detection and tracking, digital ink recognition (recognizing handwritten text and drawings on a digital surface), and some others. Among the Natural Language SDK on the other hand we have APIs to identify the language, translate text, generate smart replies, and extract entities. Because they are ready to use, we do not need to have deep knowledge about neural networks or worry about training models. However, if needed, some APIs allow experienced developers to use their own custom models for cases where the standard ones do not cover the needs of the app e.g. image classification. Another significant fact is that models are updated dynamically. This means that models will be updated without needing to update the app.
To integrate it in our app, all we need is to include Google’s Maven repository in both the buildscript and allprojects sections and add the dependencies to all the ML Kit APIs we want to use in the gradle file.
Use cases with Pepper
The ML Kit SDK opens up a tremendous amount of possibilities to develop apps in combination with a social robot equipped with the necessary hardware such as Pepper and it is just what the robot needs to become significantly smarter in those areas where it lacks abilities.
From all the range of possibilities, we will cover a few of the use cases we find interesting. For instance, we will be using ML Kit to power a drawing game, where Pepper will be guessing what the user draws on the tablet on its chest. Furthermore, we will see how to allow Pepper to recognize and interact with the environment, analyzing what it sees through its cameras. We will also teach it how to read, very quickly, without sending him to school (he’s too young just yet) and also translate into different languages.
How to develop a Pepper app for mobile developers
If you have some experience with Android but have never developed an app for a robot, this should help you understand the differences. This article does not aim to repeat the contents of the official documentation by explaining how to use the APIs in detail, but rather will focus on giving an overview of what is possible with these and how to integrate these in an app specifically designed to run on the Pepper robot.
While it is true that we program Pepper in Android Studio using Kotlin or Java and all the libraries we use for mobile apps, there are some things that we need to take into account when developing for a robot as opposed to for a smartphone and that make both the app structure and the flow somewhat more complex. Let’s mention the most remarkable differences.
QiSDK
Actions, such as those that make the robot move or talk, are the main components to control the robot that you will use to create applications for Pepper. The most important actions are conversation, movements, engaging humans, performing animations from gestures to dances, looking somewhere specific, going to a point, taking pictures, and localizing itself in an environment.
All those Pepper-specific actions are part of the QiSDK from Aldebaran, which we need to include in our module’s build.gradle file. To do this, we first need to have installed the Pepper SDK Plug-in in Android Studio and created a robot application, following the getting started guide from the documentation. Then, everything we need from the Qi SDK can be added to our project via the following gradle dependencies:
implementation ‚com.aldebaran:qisdk:1.7.5‘ implementation ‚com.aldebaran:qisdk-design:1.7.5‘
From this maven repository:
1 |
maven {url 'https://qisdk.softbankrobotics.com/sdk/maven'} |
There are also some other experimental features, although not needed for this project, which can be found in:
1 |
maven {url 'https://qisdk.softbankrobotics.com/experimental/maven'} |
Lifecycle of a Pepper app
If you are familiar with programming mobile applications for Android, you will for sure know about the Activity Lifecycle. When developing for Pepper, the tricky part is to understand the robot lifecycle running alongside, and how to integrate it in your app. In short, to be able to run robot actions an activity needs to have the so-called robot focus, which can only be owned by one activity at a time. The activity must implement the RobotLifecycleCallbacks methods and register them once it is running, for example in the onCreate method, to gain the focus. When this happens, we will be informed through the callback onRobotFocusGained. This is a crucial step because there we will obtain and store the QiContext object, which is necessary to create and run all robot actions as well as retrieve robot services.
We also need to pay attention to the thread we are working on when executing an action. Since all robot actions can be executed both synchronously and asynchronously, we have the option to choose one or the other depending on the situation. However, to simplify, in most cases going for async will be the best option, e.g., when we are on the UI thread or when we need to run several actions simultaneously.
Voice interaction
Another difference with (most) mobile applications that constitutes a great part of a Pepper app is the voice interaction. Mostly, this will be continuously running in parallel to whatever interaction takes place with the tablet, triggering answers asynchronously. Oftentimes, you want to offer the user the possibility to use both and replicate action listeners in both modalities. This also works great as a fallback option for situations where voice interaction is harder, for example somewhere loud.
We introduce voice interaction capabilities to our app by using a Chat action, which takes care of speech recognition completely, including the detection of the start and the end of the speech, its conversion to text, providing visual feedback when vocal communication is running, and matching an answer.
Chatbots
The chat action makes use of one or several Chatbot components that contain answers matching a set of inputs. The reason why you may want to have several of them running at the same time is that no chatbot is capable of responding to everything. Therefore, a combination of chatbots that perform well in their respective domain might be a better approach to cover a broad range of situations. For example, a chatbot in charge of Pepper’s main goal specifics at a certain moment (e.g., “where do I find room A?“ in a setting where Pepper has the task of guiding visitors in a building), a second chatbot for more general questions not related to the task, which might be common for several applications (e.g., “what sensors do you have?“) and a third one for general knowledge questions, where we might want to use the knowledge from an online encyclopedia or a computational knowledge engine like Wikipedia and WolframAlpha. This is another way to make Pepper smarter when programming an application, by giving it the tools to respond appropriately to the demands from humans that in most cases tend to go beyond the scope of Pepper’s role. Let’s say, for example, that a robot is deployed as a retail assistant or as a guide in a museum. Besides fulfilling its task, it will often also be expected to be able to answer non-related questions and that is exactly what we want to cover by that.
In order to produce an answer, the action will interrogate the chatbots sequentially according to their priority until it finds a proper answer.
For creating a chatbot for language interaction, you are free to choose either QiChat, the integrated script language, or any other framework you prefer, such as DialogFlow or Microsoft Bot Framework. For simplicity, and as in this demo, we do not need to master natural language capabilities that may be needed in a real scenario, we will use QiChat for the chatbot of this application.
In a similar fashion to how you can localize other resources such as strings in your app, you can also develop your topic, a file written using QiChat syntax, in several languages and the right one will be loaded according to the current locale of the robot.
Camera
Pepper is equipped with cameras placed on its head. These are the ones we will be using in these demos, instead of the ones mounted on the chest tablet. This would, of course, also be possible technically but it is less optimal for these use cases as the angle is not very appropriate for what we intend to capture. Therefore, we need to use Pepper’s APIs to obtain them instead of using the CameraX API.
Pepper & ML Kit app
After the short introduction about what makes Pepper applications different, let’s dive into programming one. In case, this is your first Pepper application, please carefully follow the getting started guide first.
This application aims to be a demo app demonstrating some of the things that are possible in integrating several of the ML Kit APIs and how they work and therefore includes more feedback than would actually be necessary to use the features in a production app. Since it is complex we will not cover every single line of it in this article, but you can find the full code here.
Let’s start with a short summary of the flow before we go into each of them in detail.
When the app starts, the user will be presented with a menu on the screen formed by ImageButtons and Pepper will explain what this is about. The screen also includes a button in the upper left corner to change the robot language and later also another icon to return to this menu once we’ve left it.
Each of the demos can be started either by pressing the button in the main menu of the tablet or by voice. Once something has been selected, Pepper explains how it works via voice and the UI changes to the specific UI of the demo.
Architecture
The app is composed of an Activity, a main ViewModel which will be shared among the views, an auxiliary class that we use for everything related to Pepper Actions, and five packages with five Fragments (one for the main menu and four for the demos), ViewModels and some helper classes.
In case you are wondering why we do not use separate activities instead of fragments for each demo and a Main one, it is a fair question. The reason why is, as mentioned before, at any given time one and only one activity may have the focus, and switching the focus between activities and restarting the chat action takes a few seconds, thus causing interruptions of the speech recognition engine. It can therefore be done like that, depending on the type of application you are building, but in this case, we expect frequent changes from one demo to the other and would prefer smooth transitions. Therefore, we went for fragments. Moreover, there are common elements to all of them.
Implementation
Let’s look at the code. We need an Activity that extends the class RobotActivity from the SDK, which adds some additional functionality on top of standard Android activity, and also implements the RobotLifecycleCallbacks, through which we will be informed about the status of the connection to the QiSDK.
The activity registers to the QiSDK in onCreate method alongside all other initializations and then saves in the ViewModel the qiContext received in the onRobotFocusGained callback. This context is what we will need to create all Pepper actions. The main actions we are going to create are a Chat, taking pictures with Pepper’s head camera, running animations, and navigating between the chat topic and the activity using bookmarks. In order to maintain our Activity code as clean as possible and also avoid repeating code, we can use utils classes to collect all methods where we encapsulate the code necessary for running each one of those robot actions. In this application, we created a class called PepperActions, which follows a singleton pattern, for these methods. It will be injected into all ViewModels using Hilt for Dependency Injection.
Chat action
We start with the chat action. First, we create a topic file, a kind of scripted dialog written in QiChat syntax, or several if you want to support multiple languages. You can find the full topics (.top files) of this application in QiChat syntax in the corresponding ‚raw‘ folders for each language here.
Then, we start the interaction using the obtained QiContext and the resource id from the topic files to create Topics and a Chat action. We don’t get into the details here because it is well explained in the documentation. By passing a callback, we will know when it’s ready and then jump to explaining how the demo works.
In order to communicate back and forth with the topic file, we will be using Bookmarks. Setting bookmarks allows us to react in our views when a specific part of the dialog has been reached, e.g. something was said, which is very useful every time the answer needs more computing than just responding with words, for example, if something needs to be processed, as it is our case. It also works the other way around, i.e. we can use them to navigate to a specific part of the dialog topic from the code. In this application, we will do both directions.
As mentioned before, besides pressing the button on the screen, we can select each demo by voice as well. Thus, when the name of a demo is matched by our chatbot as specified in the topic, a bookmark will be reached and the corresponding listener will be notified and start it.
E.g., this is how a rule inside the topic file including a bookmark we can listen to looks. In this case, the trigger word will be “draw“:
1 |
rule: u:(draw) %startDrawingBookmark |
And this is how it looks for the other direction, where we use them to also jump from the code to a specific point in the dialog, such as the rules:
1 2 3 |
u:(^empty) %drawingRulesBookmark Hey, let's play a drawing game! \pau=600\ Think of something and draw it on my Tablet! \pau=600\ Using machine learning I will try to recognize what that is. Are you ready? u1:(~yes) great! \pau=600\ let's see %startGameBookmark \pau=600\ say done or press the button when you're done u1:(~no) okay, maybe later! |
That is all for this introduction. I hope you enjoyed it! Check out the next articles of this series, where we are going to look at the mentioned use cases one by one and how to implement them in our ML Kit powered Android app for the Pepper robot!
Very nice article and very inspiring, thank you. I work with a Pepper robot myself and was able to draw valuable inspiration from it.
Thank you, Thomas. I’m glad to hear that you liked the article. If you have any questions or there are any topics you would like to discuss, please do reach out! And check out the next articles of the series for more technical details.