How to Make Mycroft run Offline to Have Real Privacy
Updated on 6th Jan 2021 16:18 in DIY, Home Assistant, Tutorial
In a previous post, we demonstrated building a voice assistant out of open source components. One of the issues that we didn't address at the time is that there are still dependencies on the cloud for some of the critical functions, such as the Speech to Text engine. Here we will see how we can modify that system to operate (almost) entirely offline. If you haven't seen my post where we build Mycroft, be sure to check it out here!
Table of Contents
A hardware check
Unfortunately, both the Speech to Text and the Text to Speech components are very processor intensive and will make quick work of even the best hardware. In my testing using an Intel i7-7700k CPU, both of the required programs ran at a speed that was simply unacceptable for a voice assistant. Keep in mind that's with a pretty strong CPU - especially when compared to the Raspberry Pi!
The key is to use a GPU as it really accelerates things. With a Nvidia GTX 1080, the time to compute a Text to Speech operation was around 2 seconds, compared to 17 seconds for the same sentence using only CPU processing. As you can see, a GPU is almost mandatory for this application unless you are willing to accept very long and unnatural delays.
As such, the hardware requirement is to have a graphics card that supports CUDA (Nvidia only). Both programs run on TensorFlow which at the time of writing has a required CUDA compute capability of 3.0. You can use this list to see which GPUs will work. Keep in mind that it does not need to be a very powerful GPU in the "gaming" sense as any GPU with CUDA capabilities will provide a significant boost over just using the CPU.
What OS will we use?
Almost all of the installation instructions for DeepSpeech and Mozilla TTS that can be found online are done with a Linux operating system. Personally, I have no Linux computer with a CUDA GPU capable of running this stuff. However, my regular computer has a GPU that can easily handle these tasks without even blinking. As such, this guide will focus on using Windows as the OS for running both of these programs.
If you do have a Linux machine with a CUDA GPU, most of the instructions will still hold, but a few things will actually be more straightforward. As they are designed with Linux in mind, installation should be much shorter and likely won't have as many gotchas as the Windows procedure.
Installing the software
There are quite a few different software requirements, and some of them are a real pain to get installed correctly as a result of a series of dependency problems. To ease the pain, on Windows, we will use Anaconda to make managing the Python environments easier. Start by heading over to the Anaconda downloads page, then download and install the 64-bit version for your platform. After installing, you should now be able to open a Conda prompt from the start menu using "Start->Anaconda" (just start typing and it should appear). All commands will be run in an Anaconda prompt unless otherwise stated.
Installing drivers
All of the software we need requires some Nvidia drivers to work, so we will install all of those now. Start by installing the NVIDIA GPU drivers, which are just the regular drivers you install with your video card - you may already have these installed, but it's best to double-check. Next, install the CUDA Toolkit, be sure to install the 10.1 version. As of writing, no other versions are correctly supported.
Finally, download and install the cuDNN SDK 7.6. Sadly this seems to require a Nvidia account, so you'll need to sign up before you can download. Once that's installed all of the GPU driver dependencies have been taken care of, and you are ready to install the rest of the software. Everything we just installed is listed as a requirement for TensorFlow, so for an always up to date list, check out the TensorFlow driver page.
Installing DeepSpeech
Now we will install the DeepSpeech server, which will allow HTTP requests to be made with an audio clip that will be converted into text by the library. For those who want to get more information about the DeepSpeech server, check out their Github page.
First, we must create a virtual environment so that all the packages we install don't interfere with anything else:
conda create --name stt
This creates a virtual environment named "stt", next we will activate the new environment:
activate stt
Your prompt should now have "(stt)" in front of it, indicating you are operating within the stt virtual environment and not the global python. The first thing to do is to install pip into the Conda environment to ensure we don't accidentally use the wrong one:
conda install pip
Now we are ready to follow the instructions on the DeepSpeech server page. Run the following command if you have a GPU, otherwise omit the "-gpu" part keeping in mind it will run very slow:
pip install deepspeech-gpu
Now we can install the DeepSpeech-server:
pip install deepspeech-server
While these are the only instructions are the page, there are a few more things to do. As of writing, there is some sort of linking bug on Windows leading to the DeepSpeech program attempting to load a non-existing DLL, which crashes everything. To fix this, open Windows File Explorer and navigate to the "<conda install directory>/envs/stt/Lib/site-packages/deepspeech/lib" directory.
The Conda directory is the location you installed Anaconda in during the installer, but if you installed to the C drive it could be "C:\Anaconda3\envs\stt\Lib\site-packages\deepspeech\lib". Within that directory, notice that there is a single file named "libdeepspeech.so", this file needs to be copied to the directory above it for everything to work correctly. Copy the file and then head over to "<conda install directory>/envs/stt/Lib/site-packages/deepspeech" and paste it. Now that file should be at the top level of the "deepspeech" folder, which will resolve the linker issue.
Note: I had to pin the cudatoolkit version to version "10.1.243", as otherwise, it would be looking for the wrong CUDA drivers. If you seem to have this problem too, try pinning it by running:
conda install cudatoolkit=10.1.243
Installing Mozilla TTS
Next, we are going to install the TTS server. Open a new Anaconda prompt, then create a new virtual environment:
conda create --name tts
Then run activate the environment:
activate tts
Your prompt should now have "(tts)" in front of it. Now download and install espeak, which is a dependency. Be sure to install the "compiled for windows" version. Add it to your PATH to be sure that the other programs will be able to find it.
Now in the Anaconda prompt, we will install pip again:
conda install pip
Now we are ready to install the TTS server. Head over to the table on the TTS Github page and copy the URL of the "Python package". Use that URL in the following command:
pip install <URL of the package>
This will install the TTS server. This should work without any changes, but if you have problems, ensure that the libraries are installed correctly.
Running the software
Usually, once the software is installed, the rest is pretty straightforward. Unfortunately, in this case, there are still some gotchas.
Running DeepSpeech
To run DeepSpeech, be sure that you are in an Anaconda prompt then activate the virtual environment:
activate stt
Now set an environment variable called "TF_FORCE_GPU_ALLOW_GROWTH" to true:
set TF_FORCE_GPU_ALLOW_GROWTH=true
This will allow the TensorFlow instance to use more GPU memory than initially configured. I'm not sure why it is necessary, but in my case, it wouldn't work without it specified. Next, create a directory somewhere on your system, mine is called "Mycroft". Change to that directory in your Anaconda prompt.
Using your browser, download the DeepSpeech pre-trained model from the DeepSpeech Github. Scroll down to the "Assets" section of the latest release and download both the .pbmm and .scorer files. In my case the files were "deepspeech-0.8.2-models.pbmm" and "deepspeech-0.8.2-models.scorer", respectively. Save those to the directory you created earlier.
Now create a JSON file with the following settings, assuming your file names are the same as above (otherwise be sure to change them!):
{
"deepspeech": {
"model" :"Deepspeech/deepspeech-0.8.2-models.pbmm",
"scorer" :"Deepspeech/deepspeech-0.8.2-models.scorer",
"beam_width": 500,
"lm_alpha": 0.931289039105002,
"lm_beta": 1.1834137581510284
},
"server": {
"http": {
"host": "0.0.0.0",
"port": 8080,
"request_max_size": 1048576
}
},
"log": {
"level": [
{ "logger": "deepspeech_server", "level": "DEBUG"}
]
}
}
Make sure to create this file in the same directory as the model and scorer file you downloaded earlier, it will not work correctly if it isn't.
Finally, you are ready to run the DeepSpeech server. In Anaconda, from the directory, you saved the JSON file to:
python C:\Anaconda3\envs\stt\Scripts\deepspeech-server --config config.json
Be sure to change out "C:\Anaconda3" with whatever your Anaconda path is. It should now be running! If you run into any problems, be sure to check out the official server configuration Github page.
Running Mozilla TTS
TTS is usually easier to run. As before, open a new Anaconda prompt and activate the environment:
activate tts
Now run the following command to start the server:
python -m TTS.server.server --use_cuda true
The "use_cuda" option is essential as, without it, no GPU will be used, and it will be very slow. You will see some warnings about using the development server, but for now, this is okay. The table on the TTS Github page offers the required nginx/uWSGI config files to set this up properly, but on Windows, it is a bit of a pain to set up.
You can test that it is working by visiting "http://127.0.0.1:5002" in your browser.
Configuring Mycroft
This is the home stretch! We are almost done. Configuring the DeepSpeech server is very easy, as it is natively supported in Mycroft. The TTS server is a bit more involved, but also not too complicated. This will work with any Mycroft install, but these instructions are specifically for the one built in this post.
Configuring DeepSpeech
Run the following command in an SSH terminal connected to the computer running Mycroft:
~/mycroft-core/bin/mycroft-config edit user
Now add the following JSON config to that file:
"stt": {
"deepspeech_server": {
"uri": "http://<IP_ADDRESS>:8080/stt"
},
"module": "deepspeech_server"
},
Replacing the real IP address of the computer that runs DeepSpeech. Save, and exit. The config will be updated to now use your DeepSpeech server. You can test this by prompting Mycroft "Hey Mycroft, what time is it?". You should see the text appear in the command window you used to start DeepSpeech.
Configuring TTS
This is more involved as, unfortunately, there is no out of the box support for Mozilla TTS, so we will need to code our own. It is quite simple to do, the code below will work well with Mycroft v19.8.
# Copyright 2017 Mycroft AI Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import requests
from mycroft.tts import TTS, TTSValidator
from mycroft.tts.remote_tts import RemoteTTS
class MozillaTTS(RemoteTTS):
PARAMS = {
'text': 'Hello World'
}
def __init__(self, lang, config):
super(MozillaTTS, self).__init__(
lang, config, config.get('url'), '/api/tts', MozillaTTSValidator(self)
)
def build_request_params(self, sentence):
params = self.PARAMS.copy()
params['text'] = sentence.encode('utf-8')
return params
class MozillaTTSValidator(TTSValidator):
def __init__(self, tts):
super(MozillaTTSValidator, self).__init__(tts)
def validate_lang(self):
# TODO: Verify Mozilla TTS can handle the requested language
pass
def validate_connection(self):
r = requests.get(self.tts.url)
if r.status_code == 200:
return True
raise AssertionError("Could not reach " + self.tts.url)
def get_tts_class(self):
return MozillaTTS
Copy the above code and paste it into a file named "mozilla_tts.py" under the "~/mycroft-core/mycroft/tts/" directory. One way to do this is with nano:
nano ~/mycroft-core/mycroft/tts/mozilla_tts.py
Then paste the code above into the file. Now we just need to make one quick edit to the TTS factory to enable our new file. Edit the file "~/mycroft-core/mycroft/tts/__init__.py":
nano ~/mycroft-core/mycroft/tts/__init__.py
Scroll all the way to the bottom and find the "CLASSES ={" section. Add the "mozillatts" module as follows:
CLASSES = {
"mimic": Mimic,
"mimic2": Mimic2,
"google": GoogleTTS,
"marytts": MaryTTS,
"fatts": FATTS,
"espeak": ESpeak,
"spdsay": SpdSay,
"watson": WatsonTTS,
"bing": BingTTS,
"responsive_voice": ResponsiveVoice,
"yandex": YandexTTS,
"mozillatts": MozillaTTS
}
Finally, add an import statement under the last one as follows:
from mycroft.tts.mozilla_tts import MozillaTTS
Now we just need to configure the JSON, and we're done. Once again run:
~/mycroft-core/bin/mycroft-config edit user
Paste the following config under the "stt" section from earlier, replacing the IP of your computer:
"tts": {
"tts": {
"mozillatts": {
"url": "http://<ip_address>:5002"
},
"module": "mozillatts"
}
The final config file
The final configuration file that is displayed when you run:
~/mycroft-core/bin/mycroft-config edit user
Should be as follows:
{
"max_allowed_core_version": 19.8,
"stt": {
"deepspeech_server": {
"uri": "http://IP_ADDRESS:8080/stt"
},
"module": "deepspeech_server"
},
"tts": {
"mozillatts": {
"url": "http://IP_ADDRESS:5002"
},
"module": "mozillatts"
}
}
Note that we are directly modifying the Mycroft code. This means that if Mycroft updates, it will likely be overwritten and we will have to deploy this code again. It isn't optimal but short of pushing this update upstream there isn't much that can be done to avoid this. The Plasma Bigscreen OS image we used in the last post actually pins the Mycroft version, so if you are using that you don't need to worry about this until they update that.
Testing it
Finally, run the following command to clear any cached data:
~/mycroft-core/bin/mycroft-start all restart
Now you should be able to use Mycroft as usual: "Hey Mycroft, what time is it?" and it should be able to understand you via the STT server you configured, and respond via the TTS server. You can watch both command lines as they will display the result of each of their operations.
Wrap up
Unfortunately, this process isn't straightforward. Anything involving GPUs and TensorFlow seems to very quickly become quite tricky, but hopefully, this guide helped bring everything together. Any strange errors that you might encounter have likely also been experienced by someone, so Google will often have good results for any TensorFlow error messages.
It should be noted that there are technically still online components involved; however the part most people are concerned about is now totally offline. The audio recordings never leave your local network, and the responses are also generated locally. I enjoy using this as an excellent way to have a Home Assistant voice control without the cloud, as it is merely a matter of installing the Home Assistant skill to connect Mycroft to Home Assistant.
In fact, just ask Mycroft to install it: "Hey Mycroft, install Home Assistant". The skill will then be installed, and you can use home.mycroft.ai to configure the skill under the "devices" section.
Should you do this?
The fact that this is even possible is fascinating, but it must be said: should anyone really do this? The answer will depend a lot on who you are and what you are trying to achieve. If you are ultimately a user who followed this guide to get it to work but have no interest in playing with it more, it probably isn't the best for you. On the other hand, if you have the patience for cutting edge technology, then this is a perfect way to get a fully offline voice assistant.
From my testing, I think Mycroft's silence detection and DeepSpeech both need to be improved for this to feel comfortable for regular users. The reason is that if the environment is ever so slightly noisy (think of a fan running), it will struggle to detect the end of the sentence. DeepSpeech will then also struggle a bit to understand what was said, and you will end up with garbage.
It is possible to train your own DeepSpeech model, which we did not explore here as it is quite an advanced thing to do. That said though, it will more than likely significantly improve the quality of your specific server by training it to understand your voice better.
The Text to Speech engine, on the other hand, really impresses me. I find that running it with a GPU gives both higher quality and faster results than using the default engine that runs on the Pi. At the end of the day, the Pi is not a very powerful machine, running the TTS on a different computer allows both the text to be processed faster and frees up resources on the Pi so that it can run other things.