Reverse Engineering the Amazon Dash Button’s Wireless Audio Configuration

Update: I’ve learned a bit more and as such, see the bottom of the post for an update. I’ve been slow to update this but now that it’s seeming to be gaining a lot of traffic (leading up to 33c3…hmmmm ;)), I want the info to be accurate and fresh.

During the great Amazon Dash Button Hype of 2015, I saw a few of the early teardowns and blog posts and decided to order a few dash buttons of my own to play around with and reverse engineer. Since the hype has burned off, there hasn’t been much in the way of new information about the inner workings of the button.

dash08-pcb
Photo credit: Matthew Petroff

The Amazon Dash button is a neat little IOT device which contains an STM32F205 ARM Cortex M3 microcontroller, a Broadcom BCM43362 Wi-Fi module, a permanently attached (boo!) Energizer Lithium AAA battery, an Invensense I2S digital microphone, some serial flash, and assorted LEDs and SMPS power supplies. For the $5 price tag, the Dash Button packs some serious punch! Just the components are worth a considerable amount more than $5.

Playing around with the button, the setup process on iOS quickly caught my attention. It (apparently*) differs from the Android setup process considerably due to differences in the inner workings of iOS. The Android setup involves connecting to the button via a network called “Amazon ConfigureMe”, while the iOS app appears to use ultrasound-esque audio to transfer information to the button for the initial setup.

*I don’t actually have an Android device on hand to test this with, hence the “apparently”.

Without even opening the button, I put together a basic theory on how the button was setup from the iOS app: The app sends a carefully crafted “audio” packet using the iOS CoreAudio Framework, which is then picked up by the Dash Button’s onboard mic and parsed for Wi-Fi config info. If the Wi-Fi credentials are correct, the button phones home to the Amazon configuration servers and the setup continues, but with further config info being sent directly to the button over the Wi-Fi.

I immediately ripped apart the button in search of a way to piggyback on the ADMP441 digital microphone’s I2S bus. I figured it would be trivial to toss a logic analyzer on the bus and decode what I2S data was being sent to the STM32. Since I2S is a very commonly used and extremely well documented audio protocol, I counted on this being a relatively quick task.

While I was impressed with the density of the design, I was most definitely not impressed with the lack of a visible testpoint on the board for the digital microphone’s data line. The EN (enable), SCK (clock), and WS (word select) lines are easily available, but the SD (data) line is nowhere to be found. I poked around for a bit but didn’t see anything that looked promising. I quickly came to the realization that I was probably going to have to analyze the audio protocol as it came out of my iPhone rather than sniff it on the board. This was about the same time that I also realized this was not going to be the quick and dirty analysis I was expecting…

Armed with my RØDE shotgun mic, I took a new approach. Using  Electroacoustics Toolbox, I performed some basic audio analysis on the packets coming from the Amazon iOS app. Based on Matthew Petroff’s Dash Button Teardown, I initially expected some sort of Frequency-Shift Encoded (FSK) modulation scheme. Using the Spectrogram tool, I could see that the configuration data was definitely coming in bursts of 20 packets in a try-retry scheme. It also looked like the frequency of the audio was spread out between 18kHz and 20kHz, which is on par for an audio FSK implementation.

Screen Shot 2015-12-23 at 1.21.00 PM
Spectrogram capture of an entire configuration transmission.

Things got interesting, however, when I took an FFT of an entire transmission. The FFT showed an obvious frequency spread near 19kHz, but lacked the characteristic “double peak” indicating frequency occurrences at both the mark and space frequencies.

FFT of entire configuration transmission.
FFT of entire configuration transmission.
FFT of FSK modulation. Note the very obvious "double peak".
FFT of FSK modulated data. Note the very obvious “double peak” at the mark and space frequencies.

As I examined the FFT, it became clearer and clearer that the configuration data was not being transmitted with an FSK modulation scheme. At this point, I switched to the basic audio oscilloscope tool to try to figure out what was going on. After the first capture, it was pretty obvious that the data was being Amplitude (AM) modulated, with a carrier frequency of 19kHz.

Screen Shot 2015-12-23 at 2.14.16 PM

The data was so clearly AM modulated that I wished I had just popped open the scope to begin with (note to future self)! Here’s a scope capture with a few repeated packets coming through.

Screen Shot 2015-12-23 at 4.01.07 PM

After “configuring” a few different dash buttons and examining the transmitted data, I was getting confused as to why there was so much variation in the peak levels of the packets. I checked for ground loops and background noise before transmitting, and confirmed that the noise floor of my microphone setup was far below the variations in peak amplitude I was seeing. After staring at a few captures, I started to notice that the “variations” were consistent in their amplitudes. Looking some more, I realized that it wasn’t noise at all: the data was intentionally being sent with four distinct amplitude levels!

0000_0000 2 copy

Clever, clever Amazon is using Amplitude-Shift Keying (ASK) modulation with 4-level binary to send the data across to the Dash Button.

The big benefit to this modulation scheme is that it’s got a 2-to-1 compression ratio, so the packet length is theoretically half of the length of an FSK packet. The downside, however, is that the Signal-to-Noise Ratio is halved. This isn’t really a problem, since the data is sent 20 times, and the transmitter (iOS device) can be closely physically located to the receiver (Dash Button).

After these discoveries, I came to a few conclusions:

  • The data is being sent from the iOS app using an ASK modulation scheme, with a carrier frequency of 19kHz. It’s resent 20 times before moving on.
  • Each “bit” (really, two bits) has a nominal bit time of 4ms. There are four levels of bit amplitude and there is no true zero. Every bit level, including 00, has some amplitude associated with it.
  • The first chunk of data is always the same. It looks like a simple calibration sequence, allowing the button to set the decoding thresholds for later down the road.
  • There appears to be both a start and stop glitch on all of the packets. This could be a byproduct of how Amazon is building their ASK packets in-app, or the hardware codec starting and stopping on the iPhone. This glitch isn’t harmful, because the transmission is stable by the time any meaningful data is coming through.
  • The packets are not of a fixed length. Entering a longer SSID or passphrase results in a longer packet.

Now that I had a rough idea of how data was transmitted, I wanted to give decoding some known data a shot. This is where things got really interesting for me, because I’ve got basically no experience in data transmission or communications theory. Luckily, I have a decent eye for patterns, which helped considerably in figuring out what data was represented where in each transmitted packet. I began by choosing an SSID and passphrase that were fairly easy to recognize. I ended up using 7’s and *’s in various combinations and orders. I quickly started to recognize the waveforms of each coming through in the data, but it wasn’t immediately clear how the characters were being translated from their ASCII representation.

7_* (3) copy
Packet containing both 7 and *.

I was getting nervous that some type of encryption was being used on the characters to prevent bored nerds like me from easily snooping on the packets.

In an effort to bruteforce whatever translation was taking place, I sent the characters 1 through 9 in the password field. I assigned amplitude level “1” on the received data as binary 00, level “2” as 01, level “3” as 10, and level “4” as 11. I recorded the ASK levels of each character, and busted out a table of what the received binary data looked like in comparison to the known ASCII value of each character. The first thing that was clear was that the binary representation of each character definitely related to the next, which was good news. This ruled out any sort of encryption or lookup-table based character set. The next observation was that the binary data was decrementing, rather than incrementing as the transmitted ASCII characters should be. It was also evident that it was somehow scrambled or flipped from the known representation.

After a bit of bit order manipulation, I arrived at three conclusions:

  • The levels I picked (level “4” as 11, and level “1” as 00) were incorrect. Flipping these levels yields non-inverted bits, which then results in upwards-counting binary data.
  • Each 8-bit ASCII representation of a character was actually being transmitted “backwards” from how I expected, with the first 2-bits transmitted representing the LSB end of the ASCII character. Characters themselves are transmitted in the order they are entered.
  • Each block is 4 pulses long, which represents a total of 8 bits of data.

Armed with the encoding info, my final task was to write a piece of software which would listen to the audio sent by the iOS app and decode it into various representations. Doing it by hand was fun for a bit, but got tedious quickly. I rather arbitrarily settled on MATLAB, mostly because it’s easy to interface with audio components, manipulate WAV data, and filter and analyze datasets. I also figured it would be a good way to sharpen up my MATLAB since it’s been a bit since I’ve fired it up.

With a few hours of coding, I’ve got a script that can listen via my external mic, trim the acquired data to a single packet (albeit semi-manually), and separate and decode each block into it’s decimal, hexadecimal, and ASCII representations. It then saves this as a CSV file.

To to this, the MATLAB utilizes the built-in MATLAB AudioRecorder function. It then waits for user input in regards to the bounds of a single packet. With these, it trims the data and performs some simple filtering and peak detection. The peak detection is done using a Hilbert Transform (a very common and useful digital peak detection method). It then finds each subsequent peak and indexes them based on their amplitude to find the corresponding binary data.

Captured and trimmed audio data displayed in MATLAB.
Captured and trimmed audio data displayed in MATLAB.
The same packet after filtering and peak detection. Each level of peak is indicated with a different colored symbol.
The same packet after filtering and peak detection. Each level of peak is indicated with a different colored symbol.

I also (for no good reason) wrote a tool that goes in the reverse: punch in an array of 4 levels (1/2/3/4), and out comes a psudeo-ASK representation of it.

Because why not?
Because why not?

Using these software tools and a several packets, I discovered a few things:

  • The first two blocks of hypothesized “calibration sequence” is definitely that. They’re 10 bits each, which doesn’t match the rest of the packet. I’ve looked at hundreds of packets and they all start the same way. My MATLAB code actually uses these to find out where to start looking for real data. Handy!
  • Block 3 (Decimal rep) is the total length of the data which will come after it, in “number of blocks”.
  • Blocks 4-9 in every packet appear to be some sort of UDID/CRC. I’ll come back to this later.
  • Block 10 (Decimal rep) is the length of the SSID, in blocks.
  • Block 11 (ASCII rep) is the first char of the SSID. In this example, it’s only one character long.
  • Block 12 (Decimal rep) is the length of the passphrase. This isn’t always block 12, it’s dependent on whatever the length of the SSID is. It’s also always present immediately after the SSID, regardless if there’s a passphrase or not. If there isn’t, it’s just decimal 0, indicating that there is no passphrase.
  • Block 13 (ASCII rep) is the first char of the passphrase, if it exists. It’s also only one char long in this case.
Various blocks numbered by order of occurrence.
Various blocks numbered by order of occurrence.
Hypothesized purpose of each block of data.
Hypothesized purpose of each block of data.

The last real question remaining is: what are blocks 4-9? In every packet I sent, they were different. I immediately thought some sort of CRC but the packet changed at times when I didn’t change the SSID or the passphrase, so it’s hard for me to tell. I’m leaning toward a on-demand Unique Device identifier (UDID) generated in the iOS app, potentially in combination with a CRC. With 48 bits to spare, a 32 bit UDID along with a 16 bit CRC seems more than reasonable.

With this scheme, device setup would look something like this:

Slide1

  • User logs into their Amazon account from the app. This takes place every time a Dash Button is configured. Amazon then generates a “short” (<=48 bits) UDID for the Dash Button which associates it with an Amazon Account. They also store this somewhere on their servers.
  • The SSID and passphrase for the Wi-Fi connection are sent via audio packet to the Dash Button, along with the UDID that was just generated.
  • The Dash Button parses the data and attempts to connect to the Wi-Fi network. If it’s successful, it phones home to the Amazon servers with the supplied UDID. The Amazon servers “register” the button as active and tell the iOS app to continue setup.
  • From here, any further configuration data is sent to the button over the network, including what account is registered to the button (likely with more sophisticated verification than I’m alluding to*), what product the button is ordering, and shipping preferences.

*Just looking at the string dumps from the Dash Button firmware show that there is more sophisticated authentication taking place, it’s just hard to say when. I’m tempted to decompile the firmware just for fun, but I’ve already spent enough time looking at this damn $5 button…

And of course, here’s the final outcome of my efforts:

BOOM!
BOOM!

I’ve attached my MATLAB code in the off chance anyone wants to try this at home. It’ll probably take some tweaking for your specific setup.

Here’s the MATLAB code on GitHub.

That’s all I’ve got so far. I’m still curious in figuring out the six mystery blocks: if you’ve got any thoughts on it feel free to let me know. I might make another followup post taking a look at the firmware using IDA or something in the future, we’ll see. And of course if any Amazon employees want to get ahold of me and tell me how far off I was, I’d be okay with that too 🙂

Thanks to Matthew Petroff, GitHub user dekuNukem, and anyone else whom I may have forgotten to credit.

EDIT: It’s been pointed out to me by a few looking deeper into the button’s internals that the modulation scheme actually IS FSK with four carriers at 18130, 18620, 19910, and 19600Hz. I believe the reason why it so strongly resembled ASK when I observed the audio packets is because of the awful frequency response at the higher end of my phone, my mic, or both. A linear attenuation right at the top of the audible spectrum would explain the highest frequency being measured as lower amplitude. That being said, all encoding and modulation schemes still apply, with the highest frequency encoding representing binary 11.

In addition, there is in fact a CRC16 attached to each packet. It’s the first two bytes after the packet length declaration. Also, that length byte includes the length of the two bytes of CRC. That leaves 32 bits for the UDID, which is POSTed to the Amazon servers at http://dash-button-na.amazon.com/2/r/oft?countryCode=XX&realm=XXAmazon where XX us US for the United States, DE for Germany, etc.  This jives quite strongly with my initial guess of button registration. Thanks to Benedikt Heinz (@EIZnuh) for sharing some of his research into the button’s firmware!