Downloading & Saving a Nest Cam Live Stream Using a Raspberry Pi + Debian Linux

Tonight, I stumbled on an interesting post on Reddit. It linked to two Nest Cams livestreaming the landfall of hurricane Irma from a Miami condo. I popped the streams open on my phone (the hurricane had not hit yet, and the sky was mostly clear) and thought about the fact that I was probably going to fall asleep before the power went out and killed the stream.

Being an incredibly good-looking and overconfident dweeb, I then thought about the fact that there’s got to be a good way to rip and save a Nest Cam livestream so that I could watch it tomorrow, maybe post a time-lapse, and gain all of the Karma. After all, its already been done with tons of other streams from various sites. It seemed like a decent Friday night hackathon in the making, and would at least solve the problem of me falling asleep lest I fail.

I decided I wanted something that was (1) automatic, and (2) running in the background so that I didn’t need to stay up all night or keep my computer on. I ran through the list of options in my head:

  1. Just stream them in a browser window and use screen capture (lame and fails both #1 and #2),
  2. Use one of the millions of Chrome or Firefox plugins that allows for saving of streams (extra lame and still fails #2),
  3. Use some sort of stream-ripping software built for Linux so I could load it on my always-running Pi (not lame, but impossible for me to find something that worked after looking for an hour or so), or
  4. Hack it and do it myself.

If you haven’t guessed already, I went with #4. I’m going to show you how I figured it out and how to do it yourself. This assumes some basic knowledge of Linux command line and shell scripting.

First thing first, I loaded one of the Nest Cam streams using the links provided on Reddit. The video livestream itself sits inside a Nest-branded HTML page that does this really annoying thing where it auto-pauses and pops over a Nest advertisement every once in a while. If I wasn’t going to rip it out already, I would have been sufficiently annoyed by this to figure out how to get to the base stream.

I poked around inside the page source using the Safari dev tools to see if I could find any obvious stream container or link, but didn’t see anything. I did find a more minimal stream that is formatted for Twitter but it still does the popover thing. Boo. I also poked around in the javascript (warning: there’s a lot) to see if the stream was being lazy-fetched from any obvious source. Again, nothing. Boo.

I decided to use the Timelines tool to see what’s being loaded on the network. I recorded for a few seconds and saw what was clearly a periodic fetch taking place. There’s an XHR request going out approximately every 4 seconds. It’s loading a media_xxxxxxxx_123.ts file and a chunklist_xxxxxxxx.m3u8 file after each request. This is definitely an MPEG-2 stream, with the chunklist serving as a manifest for the media.ts file. Bingo!

.m3u8 files are commonly used to define video streams, and so I knew I was on the right track. Right-clicking on the m3u8 file and choosing “Copy Link Address” and pasting it into the Safari address bar yielded a base-level video stream with no extra junk (*cough*) on top of it. It looks like Nest streams their livestream content from stream-bravo.dropcam.com or from stream-delta.dropcam.com. (Both are currently using Wowza Streaming Engine 4 Subscription Edition 4.7.1 build20635)

The next step was saving the stream using this URL. Time to break out the Pi! I figured I could use ffmpeg to do this, and after a quick google search, my assumptions were confirmed. This StackOverflow question gave me what I needed, except I wanted to ensure that the ffmpeg command was always running (in the event the stream broke up and was restarted, a network issue occurred, etc).

For those of you who just want to save a Nest Cam stream to disk using Raspbian/Raspberry Pi/Debian/Other Linux, this is the command that will do it for you (you need ffmpeg installed in order to use this): ffmpeg -i http://your_stream_chunklist_link.m3u8 -c copy -bsf:a aac_adtstoasc /path/to/output/file.mp4. For example, this is the command I used to save the stream I was watching to my home directory: ffmpeg -i https://stream-delta.dropcam.com/nexus_aac/a8a645a10ef24a50b250c14a08b02ef9/chunklist_w719996219.m3u8 -c copy -bsf:a aac_adtstoasc Stream.mp4

In order to make sure that ffmpeg was always restarted in case of any issues, I whipped up the following shell script (named runStream.sh) to be run as a cronjob:

#!/bin/bash
#make-run.sh
#make sure a process is always running.

process=LivingRoom
now=$(date +%Y%m%d%H%M%S)
makerun="ffmpeg -i https://stream-delta.dropcam.com/nexus_aac/a8a645a10ef24a50b250c14a08b02ef9/chunklist_w719996219.m3u8 -c copy -bsf:a aac_adtstoasc /media/HDD/Stream_$now.mp4"

if ps ax | grep -v grep | grep $process > /dev/null
then
 exit
else
 $makerun &
fi

exit

The script checks to see if the ffmpeg command is running using ps ax and grep. If it is, there is no need to start it, so it exits. If it isn’t, the script is started using the makerun shell command. Note the $now variable at the end of the filename: it automatically appends a puncuation-less timestamp to each video file, so that the previous file is not lost when ffmpeg is automatically restarted.

The last thing to do was to make the script executable using chmod +x runStream.sh and add it to the crontab using crontab -e. I set it to run every minute (can’t miss any of the action!) using the following crontab:

# m h  dom mon dow   command
* * * * * /home/pi/runStream.sh

After saving the changes and waiting a minute, I saw the first video file pop up. After running for a few hours, the auto-restart was a great idea, because it’s kicked in several times (likely due to haphazard internet because there a HURRICANE).

Stay safe out there, Florida. It’s going to get crazy.

Reverse Engineering the Amazon Dash Button’s Wireless Audio Configuration

Update: I’ve learned a bit more and as such, see the bottom of the post for an update. I’ve been slow to update this but now that it’s seeming to be gaining a lot of traffic (leading up to 33c3…hmmmm ;)), I want the info to be accurate and fresh.

During the great Amazon Dash Button Hype of 2015, I saw a few of the early teardowns and blog posts and decided to order a few dash buttons of my own to play around with and reverse engineer. Since the hype has burned off, there hasn’t been much in the way of new information about the inner workings of the button.

dash08-pcb
Photo credit: Matthew Petroff

The Amazon Dash button is a neat little IOT device which contains an STM32F205 ARM Cortex M3 microcontroller, a Broadcom BCM43362 Wi-Fi module, a permanently attached (boo!) Energizer Lithium AAA battery, an Invensense I2S digital microphone, some serial flash, and assorted LEDs and SMPS power supplies. For the $5 price tag, the Dash Button packs some serious punch! Just the components are worth a considerable amount more than $5.

Playing around with the button, the setup process on iOS quickly caught my attention. It (apparently*) differs from the Android setup process considerably due to differences in the inner workings of iOS. The Android setup involves connecting to the button via a network called “Amazon ConfigureMe”, while the iOS app appears to use ultrasound-esque audio to transfer information to the button for the initial setup.

*I don’t actually have an Android device on hand to test this with, hence the “apparently”.

Without even opening the button, I put together a basic theory on how the button was setup from the iOS app: The app sends a carefully crafted “audio” packet using the iOS CoreAudio Framework, which is then picked up by the Dash Button’s onboard mic and parsed for Wi-Fi config info. If the Wi-Fi credentials are correct, the button phones home to the Amazon configuration servers and the setup continues, but with further config info being sent directly to the button over the Wi-Fi.

I immediately ripped apart the button in search of a way to piggyback on the ADMP441 digital microphone’s I2S bus. I figured it would be trivial to toss a logic analyzer on the bus and decode what I2S data was being sent to the STM32. Since I2S is a very commonly used and extremely well documented audio protocol, I counted on this being a relatively quick task.

While I was impressed with the density of the design, I was most definitely not impressed with the lack of a visible testpoint on the board for the digital microphone’s data line. The EN (enable), SCK (clock), and WS (word select) lines are easily available, but the SD (data) line is nowhere to be found. I poked around for a bit but didn’t see anything that looked promising. I quickly came to the realization that I was probably going to have to analyze the audio protocol as it came out of my iPhone rather than sniff it on the board. This was about the same time that I also realized this was not going to be the quick and dirty analysis I was expecting…

Armed with my RØDE shotgun mic, I took a new approach. Using  Electroacoustics Toolbox, I performed some basic audio analysis on the packets coming from the Amazon iOS app. Based on Matthew Petroff’s Dash Button Teardown, I initially expected some sort of Frequency-Shift Encoded (FSK) modulation scheme. Using the Spectrogram tool, I could see that the configuration data was definitely coming in bursts of 20 packets in a try-retry scheme. It also looked like the frequency of the audio was spread out between 18kHz and 20kHz, which is on par for an audio FSK implementation.

Screen Shot 2015-12-23 at 1.21.00 PM
Spectrogram capture of an entire configuration transmission.

Things got interesting, however, when I took an FFT of an entire transmission. The FFT showed an obvious frequency spread near 19kHz, but lacked the characteristic “double peak” indicating frequency occurrences at both the mark and space frequencies.

FFT of entire configuration transmission.
FFT of entire configuration transmission.
FFT of FSK modulation. Note the very obvious "double peak".
FFT of FSK modulated data. Note the very obvious “double peak” at the mark and space frequencies.

As I examined the FFT, it became clearer and clearer that the configuration data was not being transmitted with an FSK modulation scheme. At this point, I switched to the basic audio oscilloscope tool to try to figure out what was going on. After the first capture, it was pretty obvious that the data was being Amplitude (AM) modulated, with a carrier frequency of 19kHz.

Screen Shot 2015-12-23 at 2.14.16 PM

The data was so clearly AM modulated that I wished I had just popped open the scope to begin with (note to future self)! Here’s a scope capture with a few repeated packets coming through.

Screen Shot 2015-12-23 at 4.01.07 PM

After “configuring” a few different dash buttons and examining the transmitted data, I was getting confused as to why there was so much variation in the peak levels of the packets. I checked for ground loops and background noise before transmitting, and confirmed that the noise floor of my microphone setup was far below the variations in peak amplitude I was seeing. After staring at a few captures, I started to notice that the “variations” were consistent in their amplitudes. Looking some more, I realized that it wasn’t noise at all: the data was intentionally being sent with four distinct amplitude levels!

0000_0000 2 copy

Clever, clever Amazon is using Amplitude-Shift Keying (ASK) modulation with 4-level binary to send the data across to the Dash Button.

The big benefit to this modulation scheme is that it’s got a 2-to-1 compression ratio, so the packet length is theoretically half of the length of an FSK packet. The downside, however, is that the Signal-to-Noise Ratio is halved. This isn’t really a problem, since the data is sent 20 times, and the transmitter (iOS device) can be closely physically located to the receiver (Dash Button).

After these discoveries, I came to a few conclusions:

  • The data is being sent from the iOS app using an ASK modulation scheme, with a carrier frequency of 19kHz. It’s resent 20 times before moving on.
  • Each “bit” (really, two bits) has a nominal bit time of 4ms. There are four levels of bit amplitude and there is no true zero. Every bit level, including 00, has some amplitude associated with it.
  • The first chunk of data is always the same. It looks like a simple calibration sequence, allowing the button to set the decoding thresholds for later down the road.
  • There appears to be both a start and stop glitch on all of the packets. This could be a byproduct of how Amazon is building their ASK packets in-app, or the hardware codec starting and stopping on the iPhone. This glitch isn’t harmful, because the transmission is stable by the time any meaningful data is coming through.
  • The packets are not of a fixed length. Entering a longer SSID or passphrase results in a longer packet.

Now that I had a rough idea of how data was transmitted, I wanted to give decoding some known data a shot. This is where things got really interesting for me, because I’ve got basically no experience in data transmission or communications theory. Luckily, I have a decent eye for patterns, which helped considerably in figuring out what data was represented where in each transmitted packet. I began by choosing an SSID and passphrase that were fairly easy to recognize. I ended up using 7’s and *’s in various combinations and orders. I quickly started to recognize the waveforms of each coming through in the data, but it wasn’t immediately clear how the characters were being translated from their ASCII representation.

7_* (3) copy
Packet containing both 7 and *.

I was getting nervous that some type of encryption was being used on the characters to prevent bored nerds like me from easily snooping on the packets.

In an effort to bruteforce whatever translation was taking place, I sent the characters 1 through 9 in the password field. I assigned amplitude level “1” on the received data as binary 00, level “2” as 01, level “3” as 10, and level “4” as 11. I recorded the ASK levels of each character, and busted out a table of what the received binary data looked like in comparison to the known ASCII value of each character. The first thing that was clear was that the binary representation of each character definitely related to the next, which was good news. This ruled out any sort of encryption or lookup-table based character set. The next observation was that the binary data was decrementing, rather than incrementing as the transmitted ASCII characters should be. It was also evident that it was somehow scrambled or flipped from the known representation.

After a bit of bit order manipulation, I arrived at three conclusions:

  • The levels I picked (level “4” as 11, and level “1” as 00) were incorrect. Flipping these levels yields non-inverted bits, which then results in upwards-counting binary data.
  • Each 8-bit ASCII representation of a character was actually being transmitted “backwards” from how I expected, with the first 2-bits transmitted representing the LSB end of the ASCII character. Characters themselves are transmitted in the order they are entered.
  • Each block is 4 pulses long, which represents a total of 8 bits of data.

Armed with the encoding info, my final task was to write a piece of software which would listen to the audio sent by the iOS app and decode it into various representations. Doing it by hand was fun for a bit, but got tedious quickly. I rather arbitrarily settled on MATLAB, mostly because it’s easy to interface with audio components, manipulate WAV data, and filter and analyze datasets. I also figured it would be a good way to sharpen up my MATLAB since it’s been a bit since I’ve fired it up.

With a few hours of coding, I’ve got a script that can listen via my external mic, trim the acquired data to a single packet (albeit semi-manually), and separate and decode each block into it’s decimal, hexadecimal, and ASCII representations. It then saves this as a CSV file.

To to this, the MATLAB utilizes the built-in MATLAB AudioRecorder function. It then waits for user input in regards to the bounds of a single packet. With these, it trims the data and performs some simple filtering and peak detection. The peak detection is done using a Hilbert Transform (a very common and useful digital peak detection method). It then finds each subsequent peak and indexes them based on their amplitude to find the corresponding binary data.

Captured and trimmed audio data displayed in MATLAB.
Captured and trimmed audio data displayed in MATLAB.
The same packet after filtering and peak detection. Each level of peak is indicated with a different colored symbol.
The same packet after filtering and peak detection. Each level of peak is indicated with a different colored symbol.

I also (for no good reason) wrote a tool that goes in the reverse: punch in an array of 4 levels (1/2/3/4), and out comes a psudeo-ASK representation of it.

Because why not?
Because why not?

Using these software tools and a several packets, I discovered a few things:

  • The first two blocks of hypothesized “calibration sequence” is definitely that. They’re 10 bits each, which doesn’t match the rest of the packet. I’ve looked at hundreds of packets and they all start the same way. My MATLAB code actually uses these to find out where to start looking for real data. Handy!
  • Block 3 (Decimal rep) is the total length of the data which will come after it, in “number of blocks”.
  • Blocks 4-9 in every packet appear to be some sort of UDID/CRC. I’ll come back to this later.
  • Block 10 (Decimal rep) is the length of the SSID, in blocks.
  • Block 11 (ASCII rep) is the first char of the SSID. In this example, it’s only one character long.
  • Block 12 (Decimal rep) is the length of the passphrase. This isn’t always block 12, it’s dependent on whatever the length of the SSID is. It’s also always present immediately after the SSID, regardless if there’s a passphrase or not. If there isn’t, it’s just decimal 0, indicating that there is no passphrase.
  • Block 13 (ASCII rep) is the first char of the passphrase, if it exists. It’s also only one char long in this case.
Various blocks numbered by order of occurrence.
Various blocks numbered by order of occurrence.
Hypothesized purpose of each block of data.
Hypothesized purpose of each block of data.

The last real question remaining is: what are blocks 4-9? In every packet I sent, they were different. I immediately thought some sort of CRC but the packet changed at times when I didn’t change the SSID or the passphrase, so it’s hard for me to tell. I’m leaning toward a on-demand Unique Device identifier (UDID) generated in the iOS app, potentially in combination with a CRC. With 48 bits to spare, a 32 bit UDID along with a 16 bit CRC seems more than reasonable.

With this scheme, device setup would look something like this:

Slide1

  • User logs into their Amazon account from the app. This takes place every time a Dash Button is configured. Amazon then generates a “short” (<=48 bits) UDID for the Dash Button which associates it with an Amazon Account. They also store this somewhere on their servers.
  • The SSID and passphrase for the Wi-Fi connection are sent via audio packet to the Dash Button, along with the UDID that was just generated.
  • The Dash Button parses the data and attempts to connect to the Wi-Fi network. If it’s successful, it phones home to the Amazon servers with the supplied UDID. The Amazon servers “register” the button as active and tell the iOS app to continue setup.
  • From here, any further configuration data is sent to the button over the network, including what account is registered to the button (likely with more sophisticated verification than I’m alluding to*), what product the button is ordering, and shipping preferences.

*Just looking at the string dumps from the Dash Button firmware show that there is more sophisticated authentication taking place, it’s just hard to say when. I’m tempted to decompile the firmware just for fun, but I’ve already spent enough time looking at this damn $5 button…

And of course, here’s the final outcome of my efforts:

BOOM!
BOOM!

I’ve attached my MATLAB code in the off chance anyone wants to try this at home. It’ll probably take some tweaking for your specific setup.

Here’s the MATLAB code on GitHub.

That’s all I’ve got so far. I’m still curious in figuring out the six mystery blocks: if you’ve got any thoughts on it feel free to let me know. I might make another followup post taking a look at the firmware using IDA or something in the future, we’ll see. And of course if any Amazon employees want to get ahold of me and tell me how far off I was, I’d be okay with that too 🙂

Thanks to Matthew Petroff, GitHub user dekuNukem, and anyone else whom I may have forgotten to credit.

EDIT: It’s been pointed out to me by a few looking deeper into the button’s internals that the modulation scheme actually IS FSK with four carriers at 18130, 18620, 19910, and 19600Hz. I believe the reason why it so strongly resembled ASK when I observed the audio packets is because of the awful frequency response at the higher end of my phone, my mic, or both. A linear attenuation right at the top of the audible spectrum would explain the highest frequency being measured as lower amplitude. That being said, all encoding and modulation schemes still apply, with the highest frequency encoding representing binary 11.

In addition, there is in fact a CRC16 attached to each packet. It’s the first two bytes after the packet length declaration. Also, that length byte includes the length of the two bytes of CRC. That leaves 32 bits for the UDID, which is POSTed to the Amazon servers at http://dash-button-na.amazon.com/2/r/oft?countryCode=XX&realm=XXAmazon where XX us US for the United States, DE for Germany, etc.  This jives quite strongly with my initial guess of button registration. Thanks to Benedikt Heinz (@EIZnuh) for sharing some of his research into the button’s firmware!