My Motivation (for building a rig + writing about it)
So, cut to the chase already, man, right?
Ok, will do. Here’s what specifically led me to build my own rig, and roughly how I wanted to go about it.
The initial catalyst was the Fast.ai forums, and if you haven’t wandered around that place, please do – IMHO there’s really nowhere else like it. I came across a great thread there on building a rig, and Slav Ivanov’s well-written post on the subject, both of which served as amazing starting points. Since these posts already cover the general reasons I decided to build a home rig (e.g. savings + performance over IaaS and MLaaS platforms like AWS), I’ll focus here on the criteria specific to me.
These main self-imposed constraints consisted of needs for the machine to be:
- Small(ish): I live in NYC, where free (as in freedom) space is a premium. As such, this computer has to fit in a tight space, tucked under a shelf in a compartment that measures 17"H x 10"W x 19"D. This primarily affects the chassis size, which in turn has obvious implications for the other components too.
- Quiet(ish): The aforementioned cabinet is in the common area of our apartment, and with a baby on the way this needed to have a low hum at the loudest.
- Expandable: I tend to maintain a deep focus on the current topic I’m interested in, but that topic can change frequently. As such, this computer should be able to last me a couple of years, which means that it likely needs additional components as new libraries and methods .
- Suitable for the GPU: This is the piece de resistance of the deep learning rig, so I want everything else to be designed around that. At the time of writing, the GeForce GTX 1080 ti was the most powerful Nvidia card on the market that would be suitable/affordable for home use.
- Reasonably Priced: While “reasonable” is subjective, I chose sub-$1800 as my price point to keep things somewhat under control.
If you’re feeling antsy, the gist of this section is included in this PCPartPicker link, and you can skip to the build section below for how it all fits together (Note: I did find better pricing than PCPartPicker offered though, which I detail in the next section if you’re looking to recreate this machine):
If, however, you want a more thorough breakdown, well, here you are.
1. The Case
To fit my cramped space requirement, I knew I needed a mATX size case or smaller. This selection is all about balancing the external space consumed with the internal space provided for all components, and so I started with mATX and referred back after selecting each subsequent component.
I ended up purchasing the Thermaltake Core V21, which had the most important aspects I was looking for:
- PSU fit: Supports standard ATX PS2 size power supply, so the case I selected 5.9x3.4x5.5")
- GPU fit: length of the graphics card tends to be the limiting factor here. The case I selected allows for 13.78" and the card I tentatively wanted was a 1080ti that measured 11.73"
- CPU cooler height: case allows for 7.283" and most coolers I was looking at were around 5.43" tall.
2. The GPU
The component I had the strongest opinion about coming in was the GPU; based on chats with friends who know better, I tended toward wanting Nvidia’s GeForce 1080 Ti for my chip. With that in mind, I still had to choose between multiple permutations:
- Price: This is an area where, given the resources, one could really go all out. The Titan X and Titan Z, for instance, would give amazing performance, but only incrementally so given the huge price jump. There was also high price variation even within the 1080 Ti family though.
- Size: There are a surprisingly high variation in the sizes of these cards, primarily due to fan and heat sink designs.
- Temperature: I have had cards burn out on home builds in the past, so I’m particularly sensitive to this, especially considering the tight quarters my rig was to live in.
- Memory: This affects how large your batch size can be, which affects how well the linear-algebra-optimized processor can speed up your training.
- Sound: A relatively minor considerations, but some GPUs really whir up a storm. I’d like one that can maintain a low level of noise without catching fire…
Initially, I was drawn to the Zotac Mini, since it is relatively tiny and could potentially provide more space in my rig for airflow and other components.. However, after checking out the specs and some reviews, it appeared that while it produces similar performance metrics and is priced as a budget card, it runs significantly hotter with louder fans. Quiet fans are an explicit requirement of mine, and as I mentioned temperature is a touchy subject. I checked out it’s big brother, the Zotac AMP! Extreme, but after watching another video review saw that the copper coils and plates that comprise the heat sink were so big they’d cover other PCIe slots on most motherboards.
That AMP! review video, however, did lead me to my ultimate decision, which was the ASUS STRIX.
3. The CPU
Within the deep learning context, this guy is important for applications like preprocessing, and basically anything that isn’t massively parallelized and therefore can’t take advantage of a GPU’s thousands of cores (i.e. the CPU would be best suited for data augmentation, whereas a GPU would be better for simultaneous convolutions of multiple filters over multiple images).
The big considerations here were:
- Cores: The number of cores in a CPU determines a ceiling for the number of processes can occur in parallel – higher is typically better. It is important to note that with hyperthreading there are typically more “logical” cores than the number of physical cores would indicate (e.g. Intel’s i7 7700k processors have 8 threads available across the 4 cores).
- Speed (cycles per second): It should be self-evident that the number of instructions per second that the processor can complete is important. My goal was essentially “go as fast as possible without breaking the bank.”
- Intel vs AMD: This debate has been going on in multiple contexts for as long as these guys have been around. The new AMD Ryzen chips stack up pretty nicely even against Intel’s i7 chip, and have the benefit of having tons more cores. However, the Ryzen-compatible motherboards that I was able to find would not fit with the other components, either logically or physically. So, after reviewing a couple of other folks’ opinions on both sides, and prioritizing single-core speed over number of cores (since I’ll be planning to train on the GPU when I need many threads), I decided on Intel.
- PCIe lanes: This is basically how many lanes of traffic comprise the freeway between the memory, CPU, and GPU (see the diagram below). Like many CPU criteria, this particular consideration is technically related to not just the CPU, but also the motherboard architecture. Since I’m considering expansion to 2 GPUs at some point, I’d ideally like an architecture that doesn’t artificially limit the traffic that can pass to these GPUs. While GPUs typically plug into 16 physical lanes, many processor/motherboard-chipset combos do limit the second card to just 8 or even fewer due to the architecture. Unfortunately this is the case for Intel chips unless I want to drop $1000+ which is beyond my current budget. But it appears that this bottleneck isn’t too significant for modern PCIe-3.0 devices.
- Motherboard Compatibility: In short, make sure that your motherboard works with the CPU you’ve chosen, or vice versa. For instance, an i5–6600k requires a z170 chipset, while a i7–7700k requires a z270 chipset.
Looking at some benchmarks for the i5-6600k vs i7-7700k, it does appear to be worth the slightly higher investment.
4. The CPU Cooler
Since I chose the unlocked (overclockable) 7700k, the onus was on me to get my own cooler. (Intel presumably doesn’t provide the stock cooler, because unlike the stock 7700, 99% of people buying the k version are using it to overclock, and the stock cooler is only good for stock speeds).
Given the option to choose my own, I wanted the quietest, most efficient cooler out there. Initially it looked like Nactua had it, and even made a low-profile version. However, they’ve recommended that isn’t used in conjunction with the i7-7700k.
I recalled buying a Coolermaster 212 back in the day, and started looking at them. After some searching, it still appears to be widely touted; the newer 212x looks like it beats the 212, and even came in cheaper than the 212 after rebate! Done deal.
5. The RAM
- Speed: Seems it actually doesn’t matter much, but I knew I wanted at least DDR4 with 2400 MHz (data transfers per second) so that it doesn’t bottleneck.
- DIMM Sizes: I’ve heard multiple times as a rule of thumb that one should have about 2GB of RAM for every 1GB of GPU memory. Because my GPU has 11GB, I knew I wanted to drop in at least 22GB of RAM (I rounded up to 32GB) . But should I go with a 2x16 kit vs 1x32 card vs 4x8? I chose 2x16. It provides redundancy if one goes out, while also allowing for expandability and minimizes price/GB. I bought them as a kit so they’d be timed together as well.
6. The Disk
With several physical hard disk drive failures littered across my past, I had already made up my mind I’d be buying solid state this time around. At the risk of dating myself, I’ll admit that SSDs were just rolling out the last time I’d built a machine, so my first reaction was one of amazement to see how compact and inexpensive they are these days!
My strategy was to get a 500GB min SSD for the OS and data sets I’m actively working on, and will eventually also put in a HDD for super cheap cold storage when the SSD caps out.
Samsung 950 EVO is cheapest and highest-rated SSD around, but it uses SATA connections, and I wanted NVMe to let my data fly across to the RAM, processor, and GPU – luckily the Samsung 960 EVO Series uses actual NVMe tech, and is super easy to install into an M.2 slot.
7. The Motherboard
It’s helpful to select this last, since it’s essentially the substrate that all the other components do their work against. Thus, the constraints are dictated by the components:
- Chipset: needs to run the Z270 chipset to support the selected CPU
- Size: needs to fit in the microATX case I selected
- Expansion: needs to support GPU expansion (PCIe slots, 16 channels if possible), RAM expansion (4 slots), disk expansion (more than one M.2 slot for potentially multiple SSDs, several SATA ports for future HDDs)
- Network Speed: I have a 1Gb fiber connection at home, and didn’t want to artificially limit that with a crappy network card. Need one that supports 1Gb ethernet.
Beyond these basic constraints, here are a few additional key tips, considerations, and potential “gotchas” I feel are worth noting:
- RTFM: reading the manual is pretty critical – I almost bought 2 boards with really bad compatibility issues that weren’t noted in the online listings, and only appeared in the pdf manuals, which were readily accessible online.
- Read customer reviews: there are lots of bad boards (15%+ of pretty much every board’s reviews are 1-star), but they fail in different ways and vary in the support. Customer reviews helped me avoid one board that looked technically phenomenal but appeared to fail almost half the time in reality.
- PCIe lane tricks: Watch for “PCIe x16 (x4) mode” which basically runs at 1/4 speed; manual will typically say “runs at x##, x## for dual; x##, x##, x## for triple”
- M.2 Type: sometimes the M.2 slot is linked to transmit data via SATA, which is significantly slower than NVMe (over PCIe). This could’ve really crapped on the care I took to select a fast SSD.
After all of these considerations were taken into account, my choice was easy; only one board met all of these criteria, and that was the ASROCK Z270M Extreme4.
8. The Power Supply
Lastly, we need to get our previous 7 friends some electricity. Here’s the lowdown:
- Power Needs: Adding up the rated consumption of each of my components, a potential future GPU, and a 100W factor of safety left me with a 700W+ need.
- Size: Needs to fit ATX PS2 standard, which is dictated by the selected case
- Efficiency: Since I’ll be potentially running this thing for hours on end, I’d like to be energy efficient for the sake of (a) the broader environment, (b) the temperature in my rig, and (c) my electric bill. The lowest “80 Plus” rating I was willing to accept was Gold.
The Rosewill Capstone 750M fit each of these criteria, had good reviews, and was cheap enough to keep me in budget.
The grand total for all these components came in at $1799.39, which snuck into my budget with 61¢ to spare! Bill of materials follows:
Now while I did a decent amount of desk research to find the best pricing, I also used some tricks to get to this price point (e.g. deducting the 5% back from my wife’s Amazon card, counting rebates in the pricing, and ignoring sales tax — although shipping was free at least!). In the end though, I was proud to spec out a quality machine that could roughly meet my somewhat ambitious pricing goal of sub-$1800!
Assembling the Hardware
Select photos of the assembly process follow, ripped from my braggart-toned messages I sent to my buddies while building:
Wonderful — all the components were purchased, put together, and ready to go. So hit the power button, and………NOTHING. (ohhhh f***!)
Nothing was running. After panically disassembling and reassembling the machine several times to no effect, I did a bit of reading and asking around. It turns out that just popping the CMOS battery out and dropping it back in did the trick.
Beyond that there were a few other tricks to get things running in an interactive manner:
- On most builds, a GPU will not work right away because it’s almost certain the motherboard doesn’t yet include the proper drivers. Relatedly, HDMI on the motherboard’s graphics card likely will also not work right away because there are no drivers installed. So, be prepared to use VGA or DVI into motherboard to get things initially kicked off.
- Having the GPU installed typically disables onboard motherboard video, so you’ll typically need to have that GPU popped out when first booting.
- Need a keyboard + mouse. Your BIOS will likely need clock set (particularly after popping out the CMOS battery like I did). After that, most of the bios GUIs are pretty nice these days, and even mouse-friendly, which was a hugely pleasant surprise after not building a machine for a few years!
Woohoo! At this point, the bios was now configured and the machine was ready for more.
Setting up the OS + Drivers
After getting bios configured, my plan was to boot from Ubuntu thumb drive and install. I used UNetbootin to prepare an old 2GB USB drive with Ubuntu 16.04, which worked great for the installation.
I used default options for pretty much everything, but also decided to remove the guest user and encrypt my drive. The only downside of disk encryption is that if my computer loses power, I need physical access and keyboard attached to decrypt it on startup (more on that power management stuff below in the remote access section).
Finally, I updated (
sudo apt-get update && sudo apt-get --assume-yes upgrade), set the computer to boot to cmd line by default, and put
alias startx='sudo service lightdm start' in my .bashrc file, or startx would just break.
Install NVIDIA Drivers, CUDA, CUDA Samples
I started by downloading the appropriate driver for my OS/GPU from Nvidia’s downloads, and installing it (note if you take this route you may have to use ctrl-alt-f1 to switch virtual terminals then exit X before installing to avoid conflicts – i.e.
sudo service lightdm stop then
sudo init 3).
However, I noted that in the additional info on the NVIDIA driver page, it says the Ubuntu distro probably has it’s own copy that might run better than the NVIDIA provided one.
After some googling I restarted the Nvidia process using Layla Tadjpour’s blog post as a reference. In short, I downloaded the run file, extracted the contents to a directory, and installed cuda 9.0 + cuda-samples (but not NVIDIA driver). I then used the ppa method described there to install the drivers (but with 384 since that was the newest at the time).
Rebooted with stars in my eyes, and…oh no, not again…black screen…
Update BIOS to Recognize Both Graphics Cards
At this point I know two things: (1) that putting in a GPU disables the onboard video card, and (2) there should be a setting in the bios to set the GPU as available but not the primary graphics adapter. But I didn’t see that option in my BIOS…
I eneded up doing an internet flash of my BIOS, and low and behold I saw the option! I changed
advanced>chipset>primary graphics to “onboard,” and it worked!
Installing the Required Software
Phew! Now that I had all the hardware installed, being recognized, and booting properly, it was time to prepare for the fun part: deep learning software.
I started with the basic requirements:
- Python 3.6, required for latest anaconda, per these instructions; then set python3.6 as default (in a sort of hacky way by adding
alias python='python3'in .bashrc)
- Anaconda3 to match (from their downloads)
- CUDA deep learning libraries (CuDNN 7.0 for cuda 9.0), which consists of downloading a tarball, extracting, and copying 4 files to the cuda include and lib64 folders.
And finally, I installed Keras, Theano, and Tensorflow (e.g. pip install keras, conda install tensorflow), ran some benchmarks, and…crap, not again…
I got an error about tensorflow looking for cuda 8.0, but I’d installed 9.0. According to an active github issue to this exact effect that had been updated literally ~15m before, I learned there should be a rc supporting 9.0 that week.
I ended up building from source per a suggestion in the ticket, but in the meantime, as a short-term fix, I figured I’d just make my current tf install happy by…
Reverting to CUDA 8
- rm /usr/local/cuda-9.0 (and simlink /usr/local/cuda)
- download legacy cuda 8.x run file + patch run file
- extract using
- run (as sudo) the cuda and samples run files (leave driver one alone)
- run the patch
- install cudnn 6.0 for cuda 8.0 (same cudnn major version as before, but built for 8.0 and supported by the tensorflow wheel to avoid building from source)
- download tar file, extract, copy files from include and lib64 to respective /usr/local/cuda/* folders
I made sure CUDA was running and talking to my GPU by using a few different test scripts:
- Use cuda samples to make sure the GPU is detected:
- Test with pygpu:
DEVICE=cuda0 python -c “import pygpu;pygpu.test()”
Finally I made sure Theano was running on the GPU via CUDA, per these instructions. It initially appeared to be running on the cpu, so I (1) made sure to set up ~/.theanorc per this reference, (2) had to add an nvcc flag to force inlines, and (3) rebooted a few times when things appeared wonky (e.g. GPU would stop responding).
Installing Jupyter, Run on Boot
Finally, jupyter came packaged with anaconda but I wanted it to be available and exposed on my network on boot, so I set up a simple cron to run a bash file that started jupyter with the flags I wanted.
I had some minor issues since I had set up my home dir to be encrypted, but thanks to a tip I found, I could essentially wait a short amount of time between boot and start to hackily fix that.
…and it WORKED!!! And zOMG was it fast…
As a baseline, I ran some trainings for my models for Kaggle’s cats vs dogs redux competition, and epochs (consisting of ~32k images) that were taking on the order of ~1hr on my relatively modern laptop’s CPU were now taking ~1m on my GPU.
Configuring Remote Access
I now had a machine in my living room that seemed roughly on par with the expensive on-demand GPU clusters at AWS. But, as with AWS, I wanted to be able to access it from anywhere in the world. Here’s a high-level 3-step summary of how I did that.
Step 1: Key-Based SSH Access on Local Network
The first step was to set up an ssh service on my rig that I could tunnel other services (like the web interface for jupyter) through. Rather than going with a username/password approach, I opted for a much more secure way to do this — using a keypair wh. This step looked like what’s described in this decent tutorial.
Next, I needed to set a static local IP for the box, so I knew where to connect to each time. This step looked somewhat like what’s described here.
Finally, with these pieces in place I could connect from any client that had the ssh client key, using ssh-based tunneling such as the following command, which maps the client’s (my laptop’s) port 8000 to the server’s (my rig’s) port 8888:
ssh -N -f -L localhost:8000:localhost:8888 user@staticlocalip
Success! I could now see jupyter on my rig from any machine in my home.
Step 2: Wake on LAN (WOL)
The next step was to be able to put the rig in suspended (sleep) mode to save power when I’m not using it, while maintaining an ability to wake it up when I want to do some training. The trick is that the rig is stashed away under a shelf that is relatively physically inaccessible. Enter Wake on LAN (WOL).
In short WOL works by having another machine on the local area network broadcast a “magic” UDP packet to all machines. The magic packet contains the MAC address of the target machine, repeated in a specific sequence, so that the machine knows to wake up.
However, the target machine first needs to be configured to receive such a packet, an ability that is typically turned of by default. A more general description of how to do this can be found here, but for my machine, this consisted of simply setting a config option in the BIOS:
- BIOS > Advanced > ACPI Configuration > I219 LAN Power On
I then tested from my macbook connected to the same network using a command-line client for Mac (
brew install wakeonlan) to run the following command:
wakeonlan -i [local broadcast ip] -p [any port] [mac address of machine]
Success! Waking up my machine and SSHing in was now working on my local network! But what about when I’m away from home?
Step 3: Wake on WAN (WOW)
I should note here that WOL is designed only for local networks, and that providing broader access to the Internet comes with a slew of security concerns that should be understood before proceeding (e.g. see “Wake on Lan over the Internet” section of this article).
After some desk research to get myself feeling comfortable and prepared to open up to the world, I saw two primary options to get my computer woken up:
- Use a specific port on my router to receive a magic packet from the Internet and forward it to the broadcast address on my LAN. (e.g. as described here)
- Send a message to a device on my network that’s always on, which triggers a WOL request from that device.
Long story short, option 1 is blocked by my router for (good) security reasons; even when trying some of the more exotic ssh and telnet arp hacks, the only way around appeared to be rooting my router.
So, opting against that, I tried option 2 first. Here’s how I went about it:
- Record my router’s public ip (e.g. by running
curl ipinfo.io/ipfrom any device on my network)
- Get an old android phone that I don’t care about anymore, wipe it and sign up for a new google account (e.g. “Waker McWakeface”), plug it in to power so it doesn’t die, and connect it to my LAN via wifi.
- Install LlamaLab Automate on the , which allows for various actions to trigger a WOL packet.
- Set up Automate such that when it receives a hangout message, it sends a WOL packet, pings the rig until it gets a response, then sends an email to the sender of the hangouts message to let it know the rig is up and running.
- Stash the phone somewhere out of the way, and forget about it.
Wonderful! Now I can send a hangout message to “Waker McWakeface” and after about 5 seconds receive an email telling me I’m up and running. From there I can set up ssh tunneling as described in the previous section, and access via either the jupyter web interface or via secure shell.
When I’m done I can simply run
sudo pm-suspend in the shell, and put my rig back to sleep (note this was initially causing issues with my
LD_LIBRARY_PATH not getting loaded, and I had to update my rc.local file to get CUDA running reliably after a sleep/wake cycle).
Running Some Deep Learning Benchmarks
Just to make sure everything is configured and that the CUDA libs are actually talking to both the GPU and listening to requests from jupyter (which I don’t take for granted after the initial configuration experience), I wanted to replicate some of Slav Ivanov’s benchmarks from earlier this year. I’ll not go into the details that are covered in his analysis, other than to quote his summary of two models and give my results for two of the benchmarks: MNIST and VGG.
MNIST Multilayer Perceptron
We run the Keras example on MNIST which uses Multilayer Perceptron (MLP). The MLP means that we are using only fully connected layers, not convolutions. The model is trained for 20 epochs on this dataset, which achieves over 98% accuracy out of the box.
16s wall time!!! (Faster than the previous 1080 Ti benchmark of 31s)
A VGG net will be finetuned for the Kaggle Dogs vs Cats competition. In this competition, we need to tell apart pictures of dogs and cats. Running the model on CPUs for the same number of batches wasn’t feasible. Therefore we finetune for 390 batches (1 epoch) on the GPUs and 10 batches on the CPUs. The code used is on github.
50s wall time!!! (Faster than the previous 1080 Ti benchmark of 82s)
So why the speed-up over Slav’s results? My theory is a little improvement in the hardware, and vast improvements in the software over the past 6 months. While the relevant hardware is remarkably similar to what Slav had been using in his benchmarking, in the time since there have been new minor versions released of Keras (2.1) and TensorFlow (1.4) and major versions of CUDA (9.0) and CuDNN (7.0), which should have significant impact here.
So what have I learned?
Tying this all back to my “tech is only as good as it’s applications’ effects” ethos, I’ve now validated just the first of 3 hypotheses I had coming into this:
- Having a machine is fun.
- Having a machine and using it is more fun.
- Having a machine and using it for positive external effect is most fun.
What’s next is the second two steps of my path.
For step 2, using the machine, this rig will allow me to run more rapid iterations of experiments in deep learning. Some areas I plan to dive more deeply into (and hope to link to from here as I write about them) include:
- Style Transfer (e.g. Van Gogh-ing my face)
- Generative Text (e.g. Lyrics, Recipes)
- Image Classification (e.g. Dogs v Cats)
- Image Segmentation (e.g. identifying empty parking spots, population densities, villages with solar panels, products in a shop)
- Collaborative Filtering (e.g. Movie Ratings)
- Facial Recognition (e.g. spatial-and-rotational-invariant embeddings)
- Optical Character Recognition (e.g. matching IDs to names)
- Predictive analysis for sports (e.g. next play based on NFL “all-22” cam)
From here, step 3 will be to find applications for these newly-acquired tools and skills as soon as possible — and I cordially invite you to follow my account here on Medium for updates!