BugSense takes pride in providing high quality world-class app Insights for the most popular mobile apps in the tubes. Efficiently managing extremely large volumes of your Insight requests to our backend servers is crucial to give you the real-time Insights you need to insure mobile application quality. So how do we do it? The answer, which many of you may already know if you’ve been monitoring our blog posts and talks, is threefold: Erlang, C and Lisp under the umbrella of LDB. We’re going to share how we optimize the heck out of our infrastructure to put your mobile app data to work for you.
When it came to designing how we’d communicate insight data to our backend servers, we chose not to reinvent the wheel. We used TCP/HTTPS as our protocol chain of choice because we knew we’d have to optimize a stateful (read: TCP) infrastructure as much as we could for millions of short-lived (read: stateless as in UDP) connections.
Our set of general optimizations distilled down to the following list (layered from low to high in the OSI stack):
Minimized the default TCP buffers: By aligning the default TCP buffer (via the net.ipv4.tcp_rmem and net.ipv4.tcp_wmem sysctl variables) to a page (4 KB), we effectively minimized memory allocation requirements and memory fragmentation for each TCP connection - since we knew beforehand the maximum TCP payload size the mobile phones all over the world would send to our infrastructure.
Minimized the TCP TIMEWAIT period hardcoded inside the Linux kernel: The internet is inherently an unpredictable place for streams of TCP packets, hence the importance of the TCP TIMEWAIT period. What happens when your payload is only 1 TCP packet (resulting in 7 total TCP packets for each mobile phone ping) is that, after receiving your 1-packet payload and FIN handshake, for $TIMEWAIT seconds the app/kernel maintains a TCP socket open, only consuming much needed socket space in-kernel and open file descriptors in-app. By significantly reducing the amount of the TIMEWAIT period, we allowed for more efficient handling of a TCP-as-UDP workload. Check out http://highscalability.com/blog/2012/11/26/bigdata-using-erlang-c-and-lisp-to-fight-the-tsunami-of-mobi.html for more on this!
Used an evented HTTP server: Setting up a load balanced farm of NGINX sharders was a natural choice. We just made sure we disabled HTTP keepalive everywhere, minimized the receive buffer (rcvbuf=4k) to bookmatch the in-kernel TCP receive buffers and disabled TCP keepalive (so_keepalive=off) in the listener setup.
Optimized the Erlang layer: Our Erlang (R16-based) stack consists of the famous Cowboy HTTP server acting as the packet router and our own Lethe C-and-Lisp-based database bound to the Erlang VM via the Erlang ports interface. As the whole system was built to be highly concurrent and asynchronous, so VM optimization flags included the usual kernel epoll activation flag (+K true) and the SMP activation flag (-smp enable). We also raised the maximum simultaneous open ports (-env ERL_MAX_PORTS) since our in-VM communication is port-based. We lowered the scheduling latency (+swt low) because our workload easily fluctuates and we need quick reactions from the Erlang scheduling system and prevented rapid sleeping of the thread schedulers in order to minimize latency for incoming request spikes (+sbwt long). We also biased the scheduler towards port parallelism instead of low latency (+spp true). The last optimization alone lowered the server load threefold (3.0 in ops terms) on our busiest systems. A sidenote here: +sbt option optimizations (which is a scheduler-to-CPU-affinity optimization) may be intriguing to benchmark and use, but our case showed that this flag really shines only on non-virtualized (read: non-Azure/AWS) environments, where thread-to-core placement is deterministic and not managed by a hypervisor. Using it on virtualized environments may even show a decrease in performance!
Jemalloc’ed everything: From the NGINX sharders to the Erlang and C layers, we selected and enforced jemalloc as our memory allocator of choice (thanks Facebook! - https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919). This allowed us for memory allocation introspection (especially in the C layer) and effortless performance improvements on threaded code all around!
Of course, we never stop optimizing and innovating, so watch this space, we’ve got more to come! Please feel free to comment, we’d love your feedback!