New feature preview: Advanced events

We all love event tracking because it helps us understand how users interact with our apps. But there is a main difference between mobile apps and web apps.

An event might have different meanings for different AppVersions and of course different conversion rates. Let me give you an example:

You want to track “User clicked ok”. You add the event in your AppVersion 1.1, you go to your dashboard and you see cool realtime analytics and graphs in our “Insight Boxes” pages. But then you release a new version of your app (1.2) and you have optimized the flow or made some changes. And now you don’t know which events came from version 1.1 and which came from 1.2.

One way is to create a new event “User clicked ok but this is another version”. The other way (which I call the “cool” way) is to use the new “Advanced events” feature available for preview to Premium customers.

image

Now, you can filter events by AppVersion, see the trend of a specific app and compare how this is event is during across all your releases in a slick, realtime and responsive UI.

image

We also provide you a bar with top weekly events, most trending events and other neat information in order to always be on top of your game.

If you are a premium customer, feel free to check this new preview feature and send us some feedback.

If you see a change in top events from one app version to another, you know you’re getting instant visibility into what customers like or where they are getting tripped up!

Jon (@jonromero)

BugSense’s actions to address the HeartBleed issue

As you’re likely aware, a significant vulnerability in OpenSSL, which the security community is calling the “Heartbleed” vulnerability, was discovered and publicized the previous week. This affects the component of a package that is in common use throughout the software industry.

The purpose of this blog post

is to inform you about what Splunk Bugsense did to address this issue. For more detailed information about the vulnerability itself, refer to http://heartbleed.com.

Here’s what you need to know:

The Splunk Bugsense operations team is very security-conscious and takes new security threats and published vulnerabilities very seriously. We responded - within 24 hours after the discovery of the issue - with the following actions immediately after the disclosure of the CVE-2014-0160 “Heartbleed” SSL bug:

  • All servers were immediately updated to the latest fixed version of OpenSSL (1.0.1g). 
  • We re-issued the special certificate the Bugsense SDKs use to communicate with the Bugsense service. This will mitigate possible information leakage between the disclosure of the bug and patch application to the service. 
  • We re-issued the SSL certificate of www.bugsense.com immediately after Google announced AppEngine’s update to the latest OpenSSL version. 
  • We performed a full audit for possible information leaks on all public endpoints (mobile data endpoints, monitoring servers and developer tools servers) and found none. 
  • We applied a full SSL reissue to all SSL enabled public endpoints (pertaining to the services described above). 
  • We performed a full public endpoint scan using a readily available open source tool (located here: http://filippo.io/Heartbleed) and it found no vulnerable servers.

If you have any more questions on this issue please free to contact us.

Bugsense.js v2.0.1

A few months ago we released version 2.0.0 of bugsense.js, a long awaited rebuild of the plugin and we also introduced some interesting features.

Today we’re happy to announce that our Javascript SDK is hitting v2.0.1!

API Changes

We’re introducing some changes in SDK’s API. We changed how the SDK is getting initiated in order to have more homogenous SDKs across all supported platforms. Previously, SDK was instantiated using `new Bugsense(…)“`. That is now changing to `Bugsense.initAndStartSession(…)“` to reflect to style of the rest of BugSense SDKs.

Also, we did some changes in `Bugsense.notify(…)“`. The problem is that web browsers create Error objects that are quite different from one another. We’re trying to be as cross-platform as possible, and that reflects to supporting as many browsers as possible with minimal problems and a very straightforward API, avoiding browser specific function etc.

Stacktraces

Stacktraces in Javascript environment are a true pain. We’re always trying to leverage the quality of stacktraces, so we incorporated Tracekit in order to generate better and more consistent stacktraces.

Release Notes

But the changes don’t stop here. Here’s the full changelog:

  • Changes in the API of the SDK - read more here
  • Full AMD support, available AMD build can be found here
  • Changes in error parsing functions and overall mechanism
  • Tracekit-generated stacktraces now included in error reports
  • Changes in the SDK structure to accommodate future releases and new features
  • [BUG]: fix for sending null stacktraces in BugSense’s REST API.
  • [BUG] Chrome specific issue with column line of error getting parsed as custom data

Phonegap Beta

This release is the foundation for our new upcoming Phonegap plugin that will bring feature parity with our native SDKs. Hybrid apps is a big trend and we love to help. If you are interested in joining the beta please let us now here.

Contribute

We would to remind that bugsense.js is an Open Source project, so if you have any contributions or ideas, we would be more than happy to learn about it and work together. Our repository can be found on Github.

- Dimitris

New Insight Boxes & Real Time goodness!

We’ve been working on some brand new Insight boxes and improvements that will be become your de-facto KPIs for your mobile apps: Retention/Crash Rate and Popular OS Versions & App Versions.

Popular OS Versions & App Versions

Now it’s available the most popular OS Versions & App Versions by Sessions as second tab.

Retention

If your users love your app then they will keep using it. That’s retention. With Retention box you can visualize how many of your users return and engage with your app. In addition to that, we now track how many of your new users are using your app compared to the returning users.

Crash Rate

Crash rate is the ratio of crashes to sessions per day. Using sessions in combination with crashes we can visualize the trend for 7 days. This feature gives you the opportunity to monitor your app per day and watch how a specific release, affects your crash rate.

These two new Insight Boxes are currently available to all plans! Log in and take a look!

Realtime

The new real-time page allows you to watch ‘now’ the number of  sessions vs crashes and thus making easier to understand how your app behaves in the wild. This is very important when you release a new version of your application or you anticipating a major publicity event like a presentation on TV or a startup conference.

The new real time graph has a 5 minutes window and the stats are updated every 15 seconds. 

image

This new feature is available to all plans, so log in to your BugSense account and tell us what do you think @bugsense.

-Maria

Erlang Binary Garbage Collection: A love/hate relationship

It’s a well known fact that Erlang VM’s generational GC does not do well when trying to garbage collect non-heap binaries. Here at Splunk, while we’ve been building brand new technology (standing on the shoulders of giants, of course) we’ve run into this weakness multiple times. This is a chronicle of our adventure.

A little background

Erlang binaries of up to a certain size (64 bytes to be precise) get stored in each process’s heap space and are garbage collected along with other state variables (tuples, lists etc). Larger ones however get stored in a separate shared memory space (called ProcBin) and a pointer to each one of them is stored in the manipulating process’s heap space instead. Those “large binaries” are not garbage collected in the Erlang conventional way (that is, per-process GC) since they are not accounted for in the process’s memory usage. They are reference counted and have a different GC pattern and collection interval, which, as it turns out, is not very intuitive (even when fine-tuned) and can allow your application to self-destruct if it handles a sufficiently large number of binaries (in number and in memory size).

In our architecture, we have a large number of binary data coming into the system from a Cowboy web server instance, each packet of up to 4 KB in size, which is also touched by 3 different long-live dgen_server processes on its way to purpose.

What we have observed is that, even though at some point in the lifetime of a binary datum, the datum will be released from all the processes that have touched it (after being served in a request or sent over to an other Erlang node), the datum’s space will still remain allocated (and the memory won’t be returned to the OS) for an indefinite amount of time. Indefinite here is explained by the Erlang documentation as “until memory pressure kicks-in and an old generation GC occurs”, which is at best blurry as to what memory pressure really means and how it is measured. More importantly, memory pressure usually pertains to a process and not the Erlang VM as a whole (which is where the large binaries are stored).

Tampering with the beast

To test the speed and effectiveness of Erlang’s binary GC, we’ve used wrk and a simple Lua script:

with a 4 KB JSON event.json file to fill our Erlang application with data via Cowboy. The wrk command used is

which, on our testbed 8-core virtualized server with 2 GB of memory and no swap, fills the internal data structures with about 1600 MB of memory in exactly 85 seconds.

The test procedure we followed is described below:

  1. Start the application server.

  2. Fill it up with about 1.6 GB of data from the wrk script.

  3. Fetch all of the data, serially, in batches of 40 MBs each (10.000 events).

  4. The previous operation leaves the application server without any meaningful binary data stored in process state.

  5. Run the wrk script again to populate the data structures again.

  6. Crash (7 out of 10 times).

  7. Repeat steps 1 to 6 with a manual [ erlang:garbage_collect(Pid) || Pid <- erlang:processes() ]. after step 4.

Kvetch

What we’ve observed is that:

  • Erlang will not garbage collect the shared binary space until there’s actual memory pressure. That translates to about 90% of the system’s memory being full, without significant competition from system processes.

  • Even then, it won’t run a full sweep to clean up the entire unused binary data set, but will start cleaning progressively, reclaiming as much space as needed in order to operate correctly (such as allocating outgoing buffers and memory to Cowboy handlers). This is standard behaviour in incremental generational GC systems but in Erlang’s reality it doesn’t always occur fast enough or in a timely manner to save the VM from crashing under OOM conditions.

  • Sometimes (3 out of 10), it will misestimate the remaining system memory or fail to adequately prioritize the GC mechanism and as a result, much needed memory will not be freed in time and the whole VM will crash under OOM conditions. Erlang fanboys will say that this logic may align with Erlang’s philosophy of "let it crash", but we believe that this concept should only apply to an Erlang controlled environment (that is processes, functions, ports etc) and not to the VM as a whole.

  • Forcing an old generation sweep after a lower number of minor sweeps (like 5 or 10 or even 0) via {spawn_opt, [{fullsweep_after, 5}]} in gen_server:start_link/4 did absolutely nothing, since in order to run a minor collection some memory pressure should occur and no such thing is happening in the process, as long as it’s lightweight enough in terms of other process state (remember that our binaries are stored in ProcBin and not in the process heap).

  • Forcing a garbage collection with erlang:garbage_collect(whereis(named_process)). will do the job, cleanup the entire stale binary data set from the shared heap and do it fast enough to not notice it in CPU usage.

Solution?

Unfortunately, there’s no elegant solution here. Even the official Erlang/OTP documentation states: "If the heap doesn’t grow, it’s likely that there won’t be a garbage collection, which may cause binaries to hang around longer than expected. A strategically-placed call to erlang:garbage_collect() will help.”. What could be done is implementing some sort of self-scrubbing in OTP designs that use gen_server (or any other gen_*pattern) like:

This is as inelegant as it gets. Better solutions include measuring the general binary memory usage from processes that are bound to create/manipulate/handle large binaries and run self-scrubbing on them like:

or sending the GC signal to the offending processes right after handling a large binary transaction.

Hopefully the situation will improve with R17B but, until then, workarounds such as the above have to be implemented to ensure proper application operation.

Panagiotis Papadomitsos

@priestjim