EntropyUUID: Why it’s the ultimate unique identifier, based on Information Theory

When we started tracking unique devices at BugSense recently, we initially chose to adopt the OpenUDID project. We soon found out it wasn’t the right choice for us mostly due to 2 reasons. First, a lot of devices have the same identifier and second, we need extra permissions to access it.

What we really needed was a unique ID generated on the device, without external dependencies, that couldn’t be personally identifiable. The solution came in the form of the “Unique User ID (UUID) entropy-based generator with 6 DOF" algorithm.

Let’s start by sharing why this is an amazing algorithm:

  • The concept is easy to adapt on various platforms, as it’s a cross-platform process in nature.
  • It doesn’t rely solely on one specific system feature (e.g. a device serial ID).
  • It doesn’t need any OS or hardware cap.
  • It doesn’t do any physical user identification. Also, the generated ID is the result of a one-way process - one cannot reverse the process to retrieve the “source”.
  • The algorithm runs locally on a device. No network connection, no external stimulation (for example the mouse pointer movement), no interaction with other applications is required.
  • It can be implemented in few lines of code and follows the KISS (Keep It Simple, Stupid) principle.
  • It can become more refined.

It has only one possible downside, which we’ll talk about later in the post so read on!!

The “quality” of a “random” series generator is analogous to the length of the series’ period. So why are these generators called “pseudo-random generators”?
Randomness is believed to be an illusion in non-microcosmic scales. We may model a series of events as a periodic series (even if we don’t observe the same sequence of events occurring more than once), by assuming that it has a very long period. Therefore there may be a repetition that we just haven’t observed yet. And because the universe’s entropy is a discrete , measurable value, there’s nothing to indicate we’re wrong. Infinity, after all, is only a mathematical model - it doesn’t really exist.

"Every pseudo-random generator is modelled as a chain of events with a period"

Assume that you are about to receive N bits (a message). The entropy H (representing the lack of information) for this message would be (derived from Shannon’s entropy equation):

                                                      H = Nlog(2)

as long as each bit is independent and can have the values of either 0 or 1 with equal probability (in other words, a Bernoulli process). This applies when we are in an ideal domain concerning bit distribution. Well, guess what: we are not!
So, we are seeking a way of maximizing the entropy quantity when a biased procedure takes place. This is achieved by modeling various system parameters and stimulating them to emit different results. Based on the mutability property of a system and the immutable property of time, we try to identify parameters that lead to highly distributed pseudo-random values. These parameters are the system’s degrees of freedom (DOF). A superposition of these parameters (read: addition, for a linear, time-invariant system) will yield a result “as random as possible” that will be the unique ID we are after. Here are the system’s 6 DOF:

  1. The time the program starts (timestamp since a predefined “zero” timestamp - an Epoch) won’t be the same for the same device, assuming there’s only one instance of the program running on this device at the same time.
  2. When a program runs, a datum may have an arbitrary (“random”) address (von Neumann model).
  3. The accuracy of a system measuring time intervals is not infinitely accurate.
  4. A system may have a distinguishable inherent property.
  5. Combining tasks and measuring how long it took the system to run them is a quantity that varies, based on the system load at that time and the 3rd DOF (see above). This measurement can take the role of the seed for a pseudo-random number generator.
  6. A complementary piece of information that acts as a kick-starter for the algorithm. This includes the module provider (and an age indication) imprinted on it.

If a fixed length of the generated ID is required, one can apply a message digest algorithm. This step will lead to possible collisions, but we can assume that, even though they may surely happen, they will be extremely

The gist here is that by combining various sequences with lengthy periods, one can obtain a sequence with an even lengthier period. Check out this sample implementation, written in C:

To wrap this up, there are several pros to this algorithm as we have already mentioned, while the con is only one: there is a tiny probability of generating same IDs. We haven’t witnessed such an issue (yet) at BugSense, but as I said before, it’s just a matter of not having observed the full period of the phenomenon yet! All in all, the  ”Unique User ID” algorithm provides a good, simple solution when you need a way to generate random sequences. 
If you feel like getting into more details, just send us a tweet and we’ll be happy to engage in some stimulating conversation!