Cardinality Estimates with Hyper Log Log

Mitu9900 · Post by **Mitu9900** » Mon Jan 27, 2025 4:38 am

In the big data environment, the cardinality, i.e. the number of unique elements in a large amount of data, often has to be counted. If you want to program this in the traditional way, all elements that have occurred have to be saved in order to be able to determine whether a new element has already been seen and counted or not. Although such a lookup can be carried out efficiently in-memory if you can use a hash map, for example, this also requires the corresponding storage space for each saved element. If you want to keep an eye on the unique visitors for each hosted page on a web server using the IP addresses, for example, this can very quickly lead to an enormous increase in RAM, or, if you outsource to disk, to additional performance losses.

The LogLog algorithm presented below (later improved to HyperLogLog [2] ) takes its name from its extremely low memory requirements, which will be at least two orders of magnitude less than the HashMap for a large amount of data. Strictly speaking, its memory is even constant. Similar to the Morris counter, it also deals with binary values and a kind of randomness.

The basic idea is to convert each value to be counted into a hash value using a hash function and then to look at its binary representation. This albania telegram screening could, for example, be calculated for the date "Mannheim" using the Murmur3 algorithm.

(the bold sections at the beginning will only become relevant in a few moments).

If we look at the number of leading zeros in this binary representation, we can use it and some probability calculations to estimate approximately how many hash values we must have already calculated: to do this, we simply remember the largest number of leading zeros seen so far from all the calculated hash values.

The idea behind this approach has to do with the random properties of (good) hash functions: These strive for maximum entropy, i.e. a maximally random distribution of zeros and ones in their bit chains. Accordingly, a zero in the first position has a probability of 50 percent, two zeros at the beginning of 25 percent and three zeros of 12.5 percent, etc. If we set the largest number of leading zeros equal to c and calculate 2 c, we get a good estimate for the number of elements considered so far, similar to the Morris counter. We would therefore expect a hash value with two leading zeros as in the example above in about the 4th hash calculation.