Additional Features

Features internally utilized by Nixer Spring Plugin which also can be used individually.

File-based Bloom Filter for Java

A Bloom filter is a probabilistic data structure designed to tell, rapidly and efficiently, whether an element is a member of a set. False positives are possible, but false negatives are not, which means query returns either “possibly in set” or “definitely not in set”.

Most Bloom filter implementations for Java are based on data sets kept in memory which is a huge limitation when it comes to handle large data sets, in terms of performance and system stability.

Our goal was to be able to efficiently handle Bloom filters with massive data sets, counted in billions of entries or multiple GBs of storage.

We tackled the problem of large data sets by creating a Bloom filter implementation that keeps the data in a file rather than in memory.

The algorithms are copied from Google Guava but the used bit array is backed by a file instead of Java arrays. The data file is memory mapped (mmapped) to the process space, not loaded, so there is no need to explicitly read of write from/to the disk. Also the Java heap space is not involved in storing the data.

Our Bloom filter implementation is backed by two files:

  • filter_name.bloom is metadata file containing filter parameters (e.g. false positive probability, number of insertions),

  • filter_name.bloom-data is data set file sized to fit the provided number of expected insertions.

It can efficiently handle sets that require multi-GB data storage, e.g. with billions of entries, and low false positive rates. You can see concrete numbers in the benchmark section.

The implementation is wrapped with a reusable Java library called simply bloom-filter.

Installation and usage

The bloom-filter library is distributed through Maven Central.

dependencies {
    implementation("io.nixer:bloom-filter:0.1.1.3")
}

The filter to be used is available as implementation of io.nixer.bloom.BloomFilter.

In order to obtain an instance use one of the factory methods defined at io.nixer.bloom.FileBasedBloomFilter:

  • create(..) for building an empty filter from scratch,

  • open(..) for access already existing filter from the backing files, which might contain multi-GB data sets.

For most cases you are probably going to go with the second option and open already existing filter.

However the filter has to be created somehow anyway. In order to make this process more convenient we prepared a handy CLI tool.

Bloom CLI Tool

In addition to the Bloom filter library we provide a simple command line utility to manipulate the Bloom filter backing files, called bloom-tool.

Installation

bloom-tool is built using Gradle application plugin so the package includes start scripts specific to various operating systems.

Download package from GitHub

The tool package is available on GitHub as an addition to each release of Nixer Spring Plugin.

Build from sources

Alternatively, it can be build from sources - to do so clone or download Nixer Spring Plugin and build the project using Gradle:

./gradlew

The tool’s package is going to appear under bloom-tool/build/distributions directory.

Usage

Unzip the bloom-tool package and invoke the following in order to see all commands and options:

bin/bloom-tool --help

Basic use case

Suppose we have a text file containing list of leaked credentials (e.g. one from here) which we want to put into a Bloom filter. Each line in the file consist of a password hash (as hex strings) and additional info, like prevalence, separated by colon :, e.g.

0000000A1D4B746FAA3FD526FF6D5BC8052FDB38:16
0000000CAEF405439D57847A8657218C618160B2:15
0000000FC1C08E6454BED24F463EA2129E254D43:40

Let’s call this file entries.txt and assume it contains 500 million entries, which means around 20 GB of data.

In order to create a filter populated with all the hashes from this file use the following command:

bloom-tool build --name my.bloom --size 500000000 --separator : --field 0 --input-file entries.txt

Where --field says the hash should be taken the first value after splitting each line by the : separator. Remember fields are counted starting from 0.

Note execution for this amount of data might take a while. For MacBook Pro Core i7 with 16GB RAM it was around 24 minutes.

Create a Bloom filter

Create a new filter my.bloom for 1M insertions and 1 to 1M false positive rate.

bloom-tool create --size=1000000 --fpp=1e-6 --name=my.bloom

Results in creating my.bloom metadata file and empty my.bloom-data data file.

Insert values into a Bloom filter

Insert lines from entries.txt into my.bloom filter, populating my.bloom-data file.

bloom-tool insert --name=my.bloom --input-file=entries.txt

or from input stream

cat entries.txt | bloom-tool insert --name=my.bloom --stdin

alternatively

bloom-tool insert --name=my.bloom --stdin < entries.txt

Insert hexadecimal from hashes.txt to my.bloom. Useful if the filter is intended to be use from Java code generating some hashes.

bloom-tool insert --hex --name=my.bloom --input-file=hashes.txt

Check an element is present in a filter’s dataset

Check if string example might be inserted in my.bloom filter, printing it to standard output if it might be true or skipping otherwise.

echo example | bloom-tool check --name=my.bloom --stdin

Performance benchmark

Execute performance benchmark and correctness verification of the Bloom filter implementation. This might take a minute, or more for greater sizes.

bloom-tool benchmark --size 10000000 --fpp=1e-7

Sample result:

Performance numbers (date: 2018-08-24, on a MacBook 16GB RAM, the file generated had size 4GB ):

$ caffeinate bloom-tool benchmark --size=1000000000 --fpp=1e-7
Creating: /var/folders/f3/pqxz3z7s6g37nhv65114479r0000gn/T/bloom-benchmark-tmp3708262923291670873/test.bloom
Creation time       : PT1.063892385S
Parameters          : BloomFilterParameters{expectedInsertions=1000000000, falsePositivesProbability=1.0E-7, numHashFunctions=23, bitSize=33547704320, strategy=MURMUR128_MITZ_64, byteOrder=LITTLE_ENDIAN}
Inserting           : 1000000000 values
Insertion time      : PT41M42.06361178S
Insertion speed     : 399658.84 op/sec
Checking            : 2000000000 values (half exist, half should not exist)
Check time          : PT45M27.88090649S
Check speed         : 733163.21 op/sec
False Positives     : 123
False Positive rate : 0.00000012300000
FP rate evaluation  : good (122.92% target)

Cloud IP Ranges

Most big cloud providers publish their IP addresses ranges. Information whether an IP address belongs to a particular cloud provider might be valuable for various reasons, e.g. as a hint for a suspicious request in our context.

We provide a simple Python script which purpose is to gather the actual IP ranges for various cloud providers.

The script package is available as an asset in each release of Nixer Spring Plugin, you can also see the source code here.

Supported cloud providers

  • AWS
  • Azure
  • Google Cloud Engine
  • IBM cloud
  • Cloudflare
  • Oracle cloud

Usage

  1. Make sure you have pipenv installed. In case you were not familiar with this tool go here.

  2. Setup Python virtual environment
    pipenv install
    
  3. Run the script
    pipenv run python cloud-ip-ranges.py
    
  4. See help for full set of options
    pipenv run python cloud-ip-ranges.py --help