A measurement of Oodle decode speed vs. Zlib & Brotli on several ARM devices. The Oodle compressors Kraken, Mermaid and Selkie are tested.
Updated for Oodle 2.3.0
About the test
About the devices
About the test sets
About the compressors
A note on threads
Sample encode speed and x64 decode
Oodle Kraken is a new lossless data compression algorithm from RAD Game Tools. It provides very high compression ratios (similar to Brotli-11 and LZMA) with much faster decode speeds - generally 2-4X faster than ZLib (which compresses way less) and 3-15X faster than compressors with comparable compression ratios. Kraken is generic : it works on all types of data with no specialized filters or dictionaries.
Mermaid and Selkie are new compressors in Oodle 2.3.0 that are designed to be even faster than Kraken. Mermaid generally compresses slightly more than ZLib, while decoding much faster (around 5-10X). Selkie is designed to decode faster than any other compressor (faster than LZ4) while getting more compression.
The speed of Oodle Kraken has been previously documented, mostly on x86/x64 platforms. See for example :
Oodle Kraken Pareto Frontier
Performance of Oodle Kraken
PS4 Battle : MiniZ vs Zlib-NG vs ZStd vs Brotli vs Oodle
The aim of this report is to demonstrate the speed on ARM devices.
This report contains sub-page for each device for each corpus. The page for each device-corpus pair shows the performance on each file in that corpus, as well as the total time & size over all files in the corpus.
The charts here show time and size, which are both linear costs (size is linearly proportional to time over the wire) - LOWER IS BETTER!
For example in arm32_canterbury_CortexA15.html you will see the total time and compressed size of the individual files within the Canterbury Corpus on the Cortex-A15 (Nexus 10) in 32-bit shown thusly :
From this chart we can see that Kraken is just about 2X faster than Zlib, with compression between brotli9 and brotli11. We can see Mermaid is 5X faster than brotli-11 and compresses much better than Zlib.
At the bottom of each data page are detailed numbers. See the "about" information at the bottom of this page for more details.
|CortexA57 (Samsung S6)||arm32_canterbury_CortexA57.html||arm32_silesia_CortexA57.html||arm32_pd3d_CortexA57.html|
|QcomKryo (Samsung S7)||arm32_canterbury_QcomKryo.html||arm32_silesia_QcomKryo.html||arm32_pd3d_QcomKryo.html|
|CortexA15 (Nexus 10)||arm32_canterbury_CortexA15.html||arm32_silesia_CortexA15.html||arm32_pd3d_CortexA15.html|
|CortexA9 (Nexus 7)||arm32_canterbury_CortexA9.html||arm32_silesia_CortexA9.html||arm32_pd3d_CortexA9.html|
|iPhone 6S ARM64||iOS_canterbury_iPhone6S_64b.html||iOS_silesia_iPhone6S_64b.html||iOS_pd3d_iPhone6S_64b.html|
|iPad Air2 ARM64||iOS_canterbury_iPadAir2_64b.html||iOS_silesia_iPadAir2_64b.html||iOS_pd3d_iPadAir2_64b.html|
|iPad Pro ARM64||iOS_canterbury_iPadPro_64b.html||iOS_silesia_iPadPro_64b.html||iOS_pd3d_iPadPro_64b.html|
|iPhone 6S ARM32||iOS_canterbury_iPhone6S_32b.html||iOS_silesia_iPhone6S_32b.html||iOS_pd3d_iPhone6S_32b.html|
|iPad Air2 ARM32||iOS_canterbury_iPadAir2_32b.html||iOS_silesia_iPadAir2_32b.html||iOS_pd3d_iPadAir2_32b.html|
|iPad Pro ARM32||iOS_canterbury_iPadPro_32b.html||iOS_silesia_iPadPro_32b.html||iOS_pd3d_iPadPro_32b.html|
Corpus total size and time for all devices :
About the test
Oodle, Zlib & Brotli decode speed was measured on a variety of ARM devices.
Getting reliable timings on Android is challenging, due to the lack of high precision timer, but mostly because these devices have wild fluctuations in clock rate due to thermal throttling. As much as possible, devices need to be kept in a controlled thermal environment, either through active cooling or by intentionally heating them up so that they go into min clock rate and stay there.
Timing was done by running the decoder at least 5 times and for at least 0.5 seconds to get a single timed span with enough precision. That was repeated 10 times and the median of those times is what is reported here.
Oodle is run via the Oodle 2.3.0 lib. Brotli was built from the public github source code (downloaded 05-14-2016) into the Oodle lib for testing, using the same compiler and options as Oodle. Zlib is run from the platform library via zlib.h
Oodle uses NEON, which is available on all the devices tested here.
All tests are run single threaded, memory to memory, whole buffer single call decodes.
The "total" result for a corpus is a sum of all decode time or all compressed file sizes from the individual runs of files in that corpus. That is, it is not a result on a concatentation of all files in the corpus. Summing time and size means that total is a file-size weighted average over the corpus.
Despite our best efforts, there is still a large amount of fluctuation in the speed on some of these devices (particularly the Samsung devices). Unfortunately just running more test iterations doesn't make the times more stable. Because of this, any odd results on one file on one device should be taken with some salt - it could just be a thermal fluctuation.
About the Devices
|QcomKryo||Qualcomm Kryo CPU on Snapdragon 820, in a Samsung Galaxy S7 US|
|CortexA9||ARM Cortex-A9 in a Google Nexus 7|
|CortexA15||ARM Cortex-A15 in a Google Nexus 10|
|CortexA57||ARM Cortex-A57 in a Samsung Galaxy Tab S2 8.0, also found in Samsung Galaxy S6|
|iPad Air2||Apple A8X "Typhoon", 1.5 GHz, tri-core, 2MB L2, 4 MB L3|
|iPad Pro||Apple A9X "Twister", 2.2 GHz, dual-core, 3MB L2, no L3|
|iPhone 6S||Apple A9 "Twister", 1.85 GHz, dual-core, 3MB L2, 4MB L3 victim cache|
About the test sets
Canterbury is an old compression corpus. It consists mainly of text, and contains many very small files. We do not usually test on it, as we believe it is not reflective of data that compressors typically work on in the modern era. The files are too small, and it has too much text and almost no binary. "Kraken" and "Kraken444" have identical performance here because the large window is never used.
Canterbury contains some tiny files (around 4k bytes) which Mermaid & Selkie don't even try to compress. These tiny files have very little effect on the "total" results, because their contribution to total time & size is negligible.
The Silesia Compression Corpus is probably the best current mainstream public compression corpus. It consists of a wide mix of file types, including some that are not common in typical compressor usage (such as large uncompressed images and large text files), so looking at the total performance on it can be misleading.
Silesia "mozilla" and "ooffice" reflect performance on real application binaries.
Public Domain 3D Test Set is maintained by RAD Game Tools and is available here (pd3d.7z 8MB). PD3D consists of Public Domain 3d binary files and is designed to be reflective of real data that is distributed in shipping games. As such it contains compiled binary 3d models, and compressed textures in hardware-ready formats. It does not contain text geometry files, art source material, or uncompressed textures.
About the compressors
Oodle Kraken compressor, with unbounded window (match offset limit). Compressed at level 7 (Optimal3).
Oodle Kraken compressor, with 444k match offset limit. Compressed at level 6 (Optimal2). During testing we found that many of these devices have very slow RAM, so a compressor that keeps its match references in cache can be significantly faster on those devices. The smaller window reduces compression somewhat on large files. "Kraken" and "Kraken444" are identical on files ≤ 444k bytes (eg. everything in Canterbury).
Mermaid & Selkie :
Mermaid & Selkie are two new compressors in Oodle designed for even higher speed than Kraken. Both are run with default options at level 7 (Optimal3) in this test.
Zlib is compressed at level 9 using the zlib.h library in the platform SDK.
brotli-9 & brotli11 :
Brotli is compressed with the maximum window size (24 bits) to make it as competitive with "Kraken" as possible. The parameters were set up thusly :
brotli::BrotliParams params; params.lgwin = 24; params.quality = 9; //or 11otherwise default options are used. Brotli level 9 encoding takes similar time to Kraken Optimal2 encoding, Brotli level 11 is several times slower to encode (4-8X).
The Brotli default window size of 22 bits is inappropriate for the majority of ARM cores. It is ideal only on desktop CPUs with 4 MB L3 caches. On these ARM devices, a 22 bit window size needlessly hurts compression on large files and isn't small enough to help decode speed. This Brotli run is directly comparable to Kraken, Mermaid & Selkie, which all have unbounded window. Brotli would get faster to decode with an 18 or 19 bit window on ARM, but also lose a lot of compression. "Kraken444" could be compared to Brotli with an 18 or 19 bit window, but that test has not been done here.
Brotli is run in this test with its static dictionary disabled. We believe this is the fair way to compare LZ compressors - either all with the same static dictionary, or all with no static dictionary - since a preloaded dictionary helps all LZ compressors in roughly the same way. The Brotli dictionary contains mainly text which would be helpful on the small files of Canterbury, but would not help on Silesia or PD3D. If such a dictionary is desirable for your application, it could be used in Oodle as well.
A note on threads :
All decompresses were run single-threaded for this test.
Kraken & Mermaid are able to decode using two threads, even on normal compressed data that hasn't been chunked into independent units. This provides a 1-2X speedup, typically in the 1.4X - 1.8X range.
This threaded decode was not tested here, but might be well suited to some of these ARM devices which often have lots of slow cores.
For more information see cbloomrants : Oodle Kraken Thread-Phased Decoding
Sample encode speed and x64 decode :
Encode speeds are not part of this test, but to give a rough idea of the time to encode each format here are some sample encode speeds :
(speed measured on a Core i7-3770 3.4 Ghz, Linux x64, single threaded) test file : lzt99 miniz 9 : 24,700,820 ->13,120,668 = 4.249 bpb = 1.883 to 1 miniz encode : 2.572 seconds, 406.15 c/b, rate= 9.60 MB/s brotli-9 : 24,700,820 ->10,473,560 = 3.392 bpb = 2.358 to 1 brotli encode : 29.630 seconds, 4.68 kc/b, rate= 833.65 KB/s brotli-11 : 24,700,820 -> 9,828,093 = 3.183 bpb = 2.513 to 1 brotli encode : 83.433 seconds, 13.17 kc/b, rate= 296.06 KB/s Kraken -z4 : 24,700,820 ->10,320,928 = 3.343 bpb = 2.393 to 1 encode only : 1.442 seconds, 227.72 c/b, rate= 17.13 MB/s Kraken -z7 : 24,700,820 -> 9,763,128 = 3.162 bpb = 2.530 to 1 encode only : 13.967 seconds, 2.21 kc/b, rate= 1.77 MB/sOne of the nice things about Kraken is that the -z4 (Normal) level gets good compression at much higher encode speed than most other high-compression LZ's.
Sample : x64 decode performance on lzt99
Back to top