Oodle ARM Report 07-14-2016

A measurement of Oodle decode speed vs. Zlib & Brotli on several ARM devices. The Oodle compressors Kraken, Mermaid and Selkie are tested.

Updated for Oodle 2.3.0

The Data

About the test

About the devices

About the test sets

About the compressors

A note on threads

Sample encode speed and x64 decode


Introduction

Oodle Kraken is a new lossless data compression algorithm from RAD Game Tools. It provides very high compression ratios (similar to Brotli-11 and LZMA) with much faster decode speeds - generally 2-4X faster than ZLib (which compresses way less) and 3-15X faster than compressors with comparable compression ratios. Kraken is generic : it works on all types of data with no specialized filters or dictionaries.

Mermaid and Selkie are new compressors in Oodle 2.3.0 that are designed to be even faster than Kraken. Mermaid generally compresses slightly more than ZLib, while decoding much faster (around 5-10X). Selkie is designed to decode faster than any other compressor (faster than LZ4) while getting more compression.

The speed of Oodle Kraken has been previously documented, mostly on x86/x64 platforms. See for example :

Oodle Kraken Pareto Frontier
Performance of Oodle Kraken
PS4 Battle : MiniZ vs Zlib-NG vs ZStd vs Brotli vs Oodle

The aim of this report is to demonstrate the speed on ARM devices.


How to read this report

This report contains sub-page for each device for each corpus. The page for each device-corpus pair shows the performance on each file in that corpus, as well as the total time & size over all files in the corpus.

The charts here show time and size, which are both linear costs (size is linearly proportional to time over the wire) - LOWER IS BETTER!

For example in arm32_canterbury_CortexA15.html you will see the total time and compressed size of the individual files within the Canterbury Corpus on the Cortex-A15 (Nexus 10) in 32-bit shown thusly :

From this chart we can see that Kraken is just about 2X faster than Zlib, with compression between brotli9 and brotli11. We can see Mermaid is 5X faster than brotli-11 and compresses much better than Zlib.

At the bottom of each data page are detailed numbers. See the "about" information at the bottom of this page for more details.


The Data

Android ARM32 (ARMv7-A)

Device Canterbury Silesia PD3D
CortexA57 (Samsung S6) arm32_canterbury_CortexA57.html arm32_silesia_CortexA57.html arm32_pd3d_CortexA57.html
QcomKryo (Samsung S7) arm32_canterbury_QcomKryo.html arm32_silesia_QcomKryo.html arm32_pd3d_QcomKryo.html
CortexA15 (Nexus 10) arm32_canterbury_CortexA15.html arm32_silesia_CortexA15.html arm32_pd3d_CortexA15.html
CortexA9 (Nexus 7) arm32_canterbury_CortexA9.html arm32_silesia_CortexA9.html arm32_pd3d_CortexA9.html

iOS ARM64

Device Canterbury Silesia PD3D
iPhone 6S ARM64 iOS_canterbury_iPhone6S_64b.html iOS_silesia_iPhone6S_64b.html iOS_pd3d_iPhone6S_64b.html
iPad Air2 ARM64 iOS_canterbury_iPadAir2_64b.html iOS_silesia_iPadAir2_64b.html iOS_pd3d_iPadAir2_64b.html
iPad Pro ARM64 iOS_canterbury_iPadPro_64b.html iOS_silesia_iPadPro_64b.html iOS_pd3d_iPadPro_64b.html

iOS ARM32

Device Canterbury Silesia PD3D
iPhone 6S ARM32 iOS_canterbury_iPhone6S_32b.html iOS_silesia_iPhone6S_32b.html iOS_pd3d_iPhone6S_32b.html
iPad Air2 ARM32 iOS_canterbury_iPadAir2_32b.html iOS_silesia_iPadAir2_32b.html iOS_pd3d_iPadAir2_32b.html
iPad Pro ARM32 iOS_canterbury_iPadPro_32b.html iOS_silesia_iPadPro_32b.html iOS_pd3d_iPadPro_32b.html

All Devices

Corpus total size and time for all devices :

Canterbury Silesia PD3D
alldevices_canterbury.html alldevices_silesia.html alldevices_pd3d.html


About the test

Oodle, Zlib & Brotli decode speed was measured on a variety of ARM devices.

Getting reliable timings on Android is challenging, due to the lack of high precision timer, but mostly because these devices have wild fluctuations in clock rate due to thermal throttling. As much as possible, devices need to be kept in a controlled thermal environment, either through active cooling or by intentionally heating them up so that they go into min clock rate and stay there.

Timing was done by running the decoder at least 5 times and for at least 0.5 seconds to get a single timed span with enough precision. That was repeated 10 times and the median of those times is what is reported here.

Oodle is run via the Oodle 2.3.0 lib. Brotli was built from the public github source code (downloaded 05-14-2016) into the Oodle lib for testing, using the same compiler and options as Oodle. Zlib is run from the platform library via zlib.h

Oodle uses NEON, which is available on all the devices tested here.

All tests are run single threaded, memory to memory, whole buffer single call decodes.

The "total" result for a corpus is a sum of all decode time or all compressed file sizes from the individual runs of files in that corpus. That is, it is not a result on a concatentation of all files in the corpus. Summing time and size means that total is a file-size weighted average over the corpus.

Despite our best efforts, there is still a large amount of fluctuation in the speed on some of these devices (particularly the Samsung devices). Unfortunately just running more test iterations doesn't make the times more stable. Because of this, any odd results on one file on one device should be taken with some salt - it could just be a thermal fluctuation.


About the Devices

Device Description
QcomKryo Qualcomm Kryo CPU on Snapdragon 820, in a Samsung Galaxy S7 US
CortexA9 ARM Cortex-A9 in a Google Nexus 7
CortexA15 ARM Cortex-A15 in a Google Nexus 10
CortexA57 ARM Cortex-A57 in a Samsung Galaxy Tab S2 8.0, also found in Samsung Galaxy S6
iPad Air2 Apple A8X "Typhoon", 1.5 GHz, tri-core, 2MB L2, 4 MB L3
iPad Pro Apple A9X "Twister", 2.2 GHz, dual-core, 3MB L2, no L3
iPhone 6S Apple A9 "Twister", 1.85 GHz, dual-core, 3MB L2, 4MB L3 victim cache


About the test sets

Canterbury :

Canterbury is an old compression corpus. It consists mainly of text, and contains many very small files. We do not usually test on it, as we believe it is not reflective of data that compressors typically work on in the modern era. The files are too small, and it has too much text and almost no binary. "Kraken" and "Kraken444" have identical performance here because the large window is never used.

Canterbury contains some tiny files (around 4k bytes) which Mermaid & Selkie don't even try to compress. These tiny files have very little effect on the "total" results, because their contribution to total time & size is negligible.

Silesia :

The Silesia Compression Corpus is probably the best current mainstream public compression corpus. It consists of a wide mix of file types, including some that are not common in typical compressor usage (such as large uncompressed images and large text files), so looking at the total performance on it can be misleading.

Silesia "mozilla" and "ooffice" reflect performance on real application binaries.

PD3D :

Public Domain 3D Test Set is maintained by RAD Game Tools and is available here (pd3d.7z 8MB). PD3D consists of Public Domain 3d binary files and is designed to be reflective of real data that is distributed in shipping games. As such it contains compiled binary 3d models, and compressed textures in hardware-ready formats. It does not contain text geometry files, art source material, or uncompressed textures.


About the compressors

Kraken :

Oodle Kraken compressor, with unbounded window (match offset limit). Compressed at level 7 (Optimal3).

Kraken444 :

Oodle Kraken compressor, with 444k match offset limit. Compressed at level 6 (Optimal2). During testing we found that many of these devices have very slow RAM, so a compressor that keeps its match references in cache can be significantly faster on those devices. The smaller window reduces compression somewhat on large files. "Kraken" and "Kraken444" are identical on files ≤ 444k bytes (eg. everything in Canterbury).

Mermaid & Selkie :

Mermaid & Selkie are two new compressors in Oodle designed for even higher speed than Kraken. Both are run with default options at level 7 (Optimal3) in this test.

zlib9 :

Zlib is compressed at level 9 using the zlib.h library in the platform SDK.

brotli-9 & brotli11 :

Brotli is compressed with the maximum window size (24 bits) to make it as competitive with "Kraken" as possible. The parameters were set up thusly :

	brotli::BrotliParams params;
	params.lgwin = 24;
	params.quality = 9; //or 11
otherwise default options are used. Brotli level 9 encoding takes similar time to Kraken Optimal2 encoding, Brotli level 11 is several times slower to encode (4-8X).

The Brotli default window size of 22 bits is inappropriate for the majority of ARM cores. It is ideal only on desktop CPUs with 4 MB L3 caches. On these ARM devices, a 22 bit window size needlessly hurts compression on large files and isn't small enough to help decode speed. This Brotli run is directly comparable to Kraken, Mermaid & Selkie, which all have unbounded window. Brotli would get faster to decode with an 18 or 19 bit window on ARM, but also lose a lot of compression. "Kraken444" could be compared to Brotli with an 18 or 19 bit window, but that test has not been done here.

Brotli is run in this test with its static dictionary disabled. We believe this is the fair way to compare LZ compressors - either all with the same static dictionary, or all with no static dictionary - since a preloaded dictionary helps all LZ compressors in roughly the same way. The Brotli dictionary contains mainly text which would be helpful on the small files of Canterbury, but would not help on Silesia or PD3D. If such a dictionary is desirable for your application, it could be used in Oodle as well.


A note on threads :

All decompresses were run single-threaded for this test.

Kraken & Mermaid are able to decode using two threads, even on normal compressed data that hasn't been chunked into independent units. This provides a 1-2X speedup, typically in the 1.4X - 1.8X range.

This threaded decode was not tested here, but might be well suited to some of these ARM devices which often have lots of slow cores.

For more information see cbloomrants : Oodle Kraken Thread-Phased Decoding


Sample encode speed and x64 decode :

Encode speeds are not part of this test, but to give a rough idea of the time to encode each format here are some sample encode speeds :


(speed measured on a Core i7-3770 3.4 Ghz, Linux x64, single threaded)

test file : lzt99

miniz 9     : 24,700,820 ->13,120,668 =  4.249 bpb =  1.883 to 1 
miniz encode     : 2.572 seconds, 406.15 c/b, rate= 9.60 MB/s

brotli-9    : 24,700,820 ->10,473,560 =  3.392 bpb =  2.358 to 1
brotli encode    : 29.630 seconds, 4.68 kc/b, rate= 833.65 KB/s

brotli-11   : 24,700,820 -> 9,828,093 =  3.183 bpb =  2.513 to 1
brotli encode    : 83.433 seconds, 13.17 kc/b, rate= 296.06 KB/s

Kraken -z4  : 24,700,820 ->10,320,928 =  3.343 bpb =  2.393 to 1
encode only      : 1.442 seconds, 227.72 c/b, rate= 17.13 MB/s

Kraken -z7  : 24,700,820 -> 9,763,128 =  3.162 bpb =  2.530 to 1
encode only      : 13.967 seconds, 2.21 kc/b, rate= 1.77 MB/s

One of the nice things about Kraken is that the -z4 (Normal) level gets good compression at much higher encode speed than most other high-compression LZ's.

Sample : x64 decode performance on lzt99


Back to top