Add ktx

2026-06-14 19:09:18 +01:00
parent 14bd1a9271
commit 13fa90a0e9
3958 changed files with 999286 additions and 4 deletions
@@ -0,0 +1,315 @@
+# Building ASTC Encoder
+
+This page provides instructions for building `astcenc` from the sources in
+this repository.
+
+Builds must use CMake 3.15 or higher as the build system generator. The
+examples on this page show how to use it to generate build systems for NMake
+(Windows) and Make (Linux and macOS), but CMake supports other build system
+backends.
+
+## Windows
+
+Builds for Windows are tested with CMake 3.17, and Visual Studio 2019 or newer.
+
+### Configuring the build
+
+To use CMake you must first configure the build. Create a build directory in
+the root of the `astcenc` checkout, and then run `cmake` inside that directory
+to generate the build system.
+
+```shell
+# Create a build directory
+mkdir build
+cd build
+
+# Configure your build of choice, for example:
+
+# x86-64 using a Visual Studio solution
+cmake -G "Visual Studio 16 2019" -T ClangCL -DCMAKE_INSTALL_PREFIX=..\ ^
+    -DASTCENC_ISA_AVX2=ON -DASTCENC_ISA_SSE41=ON -DASTCENC_ISA_SSE2=ON ..
+
+# x86-64 using NMake
+cmake -G "NMake Makefiles" -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=..\ ^
+    -DASTCENC_ISA_AVX2=ON -DASTCENC_ISA_SSE41=ON -DASTCENC_ISA_SSE2=ON ..
+```
+
+A single CMake configure can build multiple binaries for a single target CPU
+architecture, for example building x64 for both SSE2 and AVX2. Each binary name
+will include the build variant as a postfix. It is possible to build any set of
+the supported SIMD variants by enabling only the ones you require.
+
+Using the Visual Studio Clang-CL LLVM toolchain (`-T ClangCL`) is optional but
+produces significantly faster binaries than the default toolchain. The C++ LLVM
+toolchain component must be installed via the Visual Studio installer.
+
+### Building
+
+Once you have configured the build you can use NMake to compile the project
+from your build dir, and install to your target install directory.
+
+```shell
+# Run a build and install build outputs in `${CMAKE_INSTALL_PREFIX}/bin/`
+cd build
+nmake install
+```
+
+## macOS and Linux using Make
+
+Builds for macOS and Linux are tested with CMake 3.17, and clang++ 9.0 or
+newer.
+
+> Compiling using g++ is supported, but clang++ builds are faster by ~15%.
+
+### Configuring the build
+
+To use CMake you must first configure the build. Create a build directory
+in the root of the astcenc checkout, and then run `cmake` inside that directory
+to generate the build system.
+
+```shell
+# Select your compiler (clang++ recommended, but g++ works)
+export CXX=clang++
+
+# Create a build directory
+mkdir build
+cd build
+
+# Configure your build of choice, for example:
+
+# Arm arch64
+cmake -G "Unix Makefiles" -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../ \
+    -DASTCENC_ISA_NEON=ON ..
+
+# x86-64
+cmake -G "Unix Makefiles" -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../ \
+    -DASTCENC_ISA_AVX2=ON -DASTCENC_ISA_SSE41=ON -DASTCENC_ISA_SSE2=ON ..
+
+# macOS universal binary build
+cmake -G "Unix Makefiles" -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../ ..
+```
+
+A single CMake configure can build multiple binaries for a single target CPU
+architecture, for example building x64 for both SSE2 and AVX2. Each binary name
+will include the build variant as a postfix. It is possible to build any set of
+the supported SIMD variants by enabling only the ones you require.
+
+For macOS, we additionally support the ability to build a universal binary.
+This build includes SSE4.1 (`x86_64`), AVX2 (`x86_64h`), and NEON (`arm64`)
+build slices in a single output binary. The OS will select the correct variant
+to run for the machine being used. This is the default build target for a macOS
+build, but single-target binaries can still be built by setting
+`-DASTCENC_UNIVERSAL_BINARY=OFF` and then manually selecting the specific ISA
+variants that are required.
+
+### Building
+
+Once you have configured the build you can use Make to compile the project from
+your build dir, and install to your target install directory.
+
+```shell
+# Run a build and install build outputs in `${CMAKE_INSTALL_PREFIX}/bin/`
+# for executable binaries and `${CMAKE_INSTALL_PREFIX}/lib/` for libraries
+cd build
+make install -j16
+```
+
+## macOS using XCode
+
+Builds for macOS and Linux are tested with CMake 3.17, and XCode 14.0 or
+newer.
+
+### Configuring the build
+
+To use CMake you must first configure the build. Create a build directory
+in the root of the astcenc checkout, and then run `cmake` inside that directory
+to generate the build system.
+
+```shell
+# Create a build directory
+mkdir build
+cd build
+
+# Configure a universal build
+cmake -G Xcode -DCMAKE_INSTALL_PREFIX=../ ..
+```
+
+### Building
+
+Once you have configured the build you can use CMake to compile the project
+from your build dir, and install to your target install directory.
+
+```shell
+cmake --build . --config Release
+
+# Optionally install the binaries to the installation directory
+cmake --install . --config Release
+```
+
+## Advanced build options
+
+For codec developers and power users there are a number of useful features in
+the build system.
+
+### Build Types
+
+We support and test the following `CMAKE_BUILD_TYPE` options.
+
+| Value            | Description                                              |
+| ---------------- | -------------------------------------------------------- |
+| Release          | Optimized release build                                  |
+| RelWithDebInfo   | Optimized release build with debug info                  |
+| Debug            | Unoptimized debug build with debug info                  |
+
+Note that optimized release builds are compiled with link-time optimization,
+which can make profiling more challenging ...
+
+### Shared Libraries
+
+We support building the core library as a shared object by setting the CMake
+option `-DASTCENC_SHAREDLIB=ON` at configure time. For macOS build targets the
+shared library supports the same universal build configuration as the command
+line utility.
+
+Note that the command line tool is always statically linked; the shared objects
+are an extra build output that are not currently used by the command line tool.
+
+### Constrained block size builds
+
+All normal builds will support all ASTC block sizes, including the worst case
+6x6x6 3D block size (216 texels per block). Compressor memory footprint and
+performance can be improved by limiting the block sizes supported in the build
+by adding `-DASTCENC_BLOCK_MAX_TEXELS=<texel_count>` to to CMake command line
+when configuring. Legal block sizes that are unavailable in a restricted build
+will return the error `ASTCENC_ERR_NOT_IMPLEMENTED` during context creation.
+
+### Non-invariant builds
+
+All normal builds are designed to be invariant, so any build from the same git
+revision will produce bit-identical results for all compilers and CPU
+architectures. To achieve this we sacrifice some performance, so if this is
+not required you can specify `-DASTCENC_INVARIANCE=OFF` to enable additional
+optimizations. This has most benefit for AVX2 builds where we are able to
+enable use of the FMA instruction set extensions.
+
+### No intrinsics builds
+
+All normal builds will use SIMD accelerated code paths using intrinsics, as all
+supported target architectures (x86 and arm64) guarantee SIMD availability. For
+development purposes it is possible to build an intrinsic-free build which uses
+no explicit SIMD acceleration (the compiler may still auto-vectorize).
+
+To enable this binary variant add `-DASTCENC_ISA_NONE=ON` to the CMake command
+line when configuring. It is NOT recommended to use this for production; it is
+significantly slower than the vectorized SIMD builds.
+
+### No x86 gather instruction builds
+
+On many x86 microarchitectures the native AVX gather instructions are slower
+than simply performing manual scalar loads and combining the results. Gathers
+are enabled by default, but can be disabled by setting the CMake option
+`-DASTCENC_X86_GATHERS=OFF` on the command line when configuring.
+
+Note that we have seen mixed results when compiling the scalar fallback path,
+so we would recommend testing which option works best for the compiler and
+microarchitecture pairing that you are targeting.
+
+### Test builds
+
+We support building unit tests. These use the `googletest` framework, which is
+pulled in though a git submodule. On first use, you must fetch the submodule
+dependency:
+
+```shell
+git submodule init
+git submodule update
+```
+
+To build unit tests add `-DASTCENC_UNITTEST=ON` to the CMake command line when
+configuring.
+
+To run unit tests use the CMake `ctest` utility from your build directory after
+you have built the tests.
+
+```shell
+cd build
+ctest --verbose
+```
+
+### Sanitizer builds
+
+We support building with sanitizers on Linux and macOS when using Clang.
+
+To build binaries with ASAN checking enabled add `-DASTCENC_ASAN=ON` to the
+CMake command line when configuring.
+
+To build binaries with UBSAN checking enabled add `-DASTCENC_UBSAN=ON` to the
+CMake command line when configuring.
+
+### Android builds
+
+Builds of the command line utility for Android are not officially supported, but can be a useful
+development build for testing on e.g. different Arm CPU microarchitectures.
+
+The build script below shows one possible route to building the command line tool for Android. Once
+built the application can be pushed to e.g. `/data/local/tmp` and executed from an Android shell
+terminal over `adb`.
+
+```shell
+ANDROID_ABI=arm64-v8a
+ANDROID_NDK=/work/tools/android/ndk/22.1.7171670
+
+BUILD_TYPE=RelWithDebInfo
+
+BUILD_DIR=build
+
+mkdir -p ${BUILD_DIR}
+cd ${BUILD_DIR}
+
+cmake \
+    -DCMAKE_INSTALL_PREFIX=./ \
+    -DCMAKE_BUILD_TYPE=${BUILD_TYPE} \
+    -DCMAKE_TOOLCHAIN_FILE=${ANDROID_NDK}/build/cmake/android.toolchain.cmake \
+    -DANDROID_ABI=${ANDROID_ABI} \
+    -DANDROID_ARM_NEON=ON \
+    -DANDROID_PLATFORM=android-21 \
+    -DCMAKE_ANDROID_NDK_TOOLCHAIN_VERSION=clang \
+    -DANDROID_TOOLCHAIN=clang \
+    -DANDROID_STL=c++_static \
+    -DARCH=aarch64 \
+    -DASTCENC_ISA_NEON=ON \
+    ..
+
+make -j16
+```
+
+## Packaging a release bundle
+
+We support building a release bundle of all enabled binary configurations in
+the current CMake configuration using the `package` build target
+
+Configure CMake with:
+
+* `-DASTCENC_PACAKGE=<arch>` to set the package architecture/variant name used
+to name the package archive (not set by default).
+
+```shell
+# Run a build and package build outputs in `./astcenc-<ver>-<os>-<arch>.<fmt>`
+cd build
+make package -j16
+```
+
+Windows packages will use the `.zip` format, other packages will use the
+`.tar.gz` format.
+
+## Integrating as a library into another project
+
+The core codec of `astcenc` is built as a library, and so can be easily
+integrated into other projects using CMake. An example of the CMake integration
+and the codec API usage can be found in the `./Utils/Example` directory in the
+repository. See the [Example Readme](../Utils/Example/README.md) for more
+details.
+
+- - -
+
+_Copyright © 2019-2024, Arm Limited and contributors. All rights reserved._
@@ -0,0 +1,328 @@
+# 2.x series change log
+
+This page summarizes the major functional and performance changes in each
+release of the 2.x series.
+
+All performance data on this page is measured on an Intel Core i5-9600K
+clocked at 4.2 GHz, running astcenc using 6 threads.
+
+<!-- ---------------------------------------------------------------------- -->
+## 2.5
+
+**Status:** Released, March 2021
+
+The 2.5 release is the last major release in the 2.x series. After this release
+a `2.x` branch will provide stable long-term support, and the `main` branch
+will switch to focusing on more radical changes for the 3.x series.
+
+Reminder for users of the library interface - the API is not designed to be
+stable across versions, and this release is not compatible with earlier 2.x
+releases. Please update and rebuild your client-side code using the updated
+`astcenc.h` header.
+
+**General:**
+  * **Feature:** The `ISA_INVARIANCE` build option is no longer supported, as
+    there is no longer any performance benefit from the variant paths. All
+    builds are now using the equivalent of the `ISA_INVARIANCE=ON` setting, and
+    all builds (except Armv7) are now believed to be invariant across operating
+    systems, compilers, CPU architectures, and SIMD instruction sets.
+  * **Feature:** Armv8 32-bit builds with NEON are now supported, with
+    out-of-the-box support for Arm Linux soft-float and hard-float ABIs. There
+    are no pre-built binaries for these targets; support is included for
+    library users targeting older 32-bit Android and iOS devices.
+  * **Feature:** A compressor mode for encoding HDR textures that have been
+    encoded into LDR RGBM wrapper format is now supported. Note that this
+    encoding has some strong recommendations for how the RGBM encoding is
+    implemented to avoid block artifacts in the compressed image.
+* **Core API:**
+  * **API Change:** The core API has been changed to be a pure C API, making it
+    easier to wrap the codec in a stable shared library ABI. Some entry points
+    that used to accept references now expect pointers.
+  * **API Change:** The decompression functionality in the core API has been
+    changed to allow use of multiple threads. The design pattern matches the
+    compression functionality, requiring the caller to create the threads,
+    synchronize them between images, and to call the new
+    `astcenc_decompress_reset()` function between images.
+* **API Feature:** Defines to support exporting public API entry point
+    symbols from a shared object are provided, but not exposed off-the-shelf by
+    the CMake provided by the project.
+  * **API Feature:** New `astcenc_get_block_info()` function added to the core
+    API to allow users to perform high level analysis of compressed data. This
+    API is not implemented in decompressor-only builds.
+  * **API Feature:** Codec configuration structure has been extended to expose
+    the new RGBM compression mode. See the API header for details.
+
+<!-- ---------------------------------------------------------------------- -->
+## 2.4
+
+**Status:** Released, February 2021
+
+The 2.4 release is the fifth release in the 2.x series. It is primarily a bug
+fix release for HDR image handling, which impacts all earlier 2.x series
+releases.
+
+**General:**
+  * **Feature:** When using the `-a` option, or the equivalent config option
+    for the API, any 2D blocks that are entirely zero alpha after the alpha
+    filter radius is taken into account are replaced by transparent black
+    constant color blocks. This is an RDO-like technique to improve compression
+    ratios of any additional application packaging compression that is applied.
+**Command Line:**
+  * **Bug fix:** The command line wrapper now correctly loads HDR images that
+    have a non-square aspect ratio.
+
+<!-- ---------------------------------------------------------------------- -->
+## 2.3
+
+**Status:** Released, January 2021
+
+The 2.3 release is the fourth release in the 2.x series. It includes a number
+of performance improvements and new features.
+
+Reminder for users of the library interface - the API is not designed to be
+stable across versions, and this release is not compatible with 2.2. Please
+recompile your client-side code using the updated `astcenc.h` header.
+
+* **General:**
+  * **Feature:** Decompressor-only builds of the codec are supported again.
+    While this is primarily a feature for library users who want to shrink
+    binary size, a variant command line tool `astcdec` can be built by
+    specifying `DECOMPRESSOR=ON` on the CMake configure command line.
+  * **Feature:** Diagnostic builds of the codec can now be built. These builds
+    generate a JSON file containing a trace of the compressor execution.
+    Diagnostic builds are only suitable for codec development; they are slower
+    and JSON generation cannot be disabled. Build by setting `DIAGNOSTICS=ON`
+    on the CMake configure command line.
+  * **Feature:** Code compatibility improved with older versions of GCC,
+    earliest compiler now tested is GCC 7.5 (was GCC 9.3).
+  * **Feature:** Code compatibility improved with newer versions of LLVM,
+    latest compiler now tested is Clang 12.0 (was Clang 9.0).
+  * **Feature:** Code compatibility improved with the Visual Studio 2019 LLVM
+    toolset (`clang-cl`). Using the LLVM toolset gives 25% performance
+    improvements and is recommended.
+* **Command Line:**
+  * **Feature:** Quality level now accepts either a preset (`-fast`, etc) or a
+    float value between 0 and 100, allowing more control over the compression
+    quality vs performance trade-off. The presets are not evenly spaced in the
+    float range; they have been spaced to give the best distribution of points
+    between the fast and thorough presets.
+    * `-fastest`: 0.0
+    * `-fast`: 10.0
+    * `-medium`: 60.0
+    * `-thorough`: 98.0
+    * `-exhaustive`: 100.0
+* **Core API:**
+  * **API Change:** Quality level preset enum replaced with a float value
+    between 0 (`-fastest`) and 100 (`-exhaustive`). See above for more info.
+
+### Performance
+
+This release includes a number of optimizations to improve performance.
+
+* New compressor algorithm for handling encoding candidates and refinement.
+* Vectorized implementation of `compute_error_of_weight_set()`.
+* Unrolled implementation of `encode_ise()`.
+* Many other small improvements!
+
+The most significant change is the change to the compressor path, which now
+uses an adaptive approach to candidate trials and block refinement.
+
+In earlier releases the quality level will determine the number of encoding
+candidates and the number of iterative refinement passes that are used for each
+major encoding trial. This is a fixed behavior; it will always try the full N
+candidates and M refinement iterations specified by the quality level for each
+encoding trial.
+
+The new approach implements two optimizations for this:
+
+* Compression will complete when a block candidate hits the specified target
+  quality, after its M refinement iterations have been applied. Later block
+  candidates are simply abandoned.
+* Block candidates will predict how much refinement can improve them, and
+  abandon refinement if they are unlikely to improve upon the best known
+  encoding already in-hand.
+
+This pair of optimizations provides significant performance improvement to the
+high quality modes which use the most block candidates and refinement
+iterations. A minor loss of image quality is expected, as the blocks we no
+longer test or refine may have been better coding choices.
+
+**Absolute performance vs 2.2 release:**
+
+![Absolute scores 2.3 vs 2.2](./ChangeLogImg/absolute-2.2-to-2.3.png)
+
+**Relative performance vs 2.2 release:**
+
+![Relative scores 2.3 vs 2.2](./ChangeLogImg/relative-2.2-to-2.3.png)
+
+<!-- ---------------------------------------------------------------------- -->
+## 2.2
+
+**Status:** Released, January 2021
+
+The 2.2 release is the third release in the 2.x series. It includes a number
+of performance improvements and new features.
+
+Reminder for users of the library interface - the API is not designed to be
+stable across versions, and this release is not compatible with 2.1. Please
+recompile your client-side code using the updated `astcenc.h` header.
+
+* **General:**
+  * **Feature:** New Arm aarch64 NEON accelerated vector library support.
+  * **Improvement:** New CMake build system for all platforms.
+  * **Improvement:** SSE4.2 feature profile changed to SSE4.1, which more
+    accurately reflects the feature set used.
+* **Binary releases:**
+  * **Improvement:** Linux binaries changed to use Clang 9.0, which gives
+    up to 15% performance improvement.
+  * **Improvement:** Windows binaries are now code signed.
+  * **Improvement:** macOS binaries for Apple silicon platforms now provided.
+  * **Improvement:** macOS binaries are now code signed and notarized.
+* **Command Line:**
+  * **Feature:** New image preprocess `-pp-normalize` option added. This forces
+    normal vectors to be unit length, which is useful when compressing source
+    textures that use normal length to encode an NDF, which is incompatible
+    with ASTC's two channel encoding.
+  * **Feature:** New image preprocess `-pp-premultiply` option added. This
+    scales RGB values by the alpha value. This can be useful to minimize
+    cross-channel color bleed caused by GPU post-multiply filtering/blending.
+  * **Improvements:** Command line tool cleanly traps and reports errors for
+    corrupt input images rather than relying on standard library `assert()`
+    calls in release builds.
+* **Core API:**
+  * **API Change:** Images using region-based metrics no longer need to include
+    padding; all input images should be tightly packed and `dim_pad` is removed
+    from the `astcenc_image` structure. This makes it easier to directly use
+    images loaded from other libraries.
+  * **API Change:** Image `data` is no longer a 3D array accessed using
+    `data[z][y][x]` indexing, it's an array of 2D slices. This makes it easier
+    to directly use images loaded from other libraries.
+  * **API Change:** New `ASTCENC_FLG_SELF_DECOMPRESS_ONLY` flag added to the
+    codec config. Using this flag enables additional optimizations that
+    aggressively exploit implementation- and configuration-specific, behavior
+    to gain performance. When using this flag the codec can only reliably
+    decompress images that were compressed in the same context session. Images
+    produced via other means may fail to decompress correctly, even if they are
+    otherwise valid ASTC files.
+
+### Performance
+
+There is one major set of optimizations in this release, related to the new
+`ASTCENC_FLG_SELF_DECOMPRESS_ONLY` mode. These allow the compressor to only
+create data tables it knows that it is going to use, based on its current set
+of heuristics, rather than needing the full set the format allows.
+
+The first benefit of these changes is a reduced context creation time, which
+can be reduced by up to 250ms on our test machine. This is a significant
+percentage of the command line utility runtime for a small image when using a
+quick search preset. Compressing the whole Kodak test suite using the command
+line utility and the `-fastest` preset is ~30% faster with this release, which
+is mostly due to faster startup.
+
+The reduction in the data table size in this mode also improve the core codec
+speed. Our test sets show an average of 12% improvement in the codec for
+`-fastest` mode, and an average of 3% for `-medium` mode.
+
+Key for performance charts:
+
+* Color = block size (see legend).
+* Letter = image format (N = normal map, G = grayscale, L = LDR, H = HDR).
+
+**Absolute performance vs 2.1 release:**
+
+![Absolute scores 2.2 vs 2.1](./ChangeLogImg/absolute-2.1-to-2.2.png)
+
+**Relative performance vs 2.1 release:**
+
+![Relative scores 2.2 vs 2.1](./ChangeLogImg/relative-2.1-to-2.2.png)
+
+
+<!-- ---------------------------------------------------------------------- -->
+## 2.1
+
+**Status:** Released, November 2020
+
+The 2.1 release is the second release in the 2.x series. It includes a number
+of performance optimizations and new features.
+
+Reminder for users of the library interface - the API is not designed to be
+stable across versions, and this release is not compatible with 2.0. Please
+recompile your client-side code using the updated `astcenc.h` header.
+
+### Features:
+
+* **Command line:**
+  * **Bug fix:** The meaning of the `-tH\cH\dH` and `-th\ch\dh` compression
+    modes was inverted. They now match the documentation; use `-*H` for HDR
+    RGBA, and `-*h` for HDR RGB with LDR alpha.
+  * **Feature:** A new `-fastest` quality preset is now available. This is
+    designed for fast "roughing out" of new content, and sacrifices significant
+    image quality compared to `-fast`. We do not recommend its use for
+    production builds.
+  * **Feature:** A new `-candidatelimit` compression tuning option is now
+    available. This is a power-user control to determine how many candidates
+    are returned for each block mode encoding trial. This feature is used
+	automatically by the search presets; see `-help` for details.
+  * **Improvement:** The compression test modes (`-tl\ts\th\tH`) now emit a
+    MTex/s performance metric, in addition to coding time.
+* **Core API:**
+  * **Feature:** A new quality preset `ASTCENC_PRE_FASTEST` is available. See
+    `-fastest` above for details.
+  * **Feature:** A new tuning option `tune_candidate_limit` is available in
+    the config structure. See `-candidatelimit` above for details.
+  * **Feature:** Image input/output can now use `ASTCENC_TYPE_F32` data types.
+* **Stability:**
+  * **Feature:** The SSE2, SSE4.2, and AVX2 variants now produce identical
+    compressed output when run on the same CPU when compiled with the
+    preprocessor define `ASTCENC_ISA_INVARIANCE=1`. For Make builds this can
+    be set on the command line by setting `ISA_INV=1`. ISA invariance is off
+    by default; it reduces performance by 1-3%.
+
+### Performance
+
+Key for performance charts:
+
+* Color = block size (see legend).
+* Letter = image format (N = normal map, G = grayscale, L = LDR, H = HDR).
+
+**Absolute performance vs 2.0 release:**
+
+![Absolute scores 2.1 vs 2.0](./ChangeLogImg/absolute-2.0-to-2.1.png)
+
+**Relative performance vs 2.0 release:**
+
+![Relative scores 2.1 vs 2.0](./ChangeLogImg/relative-2.0-to-2.1.png)
+
+
+<!-- ---------------------------------------------------------------------- -->
+## 2.0
+
+**Status:** Released, August 2020
+
+The 2.0 release is first release in the 2.x series. It includes a number of
+major changes over the earlier 1.7 series, and is not command-line compatible.
+
+### Features:
+
+* The core codec can be built as a library, exposed via a new codec API.
+* The core codec supports accelerated SIMD paths for SSE2, SSE4.2, and AVX2.
+* The command line syntax has a clearer mapping to Khronos feature profiles.
+
+### Performance:
+
+Key for performance charts
+
+* Color = block size (see legend).
+* Letter = image format (N = normal map, G = grayscale, L = LDR, H = HDR).
+
+**Absolute performance vs 1.7 release:**
+
+![Absolute scores 2.0 vs 1.7](./ChangeLogImg/absolute-1.7-to-2.0.png)
+
+**Relative performance vs 1.7 release:**
+
+![Relative scores 2.0 vs 1.7](./ChangeLogImg/relative-1.7-to-2.0.png)
+
+- - -
+
+_Copyright © 2020-2022, Arm Limited and contributors. All rights reserved._
@@ -0,0 +1,308 @@
+# 3.x series change log
+
+This page summarizes the major functional and performance changes in each
+release of the 3.x series.
+
+All performance data on this page is measured on an Intel Core i5-9600K
+clocked at 4.2 GHz, running `astcenc` using AVX2 and 6 threads.
+
+<!-- ---------------------------------------------------------------------- -->
+## 3.7
+
+**Status:** April 2022
+
+The 3.7 release contains another round of performance optimizations, including
+significant improvements to the command line front-end (faster PNG loader) and
+the arm64 build of the codec (faster NEON implementation).
+
+* **General:**
+  * **Feature:** The command line tool PNG loader has been switched to use
+    the Wuffs library, which is robust and significantly faster than the
+    current stb_image implementation.
+  * **Feature:** Support for non-invariant builds returns. Opt-in to slightly
+    faster, but not bit-exact, builds by setting `-DNO_INVARIANCE=ON` for the
+    CMake configuration. This improves performance by around 2%.
+  * **Optimization:** Changed SIMD `select()` so that it matches the default
+    NEON behavior (bitwise select), rather than the default x86-64 behavior
+    (lane select on MSB). Specialization `select_msb()` added for the one case
+    we want to select on a sign-bit, where NEON needs a different
+    implementation. This provides a significant (>25%) performance uplift on
+    NEON implementations.
+
+### Performance:
+
+Key for charts:
+
+* Color = block size (see legend).
+* Letter = image format (N = normal map, G = grayscale, L = LDR, H = HDR).
+
+**Relative performance vs 3.5 release:**
+
+![Relative scores 3.7 vs 3.6](./ChangeLogImg/relative-3.6-to-3.7.png)
+
+<!-- ---------------------------------------------------------------------- -->
+## 3.6
+
+**Status:** April 2022
+
+The 3.6 release contains another round of performance optimizations.
+
+There are no interface changes in this release, but in general the API is not
+designed to be binary compatible across versions. We always recommend
+rebuilding your client-side code using the updated `astcenc.h` header.
+
+* **General:**
+  * **Feature:** Data tables are now optimized for contexts without the
+    `SELF_DECOMPRESS_ONLY` flag set. The flag therefore no longer improves
+    compression performance, but still reduces context creation time and
+    context data table memory footprint.
+  * **Feature:** Image quality for 4x4 `-fastest` configuration has been
+    improved.
+  * **Optimization:** Decimation modes are reliably excluded from processing
+    when they are only partially selected in the compressor configuration (e.g.
+    if used for single plane, but not dual plane modes). This is a significant
+    performance optimization for all quality levels.
+  * **Optimization:** Fast-path block load function variant added for 2D LDR
+    images with no swizzle. This is a moderate performance optimization for the
+    fast and fastest quality levels.
+
+### Performance:
+
+Key for charts:
+
+* Color = block size (see legend).
+* Letter = image format (N = normal map, G = grayscale, L = LDR, H = HDR).
+
+**Relative performance vs 3.5 release:**
+
+![Relative scores 3.6 vs 3.5](./ChangeLogImg/relative-3.5-to-3.6.png)
+
+<!-- ---------------------------------------------------------------------- -->
+## 3.5
+
+**Status:** March 2022
+
+The 3.5 release contains another round of performance optimizations.
+
+There are no interface changes in this release, but in general the API is not
+designed to be binary compatible across versions. We always recommend
+rebuilding your client-side code using the updated `astcenc.h` header.
+
+* **General:**
+  * **Feature:** Compressor configurations using `SELF_DECOMPRESS_ONLY` mode
+    store compacted partition tables, which significantly improves both
+    context create time and runtime performance.
+  * **Feature:** Bilinear infill for decimated weight grids supports a new
+    variant for half-decimated grids which are only decimated in one axis.
+
+### Performance:
+
+Key for charts:
+
+* Color = block size (see legend).
+* Letter = image format (N = normal map, G = grayscale, L = LDR, H = HDR).
+
+**Relative performance vs 3.4 release:**
+
+![Relative scores 3.5 vs 3.4](./ChangeLogImg/relative-3.4-to-3.5.png)
+
+
+<!-- ---------------------------------------------------------------------- -->
+## 3.4
+
+**Status:** February 2022
+
+The 3.4 release introduces another round of optimizations, removing a number
+of power-user configuration options to simplify the core compressor data path.
+
+Reminder for users of the library interface - the API is not designed to be
+binary compatible across versions, and this release is not compatible with
+earlier releases. Please update and rebuild your client-side code using the
+updated `astcenc.h` header.
+
+* **General:**
+  * **Feature:** Many memory allocations have been moved off the stack into
+    dynamically allocated working memory. This significantly reduces the peak
+    stack usage, allowing the compressor to run in systems with 128KB stack
+    limits.
+  * **Feature:** Builds now support `-DBLOCK_MAX_TEXELS=<count>` to allow a
+    compressor to support a subset of block sizes. This can reduce binary size
+    and runtime memory footprint, and improve performance.
+  * **Feature:** The `-v` and `-va` options to set a per-texel error weight
+    function are no longer supported.
+  * **Feature:** The `-b` option to set a per-texel error weight boost for
+    block border texels is no longer supported.
+  * **Feature:** The `-a` option to set a per-texel error weight based on texel
+    alpha value is no longer supported as an error weighting tool, but is still
+    supported for providing sprite-sheet RDO.
+  * **Feature:** The `-mask` option to set an error metric for mask map
+    textures is still supported, but is currently a no-op in the compressor.
+  * **Feature:** The `-perceptual` option to set a perceptual error metric is
+    still supported, but is currently a no-op in the compressor for mask map
+    and normal map textures.
+  * **Bug-fix:** Corrected decompression of error blocks in some cases, so now
+    returning the expected error color (magenta for LDR, NaN for HDR). Note
+    that astcenc determines the error color to use based on the output image
+    data type not the decoder profile.
+* **Binary releases:**
+  * **Improvement:** Windows binaries changed to use ClangCL 12.0, which gives
+    up to 10% performance improvement.
+
+### Performance:
+
+Key for charts:
+
+* Color = block size (see legend).
+* Letter = image format (N = normal map, G = grayscale, L = LDR, H = HDR).
+
+**Relative performance vs 3.3 release:**
+
+![Relative scores 3.4 vs 3.3](./ChangeLogImg/relative-3.3-to-3.4.png)
+
+
+<!-- ---------------------------------------------------------------------- -->
+## 3.3
+
+**Status:** November 2021
+
+The 3.3 release improves image quality for normal maps, and two component
+textures. Normal maps are expected to compress 25% slower than the 3.2
+release, although it should be noted that they are still faster to compress
+in 3.3 than when using the 2.5 series. This release also fixes one reported
+stability issue.
+
+* **General:**
+  * **Feature:** Normal map image quality has been improved.
+  * **Feature:** Two component image quality has been improved, provided
+    that unused components are correctly zero-weighted using e.g. `-cw` on the
+    command line.
+  * **Bug-fix:** Improved stability when trying to compress complex blocks that
+    could not beat even the starting quality threshold. These will now always
+    compress in to a constant color blocks.
+
+<!-- ---------------------------------------------------------------------- -->
+## 3.2
+
+**Status:** August 2021
+
+The 3.2 release is a bugfix release; no significant image quality or
+performance differences are expected.
+
+* **General:**
+  * **Bug-fix:** Improved stability when new contexts were created while other
+    contexts were compressing or decompressing an image.
+  * **Bug-fix:** Improved stability when decompressing blocks with invalid
+    block encodings.
+
+<!-- ---------------------------------------------------------------------- -->
+## 3.1
+
+**Status:** July 2021
+
+The 3.1 release gives another performance boost, typically between 5 and 20%
+faster than the 3.0 release, as well as further incremental improvements to
+image quality. A number of build system improvements make astcenc easier and
+faster to integrate into other projects as a library, including support for
+building universal binaries on macOS. Full change list is shown below.
+
+Reminder for users of the library interface - the API is not designed to be
+binary compatible across versions, and this release is not compatible with
+earlier releases. Please update and rebuild your client-side code using the
+updated `astcenc.h` header.
+
+* **General:**
+  * **Feature:** RGB color data now supports `-perceptual` operation. The
+    current implementation is simple, weighting color channel errors by their
+    contribution to perceived luminance. This mimics the behavior of the human
+    visual system, which is most sensitive to green, then red, then blue.
+  * **Feature:** Codec supports a new low weight search mode, which is a
+    simpler weight assignment for encodings with a low number of weights in the
+    weight grid. The weight threshold can be overridden using the new
+    `-lowweightmodelimit` command line option.
+  * **Feature:** All platform builds now support building a native binary.
+    Native binaries automatically select the SIMD level based on the default
+    configuration of the compiler in use. Native binaries built on one machine
+    may use different SIMD options than native binaries build on another.
+  * **Feature:** macOS platform builds now support building universal binaries
+    containing both `x86_64` and `arm64` target support.
+  * **Feature:** Building the command line can be disabled when using as a
+    library in another project. Set `-DCLI=OFF` during the CMake configure
+    step.
+  * **Feature:** A standalone minimal example of the core codec API usage has
+    been added in the `./Utils/Example/` directory.
+* **Core API:**
+  * **Feature:** Config flag `ASTCENC_FLG_USE_PERCEPTUAL` works for color data.
+  * **Feature:** Config option `tune_low_weight_count_limit` added.
+  * **Feature:** New heuristic added which prunes dual weight plane searches if
+    they are unlikely to help. This heuristic is not user controllable.
+  * **Feature:** Image quality has been improved. In general we see significant
+    improvements (up to 0.2dB) for high bitrate encodings (4x4, 5x4), and a
+    smaller improvement (up to 0.1dB) for lower bitrate encodings.
+  * **Bug fix:** Arm "none" SIMD builds could be invariant with other builds.
+    This fix has also been back-ported to the 2.x LTS branch.
+
+### Performance:
+
+Key for charts:
+
+* Color = block size (see legend).
+* Letter = image format (N = normal map, G = grayscale, L = LDR, H = HDR).
+
+**Relative performance vs 3.0 release:**
+
+![Relative scores 3.1 vs 3.0](./ChangeLogImg/relative-3.0-to-3.1.png)
+
+<!-- ---------------------------------------------------------------------- -->
+## 3.0
+
+**Status:** June 2021
+
+The 3.0 release is the first in a series of updates to the compressor that are
+making more radical changes than we felt we could make with the 2.x series.
+The primary goals of the 3.x series are to keep the image quality ~static or
+better compared to the 2.5 release, but continue to improve performance.
+
+Reminder for users of the library interface - the API is not designed to be
+binary compatible across versions, and this release is not compatible with
+earlier releases. Please update and rebuild your client-side code using the
+updated `astcenc.h` header.
+
+* **General:**
+  * **Feature:** The code has been significantly cleaned up, with improved
+    comments, API documentation, function naming, and variable naming.
+* **Core API:**
+  * **API Change:** The core APIs for `astcenc_compress_image()` and for
+    `astcenc_decompress_image()` now accept swizzle structures by `const`
+    pointer, instead of pass-by-value.
+  * **API Change:** Calling the `astcenc_compress_reset()` and the
+    `astcenc_decompress_reset()` functions between images is no longer required
+    if the context was created for use by a single thread.
+  * **Feature:** New heuristics have been added for controlling when to search
+    beyond 2 partitions and 1 plane, and when to search beyond 3 partitions and
+    1 plane. The previous `tune_partition_early_out_limit` config option has
+    been removed, and replaced with two new options
+    `tune_2_partition_early_out_limit_factor` and
+    `tune_3_partition_early_out_limit_factor`. See command line help for more
+    detailed documentation.
+  * **Feature:** New heuristics have been added for controlling when to use
+    dual weight planes. The previous `tune_two_plane_early_out_limit` has been
+    renamed to`tune_2_plane_early_out_limit_correlation`. See command line help
+    for more detailed documentation.
+  * **Feature:** Support for using dual weight planes has been restricted to
+    single partition blocks; it rarely helps blocks with 2 or more partitions
+    and takes considerable compression search time.
+
+### Performance:
+
+Key for charts:
+
+* Color = block size (see legend).
+* Letter = image format (N = normal map, G = grayscale, L = LDR, H = HDR).
+
+**Relative performance vs 2.5 release:**
+
+![Relative scores 3.0 vs 2.5](./ChangeLogImg/relative-2.5-to-3.0.png)
+
+- - -
+
+_Copyright © 2021-2022, Arm Limited and contributors. All rights reserved._
@@ -0,0 +1,416 @@
+# 4.x series change log
+
+This page summarizes the major functional and performance changes in each
+release of the 4.x series.
+
+All performance data on this page is measured on an Intel Core i5-9600K
+clocked at 4.2 GHz, running `astcenc` using AVX2 and 6 threads.
+
+<!-- ---------------------------------------------------------------------- -->
+## 4.8.0
+
+**Status:** May 2024
+
+The 4.8.0 release is a minor maintenance release.
+
+* **General:**
+  * **Bug fix:** Native builds on macOS will now correctly build for arm64 when
+    run outside of Rosetta on an Apple silicon device.
+  * **Bug fix:** Multiple small improvements to remove use of undefined
+    language behavior, to improve support for deployment using Emscripten.
+  * **Feature:** Builds using Clang can now build with undefined behavior
+    sanitizer by setting `-DASTCENC_UBSAN=ON` on the CMake configure line.
+  * **Feature:** Updated to Wuffs library 0.3.4, which ignores tRNS alpha
+    chunks for type 4 (LA) and 6 (RGBA) PNGs, to improve compatibility with
+    libpng.
+
+<!-- ---------------------------------------------------------------------- -->
+## 4.7.0
+
+**Status:** January 2024
+
+The 4.7.0 release is a major maintenance release, fixing rounding behavior in
+the decompressor to match the Khronos specification. This fix includes the
+addition of explicit support for optimizing for `decode_unorm8` rounding.
+
+Reminder - the codec library API is not designed to be binary compatible across
+versions. We always recommend rebuilding your client-side code using the
+updated `astcenc.h` header.
+
+* **General:**
+  * **Bug fix:** sRGB LDR decompression now uses the correct endpoint expansion
+    method to create the 16-bit RGB endpoint colors, and removes the previous
+    correction code from the interpolation function. This bug could result in
+    LSB bit flips relative to the standard specification.
+  * **Bug fix:** Decompressing to an 8-bit per component output image now
+    matches the `decode_unorm8` extension rounding rules. This bug could result
+    in LSB bit flips relative to the standard specification.
+  * **Bug fix:** Code now avoids using `alignas()` in the reference C
+    implementation, as the  default `alignas(16)` is narrower than the
+    native minimum alignment requirement on some CPUs.
+  * **Feature:** Library configuration supports a new flag,
+    `ASTCENC_FLG_USE_DECODE_UNORM8`. This flag indicates that the image will be
+    used with the `decode_unorm8` decode mode. When set during compression
+    this allows the compressor to use the correct rounding when determining the
+    best encoding.
+  * **Feature:** Command line tool supports a new option, `-decode_unorm8`.
+    This option indicates that the image will be used with the `decode_unorm8`
+    decode mode. This option will automatically be set for decompression
+    (`-d*`) and trial (`-t*`) tool operation if the decompressed output image
+    is stored to an 8-bit per component file format. This option must be set
+    manually for compression (`-c*`) tool operation, as the desired decode mode
+    cannot be reliably determined.
+  * **Feature:** Library configuration supports a new optional progress
+    reporting callback to be specified. This is called during compression to
+    to allow interactive tooling use cases to display incremental progress. The
+    command line tool uses this feature to show compression progress unless
+    `-silent` is used.
+
+<!-- ---------------------------------------------------------------------- -->
+## 4.6.1
+
+**Status:** November 2023
+
+The 4.6.1 release is a minor maintenance release to fix a scaling bug on
+large core count Windows systems.
+
+* **General:**
+  * **Optimization:** Windows builds of the `astcenc` command line tool can now
+    use more than 64 cores on large core count systems. This change doubled
+    command line performance for `-exhaustive` compression when testing on an
+    96 core/192 thread system.
+  * **Feature:** Windows Arm64 native builds of the `astcenc` command line tool
+    are now included in the prebuilt release binaries.
+
+<!-- ---------------------------------------------------------------------- -->
+## 4.6.0
+
+**Status:** November 2023
+
+The 4.6.0 release retunes the compressor heuristics to give improvements to
+performance for trivial losses to image quality. It also includes some minor
+bug fixes and code quality improvements.
+
+Reminder - the codec library API is not designed to be binary compatible across
+versions. We always recommend rebuilding your client-side code using the updated
+`astcenc.h` header.
+
+* **General:**
+  * **Bug-fix:** Fixed context allocation for contexts allocated with the
+    `ASTCENC_FLG_DECOMPRESS_ONLY` flag.
+  * **Bug-fix:** Reduced use of `reinterpret_cast` in the core codec to
+    avoid strict aliasing violations.
+  * **Optimization:** `-medium` search quality no longer tests 4 partition
+     encodings for block sizes between 25 and 83 texels (inclusive). This
+     improves performance for a tiny drop in image quality.
+  * **Optimization:** `-thorough` and higher search qualities no longer test the
+     mode0 first search for block sizes between 25 and 83 texels (inclusive).
+     This improves performance for a tiny drop in image quality.
+  * **Optimization:** `TUNE_MAX_PARTITIONING_CANDIDATES` reduced from 32 to 8
+     to reduce the size of stack allocated data structures. This causes a tiny
+     drop in image quality for the `-verythorough` and `-exhaustive` presets.
+
+<!-- ---------------------------------------------------------------------- -->
+## 4.5.0
+
+**Status:** June 2023
+
+The 4.5.0 release is a maintenance release with small image quality
+improvements, and a number of build system quality of life improvements.
+
+* **General:**
+  * **Bug-fix:** Improved handling compiler arguments in CMake, including
+    consistent use of MSVC-style command line arguments for ClangCL.
+  * **Bug-fix:** Invariant Clang builds now use `-ffp-model=precise` with
+    `-ffp-contract=off` which is needed to restore invariance due to recent
+    changes in compiler defaults.
+  * **Change:** macOS binary releases are now distributed as a single universal
+    binary for all platforms.
+  * **Change:** Windows binary releases are now compiled with VS2022.
+  * **Change:** Invariant MSVC builds for VS2022 now use `/fp:precise` instead
+    of `/fp:strict`, which is is now possible because precise no longer implies
+    contraction. This should improve performance for MSVC builds.
+  * **Change:** Non-invariant Clang builds now use `-ffp-model=precise` with
+    `-ffp-contract=on`. This should improve performance on older Clang
+    versions which defaulted to no contraction.
+  * **Change:** Non-invariant MSVC builds for VS2022 now use `/fp:precise`
+    with `/fp:contract`. This should improve performance for MSVC builds.
+  * **Change:** CMake config variables now use an `ASTCENC_` prefix to add a
+    namespace and group options when the library is used in a larger project.
+  * **Change:** CMake config `ASTCENC_UNIVERSAL_BUILD` for building macOS
+    universal binaries has been improved to include the `x86_64h` slice for
+    AVX2 builds. Universal builds are now on by default for macOS, and always
+    include NEON (arm64), SSE4.1 (x86_64), and AVX2 (x86_64h) variants.
+  * **Change:** CMake config `ASTCENC_NO_INVARIANCE` has been inverted to
+    remove the negated option, and is now `ASTCENC_INVARIANCE` with a default
+    of `ON`. Disabling this option can substantially improve performance, but
+    images can different across platforms and compilers.
+  * **Optimization:** Color quantization and packing for LDR RGB and RGBA has
+    been vectorized to improve performance.
+  * **Change:** Color quantization for LDR RGB and RGBA endpoints will now try
+    multiple quantization packing methods, and pick the one with the lowest
+    endpoint encoding error. This gives a minor image quality improvement, for
+    no significant performance impact when combined with the vectorization
+    optimizations.
+
+<!-- ---------------------------------------------------------------------- -->
+## 4.4.0
+
+**Status:** March 2023
+
+The 4.4.0 release is a minor release with image quality improvements, a small
+performance boost, and a few new quality-of-life features.
+
+* **General:**
+  * **Change:** Core library no longer checks availability of required
+    instruction set extensions, such as SSE4.1 or AVX2. Checking compatibility
+    is now the responsibility of the caller. See `astcenccli_entry.cpp` for
+    an example of code performing this check.
+  * **Change:** Core library can be built as a shared object by setting the
+    `-DSHAREDLIB=ON` CMake option, resulting in e.g. `libastcenc-avx2-shared.so`.
+    Note that the command line tool is always statically linked.
+  * **Change:** Decompressed 3D images will now write one output file per
+    slice, if the target format is a 2D image format.
+  * **Change:** Command line errors print to stderr instead of stdout.
+  * **Change:** Color encoding uses new quantization tables, that now factor
+    in floating-point rounding if a distance tie is found when using the
+    integer quant256 value. This improves image quality for 4x4 and 5x5 block
+    sizes.
+  * **Optimization:** Partition selection uses a simplified line calculation
+    with a faster approximation. This improves performance for all block sizes.
+  * **Bug-fix:** Fixed missing symbol error in decompressor-only builds.
+  * **Bug-fix:** Fixed infinity handling in debug trace JSON files.
+
+### Performance:
+
+Key for charts:
+
+* Color = block size (see legend).
+* Letter = image format (N = normal map, G = grayscale, L = LDR, H = HDR).
+
+**Relative performance vs 4.3 release:**
+
+![Relative scores 4.4 vs 4.3](./ChangeLogImg/relative-4.3-to-4.4.png)
+
+<!-- ---------------------------------------------------------------------- -->
+## 4.3.1
+
+**Status:** January 2023
+
+The 4.3.1 release is a minor maintenance release. No performance or image
+quality changes are expected.
+
+* **General:**
+  * **Bug-fix:** Fixed typo in `-2/3/4partitioncandidatelimit` CLI options.
+  * **Bug-fix:** Fixed handling for `-3/4partitionindexlimit` CLI options.
+  * **Bug-fix:** Updated to `stb_image.h` v2.28, which includes multiple fixes
+    and improvements for image loading.
+
+<!-- ---------------------------------------------------------------------- -->
+## 4.3.0
+
+**Status:** January 2023
+
+The 4.3.0 release is an optimization release. There are minor performance
+and image quality improvements in this release.
+
+Reminder - the codec library API is not designed to be binary compatible across
+versions. We always recommend rebuilding your client-side code using the updated
+`astcenc.h` header.
+
+* **General:**
+  * **Bug-fix:** Use lower case `windows.h` include for MinGW compatibility.
+  * **Change:** The `-mask` command line option, `ASTCENC_FLG_MAP_MASK` in the
+    library API, has been removed.
+  * **Optimization:** Always skip blue-contraction for `QUANT_256` encodings.
+    This gives a small image quality improvement for the 4x4 block size.
+  * **Optimization:** Always skip RGBO vector calculation for LDR encodings.
+  * **Optimization:** Defer color packing and scrambling to physical layer.
+  * **Optimization:** Remove folded `decimation_info` lookup tables. This
+    significantly reduces compressor memory footprint and improves context
+    creation time. Impact increases with the active block size.
+  * **Optimization:** Increased trial and refinement pruning by using stricter
+    target errors when determining whether to skip iterations.
+
+### Performance:
+
+Key for charts:
+
+* Color = block size (see legend).
+* Letter = image format (N = normal map, G = grayscale, L = LDR, H = HDR).
+
+**Relative performance vs 4.2 release:**
+
+![Relative scores 4.3 vs 4.2](./ChangeLogImg/relative-4.2-to-4.3.png)
+
+
+<!-- ---------------------------------------------------------------------- -->
+## 4.2.0
+
+**Status:** November 2022
+
+The 4.2.0 release is an optimization release. There are significant performance
+improvements, minor image quality improvements, and library interface changes in
+this release.
+
+Reminder - the codec library API is not designed to be binary compatible across
+versions. We always recommend rebuilding your client-side code using the updated
+`astcenc.h` header.
+
+* **General:**
+  * **Bug-fix:** Compression for RGB and RGBA base+offset encodings no
+    longer generate endpoints with the incorrect blue-contract behavior.
+  * **Bug-fix:** Lowest channel correlation calculation now correctly ignores
+    constant color channels for the purposes of filtering 2 plane encodings.
+    On average this improves both performance and image quality.
+  * **Bug-fix:** ISA compatibility now checked in `config_init()` as well as
+    in `context_alloc()`.
+  * **Change:** Removed the low-weight count optimization, as more recent
+    changes had significantly reduced its performance benefit. Option removed
+    from both command line and configuration structure.
+  * **Feature:** The `-exhaustive` mode now runs full trials on more
+    partitioning candidates and block candidates. This improves image quality
+    by 0.1 to 0.25 dB, but slows down compression by 3x. The `-verythorough`
+    and `-thorough` modes also test more candidates.
+  * **Feature:** A new preset, `-verythorough`, has been introduced to provide
+    a standard performance point between `-thorough` and the re-tuned
+    `-exhaustive` mode. This new mode is faster and higher quality than the
+    `-exhaustive` preset in the 4.1 release.
+  * **Feature:** The compressor can now independently vary the number of
+    partitionings considered for error estimation for 2/3/4 partitions. This
+    allows heuristics to put more effort into 2 partitions, and less in to
+    3/4 partitions.
+  * **Feature:** The compressor can now run trials on a variable number of
+    candidate partitionings, allowing high quality modes to explore more of the
+    search space at the expense of slower compression. The number of trials is
+    independently configurable for 2/3/4 partition cases.
+  * **Optimization:** Introduce early-out threshold for 2/3/4 partition
+    searches based on the results after 1 of 2 trials. This significantly
+    improves performance for `-medium` and `-thorough` searches, for a minor
+    loss in image quality.
+  * **Optimization:** Reduce early-out threshold for 3/4 partition searches
+    based on 2/3 partition results. This significantly improves performance,
+    especially for `-thorough` searches, for a minor loss in image quality.
+  * **Optimization:** Use direct vector compare to create a SIMD mask instead
+    of a scalar compare that is broadcast to a vector mask.
+  * **Optimization:** Remove obsolete partition validity masks from the
+    partition selection algorithm.
+  * **Optimization:** Removed obsolete channel scaling from partition
+    `avgs_and_dirs()` calculation.
+
+### Performance:
+
+Key for charts:
+
+* Color = block size (see legend).
+* Letter = image format (N = normal map, G = grayscale, L = LDR, H = HDR).
+
+**Relative performance vs 4.0 and 4.1 release:**
+
+![Relative scores 4.2 vs 4.0](./ChangeLogImg/relative-4.0-to-4.2.png)
+
+
+<!-- ---------------------------------------------------------------------- -->
+## 4.1.0
+
+**Status:** August 2022
+
+The 4.1.0 release is a maintenance release. There is no performance or image
+quality change in this release.
+
+* **General:**
+  * **Change:** Command line decompressor no longer uses the legacy
+    `GL_LUMINANCE` or `GL_LUMINANCE_ALPHA` format enums when writing KTX
+    output files. Luminance textures now use the `GL_RED` format and
+    luminance_alpha textures now use the `GL_RG` format.
+  * **Change:** Command line tool gains a new `-dimage` option to generate
+    diagnostic images showing aspects of the compression encoding. The output
+    file name with its extension stripped is used as the stem of the diagnostic
+    image file names.
+  * **Bug-fix:** Library decompressor builds for SSE no longer use masked store
+    `maskmovdqu` instructions, as they can generate faults on masked lanes.
+  * **Bug-fix:** Command line decompressor now correctly uses sized type enums
+    for the internal format when writing output KTX files.
+  * **Bug-fix:** Command line compressor now correctly loads 16 and 32-bit per
+    component input KTX files.
+  * **Bug-fix:** Fixed GCC9 compiler warnings on Arm aarch64.
+
+<!-- ---------------------------------------------------------------------- -->
+## 4.0.0
+
+**Status:** July 2022
+
+The 4.0.0 release introduces some major performance enhancement, and a number
+of larger changes to the heuristics used in the codec to find a more effective
+cost:quality trade off.
+
+* **General:**
+  * **Change:** The `-array` option for specifying the number of image planes
+    for ASTC 3D volumetric block compression been renamed to `-zdim`.
+  * **Change:** The build root package directory is now `bin` instead of
+    `astcenc`, allowing the CMake install step to write binaries into
+    `/usr/local/bin` if the user wishes to do so.
+  * **Feature:** A new `-ssw` option for specifying the shader sampling swizzle
+    has been added as convenience alternative to the `-cw` option. This is
+    needed to correct error weighting during compression if not all components
+    are read in the shader. For example, to extract and compress two components
+    from an RGBA input image, weighting the two components equally when
+    sampling through .ra in the shader, use `-esw ggga -ssw ra`. In this
+    example `-ssw ra` is equivalent to the alternative `-cw 1 0 0 1` encoding.
+  * **Feature:** The `-a` alpha weighting option has been re-enabled in the
+    backend, and now again applies alpha scaling to the RGB error metrics when
+    encoding. This is based on the maximum alpha in each block, not the
+    individual texel alpha values used in the earlier implementation.
+  * **Feature:** The command line tool now has `-repeats <count>` for testing,
+    which will iterate around compression and decompression `count` times.
+    Reported performance metrics also now separate compression and
+    decompression scores.
+  * **Feature:** The core codec is now warning clean up to /W4 for both MSVC
+    `cl.exe` and `clangcl.exe` compilers.
+  * **Feature:** The core codec now supports arm64 for both MSVC `cl.exe` and
+    `clangcl.exe` compilers.
+  * **Feature:** `NO_INVARIANCE` builds will enable the `-ffp-contract=fast`
+    option for all targets when using Clang or GCC. In addition AVX2 targets
+    will also set the `-mfma` option. This reduces image quality by up to 0.2dB
+    (normally much less), but improves performance by up to 5-20%.
+  * **Optimization:** Angular endpoint min/max weight selection is restricted
+    to weight `QUANT_11` or lower. Higher quantization levels assume default
+    0-1 range, which is less accurate but much faster.
+  * **Optimization:** Maximum weight quantization for later trials is selected
+    based on the weight quantization of the best encoding from the 1 plane 1
+    partition trial. This significantly reduces the search space for the later
+    trials with more planes or partitions.
+  * **Optimization:** Small data tables now use in-register SIMD permutes
+    rather than gathers (AVX2) or unrolled scalar lookups (SSE/NEON). This can
+    be a significant optimization for paths that are load unit limited.
+  * **Optimization:** Decompressed image block writes in the decompressor now
+    use a vectorized approach to writing each row of texels in the block,
+    including to ability to exploit masked stores if the target supports them.
+  * **Optimization:** Weight scrambling has been moved into the physical layer;
+    the rest of the codec now uses linear order weights.
+  * **Optimization:** Weight packing has been moved into the physical layer;
+    the rest of the codec now uses unpacked weights in the 0-64 range.
+  * **Optimization:** Consistently vectorize the creation of unquantized weight
+    grids when they are needed.
+  * **Optimization:** Remove redundant per-decimation mode copies of endpoint
+    and weight structures, which were really read-only duplicates.
+  * **Optimization:** Early-out the same endpoint mode color calculation if it
+    cannot be applied.
+  * **Optimization:** Numerous type size reductions applied to arrays to reduce
+    both context working buffer size usage and stack usage.
+
+### Performance:
+
+Key for charts:
+
+* Color = block size (see legend).
+* Letter = image format (N = normal map, G = grayscale, L = LDR, H = HDR).
+
+**Relative performance vs 3.7 release:**
+
+![Relative scores 4.0 vs 3.7](./ChangeLogImg/relative-3.7-to-4.0.png)
+
+
+- - -
+
+_Copyright © 2022-2024, Arm Limited and contributors. All rights reserved._
@@ -0,0 +1,105 @@
+# 5.x series change log
+
+This page summarizes the major functional and performance changes in each
+release of the 5.x series.
+
+All performance data on this page is measured on an Intel Core i5-9600K
+clocked at 4.2 GHz, running `astcenc` using AVX2 and 6 threads.
+
+<!-- ---------------------------------------------------------------------- -->
+## 5.3.0
+
+**Status:** March 2025
+
+The 5.3.0 release is a minor maintenance release.
+
+* **General:**
+  * **Feature:** Reference C builds (`ASTCENC_ISA_NONE`) now support compiling
+    for big-endian CPUs. Compile with `-DASTCENC_BIG_ENDIAN=ON` when compiling
+    for a big-endian target; it is not auto-detected.
+  * **Improvement:** Builds using GCC now specify `-flto=auto` to allow
+    parallel link steps, and remove the log warnings about not setting a CPU
+    count parameter value.
+  * **Bug fix:** Builds using MSVC `cl.exe` that do not specify an explicit
+    ISA using the preprocessor configuration defines will now correctly
+    default to the SSE2 backend on x86-64 and the NEON backend on Arm64. Previously they would have defaulted to the reference C implementation,
+    which is around 3.25 times slower.
+
+
+<!-- ---------------------------------------------------------------------- -->
+## 5.2.0
+
+**Status:** February 2025
+
+The 5.2.0 release is a minor maintenance release.
+
+This release includes changes to the public interface in the `astcenc.h`
+header.  We always recommend rebuilding your client-side code using the
+header from the same release to avoid compatibility issues.
+
+* **General:**
+  * **Change:** Changed sRGB alpha channel endpoint expansion to match the
+    revised Khronos Data Format Specification (v1.4.0), which reverts an
+    unintended specification change. Compared to previous releases, this change
+    can cause LSB bit differences in the alpha channel of compressed images.
+  * **Feature:** Arm64 builds for Linux added to the GitHub Actions builds, and
+    Arm64 binaries for NEON, 128-bit SVE 128 and 256-bit SVE added to release
+    builds.
+  * **Feature:** Added a new codec API, `astcenc_compress_cancel()`, which can
+    be used to cancel an in-flight compression. This is designed to help make
+    it easier to integrate the codec into an interactive user interface that
+    can respond to user events with low latency.
+  * **Bug fix:** Removed incorrect `static` variable qualifier, which could
+    result in an incorrect `tune_mse_overshoot` heuristic threshold being used
+    if a user ran multiple concurrent compressions with different settings.
+
+<!-- ---------------------------------------------------------------------- -->
+## 5.1.0
+
+**Status:** November 2024
+
+The 5.1.0 release is an optimization release, giving moderate performance
+improvements on all platforms. There are no image quality differences.
+
+* **General:**
+  * **Feature:** Added a new CMake build option to control use of native
+    gathers, as they can be slower than scalar loads on some common x86
+    microarchitectures. Build with `-DASTCENC_X86_GATHERS=OFF` to disable use
+    of native gathers in AVX2 builds.
+  * **Optimization:** Added new `gather()` abstraction for gathers using byte
+    indices, allowing implementations without gather hardware to skip the
+    byte-to-int index conversion.
+  * **Optimization:** Optimized `compute_lowest_and_highest_weight()` to
+    pre-compute min/max outside of the main loop.
+  * **Optimization:** Added improved intrinsics sequence for SSE and AVX2
+    integer `hmin()` and `hmax()`.
+  * **Optimization:** Added improved intrinsics sequence for `vint4(uint8_t*)`
+    on systems implementing Arm SVE.
+
+<!-- ---------------------------------------------------------------------- -->
+## 5.0.0
+
+**Status:** November 2024
+
+The 5.0.0 release is the first stable release in the 5.x series. The main new
+feature is support for the Arm Scalable Vector Extensions (SVE) SIMD instruction
+set.
+
+* **General:**
+  * **Bug fix:** Fixed incorrect return type in "None" vector library
+    reference implementation.
+  * **Bug fix:** Fixed sincos table index under/overflow.
+  * **Feature:** Changed `ASTCENC_ISA_NATIVE` builds to use `-march=native` and
+    `-mcpu=native`.
+  * **Feature:** Added backend for Arm SVE fixed-width 256-bit builds. These
+    can only run on hardware implementing 256-bit SVE.
+  * **Feature:** Added backend for Arm SVE 128-bit builds. These are portable
+    builds and can run on hardware implementing any SVE vector length, but the
+    explicit SVE use is augmented NEON and will only use the bottom 128-bits of
+    each SVE vector.
+  * **Feature:** Optimized NEON mask `any()` and `all()` functions.
+  * **Feature:** Migrated build and test to GitHub Actions pipelines.
+
+- - -
+
+_Copyright © 2022-2025, Arm Limited and contributors. All rights reserved._
@@ -0,0 +1,235 @@
+# Effective ASTC Encoding
+
+Most texture compression schemes encode a single color format at single
+bitrate, so there are relatively few configuration options available to content
+creators beyond selecting which compressed format to use.
+
+ASTC on the other hand is an extremely flexible container format which can
+compress multiple color formats at multiple bit rates. Inevitably this
+flexibility gives rise to questions about how to best use ASTC to encode a
+specific color format, or what the equivalent settings are to get a close
+match to another compression format.
+
+This page aims to give some guidelines, but note that they are only guidelines
+and are not exhaustive so please deviate from them as needed.
+
+## Traditional format reference
+
+The most commonly used non-ASTC compressed formats, their color format, and
+their compressed bitrate are shown in the table below.
+
+| Name     | Color Format | Bits/Pixel | Notes            |
+| -------- | ------------ | ---------- | ---------------- |
+| BC1      | RGB+A        | 4          | RGB565 + 1-bit A |
+| BC3      | RGB+A        | 8          | BC1 RGB + BC4 A  |
+| BC3nm    | G+R          | 8          | BC1 G   + BC4 R  |
+| BC4      | R            | 4          | L8               |
+| BC5      | R+G          | 8          | BC1 R + BC1 G    |
+| BC6H     | RGB (HDR)    | 8          |                  |
+| BC7      | RGB / RGBA   | 8          |                  |
+| EAC_R11  | R            | 4          | R11              |
+| EAC_RG11 | RG           | 8          | RG11             |
+| ETC1     | RGB          | 4          | RGB565           |
+| ETC2     | RGB+A        | 4          | RGB565 + 1-bit A |
+| ETC2+EAC | RGB+A        | 8          | RGB565 + EAC A   |
+| PVRTC    | RGBA         | 2 or 4     |                  |
+
+**Note:** BC2 (RGB+A) is not included in the table because it's rarely used in
+practice due to poor quality alpha encoding; BC3 is nearly always used instead.
+
+**Note:** Color representations shown with a `+` symbol indicate non-correlated
+compression groups; e.g. an `RGB + A` format compresses `RGB` and `A`
+independently and does not assume the two signals are correlated. This can be
+a strength (it improves quality when compressing non-correlated signals), but
+also a weakness (it reduces quality when compressing correlated signals).
+
+# ASTC Format Mapping
+
+The main question which arises with the mapping of another format on to ASTC
+is how to handle cases where the input isn't a 4 component RGBA input. ASTC is
+a container format which always decompresses in to a 4 component RGBA result.
+However, the internal compressed representation is very flexible and can store
+1-4 components as needed on a per-block basis.
+
+To get the best quality for a given bitrate, or the lowest bitrate for a given
+quality, it is important that as few components as possible are stored in the
+internal representation to avoid wasting coding space.
+
+Specific optimizations in the ASTC coding scheme exist for:
+
+* Encoding the RGB components as a single luminance component, so only a single
+  value needs to be stored in the coding instead of three.
+* Encoding the A component as a constant 1.0 value, so the coding doesn't
+  actually need to store a per-pixel alpha value at all.
+
+... so mapping your inputs given to the compressor to hit these paths is
+really important if you want to get the best output quality for your chosen
+bitrate.
+
+## Encoding 1-4 component data
+
+The table below shows the recommended component usage for data with different
+numbers of color components present in the data.
+
+The coding swizzle should be applied when compressing an image. This can be
+handled by the compressor when reading an uncompressed input image by
+specifying the swizzle using the `-esw` command line option.
+
+The sampling swizzle is what you should use in your shader programs to read
+the data from the compressed texture, assuming no additional API-level
+component swizzling is specified by the application.
+
+| Input components |  ASTC Endpoint | Coding Swizzle | Sampling Swizzle   |
+| -------------- |  ------------- | -------------- | ------------------ |
+| 1              |  L + 1         | `rrr1`         | `.g` <sup>1</sup>  |
+| 2              |  L + A         | `rrrg`         | `.ga` <sup>1</sup> |
+| 3              |  RGB + 1       | `rgb1`         | `.rgb`             |
+| 4              |  RGB + A       | `rgba`         | `.rgba`            |
+
+**1:** Sampling from `g` is preferred to sampling from `r` because it allows a
+single shader to be compatible with ASTC, BC1, or ETC formats. BC1 and ETC1
+store color endpoints as RGB565 data, so the `g` component will have higher
+precision. For ASTC it doesn't actually make any difference; the same single
+component luminance will be returned for all three of the `.rgb` components.
+
+## Equivalence with other formats
+
+Based on these component encoding requirements we can now derive the the ASTC
+coding equivalents for most of the other texture compression formats in common
+use today.
+
+| Formant  | ASTC Coding Swizzle | ASTC Sampling Swizzle | Notes            |
+| -------- | ------------------- | --------------------- | ---------------- |
+| BC1      | `rgba` <sup>1</sup> | `.rgba`               |                  |
+| BC3      | `rgba`              | `.rgba`               |                  |
+| BC3nm    | `gggr`              | `.ag`                 |                  |
+| BC4      | `rrr1`              | `.r`                  |                  |
+| BC5      | `rrrg`              | `.ra` <sup>2</sup>    |                  |
+| BC6H     | `rgb1`              | `.rgb` <sup>3</sup>   | HDR profile only |
+| BC7      | `rgba`              | `.rgba`               |                  |
+| EAC_R11  | `rrr1`              | `.r`                  |                  |
+| EAC_RG11 | `rrrg`              | `.ra` <sup>2</sup>    |                  |
+| ETC1     | `rgb1`              | `.rgb`                |                  |
+| ETC2     | `rgba` <sup>1</sup> | `.rgba`               |                  |
+| ETC2+EAC | `rgba`              | `.rgba`               |                  |
+| ETC2+EAC | `rgba`              | `.rgba`               |                  |
+
+**1:** ASTC has no equivalent of the 1-bit punch-through alpha encoding
+supported by BC1 or ETC2; if alpha is present it will be a full alpha
+component.
+
+**2:** ASTC relies on using the L+A color endpoint type for coding efficiency
+for two component data. It therefore has no direct equivalent of a two-plane
+format sampled though the `.rg` components such as BC5 or EAC_RG11. This can
+be emulated by setting texture component swizzles in the runtime API - e.g. via
+`glTexParameteri()` for OpenGL ES - although it has been noted that API
+controlled swizzles are not available in WebGL.
+
+**3:** ASTC can only store unsigned values, and has no equivalent of the BC6
+signed endpoint mode.
+
+# Other Considerations
+
+This section outlines some of the other things to consider when encoding
+textures using ASTC.
+
+## Decode mode extensions
+
+ASTC is specified to decompress into a 16-bit per component RGBA output by
+default, with the exception of the sRGB format which uses an 8-bit value for the
+RGB components.
+
+Decompressing in to a 16-bit per component output format is often higher than
+many use cases require, especially for LDR textures which originally came from
+an 8-bit per component source image. Most implementations of ASTC support the
+decode mode extensions, which allow an application to opt-in to a lower
+precision decompressed format (RGBA8 for LDR, RGB9E5 for HDR). Using these
+extensions can improve GPU texture cache efficiency, and even improve texturing
+filtering throughput, for use cases that do not need the higher precision.
+
+The ASTC format uses different data rounding rules when the decode mode
+extensions are used. To ensure that the compressor chooses the best encodings
+for the RGBA8 rounding rules, you can specify `-decode_unorm8` when compressing
+textures that will be decompressed into the RGBA8 intermediate. This gives a
+small image quality boost.
+
+**Note:** This mode is automatically enabled if you use the `astcenc`
+decompressor to write an 8-bit per component output image.
+
+## Encoding non-correlated components
+
+Most other texture compression formats have a static component assignment in
+terms of the expected data correlation. For example, ETC2+EAC assumes that RGB
+are always correlated and that alpha is non-correlated. ASTC can automatically
+encode data as either fully correlated across all 4 components, or with any one
+component assigned to a separate non-correlated partition to the other three.
+
+The non-correlated component can be changed on a block-by-block basis, so the
+compressor can dynamically adjust the coding based on the data present in the
+image. This means that there is no need for non-correlated data to be stored
+in a specific component in the input image.
+
+It is however worth noting that the alpha component is treated differently to
+the RGB color components in some circumstances:
+
+* When coding for sRGB the alpha component will always be stored in linear
+  space.
+* When coding for HDR the alpha component can optionally be kept as LDR data.
+
+## Encoding normal maps
+
+The best way to store normal maps using ASTC is similar to the scheme used by
+BC5; store the X and Y components of a unit-length normal. The Z component of
+the normal can be reconstructed in shader code based on the knowledge that the
+vector is unit length.
+
+To encode this we need to store only two input components in the compressed
+data, and therefore use the `rrrg` coding swizzle to align the data with the
+ASTC luminance+alpha endpoint. We can sample this in shader code using the
+`.ga` sampling swizzle, and reconstruct the Z value with:
+
+    vec3 nml;
+    nml.xy = texture(...).ga;                // Load normals (range 0 to 1)
+    nml.xy = nml.xy * 2.0 - 1.0;             // Unpack normals (range -1 to +1)
+    nml.z = sqrt(1 - dot(nml.xy, nml.xy));   // Compute Z, given unit length
+
+The encoding swizzle and appropriate component weighting is enabled by using
+the `-normal` command line option. If you wish to use a different pair of
+components you can specify a custom swizzle after setting the `-normal`
+parameter. For example, to match BC5n component ordering use
+`-normal -esw gggr` for compression and `-normal -dsw arz1` for decompression.
+
+## Encoding sRGB data
+
+The ASTC LDR profile can compress sRGB encoded color, which is a more
+efficient use of bits than storing linear encoded color because the gamma
+corrected value distribution more closely matches human perception of
+luminance.
+
+For color data it is nearly always a perceptual quality win to use sRGB input
+source textures that are then compressed using the ASTC sRGB compression mode
+(compress using the `-cs` command line option rather than the `-cl` command
+line option). Note that sRGB gamma correction is only applied to the RGB
+components during decode; the alpha component is always treated as linear
+encoded data.
+
+*Important:* The uncompressed input texture provided on the command line must
+be stored in the sRGB color space for `-cs` to function correctly.
+
+## Encoding HDR data
+
+HDR data can be encoded just like LDR data, but with some caveats around
+handling the alpha component.
+
+For many use cases the alpha component is an actual alpha opacity component and
+is therefore used for storing an LDR value between 0 and 1. For these cases use
+the `-ch` compressor option which will treat the RGB components as HDR, but the
+A component as LDR.
+
+For other use cases the alpha component is simply a fourth data component which
+is also storing an HDR value. For these cases use the `-cH` compressor option
+which will treat all components as HDR data.
+
+- - -
+
+_Copyright © 2019-2024, Arm Limited and contributors. All rights reserved._
@@ -0,0 +1,71 @@
+# The .astc File Format
+
+The default file format for compressed textures generated by `astcenc`, as well
+as from many other ASTC compressors, is the `.astc` format. This is a very
+simple format consisting of a small header followed immediately by the binary
+payload for a single image surface.
+
+Header
+======
+
+The header is a fixed 16 byte structure, defined as storing only bytes to avoid
+any endianness issues or incur any padding overhead.
+
+```
+struct astc_header
+{
+    uint8_t magic[4];
+    uint8_t block_x;
+    uint8_t block_y;
+    uint8_t block_z;
+    uint8_t dim_x[3];
+    uint8_t dim_y[3];
+    uint8_t dim_z[3];
+};
+```
+
+Magic number
+------------
+
+The 4 byte magic number at the start of the file acts as a format identifier.
+
+```
+    magic[0] = 0x13;
+    magic[1] = 0xAB;
+    magic[2] = 0xA1;
+    magic[3] = 0x5C;
+```
+
+Block size
+----------
+
+The `block_*` fields store the ASTC block dimensions in texels. For 2D images
+the Z dimension must be set to 1.
+
+Image dimensions
+----------------
+
+The `dim_*` fields store the image dimensions in texels.  For 2D images the
+Z dimension must be set to 1.
+
+Note that the image is not required to be an exact multiple of the compressed
+block size; the compressed data may include padding that is discarded during
+decompression.
+
+Each dimension is a 24 bit unsigned value that is reconstructed from the stored
+byte values as:
+
+```
+decoded_dim = dim[0] + (dim[1] << 8) + (dim[2] << 16);
+```
+
+Binary payload
+==============
+
+The binary payload is a byte stream that immediately follows the header. It
+contains 16 bytes per compressed block. The number of compressed blocks is
+determined from the header information.
+
+- - -
+
+_Copyright © 2020-2022, Arm Limited and contributors. All rights reserved._
@@ -0,0 +1,488 @@
+# ASTC Format Overview
+
+Adaptive Scalable Texture Compression (ASTC) is an advanced lossy texture
+compression technology developed by Arm and AMD. It has been adopted as an
+official Khronos extension to the OpenGL and OpenGL ES APIs, and as a standard
+optional feature for the Vulkan API.
+
+ASTC offers a number of advantages over earlier texture compression formats:
+
+* **Format flexibility:** ASTC supports compressing between 1 and 4 channels of
+  data, including support for one non-correlated channel such as RGB+A
+  (correlated RGB, non-correlated alpha).
+* **Bit rate flexibility:** ASTC supports compressing images with a fine
+  grained choice of bit rates between 0.89 and 8 bits per texel (bpt). The bit
+  rate choice is independent to the color format choice.
+* **Advanced format support:** ASTC supports compressing images in either low
+  dynamic range (LDR), LDR sRGB, or high dynamic range (HDR) color spaces, as
+  well as support for compressing 3D volumetric textures.
+* **Improved image quality:** Despite the high degree of format flexibility,
+  ASTC manages to beat nearly all legacy texture compression formats -- such as
+  ETC2, PVRCT, and the BC formats -- on image quality at equivalent bit
+  rates.
+
+This article explores the ASTC format, and how it manages to generate the
+flexibility and quality improvements that it achieves.
+
+
+Why ASTC?
+=========
+
+Before the creation of ASTC, the format and bit rate coverage of the available
+formats was very sparse:
+
+![Legacy texture compression formats and bit rates](./FormatOverviewImg/coverage-legacy.svg)
+
+In reality the situation is even worse than this diagram shows, as many of
+these formats are proprietary or simply not available on some operating
+systems, so any single platform will have very limited compression choices.
+
+For developers this situation makes developing content which is portable across
+multiple platforms a tricky proposition. It's almost certain that differently
+compressed assets will be needed for different platforms. Each asset pack would
+likely then need to use different levels of compression, and may even have to
+fall back to no compression for some assets on some platforms, which leaves
+either some image quality or some memory bandwidth efficiency untapped.
+
+It was clear a better way was needed, so the Khronos group asked members to
+submit proposals for a new compression algorithm to be adopted in the same
+manner that the earlier ETC algorithm was adopted for OpenGL ES. ASTC was the
+result of this, and has been adopted as an official algorithm for OpenGL,
+OpenGL ES, and Vulkan.
+
+
+Format overview
+===============
+
+Given the fragmentation issues with the existing compression formats, it should
+be no surprise that the high level design objectives for ASTC were to have
+something which could be used across the whole range of art assets found in
+modern content, and which allows artists to have more control over the quality
+to bit rate tradeoff.
+
+There are quite a few technical components which make up the ASTC format, so
+before we dive into detail it will be useful to give an overview of how ASTC
+works at a higher level.
+
+
+Block compression
+-----------------
+
+Compression formats for real-time graphics need the ability to quickly and
+efficiently make random samples into a texture. This places two technical
+requirements on any compression format:
+
+* It must be possible to compute the address of data in memory given only a
+  sample coordinate.
+* It must be possible to decompress random samples without decompressing too
+  much surrounding data.
+
+The standard solution for this used by all contemporary real-time formats,
+including ASTC, is to divide the image into fixed-size blocks of texels, each
+of which is compressed into a fixed number of output bits. This feature makes
+it possible to access texels quickly, in any order, and with a well-bounded
+decompression cost.
+
+The 2D block footprints in ASTC range from 4x4 texels up to 12x12 texels, which
+all compress into 128-bit output blocks. By dividing 128 bits by the number of
+texels in the footprint, we derive the format bit rates which range from 8 bpt
+(`128/(4*4)`) down to 0.89 bpt (`128/(12*12)`).
+
+
+Color encoding
+--------------
+
+ASTC uses gradients to assign the color values of each texel. Each compressed
+block stores the end-point colors for a gradient, and an interpolation weight
+for each texel which defines the texel's location along that gradient. During
+decompression the color value for each texel is generated by interpolating
+between the two end-point colors, based on the per-texel weight.
+
+![One partition gradient storage](./FormatOverviewImg/gradient-1p.svg)
+
+In many cases a block will contain a complex distribution of colors, for
+example a red ball sitting on green grass. In these scenarios a single color
+gradient will not be able to accurately represent all of the texels' values. To
+support this ASTC allows a block to define up to four distinct color gradients,
+known as partitions, and can assign each texel to a single partition. For our
+example we require two partitions, one for our ball texels and one for our
+grass texels.
+
+![Two partition gradient storage](./FormatOverviewImg/gradient-2p.svg)
+
+Now that you know the high level operation of the format, we can dive into more
+detail.
+
+
+Integer encoding
+================
+
+Initially the idea of fractional bits per texel sounds implausible, or even
+impossible, because we're so used to storing numbers as a whole number of bits.
+However, it's not quite as strange as it sounds. ASTC uses an encoding
+technique called Bounded Integer Sequence Encoding (BISE), which makes heavy
+use of storing numbers with a fractional number of bits to pack information
+more efficiently.
+
+
+Storing alphabets
+-----------------
+
+Even though color and weight values per texel are notionally floating-point
+values, we have far too few bits available to directly store the actual values,
+so they must be quantized during compression to reduce the storage size. For
+example, if we have a floating-point weight for each texel in the range 0.0 to
+1.0 we could choose to quantize it to five values - 0.0, 0.25, 0.5, 0.75, and
+1.0 - which we can then represent in storage using the integer values 0 to 4.
+
+In the general case we need to be able to efficiently store characters of an
+alphabet containing N symbols if we choose quantize to N levels. An N symbol
+alphabet contains `log2(N)` bits of information per character. If we have an
+alphabet of 5 possible symbols then each character contains ~2.32 bits of
+information, but simple binary storage would require us to round up to 3 bits.
+This wastes 22.3% of our storage capacity. The chart below shows the percentage
+of our bit-space wasted when using simple binary encoding to store an arbitrary
+N symbol alphabet:
+
+![Binary encoding efficiency](./FormatOverviewImg/binary.png)
+
+... which shows for most alphabet sizes we waste a lot of our storage capacity
+when using an integer number of bits per character. Efficiency is of critical
+importance to a compression format, so this is something we needed to be able
+to improve.
+
+**Note:** We could have chosen to round-up the quantization level to the next
+power of two, and at least use the bits we're spending. However, this forces
+the encoder to spend bits which could be used elsewhere for a bigger benefit,
+so it will reduce image quality and is a sub-optimal solution.
+
+
+Quints
+------
+
+Instead of rounding up a 5 symbol alphabet - called a "quint" in BISE - to
+three bits, we could choose to instead pack three quint characters together.
+Three characters in a 5-symbol alphabet have 5<sup>3</sup> (125) combinations,
+and contain 6.97 bits of information. We can store this in 7 bits and have a
+storage waste of only 0.5%.
+
+
+Trits
+-----
+
+We can similarly construct a 3-symbol alphabet - called a "trit" in BISE - and
+pack trit characters in groups of five. Each character group has 3<sup>5</sup>
+(243) combinations, and contains 7.92 bits of information. We can store this in
+8 bits and have a storage waste of only 1%.
+
+
+BISE
+----
+
+The BISE encoding used by ASTC allows storage of character sequences using
+arbitrary alphabets of up to 256 symbols, encoding each alphabet size in the
+most space-efficient choice of bits, trits, and quints.
+
+* Alphabets with up to (2<sup>n</sup> - 1) symbols can be encoded using n bits
+  per character.
+* Alphabets with up (3 * 2<sup>n</sup> - 1) symbols can be encoded using n bits
+  (m) and a trit (t) per character, and reconstructed using the equation
+  (t * 2<sup>n</sup> + m).
+* Alphabets with up to (5 * 2<sup>n</sup> - 1) symbols can be encoded using n
+  bits (m) and a quint (q) per character, and reconstructed using the equation
+  (q * 2<sup>n</sup> + m).
+
+When the number of characters in a sequence is not a multiple of three or five
+we need to avoid wasting storage at the end of the sequence, so we add another
+constraint on the encoding. If the last few values in the sequence to encode
+are zero, the last few bits in the encoded bit string must also be zero.
+Ideally, the number of non-zero bits should be easily calculated and not depend
+on the magnitudes of the previous encoded values. This is a little tricky to
+arrange during compression, but it is possible. This means that we do not need
+to store any padding after the end of the bit sequence, as we can safely assume
+that they are zero bits.
+
+With this constraint in place - and by some smart packing the bits, trits, and
+quints - BISE encodes an string of S characters in an N symbol alphabet using a
+fixed number of bits:
+
+* S values up to (2<sup>n</sup> - 1) uses (NS) bits.
+* S values up to (3 * 2<sup>n</sup> - 1) uses (NS + ceil(8S / 5)) bits.
+* S values up to (5 * 2<sup>n</sup> - 1) uses (NS + ceil(7S / 3)) bits.
+
+... and the compressor will choose the one of these which produces the smallest
+storage for the alphabet size being stored; some will use binary, some will use
+bits and a trit, and some will use bits and a quint. If we compare the storage
+efficiency of BISE against simple binary for the range of possible alphabet
+sizes we might want to encode we can see that it is much more efficient.
+
+![BISE encoding efficiency](./FormatOverviewImg/bise.png)
+
+
+Block sizes
+===========
+
+ASTC always compresses blocks of texels into 128-bit outputs, but allows the
+developer to select from a range of block sizes to enable a fine-grained
+tradeoff between image quality and size.
+
+| Block footprint | Bits/texel |     | Block footprint | Bits/texel |
+| --------------- | ---------- | --- | --------------- | ---------- |
+|             4x4 |       8.00 |     |            10x5 |       2.56 |
+|             5x4 |       6.40 |     |            10x6 |       2.13 |
+|             5x5 |       5.12 |     |             8x8 |       2.00 |
+|             6x5 |       4.27 |     |            10x8 |       1.60 |
+|             6x6 |       3.56 |     |           10x10 |       1.28 |
+|             8x5 |       3.20 |     |           12x10 |       1.07 |
+|             8x6 |       2.67 |     |           12x12 |       0.89 |
+
+
+
+Color endpoints
+===============
+
+The color data for a block is encoded as a gradient between two color
+endpoints, with each texel selecting a position along that gradient which is
+then interpolated during decompression. ASTC supports 16 color endpoint
+encoding schemes, known as "endpoint modes". Options for endpoint modes
+include:
+
+* Varying the number of color channels: e.g. luminance, luminance + alpha, rgb,
+  and rgba.
+* Varying the encoding method: e.g. direct, base+offset, base+scale,
+  quantization level.
+* Varying the data range: e.g. low dynamic range, or high dynamic range
+
+The endpoint modes, and the endpoint color BISE quantization level, can be
+chosen on a per-block basis.
+
+
+Color partitions
+================
+
+Colors within a block are often complex, and cannot be accurately captured by a
+single color gradient, as discussed earlier with our example of a red ball
+lying on green grass. ASTC allows up to four color gradients - known as
+"partitions" - to be assigned to a single block. Each texel is then assigned to
+a single partition for the purposes of decompression.
+
+Rather then directly storing the partition assignment for each texel, which
+would need a lot of decompressor hardware to store it for all block sizes, we
+generate it procedurally. Each block only needs to store the partition index -
+which is the seed for the procedural generator - and the per texel assignment
+can then be generated on-the-fly during decompression. The image below shows
+the generated texel assignments for two (top), three (middle), and four
+(bottom) partitions for the 8x8 block size.
+
+![ASTC partition table](./FormatOverviewImg/hash.png)
+
+The number of partitions and the partition index can be chosen on a per-block
+basis, and a different color endpoint mode can be chosen per partition.
+
+**Note:** ASTC uses a 10-bit seed to drive the partition assignments. The hash
+used will introduce horizontal bias in a third of the partitions, vertical bias
+in a third, and no bias in the rest. As they are procedurally generated not all
+of the partitions are useful, in particular with the smaller block sizes.
+
+* Many partitions are duplicates.
+* Many partitions are degenerate (an N partition hash results in at least one
+  partition assignment that contains no texels).
+
+
+Texel weights
+=============
+
+Each texel requires a weight, which defines the relative contribution of each
+color endpoint when interpolating the color gradient.
+
+For smaller block sizes we can choose to store the weight directly, with one
+weight per texel, but for the larger block sizes we simply do not have enough
+bits of storage to do this. To work around this ASTC allows the weight grid to
+be stored at a lower resolution than the texel grid. The per-texel weights are
+interpolated from the stored weight grid during decompression using a bilinear
+interpolation.
+
+The number of texel weights, and the weight value BISE quantization level, can
+be chosen on a per-block basis.
+
+
+Dual-plane weights
+------------------
+
+Using a single weight for all color channels works well when there is good
+correlation across the channels, but this is not always the case. Common
+examples where we would expect to get low correlation at least some of the time
+are textures storing RGBA data - alpha masks are not usually closely
+correlated with the color value - or normal data - the X and Y normal values
+often change independently.
+
+ASTC allows a dual-plane mode, which uses two separate weight grids for each
+texel. A single channel can be assigned to a second plane of weights, while
+the other three use the first plane of weights.
+
+The use of dual-plane mode can be chosen on a per-block basis, but its use
+prevents the use of four color partitions as we do not have enough bits to
+concurrently store both an extra plane of weights and an extra set of color
+endpoints.
+
+
+End results
+===========
+
+So, if we pull all of this together what do we end up with?
+
+
+Adaptive
+--------
+
+The first word in the name of ASTC is "adaptive", and it should now hopefully
+be clear why. Each block always compresses into 128-bits of storage, but the
+developer can choose from a wide range of texel block sizes and the compressor
+gets a huge amount of latitude to determine how those 128 bits are used.
+
+The compressor can trade off the number of bits assigned to colors (number of
+partitions, endpoint mode, and stored quantization level) and weights (number
+of weights per block, use of dual-plane, and stored quantization level) on a
+per-block basis to get the best image quality possible.
+
+![ASTC compressed parrot at various bit rates](./FormatOverviewImg/astc-quality.png)
+
+
+Format support
+--------------
+
+The compression scheme used by ASTC effectively compresses arbitrary sequences
+of floating point numbers, with a flexible number of channels, across any of
+the supported block sizes. There is no real notion of "color format" in the
+format itself at all, beyond the color endpoint mode selection, although a
+sensible compressor will want to use some format-specific heuristics to drive
+an efficient state-space search.
+
+The orthogonal encoding design allows ASTC to provide almost complete coverage
+of our desirable format matrix from earlier, across a wide range of bit rates:
+
+![ASTC 2D formats and bit rates](./FormatOverviewImg/coverage-astc.svg)
+
+The only significant omission is the absence of a dedicated two channel
+encoding for HDR textures. We simply ran out of entries in the space we had for
+encoding color endpoint modes, and this one didn't make the cut.
+
+The flexibility allowed by ASTC ticks the requirement that almost any asset can
+be compressed to some degree, at an appropriate bitrate for its quality needs.
+This is a powerful enabler for a compression format, because it puts control in
+the hands of content creators and not arbitrary format restrictions.
+
+
+Image quality
+-------------
+
+The normal expectation would be that this level of format flexibility would
+come at a cost of image quality; it has to cost something, right? Luckily this
+isn't true. The high packing efficiency allowed by BISE encoding, and the
+ability to dynamically choose where to spend encoding space on a per-block
+basis, means that an ASTC compressor is not forced to spend bits on things that
+don't help image quality.
+
+This gives some significant improvements in image quality compared to the older
+texture formats, even though ASTC also handles a much wider range of options.
+
+* ASTC at 2 bpt outperforms PVRTC at 2 bpt by ~2.0dB.
+* ASTC at 3.56 bpt outperforms PVRTC and BC1 at 4 bpt by ~1.5dB, and ETC2 by
+  ~0.7dB, despite a 10% bit rate disadvantage.
+* ASTC at 8 bpt for LDR formats is comparable in quality to BC7 at 8 bpt.
+* ASTC at 8 bpt for HDR formats is comparable in quality to BC6H at 8 bpt.
+
+Differences as small as 0.25dB are visible to the human eye, and remember that
+dB uses a logarithmic scale, so these are significant image quality
+improvements.
+
+
+3D compression
+--------------
+
+One of the nice bonus features of ASTC is that the techniques which underpin
+the format generalize to compressing volumetric texture data without needing
+very much additional decompression hardware.
+
+ASTC is therefore also able to optionally support compression of 3D textures,
+which is a unique feature not found in any earlier format, at the following
+bit rates:
+
+| Block footprint | Bits/texel |     | Block footprint | Bits/texel |
+| --------------- | ---------- | --- | --------------- | ---------- |
+|           3x3x3 |       4.74 |     |           5x5x4 |       1.28 |
+|           4x3x3 |       3.56 |     |           5x5x5 |       1.02 |
+|           4x4x3 |       2.67 |     |           6x5x5 |       0.85 |
+|           4x4x4 |       2.00 |     |           6x6x5 |       0.71 |
+|           5x4x4 |       1.60 |     |           6x6x6 |       0.59 |
+
+
+Availability
+============
+
+The ASTC functionality is specified as a set of feature profiles, allowing
+GPU hardware manufacturers to select which parts of the standard they
+implement. There are four commonly seen profiles:
+
+* "LDR":
+    * 2D blocks.
+    * LDR and sRGB color space.
+    * [KHR_texture_compression_astc_ldr][astc_ldr]: KHR OpenGL ES extension.
+* "LDR + Sliced 3D":
+    * 2D blocks and sliced 3D blocks.
+    * LDR and sRGB color space.
+    * [KHR_texture_compression_astc_sliced_3d][astc_3d]: KHR OpenGL ES extension.
+* "HDR":
+    * 2D and sliced 3D blocks.
+    * LDR, sRGB, and HDR color spaces.
+    * [KHR_texture_compression_astc_hdr][astc_ldr]: KHR OpenGL ES extension.
+* "Full":
+    * 2D, sliced 3D, and volumetric 3D blocks.
+    * LDR, sRGB, and HDR color spaces.
+	* [OES_texture_compression_astc][astc_full]: OES OpenGL ES extension.
+
+The LDR profile is mandatory in OpenGL ES 3.2 and a standardized optional
+feature for Vulkan, and therefore widely supported on contemporary mobile
+devices. The 2D HDR profile is not mandatory, but is widely supported.
+
+3D texturing
+------------
+
+The APIs expose 3D textures in two flavors.
+
+The sliced 3D texture support builds a 3D texture from an array of 2D image
+slices that have each been individually compressed using 2D ASTC compression.
+This is required for the HDR profile, so is also widely supported.
+
+The volumetric 3D texture support uses the native 3D block sizes provided by
+ASTC to implement true volumetric compression. This enables a wider choice of
+low bitrate options than the 2D blocks, which is particularly important for 3D
+textures of any non-trivial size. Volumetric formats are not widely supported,
+but are supported on all of the Arm Mali GPUs that support ASTC.
+
+ASTC decode mode
+----------------
+
+ASTC is specified to decompress texels into fp16 intermediate values, except
+for sRGB which always decompresses into 8-bit UNORM intermediates. For many use
+cases this gives more dynamic range and precision than required. This can cause
+a reduction in both texture cache efficiency and texture filtering performance
+due to the larger decompressed data size.
+
+A pair of extensions exist, and are widely supported on recent mobile GPUs,
+which allow applications to reduce the intermediate precision to either UNORM8
+(recommended for LDR textures) or RGB9e5 (recommended for HDR textures).
+
+* [OES_texture_compression_astc_decode_mode][astc_decode]: Allow UNORM8
+  intermediates
+* [OES_texture_compression_astc_decode_mode_rgb9e5][astc_decode]: Allow RGB9e5
+  intermediates
+
+[astc_ldr]: https://www.khronos.org/registry/OpenGL/extensions/KHR/KHR_texture_compression_astc_hdr.txt
+[astc_3d]: https://www.khronos.org/registry/OpenGL/extensions/KHR/KHR_texture_compression_astc_sliced_3d.txt
+[astc_full]: https://www.khronos.org/registry/OpenGL/extensions/OES/OES_texture_compression_astc.txt
+[astc_decode]: https://www.khronos.org/registry/OpenGL/extensions/EXT/EXT_texture_compression_astc_decode_mode.txt
+
+- - -
+
+_Copyright © 2019-2022, Arm Limited and contributors. All rights reserved._
@@ -0,0 +1,79 @@
+# Terminology for the ASTC Encoder
+
+Like most software, the `astcenc` code base has a set of naming conventions
+for variables which are used to ensure both accuracy and reasonable brevity.
+
+:construction: These conventions are being used for new patches, so new code
+will conform to this, but older code is still being cleaned up to follow
+these conventions.
+
+## Counts
+
+For counts of things prefer `<x>_count` rather than `<x>s`. For example:
+
+* `plane_count`
+* `weight_count`
+* `texel_count`
+
+Where possible aim for descriptive loop variables, as these are more literate
+than simple `i` or `j` variables. For example:
+
+* `plane_index`
+* `weight_index`
+* `texel_index`
+
+## Ideal, Unpacked Quantized, vs Packed Quantized
+
+Variables that are quantized, such as endpoint colors and weights, have
+multiple states depending on how they are being used.
+
+**Ideal values** represent arbitrary numeric values that can take any value.
+These are often used during compression to work out the best value before
+any quantization is applied. For example, integer weights in the 0-64 range can
+take any of the 65 values available.
+
+**Quant uvalues** represent the unpacked numeric value after any quantization
+rounding has been applied. These are often used during compression to work out
+the error for the quantized value compared to the ideal value. For example,
+`QUANT_3` weights in the 0-64 range can only take one of `[0, 32, 64]`.
+
+**Quant pvalues** represent the packed numeric value in the quantized alphabet.
+This is what ends up encoded in the ASTC data, although note that the encoded
+ordering is scrambled to simplify hardware. For example, `QUANT_3` weights
+originally in the 0-64 range can only take one of `[0, 1, 2]`.
+
+For example:
+
+* `weights_ideal_value`
+* `weights_quant_uvalue`
+* `weights_quant_pvalue`
+
+## Full vs Decimated interpolation weights
+
+Weight grids have multiple states depending on how they are being used.
+
+**full_weights** represent per texel weight grids, storing one weight per texel.
+
+**decimated_weights** represent reduced weight grids, which can store fewer
+weights and which are bilinear interpolated to generate the full weight grid.
+
+Full weights have no variable prefix,but decimated weights are stored with
+a `dec_` prefix.
+
+* `dec_weights_ideal_value`
+* `dec_weights_quant_uvalue`
+* `dec_weights_quant_pvalue`
+
+## Weight vs Significance
+
+The original encoder used "weight" for multiple purposes - texel significance
+(weight the error), color channel significance (weight the error), and endpoint
+interpolation weights. This gets very confusing in functions using all three!
+
+We are slowly refactoring the code to only use "weight" to mean the endpoint
+interpolation weights. The error weighting factors used for other purposes are
+being updated to use the using the term "significance".
+
+- - -
+
+_Copyright © 2020-2022, Arm Limited and contributors. All rights reserved._
@@ -0,0 +1,120 @@
+# Testing astcenc
+
+The repository contains a small suite of tests which can be used to sanity
+check source code changes to the compressor. It must be noted that this test
+suite is relatively limited in scope and does not cover every feature or
+bitrate of the standard.
+
+# Required software
+
+Running the tests requires Python 3.7 to be installed on the host machine, and
+an `astcenc-avx2` release build to have been previously compiled and installed
+into an directory called `astcenc` in the root of the git checkout. This
+can be achieved by configuring the CMake build using the install prefix
+`-DCMAKE_INSTALL_PREFIX=../` and then running a build with the `install` build
+target.
+
+# Running C++ unit tests
+
+We support a small (but growing) number of C++ unit tests, which are written
+using the `googletest` framework and integrated in the CMake "CTest" test
+framework.
+
+To build unit tests pull the `googletest` git submodule and add
+`-DASTCENC_UNITTEST=ON` to the CMake command line when configuring.
+
+To run unit tests use the CMake `ctest` utility from your build directory after
+you have built the tests.
+
+```shell
+cd build
+ctest --verbose
+```
+
+# Running command line tests
+
+To run the command line tests, which aim to get coverage of the command line
+options and core codec stability without testing the compression quality
+itself, run the command line:
+
+    python3 -m unittest discover -s Test -p astc_test*.py -v
+
+# Running image tests
+
+To run the image test suite run the following command from the root directory
+of the repository:
+
+    python3 ./Test/astc_test_image.py
+
+This will run though a series of image compression tests, comparing the image
+PSNR against a set of reference results from the last stable baseline. The test
+will fail if any reduction in PSNR above a set threshold is detected. Note that
+performance information is reported, but regressions will not flag a failure.
+
+For debug purposes, all decompressed test output images and result CSV files
+are stored in the `TestOutput` directory, using the same test set structure as
+the `Test/Images` folder.
+
+## Test selection
+
+The runner supports a number of options to filter down what is run, enabling
+developers to focus local testing on the parts of the code they are working on.
+
+* `--encoder` selects which encoder to run. By default the `avx2` encoder is
+  selected. Note that some out-of-tree reference encoders (older encoders, and
+  some third-party encoders) are supported for comparison purposes. These will
+  not work without the binaries being manually provided; they are not
+  distributed here.
+* `--test-set` selects which image set to run. By default the `Small` image
+  test set is selected, which aims to provide basic coverage of many different
+  color formats and color profiles.
+* `--block-size` selects which block size to run. By default a range of
+  block sizes (2D and 3D) are used.
+* `--color-profile` selects which color profiles from the standard should be
+  used (LDR, LDR sRGB, or HDR) to select images. By default all are selected.
+* `--color-format` selects which color formats should be used (L, XY, RGB,
+  RGBA) to select images. By default all are selected.
+
+## Performance tests
+
+To provide less noisy performance results the test suite supports compressing
+each image multiple times and returning the best measured performance. To
+enable this mode use the following options:
+
+* `--repeats <M>` : Run M test compression passes which are timed.
+
+**Note:**  The reference CSV contains performance results measured on an Intel
+Core i5 9600K running at 4.3GHz, running each test 5 times.
+
+## Updating reference data
+
+The reference PSNR and performance scores are stored in CSVs committed to the
+repository. This data is created by running the tests using the last stable
+release on a standard test machine we use for performance testing builds.
+
+It can be useful for developers to rebuild the reference results for their
+local machine, in particular for measuring performance improvements. To build
+new reference CSVs, download the current reference `astcenc` binary (1.7) from
+GitHub for your host OS and place it in to the `./Binaries/1.7/` directory.
+Once this is done, run the command:
+
+    python3 ./Test/astc_test_image.py --encoder 1.7 --test-set all --repeats 5
+
+... to regenerate the reference CSV files.
+
+**WARNING:** This can take some hours to complete, and it is best done when the
+test suite gets exclusive use of the machine to avoid other processing slowing
+down the compression and disturbing the performance data. It is recommended to
+shutdown or disable any background applications that are running.
+
+## Valgrind memcheck
+
+It is always worth running the Valgrind memcheck tool to validate that we have
+not introduced any obvious memory errors. Build a release build with symbols
+information with `-DCMAKE_BUILD_TYPE=RelWithDebInfo` and then run:
+
+    valgrind --tool=memcheck --track-origins=yes <command>
+
+- - -
+
+_Copyright © 2019-2022, Arm Limited and contributors. All rights reserved._