master
   1
   2
   3
   4
   5
   6
   7Internet Engineering Task Force (IETF)                         Y. Collet
   8Request for Comments: 8478                             M. Kucherawy, Ed.
   9Category: Informational                                         Facebook
  10ISSN: 2070-1721                                             October 2018
  11
  12
  13       Zstandard Compression and the application/zstd Media Type
  14
  15Abstract
  16
  17   Zstandard, or "zstd" (pronounced "zee standard"), is a data
  18   compression mechanism.  This document describes the mechanism and
  19   registers a media type and content encoding to be used when
  20   transporting zstd-compressed content via Multipurpose Internet Mail
  21   Extensions (MIME).
  22
  23   Despite use of the word "standard" as part of its name, readers are
  24   advised that this document is not an Internet Standards Track
  25   specification; it is being published for informational purposes only.
  26
  27Status of This Memo
  28
  29   This document is not an Internet Standards Track specification; it is
  30   published for informational purposes.
  31
  32   This document is a product of the Internet Engineering Task Force
  33   (IETF).  It represents the consensus of the IETF community.  It has
  34   received public review and has been approved for publication by the
  35   Internet Engineering Steering Group (IESG).  Not all documents
  36   approved by the IESG are candidates for any level of Internet
  37   Standard; see Section 2 of RFC 7841.
  38
  39   Information about the current status of this document, any errata,
  40   and how to provide feedback on it may be obtained at
  41   https://www.rfc-editor.org/info/rfc8478.
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58Collet & Kucherawy            Informational                     [Page 1]
  59
  60RFC 8478                    application/zstd                October 2018
  61
  62
  63Copyright Notice
  64
  65   Copyright (c) 2018 IETF Trust and the persons identified as the
  66   document authors.  All rights reserved.
  67
  68   This document is subject to BCP 78 and the IETF Trust's Legal
  69   Provisions Relating to IETF Documents
  70   (https://trustee.ietf.org/license-info) in effect on the date of
  71   publication of this document.  Please review these documents
  72   carefully, as they describe your rights and restrictions with respect
  73   to this document.  Code Components extracted from this document must
  74   include Simplified BSD License text as described in Section 4.e of
  75   the Trust Legal Provisions and are provided without warranty as
  76   described in the Simplified BSD License.
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114Collet & Kucherawy            Informational                     [Page 2]
 115
 116RFC 8478                    application/zstd                October 2018
 117
 118
 119Table of Contents
 120
 121   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   4
 122   2.  Definitions . . . . . . . . . . . . . . . . . . . . . . . . .   4
 123   3.  Compression Algorithm . . . . . . . . . . . . . . . . . . . .   5
 124     3.1.  Frames  . . . . . . . . . . . . . . . . . . . . . . . . .   6
 125       3.1.1.  Zstandard Frames  . . . . . . . . . . . . . . . . . .   6
 126         3.1.1.1.  Frame Header  . . . . . . . . . . . . . . . . . .   7
 127         3.1.1.2.  Blocks  . . . . . . . . . . . . . . . . . . . . .  12
 128         3.1.1.3.  Compressed Blocks . . . . . . . . . . . . . . . .  14
 129         3.1.1.4.  Sequence Execution  . . . . . . . . . . . . . . .  28
 130         3.1.1.5.  Repeat Offsets  . . . . . . . . . . . . . . . . .  29
 131       3.1.2.  Skippable Frames  . . . . . . . . . . . . . . . . . .  30
 132   4.  Entropy Encoding  . . . . . . . . . . . . . . . . . . . . . .  30
 133     4.1.  FSE . . . . . . . . . . . . . . . . . . . . . . . . . . .  31
 134       4.1.1.  FSE Table Description . . . . . . . . . . . . . . . .  31
 135     4.2.  Huffman Coding  . . . . . . . . . . . . . . . . . . . . .  34
 136       4.2.1.  Huffman Tree Description  . . . . . . . . . . . . . .  35
 137         4.2.1.1.  Huffman Tree Header . . . . . . . . . . . . . . .  36
 138         4.2.1.2.  FSE Compression of Huffman Weights  . . . . . . .  37
 139         4.2.1.3.  Conversion from Weights to Huffman Prefix Codes .  38
 140       4.2.2.  Huffman-Coded Streams . . . . . . . . . . . . . . . .  39
 141   5.  Dictionary Format . . . . . . . . . . . . . . . . . . . . . .  40
 142   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  42
 143     6.1.  The 'application/zstd' Media Type . . . . . . . . . . . .  42
 144     6.2.  Content Encoding  . . . . . . . . . . . . . . . . . . . .  43
 145     6.3.  Dictionaries  . . . . . . . . . . . . . . . . . . . . . .  43
 146   7.  Security Considerations . . . . . . . . . . . . . . . . . . .  43
 147   8.  Implementation Status . . . . . . . . . . . . . . . . . . . .  44
 148   9.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  45
 149     9.1.  Normative References  . . . . . . . . . . . . . . . . . .  45
 150     9.2.  Informative References  . . . . . . . . . . . . . . . . .  45
 151   Appendix A.  Decoding Tables for Predefined Codes . . . . . . . .  46
 152     A.1.  Literal Length Code Table . . . . . . . . . . . . . . . .  46
 153     A.2.  Match Length Code Table . . . . . . . . . . . . . . . . .  49
 154     A.3.  Offset Code Table . . . . . . . . . . . . . . . . . . . .  52
 155   Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . .  53
 156   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  54
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170Collet & Kucherawy            Informational                     [Page 3]
 171
 172RFC 8478                    application/zstd                October 2018
 173
 174
 1751.  Introduction
 176
 177   Zstandard, or "zstd" (pronounced "zee standard"), is a data
 178   compression mechanism, akin to gzip [RFC1952].
 179
 180   Despite use of the word "standard" as part of its name, readers are
 181   advised that this document is not an Internet Standards Track
 182   specification; it is being published for informational purposes only.
 183
 184   This document describes the Zstandard format.  Also, to enable the
 185   transport of a data object compressed with Zstandard, this document
 186   registers a media type that can be used to identify such content when
 187   it is used in a payload encoded using Multipurpose Internet Mail
 188   Extensions (MIME).
 189
 1902.  Definitions
 191
 192   Some terms used elsewhere in this document are defined here for
 193   clarity.
 194
 195   uncompressed:  Describes an arbitrary set of bytes in their original
 196      form, prior to being subjected to compression.
 197
 198   compress, compression:  The act of processing a set of bytes via the
 199      compression mechanism described here.
 200
 201   compressed:  Describes the result of passing a set of bytes through
 202      this mechanism.  The original input has thus been compressed.
 203
 204   decompress, decompression:  The act of processing a set of bytes
 205      through the inverse of the compression mechanism described here,
 206      in an attempt to recover the original set of bytes prior to
 207      compression.
 208
 209   decompressed:  Describes the result of passing a set of bytes through
 210      the reverse of this mechanism.  When this is successful, the
 211      decompressed payload and the uncompressed payload are
 212      indistinguishable.
 213
 214   encode:  The process of translating data from one form to another;
 215      this may include compression or it may refer to other translations
 216      done as part of this specification.
 217
 218   decode:  The reverse of "encode"; describes a process of reversing a
 219      prior encoding to recover the original content.
 220
 221
 222
 223
 224
 225
 226Collet & Kucherawy            Informational                     [Page 4]
 227
 228RFC 8478                    application/zstd                October 2018
 229
 230
 231   frame:  Content compressed by Zstandard is transformed into a
 232      Zstandard frame.  Multiple frames can be appended into a single
 233      file or stream.  A frame is completely independent, has a defined
 234      beginning and end, and has a set of parameters that tells the
 235      decoder how to decompress it.
 236
 237   block:  A frame encapsulates one or multiple blocks.  Each block
 238      contains arbitrary content, which is described by its header, and
 239      has a guaranteed maximum content size that depends upon frame
 240      parameters.  Unlike frames, each block depends on previous blocks
 241      for proper decoding.  However, each block can be decompressed
 242      without waiting for its successor, allowing streaming operations.
 243
 244   natural order:  A sequence or ordering of objects or values that is
 245      typical of that type of object or value.  A set of unique
 246      integers, for example, is in "natural order" if when progressing
 247      from one element in the set or sequence to the next, there is
 248      never a decrease in value.
 249
 250   The naming convention for identifiers within the specification is
 251   Mixed_Case_With_Underscores.  Identifiers inside square brackets
 252   indicate that the identifier is optional in the presented context.
 253
 2543.  Compression Algorithm
 255
 256   This section describes the Zstandard algorithm.
 257
 258   The purpose of this document is to define a lossless compressed data
 259   format that is a) independent of the CPU type, operating system, file
 260   system, and character set and b) is suitable for file compression and
 261   pipe and streaming compression, using the Zstandard algorithm.  The
 262   text of the specification assumes a basic background in programming
 263   at the level of bits and other primitive data representations.
 264
 265   The data can be produced or consumed, even for an arbitrarily long
 266   sequentially presented input data stream, using only an a priori
 267   bounded amount of intermediate storage, and hence can be used in data
 268   communications.  The format uses the Zstandard compression method,
 269   and an optional xxHash-64 checksum method [XXHASH], for detection of
 270   data corruption.
 271
 272   The data format defined by this specification does not attempt to
 273   allow random access to compressed data.
 274
 275   Unless otherwise indicated below, a compliant compressor must produce
 276   data sets that conform to the specifications presented here.
 277   However, it does not need to support all options.
 278
 279
 280
 281
 282Collet & Kucherawy            Informational                     [Page 5]
 283
 284RFC 8478                    application/zstd                October 2018
 285
 286
 287   A compliant decompressor must be able to decompress at least one
 288   working set of parameters that conforms to the specifications
 289   presented here.  It may also ignore informative fields, such as the
 290   checksum.  Whenever it does not support a parameter defined in the
 291   compressed stream, it must produce a non-ambiguous error code and
 292   associated error message explaining which parameter is unsupported.
 293
 294   This specification is intended for use by implementers of software to
 295   compress data into Zstandard format and/or decompress data from
 296   Zstandard format.  The Zstandard format is supported by an open
 297   source reference implementation, written in portable C, and available
 298   at [ZSTD].
 299
 3003.1.  Frames
 301
 302   Zstandard compressed data is made up of one or more frames.  Each
 303   frame is independent and can be decompressed independently of other
 304   frames.  The decompressed content of multiple concatenated frames is
 305   the concatenation of each frame's decompressed content.
 306
 307   There are two frame formats defined for Zstandard: Zstandard frames
 308   and skippable frames.  Zstandard frames contain compressed data,
 309   while skippable frames contain custom user metadata.
 310
 3113.1.1.  Zstandard Frames
 312
 313   The structure of a single Zstandard frame is as follows:
 314
 315     +--------------------+------------+
 316     |    Magic_Number    | 4 bytes    |
 317     +--------------------+------------+
 318     |    Frame_Header    | 2-14 bytes |
 319     +--------------------+------------+
 320     |     Data_Block     | n bytes    |
 321     +--------------------+------------+
 322     | [More Data_Blocks] |            |
 323     +--------------------+------------+
 324     | [Content_Checksum] | 0-4 bytes  |
 325     +--------------------+------------+
 326
 327   Magic_Number:  4 bytes, little-endian format.  Value: 0xFD2FB528.
 328
 329   Frame_Header:  2 to 14 bytes, detailed in Section 3.1.1.1.
 330
 331   Data_Block:  Detailed in Section 3.1.1.2.  This is where data
 332      appears.
 333
 334
 335
 336
 337
 338Collet & Kucherawy            Informational                     [Page 6]
 339
 340RFC 8478                    application/zstd                October 2018
 341
 342
 343   Content_Checksum:  An optional 32-bit checksum, only present if
 344      Content_Checksum_Flag is set.  The content checksum is the result
 345      of the XXH64() hash function [XXHASH] digesting the original
 346      (decoded) data as input, and a seed of zero.  The low 4 bytes of
 347      the checksum are stored in little-endian format.
 348
 349   The magic number was selected to be less probable to find at the
 350   beginning of an arbitrary file.  It avoids trivial patterns (0x00,
 351   0xFF, repeated bytes, increasing bytes, etc.), contains byte values
 352   outside of ASCII range, and doesn't map into UTF-8 space, all of
 353   which reduce the likelihood of its appearance at the top of a text
 354   file.
 355
 3563.1.1.1.  Frame Header
 357
 358   The frame header has a variable size, with a minimum of 2 bytes and
 359   up to 14 bytes depending on optional parameters.  The structure of
 360   Frame_Header is as follows:
 361
 362     +-------------------------+-----------+
 363     | Frame_Header_Descriptor | 1 byte    |
 364     +-------------------------+-----------+
 365     |   [Window_Descriptor]   | 0-1 byte  |
 366     +-------------------------+-----------+
 367     |     [Dictionary_ID]     | 0-4 bytes |
 368     +-------------------------+-----------+
 369     |  [Frame_Content_Size]   | 0-8 bytes |
 370     +-------------------------+-----------+
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394Collet & Kucherawy            Informational                     [Page 7]
 395
 396RFC 8478                    application/zstd                October 2018
 397
 398
 3993.1.1.1.1.  Frame_Header_Descriptor
 400
 401   The first header's byte is called the Frame_Header_Descriptor.  It
 402   describes which other fields are present.  Decoding this byte is
 403   enough to tell the size of Frame_Header.
 404
 405     +------------+-------------------------+
 406     | Bit Number | Field Name              |
 407     +------------+-------------------------+
 408     |    7-6     | Frame_Content_Size_Flag |
 409     +------------+-------------------------+
 410     |     5      | Single_Segment_Flag     |
 411     +------------+-------------------------+
 412     |     4      | (unused)                |
 413     +------------+-------------------------+
 414     |     3      | (reserved)              |
 415     +------------+-------------------------+
 416     |     2      | Content_Checksum_Flag   |
 417     +------------+-------------------------+
 418     |    1-0     | Dictionary_ID_Flag      |
 419     +------------+-------------------------+
 420
 421   In this table, bit 7 is the highest bit, while bit 0 is the lowest
 422   one.
 423
 4243.1.1.1.1.1.  Frame_Content_Size_Flag
 425
 426   This is a 2-bit flag (equivalent to Frame_Header_Descriptor right-
 427   shifted 6 bits) specifying whether Frame_Content_Size (the
 428   decompressed data size) is provided within the header.  Flag_Value
 429   provides FCS_Field_Size, which is the number of bytes used by
 430   Frame_Content_Size according to the following table:
 431
 432     +----------------+--------+---+---+---+
 433     | Flag_Value     |   0    | 1 | 2 | 3 |
 434     +----------------+--------+---+---+---+
 435     | FCS_Field_Size | 0 or 1 | 2 | 4 | 8 |
 436     +----------------+--------+---+---+---+
 437
 438   When Flag_Value is 0, FCS_Field_Size depends on Single_Segment_Flag:
 439   If Single_Segment_Flag is set, FCS_Field_Size is 1.  Otherwise,
 440   FCS_Field_Size is 0; Frame_Content_Size is not provided.
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450Collet & Kucherawy            Informational                     [Page 8]
 451
 452RFC 8478                    application/zstd                October 2018
 453
 454
 4553.1.1.1.1.2.  Single_Segment_Flag
 456
 457   If this flag is set, data must be regenerated within a single
 458   continuous memory segment.
 459
 460   In this case, Window_Descriptor byte is skipped, but
 461   Frame_Content_Size is necessarily present.  As a consequence, the
 462   decoder must allocate a memory segment of size equal or larger than
 463   Frame_Content_Size.
 464
 465   In order to protect the decoder from unreasonable memory
 466   requirements, a decoder is allowed to reject a compressed frame that
 467   requests a memory size beyond the decoder's authorized range.
 468
 469   For broader compatibility, decoders are recommended to support memory
 470   sizes of at least 8 MB.  This is only a recommendation; each decoder
 471   is free to support higher or lower limits, depending on local
 472   limitations.
 473
 4743.1.1.1.1.3.  Unused Bit
 475
 476   A decoder compliant with this specification version shall not
 477   interpret this bit.  It might be used in a future version, to signal
 478   a property that is not mandatory to properly decode the frame.  An
 479   encoder compliant with this specification must set this bit to zero.
 480
 4813.1.1.1.1.4.  Reserved Bit
 482
 483   This bit is reserved for some future feature.  Its value must be
 484   zero.  A decoder compliant with this specification version must
 485   ensure it is not set.  This bit may be used in a future revision, to
 486   signal a feature that must be interpreted to decode the frame
 487   correctly.
 488
 4893.1.1.1.1.5.  Content_Checksum_Flag
 490
 491   If this flag is set, a 32-bit Content_Checksum will be present at the
 492   frame's end.  See the description of Content_Checksum above.
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506Collet & Kucherawy            Informational                     [Page 9]
 507
 508RFC 8478                    application/zstd                October 2018
 509
 510
 5113.1.1.1.1.6.  Dictionary_ID_Flag
 512
 513   This is a 2-bit flag (= Frame_Header_Descriptor & 0x3) indicating
 514   whether a dictionary ID is provided within the header.  It also
 515   specifies the size of this field as DID_Field_Size:
 516
 517     +----------------+---+---+---+---+
 518     | Flag_Value     | 0 | 1 | 2 | 3 |
 519     +----------------+---+---+---+---+
 520     | DID_Field_Size | 0 | 1 | 2 | 4 |
 521     +----------------+---+---+---+---+
 522
 5233.1.1.1.2.  Window Descriptor
 524
 525   This provides guarantees about the minimum memory buffer required to
 526   decompress a frame.  This information is important for decoders to
 527   allocate enough memory.
 528
 529   The Window_Descriptor byte is optional.  When Single_Segment_Flag is
 530   set, Window_Descriptor is not present.  In this case, Window_Size is
 531   Frame_Content_Size, which can be any value from 0 to 2^64-1 bytes (16
 532   ExaBytes).
 533
 534     +------------+----------+----------+
 535     | Bit Number |   7-3    |   2-0    |
 536     +------------+----------+----------+
 537     | Field Name | Exponent | Mantissa |
 538     +------------+----------+----------+
 539
 540   The minimum memory buffer size is called Window_Size.  It is
 541   described by the following formulae:
 542
 543     windowLog = 10 + Exponent;
 544     windowBase = 1 << windowLog;
 545     windowAdd = (windowBase / 8) * Mantissa;
 546     Window_Size = windowBase + windowAdd;
 547
 548   The minimum Window_Size is 1 KB.  The maximum Window_Size is (1<<41)
 549   + 7*(1<<38) bytes, which is 3.75 TB.
 550
 551   In general, larger Window_Size values tend to improve the compression
 552   ratio, but at the cost of increased memory usage.
 553
 554   To properly decode compressed data, a decoder will need to allocate a
 555   buffer of at least Window_Size bytes.
 556
 557
 558
 559
 560
 561
 562Collet & Kucherawy            Informational                    [Page 10]
 563
 564RFC 8478                    application/zstd                October 2018
 565
 566
 567   In order to protect decoders from unreasonable memory requirements, a
 568   decoder is allowed to reject a compressed frame that requests a
 569   memory size beyond decoder's authorized range.
 570
 571   For improved interoperability, it's recommended for decoders to
 572   support values of Window_Size up to 8 MB and for encoders not to
 573   generate frames requiring a Window_Size larger than 8 MB.  It's
 574   merely a recommendation though, and decoders are free to support
 575   larger or lower limits, depending on local limitations.
 576
 5773.1.1.1.3.  Dictionary_ID
 578
 579   This is a variable size field, which contains the ID of the
 580   dictionary required to properly decode the frame.  This field is
 581   optional.  When it's not present, it's up to the decoder to know
 582   which dictionary to use.
 583
 584   Dictionary_ID field size is provided by DID_Field_Size.
 585   DID_Field_Size is directly derived from the value of
 586   Dictionary_ID_Flag.  One byte can represent an ID 0-255; 2 bytes can
 587   represent an ID 0-65535; 4 bytes can represent an ID 0-4294967295.
 588   Format is little-endian.
 589
 590   It is permitted to represent a small ID (for example, 13) with a
 591   large 4-byte dictionary ID, even if it is less efficient.
 592
 593   Within private environments, any dictionary ID can be used.  However,
 594   for frames and dictionaries distributed in public space,
 595   Dictionary_ID must be attributed carefully.  The following ranges are
 596   reserved for use only with dictionaries that have been registered
 597   with IANA (see Section 6.3):
 598
 599   low range:  <= 32767
 600   high range:  >= (1 << 31)
 601
 602   Any other value for Dictionary_ID can be used by private arrangement
 603   between participants.
 604
 605   Any payload presented for decompression that references an
 606   unregistered reserved dictionary ID results in an error.
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618Collet & Kucherawy            Informational                    [Page 11]
 619
 620RFC 8478                    application/zstd                October 2018
 621
 622
 6233.1.1.1.4.  Frame Content Size
 624
 625   This is the original (uncompressed) size.  This information is
 626   optional.  Frame_Content_Size uses a variable number of bytes,
 627   provided by FCS_Field_Size.  FCS_Field_Size is provided by the value
 628   of Frame_Content_Size_Flag.  FCS_Field_Size can be equal to 0 (not
 629   present), 1, 2, 4, or 8 bytes.
 630
 631     +----------------+--------------+
 632     | FCS Field Size | Range        |
 633     +----------------+--------------+
 634     |        0       | unknown      |
 635     +----------------+--------------+
 636     |        1       | 0 - 255      |
 637     +----------------+--------------+
 638     |        2       | 256 - 65791  |
 639     +----------------+--------------+
 640     |        4       | 0 - 2^32 - 1 |
 641     +----------------+--------------+
 642     |        8       | 0 - 2^64 - 1 |
 643     +----------------+--------------+
 644
 645   Frame_Content_Size format is little-endian.  When FCS_Field_Size is
 646   1, 4, or 8 bytes, the value is read directly.  When FCS_Field_Size is
 647   2, the offset of 256 is added.  It's allowed to represent a small
 648   size (for example 18) using any compatible variant.
 649
 6503.1.1.2.  Blocks
 651
 652   After Magic_Number and Frame_Header, there are some number of blocks.
 653   Each frame must have at least 1 block, but there is no upper limit on
 654   the number of blocks per frame.
 655
 656   The structure of a block is as follows:
 657
 658     +--------------+---------------+
 659     | Block_Header | Block_Content |
 660     +--------------+---------------+
 661     |    3 bytes   |    n bytes    |
 662     +--------------+---------------+
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674Collet & Kucherawy            Informational                    [Page 12]
 675
 676RFC 8478                    application/zstd                October 2018
 677
 678
 679   Block_Header uses 3 bytes, written using little-endian convention.
 680   It contains three fields:
 681
 682     +------------+------------+------------+
 683     | Last_Block | Block_Type | Block_Size |
 684     +------------+------------+------------+
 685     |    bit 0   |   bits 1-2 |  bits 3-23 |
 686     +------------+------------+------------+
 687
 6883.1.1.2.1.  Last_Block
 689
 690   The lowest bit (Last_Block) signals whether this block is the last
 691   one.  The frame will end after this last block.  It may be followed
 692   by an optional Content_Checksum (see Section 3.1.1).
 693
 6943.1.1.2.2.  Block_Type
 695
 696   The next 2 bits represent the Block_Type.  There are four block
 697   types:
 698
 699     +-----------+------------------+
 700     |   Value   |    Block_Type    |
 701     +-----------+------------------+
 702     |     0     |     Raw_Block    |
 703     +-----------+------------------+
 704     |     1     |     RLE_Block    |
 705     +-----------+------------------+
 706     |     2     | Compressed_Block |
 707     +-----------+------------------+
 708     |     3     |     Reserved     |
 709     +-----------+------------------+
 710
 711   Raw_Block:  This is an uncompressed block.  Block_Content contains
 712      Block_Size bytes.
 713
 714   RLE_Block:  This is a single byte, repeated Block_Size times.
 715      Block_Content consists of a single byte.  On the decompression
 716      side, this byte must be repeated Block_Size times.
 717
 718   Compressed_Block:  This is a compressed block as described in
 719      Section 3.1.1.3.  Block_Size is the length of Block_Content,
 720      namely the compressed data.  The decompressed size is not known,
 721      but its maximum possible value is guaranteed (see below).
 722
 723   Reserved:  This is not a block.  This value cannot be used with the
 724      current specification.  If such a value is present, it is
 725      considered to be corrupt data.
 726
 727
 728
 729
 730Collet & Kucherawy            Informational                    [Page 13]
 731
 732RFC 8478                    application/zstd                October 2018
 733
 734
 7353.1.1.2.3.  Block_Size
 736
 737   The upper 21 bits of Block_Header represent the Block_Size.
 738   Block_Size is the size of the block excluding the header.  A block
 739   can contain any number of bytes (even zero), up to
 740   Block_Maximum_Decompressed_Size, which is the smallest of:
 741
 742   o  Window_Size
 743
 744   o  128 KB
 745
 746   A Compressed_Block has the extra restriction that Block_Size is
 747   always strictly less than the decompressed size.  If this condition
 748   cannot be respected, the block must be sent uncompressed instead
 749   (i.e., treated as a Raw_Block).
 750
 7513.1.1.3.  Compressed Blocks
 752
 753   To decompress a compressed block, the compressed size must be
 754   provided from the Block_Size field within Block_Header.
 755
 756   A compressed block consists of two sections: a Literals
 757   Section (Section 3.1.1.3.1) and a
 758   Sequences_Section (Section 3.1.1.3.2).  The results of the two
 759   sections are then combined to produce the decompressed data in
 760   Sequence Execution (Section 3.1.1.4).
 761
 762   To decode a compressed block, the following elements are necessary:
 763
 764   o  Previous decoded data, up to a distance of Window_Size, or the
 765      beginning of the Frame, whichever is smaller.  Single_Segment_Flag
 766      will be set in the latter case.
 767
 768   o  List of "recent offsets" from the previous Compressed_Block.
 769
 770   o  The previous Huffman tree, required by Treeless_Literals_Block
 771      type.
 772
 773   o  Previous Finite State Entropy (FSE) decoding tables, required by
 774      Repeat_Mode, for each symbol type (literals lengths, match
 775      lengths, offsets).
 776
 777   Note that decoding tables are not always from the previous
 778   Compressed_Block:
 779
 780   o  Every decoding table can come from a dictionary.
 781
 782
 783
 784
 785
 786Collet & Kucherawy            Informational                    [Page 14]
 787
 788RFC 8478                    application/zstd                October 2018
 789
 790
 791   o  The Huffman tree comes from the previous
 792      Compressed_Literals_Block.
 793
 7943.1.1.3.1.  Literals_Section_Header
 795
 796   All literals are regrouped in the first part of the block.  They can
 797   be decoded first and then copied during Sequence Execution (see
 798   Section 3.1.1.4), or they can be decoded on the flow during Sequence
 799   Execution.
 800
 801   Literals can be stored uncompressed or compressed using Huffman
 802   prefix codes.  When compressed, an optional tree description can be
 803   present, followed by 1 or 4 streams.
 804
 805     +----------------------------+
 806     |   Literals_Section_Header  |
 807     +----------------------------+
 808     | [Huffman_Tree_Description] |
 809     +----------------------------+
 810     |        [Jump_Table]        |
 811     +----------------------------+
 812     |          Stream_1          |
 813     +----------------------------+
 814     |         [Stream_2]         |
 815     +----------------------------+
 816     |         [Stream_3]         |
 817     +----------------------------+
 818     |         [Stream_4]         |
 819     +----------------------------+
 820
 8213.1.1.3.1.1.  Literals_Section_Header
 822
 823   This field describes how literals are packed.  It's a byte-aligned
 824   variable-size bit field, ranging from 1 to 5 bytes, using little-
 825   endian convention.
 826
 827     +---------------------+-----------+
 828     | Literals_Block_Type |  2 bits   |
 829     +---------------------+-----------+
 830     |     Size_Format     | 1-2 bits  |
 831     +---------------------+-----------+
 832     |   Regenerated_Size  | 5-20 bits |
 833     +---------------------+-----------+
 834     |  [Compressed_Size]  | 0-18 bits |
 835     +---------------------+-----------+
 836
 837   In this representation, bits at the top are the lowest bits.
 838
 839
 840
 841
 842Collet & Kucherawy            Informational                    [Page 15]
 843
 844RFC 8478                    application/zstd                October 2018
 845
 846
 847   The Literals_Block_Type field uses the two lowest bits of the first
 848   byte, describing four different block types:
 849
 850     +---------------------------+-------+
 851     |    Literals_Block_Type    | Value |
 852     +---------------------------+-------+
 853     |     Raw_Literals_Block    |   0   |
 854     +---------------------------+-------+
 855     |     RLE_Literals_Block    |   1   |
 856     +---------------------------+-------+
 857     | Compressed_Literals_Block |   2   |
 858     +---------------------------+-------+
 859     |  Treeless_Literals_Block  |   3   |
 860     +---------------------------+-------+
 861
 862   Raw_Literals_Block:  Literals are stored uncompressed.
 863      Literals_Section_Content is Regenerated_Size.
 864
 865   RLE_Literals_Block:  Literals consist of a single-byte value repeated
 866      Regenerated_Size times.  Literals_Section_Content is 1.
 867
 868   Compressed_Literals_Block:  This is a standard Huffman-compressed
 869      block, starting with a Huffman tree description.  See details
 870      below.  Literals_Section_Content is Compressed_Size.
 871
 872   Treeless_Literals_Block:  This is a Huffman-compressed block, using
 873      the Huffman tree from the previous Compressed_Literals_Block, or a
 874      dictionary if there is no previous Huffman-compressed literals
 875      block.  Huffman_Tree_Description will be skipped.  Note that if
 876      this mode is triggered without any previous Huffman-table in the
 877      frame (or dictionary, per Section 5), it should be treated as data
 878      corruption.  Literals_Section_Content is Compressed_Size.
 879
 880   The Size_Format is divided into two families:
 881
 882   o  For Raw_Literals_Block and RLE_Literals_Block, it's only necessary
 883      to decode Regenerated_Size.  There is no Compressed_Size field.
 884
 885   o  For Compressed_Block and Treeless_Literals_Block, it's required to
 886      decode both Compressed_Size and Regenerated_Size (the decompressed
 887      size).  It's also necessary to decode the number of streams (1 or
 888      4).
 889
 890   For values spanning several bytes, the convention is little endian.
 891
 892   Size_Format for Raw_Literals_Block and RLE_Literals_Block uses 1 or 2
 893   bits.  Its value is (Literals_Section_Header[0]>>2) & 0x3.
 894
 895
 896
 897
 898Collet & Kucherawy            Informational                    [Page 16]
 899
 900RFC 8478                    application/zstd                October 2018
 901
 902
 903   Size_Format == 00 or 10:  Size_Format uses 1 bit.  Regenerated_Size
 904      uses 5 bits (value 0-31).  Literals_Section_Header uses 1 byte.
 905      Regenerated_Size = Literal_Section_Header[0]>>3.
 906
 907   Size_Format == 01:  Size_Format uses 2 bits.  Regenerated_Size uses
 908      12 bits (values 0-4095).  Literals_Section_Header uses 2 bytes.
 909      Regenerated_Size = (Literals_Section_Header[0]>>4) +
 910      (Literals_Section_Header[1]<<4).
 911
 912   Size_Format == 11:  Size_Format uses 2 bits.  Regenerated_Size uses
 913      20 bits (values 0-1048575).  Literals_Section_Header uses 3 bytes.
 914      Regenerated_Size = (Literals_Section_Header[0]>>4) +
 915      (Literals_Section_Header[1]<<4) + (Literals_Section_Header[2]<<12)
 916
 917   Only Stream_1 is present for these cases.  Note that it is permitted
 918   to represent a short value (for example, 13) using a long format,
 919   even if it's less efficient.
 920
 921   Size_Format for Compressed_Literals_Block and Treeless_Literals_Block
 922   always uses 2 bits.
 923
 924   Size_Format == 00:  A single stream.  Both Regenerated_Size and
 925      Compressed_Size use 10 bits (values 0-1023).
 926      Literals_Section_Header uses 3 bytes.
 927
 928   Size_Format == 01:  4 streams.  Both Regenerated_Size and
 929      Compressed_Size use 10 bits (values 0-1023).
 930      Literals_Section_Header uses 3 bytes.
 931
 932   Size_Format == 10:  4 streams.  Both Regenerated_Size and
 933      Compressed_Size use 14 bits (values 0-16383).
 934      Literals_Section_Header uses 4 bytes.
 935
 936   Size_Format == 11:  4 streams.  Both Regenerated_Size and
 937      Compressed_Size use 18 bits (values 0-262143).
 938      Literals_Section_Header uses 5 bytes.
 939
 940   Both the Compressed_Size and Regenerated_Size fields follow little-
 941   endian convention.  Note that Compressed_Size includes the size of
 942   the Huffman_Tree_Description when it is present.
 943
 9443.1.1.3.1.2.  Raw_Literals_Block
 945
 946   The data in Stream_1 is Regenerated_Size bytes long.  It contains the
 947   raw literals data to be used during Sequence Execution
 948   (Section 3.1.1.3.2).
 949
 950
 951
 952
 953
 954Collet & Kucherawy            Informational                    [Page 17]
 955
 956RFC 8478                    application/zstd                October 2018
 957
 958
 9593.1.1.3.1.3.  RLE_Literals_Block
 960
 961   Stream_1 consists of a single byte that should be repeated
 962   Regenerated_Size times to generate the decoded literals.
 963
 9643.1.1.3.1.4.  Compressed_Literals_Block and Treeless_Literals_Block
 965
 966   Both of these modes contain Huffman-encoded data.  For
 967   Treeless_Literals_Block, the Huffman table comes from the previously
 968   compressed literals block, or from a dictionary; see Section 5.
 969
 9703.1.1.3.1.5.  Huffman_Tree_Description
 971
 972   This section is only present when the Literals_Block_Type type is
 973   Compressed_Literals_Block (2).  The format of
 974   Huffman_Tree_Description can be found in Section 4.2.1.  The size of
 975   Huffman_Tree_Description is determined during the decoding process.
 976   It must be used to determine where streams begin.
 977
 978     Total_Streams_Size = Compressed_Size
 979                          - Huffman_Tree_Description_Size
 980
 9813.1.1.3.1.6.  Jump_Table
 982
 983   The Jump_Table is only present when there are 4 Huffman-coded
 984   streams.
 985
 986   (Reminder: Huffman-compressed data consists of either 1 or 4 Huffman-
 987   coded streams.)
 988
 989   If only 1 stream is present, it is a single bitstream occupying the
 990   entire remaining portion of the literals block, encoded as described
 991   within Section 4.2.2.
 992
 993   If there are 4 streams, Literals_Section_Header only provides enough
 994   information to know the decompressed and compressed sizes of all 4
 995   streams combined.  The decompressed size of each stream is equal to
 996   (Regenerated_Size+3)/4, except for the last stream, which may be up
 997   to 3 bytes smaller, to reach a total decompressed size as specified
 998   in Regenerated_Size.
 999
1000   The compressed size of each stream is provided explicitly in the
1001   Jump_Table.  The Jump_Table is 6 bytes long and consists of three
1002   2-byte little-endian fields, describing the compressed sizes of the
1003   first 3 streams.  Stream4_Size is computed from Total_Streams_Size
1004   minus sizes of other streams.
1005
1006
1007
1008
1009
1010Collet & Kucherawy            Informational                    [Page 18]
1011
1012RFC 8478                    application/zstd                October 2018
1013
1014
1015     Stream4_Size = Total_Streams_Size - 6
1016                    - Stream1_Size - Stream2_Size
1017                    - Stream3_Size
1018
1019   Note that if Stream1_Size + Stream2_Size + Stream3_Size exceeds
1020   Total_Streams_Size, the data are considered corrupted.
1021
1022   Each of these 4 bitstreams is then decoded independently as a
1023   Huffman-Coded stream, as described in Section 4.2.2.
1024
10253.1.1.3.2.  Sequences_Section
1026
1027   A compressed block is a succession of sequences.  A sequence is a
1028   literal copy command, followed by a match copy command.  A literal
1029   copy command specifies a length.  It is the number of bytes to be
1030   copied (or extracted) from the Literals Section.  A match copy
1031   command specifies an offset and a length.
1032
1033   When all sequences are decoded, if there are literals left in the
1034   literals section, these bytes are added at the end of the block.
1035
1036   This is described in more detail in Section 3.1.1.4.
1037
1038   The Sequences_Section regroups all symbols required to decode
1039   commands.  There are three symbol types: literals lengths, offsets,
1040   and match lengths.  They are encoded together, interleaved, in a
1041   single "bitstream".
1042
1043   The Sequences_Section starts by a header, followed by optional
1044   probability tables for each symbol type, followed by the bitstream.
1045
1046     Sequences_Section_Header
1047       [Literals_Length_Table]
1048       [Offset_Table]
1049       [Match_Length_Table]
1050       bitStream
1051
1052   To decode the Sequences_Section, it's necessary to know its size.
1053   This size is deduced from the size of the Literals_Section:
1054   Sequences_Section_Size = Block_Size - Literals_Section_Header -
1055   Literals_Section_Content
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066Collet & Kucherawy            Informational                    [Page 19]
1067
1068RFC 8478                    application/zstd                October 2018
1069
1070
10713.1.1.3.2.1.  Sequences_Section_Header
1072
1073   This header consists of two items:
1074
1075   o  Number_of_Sequences
1076
1077   o  Symbol_Compression_Modes
1078
1079   Number_of_Sequences is a variable size field using between 1 and 3
1080   bytes.  If the first byte is "byte0":
1081
1082   o  if (byte0 == 0): there are no sequences.  The sequence section
1083      stops here.  Decompressed content is defined entirely as Literals
1084      Section content.  The FSE tables used in Repeat_Mode are not
1085      updated.
1086
1087   o  if (byte0 < 128): Number_of_Sequences = byte0.  Uses 1 byte.
1088
1089   o  if (byte0 < 255): Number_of_Sequences = ((byte0 - 128) << 8) +
1090      byte1.  Uses 2 bytes.
1091
1092   o  if (byte0 == 255): Number_of_Sequences = byte1 + (byte2 << 8) +
1093      0x7F00.  Uses 3 bytes.
1094
1095   Symbol_Compression_Modes is a single byte, defining the compression
1096   mode of each symbol type.
1097
1098     +-------------+----------------------+
1099     | Bit Number  |      Field Name      |
1100     +-------------+----------------------+
1101     |     7-6     | Literal_Lengths_Mode |
1102     +-------------+----------------------+
1103     |     5-4     |     Offsets_Mode     |
1104     +-------------+----------------------+
1105     |     3-2     |  Match_Lengths_Mode  |
1106     +-------------+----------------------+
1107     |     1-0     |       Reserved       |
1108     +-------------+----------------------+
1109
1110   The last field, Reserved, must be all zeroes.
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122Collet & Kucherawy            Informational                    [Page 20]
1123
1124RFC 8478                    application/zstd                October 2018
1125
1126
1127   Literals_Lengths_Mode, Offsets_Mode, and Match_Lengths_Mode define
1128   the Compression_Mode of literals lengths, offsets, and match lengths
1129   symbols, respectively.  They follow the same enumeration:
1130
1131     +-------+---------------------+
1132     | Value |  Compression_Mode   |
1133     +-------+---------------------+
1134     |   0   |   Predefined_Mode   |
1135     +-------+---------------------+
1136     |   1   |      RLE_Mode       |
1137     +-------+---------------------+
1138     |   2   | FSE_Compressed_Mode |
1139     +-------+---------------------+
1140     |   3   |     Repeat_Mode     |
1141     +-------+---------------------+
1142
1143   Predefined_Mode:  A predefined FSE (see Section 4.1) distribution
1144      table is used, as defined in Section 3.1.1.3.2.2.  No distribution
1145      table will be present.
1146
1147   RLE_Mode:  The table description consists of a single byte, which
1148      contains the symbol's value.  This symbol will be used for all
1149      sequences.
1150
1151   FSE_Compressed_Mode:  Standard FSE compression.  A distribution table
1152      will be present.  The format of this distribution table is
1153      described in Section 4.1.1.  Note that the maximum allowed
1154      accuracy log for literals length and match length tables is 9, and
1155      the maximum accuracy log for the offsets table is 8.  This mode
1156      must not be used when only one symbol is present; RLE_Mode should
1157      be used instead (although any other mode will work).
1158
1159   Repeat_Mode:  The table used in the previous Compressed_Block with
1160      Number_Of_Sequences > 0 will be used again, or if this is the
1161      first block, the table in the dictionary will be used.  Note that
1162      this includes RLE_Mode, so if Repeat_Mode follows RLE_Mode, the
1163      same symbol will be repeated.  It also includes Predefined_Mode,
1164      in which case Repeat_Mode will have the same outcome as
1165      Predefined_Mode.  No distribution table will be present.  If this
1166      mode is used without any previous sequence table in the frame (or
1167      dictionary; see Section 5) to repeat, this should be treated as
1168      corruption.
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178Collet & Kucherawy            Informational                    [Page 21]
1179
1180RFC 8478                    application/zstd                October 2018
1181
1182
11833.1.1.3.2.1.1.  Sequence Codes for Lengths and Offsets
1184
1185   Each symbol is a code in its own context, which specifies Baseline
1186   and Number_of_Bits to add.  Codes are FSE compressed and interleaved
1187   with raw additional bits in the same bitstream.
1188
1189   Literals length codes are values ranging from 0 to 35 inclusive.
1190   They define lengths from 0 to 131071 bytes.  The literals length is
1191   equal to the decoded Baseline plus the result of reading
1192   Number_of_Bits bits from the bitstream, as a little-endian value.
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234Collet & Kucherawy            Informational                    [Page 22]
1235
1236RFC 8478                    application/zstd                October 2018
1237
1238
1239     +----------------------+----------+----------------+
1240     | Literals_Length_Code | Baseline | Number_of_Bits |
1241     +----------------------+----------+----------------+
1242     |         0-15         |  length  |       0        |
1243     +----------------------+----------+----------------+
1244     |          16          |    16    |       1        |
1245     +----------------------+----------+----------------+
1246     |          17          |    18    |       1        |
1247     +----------------------+----------+----------------+
1248     |          18          |    20    |       1        |
1249     +----------------------+----------+----------------+
1250     |          19          |    22    |       1        |
1251     +----------------------+----------+----------------+
1252     |          20          |    24    |       2        |
1253     +----------------------+----------+----------------+
1254     |          21          |    28    |       2        |
1255     +----------------------+----------+----------------+
1256     |          22          |    32    |       3        |
1257     +----------------------+----------+----------------+
1258     |          23          |    40    |       3        |
1259     +----------------------+----------+----------------+
1260     |          24          |    48    |       4        |
1261     +----------------------+----------+----------------+
1262     |          25          |    64    |       6        |
1263     +----------------------+----------+----------------+
1264     |          26          |    128   |       7        |
1265     +----------------------+----------+----------------+
1266     |          27          |    256   |       8        |
1267     +----------------------+----------+----------------+
1268     |          28          |    512   |       9        |
1269     +----------------------+----------+----------------+
1270     |          29          |   1024   |       10       |
1271     +----------------------+----------+----------------+
1272     |          30          |   2048   |       11       |
1273     +----------------------+----------+----------------+
1274     |          31          |   4096   |       12       |
1275     +----------------------+----------+----------------+
1276     |          32          |   8192   |       13       |
1277     +----------------------+----------+----------------+
1278     |          33          |  16384   |       14       |
1279     +----------------------+----------+----------------+
1280     |          34          |  32768   |       15       |
1281     +----------------------+----------+----------------+
1282     |          35          |  65536   |       16       |
1283     +----------------------+----------+----------------+
1284
1285
1286
1287
1288
1289
1290Collet & Kucherawy            Informational                    [Page 23]
1291
1292RFC 8478                    application/zstd                October 2018
1293
1294
1295   Match length codes are values ranging from 0 to 52 inclusive.  They
1296   define lengths from 3 to 131074 bytes.  The match length is equal to
1297   the decoded Baseline plus the result of reading Number_of_Bits bits
1298   from the bitstream, as a little-endian value.
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346Collet & Kucherawy            Informational                    [Page 24]
1347
1348RFC 8478                    application/zstd                October 2018
1349
1350
1351     +-------------------+-----------------------+----------------+
1352     | Match_Length_Code |       Baseline        | Number_of_Bits |
1353     +-------------------+-----------------------+----------------+
1354     |        0-31       | Match_Length_Code + 3 |       0        |
1355     +-------------------+-----------------------+----------------+
1356     |         32        |          35           |       1        |
1357     +-------------------+-----------------------+----------------+
1358     |         33        |          37           |       1        |
1359     +-------------------+-----------------------+----------------+
1360     |         34        |          39           |       1        |
1361     +-------------------+-----------------------+----------------+
1362     |         35        |          41           |       1        |
1363     +-------------------+-----------------------+----------------+
1364     |         36        |          43           |       2        |
1365     +-------------------+-----------------------+----------------+
1366     |         37        |          47           |       2        |
1367     +-------------------+-----------------------+----------------+
1368     |         38        |          51           |       3        |
1369     +-------------------+-----------------------+----------------+
1370     |         39        |          59           |       3        |
1371     +-------------------+-----------------------+----------------+
1372     |         40        |          67           |       4        |
1373     +-------------------+-----------------------+----------------+
1374     |         41        |          83           |       4        |
1375     +-------------------+-----------------------+----------------+
1376     |         42        |          99           |       5        |
1377     +-------------------+-----------------------+----------------+
1378     |         43        |         131           |       7        |
1379     +-------------------+-----------------------+----------------+
1380     |         44        |         259           |       8        |
1381     +-------------------+-----------------------+----------------+
1382     |         45        |         515           |       9        |
1383     +-------------------+-----------------------+----------------+
1384     |         46        |         1027          |       10       |
1385     +-------------------+-----------------------+----------------+
1386     |         47        |         2051          |       11       |
1387     +-------------------+-----------------------+----------------+
1388     |         48        |         4099          |       12       |
1389     +-------------------+-----------------------+----------------+
1390     |         49        |         8195          |       13       |
1391     +-------------------+-----------------------+----------------+
1392     |         50        |         16387         |       14       |
1393     +-------------------+-----------------------+----------------+
1394     |         51        |         32771         |       15       |
1395     +-------------------+-----------------------+----------------+
1396     |         52        |         65539         |       16       |
1397     +-------------------+-----------------------+----------------+
1398
1399
1400
1401
1402Collet & Kucherawy            Informational                    [Page 25]
1403
1404RFC 8478                    application/zstd                October 2018
1405
1406
1407   Offset codes are values ranging from 0 to N.
1408
1409   A decoder is free to limit its maximum supported value for N.
1410   Support for values of at least 22 is recommended.  At the time of
1411   this writing, the reference decoder supports a maximum N value of 31.
1412
1413   An offset code is also the number of additional bits to read in
1414   little-endian fashion and can be translated into an Offset_Value
1415   using the following formulas:
1416
1417     Offset_Value = (1 << offsetCode) + readNBits(offsetCode);
1418     if (Offset_Value > 3) Offset = Offset_Value - 3;
1419
1420   This means that maximum Offset_Value is (2^(N+1))-1, supporting back-
1421   reference distance up to (2^(N+1))-4, but it is limited by the
1422   maximum back-reference distance (see Section 3.1.1.1.2).
1423
1424   Offset_Value from 1 to 3 are special: they define "repeat codes".
1425   This is described in more detail in Section 3.1.1.5.
1426
14273.1.1.3.2.1.2.  Decoding Sequences
1428
1429   FSE bitstreams are read in reverse of the direction they are written.
1430   In zstd, the compressor writes bits forward into a block, and the
1431   decompressor must read the bitstream backwards.
1432
1433   To find the start of the bitstream, it is therefore necessary to know
1434   the offset of the last byte of the block, which can be found by
1435   counting Block_Size bytes after the block header.
1436
1437   After writing the last bit containing information, the compressor
1438   writes a single 1 bit and then fills the byte with 0-7 zero bits of
1439   padding.  The last byte of the compressed bitstream cannot be zero
1440   for that reason.
1441
1442   When decompressing, the last byte containing the padding is the first
1443   byte to read.  The decompressor needs to skip 0-7 initial zero bits
1444   until the first 1 bit occurs.  Afterwards, the useful part of the
1445   bitstream begins.
1446
1447   FSE decoding requires a 'state' to be carried from symbol to symbol.
1448   For more explanation on FSE decoding, see Section 4.1.
1449
1450   For sequence decoding, a separate state keeps track of each literal
1451   lengths, offsets, and match lengths symbols.  Some FSE primitives are
1452   also used.  For more details on the operation of these primitives,
1453   see Section 4.1.
1454
1455
1456
1457
1458Collet & Kucherawy            Informational                    [Page 26]
1459
1460RFC 8478                    application/zstd                October 2018
1461
1462
1463   The bitstream starts with initial FSE state values, each using the
1464   required number of bits in their respective accuracy, decoded
1465   previously from their normalized distribution.  It starts with
1466   Literals_Length_State, followed by Offset_State, and finally
1467   Match_Length_State.
1468
1469   Note that all values are read backward, so the 'start' of the
1470   bitstream is at the highest position in memory, immediately before
1471   the last 1 bit for padding.
1472
1473   After decoding the starting states, a single sequence is decoded
1474   Number_Of_Sequences times.  These sequences are decoded in order from
1475   first to last.  Since the compressor writes the bitstream in the
1476   forward direction, this means the compressor must encode the
1477   sequences starting with the last one and ending with the first.
1478
1479   For each of the symbol types, the FSE state can be used to determine
1480   the appropriate code.  The code then defines the Baseline and
1481   Number_of_Bits to read for each type.  The description of the codes
1482   for how to determine these values can be found in
1483   Section 3.1.1.3.2.1.
1484
1485   Decoding starts by reading the Number_of_Bits required to decode
1486   offset.  It does the same for Match_Length and then for
1487   Literals_Length.  This sequence is then used for Sequence Execution
1488   (see Section 3.1.1.4).
1489
1490   If it is not the last sequence in the block, the next operation is to
1491   update states.  Using the rules pre-calculated in the decoding
1492   tables, Literals_Length_State is updated, followed by
1493   Match_Length_State, and then Offset_State.  See Section 4.1 for
1494   details on how to update states from the bitstream.
1495
1496   This operation will be repeated Number_of_Sequences times.  At the
1497   end, the bitstream shall be entirely consumed; otherwise, the
1498   bitstream is considered corrupted.
1499
15003.1.1.3.2.2.  Default Distributions
1501
1502   If Predefined_Mode is selected for a symbol type, its FSE decoding
1503   table is generated from a predefined distribution table defined here.
1504   For details on how to convert this distribution into a decoding
1505   table, see Section 4.1.
1506
1507
1508
1509
1510
1511
1512
1513
1514Collet & Kucherawy            Informational                    [Page 27]
1515
1516RFC 8478                    application/zstd                October 2018
1517
1518
15193.1.1.3.2.2.1.  Literals Length
1520
1521   The decoding table uses an accuracy log of 6 bits (64 states).
1522
1523     short literalsLength_defaultDistribution[36] =
1524       { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
1525         2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
1526         -1,-1,-1,-1
1527       };
1528
15293.1.1.3.2.2.2.  Match Length
1530
1531   The decoding table uses an accuracy log of 6 bits (64 states).
1532
1533     short matchLengths_defaultDistribution[53] =
1534       { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
1535         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1536         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,
1537         -1,-1,-1,-1,-1
1538       };
1539
15403.1.1.3.2.2.3.  Offset Codes
1541
1542   The decoding table uses an accuracy log of 5 bits (32 states), and
1543   supports a maximum N value of 28, allowing offset values up to
1544   536,870,908.
1545
1546   If any sequence in the compressed block requires a larger offset than
1547   this, it's not possible to use the default distribution to represent
1548   it.
1549
1550     short offsetCodes_defaultDistribution[29] =
1551       { 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
1552         1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1
1553       };
1554
15553.1.1.4.  Sequence Execution
1556
1557   Once literals and sequences have been decoded, they are combined to
1558   produce the decoded content of a block.
1559
1560   Each sequence consists of a tuple of (literals_length, offset_value,
1561   match_length), decoded as described in the
1562   Sequences_Section (Section 3.1.1.3.2).  To execute a sequence, first
1563   copy literals_length bytes from the decoded literals to the output.
1564
1565
1566
1567
1568
1569
1570Collet & Kucherawy            Informational                    [Page 28]
1571
1572RFC 8478                    application/zstd                October 2018
1573
1574
1575   Then, match_length bytes are copied from previous decoded data.  The
1576   offset to copy from is determined by offset_value:
1577
1578   o  if Offset_Value > 3, then the offset is Offset_Value - 3;
1579
1580   o  if Offset_Value is from 1-3, the offset is a special repeat offset
1581      value.  See Section 3.1.1.5 for how the offset is determined in
1582      this case.
1583
1584   The offset is defined as from the current position (after copying the
1585   literals), so an offset of 6 and a match length of 3 means that 3
1586   bytes should be copied from 6 bytes back.  Note that all offsets
1587   leading to previously decoded data must be smaller than Window_Size
1588   defined in Frame_Header_Descriptor (Section 3.1.1.1.1).
1589
15903.1.1.5.  Repeat Offsets
1591
1592   As seen above, the first three values define a repeated offset; we
1593   will call them Repeated_Offset1, Repeated_Offset2, and
1594   Repeated_Offset3.  They are sorted in recency order, with
1595   Repeated_Offset1 meaning "most recent one".
1596
1597   If offset_value is 1, then the offset used is Repeated_Offset1, etc.
1598
1599   There is one exception: When the current sequence's literals_length
1600   is 0, repeated offsets are shifted by 1, so an offset_value of 1
1601   means Repeated_Offset2, an offset_value of 2 means Repeated_Offset3,
1602   and an offset_value of 3 means Repeated_Offset1 - 1_byte.
1603
1604   For the first block, the starting offset history is populated with
1605   the following values: Repeated_Offset1 (1), Repeated_Offset2 (4), and
1606   Repeated_Offset3 (8), unless a dictionary is used, in which case they
1607   come from the dictionary.
1608
1609   Then each block gets its starting offset history from the ending
1610   values of the most recent Compressed_Block.  Note that blocks that
1611   are not Compressed_Block are skipped; they do not contribute to
1612   offset history.
1613
1614   The newest offset takes the lead in offset history, shifting others
1615   back (up to its previous place if it was already present).  This
1616   means that when Repeated_Offset1 (most recent) is used, history is
1617   unmodified.  When Repeated_Offset2 is used, it is swapped with
1618   Repeated_Offset1.  If any other offset is used, it becomes
1619   Repeated_Offset1, and the rest are shifted back by 1.
1620
1621
1622
1623
1624
1625
1626Collet & Kucherawy            Informational                    [Page 29]
1627
1628RFC 8478                    application/zstd                October 2018
1629
1630
16313.1.2.  Skippable Frames
1632
1633     +--------------+------------+-----------+
1634     | Magic_Number | Frame_Size | User_Data |
1635     +--------------+------------+-----------+
1636     |    4 bytes   |   4 bytes  |  n bytes  |
1637     +--------------+------------+-----------+
1638
1639   Skippable frames allow the insertion of user-defined metadata into a
1640   flow of concatenated frames.
1641
1642   Skippable frames defined in this specification are compatible with
1643   skippable frames in [LZ4].
1644
1645   From a compliant decoder perspective, skippable frames simply need to
1646   be skipped, and their content ignored, resuming decoding after the
1647   skippable frame.
1648
1649   It should be noted that a skippable frame can be used to watermark a
1650   stream of concatenated frames embedding any kind of tracking
1651   information (even just a Universally Unique Identifier (UUID)).
1652   Users wary of such possibility should scan the stream of concatenated
1653   frames in an attempt to detect such frames for analysis or removal.
1654
1655   The fields are:
1656
1657   Magic_Number:  4 bytes, little-endian format.  Value: 0x184D2A5?,
1658      which means any value from 0x184D2A50 to 0x184D2A5F.  All 16
1659      values are valid to identify a skippable frame.  This
1660      specification does not detail any specific tagging methods for
1661      skippable frames.
1662
1663   Frame_Size:  This is the size, in bytes, of the following User_Data
1664      (without including the magic number nor the size field itself).
1665      This field is represented using 4 bytes, little-endian format,
1666      unsigned 32 bits.  This means User_Data can't be bigger than
1667      (2^32-1) bytes.
1668
1669   User_Data:  This field can be anything.  Data will just be skipped by
1670      the decoder.
1671
16724.  Entropy Encoding
1673
1674   Two types of entropy encoding are used by the Zstandard format: FSE
1675   and Huffman coding.  Huffman is used to compress literals, while FSE
1676   is used for all other symbols (Literals_Length_Code,
1677   Match_Length_Code, and offset codes) and to compress Huffman headers.
1678
1679
1680
1681
1682Collet & Kucherawy            Informational                    [Page 30]
1683
1684RFC 8478                    application/zstd                October 2018
1685
1686
16874.1.  FSE
1688
1689   FSE, short for Finite State Entropy, is an entropy codec based on
1690   [ANS].  FSE encoding/decoding involves a state that is carried over
1691   between symbols, so decoding must be done in the opposite direction
1692   as encoding.  Therefore, all FSE bitstreams are read from end to
1693   beginning.  Note that the order of the bits in the stream is not
1694   reversed; they are simply read in the reverse order from which they
1695   were written.
1696
1697   For additional details on FSE, see Finite State Entropy [FSE].
1698
1699   FSE decoding involves a decoding table that has a power of 2 size and
1700   contains three elements: Symbol, Num_Bits, and Baseline.  The base 2
1701   logarithm of the table size is its Accuracy_Log.  An FSE state value
1702   represents an index in this table.
1703
1704   To obtain the initial state value, consume Accuracy_Log bits from the
1705   stream as a little-endian value.  The next symbol in the stream is
1706   the Symbol indicated in the table for that state.  To obtain the next
1707   state value, the decoder should consume Num_Bits bits from the stream
1708   as a little-endian value and add it to Baseline.
1709
17104.1.1.  FSE Table Description
1711
1712   To decode FSE streams, it is necessary to construct the decoding
1713   table.  The Zstandard format encodes FSE table descriptions as
1714   described here.
1715
1716   An FSE distribution table describes the probabilities of all symbols
1717   from 0 to the last present one (included) on a normalized scale of
1718   (1 << Accuracy_Log).  Note that there must be two or more symbols
1719   with non-zero probability.
1720
1721   A bitstream is read forward, in little-endian fashion.  It is not
1722   necessary to know its exact size, since the size will be discovered
1723   and reported by the decoding process.  The bitstream starts by
1724   reporting on which scale it operates.  If low4bits designates the
1725   lowest 4 bits of the first byte, then Accuracy_Log = low4bits + 5.
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738Collet & Kucherawy            Informational                    [Page 31]
1739
1740RFC 8478                    application/zstd                October 2018
1741
1742
1743   This is followed by each symbol value, from 0 to the last present
1744   one.  The number of bits used by each field is variable and depends
1745   on:
1746
1747   Remaining probabilities + 1:  For example, presuming an Accuracy_Log
1748      of 8, and presuming 100 probabilities points have already been
1749      distributed, the decoder may read any value from 0 to
1750      (256 - 100 + 1) == 157, inclusive.  Therefore, it must read
1751      log2sup(157) == 8 bits.
1752
1753   Value decoded:  Small values use 1 fewer bit.  For example, presuming
1754      values from 0 to 157 (inclusive) are possible, 255 - 157 = 98
1755      values are remaining in an 8-bit field.  The first 98 values
1756      (hence from 0 to 97) use only 7 bits, and values from 98 to 157
1757      use 8 bits.  This is achieved through this scheme:
1758
1759     +------------+---------------+-----------+
1760     | Value Read | Value Decoded | Bits Used |
1761     +------------+---------------+-----------+
1762     |   0 - 97   |     0 - 97    |     7     |
1763     +------------+---------------+-----------+
1764     |  98 - 127  |    98 - 127   |     8     |
1765     +------------+---------------+-----------+
1766     | 128 - 225  |     0 - 97    |     7     |
1767     +------------+---------------+-----------+
1768     | 226 - 255  |   128 - 157   |     8     |
1769     +------------+---------------+-----------+
1770
1771   Symbol probabilities are read one by one, in order.  The probability
1772   is obtained from Value decoded using the formula P = Value - 1.  This
1773   means the value 0 becomes the negative probability -1.  This is a
1774   special probability that means "less than 1".  Its effect on the
1775   distribution table is described below.  For the purpose of
1776   calculating total allocated probability points, it counts as 1.
1777
1778   When a symbol has a probability of zero, it is followed by a 2-bit
1779   repeat flag.  This repeat flag tells how many probabilities of zeroes
1780   follow the current one.  It provides a number ranging from 0 to 3.
1781   If it is a 3, another 2-bit repeat flag follows, and so on.
1782
1783   When the last symbol reaches a cumulated total of
1784   (1 << Accuracy_Log), decoding is complete.  If the last symbol makes
1785   the cumulated total go above (1 << Accuracy_Log), distribution is
1786   considered corrupted.
1787
1788
1789
1790
1791
1792
1793
1794Collet & Kucherawy            Informational                    [Page 32]
1795
1796RFC 8478                    application/zstd                October 2018
1797
1798
1799   Finally, the decoder can tell how many bytes were used in this
1800   process and how many symbols are present.  The bitstream consumes a
1801   round number of bytes.  Any remaining bit within the last byte is
1802   simply unused.
1803
1804   The distribution of normalized probabilities is enough to create a
1805   unique decoding table.  The table has a size of (1 << Accuracy_Log).
1806   Each cell describes the symbol decoded and instructions to get the
1807   next state.
1808
1809   Symbols are scanned in their natural order for "less than 1"
1810   probabilities as described above.  Symbols with this probability are
1811   being attributed a single cell, starting from the end of the table
1812   and retreating.  These symbols define a full state reset, reading
1813   Accuracy_Log bits.
1814
1815   All remaining symbols are allocated in their natural order.  Starting
1816   from symbol 0 and table position 0, each symbol gets allocated as
1817   many cells as its probability.  Cell allocation is spread, not
1818   linear; each successor position follows this rule:
1819
1820     position += (tableSize >> 1) + (tableSize >> 3) + 3;
1821     position &= tableSize - 1;
1822
1823   A position is skipped if it is already occupied by a "less than 1"
1824   probability symbol.  Position does not reset between symbols; it
1825   simply iterates through each position in the table, switching to the
1826   next symbol when enough states have been allocated to the current
1827   one.
1828
1829   The result is a list of state values.  Each state will decode the
1830   current symbol.
1831
1832   To get the Number_of_Bits and Baseline required for the next state,
1833   it is first necessary to sort all states in their natural order.  The
1834   lower states will need 1 more bit than higher ones.  The process is
1835   repeated for each symbol.
1836
1837   For example, presuming a symbol has a probability of 5, it receives
1838   five state values.  States are sorted in natural order.  The next
1839   power of 2 is 8.  The space of probabilities is divided into 8 equal
1840   parts.  Presuming the Accuracy_Log is 7, this defines 128 states, and
1841   each share (divided by 8) is 16 in size.  In order to reach 8, 8 - 5
1842   = 3 lowest states will count "double", doubling the number of shares
1843   (32 in width), requiring 1 more bit in the process.
1844
1845
1846
1847
1848
1849
1850Collet & Kucherawy            Informational                    [Page 33]
1851
1852RFC 8478                    application/zstd                October 2018
1853
1854
1855   Baseline is assigned starting from the higher states using fewer
1856   bits, and proceeding naturally, then resuming at the first state,
1857   each taking its allocated width from Baseline.
1858
1859     +----------------+-------+-------+--------+------+-------+
1860     |   state order  |   0   |   1   |   2    |  3   |  4    |
1861     +----------------+-------+-------+--------+------+-------+
1862     |     width      |   32  |   32  |   32   |  16  |  16   |
1863     +----------------+-------+-------+--------+------+-------+
1864     | Number_of_Bits |   5   |   5   |   5    |  4   |  4    |
1865     +----------------+-------+-------+--------+------+-------+
1866     |  range number  |   2   |   4   |   6    |  0   |  1    |
1867     +----------------+-------+-------+--------+------+-------+
1868     |    Baseline    |   32  |   64  |   96   |  0   |  16   |
1869     +----------------+-------+-------+--------+------+-------+
1870     |     range      | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
1871     +----------------+-------+-------+--------+------+-------+
1872
1873   The next state is determined from the current state by reading the
1874   required Number_of_Bits and adding the specified Baseline.
1875
1876   See Appendix A for the results of this process that are applied to
1877   the default distributions.
1878
18794.2.  Huffman Coding
1880
1881   Zstandard Huffman-coded streams are read backwards, similar to the
1882   FSE bitstreams.  Therefore, to find the start of the bitstream, it is
1883   necessary to know the offset of the last byte of the Huffman-coded
1884   stream.
1885
1886   After writing the last bit containing information, the compressor
1887   writes a single 1 bit and then fills the byte with 0-7 0 bits of
1888   padding.  The last byte of the compressed bitstream cannot be 0 for
1889   that reason.
1890
1891   When decompressing, the last byte containing the padding is the first
1892   byte to read.  The decompressor needs to skip 0-7 initial 0 bits and
1893   the first 1 bit that occurs.  Afterwards, the useful part of the
1894   bitstream begins.
1895
1896   The bitstream contains Huffman-coded symbols in little-endian order,
1897   with the codes defined by the method below.
1898
1899
1900
1901
1902
1903
1904
1905
1906Collet & Kucherawy            Informational                    [Page 34]
1907
1908RFC 8478                    application/zstd                October 2018
1909
1910
19114.2.1.  Huffman Tree Description
1912
1913   Prefix coding represents symbols from an a priori known alphabet by
1914   bit sequences (codewords), one codeword for each symbol, in a manner
1915   such that different symbols may be represented by bit sequences of
1916   different lengths, but a parser can always parse an encoded string
1917   unambiguously symbol by symbol.
1918
1919   Given an alphabet with known symbol frequencies, the Huffman
1920   algorithm allows the construction of an optimal prefix code using the
1921   fewest bits of any possible prefix codes for that alphabet.
1922
1923   The prefix code must not exceed a maximum code length.  More bits
1924   improve accuracy but yield a larger header size and require more
1925   memory or more complex decoding operations.  This specification
1926   limits the maximum code length to 11 bits.
1927
1928   All literal values from zero (included) to the last present one
1929   (excluded) are represented by Weight with values from 0 to
1930   Max_Number_of_Bits.  Transformation from Weight to Number_of_Bits
1931   follows this pseudocode:
1932
1933     if Weight == 0
1934       Number_of_Bits = 0
1935     else
1936       Number_of_Bits = Max_Number_of_Bits + 1 - Weight
1937
1938   The last symbol's Weight is deduced from previously decoded ones, by
1939   completing to the nearest power of 2.  This power of 2 gives
1940   Max_Number_of_Bits the depth of the current tree.
1941
1942   For example, presume the following Huffman tree must be described:
1943
1944     +---------------+----------------+
1945     | Literal Value | Number_of_Bits |
1946     +---------------+----------------+
1947     |       0       |        1       |
1948     +---------------+----------------+
1949     |       1       |        2       |
1950     +---------------+----------------+
1951     |       2       |        3       |
1952     +---------------+----------------+
1953     |       3       |        0       |
1954     +---------------+----------------+
1955     |       4       |        4       |
1956     +---------------+----------------+
1957     |       5       |        4       |
1958     +---------------+----------------+
1959
1960
1961
1962Collet & Kucherawy            Informational                    [Page 35]
1963
1964RFC 8478                    application/zstd                October 2018
1965
1966
1967   The tree depth is 4, since its longest element uses 4 bits.  (The
1968   longest elements are those with the smallest frequencies.)  Value 5
1969   will not be listed as it can be determined from the values for 0-4,
1970   nor will values above 5 as they are all 0.  Values from 0 to 4 will
1971   be listed using Weight instead of Number_of_Bits.  The pseudocode to
1972   determine Weight is:
1973
1974     if Number_of_Bits == 0
1975       Weight = 0
1976     else
1977       Weight = Max_Number_of_Bits + 1 - Number_of_Bits
1978
1979   It gives the following series of weights:
1980
1981     +---------------+--------+
1982     | Literal Value | Weight |
1983     +---------------+--------+
1984     |       0       |   4    |
1985     +---------------+--------+
1986     |       1       |   3    |
1987     +---------------+--------+
1988     |       2       |   2    |
1989     +---------------+--------+
1990     |       3       |   0    |
1991     +---------------+--------+
1992     |       4       |   1    |
1993     +---------------+--------+
1994
1995   The decoder will do the inverse operation: having collected weights
1996   of literals from 0 to 4, it knows the last literal, 5, is present
1997   with a non-zero Weight.  The Weight of 5 can be determined by
1998   advancing to the next power of 2.  The sum of 2^(Weight-1) (excluding
1999   0's) is 15.  The nearest power of 2 is 16.  Therefore,
2000   Max_Number_of_Bits = 4 and Weight[5] = 16 - 15 = 1.
2001
20024.2.1.1.  Huffman Tree Header
2003
2004   This is a single byte value (0-255), which describes how the series
2005   of weights is encoded.
2006
2007   headerByte < 128:  The series of weights is compressed using FSE (see
2008      below).  The length of the FSE-compressed series is equal to
2009      headerByte (0-127).
2010
2011
2012
2013
2014
2015
2016
2017
2018Collet & Kucherawy            Informational                    [Page 36]
2019
2020RFC 8478                    application/zstd                October 2018
2021
2022
2023   headerByte >= 128:  This is a direct representation, where each
2024      Weight is written directly as a 4-bit field (0-15).  They are
2025      encoded forward, 2 weights to a byte with the first weight taking
2026      the top 4 bits and the second taking the bottom 4; for example,
2027      the following operations could be used to read the weights:
2028
2029     Weight[0] = (Byte[0] >> 4)
2030     Weight[1] = (Byte[0] & 0xf),
2031     etc.
2032
2033      The full representation occupies ceiling(Number_of_Symbols/2)
2034      bytes, meaning it uses only full bytes even if Number_of_Symbols
2035      is odd.  Number_of_Symbols = headerByte - 127.  Note that maximum
2036      Number_of_Symbols is 255 - 127 = 128.  If any literal has a value
2037      over 128, raw header mode is not possible, and it is necessary to
2038      use FSE compression.
2039
20404.2.1.2.  FSE Compression of Huffman Weights
2041
2042   In this case, the series of Huffman weights is compressed using FSE
2043   compression.  It is a single bitstream with two interleaved states,
2044   sharing a single distribution table.
2045
2046   To decode an FSE bitstream, it is necessary to know its compressed
2047   size.  Compressed size is provided by headerByte.  It's also
2048   necessary to know its maximum possible decompressed size, which is
2049   255, since literal values span from 0 to 255, and the last symbol's
2050   Weight is not represented.
2051
2052   An FSE bitstream starts by a header, describing probabilities
2053   distribution.  It will create a decoding table.  For a list of
2054   Huffman weights, the maximum accuracy log is 6 bits.  For more
2055   details, see Section 4.1.1.
2056
2057   The Huffman header compression uses two states, which share the same
2058   FSE distribution table.  The first state (State1) encodes the even-
2059   numbered index symbols, and the second (State2) encodes the odd-
2060   numbered index symbols.  State1 is initialized first, and then
2061   State2, and they take turns decoding a single symbol and updating
2062   their state.  For more details on these FSE operations, see
2063   Section 4.1.
2064
2065   The number of symbols to be decoded is determined by tracking the
2066   bitStream overflow condition: If updating state after decoding a
2067   symbol would require more bits than remain in the stream, it is
2068   assumed that extra bits are zero.  Then, symbols for each of the
2069   final states are decoded and the process is complete.
2070
2071
2072
2073
2074Collet & Kucherawy            Informational                    [Page 37]
2075
2076RFC 8478                    application/zstd                October 2018
2077
2078
20794.2.1.3.  Conversion from Weights to Huffman Prefix Codes
2080
2081   All present symbols will now have a Weight value.  It is possible to
2082   transform weights into Number_of_Bits, using this formula:
2083
2084     if Weight > 0
2085         Number_of_Bits = Max_Number_of_Bits + 1 - Weight
2086     else
2087         Number_of_Bits = 0
2088
2089   Symbols are sorted by Weight.  Within the same Weight, symbols keep
2090   natural sequential order.  Symbols with a Weight of zero are removed.
2091   Then, starting from the lowest Weight, prefix codes are distributed
2092   in sequential order.
2093
2094   For example, assume the following list of weights has been decoded:
2095
2096     +---------+--------+
2097     | Literal | Weight |
2098     +---------+--------+
2099     |    0    |   4    |
2100     +---------+--------+
2101     |    1    |   3    |
2102     +---------+--------+
2103     |    2    |   2    |
2104     +---------+--------+
2105     |    3    |   0    |
2106     +---------+--------+
2107     |    4    |   1    |
2108     +---------+--------+
2109     |    5    |   1    |
2110     +---------+--------+
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130Collet & Kucherawy            Informational                    [Page 38]
2131
2132RFC 8478                    application/zstd                October 2018
2133
2134
2135   Sorting by weight and then the natural sequential order yields the
2136   following distribution:
2137
2138     +---------+--------+----------------+--------------+
2139     | Literal | Weight | Number_Of_Bits | Prefix Codes |
2140     +---------+--------+----------------|--------------+
2141     |    3    |   0    |        0       |      N/A     |
2142     +---------+--------+----------------|--------------+
2143     |    4    |   1    |        4       |     0000     |
2144     +---------+--------+----------------|--------------+
2145     |    5    |   1    |        4       |     0001     |
2146     +---------+--------+----------------|--------------+
2147     |    2    |   2    |        3       |      001     |
2148     +---------+--------+----------------|--------------+
2149     |    1    |   3    |        2       |       01     |
2150     +---------+--------+----------------|--------------+
2151     |    0    |   4    |        1       |        1     |
2152     +---------+--------+----------------|--------------+
2153
21544.2.2.  Huffman-Coded Streams
2155
2156   Given a Huffman decoding table, it is possible to decode a Huffman-
2157   coded stream.
2158
2159   Each bitstream must be read backward, which starts from the end and
2160   goes up to the beginning.  Therefore, it is necessary to know the
2161   size of each bitstream.
2162
2163   It is also necessary to know exactly which bit is the last.  This is
2164   detected by a final bit flag: the highest bit of the last byte is a
2165   final-bit-flag.  Consequently, a last byte of 0 is not possible.  And
2166   the final-bit-flag itself is not part of the useful bitstream.
2167   Hence, the last byte contains between 0 and 7 useful bits.
2168
2169   Starting from the end, it is possible to read the bitstream in a
2170   little-endian fashion, keeping track of already used bits.  Since the
2171   bitstream is encoded in reverse order, starting from the end, read
2172   symbols in forward order.
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186Collet & Kucherawy            Informational                    [Page 39]
2187
2188RFC 8478                    application/zstd                October 2018
2189
2190
2191   For example, if the literal sequence "0145" was encoded using the
2192   above prefix code, it would be encoded (in reverse order) as:
2193
2194     +---------+----------+
2195     | Symbol  | Encoding |
2196     +---------+----------+
2197     |    5    |   0000   |
2198     +---------+----------+
2199     |    4    |   0001   |
2200     +---------+----------+
2201     |    1    |    01    |
2202     +---------+----------+
2203     |    0    |    1     |
2204     +---------+----------+
2205     | Padding |   00001  |
2206     +---------+----------+
2207
2208   This results in the following 2-byte bitstream:
2209
2210     00010000 00001101
2211
2212   Here is an alternative representation with the symbol codes separated
2213   by underscores:
2214
2215     0001_0000 00001_1_01
2216
2217   Reading the highest Max_Number_of_Bits bits, it's possible to compare
2218   the extracted value to the decoding table, determining the symbol to
2219   decode and number of bits to discard.
2220
2221   The process continues reading up to the required number of symbols
2222   per stream.  If a bitstream is not entirely and exactly consumed,
2223   hence reaching exactly its beginning position with all bits consumed,
2224   the decoding process is considered faulty.
2225
22265.  Dictionary Format
2227
2228   Zstandard is compatible with "raw content" dictionaries, free of any
2229   format restriction, except that they must be at least 8 bytes.  These
2230   dictionaries function as if they were just the content part of a
2231   formatted dictionary.
2232
2233   However, dictionaries created by "zstd --train" in the reference
2234   implementation follow a specific format, described here.
2235
2236   Dictionaries are not included in the compressed content but rather
2237   are provided out of band.  That is, the Dictionary_ID identifies
2238   which should be used, but this specification does not describe the
2239
2240
2241
2242Collet & Kucherawy            Informational                    [Page 40]
2243
2244RFC 8478                    application/zstd                October 2018
2245
2246
2247   mechanism by which the dictionary is obtained prior to use during
2248   compression or decompression.
2249
2250   A dictionary has a size, defined either by a buffer limit or a file
2251   size.  The general format is:
2252
2253     +--------------+---------------+----------------+---------+
2254     | Magic_Number | Dictionary_ID | Entropy_Tables | Content |
2255     +--------------+---------------+----------------+---------+
2256
2257   Magic_Number:  4 bytes ID, value 0xEC30A437, little-endian format.
2258
2259   Dictionary_ID:  4 bytes, stored in little-endian format.
2260      Dictionary_ID can be any value, except 0 (which means no
2261      Dictionary_ID).  It is used by decoders to check if they use the
2262      correct dictionary.  If the frame is going to be distributed in a
2263      private environment, any Dictionary_ID can be used.  However, for
2264      public distribution of compressed frames, the following ranges are
2265      reserved and shall not be used:
2266
2267         low range: <= 32767
2268         high range: >= (2^31)
2269
2270   Entropy_Tables:  Follow the same format as the tables in compressed
2271      blocks.  See the relevant FSE and Huffman sections for how to
2272      decode these tables.  They are stored in the following order:
2273      Huffman table for literals, FSE table for offsets, FSE table for
2274      match lengths, and FSE table for literals lengths.  These tables
2275      populate the Repeat Stats literals mode and Repeat distribution
2276      mode for sequence decoding.  It is finally followed by 3 offset
2277      values, populating repeat offsets (instead of using {1,4,8}),
2278      stored in order, 4-bytes little-endian each, for a total of 12
2279      bytes.  Each repeat offset must have a value less than the
2280      dictionary size.
2281
2282   Content:  The rest of the dictionary is its content.  The content
2283      acts as a "past" in front of data to be compressed or
2284      decompressed, so it can be referenced in sequence commands.  As
2285      long as the amount of data decoded from this frame is less than or
2286      equal to Window_Size, sequence commands may specify offsets longer
2287      than the total length of decoded output so far to reference back
2288      to the dictionary, even parts of the dictionary with offsets
2289      larger than Window_Size.  After the total output has surpassed
2290      Window_Size, however, this is no longer allowed, and the
2291      dictionary is no longer accessible.
2292
2293
2294
2295
2296
2297
2298Collet & Kucherawy            Informational                    [Page 41]
2299
2300RFC 8478                    application/zstd                October 2018
2301
2302
23036.  IANA Considerations
2304
2305   IANA has made two registrations, as described below.
2306
23076.1.  The 'application/zstd' Media Type
2308
2309   The 'application/zstd' media type identifies a block of data that is
2310   compressed using zstd compression.  The data is a stream of bytes as
2311   described in this document.  IANA has added the following to the
2312   "Media Types" registry:
2313
2314   Type name:  application
2315
2316   Subtype name:  zstd
2317
2318   Required parameters:  N/A
2319
2320   Optional parameters:  N/A
2321
2322   Encoding considerations:  binary
2323
2324   Security considerations:  See Section 7 of RFC 8478
2325
2326   Interoperability considerations:  N/A
2327
2328   Published specification:  RFC 8478
2329
2330   Applications that use this media type:  anywhere data size is an
2331      issue
2332
2333   Additional information:
2334
2335      Magic number(s):  4 bytes, little-endian format.
2336         Value: 0xFD2FB528
2337
2338      File extension(s):  zst
2339
2340      Macintosh file type code(s):  N/A
2341
2342   For further information:  See [ZSTD]
2343
2344   Intended usage:  common
2345
2346   Restrictions on usage:  N/A
2347
2348   Author:  Murray S.  Kucherawy
2349
2350   Change Controller:  IETF
2351
2352
2353
2354Collet & Kucherawy            Informational                    [Page 42]
2355
2356RFC 8478                    application/zstd                October 2018
2357
2358
2359   Provisional registration:  no
2360
23616.2.  Content Encoding
2362
2363   IANA has added the following entry to the "HTTP Content Coding
2364   Registry" within the "Hypertext Transfer Protocol (HTTP) Parameters"
2365   registry:
2366
2367   Name:  zstd
2368
2369   Description:  A stream of bytes compressed using the Zstandard
2370      protocol
2371
2372   Pointer to specification text:  RFC 8478
2373
23746.3.  Dictionaries
2375
2376   Work in progress includes development of dictionaries that will
2377   optimize compression and decompression of particular types of data.
2378   Specification of such dictionaries for public use will necessitate
2379   registration of a code point from the reserved range described in
2380   Section 3.1.1.1.3 and its association with a specific dictionary.
2381
2382   However, there are at present no such dictionaries published for
2383   public use, so this document makes no immediate request of IANA to
2384   create such a registry.
2385
23867.  Security Considerations
2387
2388   Any data compression method involves the reduction of redundancy in
2389   the data.  Zstandard is no exception, and the usual precautions
2390   apply.
2391
2392   One should never compress a message whose content must remain secret
2393   with a message generated by a third party.  Such a compression can be
2394   used to guess the content of the secret message through analysis of
2395   entropy reduction.  This was demonstrated in the Compression Ratio
2396   Info-leak Made Easy (CRIME) attack [CRIME], for example.
2397
2398   A decoder has to demonstrate capabilities to detect and prevent any
2399   kind of data tampering in the compressed frame from triggering system
2400   faults, such as reading or writing beyond allowed memory ranges.
2401   This can be guaranteed by either the implementation language or
2402   careful bound checkings.  Of particular note is the encoding of
2403   Number_of_Sequences values that cause the decoder to read into the
2404   block header (and beyond), as well as the indication of a
2405   Frame_Content_Size that is smaller than the actual decompressed data,
2406   in an attempt to trigger a buffer overflow.  It is highly recommended
2407
2408
2409
2410Collet & Kucherawy            Informational                    [Page 43]
2411
2412RFC 8478                    application/zstd                October 2018
2413
2414
2415   to fuzz-test (i.e., provide invalid, unexpected, or random input and
2416   verify safe operation of) decoder implementations to test and harden
2417   their capability to detect bad frames and deal with them without any
2418   adverse system side effect.
2419
2420   An attacker may provide correctly formed compressed frames with
2421   unreasonable memory requirements.  A decoder must always control
2422   memory requirements and enforce some (system-specific) limits in
2423   order to protect memory usage from such scenarios.
2424
2425   Compression can be optimized by training a dictionary on a variety of
2426   related content payloads.  This dictionary must then be available at
2427   the decoder for decompression of the payload to be possible.  While
2428   this document does not specify how to acquire a dictionary for a
2429   given compressed payload, it is worth noting that third-party
2430   dictionaries may interact unexpectedly with a decoder, leading to
2431   possible memory or other resource exhaustion attacks.  We expect such
2432   topics to be discussed in further detail in the Security
2433   Considerations section of a forthcoming RFC for dictionary
2434   acquisition and transmission, but highlight this issue now out of an
2435   abundance of caution.
2436
2437   As discussed in Section 3.1.2, it is possible to store arbitrary user
2438   metadata in skippable frames.  While such frames are ignored during
2439   decompression of the data, they can be used as a watermark to track
2440   the path of the compressed payload.
2441
24428.  Implementation Status
2443
2444   Source code for a C language implementation of a Zstandard-compliant
2445   library is available at [ZSTD-GITHUB].  This implementation is
2446   considered to be the reference implementation and is production
2447   ready; it implements the full range of the specification.  It is
2448   routinely tested against security hazards and widely deployed within
2449   Facebook infrastructure.
2450
2451   The reference version is optimized for speed and is highly portable.
2452   It has been proven to run safely on multiple architectures (e.g.,
2453   x86, x64, ARM, MIPS, PowerPC, IA64) featuring 32- or 64-bit
2454   addressing schemes, a little- or big-endian storage scheme, a number
2455   of different operating systems (e.g., UNIX (including Linux, BSD,
2456   OS-X, and Solaris) and Windows), and a number of compilers (e.g.,
2457   gcc, clang, visual, and icc).
2458
2459
2460
2461
2462
2463
2464
2465
2466Collet & Kucherawy            Informational                    [Page 44]
2467
2468RFC 8478                    application/zstd                October 2018
2469
2470
24719.  References
2472
24739.1.  Normative References
2474
2475   [ZSTD]     "Zstandard", <http://www.zstd.net>.
2476
24779.2.  Informative References
2478
2479   [ANS]      Duda, J., "Asymmetric numeral systems: entropy coding
2480              combining speed of Huffman coding with compression rate of
2481              arithmetic coding", January 2014,
2482              <https://arxiv.org/pdf/1311.2540>.
2483
2484   [CRIME]    "CRIME", June 2018, <https://en.wikipedia.org/w/
2485              index.php?title=CRIME&oldid=844538656>.
2486
2487   [FSE]      "FiniteStateEntropy", commit 6efa78a, June 2018,
2488              <https://github.com/Cyan4973/FiniteStateEntropy/>.
2489
2490   [LZ4]      "LZ4 Frame Format Description", commit d03224b, January
2491              2018, <https://github.com/lz4/lz4/blob/master/doc/
2492              lz4_Frame_format.md>.
2493
2494   [RFC1952]  Deutsch, P., "GZIP file format specification version 4.3",
2495              RFC 1952, DOI 10.17487/RFC1952, May 1996,
2496              <https://www.rfc-editor.org/info/rfc1952>.
2497
2498   [XXHASH]   "XXHASH Algorithm", <http://www.xxhash.org>.
2499
2500   [ZSTD-GITHUB]
2501              "zstd", commit 8514bd8, August 2018,
2502              <https://github.com/facebook/zstd>.
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522Collet & Kucherawy            Informational                    [Page 45]
2523
2524RFC 8478                    application/zstd                October 2018
2525
2526
2527Appendix A.  Decoding Tables for Predefined Codes
2528
2529   This appendix contains FSE decoding tables for the predefined literal
2530   length, match length, and offset codes.  The tables have been
2531   constructed using the algorithm as given above in Section 4.1.1.  The
2532   tables here can be used as examples to crosscheck that an
2533   implementation has built its decoding tables correctly.
2534
2535A.1.  Literal Length Code Table
2536
2537     +-------+--------+----------------+------+
2538     | State | Symbol | Number_Of_Bits | Base |
2539     +-------+--------+----------------+------+
2540     |    0  |    0   |        0       |   0  |
2541     +-------+--------+----------------+------+
2542     |    0  |    0   |        4       |   0  |
2543     +-------+--------+----------------+------+
2544     |    1  |    0   |        4       |  16  |
2545     +-------+--------+----------------+------+
2546     |    2  |    1   |        5       |  32  |
2547     +-------+--------+----------------+------+
2548     |    3  |    3   |        5       |   0  |
2549     +-------+--------+----------------+------+
2550     |    4  |    4   |        5       |   0  |
2551     +-------+--------+----------------+------+
2552     |    5  |    6   |        5       |   0  |
2553     +-------+--------+----------------+------+
2554     |    6  |    7   |        5       |   0  |
2555     +-------+--------+----------------+------+
2556     |    7  |    9   |        5       |   0  |
2557     +-------+--------+----------------+------+
2558     |    8  |   10   |        5       |   0  |
2559     +-------+--------+----------------+------+
2560     |    9  |   12   |        5       |   0  |
2561     +-------+--------+----------------+------+
2562     |   10  |   14   |        6       |   0  |
2563     +-------+--------+----------------+------+
2564     |   11  |   16   |        5       |   0  |
2565     +-------+--------+----------------+------+
2566     |   12  |   18   |        5       |   0  |
2567     +-------+--------+----------------+------+
2568     |   13  |   19   |        5       |   0  |
2569     +-------+--------+----------------+------+
2570     |   14  |   21   |        5       |   0  |
2571     +-------+--------+----------------+------+
2572     |   15  |   22   |        5       |   0  |
2573     +-------+--------+----------------+------+
2574     |   16  |   24   |        5       |   0  |
2575
2576
2577
2578Collet & Kucherawy            Informational                    [Page 46]
2579
2580RFC 8478                    application/zstd                October 2018
2581
2582
2583     +-------+--------+----------------+------+
2584     |   17  |   25   |        5       |  32  |
2585     +-------+--------+----------------+------+
2586     |   18  |   26   |        5       |   0  |
2587     +-------+--------+----------------+------+
2588     |   19  |   27   |        6       |   0  |
2589     +-------+--------+----------------+------+
2590     |   20  |   29   |        6       |   0  |
2591     +-------+--------+----------------+------+
2592     |   21  |   31   |        6       |   0  |
2593     +-------+--------+----------------+------+
2594     |   22  |    0   |        4       |  32  |
2595     +-------+--------+----------------+------+
2596     |   23  |    1   |        4       |   0  |
2597     +-------+--------+----------------+------+
2598     |   24  |    2   |        5       |   0  |
2599     +-------+--------+----------------+------+
2600     |   25  |    4   |        5       |  32  |
2601     +-------+--------+----------------+------+
2602     |   26  |    5   |        5       |   0  |
2603     +-------+--------+----------------+------+
2604     |   27  |    7   |        5       |  32  |
2605     +-------+--------+----------------+------+
2606     |   28  |    8   |        5       |   0  |
2607     +-------+--------+----------------+------+
2608     |   29  |   10   |        5       |  32  |
2609     +-------+--------+----------------+------+
2610     |   30  |   11   |        5       |   0  |
2611     +-------+--------+----------------+------+
2612     |   31  |   13   |        6       |   0  |
2613     +-------+--------+----------------+------+
2614     |   32  |   16   |        5       |  32  |
2615     +-------+--------+----------------+------+
2616     |   33  |   17   |        5       |   0  |
2617     +-------+--------+----------------+------+
2618     |   34  |   19   |        5       |  32  |
2619     +-------+--------+----------------+------+
2620     |   35  |   20   |        5       |   0  |
2621     +-------+--------+----------------+------+
2622     |   36  |   22   |        5       |  32  |
2623     +-------+--------+----------------+------+
2624     |   37  |   23   |        5       |   0  |
2625     +-------+--------+----------------+------+
2626     |   38  |   25   |        4       |   0  |
2627     +-------+--------+----------------+------+
2628     |   39  |   25   |        4       |  16  |
2629     +-------+--------+----------------+------+
2630     |   40  |   26   |        5       |  32  |
2631
2632
2633
2634Collet & Kucherawy            Informational                    [Page 47]
2635
2636RFC 8478                    application/zstd                October 2018
2637
2638
2639     +-------+--------+----------------+------+
2640     |   41  |   28   |        6       |   0  |
2641     +-------+--------+----------------+------+
2642     |   42  |   30   |        6       |   0  |
2643     +-------+--------+----------------+------+
2644     |   43  |    0   |        4       |  48  |
2645     +-------+--------+----------------+------+
2646     |   44  |    1   |        4       |  16  |
2647     +-------+--------+----------------+------+
2648     |   45  |    2   |        5       |  32  |
2649     +-------+--------+----------------+------+
2650     |   46  |    3   |        5       |  32  |
2651     +-------+--------+----------------+------+
2652     |   47  |    5   |        5       |  32  |
2653     +-------+--------+----------------+------+
2654     |   48  |    6   |        5       |  32  |
2655     +-------+--------+----------------+------+
2656     |   49  |    8   |        5       |  32  |
2657     +-------+--------+----------------+------+
2658     |   50  |    9   |        5       |  32  |
2659     +-------+--------+----------------+------+
2660     |   51  |   11   |        5       |  32  |
2661     +-------+--------+----------------+------+
2662     |   52  |   12   |        5       |  32  |
2663     +-------+--------+----------------+------+
2664     |   53  |   15   |        6       |   0  |
2665     +-------+--------+----------------+------+
2666     |   54  |   17   |        5       |  32  |
2667     +-------+--------+----------------+------+
2668     |   55  |   18   |        5       |  32  |
2669     +-------+--------+----------------+------+
2670     |   56  |   20   |        5       |  32  |
2671     +-------+--------+----------------+------+
2672     |   57  |   21   |        5       |  32  |
2673     +-------+--------+----------------+------+
2674     |   58  |   23   |        5       |  32  |
2675     +-------+--------+----------------+------+
2676     |   59  |   24   |        5       |  32  |
2677     +-------+--------+----------------+------+
2678     |   60  |   35   |        6       |   0  |
2679     +-------+--------+----------------+------+
2680     |   61  |   34   |        6       |   0  |
2681     +-------+--------+----------------+------+
2682     |   62  |   33   |        6       |   0  |
2683     +-------+--------+----------------+------+
2684     |   63  |   32   |        6       |   0  |
2685     +-------+--------+----------------+------+
2686
2687
2688
2689
2690Collet & Kucherawy            Informational                    [Page 48]
2691
2692RFC 8478                    application/zstd                October 2018
2693
2694
2695A.2.  Match Length Code Table
2696
2697     +-------+--------+----------------+------+
2698     | State | Symbol | Number_Of_Bits | Base |
2699     +-------+--------+----------------+------+
2700     |    0  |    0   |        0       |   0  |
2701     +-------+--------+----------------+------+
2702     |    0  |    0   |        6       |   0  |
2703     +-------+--------+----------------+------+
2704     |    1  |    1   |        4       |   0  |
2705     +-------+--------+----------------+------+
2706     |    2  |    2   |        5       |  32  |
2707     +-------+--------+----------------+------+
2708     |    3  |    3   |        5       |   0  |
2709     +-------+--------+----------------+------+
2710     |    4  |    5   |        5       |   0  |
2711     +-------+--------+----------------+------+
2712     |    5  |    6   |        5       |   0  |
2713     +-------+--------+----------------+------+
2714     |    6  |    8   |        5       |   0  |
2715     +-------+--------+----------------+------+
2716     |    7  |   10   |        6       |   0  |
2717     +-------+--------+----------------+------+
2718     |    8  |   13   |        6       |   0  |
2719     +-------+--------+----------------+------+
2720     |    9  |   16   |        6       |   0  |
2721     +-------+--------+----------------+------+
2722     |   10  |   19   |        6       |   0  |
2723     +-------+--------+----------------+------+
2724     |   11  |   22   |        6       |   0  |
2725     +-------+--------+----------------+------+
2726     |   12  |   25   |        6       |   0  |
2727     +-------+--------+----------------+------+
2728     |   13  |   28   |        6       |   0  |
2729     +-------+--------+----------------+------+
2730     |   14  |   31   |        6       |   0  |
2731     +-------+--------+----------------+------+
2732     |   15  |   33   |        6       |   0  |
2733     +-------+--------+----------------+------+
2734     |   16  |   35   |        6       |   0  |
2735     +-------+--------+----------------+------+
2736     |   17  |   37   |        6       |   0  |
2737     +-------+--------+----------------+------+
2738     |   18  |   39   |        6       |   0  |
2739     +-------+--------+----------------+------+
2740     |   19  |   41   |        6       |   0  |
2741     +-------+--------+----------------+------+
2742     |   20  |   43   |        6       |   0  |
2743
2744
2745
2746Collet & Kucherawy            Informational                    [Page 49]
2747
2748RFC 8478                    application/zstd                October 2018
2749
2750
2751     +-------+--------+----------------+------+
2752     |   21  |   45   |        6       |   0  |
2753     +-------+--------+----------------+------+
2754     |   22  |    1   |        4       |  16  |
2755     +-------+--------+----------------+------+
2756     |   23  |    2   |        4       |   0  |
2757     +-------+--------+----------------+------+
2758     |   24  |    3   |        5       |  32  |
2759     +-------+--------+----------------+------+
2760     |   25  |    4   |        5       |   0  |
2761     +-------+--------+----------------+------+
2762     |   26  |    6   |        5       |  32  |
2763     +-------+--------+----------------+------+
2764     |   27  |    7   |        5       |   0  |
2765     +-------+--------+----------------+------+
2766     |   28  |    9   |        6       |   0  |
2767     +-------+--------+----------------+------+
2768     |   29  |   12   |        6       |   0  |
2769     +-------+--------+----------------+------+
2770     |   30  |   15   |        6       |   0  |
2771     +-------+--------+----------------+------+
2772     |   31  |   18   |        6       |   0  |
2773     +-------+--------+----------------+------+
2774     |   32  |   21   |        6       |   0  |
2775     +-------+--------+----------------+------+
2776     |   33  |   24   |        6       |   0  |
2777     +-------+--------+----------------+------+
2778     |   34  |   27   |        6       |   0  |
2779     +-------+--------+----------------+------+
2780     |   35  |   30   |        6       |   0  |
2781     +-------+--------+----------------+------+
2782     |   36  |   32   |        6       |   0  |
2783     +-------+--------+----------------+------+
2784     |   37  |   34   |        6       |   0  |
2785     +-------+--------+----------------+------+
2786     |   38  |   36   |        6       |   0  |
2787     +-------+--------+----------------+------+
2788     |   39  |   38   |        6       |   0  |
2789     +-------+--------+----------------+------+
2790     |   40  |   40   |        6       |   0  |
2791     +-------+--------+----------------+------+
2792     |   41  |   42   |        6       |   0  |
2793     +-------+--------+----------------+------+
2794     |   42  |   44   |        6       |   0  |
2795     +-------+--------+----------------+------+
2796     |   43  |    1   |        4       |  32  |
2797     +-------+--------+----------------+------+
2798     |   44  |    1   |        4       |  48  |
2799
2800
2801
2802Collet & Kucherawy            Informational                    [Page 50]
2803
2804RFC 8478                    application/zstd                October 2018
2805
2806
2807     +-------+--------+----------------+------+
2808     |   45  |    2   |        4       |  16  |
2809     +-------+--------+----------------+------+
2810     |   46  |    4   |        5       |  32  |
2811     +-------+--------+----------------+------+
2812     |   47  |    5   |        5       |  32  |
2813     +-------+--------+----------------+------+
2814     |   48  |    7   |        5       |  32  |
2815     +-------+--------+----------------+------+
2816     |   49  |    8   |        5       |  32  |
2817     +-------+--------+----------------+------+
2818     |   50  |   11   |        6       |   0  |
2819     +-------+--------+----------------+------+
2820     |   51  |   14   |        6       |   0  |
2821     +-------+--------+----------------+------+
2822     |   52  |   17   |        6       |   0  |
2823     +-------+--------+----------------+------+
2824     |   53  |   20   |        6       |   0  |
2825     +-------+--------+----------------+------+
2826     |   54  |   23   |        6       |   0  |
2827     +-------+--------+----------------+------+
2828     |   55  |   26   |        6       |   0  |
2829     +-------+--------+----------------+------+
2830     |   56  |   29   |        6       |   0  |
2831     +-------+--------+----------------+------+
2832     |   57  |   52   |        6       |   0  |
2833     +-------+--------+----------------+------+
2834     |   58  |   51   |        6       |   0  |
2835     +-------+--------+----------------+------+
2836     |   59  |   50   |        6       |   0  |
2837     +-------+--------+----------------+------+
2838     |   60  |   49   |        6       |   0  |
2839     +-------+--------+----------------+------+
2840     |   61  |   48   |        6       |   0  |
2841     +-------+--------+----------------+------+
2842     |   62  |   47   |        6       |   0  |
2843     +-------+--------+----------------+------+
2844     |   63  |   46   |        6       |   0  |
2845     +-------+--------+----------------+------+
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858Collet & Kucherawy            Informational                    [Page 51]
2859
2860RFC 8478                    application/zstd                October 2018
2861
2862
2863A.3.  Offset Code Table
2864
2865     +-------+--------+----------------+------+
2866     | State | Symbol | Number_Of_Bits | Base |
2867     +-------+--------+----------------+------+
2868     |    0  |    0   |        0       |   0  |
2869     +-------+--------+----------------+------+
2870     |    0  |    0   |        5       |   0  |
2871     +-------+--------+----------------+------+
2872     |    1  |    6   |        4       |   0  |
2873     +-------+--------+----------------+------+
2874     |    2  |    9   |        5       |   0  |
2875     +-------+--------+----------------+------+
2876     |    3  |   15   |        5       |   0  |
2877     +-------+--------+----------------+------+
2878     |    4  |   21   |        5       |   0  |
2879     +-------+--------+----------------+------+
2880     |    5  |    3   |        5       |   0  |
2881     +-------+--------+----------------+------+
2882     |    6  |    7   |        4       |   0  |
2883     +-------+--------+----------------+------+
2884     |    7  |   12   |        5       |   0  |
2885     +-------+--------+----------------+------+
2886     |    8  |   18   |        5       |   0  |
2887     +-------+--------+----------------+------+
2888     |    9  |   23   |        5       |   0  |
2889     +-------+--------+----------------+------+
2890     |   10  |    5   |        5       |   0  |
2891     +-------+--------+----------------+------+
2892     |   11  |    8   |        4       |   0  |
2893     +-------+--------+----------------+------+
2894     |   12  |   14   |        5       |   0  |
2895     +-------+--------+----------------+------+
2896     |   13  |   20   |        5       |   0  |
2897     +-------+--------+----------------+------+
2898     |   14  |    2   |        5       |   0  |
2899     +-------+--------+----------------+------+
2900     |   15  |    7   |        4       |  16  |
2901     +-------+--------+----------------+------+
2902     |   16  |   11   |        5       |   0  |
2903     +-------+--------+----------------+------+
2904     |   17  |   17   |        5       |   0  |
2905     +-------+--------+----------------+------+
2906     |   18  |   22   |        5       |   0  |
2907     +-------+--------+----------------+------+
2908     |   19  |    4   |        5       |   0  |
2909     +-------+--------+----------------+------+
2910     |   20  |    8   |        4       |  16  |
2911
2912
2913
2914Collet & Kucherawy            Informational                    [Page 52]
2915
2916RFC 8478                    application/zstd                October 2018
2917
2918
2919     +-------+--------+----------------+------+
2920     |   21  |   13   |        5       |   0  |
2921     +-------+--------+----------------+------+
2922     |   22  |   19   |        5       |   0  |
2923     +-------+--------+----------------+------+
2924     |   23  |    1   |        5       |   0  |
2925     +-------+--------+----------------+------+
2926     |   24  |    6   |        4       |  16  |
2927     +-------+--------+----------------+------+
2928     |   25  |   10   |        5       |   0  |
2929     +-------+--------+----------------+------+
2930     |   26  |   16   |        5       |   0  |
2931     +-------+--------+----------------+------+
2932     |   27  |   28   |        5       |   0  |
2933     +-------+--------+----------------+------+
2934     |   28  |   27   |        5       |   0  |
2935     +-------+--------+----------------+------+
2936     |   29  |   26   |        5       |   0  |
2937     +-------+--------+----------------+------+
2938     |   30  |   25   |        5       |   0  |
2939     +-------+--------+----------------+------+
2940     |   31  |   24   |        5       |   0  |
2941     +-------+--------+----------------+------+
2942
2943Acknowledgments
2944
2945   zstd was developed by Yann Collet.
2946
2947   Bobo Bose-Kolanu, Felix Handte, Kyle Nekritz, Nick Terrell, and David
2948   Schleimer provided helpful feedback during the development of this
2949   document.
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970Collet & Kucherawy            Informational                    [Page 53]
2971
2972RFC 8478                    application/zstd                October 2018
2973
2974
2975Authors' Addresses
2976
2977   Yann Collet
2978   Facebook
2979   1 Hacker Way
2980   Menlo Park, CA  94025
2981   United States of America
2982
2983   Email: cyan@fb.com
2984
2985
2986   Murray S. Kucherawy (editor)
2987   Facebook
2988   1 Hacker Way
2989   Menlo Park, CA  94025
2990   United States of America
2991
2992   Email: msk@fb.com
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026Collet & Kucherawy            Informational                    [Page 54]
3027